* [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP @ 2016-03-11 15:41 Tejun Heo 2016-03-11 15:41 ` [PATCH 02/10] cgroup: un-inline cgroup_path() and friends Tejun Heo ` (4 more replies) 0 siblings, 5 replies; 50+ messages in thread From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw) To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg Hello, This patchset extends cgroup v2 to support rgroup (resource group) for in-process hierarchical resource control and implements PRIO_RGRP for setpriority(2) on top to allow in-process hierarchical CPU cycle control in a seamless way. cgroup v1 allowed putting threads of a process in different cgroups which enabled ad-hoc in-process resource control of some resources. Unfortunately, this approach was fraught with problems such as membership ambiguity with per-process resources and lack of isolation between system management and in-process properties. For a more detailed discussion on the subject, please refer to the following message. [1] [RFD] cgroup: thread granularity support for cpu controller This patchset implements the mechanism outlined in the above message. The new mechanism is named rgroup (resource group). When explicitly designating a non-rgroup cgroup, the term sgroup (system group) is used. rgroup has the following properties. * A rgroup is a cgroup which is invisible on and transparent to the system-level cgroupfs interface. * A rgroup can be created by specifying CLONE_NEWRGRP flag, along with CLONE_THREAD, during clone(2). A new rgroup is created under the parent thread's cgroup and the new thread is created in it. * A rgroup is automatically destroyed when empty. * A top-level rgroup of a process is a rgroup whose parent cgroup is a sgroup. A process may have multiple top-level rgroups and thus multiple rgroup subtrees under the same parent sgroup. * Unlike sgroups, rgroups are allowed to compete against peer threads. Each rgroup behaves equivalent to a sibling task. * rgroup subtrees are local to the process. When the process forks or execs, its rgroup subtrees are collapsed. * When a process is migrated to a different cgroup, its rgroup subtrees are preserved. * Subset of controllers available on the parent sgroup are available to rgroup subtrees. Controller management on rgroups is automatic and implicit and doesn't interfere with system-level cgroup controller management. If a controller is made unavailable on the parent sgroup, it's automatically disabled from child rgroup subtrees. rgroup lays the foundation for other kernel mechanisms to make use of resource controllers while providing proper isolation between system management and in-process operations removing the awkward and layer-violating requirement for coordination between individual applications and system management. On top of the rgroup mechanism, PRIO_RGRP is implemented for {set|get}priority(2). * PRIO_RGRP can only be used if the target task is already in a rgroup. If setpriority(2) is used and cpu controller is available, cpu controller is enabled until the target rgroup is covered and the specified nice value is set as the weight of the rgroup. * The specified nice value has the same meaning as for tasks. For example, a rgroup and a task competing under the same parent would behave exactly the same as two tasks. * For top-level rgroups, PRIO_RGRP follows the same rlimit restrictions as PRIO_PROCESS; however, as nested rgroups only distribute CPU cycles which are allocated to the process, no restriction is applied. PRIO_RGRP allows in-process hierarchical control of CPU cycles in a manner which is a straight-forward and minimal extension of existing task and priority management. There are still some missing pieces. * Documentation updates. * A mechanism that applications can use to publish certain rgroups so that external entities can determine which IDs to use to change rgroup settings. I already have interface and implementation design mostly pinned down. * Userland updates such as integrating CLONE_NEWRGRP handling to pthread or updating renice(1) to handle resource groups. I'll attach a test program which demonstrates PRIO_RGRP usage in a follow up email. This patchset contains the following 10 patches. 0001-cgroup-introduce-cgroup_-un-lock.patch 0002-cgroup-un-inline-cgroup_path-and-friends.patch 0003-cgroup-introduce-CGRP_MIGRATE_-flags.patch 0004-signal-make-put_signal_struct-public.patch 0005-cgroup-fork-add-new_rgrp_cset-p-and-clone_flags-to-c.patch 0006-cgroup-fork-add-child-and-clone_flags-to-threadgroup.patch 0007-cgroup-introduce-resource-group.patch 0008-cgroup-implement-rgroup-control-mask-handling.patch 0009-cgroup-implement-rgroup-subtree-migration.patch 0010-cgroup-sched-implement-PRIO_RGRP-for-set-get-priorit.patch 0001-0006 are prepatory patches. 0007-0009 implemnet rgroup support. 0010 implements PRIO_RGRP. This patchset is on top of cgroup/for-4.6 f6d635ad341d ("cgroup: implement cgroup_subsys->implicit_on_dfl") + [2] [PATCH 2/2] cgroup, perf_event: make perf_event controller work on cgroup2 hierarchy + [3] [PATCHSET REPOST] sched, cgroup: implement cgroup v2 interface for cpu controller and available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-rgroup diffstat follows. fs/exec.c | 8 include/linux/cgroup-defs.h | 72 ++- include/linux/cgroup.h | 60 +-- include/linux/sched.h | 31 + include/uapi/linux/resource.h | 1 include/uapi/linux/sched.h | 1 kernel/cgroup.c | 828 ++++++++++++++++++++++++++++++++++++++---- kernel/fork.c | 27 - kernel/sched/core.c | 32 + kernel/signal.c | 6 kernel/sys.c | 11 11 files changed, 917 insertions(+), 160 deletions(-) Thanks. -- tejun [1] http://lkml.kernel.org/g/20160105154503.GC5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org [2] http://lkml.kernel.org/g/1456351975-1899-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org [3] http://lkml.kernel.org/g/20160105164758.GD5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org ^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 02/10] cgroup: un-inline cgroup_path() and friends 2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo @ 2016-03-11 15:41 ` Tejun Heo [not found] ` <1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (3 subsequent siblings) 4 siblings, 0 replies; 50+ messages in thread From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw) To: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt Cc: linux-kernel, cgroups, linux-api, kernel-team, Tejun Heo They're trivial wrappers now but to support in-process resource control, cgroup_path() and friends will need to access more of cgroup internals. Un-inline them. Signed-off-by: Tejun Heo <tj@kernel.org> --- include/linux/cgroup.h | 31 +++++-------------------------- kernel/cgroup.c | 21 +++++++++++++++++++++ 2 files changed, 26 insertions(+), 26 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 2162dca..4717f43 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -74,6 +74,11 @@ extern struct css_set init_css_set; #define cgroup_subsys_on_dfl(ss) \ static_branch_likely(&ss ## _on_dfl_key) +int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen); +char * __must_check cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen); +void pr_cont_cgroup_name(struct cgroup *cgrp); +void pr_cont_cgroup_path(struct cgroup *cgrp); + bool css_has_online_children(struct cgroup_subsys_state *css); struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss); struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup, @@ -522,32 +527,6 @@ static inline struct cgroup_subsys_state *seq_css(struct seq_file *seq) return of_css(seq->private); } -/* - * Name / path handling functions. All are thin wrappers around the kernfs - * counterparts and can be called under any context. - */ - -static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen) -{ - return kernfs_name(cgrp->kn, buf, buflen); -} - -static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf, - size_t buflen) -{ - return kernfs_path(cgrp->kn, buf, buflen); -} - -static inline void pr_cont_cgroup_name(struct cgroup *cgrp) -{ - pr_cont_kernfs_name(cgrp->kn); -} - -static inline void pr_cont_cgroup_path(struct cgroup *cgrp) -{ - pr_cont_kernfs_path(cgrp->kn); -} - #else /* !CONFIG_CGROUPS */ struct cgroup_subsys_state; diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 2297bf6..4c3c43e 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -618,6 +618,27 @@ static int notify_on_release(const struct cgroup *cgrp) static void cgroup_release_agent(struct work_struct *work); static void check_for_release(struct cgroup *cgrp); +int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen) +{ + return kernfs_name(cgrp->kn, buf, buflen); +} + +char * __must_check cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen) +{ + return kernfs_path(cgrp->kn, buf, buflen); +} +EXPORT_SYMBOL_GPL(cgroup_path); + +void pr_cont_cgroup_name(struct cgroup *cgrp) +{ + pr_cont_kernfs_name(cgrp->kn); +} + +void pr_cont_cgroup_path(struct cgroup *cgrp) +{ + pr_cont_kernfs_path(cgrp->kn); +} + /* * A cgroup can be associated with multiple css_sets as different tasks may * belong to different cgroups on different hierarchies. In the other -- 2.5.0 ^ permalink raw reply related [flat|nested] 50+ messages in thread
[parent not found: <1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* [PATCH 01/10] cgroup: introduce cgroup_[un]lock() [not found] ` <1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2016-03-11 15:41 ` Tejun Heo 2016-03-11 15:41 ` [PATCH 03/10] cgroup: introduce CGRP_MIGRATE_* flags Tejun Heo ` (8 subsequent siblings) 9 siblings, 0 replies; 50+ messages in thread From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw) To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, Tejun Heo Introduce thin wrappers which lock and unlock cgroup_mutex. While this doesn't introduce any functional differences now, they will later be used to perform some extra operations around locking. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- kernel/cgroup.c | 100 +++++++++++++++++++++++++++++++++----------------------- 1 file changed, 59 insertions(+), 41 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index e22df5d8..2297bf6 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -237,6 +237,24 @@ static int cgroup_addrm_files(struct cgroup_subsys_state *css, bool is_add); /** + * cgroup_lock - lock cgroup_mutex and perform related operations + */ +static void cgroup_lock(void) + __acquires(&cgroup_mutex) +{ + mutex_lock(&cgroup_mutex); +} + +/** + * cgroup_unlock - unlock cgroup_mutex and perform related operations + */ +static void cgroup_unlock(void) + __releases(&cgroup_mutex) +{ + mutex_unlock(&cgroup_mutex); +} + +/** * cgroup_ssid_enabled - cgroup subsys enabled test by subsys ID * @ssid: subsys ID of interest * @@ -1194,7 +1212,7 @@ static void cgroup_destroy_root(struct cgroup_root *root) cgroup_exit_root_id(root); - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); kernfs_destroy_root(root->kf_root); cgroup_free_root(root); @@ -1373,7 +1391,7 @@ static void cgroup_kn_unlock(struct kernfs_node *kn) else cgrp = kn->parent->priv; - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); kernfs_unbreak_active_protection(kn); cgroup_put(cgrp); @@ -1419,7 +1437,7 @@ static struct cgroup *cgroup_kn_lock_live(struct kernfs_node *kn, if (drain_offline) cgroup_lock_and_drain_offline(cgrp); else - mutex_lock(&cgroup_mutex); + cgroup_lock(); if (!cgroup_is_dead(cgrp)) return cgrp; @@ -1804,7 +1822,7 @@ static int cgroup_remount(struct kernfs_root *kf_root, int *flags, char *data) out_unlock: kfree(opts.release_agent); kfree(opts.name); - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); return ret; } @@ -2045,7 +2063,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, continue; if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) { - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); msleep(10); ret = restart_syscall(); goto out_free; @@ -2100,7 +2118,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, pinned_sb = kernfs_pin_sb(root->kf_root, NULL); if (IS_ERR(pinned_sb) || !percpu_ref_tryget_live(&root->cgrp.self.refcnt)) { - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); if (!IS_ERR_OR_NULL(pinned_sb)) deactivate_super(pinned_sb); msleep(10); @@ -2135,7 +2153,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, cgroup_free_root(root); out_unlock: - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); out_free: kfree(opts.release_agent); kfree(opts.name); @@ -2214,7 +2232,7 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen) int hierarchy_id = 1; char *path = NULL; - mutex_lock(&cgroup_mutex); + cgroup_lock(); spin_lock_bh(&css_set_lock); root = idr_get_next(&cgroup_hierarchy_idr, &hierarchy_id); @@ -2229,7 +2247,7 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen) } spin_unlock_bh(&css_set_lock); - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); return path; } EXPORT_SYMBOL_GPL(task_cgroup_path); @@ -2790,7 +2808,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) struct cgroup_root *root; int retval = 0; - mutex_lock(&cgroup_mutex); + cgroup_lock(); for_each_root(root) { struct cgroup *from_cgrp; @@ -2805,7 +2823,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) if (retval) break; } - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); return retval; } @@ -2968,7 +2986,7 @@ static void cgroup_lock_and_drain_offline(struct cgroup *cgrp) int ssid; restart: - mutex_lock(&cgroup_mutex); + cgroup_lock(); cgroup_for_each_live_descendant_post(dsct, d_css, cgrp) { for_each_subsys(ss, ssid) { @@ -2982,7 +3000,7 @@ static void cgroup_lock_and_drain_offline(struct cgroup *cgrp) prepare_to_wait(&dsct->offline_waitq, &wait, TASK_UNINTERRUPTIBLE); - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); schedule(); finish_wait(&dsct->offline_waitq, &wait); @@ -3426,11 +3444,11 @@ static int cgroup_rename(struct kernfs_node *kn, struct kernfs_node *new_parent, kernfs_break_active_protection(new_parent); kernfs_break_active_protection(kn); - mutex_lock(&cgroup_mutex); + cgroup_lock(); ret = kernfs_rename(kn, new_parent, new_name_str); - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); kernfs_unbreak_active_protection(kn); kernfs_unbreak_active_protection(new_parent); @@ -3637,9 +3655,9 @@ int cgroup_rm_cftypes(struct cftype *cfts) { int ret; - mutex_lock(&cgroup_mutex); + cgroup_lock(); ret = cgroup_rm_cftypes_locked(cfts); - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); return ret; } @@ -3671,14 +3689,14 @@ static int cgroup_add_cftypes(struct cgroup_subsys *ss, struct cftype *cfts) if (ret) return ret; - mutex_lock(&cgroup_mutex); + cgroup_lock(); list_add_tail(&cfts->node, &ss->cfts); ret = cgroup_apply_cftypes(cfts, true); if (ret) cgroup_rm_cftypes_locked(cfts); - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); return ret; } @@ -4170,7 +4188,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) if (!cgroup_may_migrate_to(to)) return -EBUSY; - mutex_lock(&cgroup_mutex); + cgroup_lock(); /* all tasks in @from are being moved, all csets are source */ spin_lock_bh(&css_set_lock); @@ -4200,7 +4218,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) } while (task && !ret); out_err: cgroup_migrate_finish(&preloaded_csets); - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); return ret; } @@ -4507,7 +4525,7 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry) kernfs_type(kn) != KERNFS_DIR) return -EINVAL; - mutex_lock(&cgroup_mutex); + cgroup_lock(); /* * We aren't being called from kernfs and there's no guarantee on @@ -4518,7 +4536,7 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry) cgrp = rcu_dereference(kn->priv); if (!cgrp || cgroup_is_dead(cgrp)) { rcu_read_unlock(); - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); return -ENOENT; } rcu_read_unlock(); @@ -4546,7 +4564,7 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry) } css_task_iter_end(&it); - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); return 0; } @@ -4847,7 +4865,7 @@ static void css_release_work_fn(struct work_struct *work) struct cgroup_subsys *ss = css->ss; struct cgroup *cgrp = css->cgroup; - mutex_lock(&cgroup_mutex); + cgroup_lock(); css->flags |= CSS_RELEASED; list_del_rcu(&css->sibling); @@ -4874,7 +4892,7 @@ static void css_release_work_fn(struct work_struct *work) NULL); } - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); call_rcu(&css->rcu_head, css_free_rcu_fn); } @@ -5168,7 +5186,7 @@ static void css_killed_work_fn(struct work_struct *work) struct cgroup_subsys_state *css = container_of(work, struct cgroup_subsys_state, destroy_work); - mutex_lock(&cgroup_mutex); + cgroup_lock(); do { offline_css(css); @@ -5177,7 +5195,7 @@ static void css_killed_work_fn(struct work_struct *work) css = css->parent; } while (css && atomic_dec_and_test(&css->online_cnt)); - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); } /* css kill confirmation processing requires process context, bounce */ @@ -5330,7 +5348,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early) pr_debug("Initializing cgroup subsys %s\n", ss->name); - mutex_lock(&cgroup_mutex); + cgroup_lock(); idr_init(&ss->css_idr); INIT_LIST_HEAD(&ss->cfts); @@ -5374,7 +5392,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early) BUG_ON(online_css(css)); - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); } /** @@ -5431,7 +5449,7 @@ int __init cgroup_init(void) BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files)); BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files)); - mutex_lock(&cgroup_mutex); + cgroup_lock(); /* * Add init_css_set to the hash table so that dfl_root can link to @@ -5442,7 +5460,7 @@ int __init cgroup_init(void) BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0)); - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); for_each_subsys(ss, ssid) { if (ss->early_init) { @@ -5548,7 +5566,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, if (!buf) goto out; - mutex_lock(&cgroup_mutex); + cgroup_lock(); spin_lock_bh(&css_set_lock); for_each_root(root) { @@ -5602,7 +5620,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, retval = 0; out_unlock: spin_unlock_bh(&css_set_lock); - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); kfree(buf); out: return retval; @@ -5620,7 +5638,7 @@ static int proc_cgroupstats_show(struct seq_file *m, void *v) * cgroup_mutex is also necessary to guarantee an atomic snapshot of * subsys/hierarchy state. */ - mutex_lock(&cgroup_mutex); + cgroup_lock(); for_each_subsys(ss, i) seq_printf(m, "%s\t%d\t%d\t%d\n", @@ -5628,7 +5646,7 @@ static int proc_cgroupstats_show(struct seq_file *m, void *v) atomic_read(&ss->root->nr_cgrps), cgroup_ssid_enabled(i)); - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); return 0; } @@ -5860,7 +5878,7 @@ static void cgroup_release_agent(struct work_struct *work) char *pathbuf = NULL, *agentbuf = NULL, *path; char *argv[3], *envp[3]; - mutex_lock(&cgroup_mutex); + cgroup_lock(); pathbuf = kmalloc(PATH_MAX, GFP_KERNEL); agentbuf = kstrdup(cgrp->root->release_agent_path, GFP_KERNEL); @@ -5880,11 +5898,11 @@ static void cgroup_release_agent(struct work_struct *work) envp[1] = "PATH=/sbin:/bin:/usr/sbin:/usr/bin"; envp[2] = NULL; - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC); goto out_free; out: - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); out_free: kfree(agentbuf); kfree(pathbuf); @@ -6006,7 +6024,7 @@ struct cgroup *cgroup_get_from_path(const char *path) struct kernfs_node *kn; struct cgroup *cgrp; - mutex_lock(&cgroup_mutex); + cgroup_lock(); kn = kernfs_walk_and_get(cgrp_dfl_root.cgrp.kn, path); if (kn) { @@ -6021,7 +6039,7 @@ struct cgroup *cgroup_get_from_path(const char *path) cgrp = ERR_PTR(-ENOENT); } - mutex_unlock(&cgroup_mutex); + cgroup_unlock(); return cgrp; } EXPORT_SYMBOL_GPL(cgroup_get_from_path); -- 2.5.0 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 03/10] cgroup: introduce CGRP_MIGRATE_* flags [not found] ` <1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2016-03-11 15:41 ` [PATCH 01/10] cgroup: introduce cgroup_[un]lock() Tejun Heo @ 2016-03-11 15:41 ` Tejun Heo 2016-03-11 15:41 ` [PATCH 05/10] cgroup, fork: add @new_rgrp_cset[p] and @clone_flags to cgroup fork callbacks Tejun Heo ` (7 subsequent siblings) 9 siblings, 0 replies; 50+ messages in thread From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw) To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, Tejun Heo Currently, the only migration modifier is boolean @threadgroup. Make the functions pass around unsigned flag mask instead and replace @threadgroup with CGRP_MIGRATE_PROCESS. This will allow addition of other modifiers. This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- kernel/cgroup.c | 32 ++++++++++++++++++-------------- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 4c3c43e..8ecb8d6 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2403,6 +2403,10 @@ struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset, return NULL; } +enum { + CGRP_MIGRATE_PROCESS = (1 << 0), /* migrate the whole process */ +}; + /** * cgroup_taskset_migrate - migrate a taskset * @tset: taget taskset @@ -2634,7 +2638,7 @@ static int cgroup_migrate_prepare_dst(struct list_head *preloaded_csets) /** * cgroup_migrate - migrate a process or task to a cgroup * @leader: the leader of the process or the task to migrate - * @threadgroup: whether @leader points to the whole process or a single task + * @mflags: CGRP_MIGRATE_* flags * @root: cgroup root migration is taking place on * * Migrate a process or task denoted by @leader. If migrating a process, @@ -2649,7 +2653,7 @@ static int cgroup_migrate_prepare_dst(struct list_head *preloaded_csets) * decided for all targets by invoking group_migrate_prepare_dst() before * actually starting migrating. */ -static int cgroup_migrate(struct task_struct *leader, bool threadgroup, +static int cgroup_migrate(struct task_struct *leader, unsigned mflags, struct cgroup_root *root) { struct cgroup_taskset tset = CGROUP_TASKSET_INIT(tset); @@ -2665,7 +2669,7 @@ static int cgroup_migrate(struct task_struct *leader, bool threadgroup, task = leader; do { cgroup_taskset_add(task, &tset); - if (!threadgroup) + if (!(mflags & CGRP_MIGRATE_PROCESS)) break; } while_each_thread(leader, task); rcu_read_unlock(); @@ -2678,12 +2682,12 @@ static int cgroup_migrate(struct task_struct *leader, bool threadgroup, * cgroup_attach_task - attach a task or a whole threadgroup to a cgroup * @dst_cgrp: the cgroup to attach to * @leader: the task or the leader of the threadgroup to be attached - * @threadgroup: attach the whole threadgroup? + * @mflags: CGRP_MIGRATE_* flags * * Call holding cgroup_mutex and cgroup_threadgroup_rwsem. */ static int cgroup_attach_task(struct cgroup *dst_cgrp, - struct task_struct *leader, bool threadgroup) + struct task_struct *leader, unsigned mflags) { LIST_HEAD(preloaded_csets); struct task_struct *task; @@ -2699,7 +2703,7 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp, do { cgroup_migrate_add_src(task_css_set(task), dst_cgrp, &preloaded_csets); - if (!threadgroup) + if (!(mflags & CGRP_MIGRATE_PROCESS)) break; } while_each_thread(leader, task); rcu_read_unlock(); @@ -2708,7 +2712,7 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp, /* prepare dst csets and commit */ ret = cgroup_migrate_prepare_dst(&preloaded_csets); if (!ret) - ret = cgroup_migrate(leader, threadgroup, dst_cgrp->root); + ret = cgroup_migrate(leader, mflags, dst_cgrp->root); cgroup_migrate_finish(&preloaded_csets); return ret; @@ -2761,7 +2765,7 @@ static int cgroup_procs_write_permission(struct task_struct *task, * cgroup_mutex and threadgroup. */ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, - size_t nbytes, loff_t off, bool threadgroup) + size_t nbytes, loff_t off, unsigned mflags) { struct task_struct *tsk; struct cgroup *cgrp; @@ -2787,7 +2791,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, tsk = current; } - if (threadgroup) + if (mflags & CGRP_MIGRATE_PROCESS) tsk = tsk->group_leader; /* @@ -2805,7 +2809,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, ret = cgroup_procs_write_permission(tsk, cgrp, of); if (!ret) - ret = cgroup_attach_task(cgrp, tsk, threadgroup); + ret = cgroup_attach_task(cgrp, tsk, mflags); put_task_struct(tsk); goto out_unlock_threadgroup; @@ -2840,7 +2844,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) from_cgrp = task_cgroup_from_root(from, root); spin_unlock_bh(&css_set_lock); - retval = cgroup_attach_task(from_cgrp, tsk, false); + retval = cgroup_attach_task(from_cgrp, tsk, 0); if (retval) break; } @@ -2853,13 +2857,13 @@ EXPORT_SYMBOL_GPL(cgroup_attach_task_all); static ssize_t cgroup_tasks_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { - return __cgroup_procs_write(of, buf, nbytes, off, false); + return __cgroup_procs_write(of, buf, nbytes, off, 0); } static ssize_t cgroup_procs_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { - return __cgroup_procs_write(of, buf, nbytes, off, true); + return __cgroup_procs_write(of, buf, nbytes, off, CGRP_MIGRATE_PROCESS); } static ssize_t cgroup_release_agent_write(struct kernfs_open_file *of, @@ -4233,7 +4237,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) css_task_iter_end(&it); if (task) { - ret = cgroup_migrate(task, false, to->root); + ret = cgroup_migrate(task, 0, to->root); put_task_struct(task); } } while (task && !ret); -- 2.5.0 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 05/10] cgroup, fork: add @new_rgrp_cset[p] and @clone_flags to cgroup fork callbacks [not found] ` <1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2016-03-11 15:41 ` [PATCH 01/10] cgroup: introduce cgroup_[un]lock() Tejun Heo 2016-03-11 15:41 ` [PATCH 03/10] cgroup: introduce CGRP_MIGRATE_* flags Tejun Heo @ 2016-03-11 15:41 ` Tejun Heo 2016-03-11 15:41 ` [PATCH 07/10] cgroup: introduce resource group Tejun Heo ` (6 subsequent siblings) 9 siblings, 0 replies; 50+ messages in thread From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw) To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, Tejun Heo, Peter Zijlstra, Oleg Nesterov Add two extra arguments to cgroup_{can|cancel|post}_fork(). These will be used to implement in-process resource control. The extra arguments aren't used yet. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- include/linux/cgroup-defs.h | 2 ++ include/linux/cgroup.h | 21 +++++++++++++++------ kernel/cgroup.c | 15 ++++++++++++--- kernel/fork.c | 7 ++++--- 4 files changed, 33 insertions(+), 12 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 34b42f0..d3d1f92 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -542,6 +542,8 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) #else /* CONFIG_CGROUPS */ +struct css_set; + #define CGROUP_SUBSYS_COUNT 0 static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) {} diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 4717f43..ebcf21f 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -102,9 +102,12 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *tsk); void cgroup_fork(struct task_struct *p); -extern int cgroup_can_fork(struct task_struct *p); -extern void cgroup_cancel_fork(struct task_struct *p); -extern void cgroup_post_fork(struct task_struct *p); +extern int cgroup_can_fork(struct task_struct *p, unsigned long clone_flags, + struct css_set **new_rgrp_csetp); +extern void cgroup_cancel_fork(struct task_struct *p, unsigned long clone_flags, + struct css_set *new_rgrp_cset); +extern void cgroup_post_fork(struct task_struct *p, unsigned long clone_flags, + struct css_set *new_rgrp_cset); void cgroup_exit(struct task_struct *p); void cgroup_free(struct task_struct *p); @@ -538,9 +541,15 @@ static inline int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry) { return -EINVAL; } static inline void cgroup_fork(struct task_struct *p) {} -static inline int cgroup_can_fork(struct task_struct *p) { return 0; } -static inline void cgroup_cancel_fork(struct task_struct *p) {} -static inline void cgroup_post_fork(struct task_struct *p) {} +static inline int cgroup_can_fork(struct task_struct *p, + unsigned long clone_flags, + struct css_set **new_rgrp_csetp) { return 0; } +static inline void cgroup_cancel_fork(struct task_struct *p, + unsigned long clone_flags, + struct css_set *new_rgrp_cset) {} +static inline void cgroup_post_fork(struct task_struct *p, + unsigned long clone_flags, + struct css_set *new_rgrp_cset) {} static inline void cgroup_exit(struct task_struct *p) {} static inline void cgroup_free(struct task_struct *p) {} diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 8ecb8d6..ac207ae 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -5704,12 +5704,15 @@ void cgroup_fork(struct task_struct *child) /** * cgroup_can_fork - called on a new task before the process is exposed * @child: the task in question. + * @clone_flags: CLONE_* flags for @child + * @new_rgrp_csetp: rgroup out parameter to be passed to post/cancel_fork * * This calls the subsystem can_fork() callbacks. If the can_fork() callback * returns an error, the fork aborts with that error code. This allows for * a cgroup subsystem to conditionally allow or deny new forks. */ -int cgroup_can_fork(struct task_struct *child) +int cgroup_can_fork(struct task_struct *child, unsigned long clone_flags, + struct css_set **new_rgrp_csetp) { struct cgroup_subsys *ss; int i, j, ret; @@ -5736,11 +5739,14 @@ int cgroup_can_fork(struct task_struct *child) /** * cgroup_cancel_fork - called if a fork failed after cgroup_can_fork() * @child: the task in question + * @clone_flags: CLONE_* flags for @child + * @new_rgrp_cset: *@new_rgrp_csetp from cgroup_can_fork() * * This calls the cancel_fork() callbacks if a fork failed *after* * cgroup_can_fork() succeded. */ -void cgroup_cancel_fork(struct task_struct *child) +void cgroup_cancel_fork(struct task_struct *child, unsigned long clone_flags, + struct css_set *new_rgrp_cset) { struct cgroup_subsys *ss; int i; @@ -5753,6 +5759,8 @@ void cgroup_cancel_fork(struct task_struct *child) /** * cgroup_post_fork - called on a new task after adding it to the task list * @child: the task in question + * @clone_flags: CLONE_* flags for @child + * @new_rgrp_cset: *@new_rgrp_csetp from cgroup_can_fork() * * Adds the task to the list running through its css_set if necessary and * call the subsystem fork() callbacks. Has to be after the task is @@ -5760,7 +5768,8 @@ void cgroup_cancel_fork(struct task_struct *child) * cgroup_task_iter_start() - to guarantee that the new task ends up on its * list. */ -void cgroup_post_fork(struct task_struct *child) +void cgroup_post_fork(struct task_struct *child, unsigned long clone_flags, + struct css_set *new_rgrp_cset) { struct cgroup_subsys *ss; int i; diff --git a/kernel/fork.c b/kernel/fork.c index fd826de..812d477 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1249,6 +1249,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, { int retval; struct task_struct *p; + struct css_set *new_rgrp_cset; if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); @@ -1525,7 +1526,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, * between here and cgroup_post_fork() if an organisation operation is in * progress. */ - retval = cgroup_can_fork(p); + retval = cgroup_can_fork(p, clone_flags, &new_rgrp_cset); if (retval) goto bad_fork_free_pid; @@ -1607,7 +1608,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, write_unlock_irq(&tasklist_lock); proc_fork_connector(p); - cgroup_post_fork(p); + cgroup_post_fork(p, clone_flags, new_rgrp_cset); threadgroup_change_end(current); perf_event_fork(p); @@ -1617,7 +1618,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, return p; bad_fork_cancel_cgroup: - cgroup_cancel_fork(p); + cgroup_cancel_fork(p, clone_flags, new_rgrp_cset); bad_fork_free_pid: if (pid != &init_struct_pid) free_pid(pid); -- 2.5.0 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 07/10] cgroup: introduce resource group [not found] ` <1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (2 preceding siblings ...) 2016-03-11 15:41 ` [PATCH 05/10] cgroup, fork: add @new_rgrp_cset[p] and @clone_flags to cgroup fork callbacks Tejun Heo @ 2016-03-11 15:41 ` Tejun Heo 2016-03-11 15:41 ` [PATCH 08/10] cgroup: implement rgroup control mask handling Tejun Heo ` (5 subsequent siblings) 9 siblings, 0 replies; 50+ messages in thread From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw) To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, Tejun Heo, Peter Zijlstra, Oleg Nesterov cgroup v1 allowed tasks of a process to be put in different cgroups thus allowing controlling resource distribution inside a process; however, controlling in-process properties through filesystem interface is highly unusual and has various issues around delegation, ownership, and lack of integration with process altering operations. rgroup (resource group) is a type of v2 cgroup which can be created by setting CLONE_NEWRGRP during clone(2). A newly created rgroup always nests below the cgroup of the parent task, whether that is a sgroup (system group) or rgroup. rgroups are wholly owned by the associated process and not visible through cgroupfs. This patch implements the basic support for rgroups. * New rgroup can be created through CLONE_NEWRGRP. Top level rgroups are linked on the owning process's signal struct and all such signal structs are linked on the parent sgroup. * A rgroup is destroyed automatically when it becomes depopulated. * When a new process is forked, it is spawned in the nearest sgroup. * When a task execs, is is moved to the nearest sgroup. This patch doesn't yet implement actual resource control or sub-hierarchy migration and all controllers are suppressed in rgroups. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> --- fs/exec.c | 2 +- include/linux/cgroup-defs.h | 26 +++++ include/linux/cgroup.h | 2 + include/linux/sched.h | 4 + include/uapi/linux/sched.h | 1 + kernel/cgroup.c | 229 ++++++++++++++++++++++++++++++++++++++++++-- kernel/fork.c | 11 +++ 7 files changed, 266 insertions(+), 9 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 5b81bbb..286141e 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1044,7 +1044,7 @@ static int de_thread(struct task_struct *tsk) } BUG_ON(!thread_group_leader(tsk)); - return 0; + return cgroup_exec(); killed: /* protects against exit_notify() and __exit_signal() */ diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 3c4a75b..f1ee756 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -201,6 +201,14 @@ struct css_set { struct css_set *mg_dst_cset; /* + * If this cset points to a rgroup, the following is a cset which + * is equivalent except that it points to the nearest sgroup. This + * allows tasks to be escaped to the nearest sgroup without + * introducing deeply nested error cases. + */ + struct css_set *sgrp_cset; + + /* * On the default hierarhcy, ->subsys[ssid] may point to a css * attached to an ancestor instead of the cgroup this css_set is * associated with. The following node is anchored at @@ -285,6 +293,24 @@ struct cgroup { struct list_head e_csets[CGROUP_SUBSYS_COUNT]; /* + * If not NULL, the cgroup is a rgroup (resource group) of the + * process associated with the following signal struct. A rgroup + * is used for in-process resource control. rgroups are created by + * specifying CLONE_NEWRGRP during clone(2), tied to the associated + * process, and invisible and transparent to cgroupfs. + * + * The term "sgroup" (system group) is used for a cgroup which is + * explicitly not a rgroup. + */ + struct signal_struct *rgrp_sig; + + /* top-level rgroups linked on rgrp_sig->rgrps */ + struct list_head rgrp_node; + + /* signal structs with rgroups below this cgroup */ + struct list_head rgrp_child_sigs; + + /* * list of pidlists, up to two for each namespace (one for procs, one * for tasks); created on demand. */ diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 1e00fc0..ca1ec50 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -107,6 +107,7 @@ extern void cgroup_cancel_fork(struct task_struct *p, unsigned long clone_flags, struct css_set *new_rgrp_cset); extern void cgroup_post_fork(struct task_struct *p, unsigned long clone_flags, struct css_set *new_rgrp_cset); +int cgroup_exec(void); void cgroup_exit(struct task_struct *p); void cgroup_free(struct task_struct *p); @@ -548,6 +549,7 @@ static inline void cgroup_cancel_fork(struct task_struct *p, static inline void cgroup_post_fork(struct task_struct *p, unsigned long clone_flags, struct css_set *new_rgrp_cset) {} +static inline int cgroup_exec(void) { return 0; } static inline void cgroup_exit(struct task_struct *p) {} static inline void cgroup_free(struct task_struct *p) {} diff --git a/include/linux/sched.h b/include/linux/sched.h index d4ae795..7886919 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -778,6 +778,10 @@ struct signal_struct { unsigned audit_tty_log_passwd; struct tty_audit_buf *tty_audit_buf; #endif +#ifdef CONFIG_CGROUPS + struct list_head rgrps; /* top-level rgroups under this sig */ + struct list_head rgrp_node; /* parent_sgrp->child_rgrp_sigs list */ +#endif oom_flags_t oom_flags; short oom_score_adj; /* OOM kill score adjustment */ diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index cc89dde..ac6cec9 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -9,6 +9,7 @@ #define CLONE_FS 0x00000200 /* set if fs info shared between processes */ #define CLONE_FILES 0x00000400 /* set if open files shared between processes */ #define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */ +#define CLONE_NEWRGRP 0x00001000 /* New resource group */ #define CLONE_PTRACE 0x00002000 /* set if we want to let tracing continue on the child too */ #define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */ #define CLONE_PARENT 0x00008000 /* set if we want to have the same parent as the cloner */ diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 70f9985..53f479c 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -126,6 +126,13 @@ static struct percpu_rw_semaphore cgroup_threadgroup_rwsem; static struct workqueue_struct *cgroup_destroy_wq; /* + * rgroups are automatically destroyed when they become unpopulated. + * Destructions are bounced through the following workqueue which is + * ordered to avoid trying to destroy a parent before its children. + */ +static struct workqueue_struct *rgroup_destroy_wq; + +/* * pidlist destructions need to be flushed on cgroup destruction. Use a * separate workqueue as flush domain. */ @@ -228,6 +235,7 @@ static int cgroup_apply_control(struct cgroup *cgrp); static void cgroup_finalize_control(struct cgroup *cgrp, int ret); static void css_task_iter_advance(struct css_task_iter *it); static int cgroup_destroy_locked(struct cgroup *cgrp); +static void rgroup_destroy_schedule(struct cgroup *rgrp); static struct cgroup_subsys_state *css_create(struct cgroup *cgrp, struct cgroup_subsys *ss); static void css_release(struct percpu_ref *ref); @@ -242,6 +250,16 @@ static int cgroup_addrm_files(struct cgroup_subsys_state *css, static void cgroup_lock(void) __acquires(&cgroup_mutex) { + /* + * In-flight rgroup destructions can interfere with subsequent + * operations. For example, rmdir of the nearest sgroup would fail + * while rgroup destructions are in flight. rgroup destructions + * don't involve any time-consuming operations and the following + * flush shouldn't be noticeable. + */ + if (rgroup_destroy_wq) + flush_workqueue(rgroup_destroy_wq); + mutex_lock(&cgroup_mutex); } @@ -330,6 +348,11 @@ static bool cgroup_on_dfl(const struct cgroup *cgrp) return cgrp->root == &cgrp_dfl_root; } +static bool is_rgroup(struct cgroup *cgrp) +{ + return cgrp->rgrp_sig; +} + /* IDR wrappers which synchronize using cgroup_idr_lock */ static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp_mask) @@ -370,12 +393,29 @@ static struct cgroup *cgroup_parent(struct cgroup *cgrp) return NULL; } +/** + * nearest_sgroup - find the nearest system group + * @cgrp: cgroup of question + * + * Find the closest sgroup ancestor. If @cgrp is not a rgroup, @cgrp is + * returned. A rgroup subtree is always nested under a sgroup. + */ +static struct cgroup *nearest_sgroup(struct cgroup *cgrp) +{ + while (is_rgroup(cgrp)) + cgrp = cgroup_parent(cgrp); + return cgrp; +} + /* subsystems visibly enabled on a cgroup */ static u16 cgroup_control(struct cgroup *cgrp) { struct cgroup *parent = cgroup_parent(cgrp); u16 root_ss_mask = cgrp->root->subsys_mask; + if (is_rgroup(cgrp)) + return 0; + if (parent) return parent->subtree_control; @@ -390,6 +430,9 @@ static u16 cgroup_ss_mask(struct cgroup *cgrp) { struct cgroup *parent = cgroup_parent(cgrp); + if (is_rgroup(cgrp)) + return 0; + if (parent) return parent->subtree_ss_mask; @@ -620,22 +663,26 @@ static void check_for_release(struct cgroup *cgrp); int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen) { + cgrp = nearest_sgroup(cgrp); return kernfs_name(cgrp->kn, buf, buflen); } char * __must_check cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen) { + cgrp = nearest_sgroup(cgrp); return kernfs_path(cgrp->kn, buf, buflen); } EXPORT_SYMBOL_GPL(cgroup_path); void pr_cont_cgroup_name(struct cgroup *cgrp) { + cgrp = nearest_sgroup(cgrp); pr_cont_kernfs_name(cgrp->kn); } void pr_cont_cgroup_path(struct cgroup *cgrp) { + cgrp = nearest_sgroup(cgrp); pr_cont_kernfs_path(cgrp->kn); } @@ -720,8 +767,14 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated) if (!trigger) break; - check_for_release(cgrp); - cgroup_file_notify(&cgrp->events_file); + /* rgroups are automatically destroyed when empty */ + if (is_rgroup(cgrp)) { + if (!cgrp->populated_cnt) + rgroup_destroy_schedule(cgrp); + } else { + check_for_release(cgrp); + cgroup_file_notify(&cgrp->events_file); + } cgrp = cgroup_parent(cgrp); } while (cgrp); @@ -856,6 +909,9 @@ static void put_css_set_locked(struct css_set *cset) kfree(link); } + if (cset->sgrp_cset) + put_css_set_locked(cset->sgrp_cset); + kfree_rcu(cset, rcu_head); } @@ -1154,6 +1210,16 @@ static struct css_set *find_css_set(struct css_set *old_cset, spin_unlock_bh(&css_set_lock); + if (is_rgroup(cset->dfl_cgrp)) { + struct cgroup *c = nearest_sgroup(cset->dfl_cgrp); + + cset->sgrp_cset = find_css_set(cset, c); + if (!cset->sgrp_cset) { + put_css_set(cset); + return NULL; + } + } + return cset; } @@ -1909,6 +1975,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp) INIT_LIST_HEAD(&cgrp->self.sibling); INIT_LIST_HEAD(&cgrp->self.children); INIT_LIST_HEAD(&cgrp->cset_links); + INIT_LIST_HEAD(&cgrp->rgrp_child_sigs); + INIT_LIST_HEAD(&cgrp->rgrp_node); INIT_LIST_HEAD(&cgrp->pidlists); mutex_init(&cgrp->pidlist_mutex); cgrp->self.cgroup = cgrp; @@ -3307,9 +3375,10 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of, continue; } - /* a child has it enabled? */ + /* a child sgroup has it enabled? */ cgroup_for_each_live_child(child, cgrp) { - if (child->subtree_control & (1 << ssid)) { + if (!is_rgroup(child) && + child->subtree_control & (1 << ssid)) { ret = -EBUSY; goto out_unlock; } @@ -5060,7 +5129,8 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp, return ERR_PTR(err); } -static struct cgroup *cgroup_create(struct cgroup *parent) +static struct cgroup *cgroup_create(struct cgroup *parent, + struct signal_struct *rgrp_sig) { struct cgroup_root *root = parent->root; struct cgroup *cgrp, *tcgrp; @@ -5103,6 +5173,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent) set_bit(CGRP_CPUSET_CLONE_CHILDREN, &cgrp->flags); cgrp->self.serial_nr = css_serial_nr_next++; + cgrp->rgrp_sig = rgrp_sig; /* allocation complete, commit to creation */ list_add_tail_rcu(&cgrp->self.sibling, &cgroup_parent(cgrp)->self.children); @@ -5156,7 +5227,7 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, if (!parent) return -ENODEV; - cgrp = cgroup_create(parent); + cgrp = cgroup_create(parent, NULL); if (IS_ERR(cgrp)) { ret = PTR_ERR(cgrp); goto out_unlock; @@ -5201,6 +5272,75 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, return ret; } +static void rgroup_destroy_work_fn(struct work_struct *work) +{ + struct cgroup *rgrp = container_of(work, struct cgroup, + self.destroy_work); + struct signal_struct *sig = rgrp->rgrp_sig; + + /* + * cgroup_lock() flushes rgroup_destroy_wq and using it here would + * lead to deadlock. Grab cgroup_mutex directly. + */ + mutex_lock(&cgroup_mutex); + + if (WARN_ON_ONCE(cgroup_destroy_locked(rgrp))) { + mutex_unlock(&cgroup_mutex); + return; + } + + list_del(&rgrp->rgrp_node); + + if (sig && list_empty(&sig->rgrps)) { + list_del(&sig->rgrp_node); + put_signal_struct(sig); + } + + mutex_unlock(&cgroup_mutex); +} + +/** + * rgroup_destroy_schedule - schedule destruction of a rgroup + * @rgrp: rgroup to be destroyed + * + * Schedule destruction of @rgrp. Destructions are guarantee to be + * performed in order and flushed on cgroup_lock(). + */ +static void rgroup_destroy_schedule(struct cgroup *rgrp) +{ + INIT_WORK(&rgrp->self.destroy_work, rgroup_destroy_work_fn); + queue_work(rgroup_destroy_wq, &rgrp->self.destroy_work); +} + +/** + * rgroup_create - create a rgroup + * @parent: parent cgroup (sgroup or rgroup) + * @sig: signal_struct of the target process + * + * Create a rgroup under @parent for the process associated with @sig. + */ +static struct cgroup *rgroup_create(struct cgroup *parent, + struct signal_struct *sig) +{ + struct cgroup *rgrp; + + lockdep_assert_held(&cgroup_mutex); + + rgrp = cgroup_create(parent, sig); + if (IS_ERR(rgrp)) + return rgrp; + + if (!is_rgroup(parent)) + list_add_tail(&rgrp->rgrp_node, &sig->rgrps); + + if (list_empty(&sig->rgrp_node)) { + atomic_inc(&sig->sigcnt); + list_add_tail(&sig->rgrp_node, &parent->rgrp_child_sigs); + } + + return rgrp; +} + /* * This is called when the refcnt of a css is confirmed to be killed. * css_tryget_online() is now guaranteed to fail. Tell the subsystem to @@ -5562,6 +5702,9 @@ static int __init cgroup_wq_init(void) cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1); BUG_ON(!cgroup_destroy_wq); + rgroup_destroy_wq = alloc_ordered_workqueue("rgroup_destroy", 0); + BUG_ON(!rgroup_destroy_wq); + /* * Used to destroy pidlists and separate to serve as flush domain. * Cap @max_active to 1 too. @@ -5694,7 +5837,8 @@ static const struct file_operations proc_cgroupstats_operations = { * @clone_flags: clone flags if forking * * Called from threadgroup_change_begin() and allows cgroup operations to - * synchronize against threadgroup changes using a percpu_rw_semaphore. + * synchronize against threadgroup changes using a percpu_rw_semaphore. If + * clone(2) is requesting a new rgroup, also grab cgroup_mutex. */ void cgroup_threadgroup_change_begin(struct task_struct *tsk, struct task_struct *child, @@ -5709,6 +5853,9 @@ void cgroup_threadgroup_change_begin(struct task_struct *tsk, */ RCU_INIT_POINTER(child->cgroups, &init_css_set); INIT_LIST_HEAD(&child->cg_list); + + if (clone_flags & CLONE_NEWRGRP) + cgroup_lock(); } percpu_down_read(&cgroup_threadgroup_rwsem); @@ -5728,6 +5875,9 @@ void cgroup_threadgroup_change_end(struct task_struct *tsk, unsigned long clone_flags) { percpu_up_read(&cgroup_threadgroup_rwsem); + + if (child && (clone_flags & CLONE_NEWRGRP)) + cgroup_unlock(); } /** @@ -5746,6 +5896,23 @@ int cgroup_can_fork(struct task_struct *child, unsigned long clone_flags, struct cgroup_subsys *ss; int i, j, ret; + if (clone_flags & CLONE_NEWRGRP) { + struct css_set *cset = task_css_set(current); + struct cgroup *rgrp; + + rgrp = rgroup_create(cset->dfl_cgrp, current->signal); + if (IS_ERR(rgrp)) + return PTR_ERR(rgrp); + + *new_rgrp_csetp = find_css_set(cset, rgrp); + if (IS_ERR(*new_rgrp_csetp)) { + rgroup_destroy_schedule(rgrp); + return PTR_ERR(*new_rgrp_csetp); + } + } else { + *new_rgrp_csetp = NULL; + } + do_each_subsys_mask(ss, i, have_canfork_callback) { ret = ss->can_fork(child); if (ret) @@ -5780,6 +5947,11 @@ void cgroup_cancel_fork(struct task_struct *child, unsigned long clone_flags, struct cgroup_subsys *ss; int i; + if (new_rgrp_cset) { + rgroup_destroy_schedule(new_rgrp_cset->dfl_cgrp); + put_css_set(new_rgrp_cset); + } + for_each_subsys(ss, i) if (ss->cancel_fork) ss->cancel_fork(child); @@ -5828,11 +6000,29 @@ void cgroup_post_fork(struct task_struct *child, unsigned long clone_flags, struct css_set *cset; spin_lock_bh(&css_set_lock); - cset = task_css_set(current); + + /* + * If @new_rgrp_cset is set, it contains the requested new + * rgroup created by cgroup_can_fork(). + */ + if (new_rgrp_cset) { + cset = new_rgrp_cset; + } else { + cset = task_css_set(current); + /* + * If a new process is being created, it shouldn't + * be put in this process's rgroup. Escape it to + * the nearest sgroup. + */ + if (!(clone_flags & CLONE_THREAD) && cset->sgrp_cset) + cset = cset->sgrp_cset; + } + if (list_empty(&child->cg_list)) { get_css_set(cset); css_set_move_task(child, NULL, cset, false); } + spin_unlock_bh(&css_set_lock); } @@ -5846,6 +6036,29 @@ void cgroup_post_fork(struct task_struct *child, unsigned long clone_flags, } while_each_subsys_mask(); } +int cgroup_exec(void) +{ + struct cgroup *cgrp; + bool is_rgrp; + int ret; + + /* whether a task is in a sgroup or rgroup is immutable */ + rcu_read_lock(); + is_rgrp = is_rgroup(task_css_set(current)->dfl_cgrp); + rcu_read_unlock(); + + if (!is_rgrp) + return 0; + + /* exec should reset rgroup, escape to the nearest sgroup */ + cgroup_lock(); + cgrp = nearest_sgroup(task_css_set(current)->dfl_cgrp); + ret = cgroup_attach_task(cgrp, current, CGRP_MIGRATE_PROCESS); + cgroup_unlock(); + + return ret; +} + /** * cgroup_exit - detach cgroup from exiting task * @tsk: pointer to task_struct of exiting process diff --git a/kernel/fork.c b/kernel/fork.c index 840b662..70903fc 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -234,6 +234,9 @@ EXPORT_SYMBOL(free_task); static inline void free_signal_struct(struct signal_struct *sig) { +#ifdef CONFIG_CGROUPS + WARN_ON_ONCE(!list_empty(&sig->rgrps)); +#endif taskstats_tgid_free(sig); sched_autogroup_exit(sig); kmem_cache_free(signal_cachep, sig); @@ -1159,6 +1162,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) mutex_init(&sig->cred_guard_mutex); +#ifdef CONFIG_CGROUPS + INIT_LIST_HEAD(&sig->rgrps); + INIT_LIST_HEAD(&sig->rgrp_node); +#endif return 0; } @@ -1293,6 +1300,10 @@ static struct task_struct *copy_process(unsigned long clone_flags, return ERR_PTR(-EINVAL); } + /* Only threads can be put in child resource groups. */ + if (!(clone_flags & CLONE_THREAD) && (clone_flags & CLONE_NEWRGRP)) + return ERR_PTR(-EINVAL); + retval = security_task_create(clone_flags); if (retval) goto fork_out; -- 2.5.0 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 08/10] cgroup: implement rgroup control mask handling [not found] ` <1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (3 preceding siblings ...) 2016-03-11 15:41 ` [PATCH 07/10] cgroup: introduce resource group Tejun Heo @ 2016-03-11 15:41 ` Tejun Heo 2016-03-11 15:41 ` [PATCH 10/10] cgroup, sched: implement PRIO_RGRP for {set|get}priority() Tejun Heo ` (4 subsequent siblings) 9 siblings, 0 replies; 50+ messages in thread From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw) To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, Tejun Heo, Peter Zijlstra, Oleg Nesterov To support in-process resource control, the previous patch implemented the basic rgroup infrastructure which allows creating rgroups by specifying CLONE_NEWRGRP during clone(2). This patch implements control mask handling for rgroups so that controllers can be enabled on them. There can be multiple top-level rgroups per process, all nested right below the sgroup that the process belongs to. They should share the same control masks which can be different from those of top-level rgroups belonging to different processes or peer tasks. These top-level rgroup control masks are added to signal_struct and all control mask handling functions are updated accordingly. Note that a top-level rgroup should behave the same as a peer task rather than a peer sibling cgroup. As such, the set of controllers avaialble to a rgroup should be the same as the set for peer tasks. To satisfy this requirement, controller availability for a top-level rgroup is constrainted by cgroup_{control|ss_mask}(nearest_sgrp) instead of nearest_sgrp->subtree_{control|ss_mask}. After this change, csses can be created on rgroups. The css_{clear|populate}_dir() are updated to become noops for csses on rgroups. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> --- include/linux/sched.h | 4 ++++ kernel/cgroup.c | 49 +++++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 47 insertions(+), 6 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 7886919..d3849ad 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -781,6 +781,10 @@ struct signal_struct { #ifdef CONFIG_CGROUPS struct list_head rgrps; /* top-level rgroups under this sig */ struct list_head rgrp_node; /* parent_sgrp->child_rgrp_sigs list */ + u16 rgrp_subtree_control; /* control for top-level rgroups */ + u16 rgrp_subtree_ss_mask; /* ss_mask for top-level rgroups */ + u16 rgrp_old_subtree_control; /* used during control updates */ + u16 rgrp_old_subtree_ss_mask; /* used during control updates */ #endif oom_flags_t oom_flags; diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 53f479c..283b7ed 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -413,8 +413,8 @@ static u16 cgroup_control(struct cgroup *cgrp) struct cgroup *parent = cgroup_parent(cgrp); u16 root_ss_mask = cgrp->root->subsys_mask; - if (is_rgroup(cgrp)) - return 0; + if (is_rgroup(cgrp) && !is_rgroup(parent)) + return cgrp->rgrp_sig->rgrp_subtree_control; if (parent) return parent->subtree_control; @@ -430,8 +430,8 @@ static u16 cgroup_ss_mask(struct cgroup *cgrp) { struct cgroup *parent = cgroup_parent(cgrp); - if (is_rgroup(cgrp)) - return 0; + if (is_rgroup(cgrp) && !is_rgroup(parent)) + return cgrp->rgrp_sig->rgrp_subtree_ss_mask; if (parent) return parent->subtree_ss_mask; @@ -1565,6 +1565,9 @@ static void css_clear_dir(struct cgroup_subsys_state *css) css->flags &= ~CSS_VISIBLE; + if (is_rgroup(cgrp)) + return; + list_for_each_entry(cfts, &css->ss->cfts, node) cgroup_addrm_files(css, cgrp, cfts, false); } @@ -1584,6 +1587,9 @@ static int css_populate_dir(struct cgroup_subsys_state *css) if ((css->flags & CSS_VISIBLE) || !cgrp->kn) return 0; + if (is_rgroup(cgrp)) + goto done; + if (!css->ss) { if (cgroup_on_dfl(cgrp)) cfts = cgroup_dfl_base_files; @@ -1600,9 +1606,8 @@ static int css_populate_dir(struct cgroup_subsys_state *css) goto err; } } - +done: css->flags |= CSS_VISIBLE; - return 0; err: list_for_each_entry(cfts, &css->ss->cfts, node) { @@ -3116,8 +3121,15 @@ static void cgroup_save_control(struct cgroup *cgrp) struct cgroup_subsys_state *d_css; cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) { + struct signal_struct *sig; + dsct->old_subtree_control = dsct->subtree_control; dsct->old_subtree_ss_mask = dsct->subtree_ss_mask; + + list_for_each_entry(sig, &dsct->rgrp_child_sigs, rgrp_node) { + sig->rgrp_old_subtree_control = sig->rgrp_subtree_control; + sig->rgrp_old_subtree_ss_mask = sig->rgrp_subtree_ss_mask; + } } } @@ -3135,10 +3147,19 @@ static void cgroup_propagate_control(struct cgroup *cgrp) struct cgroup_subsys_state *d_css; cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) { + struct signal_struct *sig; + dsct->subtree_control &= cgroup_control(dsct); dsct->subtree_ss_mask = cgroup_calc_subtree_ss_mask(dsct->subtree_control, cgroup_ss_mask(dsct)); + + list_for_each_entry(sig, &dsct->rgrp_child_sigs, rgrp_node) { + sig->rgrp_subtree_control &= cgroup_control(dsct); + sig->rgrp_subtree_ss_mask = + cgroup_calc_subtree_ss_mask(sig->rgrp_subtree_control, + cgroup_ss_mask(dsct)); + } } } @@ -3155,6 +3176,13 @@ static void cgroup_restore_control(struct cgroup *cgrp) struct cgroup_subsys_state *d_css; cgroup_for_each_live_descendant_post(dsct, d_css, cgrp) { + struct signal_struct *sig; + + list_for_each_entry(sig, &dsct->rgrp_child_sigs, rgrp_node) { + sig->rgrp_subtree_control = sig->rgrp_old_subtree_control; + sig->rgrp_subtree_ss_mask = sig->rgrp_old_subtree_ss_mask; + } + dsct->subtree_control = dsct->old_subtree_control; dsct->subtree_ss_mask = dsct->old_subtree_ss_mask; } @@ -5326,6 +5354,15 @@ static struct cgroup *rgroup_create(struct cgroup *parent, lockdep_assert_held(&cgroup_mutex); + if (list_empty(&sig->rgrps)) { + WARN_ON_ONCE(is_rgroup(parent)); + + sig->rgrp_subtree_control = 0; + sig->rgrp_subtree_ss_mask = + cgroup_calc_subtree_ss_mask(sig->rgrp_subtree_control, + cgroup_ss_mask(parent)); + } + rgrp = cgroup_create(parent, sig); if (IS_ERR(rgrp)) return rgrp; -- 2.5.0 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 10/10] cgroup, sched: implement PRIO_RGRP for {set|get}priority() [not found] ` <1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (4 preceding siblings ...) 2016-03-11 15:41 ` [PATCH 08/10] cgroup: implement rgroup control mask handling Tejun Heo @ 2016-03-11 15:41 ` Tejun Heo 2016-03-11 16:05 ` Example program for PRIO_RGRP Tejun Heo ` (3 subsequent siblings) 9 siblings, 0 replies; 50+ messages in thread From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw) To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, Tejun Heo, Peter Zijlstra, Oleg Nesterov One of the missing features in cgroup v2 is the ability to control cpu cycle distribution hierarchically among threads of a process. With rgroup infrastructure in place, this can be implemented as a natural extension of setpriority(). This patch introduces a new @which selector PRIO_RGRP for {set|get}priority() which can be used only when the calling thread is in a rgroup and respectively sets and gets the nice priority of the rgroup that the calling thread belongs to. The nice values have exactly the same meaning as for a single task and top-level rgroups compete with peer tasks as if the entire subtree is a single task with the specified nice value. setpriority(PRIO_RGRP, nice) automatically enables cpu controller upto the rgroup of the thread. The cpu controller is available iff it's mounted on the default hierarchy and available on the nearest sgroup (ie. the parent of the nearest sgroup should have it enabled in its subtree_control). If the controller isn't available, setpriority() fails with -ENODEV. If the cpu controller is made unavailable either through clearing of subtree_control or migration to a cgroup which doesn't have it available, cpu controller is disabled for the affected rgroup subtrees. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> --- include/linux/cgroup.h | 4 + include/linux/sched.h | 5 ++ include/uapi/linux/resource.h | 1 + kernel/cgroup.c | 190 ++++++++++++++++++++++++++++++++++++++++++ kernel/sched/core.c | 32 +++++++ kernel/sys.c | 11 ++- 6 files changed, 241 insertions(+), 2 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index ca1ec50..885c29e 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -110,6 +110,8 @@ extern void cgroup_post_fork(struct task_struct *p, unsigned long clone_flags, int cgroup_exec(void); void cgroup_exit(struct task_struct *p); void cgroup_free(struct task_struct *p); +int rgroup_setpriority(pid_t vpid, int nice); +int rgroup_getpriority(pid_t vpid); int cgroup_init_early(void); int cgroup_init(void); @@ -552,6 +554,8 @@ static inline void cgroup_post_fork(struct task_struct *p, static inline int cgroup_exec(void) { return 0; } static inline void cgroup_exit(struct task_struct *p) {} static inline void cgroup_free(struct task_struct *p) {} +static inline int rgroup_setpriority(pid_t vpid, int nice) { return -ENODEV; } +static inline int rgroup_getpriority(pid_t vpid) { return -ENODEV; } static inline int cgroup_init_early(void) { return 0; } static inline int cgroup_init(void) { return 0; } diff --git a/include/linux/sched.h b/include/linux/sched.h index d3849ad..36fc5cb 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2371,6 +2371,11 @@ extern u64 scheduler_tick_max_deferment(void); static inline bool sched_can_stop_tick(void) { return false; } #endif +#ifdef CONFIG_FAIR_GROUP_SCHED +extern int cpu_cgroup_setpriority(struct cgroup_subsys_state *css, int nice); +extern int cpu_cgroup_getpriority(struct cgroup_subsys_state *css); +#endif + #ifdef CONFIG_SCHED_AUTOGROUP extern void sched_autogroup_create_attach(struct task_struct *p); extern void sched_autogroup_detach(struct task_struct *p); diff --git a/include/uapi/linux/resource.h b/include/uapi/linux/resource.h index 36fb3b5..da15cb1 100644 --- a/include/uapi/linux/resource.h +++ b/include/uapi/linux/resource.h @@ -57,6 +57,7 @@ struct rlimit64 { #define PRIO_PROCESS 0 #define PRIO_PGRP 1 #define PRIO_USER 2 +#define PRIO_RGRP 3 /* * Limit the stack by to some sane default: root can always diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 6107a1f..92eb74d 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -6305,6 +6305,196 @@ void cgroup_free(struct task_struct *task) put_css_set(cset); } +/** + * task_rgroup_lock_and_drain_offline - lock a task's rgroup and drain + * @task: target task + * + * Look up @task's rgroup, lock, drain and return it. If @task doesn't + * belong to a rgroup, ERR_PTR(-ENODEV) is returned. + */ +static struct cgroup * +task_rgroup_lock_and_drain_offline(struct task_struct *task) +{ + struct cgroup *rgrp; + +retry: + rcu_read_lock(); + + do { + rgrp = task_css_set(task)->dfl_cgrp; + if (!is_rgroup(rgrp)) { + rcu_read_unlock(); + return ERR_PTR(-ENODEV); + } + + if (!cgroup_tryget(rgrp)) { + cpu_relax(); + continue; + } + } while (false); + + rcu_read_unlock(); + + cgroup_lock_and_drain_offline(rgrp); + + /* did we race against migration? */ + if (rgrp != task_css_set(task)->dfl_cgrp) { + cgroup_unlock(); + goto retry; + } + + /* + * @task can't be moved to another cgroup while cgroup_mutex is + * held. No need to hold the extra reference. + */ + cgroup_put(rgrp); + + return rgrp; +} + +/** + * vpid_rgroup_lock_and_drain_offline - lock a vpid's rgroup and drain + * @vpid: target vpid + * @taskp: out paramter for the found task + * + * Look up the task for @vpid. If @vpid is zero, %current is used. If the + * task is found, look up its rgroup, lock, drain and return it. On + * success, the task's refcnt is incremented and the *@taskp points to it. + * An ERR_PTR() value is returned on failure. + */ +static struct cgroup * +vpid_rgroup_lock_and_drain_offline(pid_t vpid, struct task_struct **taskp) +{ + struct task_struct *task; + struct cgroup *rgrp; + + rcu_read_lock(); + if (vpid) { + task = find_task_by_vpid(vpid); + if (!task) { + rcu_read_unlock(); + return ERR_PTR(-ESRCH); + } + } else { + task = current; + } + get_task_struct(task); + rcu_read_unlock(); + + rgrp = task_rgroup_lock_and_drain_offline(task); + if (IS_ERR(rgrp)) + put_task_struct(task); + else + *taskp = task; + + return rgrp; +} + +/** + * rgroup_enable_subsys - enable a subsystem on a rgroup + * @rgrp: target rgroup + * @sgrp: nearest sgroup of @rgrp + * @ss: subsystem to enable + * + * Try to enable @ss on @rgrp. On success, 0 is returned and @ss is + * enabled on @rgrp; otherwise, -errno is returned. The caller must always + * call cgroup_finalize_control() afterwards. + */ +static int __maybe_unused rgroup_enable_subsys(struct cgroup *rgrp, + struct cgroup *sgrp, + struct cgroup_subsys *ss) +{ + struct cgroup *pos; + int ret; + + lockdep_assert_held(&cgroup_mutex); + + cgroup_save_control(sgrp); + + for (pos = rgrp; pos != sgrp; pos = cgroup_parent(pos)) { + struct cgroup *parent = cgroup_parent(pos); + + if (parent == sgrp) + pos->rgrp_sig->rgrp_subtree_control |= 1 << ss->id; + else + parent->subtree_control |= 1 << ss->id; + } + + ret = cgroup_apply_control(sgrp); + if (ret) + return ret; + + /* did control propagtion disable @ss? */ + if (!cgroup_css(rgrp, ss)) + return -ENODEV; + + return 0; +} + +int rgroup_setpriority(pid_t vpid, int nice) +{ + struct task_struct *task; + struct cgroup *rgrp; + struct cgroup *sgrp __maybe_unused; + int ret; + + rgrp = vpid_rgroup_lock_and_drain_offline(vpid, &task); + if (IS_ERR(rgrp)) + return PTR_ERR(rgrp); + + /* + * If @rgrp is top-level, it should be put under the same nice + * level restriction as @task; otherwise, limits are already + * applied higher up the hierarchy and there's no reason to + * restrict nice levels. + */ + if (!is_rgroup(cgroup_parent(rgrp)) && !can_nice(task, nice)) { + ret = -EPERM; + goto out_unlock; + } + + ret = -ENODEV; + /* do ifdef late to preserve the correct error response */ +#ifdef CONFIG_FAIR_GROUP_SCHED + sgrp = nearest_sgroup(rgrp); + + /* enable cpu and apply weight */ + ret = rgroup_enable_subsys(rgrp, sgrp, &cpu_cgrp_subsys); + if (!ret) + ret = cpu_cgroup_setpriority(cgroup_css(rgrp, &cpu_cgrp_subsys), + nice); + cgroup_finalize_control(sgrp, ret); +#endif + +out_unlock: + cgroup_unlock(); + put_task_struct(task); + return ret; +} + +int rgroup_getpriority(pid_t vpid) +{ + struct task_struct *task; + struct cgroup *rgrp; + int ret; + + rgrp = vpid_rgroup_lock_and_drain_offline(vpid, &task); + if (IS_ERR(rgrp)) + return PTR_ERR(rgrp); + + ret = -ENODEV; + /* do ifdef late to preserve the correct error response */ +#ifdef CONFIG_FAIR_GROUP_SCHED + if (cgroup_css(rgrp, &cpu_cgrp_subsys)) { + ret = cpu_cgroup_getpriority(cgroup_css(rgrp, &cpu_cgrp_subsys)); + ret = nice_to_rlimit(ret); + } +#endif + cgroup_unlock(); + put_task_struct(task); + return ret; +} + static void check_for_release(struct cgroup *cgrp) { if (notify_on_release(cgrp) && !cgroup_is_populated(cgrp) && diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 16ad92b..e22e0ce 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8747,6 +8747,35 @@ static int cpu_weight_write_u64(struct cgroup_subsys_state *css, return sched_group_set_shares(css_tg(css), scale_load(weight)); } + +static int cpu_cgroup_css_copy(struct cgroup_subsys_state *to, + struct cgroup_subsys_state *from) +{ + struct task_group *to_tg = css_tg(to); + struct task_group *from_tg = css_tg(from); + + return sched_group_set_shares(to_tg, from_tg->shares); +} + +int cpu_cgroup_setpriority(struct cgroup_subsys_state *css, int nice) +{ + int prio = NICE_TO_PRIO(clamp_val(nice, MIN_NICE, MAX_NICE)); + int weight = sched_prio_to_weight[prio - MAX_RT_PRIO]; + + return sched_group_set_shares(css_tg(css), scale_load(weight)); +} + +int cpu_cgroup_getpriority(struct cgroup_subsys_state *css) +{ + int weight = css_tg(css)->shares; + int idx; + + for (idx = 0; idx < ARRAY_SIZE(sched_prio_to_weight) - 1; idx++) + if (weight >= sched_prio_to_weight[idx]) + break; + + return PRIO_TO_NICE(idx + MAX_RT_PRIO); +} #endif static void __maybe_unused cpu_period_quota_print(struct seq_file *sf, @@ -8835,6 +8864,9 @@ struct cgroup_subsys cpu_cgrp_subsys = { .css_free = cpu_cgroup_css_free, .css_online = cpu_cgroup_css_online, .css_offline = cpu_cgroup_css_offline, +#ifdef CONFIG_FAIR_GROUP_SCHED + .css_copy = cpu_cgroup_css_copy, +#endif .fork = cpu_cgroup_fork, .can_attach = cpu_cgroup_can_attach, .attach = cpu_cgroup_attach, diff --git a/kernel/sys.c b/kernel/sys.c index 78947de..923f66a 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -41,6 +41,7 @@ #include <linux/syscore_ops.h> #include <linux/version.h> #include <linux/ctype.h> +#include <linux/cgroup.h> #include <linux/compat.h> #include <linux/syscalls.h> @@ -181,7 +182,7 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval) struct pid *pgrp; kuid_t uid; - if (which > PRIO_USER || which < PRIO_PROCESS) + if (which > PRIO_RGRP || which < PRIO_PROCESS) goto out; /* normalize: avoid signed division (rounding problems) */ @@ -191,6 +192,9 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval) if (niceval > MAX_NICE) niceval = MAX_NICE; + if (which == PRIO_RGRP) + return rgroup_setpriority(who, niceval); + rcu_read_lock(); read_lock(&tasklist_lock); switch (which) { @@ -251,9 +255,12 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who) struct pid *pgrp; kuid_t uid; - if (which > PRIO_USER || which < PRIO_PROCESS) + if (which > PRIO_RGRP || which < PRIO_PROCESS) return -EINVAL; + if (which == PRIO_RGRP) + return rgroup_getpriority(who); + rcu_read_lock(); read_lock(&tasklist_lock); switch (which) { -- 2.5.0 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Example program for PRIO_RGRP [not found] ` <1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (5 preceding siblings ...) 2016-03-11 15:41 ` [PATCH 10/10] cgroup, sched: implement PRIO_RGRP for {set|get}priority() Tejun Heo @ 2016-03-11 16:05 ` Tejun Heo 2016-03-12 6:26 ` [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Mike Galbraith ` (2 subsequent siblings) 9 siblings, 0 replies; 50+ messages in thread From: Tejun Heo @ 2016-03-11 16:05 UTC (permalink / raw) To: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg [-- Attachment #1: Type: text/plain, Size: 1202 bytes --] Hello, The attached test-rgrp-burn.c is an example program making use of the new PRIO_RGRP. The test program creates rgroup the following rgroup hierarchy sgroup - main thread + [rgroup-0] burner-0 + [rgroup-1] + [rgroup-2] burner-1 + [rgroup-3] burner-2 and takes upto 4 arguments respectively specifying the nice level for each rgroup. Each burner thread executes CPU burning loops and periodically prints out how many loops it has completed. * "./test-rgrp-burn" If the program is run without any argument, on a kernel which doesn't support rgroup, or from a cgroup where cpu controller is not available, the three burner threads would run at about equivalent speeds. * "./test-rgrp-burn 0" from a cgroup w/ cpu controller cpu controller is enabled across the top-level, so rgroup-0 and rgroup-1 would compete on equal footing, so burner-0 runs twice as fast as burner-1 or burner-2. * "./test-rgrp-burn 0 3 -1 2" from a cgroup w/ cpu controller cpu controller is enabled at both levels. Nice level difference of 3 is about twice difference in weight, so the ratio would roughly be burner-0 : burner-1 : burner-2 ~= 3 : 2 : 1 Thanks. -- tejun [-- Attachment #2: test-rgrp-burn.c --] [-- Type: text/plain, Size: 3315 bytes --] /* * test-rgrp-burn - rgrp test program * * Creates the following rgrp hierarchy of three CPU cycle burner * threads. * * sgrp - main thread * + [rgrp-0] burner thread * + [rgrp-1] + [rgrp-2] nested burner thread * + [rgrp-3] nested burner thread * * Takes upto 4 arguments specifying the nice level of each rgrp. */ #define _GNU_SOURCE #include <sys/types.h> #include <sys/wait.h> #include <sched.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <errno.h> #include <limits.h> #include <pthread.h> #include <sys/syscall.h> #include <sys/time.h> #include <sys/resource.h> #define CLONE_NEWRGRP 0x00001000 /* New resource group */ #define PRIO_RGRP 3 #define STACK_SIZE (4 * 1024 * 1024) #define CLONE_THREAD_FLAGS (CLONE_THREAD | CLONE_SIGHAND | CLONE_VM | \ CLONE_FS | CLONE_FILES) static int nice_val[] = { [0 ... 3] = INT_MIN }; static pthread_mutex_t lprintf_mutex; #define lprintf(fmt, args...) do { \ pthread_mutex_lock(&lprintf_mutex); \ printf(fmt, ##args); \ pthread_mutex_unlock(&lprintf_mutex); \ } while (0) static int gettid(void) { return syscall(SYS_gettid); } static int burner_fn(void *arg) { unsigned long a = 37, cnt = 0; sleep(1); lprintf("burner : %d started\n", gettid()); while (1) { *(volatile unsigned long *)&a = a * 37 / 13 + 53; *(volatile unsigned long *)&a = a * 37 / 13 + 53; *(volatile unsigned long *)&a = a * 37 / 13 + 53; *(volatile unsigned long *)&a = a * 37 / 13 + 53; *(volatile unsigned long *)&a = a * 37 / 13 + 53; *(volatile unsigned long *)&a = a * 37 / 13 + 53; *(volatile unsigned long *)&a = a * 37 / 13 + 53; *(volatile unsigned long *)&a = a * 37 / 13 + 53; if (!(++cnt % (1000000 * 100))) { int prio; errno = 0; prio = getpriority(PRIO_RGRP, 0); lprintf("burner : %d finished %lum loops (rgrp nice=%d errno=%d)\n", gettid(), cnt / 1000000, prio, errno); } } return 0; } static void rgrp_setprio(pid_t pid, int nice) { if (nice == INT_MIN) return; lprintf("setprio: setting PRIO_RGRP to %d on %d\n", nice, pid); if (setpriority(PRIO_RGRP, pid, nice)) perror("setpriority"); } static int child_fn(void *arg) { char *stack; pid_t pid; stack = malloc(STACK_SIZE) + STACK_SIZE; pid = clone(burner_fn, stack, CLONE_THREAD_FLAGS | CLONE_NEWRGRP, NULL); lprintf("child : cloned nested burner %d\n", pid); rgrp_setprio(pid, nice_val[2]); stack = malloc(STACK_SIZE) + STACK_SIZE; pid = clone(burner_fn, stack, CLONE_THREAD_FLAGS | CLONE_NEWRGRP, NULL); lprintf("child : cloned nested burner %d\n", pid); rgrp_setprio(pid, nice_val[3]); sleep(500); return 0; } int main(int argc, char **argv) { char *stack; pid_t pid; int i; if (argc > 5) argc = 5; for (i = 1; i < argc; i++) nice_val[i - 1] = atoi(argv[i]); pthread_mutex_init(&lprintf_mutex, NULL); stack = malloc(STACK_SIZE) + STACK_SIZE; pid = clone(burner_fn, stack, CLONE_THREAD_FLAGS | CLONE_NEWRGRP, NULL); lprintf("main : cloned burner %d\n", pid); rgrp_setprio(pid, nice_val[0]); stack = malloc(STACK_SIZE) + STACK_SIZE; pid = clone(child_fn, stack, CLONE_THREAD_FLAGS | CLONE_NEWRGRP, NULL); lprintf("main : cloned child %d\n", pid); rgrp_setprio(pid, nice_val[1]); sleep(500); return 0; } ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (6 preceding siblings ...) 2016-03-11 16:05 ` Example program for PRIO_RGRP Tejun Heo @ 2016-03-12 6:26 ` Mike Galbraith [not found] ` <1457764019.10402.72.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-03-14 11:30 ` Peter Zijlstra 2016-03-15 17:21 ` Michal Hocko 9 siblings, 1 reply; 50+ messages in thread From: Mike Galbraith @ 2016-03-12 6:26 UTC (permalink / raw) To: Tejun Heo, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Fri, 2016-03-11 at 10:41 -0500, Tejun Heo wrote: > Hello, > > This patchset extends cgroup v2 to support rgroup (resource group) for > in-process hierarchical resource control and implements PRIO_RGRP for > setpriority(2) on top to allow in-process hierarchical CPU cycle > control in a seamless way. > > cgroup v1 allowed putting threads of a process in different cgroups > which enabled ad-hoc in-process resource control of some resources. > Unfortunately, this approach was fraught with problems such as > membership ambiguity with per-process resources and lack of isolation > between system management and in-process properties. For a more > detailed discussion on the subject, please refer to the following > message. > > [1] [RFD] cgroup: thread granularity support for cpu controller > > This patchset implements the mechanism outlined in the above message. > The new mechanism is named rgroup (resource group). When explicitly > designating a non-rgroup cgroup, the term sgroup (system group) is > used. rgroup has the following properties. > > * A rgroup is a cgroup which is invisible on and transparent to the > system-level cgroupfs interface. > > * A rgroup can be created by specifying CLONE_NEWRGRP flag, along with > CLONE_THREAD, during clone(2). A new rgroup is created under the > parent thread's cgroup and the new thread is created in it. > > * A rgroup is automatically destroyed when empty. > > * A top-level rgroup of a process is a rgroup whose parent cgroup is a > sgroup. A process may have multiple top-level rgroups and thus > multiple rgroup subtrees under the same parent sgroup. > > * Unlike sgroups, rgroups are allowed to compete against peer threads. > Each rgroup behaves equivalent to a sibling task. > > * rgroup subtrees are local to the process. When the process forks or > execs, its rgroup subtrees are collapsed. > > * When a process is migrated to a different cgroup, its rgroup > subtrees are preserved. > > * Subset of controllers available on the parent sgroup are available > to rgroup subtrees. Controller management on rgroups is automatic > and implicit and doesn't interfere with system-level cgroup > controller management. If a controller is made unavailable on the > parent sgroup, it's automatically disabled from child rgroup > subtrees. > > rgroup lays the foundation for other kernel mechanisms to make use of > resource controllers while providing proper isolation between system > management and in-process operations removing the awkward and > layer-violating requirement for coordination between individual > applications and system management. On top of the rgroup mechanism, > PRIO_RGRP is implemented for {set|get}priority(2). > > * PRIO_RGRP can only be used if the target task is already in a > rgroup. If setpriority(2) is used and cpu controller is available, > cpu controller is enabled until the target rgroup is covered and the > specified nice value is set as the weight of the rgroup. > > * The specified nice value has the same meaning as for tasks. For > example, a rgroup and a task competing under the same parent would > behave exactly the same as two tasks. > > * For top-level rgroups, PRIO_RGRP follows the same rlimit > restrictions as PRIO_PROCESS; however, as nested rgroups only > distribute CPU cycles which are allocated to the process, no > restriction is applied. > > PRIO_RGRP allows in-process hierarchical control of CPU cycles in a > manner which is a straight-forward and minimal extension of existing > task and priority management. Hrm. You're showing that per-thread groups can coexist just fine, which is good given need and usage exists today out in the wild. Why do such groups have to be invisible with a unique interface though? Given the core has to deal with them whether they're visible or not, and given they exist to fulfill a need, seems they should be first class citizens, not some Quasimodo like creature sneaking into the cathedral via a back door and slinking about in the shadows. -Mike ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <1457764019.10402.72.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <1457764019.10402.72.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2016-03-12 17:04 ` Mike Galbraith [not found] ` <1457802262.3628.129.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-03-13 15:00 ` Tejun Heo 1 sibling, 1 reply; 50+ messages in thread From: Mike Galbraith @ 2016-03-12 17:04 UTC (permalink / raw) To: Tejun Heo, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Sat, 2016-03-12 at 07:26 +0100, Mike Galbraith wrote: > On Fri, 2016-03-11 at 10:41 -0500, Tejun Heo wrote: > > Hello, > > > > This patchset extends cgroup v2 to support rgroup (resource group) for > > in-process hierarchical resource control and implements PRIO_RGRP for > > setpriority(2) on top to allow in-process hierarchical CPU cycle > > control in a seamless way. > > > > cgroup v1 allowed putting threads of a process in different cgroups > > which enabled ad-hoc in-process resource control of some resources. BTW, within the scheduler, "process" does not exist. A high level composite entity is what we currently aggregate from arbitrary individual entities, a.k.a threads. Whether an individual entity be an un-threaded "process" bash, a thread of "process" oracle, or one of "process!?!" kernel is irrelevant. What entity aggregation has to do with "process" eludes me completely. What's ad-hoc or unusual about a thread pool servicing an arbitrary number of customers using cgroup bean accounting? Job arrives from customer, worker is dispatched to customer workshop (cgroup), it does whatever on behest of customer, sends bean count off to the billing department, and returns to the break room. What's so annoying about using bean counters for.. counting beans that you want to forbid it? -Mike ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <1457802262.3628.129.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* cgroup NAKs ignored? Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <1457802262.3628.129.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2016-03-12 17:13 ` Ingo Molnar [not found] ` <20160312171318.GD1108-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 50+ messages in thread From: Ingo Molnar @ 2016-03-12 17:13 UTC (permalink / raw) To: Mike Galbraith Cc: Tejun Heo, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, Thomas Gleixner * Mike Galbraith <umgwanakikbuti-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > On Sat, 2016-03-12 at 07:26 +0100, Mike Galbraith wrote: > > On Fri, 2016-03-11 at 10:41 -0500, Tejun Heo wrote: > > > Hello, > > > > > > This patchset extends cgroup v2 to support rgroup (resource group) for > > > in-process hierarchical resource control and implements PRIO_RGRP for > > > setpriority(2) on top to allow in-process hierarchical CPU cycle > > > control in a seamless way. > > > > > > cgroup v1 allowed putting threads of a process in different cgroups > > > which enabled ad-hoc in-process resource control of some resources. > > BTW, within the scheduler, "process" does not exist. [...] Yes, and that's very fundamental. And I see that many bits of the broken 'v2' cgroups ABI already snuck into the upstream kernel in this merge dinwo, without this detail having been agreed upon! :-( Tejun, this _REALLY_ sucks. We had pending NAKs over the design, still you moved ahead like nothing happened, why?! > [...] A high level composite entity is what we currently aggregate from > arbitrary individual entities, a.k.a threads. Whether an individual entity be > an un-threaded "process" bash, a thread of "process" oracle, or one of > "process!?!" kernel is irrelevant. What entity aggregation has to do with > "process" eludes me completely. > > What's ad-hoc or unusual about a thread pool servicing an arbitrary number of > customers using cgroup bean accounting? Job arrives from customer, worker is > dispatched to customer workshop (cgroup), it does whatever on behest of > customer, sends bean count off to the billing department, and returns to the > break room. What's so annoying about using bean counters for.. counting beans > that you want to forbid it? Agreed ... and many others expressed this concern as well. Why were these concerns ignored? Thanks, Ingo ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160312171318.GD1108-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: cgroup NAKs ignored? Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160312171318.GD1108-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2016-03-13 14:42 ` Tejun Heo 0 siblings, 0 replies; 50+ messages in thread From: Tejun Heo @ 2016-03-13 14:42 UTC (permalink / raw) To: Ingo Molnar Cc: Mike Galbraith, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, Thomas Gleixner Hello, Ingo. On Sat, Mar 12, 2016 at 06:13:18PM +0100, Ingo Molnar wrote: > > BTW, within the scheduler, "process" does not exist. [...] > > Yes, and that's very fundamental. I'll go into this part later. > And I see that many bits of the broken 'v2' cgroups ABI already snuck into the > upstream kernel in this merge dinwo, without this detail having been agreed upon! > :-( > > Tejun, this _REALLY_ sucks. We had pending NAKs over the design, still you moved > ahead like nothing happened, why?! Hmmmm? The cpu controller is still in review branch. The thread sprawled out but the disagreement there was about missing the ability to hierarchically distribute CPU cycles in process and the two alternatives discussed throughout the thread were per-process private filesystem under /proc/PID and extension of existing process resource nmanagement mechanisms. Going back to the per-process part, I described the rationales in cgroup-v2 documentation and the RFD document but here are some important bits. 1. Common resource domains * When different resources get intermixed as do memory and io during writeback, without a common resource domain defined across the different resource types, it's impossible to perform resource control. As a simplistic example, let's say there are four processes (1, 2, 3, 4), two memory cgroups (ma, mb) and two io cgroups (ia, ib) with the following memership. ma: 1, 2 mb: 3, 4 ia: 1, 3 ib: 2, 4 Writeback and dirty throttling are regulated by the proportion of dirty memory against available and writeback bandwidth of the target backing device. When resource domains are orthogonal like the above, it's impossible to define clear relationship. This is one of the main reasons why writeback behavior has been so erratic with respect to cgroups. * It is a lot more useful and less painful to have common resource domains defined across all resource types as it allows expressing things like "if this belongs to resource domain F, do XYZ". A lot of use cases are already doing this by building the identical hierarchies (to differing depths) across all controllers. 2. Per-process * There is a relatively pronounced boundary between system management and internal operations of an application and one side-effect of allowing threads to be assigned arbitrarily across system cgroupfs hierarchy is that it mandates close coordination between individual applications and system management (whether that be a human being or system agent software). This is userland suffering because the kernel fails to provide a properly abstracted and isolated constructs. Decoupling system management and in-application operations makes hierarchical resource grouping and control easily accessible to individual applications without worrying about how the system is managed in larger scope. Process is a fairly good approximation of this boundary. * For some resources, going beyond process granularity doesn't make much sense. While we can just let users do whatever they wanna do and declare certain configurations to yield undefined behaviors (io controller on v1 hierarchy actually does this), it is better to provide abstractions which match the actual characteristics. Combined with the above, it is natural to distinguish across-process and in-process operations. > > [...] A high level composite entity is what we currently aggregate from > > arbitrary individual entities, a.k.a threads. Whether an individual entity be > > an un-threaded "process" bash, a thread of "process" oracle, or one of > > "process!?!" kernel is irrelevant. What entity aggregation has to do with > > "process" eludes me completely. > > > > What's ad-hoc or unusual about a thread pool servicing an arbitrary number of > > customers using cgroup bean accounting? Job arrives from customer, worker is > > dispatched to customer workshop (cgroup), it does whatever on behest of > > customer, sends bean count off to the billing department, and returns to the > > break room. What's so annoying about using bean counters for.. counting beans > > that you want to forbid it? > > Agreed ... and many others expressed this concern as well. Why were these concerns > ignored? They weren't ignored. The concern expressed was the loss of the ability to hierarchically distribute resource in process and the RFD document and this patchset are the attempts at resolving that specific issue. Going back to Mike's "why can't these be arbitrary bean counters?", yes, they can be. That's what one gets when the cpu controller is mounted on its own hierarchy. If that's what the use case at hand calls for, that is the way to go and there's nothing preventing that. In fact, with recent restructuring of cgroup core, stealing a stateless controller to a new hierarchy can be made a lot easier for such use cases. However, as explained above, controlling a resource in abstraction and restriction-free style also has its costs. There's no way to tie different types of resources serving the same purpose which can be generally painful and makes some cross-resource operations impossible. Or entangling in-process operations with system management, IOW, a process having to speak to the external $SYSTEM_AGENT to manage its threadpools. What the proposed solution tries to achieve is balancing flexibility at system management level with proper abstractions and isolation so that hierarchical resource management is actually accessible to a lot wider set of applications and use-cases. Given how cgroup is used in the wild, I'm pretty sure that the structured approach will reach a lot wider audience without getting in the way of what they try to achieve. That said, again, for specific use cases where the benefits from structured approach can or should be ignored, using the cpu controller as arbitrary hierarchical bean counters is completely fine and the right solution. Thanks. -- tejun ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <1457764019.10402.72.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-03-12 17:04 ` Mike Galbraith @ 2016-03-13 15:00 ` Tejun Heo [not found] ` <20160313150012.GB13405-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 1 sibling, 1 reply; 50+ messages in thread From: Tejun Heo @ 2016-03-13 15:00 UTC (permalink / raw) To: Mike Galbraith Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg Hello, Mike. On Sat, Mar 12, 2016 at 07:26:59AM +0100, Mike Galbraith wrote: > Hrm. You're showing that per-thread groups can coexist just fine, > which is good given need and usage exists today out in the wild. Why > do such groups have to be invisible with a unique interface though? I tried to explain these in the forementioned RFD document. I'll give a brief example here. Let's say there is an application which wants to manage resource distributions across its multiple threadpools in a hierarchical way. With cgroupfs interface as the only system-wide interface, it has to coordinate who or whatever is managing that interface. Maybe it can get a subtree delegated to it, maybe it has to ask the system thing to create and place threads there, maybe it can just expose the pids and let the system management do its thing (what if the threads in the pools are dynamic tho?). There is no reliable universal way of doing this. Each such application has to be ready to specifically coordinate with the specific system management in use. This is kernel failing to provide proper abstraction and isolation between different layers. The "raw" feature is there but it's unsafe to use and thus can't be used widely. > Given the core has to deal with them whether they're visible or not, > and given they exist to fulfill a need, seems they should be first > class citizens, not some Quasimodo like creature sneaking into the > cathedral via a back door and slinking about in the shadows. In terms of programmability and accessibility for individual applications, group resource management being available through straight-forward and incremental extension of exsiting mechanisms is *way* more first class citizen. It is two seamless extensions to clone(2) and setpriority(2) making hierarchical resource management generally available to applications. There can be use cases where building cpu resource hierarchy which is completely alien to how the rest of the system is organized is useful. For those cases, the only thing which can be done is building a separate hierarchy for the cpu controller and that capability isn't going anywhere. Thanks. -- tejun ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160313150012.GB13405-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160313150012.GB13405-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2016-03-13 17:40 ` Mike Galbraith [not found] ` <1457890835.3859.72.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-03-14 2:23 ` Mike Galbraith 1 sibling, 1 reply; 50+ messages in thread From: Mike Galbraith @ 2016-03-13 17:40 UTC (permalink / raw) To: Tejun Heo Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Sun, 2016-03-13 at 11:00 -0400, Tejun Heo wrote: > Hello, Mike. > > On Sat, Mar 12, 2016 at 07:26:59AM +0100, Mike Galbraith wrote: > > Hrm. You're showing that per-thread groups can coexist just fine, > > which is good given need and usage exists today out in the wild. Why > > do such groups have to be invisible with a unique interface though? > > I tried to explain these in the forementioned RFD document. I'll give > a brief example here. > > Let's say there is an application which wants to manage resource > distributions across its multiple threadpools in a hierarchical way. > With cgroupfs interface as the only system-wide interface, it has to > coordinate who or whatever is managing that interface. Maybe it can > get a subtree delegated to it, maybe it has to ask the system thing to > create and place threads there, maybe it can just expose the pids and > let the system management do its thing (what if the threads in the > pools are dynamic tho?). There is no reliable universal way of doing > this. Each such application has to be ready to specifically > coordinate with the specific system management in use. The last thing I ever want to see on my boxen is random applications either doing their own thing with my cgroups management interface, or conspiring with "the system thing" behind my back to do things that I did not specifically ask them to do. "The system thing" started doing its own thing behind my back, and oddly enough, its tentacles started falling off. By golly, its eyes seem to have fallen out as well. That's what happens when control freak meets control freak, one of them ends up in pieces. There can be only one, and that one is me, the administrator. Applications don't coordinate spit, if I put on my administrator hat and stuff 'em in a box, they better stay stuffed. > This is kernel failing to provide proper abstraction and isolation > between different layers. The "raw" feature is there but it's unsafe > to use and thus can't be used widely. Good, management of my boxen is my turf. The raw feature works fine today, and I'd like to see it to keep on working tomorrow. If tools written for administration diddle, that's fine IFF I'm the guy on the other end of the tool. All other manipulators of my management interfaces can go.. fish. > > Given the core has to deal with them whether they're visible or not, > > and given they exist to fulfill a need, seems they should be first > > class citizens, not some Quasimodo like creature sneaking into the > > cathedral via a back door and slinking about in the shadows. > > In terms of programmability and accessibility for individual > applications, group resource management being available through > straight-forward and incremental extension of exsiting mechanisms is > *way* more first class citizen. It is two seamless extensions to > clone(2) and setpriority(2) making hierarchical resource management > generally available to applications. To me, that sounds like chaos. > There can be use cases where building cpu resource hierarchy which is > completely alien to how the rest of the system is organized is useful. > For those cases, the only thing which can be done is building a > separate hierarchy for the cpu controller and that capability isn't > going anywhere. As long as administrators can use the system interface to aggregate what they see fit, I'm happy. The scheduler schedules threads, ergo the cpu controller must aggregate threads. There is no process. -Mike ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <1457890835.3859.72.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <1457890835.3859.72.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2016-04-07 0:00 ` Tejun Heo [not found] ` <20160407000034.GL24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 50+ messages in thread From: Tejun Heo @ 2016-04-07 0:00 UTC (permalink / raw) To: Mike Galbraith Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg Hello, Mike. On Sun, Mar 13, 2016 at 06:40:35PM +0100, Mike Galbraith wrote: > On Sun, 2016-03-13 at 11:00 -0400, Tejun Heo wrote: > > Let's say there is an application which wants to manage resource > > distributions across its multiple threadpools in a hierarchical way. > > With cgroupfs interface as the only system-wide interface, it has to > > coordinate who or whatever is managing that interface. Maybe it can > > get a subtree delegated to it, maybe it has to ask the system thing to > > create and place threads there, maybe it can just expose the pids and > > let the system management do its thing (what if the threads in the > > pools are dynamic tho?). There is no reliable universal way of doing > > this. Each such application has to be ready to specifically > > coordinate with the specific system management in use. > > The last thing I ever want to see on my boxen is random applications > either doing their own thing with my cgroups management interface, or > conspiring with "the system thing" behind my back to do things that I > did not specifically ask them to do. That isn't too different from saying that you don't want applications to be calling setpriority(2) on its threads, which is a weird thing to say, especially given that there are situations where applying control from outside simply can't work - thread pools can be dynamic and there is no reliable way of telling which threads are for which purposes from outside. This is not to say that admin override is unnecessary or unsupported. In fact, rgroup and cgroup give the admin a lot more control. Controller access can be revoked from applications in subtrees and the entire controller can be detached from the hierarchy for full override. > "The system thing" started doing its own thing behind my back, and > oddly enough, its tentacles started falling off. By golly, its eyes > seem to have fallen out as well. > > That's what happens when control freak meets control freak, one of them > ends up in pieces. There can be only one, and that one is me, the > administrator. Applications don't coordinate spit, if I put on my > administrator hat and stuff 'em in a box, they better stay stuffed. Sure, you're a control freak. Be that. rgroup doesn't get in the way of you doing that; however, you also have to realize that a single person hand-configuring a specialized setup for oneself isn't the only mode of usage. Those are, in fact, vocal but clearly minority use cases. What's more common would be systematic management of resources and applications configuring resource distribution across their threads. If you wanna assume full control, do so. Nothing is preventing that, and, at the same time, that shouldn't get in the way of implementing mechanisms which are more widely useful. > > > Given the core has to deal with them whether they're visible or not, > > > and given they exist to fulfill a need, seems they should be first > > > class citizens, not some Quasimodo like creature sneaking into the > > > cathedral via a back door and slinking about in the shadows. > > > > In terms of programmability and accessibility for individual > > applications, group resource management being available through > > straight-forward and incremental extension of exsiting mechanisms is > > *way* more first class citizen. It is two seamless extensions to > > clone(2) and setpriority(2) making hierarchical resource management > > generally available to applications. > > To me, that sounds like chaos. Care to elaborate rationales for the claim? > > There can be use cases where building cpu resource hierarchy which is > > completely alien to how the rest of the system is organized is useful. > > For those cases, the only thing which can be done is building a > > separate hierarchy for the cpu controller and that capability isn't > > going anywhere. > > As long as administrators can use the system interface to aggregate > what they see fit, I'm happy. The scheduler schedules threads, ergo > the cpu controller must aggregate threads. There is no process. Scheduler not knowing beyond threads is great but that doesn't make the concept of process any less of a real thing when the kernel interacts with userland. Processes and threads are clearly the primary constructs that our userland uses for execution contexts, and process boundaries are frequently used to delimit various isolation domains both in kerne API and programming conventions. A capability can also become inaccessible when it's exposed in a way which doesn't work in conjunction with the existing abstractions. The kernel not providing isolation at the expected layers is a failure that can prevent the feature from being useable in a lot of cases. What rgroup tries to do is exposing cgroup's capabilities in a way which integrates with the existing programming constructs that userland already depends upon to make these capabilities accessible to them. Again, if you wanna be a control freak, nothing stands in your way, but control freak admins aren't the only consumers of kernel. Thanks. -- tejun ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160407000034.GL24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160407000034.GL24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2016-04-07 3:26 ` Mike Galbraith 0 siblings, 0 replies; 50+ messages in thread From: Mike Galbraith @ 2016-04-07 3:26 UTC (permalink / raw) To: Tejun Heo Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Wed, 2016-04-06 at 20:00 -0400, Tejun Heo wrote: > Hello, Mike. > > On Sun, Mar 13, 2016 at 06:40:35PM +0100, Mike Galbraith wrote: > > On Sun, 2016-03-13 at 11:00 -0400, Tejun Heo wrote: > > > Let's say there is an application which wants to manage resource > > > distributions across its multiple threadpools in a hierarchical way. > > > With cgroupfs interface as the only system-wide interface, it has to > > > coordinate who or whatever is managing that interface. Maybe it can > > > get a subtree delegated to it, maybe it has to ask the system thing to > > > create and place threads there, maybe it can just expose the pids and > > > let the system management do its thing (what if the threads in the > > > pools are dynamic tho?). There is no reliable universal way of doing > > > this. Each such application has to be ready to specifically > > > coordinate with the specific system management in use. > > > > The last thing I ever want to see on my boxen is random applications > > either doing their own thing with my cgroups management interface, or > > conspiring with "the system thing" behind my back to do things that I > > did not specifically ask them to do. > > That isn't too different from saying that you don't want applications > to be calling setpriority(2) on its threads, which is a weird thing to > say, especially given that there are situations where applying control > from outside simply can't work - thread pools can be dynamic and there > is no reliable way of telling which threads are for which purposes > from outside. > > This is not to say that admin override is unnecessary or unsupported. > In fact, rgroup and cgroup give the admin a lot more control. > Controller access can be revoked from applications in subtrees and the > entire controller can be detached from the hierarchy for full > override. My box lacks a hierarchy of say CPUs. I fail to see how a gaggle of apps, myself, and system thing are going to all agree on who gets what percentage of which CPUs. The traditional thing is for apps to communicate their needs and desires to the administrator. > > "The system thing" started doing its own thing behind my back, and > > oddly enough, its tentacles started falling off. By golly, its eyes > > seem to have fallen out as well. > > > > That's what happens when control freak meets control freak, one of them > > ends up in pieces. There can be only one, and that one is me, the > > administrator. Applications don't coordinate spit, if I put on my > > administrator hat and stuff 'em in a box, they better stay stuffed. > > Sure, you're a control freak. Be that. rgroup doesn't get in the way > of you doing that; however, you also have to realize that a single > person hand-configuring a specialized setup for oneself isn't the only > mode of usage. Those are, in fact, vocal but clearly minority use > cases. Minority says you. I'm not hearing the voices of this alleged multitude, so remain highly skeptical. The only voices I've heard is yours and the system thing authors. Perhaps time will prove my concerns to be unfounded, meanwhile my machete stays at the ready. > What's more common would be systematic management of resources > and applications configuring resource distribution across their > threads. If you wanna assume full control, do so. Nothing is > preventing that, and, at the same time, that shouldn't get in the way > of implementing mechanisms which are more widely useful. As long as the admin remains in the drivers seat, and overhead remains restricted to those who are playing with cgroups (not the case atm), I couldn't possibly care less. > > > > Given the core has to deal with them whether they're visible or not, > > > > and given they exist to fulfill a need, seems they should be first > > > > class citizens, not some Quasimodo like creature sneaking into the > > > > cathedral via a back door and slinking about in the shadows. > > > > > > In terms of programmability and accessibility for individual > > > applications, group resource management being available through > > > straight-forward and incremental extension of exsiting mechanisms is > > > *way* more first class citizen. It is two seamless extensions to > > > clone(2) and setpriority(2) making hierarchical resource management > > > generally available to applications. > > > > To me, that sounds like chaos. > > Care to elaborate rationales for the claim? I don't have a hierarchy of independent resources, I have one set of resources. If system thing or anybody else starts say dorking around with the CPU controller behind my back (system thing did), I pay for it in performance loss. Various remote individuals thinking cgroups are so cool that _of course_ everybody will want and benefit from group scheduling does not make it true. > > > There can be use cases where building cpu resource hierarchy which is > > > completely alien to how the rest of the system is organized is useful. > > > For those cases, the only thing which can be done is building a > > > separate hierarchy for the cpu controller and that capability isn't > > > going anywhere. > > > > As long as administrators can use the system interface to aggregate > > what they see fit, I'm happy. The scheduler schedules threads, ergo > > the cpu controller must aggregate threads. There is no process. > > Scheduler not knowing beyond threads is great but that doesn't make > the concept of process any less of a real thing when the kernel > interacts with userland. Processes and threads are clearly the > primary constructs that our userland uses for execution contexts, and > process boundaries are frequently used to delimit various isolation > domains both in kerne API and programming conventions. > > A capability can also become inaccessible when it's exposed in a way > which doesn't work in conjunction with the existing abstractions. The > kernel not providing isolation at the expected layers is a failure > that can prevent the feature from being useable in a lot of cases. > What rgroup tries to do is exposing cgroup's capabilities in a way > which integrates with the existing programming constructs that > userland already depends upon to make these capabilities accessible to > them. > > Again, if you wanna be a control freak, nothing stands in your way, > but control freak admins aren't the only consumers of kernel. This control freak admin has however in fact bumped heads with control freak system thing authors, and sees more of that in his future. I did not care until remote idiocy gave me a local reason to care. If the admin remains the final authority, and remote individuals can't do things that affect all, and thus annoy him/her behind his/her back, said admins including this one won't have any reason to care. -Mike ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160313150012.GB13405-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2016-03-13 17:40 ` Mike Galbraith @ 2016-03-14 2:23 ` Mike Galbraith 1 sibling, 0 replies; 50+ messages in thread From: Mike Galbraith @ 2016-03-14 2:23 UTC (permalink / raw) To: Tejun Heo Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Sun, 2016-03-13 at 11:00 -0400, Tejun Heo wrote: > There can be use cases where building cpu resource hierarchy which is > completely alien to how the rest of the system is organized is useful. Well, from my POV it's the "process" oriented thingy that's the alien, and one that wants to meet an alien juice resistant shovel ;-) Another example is cpusets. I use it, and I know beyond doubt that my employer's customers use it for RT/HPC purposes out in the wild, and that very much includes placing critical _threads_ in exclusive sets. -Mike ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (7 preceding siblings ...) 2016-03-12 6:26 ` [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Mike Galbraith @ 2016-03-14 11:30 ` Peter Zijlstra [not found] ` <20160314113013.GM6344-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-03-15 17:21 ` Michal Hocko 9 siblings, 1 reply; 50+ messages in thread From: Peter Zijlstra @ 2016-03-14 11:30 UTC (permalink / raw) To: Tejun Heo Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Fri, Mar 11, 2016 at 10:41:18AM -0500, Tejun Heo wrote: > * A rgroup is a cgroup which is invisible on and transparent to the > system-level cgroupfs interface. > > * A rgroup can be created by specifying CLONE_NEWRGRP flag, along with > CLONE_THREAD, during clone(2). A new rgroup is created under the > parent thread's cgroup and the new thread is created in it. This seems overly restrictive. As you well know there's people moving threads about after creation. Also, with this interface the whole thing cannot be used until your libc's pthread_create() has been patched to allow use of this new flag. > * A rgroup is automatically destroyed when empty. Except for Zombies it appears.. > * A top-level rgroup of a process is a rgroup whose parent cgroup is a > sgroup. A process may have multiple top-level rgroups and thus > multiple rgroup subtrees under the same parent sgroup. > > * Unlike sgroups, rgroups are allowed to compete against peer threads. > Each rgroup behaves equivalent to a sibling task. > > * rgroup subtrees are local to the process. When the process forks or > execs, its rgroup subtrees are collapsed. > > * When a process is migrated to a different cgroup, its rgroup > subtrees are preserved. This all makes it impossible to say put a single thread outside of the hierarchy forced upon it by the process. Like putting a RT thread in an isolated group on the side. Which is a rather common thing to do. > rgroup lays the foundation for other kernel mechanisms to make use of > resource controllers while providing proper isolation between system > management and in-process operations removing the awkward and > layer-violating requirement for coordination between individual > applications and system management. On top of the rgroup mechanism, > PRIO_RGRP is implemented for {set|get}priority(2). > > * PRIO_RGRP can only be used if the target task is already in a > rgroup. If setpriority(2) is used and cpu controller is available, > cpu controller is enabled until the target rgroup is covered and the > specified nice value is set as the weight of the rgroup. > > * The specified nice value has the same meaning as for tasks. For > example, a rgroup and a task competing under the same parent would > behave exactly the same as two tasks. > > * For top-level rgroups, PRIO_RGRP follows the same rlimit > restrictions as PRIO_PROCESS; however, as nested rgroups only > distribute CPU cycles which are allocated to the process, no > restriction is applied. While this appears neat, I doubt it will remain so in the face of this: > * A mechanism that applications can use to publish certain rgroups so > that external entities can determine which IDs to use to change > rgroup settings. I already have interface and implementation design > mostly pinned down. So you need some new fangled way to set/query all the other possible cgroup parameters supported, and then suddenly you have one that has two possible interface. That's way ugly. While I appreciate the sentiment that having two entities poking at the cgroup filesystem without coordination is a problem, I don't see this as the solution. I would much rather just kill the system wide thing, that too solves the problem. IOW, I'm unconvinced this approach will cater to current practises or even allow similar functionality. ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160314113013.GM6344-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160314113013.GM6344-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2016-04-06 15:58 ` Tejun Heo [not found] ` <20160406155830.GI24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 50+ messages in thread From: Tejun Heo @ 2016-04-06 15:58 UTC (permalink / raw) To: Peter Zijlstra Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg Hello, Peter. Sorry about the delay. On Mon, Mar 14, 2016 at 12:30:13PM +0100, Peter Zijlstra wrote: > On Fri, Mar 11, 2016 at 10:41:18AM -0500, Tejun Heo wrote: > > * A rgroup is a cgroup which is invisible on and transparent to the > > system-level cgroupfs interface. > > > > * A rgroup can be created by specifying CLONE_NEWRGRP flag, along with > > CLONE_THREAD, during clone(2). A new rgroup is created under the > > parent thread's cgroup and the new thread is created in it. > > This seems overly restrictive. As you well know there's people moving > threads about after creation. Will get to this later. > Also, with this interface the whole thing cannot be used until your > libc's pthread_create() has been patched to allow use of this new flag. This isn't difficult to change but is this a problem in the long term? Once added, this is gonna be a permanent part of API and I think we better get it right than quick. If this is a concern, we can go for a setsid(2) style syscall so that users can have an easier access to it. > > * A rgroup is automatically destroyed when empty. > > Except for Zombies it appears.. Zombies do hold onto its rgroup but at that point the rgroup is draining refs and can't be populated again. The same state as rmdir'd cgroup with zombies. > > * A top-level rgroup of a process is a rgroup whose parent cgroup is a > > sgroup. A process may have multiple top-level rgroups and thus > > multiple rgroup subtrees under the same parent sgroup. > > > > * Unlike sgroups, rgroups are allowed to compete against peer threads. > > Each rgroup behaves equivalent to a sibling task. > > > > * rgroup subtrees are local to the process. When the process forks or > > execs, its rgroup subtrees are collapsed. > > > > * When a process is migrated to a different cgroup, its rgroup > > subtrees are preserved. > > This all makes it impossible to say put a single thread outside of the > hierarchy forced upon it by the process. Like putting a RT thread in an > isolated group on the side. > > Which is a rather common thing to do. I don't think the mentioned RT case is problematic. Depending on the desired outcome, 1. If the admin doesn't want the program to be able to meddle with the cpu resource control at all, it can just disable the cpu controller in subtree_control (this is tied to the parent control now but will be moved to the associated cgroup itself). The application will create rgroup hierarchy but won't be able to use CPU resource control and the admin would be able to treat all threads as if they don't have rgroups at all. 2. If the admin still wants to allow the application to retain CPU resource control, unless the said program is actively getting in the way, the admin can set the limits the way it wants along the hierarchy down to the specific thread. Note that #1 can be done after-the-fact. The admin can revoke CPU controller access anytime. For example, assuming the following hierarchy (cX is a cgroup, rX is a rgroup, NNN are threads). cA - 234 + r235 - 235 + 236 If the process 234 configured CPU resource control in a specific way and the admin wants to override, the admin can simply do "echo -cpu > cA.subtree_control". Afterwards, as far as CPU resource control is concerned, all threads will behave as if there are no rgroups at all and the admin can tweak the settings of individual threads using the usual scheduler systemcalls. > > rgroup lays the foundation for other kernel mechanisms to make use of > > resource controllers while providing proper isolation between system > > management and in-process operations removing the awkward and > > layer-violating requirement for coordination between individual > > applications and system management. On top of the rgroup mechanism, > > PRIO_RGRP is implemented for {set|get}priority(2). > > > > * PRIO_RGRP can only be used if the target task is already in a > > rgroup. If setpriority(2) is used and cpu controller is available, > > cpu controller is enabled until the target rgroup is covered and the > > specified nice value is set as the weight of the rgroup. > > > > * The specified nice value has the same meaning as for tasks. For > > example, a rgroup and a task competing under the same parent would > > behave exactly the same as two tasks. > > > > * For top-level rgroups, PRIO_RGRP follows the same rlimit > > restrictions as PRIO_PROCESS; however, as nested rgroups only > > distribute CPU cycles which are allocated to the process, no > > restriction is applied. > > While this appears neat, I doubt it will remain so in the face of this: > > > * A mechanism that applications can use to publish certain rgroups so > > that external entities can determine which IDs to use to change > > rgroup settings. I already have interface and implementation design > > mostly pinned down. > > So you need some new fangled way to set/query all the other possible > cgroup parameters supported, and then suddenly you have one that has two > possible interface. That's way ugly. So, the above response is a bit confusing because publishing rgroups doesn't require setting or querying all other possible cgroup parameters. Regarding the need to add separeate interface for each control knob for rgroups, 1. There aren't many knobs which make sense for in-process control to begin with. 2. As shown in this patchset's modifiction to setpriority(2), for stuff which makes sense, we're likely to already have constructs which already deal with the issue (it is a needed capability with or without cgroup). The right way forward is seamlessly extending existing interfaces. 3. If there is no exactly matching interface, we want to add them for both groups and threads in a way which is consistent with other syscalls which deal with related issues, especially for the scheduler. Just in case, here's a more concerete explanation about publishing rgroups. The only addition needed for external access is a way to determine which ID maps to which rgroup - a proc file listing (rgroup name, ID) pairs. Let's say the program creates the following internal hierarchy. cgroup - service0 - highpri_workers + lowpri_workers + service1 - highpri_workers + lowpri_workers The rgroups are published by a member thread performing, for example, prctl(PR_SET_RGROUP_NAME, "service0.highpri_workers"). The only thing it does is pinning the pid so that it stays associated with the rgroup and publishes it in proc as follows. # cat /proc/234/rgroups service0.highpri_workers 240 service0.lowpri_workers 241 service1.highpri_workers 248 service1.lowpri_workers 249 >From tooling side, renice(2) can be extended to understand rgroups so that something like the following works. # renice -n -10 -r 234:service1.highpri_workers Thanks. -- tejun ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160406155830.GI24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160406155830.GI24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2016-04-07 6:45 ` Peter Zijlstra [not found] ` <20160407064549.GH3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 0 siblings, 1 reply; 50+ messages in thread From: Peter Zijlstra @ 2016-04-07 6:45 UTC (permalink / raw) To: Tejun Heo Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg So I recently got made aware of the fact that cgroupv2 doesn't allow tasks to be associated with !leaf cgroups, this is yet another capability of cpu-cgroup you've destroyed. At this point I really don't see why I should spend another second considering anything v2. So full NAK and stop wasting my time. ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160407064549.GH3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160407064549.GH3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2016-04-07 7:35 ` Johannes Weiner [not found] ` <20160407073547.GA12560-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 0 siblings, 1 reply; 50+ messages in thread From: Johannes Weiner @ 2016-04-07 7:35 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Apr 07, 2016 at 08:45:49AM +0200, Peter Zijlstra wrote: > So I recently got made aware of the fact that cgroupv2 doesn't allow > tasks to be associated with !leaf cgroups, this is yet another > capability of cpu-cgroup you've destroyed. May I ask how you are using that? The behavior for tasks in !leaf groups was fairly inconsistent across controllers because they all did different things, or didn't handle it at all. For example, the block controller in v1 implements separate weight knobs for the group as a subtree root as well as for the tasks only inside the group itself. But it didn't do so for bandwith limits. The memory controller on the other hand only had a singular set of the controls that applied to both the local tasks and all subgroups. And I know Google had a lot of trouble with that because they ended up with basically uncontrollable leftover cache in the top-level group of some subtree that would put pressure on the real workload leafgroups below. There was a lot of back and forth whether we should add a second set of knobs just to control the local tasks separately from the subtree, but ended up concluding that the situation can be expressed more clearly by creating dedicated leaf subgroups for stuff like management software and launchers instead, so that their memory pools/LRUs are clearly delineated from other groups and seperately controllable. And we couldn't think of any meaningful configuration that could not be expressed in that scheme. I mean, it's the same thing, right? Only that with tasks in !leaf groups the controller would have to emulate a hidden leaf subgroup and provide additional interfacing, and without it the leaf groups are explicit and a single set of knobs suffices. I.e. it seems more of a convenience thing than actual functionality, but one that forces ugly redundancy in the interface. So it was a nice cleanup for the memory controller and I believe the IO controller as well. I'd be curious how it'd be a problem for CPU? ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160407073547.GA12560-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160407073547.GA12560-druUgvl0LCNAfugRpC6u6w@public.gmane.org> @ 2016-04-07 8:05 ` Mike Galbraith 2016-04-07 8:08 ` Peter Zijlstra 2016-04-07 8:28 ` Peter Zijlstra 2 siblings, 0 replies; 50+ messages in thread From: Mike Galbraith @ 2016-04-07 8:05 UTC (permalink / raw) To: Johannes Weiner, Peter Zijlstra Cc: Tejun Heo, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, 2016-04-07 at 03:35 -0400, Johannes Weiner wrote: > On Thu, Apr 07, 2016 at 08:45:49AM +0200, Peter Zijlstra wrote: > > So I recently got made aware of the fact that cgroupv2 doesn't > > allow > > tasks to be associated with !leaf cgroups, this is yet another > > capability of cpu-cgroup you've destroyed. > > May I ask how you are using that? One real world usage is the thread pool servicing customers on a cgroup = account basis that I outlined. -Mike ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160407073547.GA12560-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 2016-04-07 8:05 ` Mike Galbraith @ 2016-04-07 8:08 ` Peter Zijlstra [not found] ` <20160407080833.GK3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-04-07 8:28 ` Peter Zijlstra 2 siblings, 1 reply; 50+ messages in thread From: Peter Zijlstra @ 2016-04-07 8:08 UTC (permalink / raw) To: Johannes Weiner Cc: Tejun Heo, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote: > On Thu, Apr 07, 2016 at 08:45:49AM +0200, Peter Zijlstra wrote: > > So I recently got made aware of the fact that cgroupv2 doesn't allow > > tasks to be associated with !leaf cgroups, this is yet another > > capability of cpu-cgroup you've destroyed. > > May I ask how you are using that? _I_ use a kernel with CONFIG_CGROUPS=n (yes really). But seriously? You have to ask? The root cgroup is per definition not a leaf, and all tasks start life there, and some cannot be ever moved out. Therefore _everybody_ uses this. > The behavior for tasks in !leaf groups was fairly inconsistent across > controllers because they all did different things, or didn't handle it > at all. Then they're all bloody broken, because fully hierarchical was an early requirement for cgroups; I know, because I had to throw away many days of work and start over with cgroup support when they did that. > So it was a nice cleanup for the memory controller and I believe the > IO controller as well. I'd be curious how it'd be a problem for CPU? The full hierarchy took years to make work and is fully ingrained with how the thing words, changing it isn't going to be nice or easy. So sure, go with a lowest common denominator, instead of fixing shit, yay for progress :/ ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160407080833.GK3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160407080833.GK3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2016-04-07 9:28 ` Johannes Weiner [not found] ` <20160407092824.GA13839-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 2016-04-07 19:45 ` Tejun Heo 1 sibling, 1 reply; 50+ messages in thread From: Johannes Weiner @ 2016-04-07 9:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Apr 07, 2016 at 10:08:33AM +0200, Peter Zijlstra wrote: > On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote: > > On Thu, Apr 07, 2016 at 08:45:49AM +0200, Peter Zijlstra wrote: > > > So I recently got made aware of the fact that cgroupv2 doesn't allow > > > tasks to be associated with !leaf cgroups, this is yet another > > > capability of cpu-cgroup you've destroyed. > > > > May I ask how you are using that? > > _I_ use a kernel with CONFIG_CGROUPS=n (yes really). > > But seriously? You have to ask? > > The root cgroup is per definition not a leaf, and all tasks start life > there, and some cannot be ever moved out. > > Therefore _everybody_ uses this. Hm? The root group can always contain tasks. It's not the only thing the root is exempt from, it can't control any resources either: sched_group_set_shares(): /* * We can't change the weight of the root cgroup. */ if (!tg->se[0]) return -EINVAL; tg_set_cfs_bandwidth(): if (tg == &root_task_group) return -EINVAL; etc. and all the problems that led to this rule stem from resource control. > > The behavior for tasks in !leaf groups was fairly inconsistent across > > controllers because they all did different things, or didn't handle it > > at all. > > Then they're all bloody broken, because fully hierarchical was an early > requirement for cgroups; I know, because I had to throw away many days > of work and start over with cgroup support when they did that. I think we're talking past each other. They're all fully hierarchical in the sense of accounting and divvying up resources along a tree structure, and configurable groups competing with other configurable groups or subtrees. That all works perfectly fine. It's the concept of loose unconfigurable tasks competing with configured groups or subtrees that invites problems. It's not a question of implementation, it's that the configurations that people created with e.g. the memory controller repeatedly ended up creating the same problems and the same stupid patches to add the local-only knobs (which the cpu cgroup doesn't have either AFAICS). This is not some gratuitous cutting away of convenience, it's hours and hours of discussions both on the mailinglists and at conferences about such lovely stuff as to whether the memory lowlimit (softlimit) should apply to only the local memory pool or hierarchically because that user happened to have memory pools in !leaf nodes which they had to control somehow. Swear to god. [ And yes, the root group IS "loose unconfigurable tasks" that compete with configured subtrees. But that is very explicit in the interface and you move stuff that consumes significant resources and needs to be controlled out of the root group; it doesn't have the same issue. ] If that happens once or twice I'm willing to write it off as PEBCAK, but if otherwise competent users like Google repeatedly create configurations that lead to these problems, and then end up pushing and lobbying in this case for non-hierarchical knobs to work around problems in the structural organization of the workloads, it's more likely that the interface is shit. So we added a rule that doesn't take away any functionality, but it forces you to organize your workloads more explicitely to take away that ambiguity. ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160407092824.GA13839-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160407092824.GA13839-druUgvl0LCNAfugRpC6u6w@public.gmane.org> @ 2016-04-07 10:42 ` Peter Zijlstra 0 siblings, 0 replies; 50+ messages in thread From: Peter Zijlstra @ 2016-04-07 10:42 UTC (permalink / raw) To: Johannes Weiner Cc: Tejun Heo, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Apr 07, 2016 at 05:28:24AM -0400, Johannes Weiner wrote: > Hm? The root group can always contain tasks. It's not the only thing > the root is exempt from, it can't control any resources either: it does in fact control resouces; the hierarchy directly affects the proportional distribution of time. > sched_group_set_shares(): > > /* > * We can't change the weight of the root cgroup. > */ > if (!tg->se[0]) > return -EINVAL; The root has, per definition, no siblings, so setting a weight is entirely pointless. > tg_set_cfs_bandwidth(): > > if (tg == &root_task_group) > return -EINVAL; > We have had patches to implement this, but have so far held off because they add a bunch of cycles to some really hot paths and we'd rather not do that. Its not impossible, or unthinkable to do this otherwise. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160407080833.GK3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-04-07 9:28 ` Johannes Weiner @ 2016-04-07 19:45 ` Tejun Heo [not found] ` <20160407194555.GI7822-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 1 sibling, 1 reply; 50+ messages in thread From: Tejun Heo @ 2016-04-07 19:45 UTC (permalink / raw) To: Peter Zijlstra Cc: Johannes Weiner, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg Hello, Peter. On Thu, Apr 07, 2016 at 10:08:33AM +0200, Peter Zijlstra wrote: > On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote: > > So it was a nice cleanup for the memory controller and I believe the > > IO controller as well. I'd be curious how it'd be a problem for CPU? > > The full hierarchy took years to make work and is fully ingrained with > how the thing words, changing it isn't going to be nice or easy. > > So sure, go with a lowest common denominator, instead of fixing shit, > yay for progress :/ It's easy to get fixated on what each subsystem can do and develop towards different directions siloed in each subsystem. That's what we've had for quite a while in cgroup. Expectedly, this sends off controllers towards different directions. Direct competion between tasks and child cgroups was one of the main sources of balkanization. The balkanization was no coincidence either. Tasks and cgroups are different types of entities and don't have the same control knobs or follow the same lifetime rules. For absolute limits, it isn't clear how much of the parent's resources should be distributed to internal children as opposed to child cgroups. People end up depending on specific implementation details and proposing one-off hacks and interface additions. Proportional weights aren't much better either. CPU has internal mapping between nice values and shares and treat them equally, which can get confusing as the configured weights behave differently depending on how many threads are in the parent cgroup which often is opaque and can't be controlled from outside. Widely diverging from CPU's behavior, IO grouped all internal tasks into an internal leaf node and used to assign a fixed weight to it. Now, you might think that none of it matters and each subsystem treating cgroup hierarchy as arbitrary and orthogonal collections of bean counters is fine; however, that makes it impossible to account for and control operations which span different types of resources. This prevented us from implementing resource control over frigging buffered writes, making the whole IO control thing a joke. While CPU currently doesn't directly tie into it, that is only because CPU cycles spent during writeback isn't yet properly accounted. The structural constraints and resulting consistency don't just subtract from the abilities of each controller. It establishes a common base, the shared resource domains and consistent behaviors on top of them, that further capabilities can be built upon, capabilities as fundamental as comprehensive resource control over buffered writeback. It can be convenient to have subsystem-specific raw bean counters. If that's what the use case calls for, individual controllers can easily be moved to a separate hierarchy although it would naturally lose the capabilities coming from cooperating over shared resource domains. However, please understand that there are a lot of use cases where comprehensive and consistent resource accounting and control over all major resources is useful and necessary. Thanks. -- tejun ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160407194555.GI7822-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160407194555.GI7822-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2016-04-07 20:25 ` Peter Zijlstra [not found] ` <20160407202542.GD3448-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 0 siblings, 1 reply; 50+ messages in thread From: Peter Zijlstra @ 2016-04-07 20:25 UTC (permalink / raw) To: Tejun Heo Cc: Johannes Weiner, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Apr 07, 2016 at 03:45:55PM -0400, Tejun Heo wrote: > Hello, Peter. > > On Thu, Apr 07, 2016 at 10:08:33AM +0200, Peter Zijlstra wrote: > > On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote: > > > So it was a nice cleanup for the memory controller and I believe the > > > IO controller as well. I'd be curious how it'd be a problem for CPU? > > > > The full hierarchy took years to make work and is fully ingrained with > > how the thing words, changing it isn't going to be nice or easy. > > > > So sure, go with a lowest common denominator, instead of fixing shit, > > yay for progress :/ > > It's easy to get fixated on what each subsystem can do and develop > towards different directions siloed in each subsystem. That's what > we've had for quite a while in cgroup. Expectedly, this sends off > controllers towards different directions. Direct competion between > tasks and child cgroups was one of the main sources of balkanization. > > The balkanization was no coincidence either. Tasks and cgroups are > different types of entities and don't have the same control knobs or > follow the same lifetime rules. For absolute limits, it isn't clear > how much of the parent's resources should be distributed to internal > children as opposed to child cgroups. People end up depending on > specific implementation details and proposing one-off hacks and > interface additions. Yes, I'm familiar with the problem; but simply mandating leaf only nodes is not a solution, for the very simple fact that there are tasks in the root cgroup that cannot ever be moved out, so we _must_ be able to deal with !leaf nodes containing tasks. A consistent interface for absolute controllers to divvy up the resources between local tasks and child cgroups isn't _that_ hard. And this leaf only business totally screwed over anything proportional. This simply cannot work. > Proportional weights aren't much better either. CPU has internal > mapping between nice values and shares and treat them equally, which > can get confusing as the configured weights behave differently > depending on how many threads are in the parent cgroup which often is > opaque and can't be controlled from outside. Huh what? There's nothing confusing there, the nice to weight mapping is static and can easily be consulted. Alternatively we can make an interface where you can set weight through nice values, for those people that are afraid of numbers. But the configured weights do _not_ behave differently depending on the number of tasks, they behave exactly as specified in the proportional weight based rate distribution. We've done the math.. > Widely diverging from > CPU's behavior, IO grouped all internal tasks into an internal leaf > node and used to assign a fixed weight to it. That's just plain broken... That is not how a proportional weight based hierarchical controller works. > Now, you might think that none of it matters and each subsystem > treating cgroup hierarchy as arbitrary and orthogonal collections of > bean counters is fine; however, that makes it impossible to account > for and control operations which span different types of resources. > This prevented us from implementing resource control over frigging > buffered writes, making the whole IO control thing a joke. While CPU > currently doesn't directly tie into it, that is only because CPU > cycles spent during writeback isn't yet properly accounted. CPU cycles spend in waitqueues aren't properly accounted to whoever queued the job either, and there's a metric ton of async stuff that's not properly accounted, so what? > However, please understand that there are a lot of use cases where > comprehensive and consistent resource accounting and control over all > major resources is useful and necessary. Maybe, but so far I've only heard people complain this v2 thing didn't work for them, and as far as I can see the whole v2 model is internally inconsistent and impossible to implement. The suggestion by Johannes to adjust the leaf node weight depending on the number of tasks in is so ludicrous I don't even know where to start enumerating the fail. ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160407202542.GD3448-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160407202542.GD3448-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2016-04-08 20:11 ` Tejun Heo [not found] ` <20160408201135.GO24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 50+ messages in thread From: Tejun Heo @ 2016-04-08 20:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Johannes Weiner, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg Hello, Peter. On Thu, Apr 07, 2016 at 10:25:42PM +0200, Peter Zijlstra wrote: > > The balkanization was no coincidence either. Tasks and cgroups are > > different types of entities and don't have the same control knobs or > > follow the same lifetime rules. For absolute limits, it isn't clear > > how much of the parent's resources should be distributed to internal > > children as opposed to child cgroups. People end up depending on > > specific implementation details and proposing one-off hacks and > > interface additions. > > Yes, I'm familiar with the problem; but simply mandating leaf only nodes > is not a solution, for the very simple fact that there are tasks in the > root cgroup that cannot ever be moved out, so we _must_ be able to deal > with !leaf nodes containing tasks. As Johannes already pointed out, the root cgroup has always been special. While pure practicality, performance implications and implementation convenience do play important roles in the special treatment, another constributing aspect is avoiding exposing statistics and control knobs which are duplicates of and/or conflicting with what's already available at the system level. It's never fun to have multiple sources of truth. > A consistent interface for absolute controllers to divvy up the > resources between local tasks and child cgroups isn't _that_ hard. I've spent months thinking about it and didn't get too far. If you have a good solution, I'd be happy to be enlightened. Also, please note that the current solution is based on restricting certain configurations. If we can find a better solution, we can relax the relevant constraints and move onto it without breaking compatibility. > And this leaf only business totally screwed over anything proportional. > > This simply cannot work. Will get to this below. > > Proportional weights aren't much better either. CPU has internal > > mapping between nice values and shares and treat them equally, which > > can get confusing as the configured weights behave differently > > depending on how many threads are in the parent cgroup which often is > > opaque and can't be controlled from outside. > > Huh what? There's nothing confusing there, the nice to weight mapping is > static and can easily be consulted. Alternatively we can make an > interface where you can set weight through nice values, for those people > that are afraid of numbers. > > But the configured weights do _not_ behave differently depending on the > number of tasks, they behave exactly as specified in the proportional > weight based rate distribution. We've done the math.. Yes, once one understands what's going on, it isn't confusing. It's just not something users can intuitively understand from the presented interface. The confusion of course is worsened severely by different controller behaviors. > > Widely diverging from > > CPU's behavior, IO grouped all internal tasks into an internal leaf > > node and used to assign a fixed weight to it. > > That's just plain broken... That is not how a proportional weight based > hierarchical controller works. That's a strong statement. When the hierarchy is composed of equivalent objects as in CPU, not distinguishing internal and leaf nodes would be a more natural way to organize; however, it isn't necessarily true in all cases. For example, while a writeback IO would be issued by some task, the task itself might not have done anything to cause that IO and the IO would essentially be anonymous in the resource domain. Also, different controllers use different units of organization - CPU sees threads, IO sees IO contexts which are usually shared in a process. The difference would lead to differing scaling behaviors in proportional distribution. While the separate buckets and entities model may not be as elegant as tree of uniform objects, it is far from uncommon and more robust when dealing with different types of objects. > > Now, you might think that none of it matters and each subsystem > > treating cgroup hierarchy as arbitrary and orthogonal collections of > > bean counters is fine; however, that makes it impossible to account > > for and control operations which span different types of resources. > > This prevented us from implementing resource control over frigging > > buffered writes, making the whole IO control thing a joke. While CPU > > currently doesn't directly tie into it, that is only because CPU > > cycles spent during writeback isn't yet properly accounted. > > CPU cycles spend in waitqueues aren't properly accounted to whoever > queued the job either, and there's a metric ton of async stuff that's > not properly accounted, so what? The ultimate goal of cgroup resource control is accounting and controlling all significant resource consumptions as configured. Some system operations are inherently global and others are simply too cheap to justify the overhead; however, there still are significant aggregate operations which are being missed out including almost everything taking place in the writeback path. So, yes, we eventually want to be able to account for them, of course in a way which doesn't get in the way of actual operation. > > However, please understand that there are a lot of use cases where > > comprehensive and consistent resource accounting and control over all > > major resources is useful and necessary. > > Maybe, but so far I've only heard people complain this v2 thing didn't > work for them, and as far as I can see the whole v2 model is internally > inconsistent and impossible to implement. I suppose we live in different bubbles. Can you please elaborate which parts of cgroup v2 model are internally inconsistent and impossible to implement? I'd be happy to rectify the situation. > The suggestion by Johannes to adjust the leaf node weight depending on > the number of tasks in is so ludicrous I don't even know where to start > enumerating the fail. That sounds like a pretty uncharitable way to read his message. I think he was trying to find out the underlying requirements so that a way forward can be discussed. I do have the same question. It's difficult to have discussions about trade-offs without knowing where the requirements are coming from. Do you have something on mind for cases where internal tasks have to compete with sibling cgroups? Thanks. -- tejun ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160408201135.GO24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160408201135.GO24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2016-04-09 6:16 ` Mike Galbraith 2016-04-09 13:39 ` Peter Zijlstra 2016-04-09 16:02 ` Peter Zijlstra 2 siblings, 0 replies; 50+ messages in thread From: Mike Galbraith @ 2016-04-09 6:16 UTC (permalink / raw) To: Tejun Heo, Peter Zijlstra Cc: Johannes Weiner, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Fri, 2016-04-08 at 16:11 -0400, Tejun Heo wrote: > > That's just plain broken... That is not how a proportional weight based > > hierarchical controller works. > > That's a strong statement. When the hierarchy is composed of > equivalent objects as in CPU, not distinguishing internal and leaf > nodes would be a more natural way to organize; however... You almost said it yourself, you want to make the natural organization of cpu, cpuacct and cpuset controllers a prohibited organization. There is no "however..." that can possibly justify that. It's akin to mandating: henceforth "horse" shall be spelled "cow", riders thereof shall teach their "cow" the proper enunciation of "moo". It's silly. Like it or not, these controllers have thread encoded in their DNA, it's an integral part of what they are, and how real users in the real world use them. No rationalization will change that cold hard fact. Make an "Aunt Tilly" button for those incapable of comprehending the complexities if you will, but please don't make cgroups so rigid and idiot proof that only idiots (hi system thing) can use it. -Mike ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160408201135.GO24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2016-04-09 6:16 ` Mike Galbraith @ 2016-04-09 13:39 ` Peter Zijlstra [not found] ` <20160409133917.GV3448-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-04-09 16:02 ` Peter Zijlstra 2 siblings, 1 reply; 50+ messages in thread From: Peter Zijlstra @ 2016-04-09 13:39 UTC (permalink / raw) To: Tejun Heo Cc: Johannes Weiner, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Fri, Apr 08, 2016 at 04:11:35PM -0400, Tejun Heo wrote: > > > Widely diverging from > > > CPU's behavior, IO grouped all internal tasks into an internal leaf > > > node and used to assign a fixed weight to it. > > > > That's just plain broken... That is not how a proportional weight based > > hierarchical controller works. > > That's a strong statement. No its plain fact. If you modify a graph, it is not the same graph. Even if you argue by merit of the function on this graph, and state that only the result of this function is important, and any modification to the graph that leaves this result in tact is good; ie. a modification invariant to the function, this fails. Because for proportional controllers all that matters is the number and weight of edges leaving a node. The modification described above does clearly change the outcome and is not invariant under the proportional weight distribution function. > When the hierarchy is composed of > equivalent objects as in CPU, not distinguishing internal and leaf > nodes would be a more natural way to organize; however, it isn't > necessarily true in all cases. For example, while a writeback IO > would be issued by some task, the task itself might not have done > anything to cause that IO and the IO would essentially be anonymous in > the resource domain. Also, different controllers use different units > of organization - CPU sees threads, IO sees IO contexts which are > usually shared in a process. The difference would lead to differing > scaling behaviors in proportional distribution. > > While the separate buckets and entities model may not be as elegant as > tree of uniform objects, it is far from uncommon and more robust when > dealing with different types of objects. The graph does not care about the type of objects the nodes represent, and proportional weight distribution only cares about the edges. With cpu-cgroup the nodes are not of uniform type either, they can be a group or a task. You get runtime type identification and make it work. There just isn't an excuse for crazy crap like this. Its wrong, no two ways about it. ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160409133917.GV3448-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160409133917.GV3448-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2016-04-12 22:29 ` Tejun Heo [not found] ` <20160412222915.GT24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 50+ messages in thread From: Tejun Heo @ 2016-04-12 22:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Johannes Weiner, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg Hello, Peter. On Sat, Apr 09, 2016 at 03:39:17PM +0200, Peter Zijlstra wrote: > > While the separate buckets and entities model may not be as elegant as > > tree of uniform objects, it is far from uncommon and more robust when > > dealing with different types of objects. > > The graph does not care about the type of objects the nodes represent, > and proportional weight distribution only cares about the edges. > > With cpu-cgroup the nodes are not of uniform type either, they can be a > group or a task. You get runtime type identification and make it work. > > There just isn't an excuse for crazy crap like this. Its wrong, no two > ways about it. Abstracing tasks and groups as equivalent objects works well for the scheduler and that's great. This is also because the domain lends itself very well to such simple and elegant approach. The only entities of interest are tasks, as you and Mike pointed out earlier in the thread, and group priority can be easily mapped to task priority. However, this isn't necessarily the case for other controllers. There's also the issue of mapping the model to absolute controllers. For the uniform model to work, there must be a way to treat internal and leaf entities in the same way. For memory, the leaf entities are processes and applying the same model would mean that memory controller would have to implement equivalent per-process control knobs. We don't have that. In fact, we can't have that - a significant part of memory consumption can't be attached to a single process. There is a fundamental distinction between internal and leaf nodes in the memory resource graph. We aren't designing a spherical cow in a vacuum, and, I believe, should aspire to make pragmatic trade-offs of all involved factors. If multiple controllers co-operating on the same resource domains is beneficial and required, we should figure out a way to make different controllers agree and that way most likely will require some trade-offs from various controllers. Given the currently known requirements and constraints, restricting internal competition is a simple and straight-forward way to isolate leaf node handling details of different controllers. The cost is part aesthetical and part practical. While less elegant than tree of uniform objects, it seems a stretch to call internal / leaf node distinction broken especially given that the model is natural to some controllers. The practical cost is loss of the ability to let leaf entities compete against groups. However, we can't evaluate how important such capability is without actual use-cases. If there are important ones, please bring them up, so that we can examine the actual requirements and try to find a good trade-off to support them. I understand that CPU controller getting constrained due to other controllers can feel frustrating; however, the constraint is there to solve practical problems which hopefully are being explained in this conversation. If there is a better trade-off, we can easily get rid of it and move on, but such decision can only be made considering all the relevant factors. If you can think of a better solution, let's please discuss it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160412222915.GT24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160412222915.GT24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2016-04-13 7:43 ` Mike Galbraith 2016-04-13 15:59 ` Tejun Heo 0 siblings, 1 reply; 50+ messages in thread From: Mike Galbraith @ 2016-04-13 7:43 UTC (permalink / raw) To: Tejun Heo, Peter Zijlstra Cc: Johannes Weiner, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue, 2016-04-12 at 18:29 -0400, Tejun Heo wrote: > Hello, Peter. > > On Sat, Apr 09, 2016 at 03:39:17PM +0200, Peter Zijlstra wrote: > > > While the separate buckets and entities model may not be as elegant as > > > tree of uniform objects, it is far from uncommon and more robust when > > > dealing with different types of objects. > > > > The graph does not care about the type of objects the nodes represent, > > and proportional weight distribution only cares about the edges. > > > > With cpu-cgroup the nodes are not of uniform type either, they can be a > > group or a task. You get runtime type identification and make it work. > > > > There just isn't an excuse for crazy crap like this. Its wrong, no two > > ways about it. > > Abstracing tasks and groups as equivalent objects works well for the > scheduler and that's great. This is also because the domain lends > itself very well to such simple and elegant approach. The only > entities of interest are tasks, as you and Mike pointed out earlier in > the thread, and group priority can be easily mapped to task priority. > However, this isn't necessarily the case for other controllers. > > There's also the issue of mapping the model to absolute controllers. > For the uniform model to work, there must be a way to treat internal > and leaf entities in the same way. For memory, the leaf entities are > processes and applying the same model would mean that memory > controller would have to implement equivalent per-process control > knobs. We don't have that. In fact, we can't have that - a > significant part of memory consumption can't be attached to a single > process. There is a fundamental distinction between internal and leaf > nodes in the memory resource graph. > > We aren't designing a spherical cow in a vacuum, and, I believe, > should aspire to make pragmatic trade-offs of all involved factors. > If multiple controllers co-operating on the same resource domains is > beneficial and required, we should figure out a way to make different > controllers agree and that way most likely will require some > trade-offs from various controllers. > > Given the currently known requirements and constraints, restricting > internal competition is a simple and straight-forward way to isolate > leaf node handling details of different controllers. > > The cost is part aesthetical and part practical. While less elegant > than tree of uniform objects, it seems a stretch to call internal / > leaf node distinction broken especially given that the model is > natural to some controllers. That justifies prohibiting proper usages of three controllers, cpu, cpuacct and cpuset? > The practical cost is loss of the ability to let leaf entities compete > against groups. However, we can't evaluate how important such > capability is without actual use-cases. If there are important ones, > please bring them up, so that we can examine the actual requirements > and try to find a good trade-off to support them. Hm, I though Google did that, and I know I mentioned another gigabuck sized outfit. Whatever, ob trade-off.. Another cpuset example is something I was asked to look into recently. There are folks out in the real world who want to run RT guests. Now VIRTUAL REALtime tickles my funny-bone, but I piddled around with it nonetheless to see what such can deliver (not much). System thing and/or libvirt created a cpuset home for qemu, but with VPUs sharing CPU with other qemu threads and the rest of the world, RT performance in little virtual box was as pathetic as one would expect. What did I do about it? Among others, the obvious, I created an exclusive cpuset, and distributed qemu contexts having different requirements among context containment vessels having the required properties. I won't be doing any more of that particular scenario, but certainly will want to distribute various contexts among various context containment vessels in future. I soon enough won't care about cgroups, but others will surely expect cpu, cpuacct and cpuset controllers to continue to function properly. > I understand that CPU controller getting constrained due to other > controllers can feel frustrating; however, the constraint is there to > solve practical problems which hopefully are being explained in this > conversation. If there is a better trade-off, we can easily get rid > of it and move on, but such decision can only be made considering all > the relevant factors. If you can think of a better solution, let's > please discuss it. None here. Any artificial restriction placed on controllers will render same broken in one way or another that will matter to someone somewhere. Making something less than it was will do that. -Mike ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP 2016-04-13 7:43 ` Mike Galbraith @ 2016-04-13 15:59 ` Tejun Heo [not found] ` <20160413155952.GU24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2016-04-14 6:07 ` Mike Galbraith 0 siblings, 2 replies; 50+ messages in thread From: Tejun Heo @ 2016-04-13 15:59 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Johannes Weiner, torvalds, akpm, mingo, lizefan, pjt, linux-kernel, cgroups, linux-api, kernel-team Hello, Mike. On Wed, Apr 13, 2016 at 09:43:01AM +0200, Mike Galbraith wrote: > > The cost is part aesthetical and part practical. While less elegant > > than tree of uniform objects, it seems a stretch to call internal / > > leaf node distinction broken especially given that the model is > > natural to some controllers. > > That justifies prohibiting proper usages of three controllers, cpu, > cpuacct and cpuset? Neither cpuacct or cpuset loses any capability from the constraint as there is no difference between tasks being in an internal cgroup or a leaf cgroup nested under it. The only practical impact is that we lose the ability to let internal tasks compete against sibling cgroups for proportional control. > > The practical cost is loss of the ability to let leaf entities compete > > against groups. However, we can't evaluate how important such > > capability is without actual use-cases. If there are important ones, > > please bring them up, so that we can examine the actual requirements > > and try to find a good trade-off to support them. > > Hm, I though Google did that, and I know I mentioned another gigabuck > sized outfit. Whatever, ob trade-off.. Are you saying that you're aware that google or another big outfit is making active use of internal tasks competing against sibling cgroups for proportional CPU distribution? If so, can you please be more specific? > Another cpuset example is something I was asked to look into recently. First of all, as mentioned above, cpuset isn't affected at all in practical terms. Besides, for a very specialized cpuset setup, the cpuset configuration might not have anything to do with the resource domains other controllers use and it might make sense to keep cpuset on a separate hierarchy. > > I understand that CPU controller getting constrained due to other > > controllers can feel frustrating; however, the constraint is there to > > solve practical problems which hopefully are being explained in this > > conversation. If there is a better trade-off, we can easily get rid > > of it and move on, but such decision can only be made considering all > > the relevant factors. If you can think of a better solution, let's > > please discuss it. > > None here. Any artificial restriction placed on controllers will > render same broken in one way or another that will matter to someone > somewhere. Making something less than it was will do that. The specifics of gains and losses are what I've been trying to clarify in this thread. Hopefully, what we can gain from sharing common resource domains is clear by now. The practical cost is loss of the capability to let internal tasks compete against sibling cgroups for proportional control. However, to determine the weight of this cost, we have to know which use-cases call for it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160413155952.GU24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160413155952.GU24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2016-04-13 19:15 ` Mike Galbraith 0 siblings, 0 replies; 50+ messages in thread From: Mike Galbraith @ 2016-04-13 19:15 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Johannes Weiner, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Wed, 2016-04-13 at 11:59 -0400, Tejun Heo wrote: > Are you saying that you're aware that google or another big outfit is > making active use of internal tasks competing against sibling cgroups > for proportional CPU distribution? If so, can you please be more > specific? What I'm aware of is a big outfit that moves thread pool workers in/out of a large number of cpu/cpuacct cgroups. What all a worker thread may spawn in a cgroup, or find already there upon arrival and thus compete with I do not know. I'm baffled by why anyone would care which entity competes with which other entity. An entity is an entity is an entity. -Mike ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP 2016-04-13 15:59 ` Tejun Heo [not found] ` <20160413155952.GU24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2016-04-14 6:07 ` Mike Galbraith [not found] ` <1460614057.5100.150.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 1 sibling, 1 reply; 50+ messages in thread From: Mike Galbraith @ 2016-04-14 6:07 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Johannes Weiner, torvalds, akpm, mingo, lizefan, pjt, linux-kernel, cgroups, linux-api, kernel-team On Wed, 2016-04-13 at 11:59 -0400, Tejun Heo wrote: > Hello, Mike. > > On Wed, Apr 13, 2016 at 09:43:01AM +0200, Mike Galbraith wrote: > > > The cost is part aesthetical and part practical. While less > > > elegant > > > than tree of uniform objects, it seems a stretch to call internal > > > / > > > leaf node distinction broken especially given that the model is > > > natural to some controllers. > > > > That justifies prohibiting proper usages of three controllers, cpu, > > cpuacct and cpuset? > > Neither cpuacct or cpuset loses any capability from the constraint as > there is no difference between tasks being in an internal cgroup or a > leaf cgroup nested under it. The only practical impact is that we > lose the ability to let internal tasks compete against sibling cgroups > for proportional control. I'm not getting it. A. entity = task[s] | cgroup[s] B. entity = task[s] ^ cgroup[s] A I get, B I don't, but you seem to be saying B, else we get the task competes with sibling cgroup business. Let /foo be an exclusive cpuset containing exclusive subset bar. How can any task acquire set foo affinity if B really really applies? My box calls me a dummy if I try to create a "proper" home for tasks, one with both no snobby neighbors and proper affinity. -Mike ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <1460614057.5100.150.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <1460614057.5100.150.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2016-04-14 19:57 ` Tejun Heo [not found] ` <20160414195748.GK7822-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 50+ messages in thread From: Tejun Heo @ 2016-04-14 19:57 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Johannes Weiner, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg Hello, Mike. On Thu, Apr 14, 2016 at 08:07:37AM +0200, Mike Galbraith wrote: > Let /foo be an exclusive cpuset containing exclusive subset bar. > How can any task acquire set foo affinity if B really really > applies? My box calls me a dummy if I try to create a "proper" home > for tasks, one with both no snobby neighbors and proper affinity. I'm not sure I quite understand what you're saying. Are you referring to the cpuset.{cpu|mem}_exclusive knobs? Thanks. -- tejun ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160414195748.GK7822-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160414195748.GK7822-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2016-04-15 2:42 ` Mike Galbraith 0 siblings, 0 replies; 50+ messages in thread From: Mike Galbraith @ 2016-04-15 2:42 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Johannes Weiner, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, 2016-04-14 at 15:57 -0400, Tejun Heo wrote: > Hello, Mike. > > On Thu, Apr 14, 2016 at 08:07:37AM +0200, Mike Galbraith wrote: > > Let /foo be an exclusive cpuset containing exclusive subset bar. > > How can any task acquire set foo affinity if B really really > > applies? My box calls me a dummy if I try to create a "proper" > > home > > for tasks, one with both no snobby neighbors and proper affinity. > > I'm not sure I quite understand what you're saying. Are you referring > to the cpuset.{cpu|mem}_exclusive knobs? Yes. -Mike ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160408201135.GO24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2016-04-09 6:16 ` Mike Galbraith 2016-04-09 13:39 ` Peter Zijlstra @ 2016-04-09 16:02 ` Peter Zijlstra 2 siblings, 0 replies; 50+ messages in thread From: Peter Zijlstra @ 2016-04-09 16:02 UTC (permalink / raw) To: Tejun Heo Cc: Johannes Weiner, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Fri, Apr 08, 2016 at 04:11:35PM -0400, Tejun Heo wrote: > > Yes, I'm familiar with the problem; but simply mandating leaf only nodes > > is not a solution, for the very simple fact that there are tasks in the > > root cgroup that cannot ever be moved out, so we _must_ be able to deal > > with !leaf nodes containing tasks. > > As Johannes already pointed out, the root cgroup has always been > special. The root of the tree isn't special except for 2 properties. - it _is_ a root; iow, it doesn't have any incoming edges. This also means it doesn't have a parent; nor can have a weight, since that is an edge propery, not a node property. - it always exists; for without a root there is no tree. Making it _more_ special is silly. > > Maybe, but so far I've only heard people complain this v2 thing didn't > > work for them, and as far as I can see the whole v2 model is internally > > inconsistent and impossible to implement. > > I suppose we live in different bubbles. Can you please elaborate > which parts of cgroup v2 model are internally inconsistent and > impossible to implement? I'd be happy to rectify the situation. The fact that we have to deal with tasks in the root cgroup while not allowing tasks in any other node is internally inconsistent. If I can deal with tasks in one node (root) I can equally deal with tasks in any other node in exactly the same manner. Making it different is actually _more_ code. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160407073547.GA12560-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 2016-04-07 8:05 ` Mike Galbraith 2016-04-07 8:08 ` Peter Zijlstra @ 2016-04-07 8:28 ` Peter Zijlstra [not found] ` <20160407082810.GN3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2 siblings, 1 reply; 50+ messages in thread From: Peter Zijlstra @ 2016-04-07 8:28 UTC (permalink / raw) To: Johannes Weiner Cc: Tejun Heo, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote: > There was a lot of back and forth whether we should add a second set > of knobs just to control the local tasks separately from the subtree, > but ended up concluding that the situation can be expressed more > clearly by creating dedicated leaf subgroups for stuff like management > software and launchers instead, so that their memory pools/LRUs are > clearly delineated from other groups and seperately controllable. And > we couldn't think of any meaningful configuration that could not be > expressed in that scheme. I mean, it's the same thing, right? No, not the same. R / | \ t1 t2 A / \ t3 t4 Is fundamentally different from: R / \ L A / \ / \ t1 t2 t3 t4 Because if in the first hierarchy you add a task (t5) to R, all of its A will run at 1/4th of total bandwidth where before it had 1/3rd, whereas with the second example, if you add our t5 to L, A doesn't get any less bandwidth. Please pull your collective heads out of the systemd arse and start thinking. ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160407082810.GN3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160407082810.GN3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2016-04-07 19:04 ` Johannes Weiner [not found] ` <20160407190424.GA20407-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 0 siblings, 1 reply; 50+ messages in thread From: Johannes Weiner @ 2016-04-07 19:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Apr 07, 2016 at 10:28:10AM +0200, Peter Zijlstra wrote: > On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote: > > There was a lot of back and forth whether we should add a second set > > of knobs just to control the local tasks separately from the subtree, > > but ended up concluding that the situation can be expressed more > > clearly by creating dedicated leaf subgroups for stuff like management > > software and launchers instead, so that their memory pools/LRUs are > > clearly delineated from other groups and seperately controllable. And > > we couldn't think of any meaningful configuration that could not be > > expressed in that scheme. I mean, it's the same thing, right? > > No, not the same. > > > R > / | \ > t1 t2 A > / \ > t3 t4 > > > Is fundamentally different from: > > > R > / \ > L A > / \ / \ > t1 t2 t3 t4 > > > Because if in the first hierarchy you add a task (t5) to R, all of its A > will run at 1/4th of total bandwidth where before it had 1/3rd, whereas > with the second example, if you add our t5 to L, A doesn't get any less > bandwidth. I didn't mean the same exact configuration, I meant being able to configure with the same outcome of resource distribution. All this means here is that if you want to change the shares allocated to the tasks in R (or then L) you have to be explicit about it and update the weight configuration in L. Again, it's not gratuitous, it's based on the problems this concept in the interface created in more comprehensive container deployments. > Please pull your collective heads out of the systemd arse and start > thinking. I don't care about systemd here. In fact, in 5 years of rewriting the memory controller, zero percent of it was driven by systemd and most of it from Google's feedback at LSF and email since they had by far the most experience and were pushing the frontier. And even though the performance and overhead of the memory controller was absolutely abysmal - routinely hitting double digits in page fault profiles - the discussions *always* centered around the interface and configuration. IMO, this thread is a little too focused on the reality of a single resource controller, when in real setups it doesn't exist in a vacuum. What these environments need is to robustly divide the machine up into parcels to isolate thousands of jobs on X dimensions at the same time: allocate CPU time, allocate memory, allocate IO. And then on top of that implement higher concepts such as dirty page quotas and writeback, accounting for kswapd's cpu time based on who owns the memory it reclaims, accounting IO time for the stuff it swaps out etc. That *needs* all three resources to be coordinated. You disparagingly called it the lowest common denominator, but the thing is that streamlining the controllers and coordinating them around shared resource domains gives us much more powerful and robust ways to allocate the *machines* as a whole, and allows the proper tracking and accounting of cross-domain operations such as writeback that wasn't even possible before. And all that in a way that doesn't have the same usability pitfalls that v1 had when you actually push this stuff beyond the "i want to limit the cpu cycles of this one service" and move towards "this machine is an anonymous node in a data center and I want it to host thousands of different workloads - some sensitive to latency, some that only care about throughput - and they better not step on each other's toes on *any* of the resource pools." Those are my primary concerns when it comes to the v2 interface, and I think focusing too much on what's theoretically possible with a single controller is missing the bigger challenge of allocating machines. ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160407190424.GA20407-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160407190424.GA20407-druUgvl0LCNAfugRpC6u6w@public.gmane.org> @ 2016-04-07 19:31 ` Peter Zijlstra [not found] ` <20160407193127.GB3448-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 0 siblings, 1 reply; 50+ messages in thread From: Peter Zijlstra @ 2016-04-07 19:31 UTC (permalink / raw) To: Johannes Weiner Cc: Tejun Heo, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Apr 07, 2016 at 03:04:24PM -0400, Johannes Weiner wrote: > All this means here is that if you want to change the shares allocated > to the tasks in R (or then L) you have to be explicit about it and > update the weight configuration in L. Updating the weight of L for every task spawned and killed is simply not an option. The fact that you're not willing to admit to this is troubling, but does confirm I can stop spending time on anything cgroup v2. cpu-cgroup just isn't going to move to this inferior interface. ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160407193127.GB3448-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160407193127.GB3448-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2016-04-07 20:23 ` Johannes Weiner [not found] ` <20160407202344.GA22509-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 0 siblings, 1 reply; 50+ messages in thread From: Johannes Weiner @ 2016-04-07 20:23 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Apr 07, 2016 at 09:31:27PM +0200, Peter Zijlstra wrote: > On Thu, Apr 07, 2016 at 03:04:24PM -0400, Johannes Weiner wrote: > > > All this means here is that if you want to change the shares allocated > > to the tasks in R (or then L) you have to be explicit about it and > > update the weight configuration in L. > > Updating the weight of L for every task spawned and killed is simply not > an option. > > The fact that you're not willing to admit to this is troubling, but does > confirm I can stop spending time on anything cgroup v2. cpu-cgroup just > isn't going to move to this inferior interface. I guess I walked right into that one, didn't I ;-) It probably makes more sense to discuss a real-life workload instead of a diagram. Obviously, if the threadpool size is highly variable it's not reasonable to ask the user to track every update and accordingly reconfigure the controller. I fully agree with you there. All I meant to point out is that the *implicit* behavior of the v1 interface did create real problems, to show you that this is not a one-sided discussion and that there are real life concerns that played into the decision of not letting loose tasks compete with groups. If this is a real workload rather than a thought experiment, it will need to be supported in v2 as well - just if we can help it hopefully without reverting to the tricky behavior of the v1 controller. One possible solution I could imagine for example is adding the option to configure a groups weight such that its dynamically based on the # of threads. But it depends on what the exact requirements are here. Could you explain what this workload is so it's easier to reason about? Mike, is that the one you referred to with one group per customer account? If so, would you have a pointer to where you outline it? ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160407202344.GA22509-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160407202344.GA22509-druUgvl0LCNAfugRpC6u6w@public.gmane.org> @ 2016-04-08 3:13 ` Mike Galbraith 0 siblings, 0 replies; 50+ messages in thread From: Mike Galbraith @ 2016-04-08 3:13 UTC (permalink / raw) To: Johannes Weiner, Peter Zijlstra Cc: Tejun Heo, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, 2016-04-07 at 16:23 -0400, Johannes Weiner wrote: > Mike, is that the one you referred to with one group per customer > account? If so, would you have a pointer to where you outline it? The usage I loosely outlined, I did in this thread. All of the gory details I do not have, do not want, and could not provide if I did. -Mike ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (8 preceding siblings ...) 2016-03-14 11:30 ` Peter Zijlstra @ 2016-03-15 17:21 ` Michal Hocko [not found] ` <20160315172136.GA6114-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 9 siblings, 1 reply; 50+ messages in thread From: Michal Hocko @ 2016-03-15 17:21 UTC (permalink / raw) To: Tejun Heo Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Fri 11-03-16 10:41:18, Tejun Heo wrote: > Hello, > > This patchset extends cgroup v2 to support rgroup (resource group) for > in-process hierarchical resource control and implements PRIO_RGRP for > setpriority(2) on top to allow in-process hierarchical CPU cycle > control in a seamless way. > > cgroup v1 allowed putting threads of a process in different cgroups > which enabled ad-hoc in-process resource control of some resources. > Unfortunately, this approach was fraught with problems such as > membership ambiguity with per-process resources [Sorry if this has been already discussed, I haven't followed all the discussions regarding this topic] While I agree that per-thread granularity is no fun for controllers which operate on different than task_struct entities (like memory cgroup controller) but I am afraid that all the complications will not go away if we are strictly per-process anyway. For example memcg controller is not strictly per-process either, it operates on the mm_struct and that might be shared between different _processes_. So we still might end up in the same schizophrenic situation where two different processes are living in different cgroups while one of them is silently operating in a different memcg cgroup. I really hate this but this is what our clone(CLONE_VM) (without CLONE_THREAD) allows to do. I do not know about other controllers, maybe only memcg is so special, but that would suggest that even process-only restriction might turn out to be a problem in the future and controllers would have to face the same problem later on. Now I have to admit I do not have great ideas how to cover all the possible cases but wouldn't it make more sense to allow for more flexibility and allow thread migration while the migration can be vetoed by any controller should it cross into a different/incompatible cgroup. [...] Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160315172136.GA6114-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160315172136.GA6114-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2016-04-06 21:53 ` Tejun Heo [not found] ` <20160406215307.GJ24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 50+ messages in thread From: Tejun Heo @ 2016-04-06 21:53 UTC (permalink / raw) To: Michal Hocko Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg Hello, Michal. Sorry about the delay. On Tue, Mar 15, 2016 at 06:21:36PM +0100, Michal Hocko wrote: > While I agree that per-thread granularity is no fun for controllers > which operate on different than task_struct entities (like memory cgroup > controller) but I am afraid that all the complications will not go away > if we are strictly per-process anyway. > > For example memcg controller is not strictly per-process either, it > operates on the mm_struct and that might be shared between different > _processes_. So we still might end up in the same schizophrenic > situation where two different processes are living in different > cgroups while one of them is silently operating in a different memcg > cgroup. I really hate this but this is what our clone(CLONE_VM) (without > CLONE_THREAD) allows to do. Can you list applications which make use of CLONE_VM without CLONE_THREAD? I searched using searchcode.com and the only non-kernel code that I see are niche pthread implementations and some strace type audit tools. The only reason those threadpackages use CLONE_VM && !CLONE_THREAD is that that used to be how linuxthreads was done before linux kernel grew proper threading support with CLONE_THREAD. What you're pointing out is a historical vestige and if you can't bring yourself to agree to the fact that processes and threads are the primary abstractions that our userspace use day in and day out, you are not thinking straight. Even the existing usages are *to* implement pthread. While the kernel can't assume CLONE_VM is always accompanied by CLONE_THREAD and shouldn't be crashing when such conditions occur, we also don't and shouldn't architect or optimize for them either. In fact, both memory and io pretty much declare that the specific behaviors are undefined. > I do not know about other controllers, maybe only memcg is so special, > but that would suggest that even process-only restriction might turn out > to be a problem in the future and controllers would have to face the > same problem later on. > > Now I have to admit I do not have great ideas how to cover all the > possible cases but wouldn't it make more sense to allow for more > flexibility and allow thread migration while the migration can be vetoed > by any controller should it cross into a different/incompatible cgroup. This is a non-issue and designing an interface is not about "covering all the possible cases". Different cases have differing levels of importance. It'd be absolutely crazy to put the same amount of consideration towards CLONE_VM && !CLONE_THREAD case when designing *anything*. Another factor to consider, which might not be immediately intuitive, is that exposing everything comes at a cost, often a steep one. cgroup has been reliably proving to be a very good example of this. Orthogonal hierarchies seems totally flexible on the surface but it makes it extremely awkward for different controllers to cooperate preventing something as fundamental as control over buffered writes. This case is similar too. While exposing every possible combination to userland might seem to be a good idea on the surface, the end result is the kernel failing to provide a necessary isolation between operations internal to applications and system management making resource control essentially inaccessible outside of specialized custom setups. It's a failure, not a feature. Thanks. -- tejun ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20160406215307.GJ24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP [not found] ` <20160406215307.GJ24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2016-04-07 6:40 ` Peter Zijlstra 0 siblings, 0 replies; 50+ messages in thread From: Peter Zijlstra @ 2016-04-07 6:40 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, pjt-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Wed, Apr 06, 2016 at 05:53:07PM -0400, Tejun Heo wrote: > Can you list applications which make use of CLONE_VM without > CLONE_THREAD? I ran into one two years ago or so; I forgot what it was, but it made perf misbehave because we too had assumed this not to happen. Eventually it turned out a newer version of that software did no longer do this and we more or less left it at that. But it does show that crap is out there.. ^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 04/10] signal: make put_signal_struct() public 2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo 2016-03-11 15:41 ` [PATCH 02/10] cgroup: un-inline cgroup_path() and friends Tejun Heo [not found] ` <1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2016-03-11 15:41 ` Tejun Heo 2016-03-11 15:41 ` [PATCH 06/10] cgroup, fork: add @child and @clone_flags to threadgroup_change_begin/end() Tejun Heo 2016-03-11 15:41 ` [PATCH 09/10] cgroup: implement rgroup subtree migration Tejun Heo 4 siblings, 0 replies; 50+ messages in thread From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw) To: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt Cc: linux-kernel, cgroups, linux-api, kernel-team, Tejun Heo, Peter Zijlstra, Oleg Nesterov This will be used from cgroup while implementing in-process resource control. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> --- include/linux/sched.h | 2 ++ kernel/fork.c | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index f1e81e1..80d6ed1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2034,6 +2034,8 @@ static inline int is_global_init(struct task_struct *tsk) extern struct pid *cad_pid; +extern void put_signal_struct(struct signal_struct *sig); + extern void free_task(struct task_struct *tsk); #define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0) diff --git a/kernel/fork.c b/kernel/fork.c index 2e391c7..fd826de 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -239,7 +239,7 @@ static inline void free_signal_struct(struct signal_struct *sig) kmem_cache_free(signal_cachep, sig); } -static inline void put_signal_struct(struct signal_struct *sig) +void put_signal_struct(struct signal_struct *sig) { if (atomic_dec_and_test(&sig->sigcnt)) free_signal_struct(sig); -- 2.5.0 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 06/10] cgroup, fork: add @child and @clone_flags to threadgroup_change_begin/end() 2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo ` (2 preceding siblings ...) 2016-03-11 15:41 ` [PATCH 04/10] signal: make put_signal_struct() public Tejun Heo @ 2016-03-11 15:41 ` Tejun Heo 2016-03-11 15:41 ` [PATCH 09/10] cgroup: implement rgroup subtree migration Tejun Heo 4 siblings, 0 replies; 50+ messages in thread From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw) To: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt Cc: linux-kernel, cgroups, linux-api, kernel-team, Tejun Heo, Peter Zijlstra, Oleg Nesterov When threadgroup_change_begin/end() are called from fork path, pass in @child and @clone_flags so that fork path can be distinguished and fork related information is available. While at it, un-inline cgroup_threadgroup_change_begin/end() and fold cgroup_fork() into cgroup_threadgroup_change_begin(). These changes will be used to imlement in-process resource control. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> --- fs/exec.c | 6 +++--- include/linux/cgroup-defs.h | 39 ++++++++++++------------------------- include/linux/cgroup.h | 2 -- include/linux/sched.h | 16 +++++++++++---- kernel/cgroup.c | 47 ++++++++++++++++++++++++++++++++++++--------- kernel/fork.c | 7 +++---- kernel/signal.c | 6 +++--- 7 files changed, 71 insertions(+), 52 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 828ec5f..5b81bbb 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -936,7 +936,7 @@ static int de_thread(struct task_struct *tsk) struct task_struct *leader = tsk->group_leader; for (;;) { - threadgroup_change_begin(tsk); + threadgroup_change_begin(tsk, NULL, 0); write_lock_irq(&tasklist_lock); /* * Do this under tasklist_lock to ensure that @@ -947,7 +947,7 @@ static int de_thread(struct task_struct *tsk) break; __set_current_state(TASK_KILLABLE); write_unlock_irq(&tasklist_lock); - threadgroup_change_end(tsk); + threadgroup_change_end(tsk, NULL, 0); schedule(); if (unlikely(__fatal_signal_pending(tsk))) goto killed; @@ -1005,7 +1005,7 @@ static int de_thread(struct task_struct *tsk) if (unlikely(leader->ptrace)) __wake_up_parent(leader, leader->parent); write_unlock_irq(&tasklist_lock); - threadgroup_change_end(tsk); + threadgroup_change_end(tsk, NULL, 0); release_task(leader); } diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index d3d1f92..3c4a75b 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -514,31 +514,12 @@ struct cgroup_subsys { unsigned int depends_on; }; -extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem; - -/** - * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups - * @tsk: target task - * - * Called from threadgroup_change_begin() and allows cgroup operations to - * synchronize against threadgroup changes using a percpu_rw_semaphore. - */ -static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) -{ - percpu_down_read(&cgroup_threadgroup_rwsem); -} - -/** - * cgroup_threadgroup_change_end - threadgroup exclusion for cgroups - * @tsk: target task - * - * Called from threadgroup_change_end(). Counterpart of - * cgroup_threadcgroup_change_begin(). - */ -static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) -{ - percpu_up_read(&cgroup_threadgroup_rwsem); -} +void cgroup_threadgroup_change_begin(struct task_struct *tsk, + struct task_struct *child, + unsigned long clone_flags); +void cgroup_threadgroup_change_end(struct task_struct *tsk, + struct task_struct *child, + unsigned long clone_flags); #else /* CONFIG_CGROUPS */ @@ -546,8 +527,12 @@ struct css_set; #define CGROUP_SUBSYS_COUNT 0 -static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) {} -static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) {} +static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk, + struct task_struct *child, + unsigned long clone_flags) {} +static inline void cgroup_threadgroup_change_end(struct task_struct *tsk, + struct task_struct *child, + unsigned long clone_flags) {} #endif /* CONFIG_CGROUPS */ diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index ebcf21f..1e00fc0 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -101,7 +101,6 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry); int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *tsk); -void cgroup_fork(struct task_struct *p); extern int cgroup_can_fork(struct task_struct *p, unsigned long clone_flags, struct css_set **new_rgrp_csetp); extern void cgroup_cancel_fork(struct task_struct *p, unsigned long clone_flags, @@ -540,7 +539,6 @@ static inline int cgroup_attach_task_all(struct task_struct *from, static inline int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry) { return -EINVAL; } -static inline void cgroup_fork(struct task_struct *p) {} static inline int cgroup_can_fork(struct task_struct *p, unsigned long clone_flags, struct css_set **new_rgrp_csetp) { return 0; } diff --git a/include/linux/sched.h b/include/linux/sched.h index 80d6ed1..d4ae795 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2783,6 +2783,8 @@ static inline void unlock_task_sighand(struct task_struct *tsk, /** * threadgroup_change_begin - mark the beginning of changes to a threadgroup * @tsk: task causing the changes + * @child: child task if forking, NULL otherwise + * @clone_flags: clone flags if forking * * All operations which modify a threadgroup - a new thread joining the * group, death of a member thread (the assertion of PF_EXITING) and @@ -2791,21 +2793,27 @@ static inline void unlock_task_sighand(struct task_struct *tsk, * subsystems needing threadgroup stability can hook into for * synchronization. */ -static inline void threadgroup_change_begin(struct task_struct *tsk) +static inline void threadgroup_change_begin(struct task_struct *tsk, + struct task_struct *child, + unsigned long clone_flags) { might_sleep(); - cgroup_threadgroup_change_begin(tsk); + cgroup_threadgroup_change_begin(tsk, child, clone_flags); } /** * threadgroup_change_end - mark the end of changes to a threadgroup * @tsk: task causing the changes + * @child: child task if forking, NULL otherwise + * @clone_flags: clone flags if forking * * See threadgroup_change_begin(). */ -static inline void threadgroup_change_end(struct task_struct *tsk) +static inline void threadgroup_change_end(struct task_struct *tsk, + struct task_struct *child, + unsigned long clone_flags) { - cgroup_threadgroup_change_end(tsk); + cgroup_threadgroup_change_end(tsk, child, clone_flags); } #ifndef __HAVE_THREAD_FUNCTIONS diff --git a/kernel/cgroup.c b/kernel/cgroup.c index ac207ae..70f9985 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -110,7 +110,7 @@ static DEFINE_SPINLOCK(cgroup_file_kn_lock); */ static DEFINE_SPINLOCK(release_agent_path_lock); -struct percpu_rw_semaphore cgroup_threadgroup_rwsem; +static struct percpu_rw_semaphore cgroup_threadgroup_rwsem; #define cgroup_assert_mutex_or_rcu_locked() \ RCU_LOCKDEP_WARN(!rcu_read_lock_held() && \ @@ -5688,17 +5688,46 @@ static const struct file_operations proc_cgroupstats_operations = { }; /** - * cgroup_fork - initialize cgroup related fields during copy_process() - * @child: pointer to task_struct of forking parent process. + * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups + * @tsk: target task + * @child: child task if forking, NULL otherwise + * @clone_flags: clone flags if forking * - * A task is associated with the init_css_set until cgroup_post_fork() - * attaches it to the parent's css_set. Empty cg_list indicates that - * @child isn't holding reference to its css_set. + * Called from threadgroup_change_begin() and allows cgroup operations to + * synchronize against threadgroup changes using a percpu_rw_semaphore. */ -void cgroup_fork(struct task_struct *child) +void cgroup_threadgroup_change_begin(struct task_struct *tsk, + struct task_struct *child, + unsigned long clone_flags) { - RCU_INIT_POINTER(child->cgroups, &init_css_set); - INIT_LIST_HEAD(&child->cg_list); + if (child) { + /* + * A task is associated with the init_css_set until + * cgroup_post_fork() attaches it to the parent's css_set. + * Empty cg_list indicates that @child isn't holding + * reference to its css_set. + */ + RCU_INIT_POINTER(child->cgroups, &init_css_set); + INIT_LIST_HEAD(&child->cg_list); + } + + percpu_down_read(&cgroup_threadgroup_rwsem); +} + +/** + * cgroup_threadgroup_change_end - threadgroup exclusion for cgroups + * @tsk: target task + * @child: child task if forking, NULL otherwise + * @clone_flags: clone flags if forking + * + * Called from threadgroup_change_end(). Counterpart of + * cgroup_threadcgroup_change_begin(). + */ +void cgroup_threadgroup_change_end(struct task_struct *tsk, + struct task_struct *child, + unsigned long clone_flags) +{ + percpu_up_read(&cgroup_threadgroup_rwsem); } /** diff --git a/kernel/fork.c b/kernel/fork.c index 812d477..840b662 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1368,8 +1368,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, p->real_start_time = ktime_get_boot_ns(); p->io_context = NULL; p->audit_context = NULL; - threadgroup_change_begin(current); - cgroup_fork(p); + threadgroup_change_begin(current, p, clone_flags); #ifdef CONFIG_NUMA p->mempolicy = mpol_dup(p->mempolicy); if (IS_ERR(p->mempolicy)) { @@ -1609,7 +1608,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, proc_fork_connector(p); cgroup_post_fork(p, clone_flags, new_rgrp_cset); - threadgroup_change_end(current); + threadgroup_change_end(current, p, clone_flags); perf_event_fork(p); trace_task_newtask(p, clone_flags); @@ -1650,7 +1649,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, mpol_put(p->mempolicy); bad_fork_cleanup_threadgroup_lock: #endif - threadgroup_change_end(current); + threadgroup_change_end(current, p, clone_flags); delayacct_tsk_free(p); bad_fork_cleanup_count: atomic_dec(&p->cred->user->processes); diff --git a/kernel/signal.c b/kernel/signal.c index f3f1f7a..1679c02 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2389,11 +2389,11 @@ void exit_signals(struct task_struct *tsk) * @tsk is about to have PF_EXITING set - lock out users which * expect stable threadgroup. */ - threadgroup_change_begin(tsk); + threadgroup_change_begin(tsk, NULL, 0); if (thread_group_empty(tsk) || signal_group_exit(tsk->signal)) { tsk->flags |= PF_EXITING; - threadgroup_change_end(tsk); + threadgroup_change_end(tsk, NULL, 0); return; } @@ -2404,7 +2404,7 @@ void exit_signals(struct task_struct *tsk) */ tsk->flags |= PF_EXITING; - threadgroup_change_end(tsk); + threadgroup_change_end(tsk, NULL, 0); if (!signal_pending(tsk)) goto out; -- 2.5.0 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 09/10] cgroup: implement rgroup subtree migration 2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo ` (3 preceding siblings ...) 2016-03-11 15:41 ` [PATCH 06/10] cgroup, fork: add @child and @clone_flags to threadgroup_change_begin/end() Tejun Heo @ 2016-03-11 15:41 ` Tejun Heo 4 siblings, 0 replies; 50+ messages in thread From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw) To: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt Cc: linux-kernel, cgroups, linux-api, kernel-team, Tejun Heo, Peter Zijlstra, Oleg Nesterov Currently, when a process with rgroups is migrated, rgroup subtrees are not preserved and all threads are put directly under the migration destination cgroup. This patch implements rgroup subtree migration so that rgroup subtrees are preserved across process migration. Early during process migration, cgroup_migrate_copy_rgrps() duplicates rgroup subtrees of a process under the destination cgroup and links the counterparts with newly added src_cgrp->rgrp_target. Also, subsystems can implement css_copy() method to copy over settings and whatever states necessary. Once copying is complete, the actual migration uses ->rgrp_target as destination. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul Turner <pjt@google.com> --- include/linux/cgroup-defs.h | 5 ++ kernel/cgroup.c | 157 +++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 159 insertions(+), 3 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index f1ee756..9ffa2d8 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -310,6 +310,9 @@ struct cgroup { /* signal structs with rgroups below this cgroup */ struct list_head rgrp_child_sigs; + /* target rgroup, used during rgroup subtree migration */ + struct cgroup *rgrp_target; + /* * list of pidlists, up to two for each namespace (one for procs, one * for tasks); created on demand. @@ -462,6 +465,8 @@ struct cgroup_subsys { void (*css_offline)(struct cgroup_subsys_state *css); void (*css_released)(struct cgroup_subsys_state *css); void (*css_free)(struct cgroup_subsys_state *css); + int (*css_copy)(struct cgroup_subsys_state *to, + struct cgroup_subsys_state *from); void (*css_reset)(struct cgroup_subsys_state *css); int (*can_attach)(struct cgroup_taskset *tset); diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 283b7ed..6107a1f 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -229,8 +229,12 @@ static struct file_system_type cgroup2_fs_type; static struct cftype cgroup_dfl_base_files[]; static struct cftype cgroup_legacy_base_files[]; +static struct cgroup *rgroup_create(struct cgroup *parent, struct signal_struct *sig); static int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask); static void cgroup_lock_and_drain_offline(struct cgroup *cgrp); +static void cgroup_save_control(struct cgroup *cgrp); +static void cgroup_propagate_control(struct cgroup *cgrp); +static void cgroup_restore_control(struct cgroup *cgrp); static int cgroup_apply_control(struct cgroup *cgrp); static void cgroup_finalize_control(struct cgroup *cgrp, int ret); static void css_task_iter_advance(struct css_task_iter *it); @@ -2478,6 +2482,7 @@ struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset, enum { CGRP_MIGRATE_PROCESS = (1 << 0), /* migrate the whole process */ + CGRP_MIGRATE_COPY_RGRPS = (1 << 1), /* copy rgroup subtree */ }; /** @@ -2752,6 +2757,132 @@ static int cgroup_migrate(struct task_struct *leader, unsigned mflags, } /** + * cgroup_migrate_uncopy_rgrps - cancel in-flight rgroup subtree copying + * @dst_cgrp: migration target cgroup + * @leader: leader of process being migrated + * + * Undo cgroup_migrate_copy_rgrps(). + */ +static void cgroup_migrate_uncopy_rgrps(struct cgroup *dst_cgrp, + struct task_struct *leader) +{ + struct signal_struct *sig = leader->signal; + struct cgroup *sgrp = nearest_sgroup(task_css_set(leader)->dfl_cgrp); + struct cgroup *rgrp, *dsct; + struct cgroup_subsys_state *d_css; + + /* destroy rgroup copies */ + list_for_each_entry(rgrp, &sig->rgrps, rgrp_node) { + cgroup_for_each_live_descendant_post(dsct, d_css, rgrp) { + if (dsct->rgrp_target) { + rgroup_destroy_schedule(dsct->rgrp_target); + dsct->rgrp_target = NULL; + } + } + } + + /* move back @sig under the original nearest sgroup */ + if (!list_empty(&sig->rgrp_node)) + list_move_tail(&sig->rgrp_node, &sgrp->rgrp_child_sigs); + + sgrp->rgrp_target = NULL; + cgroup_restore_control(sgrp); +} + +/** + * cgroup_migrate_copy_rgrps - copy a process's rgroup subtrees + * @dst_cgrp: migration target cgroup + * @leader: leader of process being migrated + * + * @leader and its threads are being migrated under @dst_cgrp. Copy the + * process's rgroup subtrees under @dst_cgrp and make the source rgroups + * and their nearest sgroup point to the counterpart in the copied subtrees + * via ->rgrp_target. + * + * Before process migration is complete, this operation can be canceled + * using cgroup_migrate_uncopy_rgrps(). + */ +static int cgroup_migrate_copy_rgrps(struct cgroup *dst_cgrp, + struct task_struct *leader) +{ + struct signal_struct *sig = leader->signal; + struct cgroup *sgrp = nearest_sgroup(task_css_set(leader)->dfl_cgrp); + struct cgroup *rgrp, *dsct; + struct cgroup_subsys_state *d_css; + int ret; + + if (WARN_ON_ONCE(!cgroup_on_dfl(dst_cgrp))) + return -EINVAL; + + /* save for uncopy */ + cgroup_save_control(sgrp); + sgrp->rgrp_target = dst_cgrp; + + /* + * Move @sig under @dst_cgrp for correct control propagation and + * update its control masks. + */ + if (!list_empty(&sig->rgrp_node)) + list_move_tail(&sig->rgrp_node, &dst_cgrp->rgrp_child_sigs); + + cgroup_propagate_control(dst_cgrp); + + /* + * Walk and copy each rgroup. As top-level copies are appended to + * &sig->rgrps, terminate on encountering one. + */ + list_for_each_entry(rgrp, &sig->rgrps, rgrp_node) { + if (cgroup_parent(rgrp) == dst_cgrp) + break; + + cgroup_for_each_live_descendant_pre(dsct, d_css, rgrp) { + struct cgroup *parent = cgroup_parent(dsct); + struct cgroup *copy; + struct cgroup_subsys_state *copy_css; + int ssid; + + if (WARN_ON_ONCE(!parent->rgrp_target) || + WARN_ON_ONCE(dsct->rgrp_target)) { + ret = -EINVAL; + goto out_uncopy; + } + + /* create a copy and refresh its control masks */ + copy = rgroup_create(parent->rgrp_target, sig); + if (IS_ERR(copy)) { + ret = PTR_ERR(copy); + goto out_uncopy; + } + + copy->subtree_control = dsct->subtree_control; + cgroup_propagate_control(copy); + + dsct->rgrp_target = copy; + + /* copy subsystem states */ + for_each_css(copy_css, ssid, copy) { + struct cgroup_subsys *ss = copy_css->ss; + struct cgroup_subsys_state *css = + cgroup_css(dsct, ss); + + if (!ss->css_copy || !css) + continue; + + ret = ss->css_copy(copy_css, css); + if (ret) + goto out_uncopy; + } + } + } + + return 0; + +out_uncopy: + cgroup_migrate_uncopy_rgrps(dst_cgrp, leader); + return ret; +} + +/** * cgroup_attach_task - attach a task or a whole threadgroup to a cgroup * @dst_cgrp: the cgroup to attach to * @leader: the task or the leader of the threadgroup to be attached @@ -2762,6 +2893,7 @@ static int cgroup_migrate(struct task_struct *leader, unsigned mflags, static int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, unsigned mflags) { + bool copy_rgrps = mflags & CGRP_MIGRATE_COPY_RGRPS; LIST_HEAD(preloaded_csets); struct task_struct *task; int ret; @@ -2769,13 +2901,25 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp, if (!cgroup_may_migrate_to(dst_cgrp)) return -EBUSY; + if (copy_rgrps) { + ret = cgroup_migrate_copy_rgrps(dst_cgrp, leader); + if (ret) + return ret; + } + /* look up all src csets */ spin_lock_bh(&css_set_lock); rcu_read_lock(); task = leader; do { - cgroup_migrate_add_src(task_css_set(task), dst_cgrp, - &preloaded_csets); + struct css_set *src_cset = task_css_set(task); + struct cgroup *dfl_cgrp = src_cset->dfl_cgrp; + struct cgroup *target_cgrp = dst_cgrp; + + if (copy_rgrps) + target_cgrp = dfl_cgrp->rgrp_target; + + cgroup_migrate_add_src(src_cset, target_cgrp, &preloaded_csets); if (!(mflags & CGRP_MIGRATE_PROCESS)) break; } while_each_thread(leader, task); @@ -2788,6 +2932,10 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp, ret = cgroup_migrate(leader, mflags, dst_cgrp->root); cgroup_migrate_finish(&preloaded_csets); + + if (copy_rgrps && ret) + cgroup_migrate_uncopy_rgrps(dst_cgrp, leader); + return ret; } @@ -2864,8 +3012,11 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, tsk = current; } - if (mflags & CGRP_MIGRATE_PROCESS) + if (mflags & CGRP_MIGRATE_PROCESS) { tsk = tsk->group_leader; + if (cgroup_on_dfl(cgrp)) + mflags |= CGRP_MIGRATE_COPY_RGRPS; + } /* * Workqueue threads may acquire PF_NO_SETAFFINITY and become -- 2.5.0 ^ permalink raw reply related [flat|nested] 50+ messages in thread
end of thread, other threads:[~2016-04-15 2:42 UTC | newest] Thread overview: 50+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo 2016-03-11 15:41 ` [PATCH 02/10] cgroup: un-inline cgroup_path() and friends Tejun Heo [not found] ` <1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2016-03-11 15:41 ` [PATCH 01/10] cgroup: introduce cgroup_[un]lock() Tejun Heo 2016-03-11 15:41 ` [PATCH 03/10] cgroup: introduce CGRP_MIGRATE_* flags Tejun Heo 2016-03-11 15:41 ` [PATCH 05/10] cgroup, fork: add @new_rgrp_cset[p] and @clone_flags to cgroup fork callbacks Tejun Heo 2016-03-11 15:41 ` [PATCH 07/10] cgroup: introduce resource group Tejun Heo 2016-03-11 15:41 ` [PATCH 08/10] cgroup: implement rgroup control mask handling Tejun Heo 2016-03-11 15:41 ` [PATCH 10/10] cgroup, sched: implement PRIO_RGRP for {set|get}priority() Tejun Heo 2016-03-11 16:05 ` Example program for PRIO_RGRP Tejun Heo 2016-03-12 6:26 ` [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Mike Galbraith [not found] ` <1457764019.10402.72.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-03-12 17:04 ` Mike Galbraith [not found] ` <1457802262.3628.129.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-03-12 17:13 ` cgroup NAKs ignored? " Ingo Molnar [not found] ` <20160312171318.GD1108-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-03-13 14:42 ` Tejun Heo 2016-03-13 15:00 ` Tejun Heo [not found] ` <20160313150012.GB13405-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2016-03-13 17:40 ` Mike Galbraith [not found] ` <1457890835.3859.72.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-04-07 0:00 ` Tejun Heo [not found] ` <20160407000034.GL24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2016-04-07 3:26 ` Mike Galbraith 2016-03-14 2:23 ` Mike Galbraith 2016-03-14 11:30 ` Peter Zijlstra [not found] ` <20160314113013.GM6344-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-04-06 15:58 ` Tejun Heo [not found] ` <20160406155830.GI24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2016-04-07 6:45 ` Peter Zijlstra [not found] ` <20160407064549.GH3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-04-07 7:35 ` Johannes Weiner [not found] ` <20160407073547.GA12560-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 2016-04-07 8:05 ` Mike Galbraith 2016-04-07 8:08 ` Peter Zijlstra [not found] ` <20160407080833.GK3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-04-07 9:28 ` Johannes Weiner [not found] ` <20160407092824.GA13839-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 2016-04-07 10:42 ` Peter Zijlstra 2016-04-07 19:45 ` Tejun Heo [not found] ` <20160407194555.GI7822-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2016-04-07 20:25 ` Peter Zijlstra [not found] ` <20160407202542.GD3448-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-04-08 20:11 ` Tejun Heo [not found] ` <20160408201135.GO24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2016-04-09 6:16 ` Mike Galbraith 2016-04-09 13:39 ` Peter Zijlstra [not found] ` <20160409133917.GV3448-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-04-12 22:29 ` Tejun Heo [not found] ` <20160412222915.GT24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2016-04-13 7:43 ` Mike Galbraith 2016-04-13 15:59 ` Tejun Heo [not found] ` <20160413155952.GU24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2016-04-13 19:15 ` Mike Galbraith 2016-04-14 6:07 ` Mike Galbraith [not found] ` <1460614057.5100.150.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-04-14 19:57 ` Tejun Heo [not found] ` <20160414195748.GK7822-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2016-04-15 2:42 ` Mike Galbraith 2016-04-09 16:02 ` Peter Zijlstra 2016-04-07 8:28 ` Peter Zijlstra [not found] ` <20160407082810.GN3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-04-07 19:04 ` Johannes Weiner [not found] ` <20160407190424.GA20407-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 2016-04-07 19:31 ` Peter Zijlstra [not found] ` <20160407193127.GB3448-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-04-07 20:23 ` Johannes Weiner [not found] ` <20160407202344.GA22509-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 2016-04-08 3:13 ` Mike Galbraith 2016-03-15 17:21 ` Michal Hocko [not found] ` <20160315172136.GA6114-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2016-04-06 21:53 ` Tejun Heo [not found] ` <20160406215307.GJ24661-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2016-04-07 6:40 ` Peter Zijlstra 2016-03-11 15:41 ` [PATCH 04/10] signal: make put_signal_struct() public Tejun Heo 2016-03-11 15:41 ` [PATCH 06/10] cgroup, fork: add @child and @clone_flags to threadgroup_change_begin/end() Tejun Heo 2016-03-11 15:41 ` [PATCH 09/10] cgroup: implement rgroup subtree migration Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).