* [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode @ 2017-02-02 20:06 Tejun Heo 2017-02-02 20:06 ` [PATCH 1/5] cgroup: reorganize cgroup.procs / task write path Tejun Heo ` (4 more replies) 0 siblings, 5 replies; 32+ messages in thread From: Tejun Heo @ 2017-02-02 20:06 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ, efault-Mmb7MZpHnFY Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA Hello, This patchset implements cgroup v2 thread mode. It is largely based on the discussions that we had at the plumbers last year. Here's the rough outline. * Thread mode is explicitly enabled on a cgroup by writing "enable" into "cgroup.threads" file. The cgroup shouldn't have any child cgroups or enabled controllers. * Once enabled, arbitrary sub-hierarchy can be created and threads can be put anywhere in the subtree by writing TIDs into "cgroup.threads" file. Process granularity and no-internal-process constraint don't apply in a threaded subtree. * To be used in a threaded subtree, controllers should explicitly declare thread mode support and should be able to handle internal competition in some way. * The root of a threaded subtree serves as the resource domain for the whole subtree. This is where all the controllers are guaranteed to have a common ground and resource consumptions in the threaded subtree which aren't tied to a specific thread are charged. Non-threaded controllers never see beyond thread root and can assume that all controllers will follow the same rules upto that point. This allows threaded controllers to implement thread granular resource control without getting in the way of system level resource partitioning. This patchset contains the following five patches. For more details on the interface and behavior, please refer to the last patch. 0001-cgroup-reorganize-cgroup.procs-task-write-path.patch 0002-cgroup-add-flags-to-css_task_iter_start-and-implemen.patch 0003-cgroup-introduce-cgroup-proc_cgrp-and-threaded-css_s.patch 0004-cgroup-implement-CSS_TASK_ITER_THREADED.patch 0005-cgroup-implement-cgroup-v2-thread-support.patch This patchset is on top of cgroup/for-4.11 63f1ca59453a ("Merge branch 'cgroup/for-4.11-rdmacg' into cgroup/for-4.11") and available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-threads diffstat follows. Thanks. Documentation/cgroup-v2.txt | 75 ++++- include/linux/cgroup-defs.h | 38 ++ include/linux/cgroup.h | 12 kernel/cgroup/cgroup-internal.h | 8 kernel/cgroup/cgroup-v1.c | 64 +++- kernel/cgroup/cgroup.c | 589 ++++++++++++++++++++++++++++++++-------- kernel/cgroup/cpuset.c | 6 kernel/cgroup/freezer.c | 6 kernel/cgroup/pids.c | 1 kernel/events/core.c | 1 mm/memcontrol.c | 2 net/core/netclassid_cgroup.c | 2 12 files changed, 671 insertions(+), 133 deletions(-) -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 1/5] cgroup: reorganize cgroup.procs / task write path 2017-02-02 20:06 [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Tejun Heo @ 2017-02-02 20:06 ` Tejun Heo 2017-02-02 20:06 ` [PATCH 2/5] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS Tejun Heo ` (3 subsequent siblings) 4 siblings, 0 replies; 32+ messages in thread From: Tejun Heo @ 2017-02-02 20:06 UTC (permalink / raw) To: lizefan, hannes, peterz, mingo, pjt, luto, efault Cc: cgroups, linux-kernel, kernel-team, lvenanci, Tejun Heo Currently, writes "cgroup.procs" and "cgroup.tasks" files are all handled by __cgroup_procs_write() on both v1 and v2. This patch reoragnizes the write path so that there are common helper functions that different write paths use. While this somewhat increases LOC, the different paths are no longer intertwined and each path has more flexibility to implement different behaviors which will be necessary for the planned v2 thread support. Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/cgroup/cgroup-internal.h | 8 +- kernel/cgroup/cgroup-v1.c | 58 ++++++++++++-- kernel/cgroup/cgroup.c | 163 +++++++++++++++++++++------------------- 3 files changed, 142 insertions(+), 87 deletions(-) diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index 9203bfb..6ef662a 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -179,10 +179,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup, int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, bool threadgroup); -ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, - size_t nbytes, loff_t off, bool threadgroup); -ssize_t cgroup_procs_write(struct kernfs_open_file *of, char *buf, size_t nbytes, - loff_t off); +struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup) + __acquires(&cgroup_threadgroup_rwsem); +void cgroup_procs_write_finish(void) + __releases(&cgroup_threadgroup_rwsem); void cgroup_lock_and_drain_offline(struct cgroup *cgrp); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index fc34bcf..69146b3 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -511,10 +511,58 @@ static int cgroup_pidlist_show(struct seq_file *s, void *v) return 0; } -static ssize_t cgroup_tasks_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off) +static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off, + bool threadgroup) { - return __cgroup_procs_write(of, buf, nbytes, off, false); + struct cgroup *cgrp; + struct task_struct *task; + const struct cred *cred, *tcred; + ssize_t ret; + + cgrp = cgroup_kn_lock_live(of->kn, false); + if (!cgrp) + return -ENODEV; + + task = cgroup_procs_write_start(buf, threadgroup); + ret = PTR_ERR_OR_ZERO(task); + if (ret) + goto out_unlock; + + /* + * Even if we're attaching all tasks in the thread group, we only + * need to check permissions on one of them. + */ + cred = current_cred(); + tcred = get_task_cred(task); + if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && + !uid_eq(cred->euid, tcred->uid) && + !uid_eq(cred->euid, tcred->suid)) + ret = -EACCES; + put_cred(tcred); + if (ret) + goto out_finish; + + ret = cgroup_attach_task(cgrp, task, threadgroup); + +out_finish: + cgroup_procs_write_finish(); +out_unlock: + cgroup_kn_unlock(of->kn); + + return ret ?: nbytes; +} + +static ssize_t cgroup1_procs_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + return __cgroup1_procs_write(of, buf, nbytes, off, true); +} + +static ssize_t cgroup1_tasks_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + return __cgroup1_procs_write(of, buf, nbytes, off, false); } static ssize_t cgroup_release_agent_write(struct kernfs_open_file *of, @@ -593,7 +641,7 @@ struct cftype cgroup1_base_files[] = { .seq_stop = cgroup_pidlist_stop, .seq_show = cgroup_pidlist_show, .private = CGROUP_FILE_PROCS, - .write = cgroup_procs_write, + .write = cgroup1_procs_write, }, { .name = "cgroup.clone_children", @@ -612,7 +660,7 @@ struct cftype cgroup1_base_files[] = { .seq_stop = cgroup_pidlist_stop, .seq_show = cgroup_pidlist_show, .private = CGROUP_FILE_TASKS, - .write = cgroup_tasks_write, + .write = cgroup1_tasks_write, }, { .name = "notify_on_release", diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index fe374f8..2c37b436 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1913,6 +1913,23 @@ int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen) } EXPORT_SYMBOL_GPL(task_cgroup_path); +static struct cgroup *cgroup_migrate_common_ancestor(struct task_struct *task, + struct cgroup *dst_cgrp) +{ + struct cgroup *cgrp; + + lockdep_assert_held(&cgroup_mutex); + + spin_lock_irq(&css_set_lock); + cgrp = task_cgroup_from_root(task, &cgrp_dfl_root); + spin_unlock_irq(&css_set_lock); + + while (!cgroup_is_descendant(dst_cgrp, cgrp)) + cgrp = cgroup_parent(cgrp); + + return cgrp; +} + /** * cgroup_migrate_add_task - add a migration target task to a migration context * @task: target task @@ -2345,76 +2362,23 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, return ret; } -static int cgroup_procs_write_permission(struct task_struct *task, - struct cgroup *dst_cgrp, - struct kernfs_open_file *of) -{ - int ret = 0; - - if (cgroup_on_dfl(dst_cgrp)) { - struct super_block *sb = of->file->f_path.dentry->d_sb; - struct cgroup *cgrp; - struct inode *inode; - - spin_lock_irq(&css_set_lock); - cgrp = task_cgroup_from_root(task, &cgrp_dfl_root); - spin_unlock_irq(&css_set_lock); - - while (!cgroup_is_descendant(dst_cgrp, cgrp)) - cgrp = cgroup_parent(cgrp); - - ret = -ENOMEM; - inode = kernfs_get_inode(sb, cgrp->procs_file.kn); - if (inode) { - ret = inode_permission(inode, MAY_WRITE); - iput(inode); - } - } else { - const struct cred *cred = current_cred(); - const struct cred *tcred = get_task_cred(task); - - /* - * even if we're attaching all tasks in the thread group, - * we only need to check permissions on one of them. - */ - if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && - !uid_eq(cred->euid, tcred->uid) && - !uid_eq(cred->euid, tcred->suid)) - ret = -EACCES; - put_cred(tcred); - } - - return ret; -} - -/* - * Find the task_struct of the task to attach by vpid and pass it along to the - * function to attach either it or all tasks in its threadgroup. Will lock - * cgroup_mutex and threadgroup. - */ -ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, - size_t nbytes, loff_t off, bool threadgroup) +struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup) + __acquires(&cgroup_threadgroup_rwsem) { struct task_struct *tsk; - struct cgroup_subsys *ss; - struct cgroup *cgrp; pid_t pid; - int ssid, ret; if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0) - return -EINVAL; - - cgrp = cgroup_kn_lock_live(of->kn, false); - if (!cgrp) - return -ENODEV; + return ERR_PTR(-EINVAL); percpu_down_write(&cgroup_threadgroup_rwsem); + rcu_read_lock(); if (pid) { tsk = find_task_by_vpid(pid); if (!tsk) { - ret = -ESRCH; - goto out_unlock_rcu; + tsk = ERR_PTR(-ESRCH); + goto out_unlock_threadgroup; } } else { tsk = current; @@ -2429,35 +2393,30 @@ ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, * with no rt_runtime allocated. Just say no. */ if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) { - ret = -EINVAL; - goto out_unlock_rcu; + tsk = ERR_PTR(-EINVAL); + goto out_unlock_threadgroup; } get_task_struct(tsk); - rcu_read_unlock(); - - ret = cgroup_procs_write_permission(tsk, cgrp, of); - if (!ret) - ret = cgroup_attach_task(cgrp, tsk, threadgroup); - - put_task_struct(tsk); - goto out_unlock_threadgroup; + goto out_unlock_rcu; +out_unlock_threadgroup: + percpu_up_write(&cgroup_threadgroup_rwsem); out_unlock_rcu: rcu_read_unlock(); -out_unlock_threadgroup: + return tsk; +} + +void cgroup_procs_write_finish(void) + __releases(&cgroup_threadgroup_rwsem) +{ + struct cgroup_subsys *ss; + int ssid; + percpu_up_write(&cgroup_threadgroup_rwsem); for_each_subsys(ss, ssid) if (ss->post_attach) ss->post_attach(); - cgroup_kn_unlock(of->kn); - return ret ?: nbytes; -} - -ssize_t cgroup_procs_write(struct kernfs_open_file *of, char *buf, size_t nbytes, - loff_t off) -{ - return __cgroup_procs_write(of, buf, nbytes, off, true); } static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask) @@ -3781,6 +3740,54 @@ static int cgroup_procs_show(struct seq_file *s, void *v) return 0; } +static int cgroup_procs_write_permission(struct cgroup *cgrp, + struct super_block *sb) +{ + struct inode *inode; + int ret; + + inode = kernfs_get_inode(sb, cgrp->procs_file.kn); + if (!inode) + return -ENOMEM; + + ret = inode_permission(inode, MAY_WRITE); + iput(inode); + return ret; +} + +static ssize_t cgroup_procs_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct cgroup *cgrp, *common_ancestor; + struct task_struct *task; + ssize_t ret; + + cgrp = cgroup_kn_lock_live(of->kn, false); + if (!cgrp) + return -ENODEV; + + task = cgroup_procs_write_start(buf, true); + ret = PTR_ERR_OR_ZERO(task); + if (ret) + goto out_unlock; + + common_ancestor = cgroup_migrate_common_ancestor(task, cgrp); + + ret = cgroup_procs_write_permission(common_ancestor, + of->file->f_path.dentry->d_sb); + if (ret) + goto out_finish; + + ret = cgroup_attach_task(cgrp, task, true); + +out_finish: + cgroup_procs_write_finish(); +out_unlock: + cgroup_kn_unlock(of->kn); + + return ret ?: nbytes; +} + /* cgroup core interface files for the default hierarchy */ static struct cftype cgroup_base_files[] = { { -- 2.9.3 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH 2/5] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS 2017-02-02 20:06 [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Tejun Heo 2017-02-02 20:06 ` [PATCH 1/5] cgroup: reorganize cgroup.procs / task write path Tejun Heo @ 2017-02-02 20:06 ` Tejun Heo 2017-02-02 20:06 ` [PATCH 4/5] cgroup: implement CSS_TASK_ITER_THREADED Tejun Heo ` (2 subsequent siblings) 4 siblings, 0 replies; 32+ messages in thread From: Tejun Heo @ 2017-02-02 20:06 UTC (permalink / raw) To: lizefan, hannes, peterz, mingo, pjt, luto, efault Cc: cgroups, linux-kernel, kernel-team, lvenanci, Tejun Heo css_task_iter currently always walks all tasks. With the scheduled cgroup v2 thread support, the iterator would need to handle multiple types of iteration. As a preparation, add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag is not specified, it walks all tasks as before. When asserted, the iterator only walks the group leaders. For now, the only user of the flag is cgroup v2 "cgroup.procs" file which no longer needs to skip non-leader tasks in cgroup_procs_next(). Note that cgroup v1 "cgroup.procs" can't use the group leader walk as v1 "cgroup.procs" doesn't mean "list all thread group leaders in the cgroup" but "list all thread group id's with any threads in the cgroup". While at it, update cgroup_procs_show() to use task_pid_vnr() instead of task_tgid_vnr(). As the iteration guarantees that the function only sees group leaders, this doesn't change the output and will allow sharing the function for thread iteration. Signed-off-by: Tejun Heo <tj@kernel.org> --- include/linux/cgroup.h | 6 +++++- kernel/cgroup/cgroup-v1.c | 6 +++--- kernel/cgroup/cgroup.c | 24 ++++++++++++++---------- kernel/cgroup/cpuset.c | 6 +++--- kernel/cgroup/freezer.c | 6 +++--- mm/memcontrol.c | 2 +- net/core/netclassid_cgroup.c | 2 +- 7 files changed, 30 insertions(+), 22 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index f6b43fb..20f6057 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -36,9 +36,13 @@ #define CGROUP_WEIGHT_DFL 100 #define CGROUP_WEIGHT_MAX 10000 +/* walk only threadgroup leaders */ +#define CSS_TASK_ITER_PROCS (1U << 0) + /* a css_task_iter should be treated as an opaque object */ struct css_task_iter { struct cgroup_subsys *ss; + unsigned int flags; struct list_head *cset_pos; struct list_head *cset_head; @@ -129,7 +133,7 @@ struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset, struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset, struct cgroup_subsys_state **dst_cssp); -void css_task_iter_start(struct cgroup_subsys_state *css, +void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags, struct css_task_iter *it); struct task_struct *css_task_iter_next(struct css_task_iter *it); void css_task_iter_end(struct css_task_iter *it); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 69146b3..de4939d 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -118,7 +118,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) * ->can_attach() fails. */ do { - css_task_iter_start(&from->self, &it); + css_task_iter_start(&from->self, 0, &it); task = css_task_iter_next(&it); if (task) get_task_struct(task); @@ -374,7 +374,7 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type, if (!array) return -ENOMEM; /* now, populate the array */ - css_task_iter_start(&cgrp->self, &it); + css_task_iter_start(&cgrp->self, 0, &it); while ((tsk = css_task_iter_next(&it))) { if (unlikely(n == length)) break; @@ -750,7 +750,7 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry) } rcu_read_unlock(); - css_task_iter_start(&cgrp->self, &it); + css_task_iter_start(&cgrp->self, 0, &it); while ((tsk = css_task_iter_next(&it))) { switch (tsk->state) { case TASK_RUNNING: diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 2c37b436..a9df46c 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3588,6 +3588,7 @@ static void css_task_iter_advance(struct css_task_iter *it) lockdep_assert_held(&css_set_lock); WARN_ON_ONCE(!l); +repeat: /* * Advance iterator to find next entry. cset->tasks is consumed * first and then ->mg_tasks. After ->mg_tasks, we move onto the @@ -3602,11 +3603,18 @@ static void css_task_iter_advance(struct css_task_iter *it) css_task_iter_advance_css_set(it); else it->task_pos = l; + + /* if PROCS, skip over tasks which aren't group leaders */ + if ((it->flags & CSS_TASK_ITER_PROCS) && it->task_pos && + !thread_group_leader(list_entry(it->task_pos, struct task_struct, + cg_list))) + goto repeat; } /** * css_task_iter_start - initiate task iteration * @css: the css to walk tasks of + * @flags: CSS_TASK_ITER_* flags * @it: the task iterator to use * * Initiate iteration through the tasks of @css. The caller can call @@ -3614,7 +3622,7 @@ static void css_task_iter_advance(struct css_task_iter *it) * returns NULL. On completion of iteration, css_task_iter_end() must be * called. */ -void css_task_iter_start(struct cgroup_subsys_state *css, +void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags, struct css_task_iter *it) { /* no one should try to iterate before mounting cgroups */ @@ -3625,6 +3633,7 @@ void css_task_iter_start(struct cgroup_subsys_state *css, spin_lock_irq(&css_set_lock); it->ss = css->ss; + it->flags = flags; if (it->ss) it->cset_pos = &css->cgroup->e_csets[css->ss->id]; @@ -3698,13 +3707,8 @@ static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos) { struct kernfs_open_file *of = s->private; struct css_task_iter *it = of->priv; - struct task_struct *task; - - do { - task = css_task_iter_next(it); - } while (task && !thread_group_leader(task)); - return task; + return css_task_iter_next(it); } static void *cgroup_procs_start(struct seq_file *s, loff_t *pos) @@ -3725,10 +3729,10 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos) if (!it) return ERR_PTR(-ENOMEM); of->priv = it; - css_task_iter_start(&cgrp->self, it); + css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it); } else if (!(*pos)++) { css_task_iter_end(it); - css_task_iter_start(&cgrp->self, it); + css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it); } return cgroup_procs_next(s, NULL, NULL); @@ -3736,7 +3740,7 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos) static int cgroup_procs_show(struct seq_file *s, void *v) { - seq_printf(s, "%d\n", task_tgid_vnr(v)); + seq_printf(s, "%d\n", task_pid_vnr(v)); return 0; } diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index b308888..35e7d43 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -859,7 +859,7 @@ static void update_tasks_cpumask(struct cpuset *cs) struct css_task_iter it; struct task_struct *task; - css_task_iter_start(&cs->css, &it); + css_task_iter_start(&cs->css, 0, &it); while ((task = css_task_iter_next(&it))) set_cpus_allowed_ptr(task, cs->effective_cpus); css_task_iter_end(&it); @@ -1104,7 +1104,7 @@ static void update_tasks_nodemask(struct cpuset *cs) * It's ok if we rebind the same mm twice; mpol_rebind_mm() * is idempotent. Also migrate pages in each mm to new nodes. */ - css_task_iter_start(&cs->css, &it); + css_task_iter_start(&cs->css, 0, &it); while ((task = css_task_iter_next(&it))) { struct mm_struct *mm; bool migrate; @@ -1297,7 +1297,7 @@ static void update_tasks_flags(struct cpuset *cs) struct css_task_iter it; struct task_struct *task; - css_task_iter_start(&cs->css, &it); + css_task_iter_start(&cs->css, 0, &it); while ((task = css_task_iter_next(&it))) cpuset_update_task_spread_flag(cs, task); css_task_iter_end(&it); diff --git a/kernel/cgroup/freezer.c b/kernel/cgroup/freezer.c index 1b72d56..0823679 100644 --- a/kernel/cgroup/freezer.c +++ b/kernel/cgroup/freezer.c @@ -268,7 +268,7 @@ static void update_if_frozen(struct cgroup_subsys_state *css) rcu_read_unlock(); /* are all tasks frozen? */ - css_task_iter_start(css, &it); + css_task_iter_start(css, 0, &it); while ((task = css_task_iter_next(&it))) { if (freezing(task)) { @@ -320,7 +320,7 @@ static void freeze_cgroup(struct freezer *freezer) struct css_task_iter it; struct task_struct *task; - css_task_iter_start(&freezer->css, &it); + css_task_iter_start(&freezer->css, 0, &it); while ((task = css_task_iter_next(&it))) freeze_task(task); css_task_iter_end(&it); @@ -331,7 +331,7 @@ static void unfreeze_cgroup(struct freezer *freezer) struct css_task_iter it; struct task_struct *task; - css_task_iter_start(&freezer->css, &it); + css_task_iter_start(&freezer->css, 0, &it); while ((task = css_task_iter_next(&it))) __thaw_task(task); css_task_iter_end(&it); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4048897..9250128 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -945,7 +945,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, struct css_task_iter it; struct task_struct *task; - css_task_iter_start(&iter->css, &it); + css_task_iter_start(&iter->css, 0, &it); while (!ret && (task = css_task_iter_next(&it))) ret = fn(task, arg); css_task_iter_end(&it); diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c index 11fce17..e24ea8c 100644 --- a/net/core/netclassid_cgroup.c +++ b/net/core/netclassid_cgroup.c @@ -74,7 +74,7 @@ static void update_classid(struct cgroup_subsys_state *css, void *v) struct css_task_iter it; struct task_struct *p; - css_task_iter_start(css, &it); + css_task_iter_start(css, 0, &it); while ((p = css_task_iter_next(&it))) { task_lock(p); iterate_fd(p->files, 0, update_classid_sock, v); -- 2.9.3 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH 4/5] cgroup: implement CSS_TASK_ITER_THREADED 2017-02-02 20:06 [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Tejun Heo 2017-02-02 20:06 ` [PATCH 1/5] cgroup: reorganize cgroup.procs / task write path Tejun Heo 2017-02-02 20:06 ` [PATCH 2/5] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS Tejun Heo @ 2017-02-02 20:06 ` Tejun Heo [not found] ` <20170202200632.13992-1-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2017-02-09 13:07 ` Paul Turner 4 siblings, 0 replies; 32+ messages in thread From: Tejun Heo @ 2017-02-02 20:06 UTC (permalink / raw) To: lizefan, hannes, peterz, mingo, pjt, luto, efault Cc: cgroups, linux-kernel, kernel-team, lvenanci, Tejun Heo cgroup v2 is in the process of growing thread granularity support. Once thread mode is enabled, the root cgroup of the subtree serves as the proc_cgrp to which the processes of the subtree conceptually belong and domain-level resource consumptions not tied to any specific task are charged. In the subtree, threads won't be subject to process granularity or no-internal-task constraint and can be distributed arbitrarily across the subtree. This patch implements a new task iterator flag CSS_TASK_ITER_THREADED, which, when used on a proc_cgrp, makes the iteration include the tasks on all the associated threaded css_sets. "cgroup.procs" read path is updated to use it so that reading the file on a proc_cgrp lists all processes. This will also be used by controller implementations which need to walk processes or tasks at the resource domain level. Task iteration is implemented nested in css_set iteration. If CSS_TASK_ITER_THREADED is specified, after walking tasks of each !threaded css_set, all the associated threaded css_sets are visited before moving onto the next !threaded css_set. Signed-off-by: Tejun Heo <tj@kernel.org> --- include/linux/cgroup.h | 6 ++++ kernel/cgroup/cgroup.c | 81 +++++++++++++++++++++++++++++++++++++++++--------- 2 files changed, 73 insertions(+), 14 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 20f6057..cd0be87 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -38,6 +38,8 @@ /* walk only threadgroup leaders */ #define CSS_TASK_ITER_PROCS (1U << 0) +/* walk threaded css_sets as part of their proc_csets */ +#define CSS_TASK_ITER_THREADED (1U << 1) /* a css_task_iter should be treated as an opaque object */ struct css_task_iter { @@ -47,11 +49,15 @@ struct css_task_iter { struct list_head *cset_pos; struct list_head *cset_head; + struct list_head *tcset_pos; + struct list_head *tcset_head; + struct list_head *task_pos; struct list_head *tasks_head; struct list_head *mg_tasks_head; struct css_set *cur_cset; + struct css_set *cur_pcset; struct task_struct *cur_task; struct list_head iters_node; /* css_set->task_iters */ }; diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 6c5658a..e73b3c3 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3590,27 +3590,36 @@ bool css_has_online_children(struct cgroup_subsys_state *css) return ret; } -/** - * css_task_iter_advance_css_set - advance a task itererator to the next css_set - * @it: the iterator to advance - * - * Advance @it to the next css_set to walk. - */ -static void css_task_iter_advance_css_set(struct css_task_iter *it) +static struct css_set *css_task_iter_next_css_set(struct css_task_iter *it) { - struct list_head *l = it->cset_pos; + bool threaded = it->flags & CSS_TASK_ITER_THREADED; + struct list_head *l; struct cgrp_cset_link *link; struct css_set *cset; lockdep_assert_held(&css_set_lock); - /* Advance to the next non-empty css_set */ + /* find the next threaded cset */ + if (it->tcset_pos) { + l = it->tcset_pos->next; + + if (l != it->tcset_head) { + it->tcset_pos = l; + return container_of(l, struct css_set, + threaded_csets_node); + } + + it->tcset_pos = NULL; + } + + /* find the next cset */ + l = it->cset_pos; + do { l = l->next; if (l == it->cset_head) { it->cset_pos = NULL; - it->task_pos = NULL; - return; + return NULL; } if (it->ss) { @@ -3620,10 +3629,50 @@ static void css_task_iter_advance_css_set(struct css_task_iter *it) link = list_entry(l, struct cgrp_cset_link, cset_link); cset = link->cset; } - } while (!css_set_populated(cset)); + + /* + * For threaded iterations, threaded csets are walked + * together with their proc_csets. Skip here. + */ + } while (threaded && css_set_threaded(cset)); it->cset_pos = l; + /* initialize threaded cset walking */ + if (threaded) { + if (it->cur_pcset) + put_css_set_locked(it->cur_pcset); + it->cur_pcset = cset; + get_css_set(cset); + + it->tcset_head = &cset->threaded_csets; + it->tcset_pos = &cset->threaded_csets; + } + + return cset; +} + +/** + * css_task_iter_advance_css_set - advance a task itererator to the next css_set + * @it: the iterator to advance + * + * Advance @it to the next css_set to walk. + */ +static void css_task_iter_advance_css_set(struct css_task_iter *it) +{ + struct css_set *cset; + + lockdep_assert_held(&css_set_lock); + + /* Advance to the next non-empty css_set */ + do { + cset = css_task_iter_next_css_set(it); + if (!cset) { + it->task_pos = NULL; + return; + } + } while (!css_set_populated(cset)); + if (!list_empty(&cset->tasks)) it->task_pos = cset->tasks.next; else @@ -3766,6 +3815,9 @@ void css_task_iter_end(struct css_task_iter *it) spin_unlock_irq(&css_set_lock); } + if (it->cur_pcset) + put_css_set(it->cur_pcset); + if (it->cur_task) put_task_struct(it->cur_task); } @@ -3791,6 +3843,7 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos) struct kernfs_open_file *of = s->private; struct cgroup *cgrp = seq_css(s)->cgroup; struct css_task_iter *it = of->priv; + unsigned iter_flags = CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED; /* * When a seq_file is seeked, it's always traversed sequentially @@ -3804,10 +3857,10 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos) if (!it) return ERR_PTR(-ENOMEM); of->priv = it; - css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it); + css_task_iter_start(&cgrp->self, iter_flags, it); } else if (!(*pos)++) { css_task_iter_end(it); - css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it); + css_task_iter_start(&cgrp->self, iter_flags, it); } return cgroup_procs_next(s, NULL, NULL); -- 2.9.3 ^ permalink raw reply related [flat|nested] 32+ messages in thread
[parent not found: <20170202200632.13992-1-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* [PATCH 3/5] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling [not found] ` <20170202200632.13992-1-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2017-02-02 20:06 ` Tejun Heo 2017-02-02 20:06 ` [PATCH 5/5] cgroup: implement cgroup v2 thread support Tejun Heo ` (2 subsequent siblings) 3 siblings, 0 replies; 32+ messages in thread From: Tejun Heo @ 2017-02-02 20:06 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ, efault-Mmb7MZpHnFY Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA, Tejun Heo cgroup v2 is in the process of growing thread granularity support. Once thread mode is enabled, the root cgroup of the subtree serves as the proc_cgrp to which the processes of the subtree conceptually belong and domain-level resource consumptions not tied to any specific task are charged. In the subtree, threads won't be subject to process granularity or no-internal-task constraint and can be distributed arbitrarily across the subtree. This patch introduces cgroup->proc_cgrp along with threaded css_set handling. * cgroup->proc_cgrp is NULL if !threaded. If threaded, points to the proc_cgrp (root of the threaded subtree). * css_set->proc_cset points to self if !threaded. If threaded, points to the css_set which belongs to the cgrp->proc_cgrp. The proc_cgrp serves as the resource domain and needs the matching csses readily available. The proc_cset holds those csses and makes them easily accessible. * All threaded csets are linked on their proc_csets to enable iteration of all threaded tasks. This patch adds the above but doesn't actually use them yet. The following patches will build on top. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- include/linux/cgroup-defs.h | 22 ++++++++++++ kernel/cgroup/cgroup.c | 87 +++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 103 insertions(+), 6 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 3c02404..22e894c 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -158,6 +158,15 @@ struct css_set { /* reference count */ atomic_t refcount; + /* + * If not threaded, the following points to self. If threaded, to + * a cset which belongs to the top cgroup of the threaded subtree. + * The proc_cset provides access to the process cgroup and its + * csses to which domain level resource consumptions should be + * charged. + */ + struct css_set __rcu *proc_cset; + /* the default cgroup associated with this css_set */ struct cgroup *dfl_cgrp; @@ -183,6 +192,10 @@ struct css_set { */ struct list_head e_cset_node[CGROUP_SUBSYS_COUNT]; + /* all csets whose ->proc_cset points to this cset */ + struct list_head threaded_csets; + struct list_head threaded_csets_node; + /* * List running through all cgroup groups in the same hash * slot. Protected by css_set_lock @@ -289,6 +302,15 @@ struct cgroup { struct list_head e_csets[CGROUP_SUBSYS_COUNT]; /* + * If !threaded, NULL. If threaded, it points to the top cgroup of + * the threaded subtree, on which it points to self. Threaded + * subtree is exempt from process granularity and no-internal-task + * constraint. Domain level resource consumptions which aren't + * tied to a specific task should be charged to the proc_cgrp. + */ + struct cgroup *proc_cgrp; + + /* * list of pidlists, up to two for each namespace (one for procs, one * for tasks); created on demand. */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index a9df46c..6c5658a 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -554,9 +554,11 @@ EXPORT_SYMBOL_GPL(of_css); */ struct css_set init_css_set = { .refcount = ATOMIC_INIT(1), + .proc_cset = RCU_INITIALIZER(&init_css_set), .tasks = LIST_HEAD_INIT(init_css_set.tasks), .mg_tasks = LIST_HEAD_INIT(init_css_set.mg_tasks), .task_iters = LIST_HEAD_INIT(init_css_set.task_iters), + .threaded_csets = LIST_HEAD_INIT(init_css_set.threaded_csets), .cgrp_links = LIST_HEAD_INIT(init_css_set.cgrp_links), .mg_preload_node = LIST_HEAD_INIT(init_css_set.mg_preload_node), .mg_node = LIST_HEAD_INIT(init_css_set.mg_node), @@ -575,6 +577,17 @@ static bool css_set_populated(struct css_set *cset) return !list_empty(&cset->tasks) || !list_empty(&cset->mg_tasks); } +static struct css_set *proc_css_set(struct css_set *cset) +{ + return rcu_dereference_protected(cset->proc_cset, + lockdep_is_held(&css_set_lock)); +} + +static bool css_set_threaded(struct css_set *cset) +{ + return proc_css_set(cset) != cset; +} + /** * cgroup_update_populated - updated populated count of a cgroup * @cgrp: the target cgroup @@ -726,6 +739,8 @@ void put_css_set_locked(struct css_set *cset) if (!atomic_dec_and_test(&cset->refcount)) return; + WARN_ON_ONCE(!list_empty(&cset->threaded_csets)); + /* This css_set is dead. unlink it and release cgroup and css refs */ for_each_subsys(ss, ssid) { list_del(&cset->e_cset_node[ssid]); @@ -742,6 +757,11 @@ void put_css_set_locked(struct css_set *cset) kfree(link); } + if (css_set_threaded(cset)) { + list_del(&cset->threaded_csets_node); + put_css_set_locked(proc_css_set(cset)); + } + kfree_rcu(cset, rcu_head); } @@ -751,6 +771,7 @@ void put_css_set_locked(struct css_set *cset) * @old_cset: existing css_set for a task * @new_cgrp: cgroup that's being entered by the task * @template: desired set of css pointers in css_set (pre-calculated) + * @for_pcset: the comparison is for a new proc_cset * * Returns true if "cset" matches "old_cset" except for the hierarchy * which "new_cgrp" belongs to, for which it should match "new_cgrp". @@ -758,7 +779,8 @@ void put_css_set_locked(struct css_set *cset) static bool compare_css_sets(struct css_set *cset, struct css_set *old_cset, struct cgroup *new_cgrp, - struct cgroup_subsys_state *template[]) + struct cgroup_subsys_state *template[], + bool for_pcset) { struct list_head *l1, *l2; @@ -770,6 +792,32 @@ static bool compare_css_sets(struct css_set *cset, if (memcmp(template, cset->subsys, sizeof(cset->subsys))) return false; + if (for_pcset) { + /* + * We're looking for the pcset of @old_cset. As @old_cset + * doesn't have its ->proc_cset pointer set yet (we're + * trying to find out what to set it to), @old_cset itself + * may seem like a match here. Explicitly exlude identity + * matching. + */ + if (css_set_threaded(cset) || cset == old_cset) + return false; + } else { + bool is_threaded; + + /* + * Otherwise, @cset's threaded state should match the + * default cgroup's. + */ + if (cgroup_on_dfl(new_cgrp)) + is_threaded = new_cgrp->proc_cgrp; + else + is_threaded = old_cset->dfl_cgrp->proc_cgrp; + + if (is_threaded != css_set_threaded(cset)) + return false; + } + /* * Compare cgroup pointers in order to distinguish between * different cgroups in hierarchies. As different cgroups may @@ -822,10 +870,12 @@ static bool compare_css_sets(struct css_set *cset, * @old_cset: the css_set that we're using before the cgroup transition * @cgrp: the cgroup that we're moving into * @template: out param for the new set of csses, should be clear on entry + * @for_pcset: looking for a new proc_cset */ static struct css_set *find_existing_css_set(struct css_set *old_cset, struct cgroup *cgrp, - struct cgroup_subsys_state *template[]) + struct cgroup_subsys_state *template[], + bool for_pcset) { struct cgroup_root *root = cgrp->root; struct cgroup_subsys *ss; @@ -856,7 +906,7 @@ static struct css_set *find_existing_css_set(struct css_set *old_cset, key = css_set_hash(template); hash_for_each_possible(css_set_table, cset, hlist, key) { - if (!compare_css_sets(cset, old_cset, cgrp, template)) + if (!compare_css_sets(cset, old_cset, cgrp, template, for_pcset)) continue; /* This css_set matches what we need */ @@ -938,12 +988,13 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset, * find_css_set - return a new css_set with one cgroup updated * @old_cset: the baseline css_set * @cgrp: the cgroup to be updated + * @for_pcset: looking for a new proc_cset * * Return a new css_set that's equivalent to @old_cset, but with @cgrp * substituted into the appropriate hierarchy. */ static struct css_set *find_css_set(struct css_set *old_cset, - struct cgroup *cgrp) + struct cgroup *cgrp, bool for_pcset) { struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT] = { }; struct css_set *cset; @@ -958,7 +1009,7 @@ static struct css_set *find_css_set(struct css_set *old_cset, /* First see if we already have a cgroup group that matches * the desired set */ spin_lock_irq(&css_set_lock); - cset = find_existing_css_set(old_cset, cgrp, template); + cset = find_existing_css_set(old_cset, cgrp, template, for_pcset); if (cset) get_css_set(cset); spin_unlock_irq(&css_set_lock); @@ -977,9 +1028,11 @@ static struct css_set *find_css_set(struct css_set *old_cset, } atomic_set(&cset->refcount, 1); + RCU_INIT_POINTER(cset->proc_cset, cset); INIT_LIST_HEAD(&cset->tasks); INIT_LIST_HEAD(&cset->mg_tasks); INIT_LIST_HEAD(&cset->task_iters); + INIT_LIST_HEAD(&cset->threaded_csets); INIT_HLIST_NODE(&cset->hlist); INIT_LIST_HEAD(&cset->cgrp_links); INIT_LIST_HEAD(&cset->mg_preload_node); @@ -1017,6 +1070,28 @@ static struct css_set *find_css_set(struct css_set *old_cset, spin_unlock_irq(&css_set_lock); + /* + * If @cset should be threaded, look up the matching proc_cset and + * link them up. We first fully initialize @cset then look for the + * pcset. It's simpler this way and safe as @cset is guaranteed to + * stay empty until we return. + */ + if (!for_pcset && cset->dfl_cgrp->proc_cgrp) { + struct css_set *pcset; + + pcset = find_css_set(cset, cset->dfl_cgrp->proc_cgrp, true); + if (!pcset) { + put_css_set(cset); + return NULL; + } + + spin_lock_irq(&css_set_lock); + rcu_assign_pointer(cset->proc_cset, pcset); + list_add_tail(&cset->threaded_csets_node, + &pcset->threaded_csets); + spin_unlock_irq(&css_set_lock); + } + return cset; } @@ -2238,7 +2313,7 @@ int cgroup_migrate_prepare_dst(struct cgroup_mgctx *mgctx) struct cgroup_subsys *ss; int ssid; - dst_cset = find_css_set(src_cset, src_cset->mg_dst_cgrp); + dst_cset = find_css_set(src_cset, src_cset->mg_dst_cgrp, false); if (!dst_cset) goto err; -- 2.9.3 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH 5/5] cgroup: implement cgroup v2 thread support [not found] ` <20170202200632.13992-1-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2017-02-02 20:06 ` [PATCH 3/5] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling Tejun Heo @ 2017-02-02 20:06 ` Tejun Heo 2017-02-02 21:32 ` [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Andy Lutomirski 2017-02-03 20:20 ` Peter Zijlstra 3 siblings, 0 replies; 32+ messages in thread From: Tejun Heo @ 2017-02-02 20:06 UTC (permalink / raw) To: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, peterz-wEGCiKHe2LqWVfeAwA7xHQ, mingo-H+wXaHxf7aLQT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ, efault-Mmb7MZpHnFY Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA, Tejun Heo This patch implements cgroup v2 thread support. The goal of the thread mode is supporting hierarchical accounting and control at thread granularity while staying inside the resource domain model which allows coordination across different resource controllers and handling of anonymous resource consumptions. Once thread mode is enabled on a cgroup, the threads of the processes which are in its subtree can be placed inside the subtree without being restricted by process granularity or no-internal-process constraint. Note that the threads aren't allowed to escape to a different threaded subtree. To be used inside a threaded subtree, a controller should explicitly support threaded mode and be able to handle internal competition in the way which is appropriate for the resource. The root of a threaded subtree, where thread mode is enabled in the first place, is called the thread root and serves as the resource domain for the whole subtree. This is the last cgroup where non-threaded controllers are operational and where all the domain-level resource consumptions in the subtree are accounted. This allows threaded controllers to operate at thread granularity when requested while staying inside the scope of system-level resource distribution. Internally, in a threaded subtree, each css_set has its ->proc_cset pointing to a matching css_set which belongs to the thread root. This ensures that thread root level cgroup_subsys_state for all threaded controllers are readily accessible for domain-level operations. This patch enables threaded mode for the pids and perf_events controllers. Neither has to worry about domain-level resource consumptions and it's enough to simply set the flag. For more details on the interface and behavior of the thread mode, please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added by this patch. Note that the documentation update is not complete as the rest of the documentation needs to be updated accordingly. Rolling those updates into this patch can be confusing so that will be separate patches. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> --- Documentation/cgroup-v2.txt | 75 +++++++++++++- include/linux/cgroup-defs.h | 16 +++ kernel/cgroup/cgroup.c | 240 +++++++++++++++++++++++++++++++++++++++++++- kernel/cgroup/pids.c | 1 + kernel/events/core.c | 1 + 5 files changed, 326 insertions(+), 7 deletions(-) diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index 3b8449f..6c255394 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -16,7 +16,9 @@ CONTENTS 1-2. What is cgroup? 2. Basic Operations 2-1. Mounting - 2-2. Organizing Processes + 2-2. Organizing Processes and Threads + 2-2-1. Processes + 2-2-2. Threads 2-3. [Un]populated Notification 2-4. Controlling Controllers 2-4-1. Enabling and Disabling @@ -150,7 +152,9 @@ and experimenting easier, the kernel parameter cgroup_no_v1= allows disabling controllers in v1 and make them always available in v2. -2-2. Organizing Processes +2-2. Organizing Processes and Threads + +2-2-1. Processes Initially, only the root cgroup exists to which all processes belong. A child cgroup can be created by creating a sub-directory. @@ -201,6 +205,73 @@ is removed subsequently, " (deleted)" is appended to the path. 0::/test-cgroup/test-cgroup-nested (deleted) +2-2-2. Threads + +cgroup v2 supports thread granularity for a subset of controllers to +support use cases requiring hierarchical resource distribution across +the threads of a group of processes. By default, all threads of a +process belong to the same cgroup, which also serves as the resource +domain to host resource consumptions which are not specific to a +process or thread. The thread mode allows threads to be spread across +a subtree while still maintaining the common resource domain for them. + +Enabling thread mode on a subtree makes it threaded. The root of a +threaded subtree is called thread root and serves as the resource +domain for the entire subtree. In a threaded subtree, threads of a +process can be put in different cgroups and are not subject to the no +internal process constraint - threaded controllers can be enabled on +non-leaf cgroups whether they have threads in them or not. + +To enable the thread mode, the following conditions must be met. + +- The thread root doesn't have any child cgroups. + +- The thread root doesn't have any controllers enabled. + +Thread mode can be enabled by writing "enable" to "cgroup.threads" +file. + + # echo enable > cgroup.threads + +Inside a threaded subtree, "cgroup.threads" can be read and contains +the list of the thread IDs of all threads in the cgroup. Except that +the operations are per-thread instead of per-process, "cgroup.threads" +has the same format and behaves the same way as "cgroup.procs". + +The thread root serves as the resource domain for the whole subtree, +and, while the threads can be scattered across the subtree, all the +processes are considered to be in the thread root. "cgroup.procs" in +a thread root contains the PIDs of all processes in the subtree and is +not readable in the subtree proper. However, "cgroup.procs" can be +written to from anywhere in the subtree to migrate all threads of the +matching process to the cgroup. + +Only threaded controllers can be enabled in a threaded subtree. When +a threaded controller is enabled inside a threaded subtree, it only +accounts for and controls resource consumptions associated with the +threads in the cgroup and its descendants. All consumptions which +aren't tied to a specific thread belong to the thread root. + +Because a threaded subtree is exempt from no internal process +constraint, a threaded controller must be able to handle competition +between threads in a non-leaf cgroup and its child cgroups. Each +threaded controller defines how such competitions are handled. + +To disable the thread mode, the following conditions must be met. + +- The cgroup is a thread root. Thread mode can't be disabled + partially in the subtree. + +- The thread root doesn't have any child cgroups. + +- The thread root doesn't have any controllers enabled. + +Thread mode can be disabled by writing "disable" to "cgroup.threads" +file. + + # echo disable > cgroup.threads + + 2-3. [Un]populated Notification Each non-root cgroup has a "cgroup.events" file which contains diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 22e894c..c4fc469 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -226,6 +226,10 @@ struct css_set { struct cgroup *mg_dst_cgrp; struct css_set *mg_dst_cset; + /* used while updating ->proc_cset to enable/disable threaded mode */ + struct list_head pcset_preload_node; + struct css_set *pcset_preload; + /* dead and being drained, ignore for migration */ bool dead; @@ -497,6 +501,18 @@ struct cgroup_subsys { bool implicit_on_dfl:1; /* + * If %true, the controller, supports threaded mode on the default + * hierarchy. In a threaded subtree, both process granularity and + * no-internal-process constraint are ignored and a threaded + * controllers should be able to handle that. + * + * Note that as an implicit controller is automatically enabled on + * all cgroups on the default hierarchy, it should also be + * threaded. implicit && !threaded is not supported. + */ + bool threaded:1; + + /* * If %false, this subsystem is properly hierarchical - * configuration, resource accounting and restriction on a parent * cgroup cover those of its children. If %true, hierarchy support diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index e73b3c3..4e5b3b2 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -161,6 +161,9 @@ static u16 cgrp_dfl_inhibit_ss_mask; /* some controllers are implicitly enabled on the default hierarchy */ static u16 cgrp_dfl_implicit_ss_mask; +/* some controllers can be threaded on the default hierarchy */ +static u16 cgrp_dfl_threaded_ss_mask; + /* The list of hierarchy roots */ LIST_HEAD(cgroup_roots); static int cgroup_root_count; @@ -2909,11 +2912,18 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of, goto out_unlock; } + /* can't enable !threaded controllers on a threaded cgroup */ + if (cgrp->proc_cgrp && (enable & ~cgrp_dfl_threaded_ss_mask)) { + ret = -EBUSY; + goto out_unlock; + } + /* - * Except for the root, subtree_control must be zero for a cgroup - * with tasks so that child cgroups don't compete against tasks. + * Except for root and threaded cgroups, subtree_control must be + * zero for a cgroup with tasks so that child cgroups don't compete + * against tasks. */ - if (enable && cgroup_parent(cgrp)) { + if (enable && cgroup_parent(cgrp) && !cgrp->proc_cgrp) { struct cgrp_cset_link *link; /* @@ -2954,6 +2964,124 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of, return ret ?: nbytes; } +static int cgroup_enable_threaded(struct cgroup *cgrp) +{ + LIST_HEAD(csets); + struct cgrp_cset_link *link; + struct css_set *cset, *cset_next; + int ret; + + lockdep_assert_held(&cgroup_mutex); + + /* noop if already threaded */ + if (cgrp->proc_cgrp) + return 0; + + /* allow only if there are neither children or enabled controllers */ + if (css_has_online_children(&cgrp->self) || cgrp->subtree_control) + return -EBUSY; + + /* find all csets which need ->proc_cset updated */ + spin_lock_irq(&css_set_lock); + list_for_each_entry(link, &cgrp->cset_links, cset_link) { + cset = link->cset; + if (css_set_populated(cset)) { + WARN_ON_ONCE(css_set_threaded(cset)); + WARN_ON_ONCE(cset->pcset_preload); + + list_add_tail(&cset->pcset_preload_node, &csets); + get_css_set(cset); + } + } + spin_unlock_irq(&css_set_lock); + + /* find the proc_csets to associate */ + list_for_each_entry(cset, &csets, pcset_preload_node) { + struct css_set *pcset = find_css_set(cset, cgrp, true); + + WARN_ON_ONCE(cset == pcset); + if (!pcset) { + ret = -ENOMEM; + goto err_put_csets; + } + cset->pcset_preload = pcset; + } + + /* install ->proc_cset */ + spin_lock_irq(&css_set_lock); + list_for_each_entry_safe(cset, cset_next, &csets, pcset_preload_node) { + rcu_assign_pointer(cset->proc_cset, cset->pcset_preload); + list_add_tail(&cset->threaded_csets_node, + &cset->pcset_preload->threaded_csets); + + cset->pcset_preload = NULL; + list_del(&cset->pcset_preload_node); + put_css_set_locked(cset); + } + spin_unlock_irq(&css_set_lock); + + /* mark it threaded */ + cgrp->proc_cgrp = cgrp; + + return 0; + +err_put_csets: + spin_lock_irq(&css_set_lock); + list_for_each_entry_safe(cset, cset_next, &csets, pcset_preload_node) { + if (cset->pcset_preload) { + put_css_set_locked(cset->pcset_preload); + cset->pcset_preload = NULL; + } + list_del(&cset->pcset_preload_node); + put_css_set_locked(cset); + } + spin_unlock_irq(&css_set_lock); + return ret; +} + +static int cgroup_disable_threaded(struct cgroup *cgrp) +{ + struct cgrp_cset_link *link; + + lockdep_assert_held(&cgroup_mutex); + + /* noop if already !threaded */ + if (!cgrp->proc_cgrp) + return 0; + + /* partial disable isn't supported */ + if (cgrp->proc_cgrp != cgrp) + return -EBUSY; + + /* allow only if there are neither children or enabled controllers */ + if (css_has_online_children(&cgrp->self) || cgrp->subtree_control) + return -EBUSY; + + /* walk all csets and reset ->proc_cset */ + spin_lock_irq(&css_set_lock); + list_for_each_entry(link, &cgrp->cset_links, cset_link) { + struct css_set *cset = link->cset; + + if (css_set_threaded(cset)) { + struct css_set *pcset = proc_css_set(cset); + + WARN_ON_ONCE(pcset->dfl_cgrp != cgrp); + rcu_assign_pointer(cset->proc_cset, cset); + list_del(&cset->threaded_csets_node); + + /* + * @pcset is never @cset and safe to put during + * iteration. + */ + put_css_set_locked(pcset); + } + } + cgrp->proc_cgrp = NULL; + spin_unlock_irq(&css_set_lock); + + return 0; +} + static int cgroup_events_show(struct seq_file *seq, void *v) { seq_printf(seq, "populated %d\n", @@ -3838,12 +3966,12 @@ static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos) return css_task_iter_next(it); } -static void *cgroup_procs_start(struct seq_file *s, loff_t *pos) +static void *__cgroup_procs_start(struct seq_file *s, loff_t *pos, + unsigned int iter_flags) { struct kernfs_open_file *of = s->private; struct cgroup *cgrp = seq_css(s)->cgroup; struct css_task_iter *it = of->priv; - unsigned iter_flags = CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED; /* * When a seq_file is seeked, it's always traversed sequentially @@ -3866,6 +3994,23 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos) return cgroup_procs_next(s, NULL, NULL); } +static void *cgroup_procs_start(struct seq_file *s, loff_t *pos) +{ + struct cgroup *cgrp = seq_css(s)->cgroup; + + /* + * All processes of a threaded subtree are in the top threaded + * cgroup. Only threads can be distributed across the subtree. + * Reject reads on cgroup.procs in the subtree proper. They're + * always empty anyway. + */ + if (cgrp->proc_cgrp && cgrp->proc_cgrp != cgrp) + return ERR_PTR(-EINVAL); + + return __cgroup_procs_start(s, pos, CSS_TASK_ITER_PROCS | + CSS_TASK_ITER_THREADED); +} + static int cgroup_procs_show(struct seq_file *s, void *v) { seq_printf(s, "%d\n", task_pid_vnr(v)); @@ -3920,6 +4065,76 @@ static ssize_t cgroup_procs_write(struct kernfs_open_file *of, return ret ?: nbytes; } +static void *cgroup_threads_start(struct seq_file *s, loff_t *pos) +{ + struct cgroup *cgrp = seq_css(s)->cgroup; + + if (!cgrp->proc_cgrp) + return ERR_PTR(-EINVAL); + + return __cgroup_procs_start(s, pos, 0); +} + +static ssize_t cgroup_threads_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct super_block *sb = of->file->f_path.dentry->d_sb; + struct cgroup *cgrp, *common_ancestor; + struct task_struct *task; + ssize_t ret; + + buf = strstrip(buf); + + cgrp = cgroup_kn_lock_live(of->kn, false); + if (!cgrp) + return -ENODEV; + + /* cgroup.procs determines delegation, require permission on it too */ + ret = cgroup_procs_write_permission(cgrp, sb); + if (ret) + goto out_unlock; + + /* enable or disable? */ + if (!strcmp(buf, "enable")) { + ret = cgroup_enable_threaded(cgrp); + goto out_unlock; + } else if (!strcmp(buf, "disable")) { + ret = cgroup_disable_threaded(cgrp); + goto out_unlock; + } + + /* thread migration */ + ret = -EINVAL; + if (!cgrp->proc_cgrp) + goto out_unlock; + + task = cgroup_procs_write_start(buf, false); + ret = PTR_ERR_OR_ZERO(task); + if (ret) + goto out_unlock; + + common_ancestor = cgroup_migrate_common_ancestor(task, cgrp); + + /* can't migrate across disjoint threaded subtrees */ + ret = -EACCES; + if (common_ancestor->proc_cgrp != cgrp->proc_cgrp) + goto out_finish; + + /* and follow the cgroup.procs delegation rule */ + ret = cgroup_procs_write_permission(common_ancestor, sb); + if (ret) + goto out_finish; + + ret = cgroup_attach_task(cgrp, task, false); + +out_finish: + cgroup_procs_write_finish(); +out_unlock: + cgroup_kn_unlock(of->kn); + + return ret ?: nbytes; +} + /* cgroup core interface files for the default hierarchy */ static struct cftype cgroup_base_files[] = { { @@ -3932,6 +4147,14 @@ static struct cftype cgroup_base_files[] = { .write = cgroup_procs_write, }, { + .name = "cgroup.threads", + .release = cgroup_procs_release, + .seq_start = cgroup_threads_start, + .seq_next = cgroup_procs_next, + .seq_show = cgroup_procs_show, + .write = cgroup_threads_write, + }, + { .name = "cgroup.controllers", .seq_show = cgroup_controllers_show, }, @@ -4245,6 +4468,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent) cgrp->self.parent = &parent->self; cgrp->root = root; cgrp->level = level; + cgrp->proc_cgrp = parent->proc_cgrp; for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) cgrp->ancestor_ids[tcgrp->level] = tcgrp->id; @@ -4687,11 +4911,17 @@ int __init cgroup_init(void) cgrp_dfl_root.subsys_mask |= 1 << ss->id; + /* implicit controllers must be threaded too */ + WARN_ON(ss->implicit_on_dfl && !ss->threaded); + if (ss->implicit_on_dfl) cgrp_dfl_implicit_ss_mask |= 1 << ss->id; else if (!ss->dfl_cftypes) cgrp_dfl_inhibit_ss_mask |= 1 << ss->id; + if (ss->threaded) + cgrp_dfl_threaded_ss_mask |= 1 << ss->id; + if (ss->dfl_cftypes == ss->legacy_cftypes) { WARN_ON(cgroup_add_cftypes(ss, ss->dfl_cftypes)); } else { diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c index 2bd6737..49b5424 100644 --- a/kernel/cgroup/pids.c +++ b/kernel/cgroup/pids.c @@ -345,4 +345,5 @@ struct cgroup_subsys pids_cgrp_subsys = { .free = pids_free, .legacy_cftypes = pids_files, .dfl_cftypes = pids_files, + .threaded = true, }; diff --git a/kernel/events/core.c b/kernel/events/core.c index d72128d..508c0d9 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -10798,5 +10798,6 @@ struct cgroup_subsys perf_event_cgrp_subsys = { * controller is not mounted on a legacy hierarchy. */ .implicit_on_dfl = true, + .threaded = true, }; #endif /* CONFIG_CGROUP_PERF */ -- 2.9.3 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <20170202200632.13992-1-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2017-02-02 20:06 ` [PATCH 3/5] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling Tejun Heo 2017-02-02 20:06 ` [PATCH 5/5] cgroup: implement cgroup v2 thread support Tejun Heo @ 2017-02-02 21:32 ` Andy Lutomirski [not found] ` <CALCETrW6Mqj9VLogd0XaLgVAzEqsZ+VnZjN5NROCqr0ssdYaKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-02-03 20:20 ` Peter Zijlstra 3 siblings, 1 reply; 32+ messages in thread From: Andy Lutomirski @ 2017-02-02 21:32 UTC (permalink / raw) To: Tejun Heo, Linux API Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, Paul Turner, Mike Galbraith, open list:CONTROL GROUP (CGROUP), linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA On Thu, Feb 2, 2017 at 12:06 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > Hello, > > This patchset implements cgroup v2 thread mode. It is largely based > on the discussions that we had at the plumbers last year. Here's the > rough outline. I like this, but I have some design questions: > > * Thread mode is explicitly enabled on a cgroup by writing "enable" > into "cgroup.threads" file. The cgroup shouldn't have any child > cgroups or enabled controllers. Why do you need to manually turn it on? That is, couldn't it be automatic based on what controllers are enabled? > > * Once enabled, arbitrary sub-hierarchy can be created and threads can > be put anywhere in the subtree by writing TIDs into "cgroup.threads" > file. Process granularity and no-internal-process constraint don't > apply in a threaded subtree. I'm a bit worried that this conflates two different things. There's thread support, i.e. allowing individual threads to be placed into cgroups. There's also more flexible sub-hierarchy support, i.e. relaxing no-internal-process constraints. For the "cpuacct" controller, for example, both of these make sense. But what if someone writes a controller (directio, for example, just to make something up) for which thread granularity makes sense but relaxing no-internal-process constraints does not? --Andy ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <CALCETrW6Mqj9VLogd0XaLgVAzEqsZ+VnZjN5NROCqr0ssdYaKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <CALCETrW6Mqj9VLogd0XaLgVAzEqsZ+VnZjN5NROCqr0ssdYaKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-02-02 21:52 ` Tejun Heo [not found] ` <20170202215229.GA27231-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2017-02-02 21:52 UTC (permalink / raw) To: Andy Lutomirski Cc: Linux API, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, Paul Turner, Mike Galbraith, open list:CONTROL GROUP (CGROUP), linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA Hello, On Thu, Feb 02, 2017 at 01:32:19PM -0800, Andy Lutomirski wrote: > > * Thread mode is explicitly enabled on a cgroup by writing "enable" > > into "cgroup.threads" file. The cgroup shouldn't have any child > > cgroups or enabled controllers. > > Why do you need to manually turn it on? That is, couldn't it be > automatic based on what controllers are enabled? This came up already but it's not like some controllers are inherently thread-only. Consider CPU, all in-context CPU cycle consumptions are tied to the thread; however, we also want to be able to account for CPU cycles consumed for, for example, memory reclaim or encryption during writeback. I played with an interface where thread mode is enabled automatically upto the common ancestor of the threads but not only was it complicated to implement but also the eventual behavior was very confusing as the resource domain can change without any active actions from the user. I think keeping things simple is the right choice here. > > * Once enabled, arbitrary sub-hierarchy can be created and threads can > > be put anywhere in the subtree by writing TIDs into "cgroup.threads" > > file. Process granularity and no-internal-process constraint don't > > apply in a threaded subtree. > > I'm a bit worried that this conflates two different things. There's > thread support, i.e. allowing individual threads to be placed into > cgroups. There's also more flexible sub-hierarchy support, i.e. > relaxing no-internal-process constraints. For the "cpuacct" > controller, for example, both of these make sense. But what if > someone writes a controller (directio, for example, just to make > something up) for which thread granularity makes sense but relaxing > no-internal-process constraints does not? If a controller can't possibly define how internal competition should be handled, which is unlikely - the problem is being consistent and sensible, defining something isn't difficult - the controller can simply error out those cases either on configuration or migration. Again, I'm very doubtful we'll need that but if we ever need that denying specific configurations is the best we can do anyway. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <20170202215229.GA27231-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <20170202215229.GA27231-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2017-02-03 21:10 ` Andy Lutomirski 2017-02-03 21:56 ` Tejun Heo 2017-02-06 9:50 ` Peter Zijlstra 1 sibling, 1 reply; 32+ messages in thread From: Andy Lutomirski @ 2017-02-03 21:10 UTC (permalink / raw) To: Tejun Heo Cc: Linux API, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, Paul Turner, Mike Galbraith, open list:CONTROL GROUP (CGROUP), linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA On Thu, Feb 2, 2017 at 1:52 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > Hello, > > On Thu, Feb 02, 2017 at 01:32:19PM -0800, Andy Lutomirski wrote: >> > * Thread mode is explicitly enabled on a cgroup by writing "enable" >> > into "cgroup.threads" file. The cgroup shouldn't have any child >> > cgroups or enabled controllers. >> >> Why do you need to manually turn it on? That is, couldn't it be >> automatic based on what controllers are enabled? > > This came up already but it's not like some controllers are inherently > thread-only. Consider CPU, all in-context CPU cycle consumptions are > tied to the thread; however, we also want to be able to account for > CPU cycles consumed for, for example, memory reclaim or encryption > during writeback. > Is this flexible enough for the real-world usecases? For my use case (if I actually ported over to this), it would mean that I'd have to enable thread mode on the root. What about letting a given process (actually mm, perhaps) live in a cgroup but let the threads be in different cgroups without any particular constraints. Then process-wide stuff would be accounted to the cgroup that owns the process. > >> > * Once enabled, arbitrary sub-hierarchy can be created and threads can >> > be put anywhere in the subtree by writing TIDs into "cgroup.threads" >> > file. Process granularity and no-internal-process constraint don't >> > apply in a threaded subtree. >> >> I'm a bit worried that this conflates two different things. There's >> thread support, i.e. allowing individual threads to be placed into >> cgroups. There's also more flexible sub-hierarchy support, i.e. >> relaxing no-internal-process constraints. For the "cpuacct" >> controller, for example, both of these make sense. But what if >> someone writes a controller (directio, for example, just to make >> something up) for which thread granularity makes sense but relaxing >> no-internal-process constraints does not? > > If a controller can't possibly define how internal competition should > be handled, which is unlikely - the problem is being consistent and > sensible, defining something isn't difficult - the controller can > simply error out those cases either on configuration or migration. > Again, I'm very doubtful we'll need that but if we ever need that > denying specific configurations is the best we can do anyway. > I'm not sure I follow. I'm suggesting something quite simple: let controllers that don't need the no-internal-process constraints set a flag so that the constraints don't apply if all enabled controllers have the flag set. --Andy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode 2017-02-03 21:10 ` Andy Lutomirski @ 2017-02-03 21:56 ` Tejun Heo 0 siblings, 0 replies; 32+ messages in thread From: Tejun Heo @ 2017-02-03 21:56 UTC (permalink / raw) To: Andy Lutomirski Cc: Linux API, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, Paul Turner, Mike Galbraith, open list:CONTROL GROUP (CGROUP), linux-kernel@vger.kernel.org, kernel-team, lvenanci Hello, On Fri, Feb 03, 2017 at 01:10:21PM -0800, Andy Lutomirski wrote: > Is this flexible enough for the real-world usecases? For my use case I can't think of a reason why it won't be. Capability-wise, nothing is being lost by the interface. > (if I actually ported over to this), it would mean that I'd have to > enable thread mode on the root. What about letting a given process > (actually mm, perhaps) live in a cgroup but let the threads be in > different cgroups without any particular constraints. Then > process-wide stuff would be accounted to the cgroup that owns the > process. I don't know. So, then, we basiclly have completely separate trees for resource domains and threads. That exactly is what mounting cpu controller separately does. It doesn't make sense to put them on the same hierarchy. Why? > > If a controller can't possibly define how internal competition should > > be handled, which is unlikely - the problem is being consistent and > > sensible, defining something isn't difficult - the controller can > > simply error out those cases either on configuration or migration. > > Again, I'm very doubtful we'll need that but if we ever need that > > denying specific configurations is the best we can do anyway. > > I'm not sure I follow. > > I'm suggesting something quite simple: let controllers that don't need > the no-internal-process constraints set a flag so that the constraints > don't apply if all enabled controllers have the flag set. Firstly, I think it's better to have the rules as simple and consistent as possible as long as we don't sacrifice critical capabilities. Secondly, all the major resource controllers including cpu would eventually need resource domain, so there is no real practical upside to doing so. Thirdly, if we commit to something like "controller X is not subject to no-internal-process constraint", that commitment would prevent from ever adding domain level operations to that controller without breaking userland visible interface. All controllers do and have to support process level operations. Some controllers can do thread level operations. Keeping the latter opt-in doesn't block us from adding thread mode later on. Doing it the other way around blocks us from adding domain level operations later on. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <20170202215229.GA27231-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2017-02-03 21:10 ` Andy Lutomirski @ 2017-02-06 9:50 ` Peter Zijlstra 1 sibling, 0 replies; 32+ messages in thread From: Peter Zijlstra @ 2017-02-06 9:50 UTC (permalink / raw) To: Tejun Heo Cc: Andy Lutomirski, Linux API, Li Zefan, Johannes Weiner, Ingo Molnar, Paul Turner, Mike Galbraith, open list:CONTROL GROUP (CGROUP), linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA On Thu, Feb 02, 2017 at 04:52:29PM -0500, Tejun Heo wrote: > > Why do you need to manually turn it on? That is, couldn't it be > > automatic based on what controllers are enabled? > > This came up already but it's not like some controllers are inherently > thread-only. Consider CPU, all in-context CPU cycle consumptions are > tied to the thread; however, we also want to be able to account for > CPU cycles consumed for, for example, memory reclaim or encryption > during writeback. > > I played with an interface where thread mode is enabled automatically > upto the common ancestor of the threads but not only was it > complicated to implement but also the eventual behavior was very > confusing as the resource domain can change without any active actions > from the user. I think keeping things simple is the right choice > here. Note that the explicit marking of the resource domains gets you exactly that. But let me reply in the other subthread. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <20170202200632.13992-1-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ` (2 preceding siblings ...) 2017-02-02 21:32 ` [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Andy Lutomirski @ 2017-02-03 20:20 ` Peter Zijlstra [not found] ` <20170203202048.GD6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 3 siblings, 1 reply; 32+ messages in thread From: Peter Zijlstra @ 2017-02-03 20:20 UTC (permalink / raw) To: Tejun Heo Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, mingo-H+wXaHxf7aLQT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ, efault-Mmb7MZpHnFY, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA On Thu, Feb 02, 2017 at 03:06:27PM -0500, Tejun Heo wrote: > Hello, > > This patchset implements cgroup v2 thread mode. It is largely based > on the discussions that we had at the plumbers last year. Here's the > rough outline. > > * Thread mode is explicitly enabled on a cgroup by writing "enable" > into "cgroup.threads" file. The cgroup shouldn't have any child > cgroups or enabled controllers. > > * Once enabled, arbitrary sub-hierarchy can be created and threads can > be put anywhere in the subtree by writing TIDs into "cgroup.threads" > file. Process granularity and no-internal-process constraint don't > apply in a threaded subtree. > > * To be used in a threaded subtree, controllers should explicitly > declare thread mode support and should be able to handle internal > competition in some way. > > * The root of a threaded subtree serves as the resource domain for the > whole subtree. This is where all the controllers are guaranteed to > have a common ground and resource consumptions in the threaded > subtree which aren't tied to a specific thread are charged. > Non-threaded controllers never see beyond thread root and can assume > that all controllers will follow the same rules upto that point. > > This allows threaded controllers to implement thread granular resource > control without getting in the way of system level resource > partitioning. I'm still confused. So suppose I mark my root cgroup as such, because I run RT tasks there. Does this then mean I cannot later start a VM and have that containered properly? That is, I think threaded controllers very much get in the way of system level source partitioning this way. So my proposal was to do the inverse of what you propose here. Instead of marking special 'thread' subtrees, explicitly mark resource domains in the tree. So always allow arbitrary hierarchies and allow threads to be assigned to cgroups, as long as they're all in the same resource domain. Controllers that do not support things, fallback to mapping everything to the nearest resource domain. ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <20170203202048.GD6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <20170203202048.GD6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2017-02-03 20:59 ` Tejun Heo [not found] ` <20170203205955.GA9886-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2017-02-03 20:59 UTC (permalink / raw) To: Peter Zijlstra Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, mingo-H+wXaHxf7aLQT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ, efault-Mmb7MZpHnFY, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA Hello, Peter. On Fri, Feb 03, 2017 at 09:20:48PM +0100, Peter Zijlstra wrote: > So my proposal was to do the inverse of what you propose here. Instead > of marking special 'thread' subtrees, explicitly mark resource domains > in the tree. > > So always allow arbitrary hierarchies and allow threads to be assigned > to cgroups, as long as they're all in the same resource domain. > > Controllers that do not support things, fallback to mapping everything > to the nearest resource domain. That sounds counter-intuitive as all controllers can do resource domains and only a subset of them can do thread mode. Also, thread subtrees are necessarily a sub-hierarchy of a resource domain. Also, expanding resource domains from the root after the trees are populated would make the behaviors surprising as the resource domains that these subtrees belong to would change dynamically. In practice, how would this work? To enable memcg, the user has to first create the subtree and then explicitly have to make that a domain and then enable memcg? If so, that would be a completely unnecessary deviation from the current behavior while not achieving any more functionalities, right? It's just flipping the default value the other way around and the default wouldn't be supported by many of the controllers. I can't see how that is a better option. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <20170203205955.GA9886-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <20170203205955.GA9886-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2017-02-06 12:49 ` Peter Zijlstra [not found] ` <20170206124943.GJ6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 0 siblings, 1 reply; 32+ messages in thread From: Peter Zijlstra @ 2017-02-06 12:49 UTC (permalink / raw) To: Tejun Heo Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, mingo-H+wXaHxf7aLQT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ, efault-Mmb7MZpHnFY, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA On Fri, Feb 03, 2017 at 03:59:55PM -0500, Tejun Heo wrote: > Hello, Peter. > > On Fri, Feb 03, 2017 at 09:20:48PM +0100, Peter Zijlstra wrote: > > So my proposal was to do the inverse of what you propose here. Instead > > of marking special 'thread' subtrees, explicitly mark resource domains > > in the tree. > > > > So always allow arbitrary hierarchies and allow threads to be assigned > > to cgroups, as long as they're all in the same resource domain. > > > > Controllers that do not support things, fallback to mapping everything > > to the nearest resource domain. > > That sounds counter-intuitive as all controllers can do resource > domains and only a subset of them can do thread mode. But to me the resource domain is your primary new construct; so it makes more sense to explicitly mark that. (FWIW note that your whole initial cgroup-v2 thing is counter intuitive to me, as someone who has only ever dealt with thread capable controllers.) > Also, thread subtrees are necessarily a sub-hierarchy of a resource domain. Sure, don't see how that is relevant though. Or rather, I don't see it being an argument one way or the other. > Also, > expanding resource domains from the root after the trees are populated > would make the behaviors surprising as the resource domains that these > subtrees belong to would change dynamically. Uh what? I cannot parse that. My question was, if you have root.threads=1, can you then still create (sub) resource domains? Can I create a child cgroup and clear "threads" again? (I'm assuming "threads" is inherited when creating new groups). Now, _if_ we can do the above, then "threads" is not sufficient to uniquely identify resource domains, which I think was your point in the other email. Which argues against the interface. Because a group can be a resource domain _and_ have threads sub-trees. OTOH, if you can _not_ do this, then this proposal is insufficient/inadequate. > In practice, how would this work? To enable memcg, the user has to > first create the subtree and then explicitly have to make that a > domain and then enable memcg? If so, that would be a completely > unnecessary deviation from the current behavior while not achieving > any more functionalities, right? It's just flipping the default value > the other way around and the default wouldn't be supported by many of > the controllers. I can't see how that is a better option. OK, so I'm entirely unaware of this enabling of controllers. What's that about? I thought the whole point of cgroup-v2 was to have all controllers enabled over the entire tree, this is not so? In any case, yes, more or less like that, except of course, not at all :-) If we make this flag inherited (which I think it should be), you don't need to do anything different from today, because the root group must be a resource domain, any new sub-group will automagically also be. Only once you clear the flag somewhere do you get 'new' behaviour. Note that the only extra constraint is that all threads of a process must stay within the same resource domain, anything else goes. Task based controllers operate on the actual cgroup, resource domain controllers always map it back to the resource group. Finding a random task's resource domain is trivial; simply walk up the hierarchy until you find a group with the flag set. So, just to recap, my proposal is as follows: 1) each cgroup will have a new flag, indicating if this is a resource domain. a) this flag will be inherited; when creating a new cgroup, the state of the flag will be set to the value of the parent cgroup. b) the root cgroup is a resource domain per definition, will have it set (cannot be cleared). 2) all tasks of a process shall be part of the same resource domain 3) controllers come in two types: a) task based controllers; these use the direct cgroup the task is assigned to. b) resource controllers; these use the effective resource group of the task, which is the first parent group with our new flag set. With an optional: 1c) this flag can only be changed on empty groups to ease implementation. From these rules it follows that: - 1a and 1b together ensure no change in behaviour per default for cgroup-v2. - 2 and 3a together ensure resource groups automagically work for task based controllers (under the assumption that the controller is strictly hierarchical). For example, cpuacct does the accounting strictly hierarchical, it adds the cpu usage to each parent group. Therefore the total cpu usage accounted to the resource group is the same as if all tasks were part of that group. ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <20170206124943.GJ6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <20170206124943.GJ6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2017-02-08 23:08 ` Tejun Heo [not found] ` <20170208230819.GD25826-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2017-02-08 23:08 UTC (permalink / raw) To: Peter Zijlstra Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, mingo-H+wXaHxf7aLQT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ, efault-Mmb7MZpHnFY, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA, Linus Torvalds, Andrew Morton (cc'ing Linus and Andrew for visibility) Hello, Peter. On Mon, Feb 06, 2017 at 01:49:43PM +0100, Peter Zijlstra wrote: > But to me the resource domain is your primary new construct; so it makes > more sense to explicitly mark that. Whether it's new or not isn't the point. Resource domains weren't added arbitrarily. We were missing critical resource accounting and control capabilities because cgroup v1's abstraction wasn't strong enough to cover some of the major resource consumers and how different resources can interact with each other. Resource domains were added to address that. Given that cgroup's primary goal is providing resource accounting and control, it doesn't make sense to make this optional. > (FWIW note that your whole initial cgroup-v2 thing is counter intuitive > to me, as someone who has only ever dealt with thread capable > controllers.) That is completely fine but you can't direct the overall design while claiming to be not using, disinterested and unfamiliar with the subject at hand. I do understand that you have certain use cases that you think should be covered. Let's focus on that. > My question was, if you have root.threads=1, can you then still create > (sub) resource domains? Can I create a child cgroup and clear "threads" > again? > > (I'm assuming "threads" is inherited when creating new groups). > > Now, _if_ we can do the above, then "threads" is not sufficient to > uniquely identify resource domains, which I think was your point in the > other email. Which argues against the interface. Because a group can be > a resource domain _and_ have threads sub-trees. > > OTOH, if you can _not_ do this, then this proposal is > insufficient/inadequate. No, you can't flip it back and I'm not convinced this matters. More on this below. > > In practice, how would this work? To enable memcg, the user has to > > first create the subtree and then explicitly have to make that a > > domain and then enable memcg? If so, that would be a completely > > unnecessary deviation from the current behavior while not achieving > > any more functionalities, right? It's just flipping the default value > > the other way around and the default wouldn't be supported by many of > > the controllers. I can't see how that is a better option. > > OK, so I'm entirely unaware of this enabling of controllers. What's that > about? I thought the whole point of cgroup-v2 was to have all > controllers enabled over the entire tree, this is not so? This is one of the most basic aspects of cgroup v2. In short, while the controllers share the hierarchy, each doesn't have to be enabled all the way down to the leaf. Different controllers can see upto different subsets of the hierarchy spreading out from the root. > In any case, yes, more or less like that, except of course, not at all > :-) If we make this flag inherited (which I think it should be), you > don't need to do anything different from today, because the root group > must be a resource domain, any new sub-group will automagically also be. > > Only once you clear the flag somewhere do you get 'new' behaviour. Note > that the only extra constraint is that all threads of a process must > stay within the same resource domain, anything else goes. > > Task based controllers operate on the actual cgroup, resource domain > controllers always map it back to the resource group. Finding a random > task's resource domain is trivial; simply walk up the hierarchy until > you find a group with the flag set. > > So, just to recap, my proposal is as follows: > > 1) each cgroup will have a new flag, indicating if this is a resource > domain. > > a) this flag will be inherited; when creating a new cgroup, the > state of the flag will be set to the value of the parent cgroup. > > b) the root cgroup is a resource domain per definition, will > have it set (cannot be cleared). > > 2) all tasks of a process shall be part of the same resource domain > > 3) controllers come in two types: > > a) task based controllers; these use the direct cgroup the task > is assigned to. > > b) resource controllers; these use the effective resource group > of the task, which is the first parent group with our new > flag set. > > > With an optional: > > 1c) this flag can only be changed on empty groups > > to ease implementation. > > From these rules it follows that: > > - 1a and 1b together ensure no change in behaviour per default > for cgroup-v2. > > - 2 and 3a together ensure resource groups automagically work for task > based controllers (under the assumption that the controller is > strictly hierarchical). > > For example, cpuacct does the accounting strictly hierarchical, it adds > the cpu usage to each parent group. Therefore the total cpu usage > accounted to the resource group is the same as if all tasks were part of > that group. So, what you're proposing isn't that different from what the posted patches implement except that what's implemented doesn't allow flipping a part of a threaded subtree back to domain mode. Your proposal is more complicated while seemingly not adding much to what can be achieved. The orignal proposal is very simple. It allows declaring a subtree to be threaded (or task based) and that's it. A threaded subtree can't have resource domains under it. The only addition that your proposal has is the ability to mark portions of such subtree to be domains again. This would require making domain controllers and thread controllers to see different hierarchies, which brings in something completely new to the basic hierarchy. Different controllers seeing differing levels of the same hierarchy is part of the basic behaviors and making subtrees threaded is a straight-forward extension of that - threaded controllers just see further into the hierarchy. Adding threaded sub-sections in the middle is more complex and frankly confusing. Let's say we can make that work but what are the use cases which would require such setup where we have to alternate between thread and domain modes through out the resource hierarchy? This will be a considerable departure and added complexity from the existing behaviors and code. We gotta be achieving something significant if we're doing that. Why would we want this? And here's another aspect. The currently proposed interface doesn't preclude adding the behavior you're describing in the future. Once thread mode is enabled on a subtree, it isn't allowed to be disabled in its proper subtree; however, if there actually are use cases which require flipping it back, we can later implemnt the behavior and lift that restriction. I think it makes sense to start with a simple model. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <20170208230819.GD25826-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <20170208230819.GD25826-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2017-02-09 10:29 ` Peter Zijlstra [not found] ` <20170209102909.GC6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 0 siblings, 1 reply; 32+ messages in thread From: Peter Zijlstra @ 2017-02-09 10:29 UTC (permalink / raw) To: Tejun Heo Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, mingo-H+wXaHxf7aLQT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ, efault-Mmb7MZpHnFY, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA, Linus Torvalds, Andrew Morton On Wed, Feb 08, 2017 at 06:08:19PM -0500, Tejun Heo wrote: > (cc'ing Linus and Andrew for visibility) > > Hello, Peter. > > On Mon, Feb 06, 2017 at 01:49:43PM +0100, Peter Zijlstra wrote: > > But to me the resource domain is your primary new construct; so it makes > > more sense to explicitly mark that. > > Whether it's new or not isn't the point. Resource domains weren't > added arbitrarily. We were missing critical resource accounting and > control capabilities because cgroup v1's abstraction wasn't strong > enough to cover some of the major resource consumers and how different > resources can interact with each other. > > Resource domains were added to address that. Given that cgroup's > primary goal is providing resource accounting and control, it doesn't > make sense to make this optional. I'm not sure what you're saying here. Are you agreeing that 'resource domains' are the primary new construct or not? > > My question was, if you have root.threads=1, can you then still create > > (sub) resource domains? Can I create a child cgroup and clear "threads" > > again? > > > > (I'm assuming "threads" is inherited when creating new groups). > > > > Now, _if_ we can do the above, then "threads" is not sufficient to > > uniquely identify resource domains, which I think was your point in the > > other email. Which argues against the interface. Because a group can be > > a resource domain _and_ have threads sub-trees. > > > > OTOH, if you can _not_ do this, then this proposal is > > insufficient/inadequate. > > No, you can't flip it back and I'm not convinced this matters. More > on this below. Then I shall preliminary NAK your proposal right here, but I shall continue to read on. > > So, just to recap, my proposal is as follows: > > > > 1) each cgroup will have a new flag, indicating if this is a resource > > domain. > > > > a) this flag will be inherited; when creating a new cgroup, the > > state of the flag will be set to the value of the parent cgroup. > > > > b) the root cgroup is a resource domain per definition, will > > have it set (cannot be cleared). > > > > 2) all tasks of a process shall be part of the same resource domain > > > > 3) controllers come in two types: > > > > a) task based controllers; these use the direct cgroup the task > > is assigned to. > > > > b) resource controllers; these use the effective resource group > > of the task, which is the first parent group with our new > > flag set. > > > > > > With an optional: > > > > 1c) this flag can only be changed on empty groups > > > > to ease implementation. > > > > From these rules it follows that: > > > > - 1a and 1b together ensure no change in behaviour per default > > for cgroup-v2. > > > > - 2 and 3a together ensure resource groups automagically work for task > > based controllers (under the assumption that the controller is > > strictly hierarchical). > > > > For example, cpuacct does the accounting strictly hierarchical, it adds > > the cpu usage to each parent group. Therefore the total cpu usage > > accounted to the resource group is the same as if all tasks were part of > > that group. > > So, what you're proposing isn't that different from what the posted > patches implement except that what's implemented doesn't allow > flipping a part of a threaded subtree back to domain mode. > > Your proposal is more complicated while seemingly not adding much to > what can be achieved. The orignal proposal is very simple. It allows > declaring a subtree to be threaded (or task based) and that's it. A > threaded subtree can't have resource domains under it. > > The only addition that your proposal has is the ability to mark > portions of such subtree to be domains again. This would require > making domain controllers and thread controllers to see different > hierarchies, Uhm, no. They would see the exact same hierarchy, seeing how there is only one tree. They would have different view of it maybe, but I don't see how that matters, nor do you explain. > which brings in something completely new to the basic hierarchy. I'm failing to see what. > Different controllers seeing differing levels of the same hierarchy is > part of the basic behaviors I have no idea what you mean there. > and making subtrees threaded is a > straight-forward extension of that - threaded controllers just see > further into the hierarchy. Adding threaded sub-sections in the > middle is more complex and frankly confusing. I disagree, as I completely fail to see any confusion. The rules are simple and straight forward. I also don't see why you would want to impose this artificial restriction. It doesn't get you anything. Why are you so keen on designs with these artificial limits on? > Let's say we can make that work but what are the use cases which would > require such setup where we have to alternate between thread and > domain modes through out the resource hierarchy? I would very much like to run my main workload in the root resource group. This means I need to have threaded subtrees at the root level. Your design would then mean I then cannot run a VM (which uses all these cgroups muck and needs its own resource domain) for some less critical/isolated workload. Now, you'll argue I should set up a subtree for the main workload; but why would I do that? Why would you force me into making this choice; which has performance penalties associated (because the root resource domain is special cased in a bunch of places; and because the shallower the cgroup tree the less overhead etc.). > This will be a > considerable departure and added complexity from the existing > behaviors and code. We gotta be achieving something significant if > we're doing that. Why would we want this? How is this a departure? I do not understand. Why would we not want to do this? Why would we want to impose artificial limitations. What specifically is hard about what I propose? You have no actual arguments on why what I propose would be hard to implement. As far as I can tell it should be fairly similar in complexity to what you already proposed. > And here's another aspect. The currently proposed interface doesn't > preclude adding the behavior you're describing in the future. Once > thread mode is enabled on a subtree, it isn't allowed to be disabled > in its proper subtree; however, if there actually are use cases which > require flipping it back, we can later implemnt the behavior and lift > that restriction. I think it makes sense to start with a simple > model. Your choice of flag makes it impossible to tell what is a resource domain and what is not in that situation. Suppose I set the root group threaded and I create subgroups (which will also all have threaded set). Suppose I clear the threaded bit somewhere in the subtree to create a new resource group, but then immediately set the threaded bit again to allow that resource group to have thread subgroups as well. Now the entire hierarchy will have the threaded flag set and it becomes impossible to find the resource domains. This is all a direct consequence of your flag not denoting the primary construct; eg. resource domains. IOW; you've completely failed to convince me and my NAK stands. ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <20170209102909.GC6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <20170209102909.GC6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2017-02-10 15:45 ` Tejun Heo [not found] ` <20170210154508.GA16097-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2017-02-10 15:45 UTC (permalink / raw) To: Peter Zijlstra Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, mingo-H+wXaHxf7aLQT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ, efault-Mmb7MZpHnFY, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA, Linus Torvalds, Andrew Morton Hello, On Thu, Feb 09, 2017 at 11:29:09AM +0100, Peter Zijlstra wrote: > Uhm, no. They would see the exact same hierarchy, seeing how there is > only one tree. They would have different view of it maybe, but I don't > see how that matters, nor do you explain. Sure, the base hierarchy is the same but different controllers would need to see different subsets (or views) of the hierarchy. As I wrote before, cgroup v2 alredy does this to certain extent by controllers ignoring the hierarchy beyond certain points. You're proposing to add a new "view" of the hierarchy. I'll explain why it matters below. > > which brings in something completely new to the basic hierarchy. > > I'm failing to see what. > > > Different controllers seeing differing levels of the same hierarchy is > > part of the basic behaviors > > I have no idea what you mean there. It's explained in Documentation/cgroup-v2.txt but for example, if the whole hierarchy is, A - B -C \ D One controller might only see A - B \ D while another sees the whole thing. > > and making subtrees threaded is a > > straight-forward extension of that - threaded controllers just see > > further into the hierarchy. Adding threaded sub-sections in the > > middle is more complex and frankly confusing. > > I disagree, as I completely fail to see any confusion. The rules are > simple and straight forward. > > I also don't see why you would want to impose this artificial > restriction. It doesn't get you anything. Why are you so keen on designs > with these artificial limits on? Because I actually understand and use this thing day in and day out? Let's go back to the no-internal-process constraint. The main reason behind that is avoiding resource competition between child cgroups and processes. The reason why we need this is because for some resources the terminal consumer (be that a process or task or anonymous) and the resource domain that it belongs to (be that the system itself or a cgroup) aren't equivalent. If you make a memcg, put some processes in it and then create some child cgroups, how resource should be distributed between those processes and child cgroups is not clearly defined and can't be controlled from userspace. The resource control knobs in a child cgroup governs how the resource is distributed from the parent. For child processes, we don't have those knobs. There are multiple ways to deal with the problem. We can add a separate set of control knobs to govern control resource consumption from internal processes. This effectively adds an implicit leaf node to each cgroup so that internal processes or tasks always are in its own leaf resource domain. This however adds a lot of cruft to the interface, the implementation gets nasty and the presented resource hierarchy can be misleading to users. Another option would be just letting each controller do whatever, which is pretty much what we did in v1. This got really bad because the behaviors were widely inconsistent across controllers and often implementation dependent without any way for the user to configure or monitor what's going on. Who gets how much becomes a matter of accidents and people optimize for whatever arbitrary behaviors that the kernel they're using is showing. No-internal-process rule establishes that resource domains are always terminal in the resource graph for a given controller, such that every competition along the resource hiearchy always is clearly defined and configurable. Only the terminal resource domains actually host resource consumptions and they can behave analogous to a system which doesn't have any cgroups at all. Estalishing resource domains this way isn't the only approach to solve the problem; however, it is a valid, simple and effective one. Now, back to not allowing switching back and forth between resource domains and thread subtrees. Let's say we allow that and compose a hierarchy as follows. Let's say A and B are resource domains and T's are subtrees of threads. A - T1 - B - T2 The resource domain controllers would see the following hierarchy. A - B A will contain processes from T1 and B T2. Both A and B would have internal consumptions from the processes and the no-internal-process constraint and thus resource domain abstraction are broken. If we want to support a hierarchy like that, we'll internally have to something like A - B \ A' Where cgroup A' contains processes from T1 and B T2. Now, this is exactly the same problem as having internal processes and can be solved in the same ways. The only realistic way to handle this in a generic and consistent manner is creating a leaf cgroup to contain the processes. We sure can try to hide this from userspace and convolute the interface but it can be solved *far* more elegantly by simply requiring thread subtrees to be leaf subtrees. And here's another point, currently, all controllers are enabled consecutively from root. If we have leaf thread subtrees, this still works fine. Resource domain controllers won't be enabled into thread subtrees. If we allow switching back and forth, what do we do in the middle while we're in the thread part? No matter what we do, it's gonna be more confusing and we lose basic invariants like "parent always has superset of control knobs that its child has". If we're gonna override the above points, we gotta gain something really substantial. > > Let's say we can make that work but what are the use cases which would > > require such setup where we have to alternate between thread and > > domain modes through out the resource hierarchy? > > I would very much like to run my main workload in the root resource > group. This means I need to have threaded subtrees at the root level. But this is just a whim. It isn't even a functional requirement. > Your design would then mean I then cannot run a VM (which uses all these > cgroups muck and needs its own resource domain) for some less > critical/isolated workload. > > Now, you'll argue I should set up a subtree for the main workload; but > why would I do that? Why would you force me into making this choice; > which has performance penalties associated (because the root resource > domain is special cased in a bunch of places; and because the shallower > the cgroup tree the less overhead etc.). Because what you want costs a lot of complexity and significantly worsens the interface. "I just want to do it in the root" isn't a valid justification. As for the runtime overhead, if you get affected by adding a top-level cgroup in any measureable way, we need to fix that. That's not a valid argument for messing up the interface. > > This will be a > > considerable departure and added complexity from the existing > > behaviors and code. We gotta be achieving something significant if > > we're doing that. Why would we want this? > > How is this a departure? I do not understand. > > Why would we not want to do this? Why would we want to impose artificial > limitations. What specifically is hard about what I propose? > > You have no actual arguments on why what I propose would be hard to > implement. As far as I can tell it should be fairly similar in > complexity to what you already proposed. I hope it's explained now. > > And here's another aspect. The currently proposed interface doesn't > > preclude adding the behavior you're describing in the future. Once > > thread mode is enabled on a subtree, it isn't allowed to be disabled > > in its proper subtree; however, if there actually are use cases which > > require flipping it back, we can later implemnt the behavior and lift > > that restriction. I think it makes sense to start with a simple > > model. > > Your choice of flag makes it impossible to tell what is a resource > domain and what is not in that situation. > > Suppose I set the root group threaded and I create subgroups (which will > also all have threaded set). Suppose I clear the threaded bit somewhere > in the subtree to create a new resource group, but then immediately set > the threaded bit again to allow that resource group to have thread > subgroups as well. Now the entire hierarchy will have the threaded flag > set and it becomes impossible to find the resource domains. > > This is all a direct consequence of your flag not denoting the primary > construct; eg. resource domains. Even if we allow switching back and forth, we can't make the same cgroup both resource domain && thread root. Not in a sane way at least. > IOW; you've completely failed to convince me and my NAK stands. You have a narrow view from a single component and has been openly claiming and demonstrating to be not using, disinterested and uninformed on cgroup. It's unfortunate and bullshit that the whole thing is blocked on your NAK, especially when the part you're holding hostage is something a lot of users want and won't change no matter what we do about threads. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <20170210154508.GA16097-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <20170210154508.GA16097-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2017-02-10 17:51 ` Peter Zijlstra [not found] ` <20170210175145.GJ6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 0 siblings, 1 reply; 32+ messages in thread From: Peter Zijlstra @ 2017-02-10 17:51 UTC (permalink / raw) To: Tejun Heo Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, mingo-H+wXaHxf7aLQT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ, efault-Mmb7MZpHnFY, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA, Linus Torvalds, Andrew Morton On Fri, Feb 10, 2017 at 10:45:08AM -0500, Tejun Heo wrote: > > > and making subtrees threaded is a > > > straight-forward extension of that - threaded controllers just see > > > further into the hierarchy. Adding threaded sub-sections in the > > > middle is more complex and frankly confusing. > > > > I disagree, as I completely fail to see any confusion. The rules are > > simple and straight forward. > > > > I also don't see why you would want to impose this artificial > > restriction. It doesn't get you anything. Why are you so keen on designs > > with these artificial limits on? > > Because I actually understand and use this thing day in and day out? Just because you don't have the use-cases doesn't mean they're invalid. Also, the above is effectively: "because I say so", which isn't much of an argument. > Let's go back to the no-internal-process constraint. The main reason > behind that is avoiding resource competition between child cgroups and > processes. The reason why we need this is because for some resources > the terminal consumer (be that a process or task or anonymous) and the > resource domain that it belongs to (be that the system itself or a > cgroup) aren't equivalent. Sure, we're past that. This isn't about what memcg can or cannot do. Previous discussions established that controllers come in two shapes: - task based controllers; these are build on per task properties and groups are aggregates over sets of tasks. Since per definition inter task competition is already defined on individual tasks, its fairly trivial to extend the same rules to sets of tasks etc.. Examples: cpu, cpuset, cpuacct, perf, pid, (freezer) - system controllers; instead of building from tasks upwards, they split what previously would be machine wide / global state. For these there is no natural competition rule vs tasks, and hence your no-internal-task rule. Examples: memcg, io, hugetlb (I have no idea where: devices, net_cls, net_prio, debug fall in this classification, nor is that really relevant) Now, cgroup-v2 is entirely build around the use-case of containerization, where you want a single hierarchy describing the containers and their resources. Now, because of that single hierarchy and single use-case, you let the system controllers dominate and dictate the rules. By doing that you've completely killed off a whole bunch of use-cases that were possible with pure task controllers. And you seen to have a very hard time accepting that this is a problem. Furthermore, the argument that people who need that can continue to use v1 doesn't work. Because v2 and v1 are mutually exclusive and do not respect the namespace/container invariant. That is, if a controller is used in v2, a sub-container is forced to also use v2. Therefore it is important to fix v2 if possible or do v3 if not, such that all use-cases can be met in a single setup that respects the container invariant. > Now, back to not allowing switching back and forth between resource > domains and thread subtrees. Let's say we allow that and compose a > hierarchy as follows. Let's say A and B are resource domains and T's > are subtrees of threads. > > A - T1 - B - T2 > > The resource domain controllers would see the following hierarchy. > > A - B > > A will contain processes from T1 and B T2. Both A and B would have > internal consumptions from the processes and the no-internal-process > constraint and thus resource domain abstraction are broken. > If we want to support a hierarchy like that, we'll internally have to > something like > > A - B > \ > A' Because, and it took me a little while to get here, this: A / \ T1 t1 / \ t2 B / \ t3 T2 /\ t4 t5 Ends up being this from a resource domain pov. (because the task controllers are hierarchical their effective contribution collapses onto the resource domain): A / \ B t1, t2 | t3,t4,t5 > Now, this is exactly the same problem as having internal processes Indeed, bugger. > And here's another point, currently, all controllers are enabled > consecutively from root. If we have leaf thread subtrees, this still > works fine. Resource domain controllers won't be enabled into thread > subtrees. If we allow switching back and forth, what do we do in the > middle while we're in the thread part? From what I understand you cannot re-enable a controller once its been disabled, right? If you disable it, its dead for the entire subtree. I think it would work naturally if you only allow disabling system controllers at the resource domain levels (thread controllers can be disabled at any point). That means that thread nodes will have the exact same system controllers enabled as their resource domain, which makes perfect sense, since all tasks in the thread nodes are effectively mapped into the resource domain for system controllers. That is: A (cpu, memory) | T (memory) is a perfectly valid setup, since all tasks under T will still use the memory setup of A. > No matter what we do, it's > gonna be more confusing and we lose basic invariants like "parent > always has superset of control knobs that its child has". No, exactly that. I don't think I ever proposed something different. The "resource domain" flag I proposed violates the no-internal-processes thing, but it doesn't violate that rule afaict. > > > Let's say we can make that work but what are the use cases which would > > > require such setup where we have to alternate between thread and > > > domain modes through out the resource hierarchy? > > > > I would very much like to run my main workload in the root resource > > group. This means I need to have threaded subtrees at the root level. > > But this is just a whim. It isn't even a functional requirement. You're always so very quick to dismiss use-cases :/ Or do I read this that performance is not a functional requirement? (don't bite, I know you don't mean that) Sure, I had not seen that I violated the no internal processes rule in the resource domain view; equally you had not made it very clear either. > As for the runtime overhead, if you get affected by adding a top-level > cgroup in any measureable way, we need to fix that. That's not a > valid argument for messing up the interface. I think cgroup tree depth is a more significant issue; because of hierarchy we often do tree walks (uo-to-root or down-to-task). So creating elaborate trees is something I try not to do. > > You have no actual arguments on why what I propose would be hard to > > implement. As far as I can tell it should be fairly similar in > > complexity to what you already proposed. > > I hope it's explained now. I think I got there.. > > > And here's another aspect. The currently proposed interface doesn't > > > preclude adding the behavior you're describing in the future. Once > > > thread mode is enabled on a subtree, it isn't allowed to be disabled > > > in its proper subtree; however, if there actually are use cases which > > > require flipping it back, we can later implemnt the behavior and lift > > > that restriction. I think it makes sense to start with a simple > > > model. > > > > Your choice of flag makes it impossible to tell what is a resource > > domain and what is not in that situation. > > > > Suppose I set the root group threaded and I create subgroups (which will > > also all have threaded set). Suppose I clear the threaded bit somewhere > > in the subtree to create a new resource group, but then immediately set > > the threaded bit again to allow that resource group to have thread > > subgroups as well. Now the entire hierarchy will have the threaded flag > > set and it becomes impossible to find the resource domains. > > > > This is all a direct consequence of your flag not denoting the primary > > construct; eg. resource domains. > > Even if we allow switching back and forth, we can't make the same > cgroup both resource domain && thread root. Not in a sane way at > least. The back and forth thing yes, but even with a single level, the one resource domain you tag will be both resource domain and thread root. > > IOW; you've completely failed to convince me and my NAK stands. > > You have a narrow view from a single component and has been openly > claiming and demonstrating to be not using, disinterested and > uninformed on cgroup. I use a narrow set of cgroup-v1 capabilities not present in v2. You've been very aggressively dismissing and ignoring those for a long time. Given that, why should I be interested in v2? > It's unfortunate and bullshit that the whole thing is blocked on your > NAK, especially when the part you're holding hostage is something a > lot of users want and won't change no matter what we do about threads. I understand your frustration, I have plenty of my own, see the paragraph above. Then again, I'm glad you're now more open discuss these things. I don't see it as a given that things will not change until the threads situation is solved. Call me a pessimist if you will, but I want to see a full picture first. In any case, let me ponder these new insights for a bit. ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <20170210175145.GJ6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <20170210175145.GJ6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2017-02-12 5:05 ` Tejun Heo [not found] ` <20170212050544.GJ29323-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2017-02-14 10:35 ` Peter Zijlstra 0 siblings, 2 replies; 32+ messages in thread From: Tejun Heo @ 2017-02-12 5:05 UTC (permalink / raw) To: Peter Zijlstra Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, mingo-H+wXaHxf7aLQT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ, efault-Mmb7MZpHnFY, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA, Linus Torvalds, Andrew Morton Hello, On Fri, Feb 10, 2017 at 06:51:45PM +0100, Peter Zijlstra wrote: > Sure, we're past that. This isn't about what memcg can or cannot do. > Previous discussions established that controllers come in two shapes: > > - task based controllers; these are build on per task properties and > groups are aggregates over sets of tasks. Since per definition inter > task competition is already defined on individual tasks, its fairly > trivial to extend the same rules to sets of tasks etc.. > > Examples: cpu, cpuset, cpuacct, perf, pid, (freezer) > > - system controllers; instead of building from tasks upwards, they > split what previously would be machine wide / global state. For these > there is no natural competition rule vs tasks, and hence your > no-internal-task rule. > > Examples: memcg, io, hugetlb This is a bit of delta but as I wrote before, at least cpu (and accordingly cpuacct) won't stay purely task-based as we should account for resource consumptions which aren't tied to specific tasks to the matching domain (e.g. CPU consumption during writeback, disk encryption or CPU cycles spent to receive packets). > > And here's another point, currently, all controllers are enabled > > consecutively from root. If we have leaf thread subtrees, this still > > works fine. Resource domain controllers won't be enabled into thread > > subtrees. If we allow switching back and forth, what do we do in the > > middle while we're in the thread part? > > From what I understand you cannot re-enable a controller once its been > disabled, right? If you disable it, its dead for the entire subtree. cgroups on creation don't enable controllers by default and users can enable and disable controllers dynamically as long as the conditions are met. So, they can be disable and re-enabled. > > No matter what we do, it's > > gonna be more confusing and we lose basic invariants like "parent > > always has superset of control knobs that its child has". > > No, exactly that. I don't think I ever proposed something different. > > The "resource domain" flag I proposed violates the no-internal-processes > thing, but it doesn't violate that rule afaict. If we go to thread mode and back to domain mode, the control knobs for domain controllers don't make sense on the thread part of the tree and they won't have cgroup_subsys_state to correspond to either. For example, A - T - B B's memcg knobs would control memory distribution from A and cgroups in T can't have memcg knobs. It'd be weird to indicate that memcg is enabled in those cgroups too. We can make it work somehow. It's just weird-ass interface. > > As for the runtime overhead, if you get affected by adding a top-level > > cgroup in any measureable way, we need to fix that. That's not a > > valid argument for messing up the interface. > > I think cgroup tree depth is a more significant issue; because of > hierarchy we often do tree walks (uo-to-root or down-to-task). > > So creating elaborate trees is something I try not to do. So, as long as the depth stays reasonable (single digit or lower), what we try to do is keeping tree traversal operations aggregated or located on slow paths. There still are places that this overhead shows up (e.g. the block controllers aren't too optimized) but it isn't particularly difficult to make a handful of layers not matter at all. memcg batches the charging operations and it's impossible to measure the overhead of several levels of hierarchy. In general, I think it's important to ensure that this in general is the case so that users can use the logical layouts matching the actual resource hierarchy rather than having to twist the layout for optimization. > > Even if we allow switching back and forth, we can't make the same > > cgroup both resource domain && thread root. Not in a sane way at > > least. > > The back and forth thing yes, but even with a single level, the one > resource domain you tag will be both resource domain and thread root. Ah, you're right. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <20170212050544.GJ29323-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <20170212050544.GJ29323-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2017-02-12 6:59 ` Mike Galbraith 2017-02-13 5:45 ` Mike Galbraith 0 siblings, 1 reply; 32+ messages in thread From: Mike Galbraith @ 2017-02-12 6:59 UTC (permalink / raw) To: Tejun Heo, Peter Zijlstra Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, mingo-H+wXaHxf7aLQT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA, Linus Torvalds, Andrew Morton On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote: > > I think cgroup tree depth is a more significant issue; because of > > hierarchy we often do tree walks (uo-to-root or down-to-task). > > > > So creating elaborate trees is something I try not to do. > > So, as long as the depth stays reasonable (single digit or lower), > what we try to do is keeping tree traversal operations aggregated or > located on slow paths. There still are places that this overhead > shows up (e.g. the block controllers aren't too optimized) but it > isn't particularly difficult to make a handful of layers not matter at > all. A handful of cpu bean counting layers stings considerably. homer:/abuild # pipe-test 1 2.010057 usecs/loop -- avg 2.010057 995.0 KHz 2.006630 usecs/loop -- avg 2.009714 995.2 KHz 2.127118 usecs/loop -- avg 2.021455 989.4 KHz 2.256244 usecs/loop -- avg 2.044934 978.0 KHz 1.993693 usecs/loop -- avg 2.039810 980.5 KHz ^C homer:/abuild # cgexec -g cpu:hurt pipe-test 1 2.771641 usecs/loop -- avg 2.771641 721.6 KHz 2.432333 usecs/loop -- avg 2.737710 730.5 KHz 2.750493 usecs/loop -- avg 2.738988 730.2 KHz 2.663203 usecs/loop -- avg 2.731410 732.2 KHz 2.762564 usecs/loop -- avg 2.734525 731.4 KHz ^C homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1 2.967201 usecs/loop -- avg 2.967201 674.0 KHz 3.049012 usecs/loop -- avg 2.975382 672.2 KHz 3.031226 usecs/loop -- avg 2.980966 670.9 KHz 2.954259 usecs/loop -- avg 2.978296 671.5 KHz 2.933432 usecs/loop -- avg 2.973809 672.5 KHz ^C ... homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1 4.417044 usecs/loop -- avg 4.417044 452.8 KHz 4.494913 usecs/loop -- avg 4.424831 452.0 KHz 4.253861 usecs/loop -- avg 4.407734 453.7 KHz 4.378059 usecs/loop -- avg 4.404766 454.1 KHz 4.179895 usecs/loop -- avg 4.382279 456.4 KHz ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode 2017-02-12 6:59 ` Mike Galbraith @ 2017-02-13 5:45 ` Mike Galbraith [not found] ` <1486964707.5912.93.camel-Mmb7MZpHnFY@public.gmane.org> 0 siblings, 1 reply; 32+ messages in thread From: Mike Galbraith @ 2017-02-13 5:45 UTC (permalink / raw) To: Tejun Heo, Peter Zijlstra Cc: lizefan, hannes, mingo, pjt, luto, cgroups, linux-kernel, kernel-team, lvenanci, Linus Torvalds, Andrew Morton On Sun, 2017-02-12 at 07:59 +0100, Mike Galbraith wrote: > On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote: > > > > I think cgroup tree depth is a more significant issue; because of > > > hierarchy we often do tree walks (uo-to-root or down-to-task). > > > > > > So creating elaborate trees is something I try not to do. > > > > So, as long as the depth stays reasonable (single digit or lower), > > what we try to do is keeping tree traversal operations aggregated or > > located on slow paths. There still are places that this overhead > > shows up (e.g. the block controllers aren't too optimized) but it > > isn't particularly difficult to make a handful of layers not matter at > > all. > > A handful of cpu bean counting layers stings considerably. BTW, that overhead is also why merging cpu/cpuacct is not really as wonderful as it may seem on paper. If you only want to account, you may not have anything to gain from group scheduling (in fact it may wreck performance), but you'll pay for it. > homer:/abuild # pipe-test 1 > 2.010057 usecs/loop -- avg 2.010057 995.0 KHz > 2.006630 usecs/loop -- avg 2.009714 995.2 KHz > 2.127118 usecs/loop -- avg 2.021455 989.4 KHz > 2.256244 usecs/loop -- avg 2.044934 978.0 KHz > 1.993693 usecs/loop -- avg 2.039810 980.5 KHz > ^C > homer:/abuild # cgexec -g cpu:hurt pipe-test 1 > 2.771641 usecs/loop -- avg 2.771641 721.6 KHz > 2.432333 usecs/loop -- avg 2.737710 730.5 KHz > 2.750493 usecs/loop -- avg 2.738988 730.2 KHz > 2.663203 usecs/loop -- avg 2.731410 732.2 KHz > 2.762564 usecs/loop -- avg 2.734525 731.4 KHz > ^C > homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1 > 2.967201 usecs/loop -- avg 2.967201 674.0 KHz > 3.049012 usecs/loop -- avg 2.975382 672.2 KHz > 3.031226 usecs/loop -- avg 2.980966 670.9 KHz > 2.954259 usecs/loop -- avg 2.978296 671.5 KHz > 2.933432 usecs/loop -- avg 2.973809 672.5 KHz > ^C > ... > homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1 > 4.417044 usecs/loop -- avg 4.417044 452.8 KHz > 4.494913 usecs/loop -- avg 4.424831 452.0 KHz > 4.253861 usecs/loop -- avg 4.407734 453.7 KHz > 4.378059 usecs/loop -- avg 4.404766 454.1 KHz > 4.179895 usecs/loop -- avg 4.382279 456.4 KHz ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <1486964707.5912.93.camel-Mmb7MZpHnFY@public.gmane.org>]
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <1486964707.5912.93.camel-Mmb7MZpHnFY@public.gmane.org> @ 2017-03-13 19:26 ` Tejun Heo 2017-03-14 14:45 ` Mike Galbraith 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2017-03-13 19:26 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, mingo-H+wXaHxf7aLQT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA, Linus Torvalds, Andrew Morton Hello, Mike. Sorry about the long delay. On Mon, Feb 13, 2017 at 06:45:07AM +0100, Mike Galbraith wrote: > > > So, as long as the depth stays reasonable (single digit or lower), > > > what we try to do is keeping tree traversal operations aggregated or > > > located on slow paths. There still are places that this overhead > > > shows up (e.g. the block controllers aren't too optimized) but it > > > isn't particularly difficult to make a handful of layers not matter at > > > all. > > > > A handful of cpu bean counting layers stings considerably. Hmm... yeah, I was trying to think about ways to avoid full scheduling overhead at each layer (the scheduler does a lot per each layer of scheduling) but don't think it's possible to circumvent that without introducing a whole lot of scheduling artifacts. In a lot of workloads, the added overhead from several layers of CPU controllers doesn't seem to get in the way too much (most threads do something other than scheduling after all). The only major issue that we're seeing in the fleet is the cgroup iteration in idle rebalancing code pushing up the scheduling latency too much but that's a different issue. Anyways, I understand that there are cases where people would want to avoid any extra layers. I'll continue on PeterZ's message. > BTW, that overhead is also why merging cpu/cpuacct is not really as > wonderful as it may seem on paper. If you only want to account, you > may not have anything to gain from group scheduling (in fact it may > wreck performance), but you'll pay for it. There's another reason why we would want accounting separate - because weight based controllers, cpu and io currently, can't be enabled without affecting the scheduling behavior. However, they're different from CPU controllers in that all the heavy part of operations can be shifted to the readers (we just need to do per-cpu updates from hot paths), so we might as well publish those stats by default on the v2 hierarchy. We couldn't do the same in v1 because the number of hierarchies were not limited. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode 2017-03-13 19:26 ` Tejun Heo @ 2017-03-14 14:45 ` Mike Galbraith 0 siblings, 0 replies; 32+ messages in thread From: Mike Galbraith @ 2017-03-14 14:45 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, lizefan, hannes, mingo, pjt, luto, cgroups, linux-kernel, kernel-team, lvenanci, Linus Torvalds, Andrew Morton On Mon, 2017-03-13 at 15:26 -0400, Tejun Heo wrote: > Hello, Mike. > > Sorry about the long delay. > > On Mon, Feb 13, 2017 at 06:45:07AM +0100, Mike Galbraith wrote: > > > > So, as long as the depth stays reasonable (single digit or lower), > > > > what we try to do is keeping tree traversal operations aggregated or > > > > located on slow paths. There still are places that this overhead > > > > shows up (e.g. the block controllers aren't too optimized) but it > > > > isn't particularly difficult to make a handful of layers not matter at > > > > all. > > > > > > A handful of cpu bean counting layers stings considerably. > > Hmm... yeah, I was trying to think about ways to avoid full scheduling > overhead at each layer (the scheduler does a lot per each layer of > scheduling) but don't think it's possible to circumvent that without > introducing a whole lot of scheduling artifacts. Yup. > In a lot of workloads, the added overhead from several layers of CPU > controllers doesn't seem to get in the way too much (most threads do > something other than scheduling after all). Sure, don't schedule a lot, it doesn't hurt much, but there are plenty of loads that routinely do schedule a LOT, and there it matters a LOT.. which is why network benchmarks tend to be severely allergic to scheduler lard. > The only major issue that > we're seeing in the fleet is the cgroup iteration in idle rebalancing > code pushing up the scheduling latency too much but that's a different > issue. Hm, I would suspect PELT to be the culprit there. It helps smooth out load balancing, but will stack "skinny looking" tasks. -Mike ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode 2017-02-12 5:05 ` Tejun Heo [not found] ` <20170212050544.GJ29323-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2017-02-14 10:35 ` Peter Zijlstra [not found] ` <20170214103541.GS6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 1 sibling, 1 reply; 32+ messages in thread From: Peter Zijlstra @ 2017-02-14 10:35 UTC (permalink / raw) To: Tejun Heo Cc: lizefan, hannes, mingo, pjt, luto, efault, cgroups, linux-kernel, kernel-team, lvenanci, Linus Torvalds, Andrew Morton On Sun, Feb 12, 2017 at 02:05:44PM +0900, Tejun Heo wrote: > Hello, > > On Fri, Feb 10, 2017 at 06:51:45PM +0100, Peter Zijlstra wrote: > > Sure, we're past that. This isn't about what memcg can or cannot do. > > Previous discussions established that controllers come in two shapes: > > > > - task based controllers; these are build on per task properties and > > groups are aggregates over sets of tasks. Since per definition inter > > task competition is already defined on individual tasks, its fairly > > trivial to extend the same rules to sets of tasks etc.. > > > > Examples: cpu, cpuset, cpuacct, perf, pid, (freezer) > > > > - system controllers; instead of building from tasks upwards, they > > split what previously would be machine wide / global state. For these > > there is no natural competition rule vs tasks, and hence your > > no-internal-task rule. > > > > Examples: memcg, io, hugetlb > > This is a bit of delta but as I wrote before, at least cpu (and > accordingly cpuacct) won't stay purely task-based as we should account > for resource consumptions which aren't tied to specific tasks to the > matching domain (e.g. CPU consumption during writeback, disk > encryption or CPU cycles spent to receive packets). We should probably do that in another thread, but I'd probably insist on separate controllers that co-operate to get that done. > > > And here's another point, currently, all controllers are enabled > > > consecutively from root. If we have leaf thread subtrees, this still > > > works fine. Resource domain controllers won't be enabled into thread > > > subtrees. If we allow switching back and forth, what do we do in the > > > middle while we're in the thread part? > > > > From what I understand you cannot re-enable a controller once its been > > disabled, right? If you disable it, its dead for the entire subtree. > > cgroups on creation don't enable controllers by default and users can > enable and disable controllers dynamically as long as the conditions > are met. So, they can be disable and re-enabled. I was talking in a hierarchical sense, your section 2-4-2. Top-Down constraint seems to state similar things to what I meant. Once you disable a controller it cannot be re-enabled in a subtree. > > > No matter what we do, it's > > > gonna be more confusing and we lose basic invariants like "parent > > > always has superset of control knobs that its child has". > > > > No, exactly that. I don't think I ever proposed something different. > > > > The "resource domain" flag I proposed violates the no-internal-processes > > thing, but it doesn't violate that rule afaict. > > If we go to thread mode and back to domain mode, the control knobs for > domain controllers don't make sense on the thread part of the tree and > they won't have cgroup_subsys_state to correspond to either. For > example, > > A - T - B > > B's memcg knobs would control memory distribution from A and cgroups > in T can't have memcg knobs. It'd be weird to indicate that memcg is > enabled in those cgroups too. But memcg _is_ enabled for T. All the tasks are mapped onto A for purpose of the system controller (memcg) and are subject to its constraints. > We can make it work somehow. It's just weird-ass interface. You could make these control files (read-only?) symlinks back to A's actual control files. To more explicitly show this. > > > As for the runtime overhead, if you get affected by adding a top-level > > > cgroup in any measureable way, we need to fix that. That's not a > > > valid argument for messing up the interface. > > > > I think cgroup tree depth is a more significant issue; because of > > hierarchy we often do tree walks (uo-to-root or down-to-task). > > > > So creating elaborate trees is something I try not to do. > > So, as long as the depth stays reasonable (single digit or lower), > what we try to do is keeping tree traversal operations aggregated or > located on slow paths. While at the same time you allowed that BPF cgroup thing to not be hierarchical because iterating the tree was too expensive; or did I misunderstand? Also, I think Mike showed you the pain and hurt are quite visible for even a few levels. Batching is tricky, you need to somehow bound the error function in order to not become too big a factor on behaviour. Esp. for cpu, cpuacct obviously doesn't care much as it doesn't enforce anything. > In general, I think it's important to ensure that this in general is > the case so that users can use the logical layouts matching the actual > resource hierarchy rather than having to twist the layout for > optimization. One does what one can.. But it is important to understand the constraints, nothing comes for free. > > > Even if we allow switching back and forth, we can't make the same > > > cgroup both resource domain && thread root. Not in a sane way at > > > least. > > > > The back and forth thing yes, but even with a single level, the one > > resource domain you tag will be both resource domain and thread root. > > Ah, you're right. Also, there is the one giant wart in v2 wrt no-internal-processes; namely the root group is allowed to have them. Now I understand why this is so; so don't feel compelled to explain that again, but it does make the model very ugly and has a real problem, see below. OTOH, since it is there, I would very much like to make use of this 'feature' and allow a thread-group on the root group. And since you then _can_ have nested thread groups, it again becomes very important to be able to find the resource domains, which brings me back to my proposal; albeit with an addition constraint. Now on to the problem of the no-internal-processes wart; how does cgroup-v2 currently implement the whole container invariant? Because by that invariant, a container's 'root' group must also allow internal-processes. ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <20170214103541.GS6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <20170214103541.GS6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2017-03-13 20:05 ` Tejun Heo [not found] ` <20170313200544.GE15709-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2017-03-13 20:05 UTC (permalink / raw) To: Peter Zijlstra Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, mingo-H+wXaHxf7aLQT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ, efault-Mmb7MZpHnFY, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA, Linus Torvalds, Andrew Morton Hey, Peter. Sorry about the long delay. On Tue, Feb 14, 2017 at 11:35:41AM +0100, Peter Zijlstra wrote: > > This is a bit of delta but as I wrote before, at least cpu (and > > accordingly cpuacct) won't stay purely task-based as we should account > > for resource consumptions which aren't tied to specific tasks to the > > matching domain (e.g. CPU consumption during writeback, disk > > encryption or CPU cycles spent to receive packets). > > We should probably do that in another thread, but I'd probably insist on > separate controllers that co-operate to get that done. Let's shelve this for now. > > cgroups on creation don't enable controllers by default and users can > > enable and disable controllers dynamically as long as the conditions > > are met. So, they can be disable and re-enabled. > > I was talking in a hierarchical sense, your section 2-4-2. Top-Down > constraint seems to state similar things to what I meant. > > Once you disable a controller it cannot be re-enabled in a subtree. Ah, yeah, you can't jump across parts of the tree. > > If we go to thread mode and back to domain mode, the control knobs for > > domain controllers don't make sense on the thread part of the tree and > > they won't have cgroup_subsys_state to correspond to either. For > > example, > > > > A - T - B > > > > B's memcg knobs would control memory distribution from A and cgroups > > in T can't have memcg knobs. It'd be weird to indicate that memcg is > > enabled in those cgroups too. > > But memcg _is_ enabled for T. All the tasks are mapped onto A for > purpose of the system controller (memcg) and are subject to its > constraints. Sure, T is contained in A but think about the interface. For memcg, T belongs to A. B is the first descendant when viewed from memcg, which brings about two problems - memcg doesn't have control knobs to assign throughout T which is just weird and there's no way to configure how T competes against B. > > We can make it work somehow. It's just weird-ass interface. > > You could make these control files (read-only?) symlinks back to A's > actual control files. To more explicitly show this. But the knobs are supposed to control how much resource a child gets from its parent. Flipping that over while walking down the same tree sounds horribly ugly and confusing to me. Besides, that doesn't solve the problem with lacking the ability configure T's consumptions against B. > > So, as long as the depth stays reasonable (single digit or lower), > > what we try to do is keeping tree traversal operations aggregated or > > located on slow paths. > > While at the same time you allowed that BPF cgroup thing to not be > hierarchical because iterating the tree was too expensive; or did I > misunderstand? That was more because that was supposed to be part of bpf (network or whatever) and just followed the model of table matching "is the target under this hierarchy?". That's where it came from after all. Hierarchical walking can be added but it's more work (defining the iteration direction and rules) and doesn't bring benefits without working delegation. If it were a cgroup controller, it should have been fully hierarchical no matter what but that involves solving bpf delegation first. > Also, I think Mike showed you the pain and hurt are quite visible for > even a few levels. > > Batching is tricky, you need to somehow bound the error function in > order to not become too big a factor on behaviour. Esp. for cpu, cpuacct > obviously doesn't care much as it doesn't enforce anything. Yeah, I thought about this for quite a while but I couldn't think of any easy way of circumventing overhead without introducing a lot of scheduling artifacts (e.g. multiplying down the weights to practically collapse multi levels of the hierarchy), at least for the weight based control which what most people use. It looks like the only way to lower the overhead there is making generic scheduling cheaper but that still means that multi-level will always be noticeably more expensive in terms of raw schceduling performance. Scheduling hackbench is an extreme case tho and in practice at least we're not seeing noticeable issues with a few levels of nesting when the workload actually spends cpu cycles doing things other than scheduling. However, we're seeing significant increase in scheduling latency coming from how cgroups are handled from the rebalance path. I'm still looking into it and will write about that in a separate thread. > > In general, I think it's important to ensure that this in general is > > the case so that users can use the logical layouts matching the actual > > resource hierarchy rather than having to twist the layout for > > optimization. > > One does what one can.. But it is important to understand the > constraints, nothing comes for free. Yeah, for sure. > Also, there is the one giant wart in v2 wrt no-internal-processes; > namely the root group is allowed to have them. > > Now I understand why this is so; so don't feel compelled to explain that > again, but it does make the model very ugly and has a real problem, see > below. OTOH, since it is there, I would very much like to make use of > this 'feature' and allow a thread-group on the root group. > > And since you then _can_ have nested thread groups, it again becomes > very important to be able to find the resource domains, which brings me > back to my proposal; albeit with an addition constraint. I've thought quite a bit about ways to allow thread granularity from the top while still presenting a consistent picture to resource domain controllers. That's what's missing from the CPU controller side given Mike's claim that there's unavoidable overhead in nesting CPU controller and requiring at least one level of nesting on cgroup v2 for thread granularity might not be acceptable. Going back to why thread support on cgroup v2 was needed in the first place, it was to allow thread level control while cooperating with other controllers on v2. IOW, allowing thread level control for CPU while cooperating with resource domain type controllers. Now, going back to allowing thread hierarchies from the root, given that their resource domain can only be root, which is exactly what you get when CPU is mounted on a separate hierarchy, it seems kinda moot. The practical constraint with the current scheme is that in cases where other resource domain controllers need to be used, the thread hierarchies would have to be nested at least one level, but if you don't want any resource domain things, that's the same as mounting the controller separately. > Now on to the problem of the no-internal-processes wart; how does > cgroup-v2 currently implement the whole container invariant? Because by > that invariant, a container's 'root' group must also allow > internal-processes. I'm not sure I follow the question here. What's the "whole container invariant"? Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <20170313200544.GE15709-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <20170313200544.GE15709-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2017-03-21 12:39 ` Peter Zijlstra [not found] ` <20170321123958.af7mcvcovexxzahu-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 0 siblings, 1 reply; 32+ messages in thread From: Peter Zijlstra @ 2017-03-21 12:39 UTC (permalink / raw) To: Tejun Heo Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, mingo-H+wXaHxf7aLQT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ, efault-Mmb7MZpHnFY, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA, Linus Torvalds, Andrew Morton On Mon, Mar 13, 2017 at 04:05:44PM -0400, Tejun Heo wrote: > Hey, Peter. Sorry about the long delay. No worries; we're all (too) busy. > > > If we go to thread mode and back to domain mode, the control knobs for > > > domain controllers don't make sense on the thread part of the tree and > > > they won't have cgroup_subsys_state to correspond to either. For > > > example, > > > > > > A - T - B > > > > > > B's memcg knobs would control memory distribution from A and cgroups > > > in T can't have memcg knobs. It'd be weird to indicate that memcg is > > > enabled in those cgroups too. > > > > But memcg _is_ enabled for T. All the tasks are mapped onto A for > > purpose of the system controller (memcg) and are subject to its > > constraints. > > Sure, T is contained in A but think about the interface. For memcg, T > belongs to A. B is the first descendant when viewed from memcg, which > brings about two problems - memcg doesn't have control knobs to assign > throughout T which is just weird and there's no way to configure how T > competes against B. > > > > We can make it work somehow. It's just weird-ass interface. > > > > You could make these control files (read-only?) symlinks back to A's > > actual control files. To more explicitly show this. > > But the knobs are supposed to control how much resource a child gets > from its parent. Flipping that over while walking down the same tree > sounds horribly ugly and confusing to me. Besides, that doesn't solve > the problem with lacking the ability configure T's consumptions > against B. So I'm not confused; and I suspect you're not either. But you're worried about 'simple' people getting confused? The rules really are fairly straight forward; but yes, it will be a little more involved than without this. But note that this is an optional thing, people don't have to make thread groups if they don't want to. And they further don't have to use non-leaf thread groups. And at some point, there's no helping stupid; and trying to do so will only make you insane. So the fundamental thing to realize (and explain) is that there are two different types of controllers; and that they operate on different views of the hierarchy. I think our goal as a kernel API should be presenting the capabilities in a concise and consistent manner; and I feel that the proposed interface is that. So the points you raise above; about system controller knobs in thread groups and competition between thread and system groups as seen for system controllers are confusion due to not considering the views. And yes, having to consider views is new and a direct consequence of this new optional feature. But I don't see how its a problem. > Scheduling hackbench is an extreme case tho and in practice at least > we're not seeing noticeable issues with a few levels of nesting when > the workload actually spends cpu cycles doing things other than > scheduling. Right; most workloads don't schedule _that_ much; and the overhead isn't too painful. > However, we're seeing significant increase in scheduling > latency coming from how cgroups are handled from the rebalance path. > I'm still looking into it and will write about that in a separate > thread. I have some vague memories of this being a pain point. IIRC it comes down to the problem that latency is an absolute measure and the weight is relative thing. I think we mucked about with it some many years ago; but haven't done so recently. > > Also, there is the one giant wart in v2 wrt no-internal-processes; > > namely the root group is allowed to have them. > > > > Now I understand why this is so; so don't feel compelled to explain that > > again, but it does make the model very ugly and has a real problem, see > > below. OTOH, since it is there, I would very much like to make use of > > this 'feature' and allow a thread-group on the root group. > > > > And since you then _can_ have nested thread groups, it again becomes > > very important to be able to find the resource domains, which brings me > > back to my proposal; albeit with an addition constraint. > > I've thought quite a bit about ways to allow thread granularity from > the top while still presenting a consistent picture to resource domain > controllers. That's what's missing from the CPU controller side given > Mike's claim that there's unavoidable overhead in nesting CPU > controller and requiring at least one level of nesting on cgroup v2 > for thread granularity might not be acceptable. > > Going back to why thread support on cgroup v2 was needed in the first > place, it was to allow thread level control while cooperating with > other controllers on v2. IOW, allowing thread level control for CPU > while cooperating with resource domain type controllers. Well, not only CPU, I can see the same being used for perf for example. > Now, going back to allowing thread hierarchies from the root, given > that their resource domain can only be root, which is exactly what you > get when CPU is mounted on a separate hierarchy, it seems kinda moot. Not quite; see below on the container thing. > The practical constraint with the current scheme is that in cases > where other resource domain controllers need to be used, the thread > hierarchies would have to be nested at least one level, but if you > don't want any resource domain things, that's the same as mounting the > controller separately. > > > Now on to the problem of the no-internal-processes wart; how does > > cgroup-v2 currently implement the whole container invariant? Because by > > that invariant, a container's 'root' group must also allow > > internal-processes. > > I'm not sure I follow the question here. What's the "whole container > invariant"? The container invariant is that everything inside a container looks and works *exactly* like a 'real' system. Containers do this with namespace; so the PID namespace 'hides' all processes not part of its namespace and has an independent PID number; such that we can start at 1 again for our init; with that it also has a new child reaper etc.. Similarly, the cgroup namespace should hide everything outside its subtree; but it should also provide a full new root cgroup, which _should_ include the no-internal-processes exception. Another constraint is the whole controller mounting nonsense; unless you would allow containers to (re)mount cgroups controllers differently, an unlikely case I feel; containers are constrained to whatever mount options the host kernel got dealt. This effectively means that controller mount options are not a viable configuration mechanism. This is really important for things like Docker and related. They must assume some standard setup of cgroups or otherwise cannot use it at all. But even aside of that; the mount thing is a fairly static an inflexible configuration. What if you have two workloads that require a different setup on the same machine? ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <20170321123958.af7mcvcovexxzahu-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <20170321123958.af7mcvcovexxzahu-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2017-03-22 14:52 ` Peter Zijlstra 0 siblings, 0 replies; 32+ messages in thread From: Peter Zijlstra @ 2017-03-22 14:52 UTC (permalink / raw) To: Tejun Heo Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, mingo-H+wXaHxf7aLQT0dZR+AlfA, pjt-hpIqsD4AKlfQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ, efault-Mmb7MZpHnFY, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, lvenanci-H+wXaHxf7aLQT0dZR+AlfA, Linus Torvalds, Andrew Morton On Tue, Mar 21, 2017 at 01:39:58PM +0100, Peter Zijlstra wrote: > > And yes, having to consider views is new and a direct consequence of > this new optional feature. But I don't see how its a problem. > So aside from having (RO) links in thread groups for system controllers, we could also have a ${controller}_parent link back to whatever group is the actual parent for that specific controller's view. So then your B's memcg_parent would point to A, not T. But I feel this is all superfluous window dressing; but if you want to clarify the filesystem interface, this could be something to consider. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode 2017-02-02 20:06 [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Tejun Heo ` (3 preceding siblings ...) [not found] ` <20170202200632.13992-1-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2017-02-09 13:07 ` Paul Turner 2017-02-09 14:47 ` Peter Zijlstra [not found] ` <CAPM31RKHsM1-iWb5B6jkOtLomLhiOARtgMOTWZ8p1yjEn-ZK0A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 4 siblings, 2 replies; 32+ messages in thread From: Paul Turner @ 2017-02-09 13:07 UTC (permalink / raw) To: Tejun Heo Cc: lizefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, Andy Lutomirski, Mike Galbraith, cgroups, LKML, kernel-team, lvenanci, Linus Torvalds, Andrew Morton On Thu, Feb 2, 2017 at 12:06 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > This patchset implements cgroup v2 thread mode. It is largely based > on the discussions that we had at the plumbers last year. Here's the > rough outline. > > * Thread mode is explicitly enabled on a cgroup by writing "enable" > into "cgroup.threads" file. The cgroup shouldn't have any child > cgroups or enabled controllers. > > * Once enabled, arbitrary sub-hierarchy can be created and threads can > be put anywhere in the subtree by writing TIDs into "cgroup.threads" > file. Process granularity and no-internal-process constraint don't > apply in a threaded subtree. > > * To be used in a threaded subtree, controllers should explicitly > declare thread mode support and should be able to handle internal > competition in some way. > > * The root of a threaded subtree serves as the resource domain for the > whole subtree. This is where all the controllers are guaranteed to > have a common ground and resource consumptions in the threaded > subtree which aren't tied to a specific thread are charged. > Non-threaded controllers never see beyond thread root and can assume > that all controllers will follow the same rules upto that point. > > This allows threaded controllers to implement thread granular resource > control without getting in the way of system level resource > partitioning. > I think that this is definitely a step in the right direction versus previous proposals. However, as proposed it feels like the API is conflating the process/thread distinction with the core process hierarchy. While this does previous use-cases to be re-enabled, it seems to do so at an unnecessary API cost. As proposed, the cgroup.threads file means that threads are always anchored in the tree by their process parent. They may never move past it. I.e. If I have cgroups root/A/B With B allowing sub-thread moves and the parent belonging to A, or B. it is clear that the child cannot be moved beyond the parent. Now this, in itself, is a natural restriction. However, with this in hand, it means that we are effectively co-mingling two hierarchies onto the same tree: one that applies to processes, and per-process sub-trees. This introduces the following costs/restrictions: 1) We lose the ability to reasonably move a process. This puts us back to the existing challenge of the V1 API in which a thread is the unit we can move atomically. Hierarchies must be externally managed and synchronized. 2) This retains all of the problems of the existing V1 API for a process which wants to use these sub-groups to coordinate its threads. It must coordinate its operations on these groups with the global hierarchy (which is not consistently mounted) as well as potential migration -- (1) above. With the split as proposed, I fundamentally do not see the advantage of exposing these as the same hierarchy. By definition these .thread files are essentially introducing independent, process level, sub-hierarchies. It seems greatly preferable to expose the sub-process level hierarchies via separate path, e.g.: /proc/{pid, self}/thread_cgroups/ Any controllers enabled on the hierarchy that the process belonged to, which support thread level operations would appear within. This fully addresses (1) and (2) while allowing us to keep the unified process-granular v2-cgroup mounts as is. The only case that this does not support vs ".threads" would be some hybrid where we co-mingle threads from different processes (with the processes belonging to the same node in the hierarchy). I'm not aware of any usage that looks like this. What are the motivations that you see for forcing this all onto one mount-point via .threads sub-tree tags? > This patchset contains the following five patches. For more details > on the interface and behavior, please refer to the last patch. > > 0001-cgroup-reorganize-cgroup.procs-task-write-path.patch > 0002-cgroup-add-flags-to-css_task_iter_start-and-implemen.patch > 0003-cgroup-introduce-cgroup-proc_cgrp-and-threaded-css_s.patch > 0004-cgroup-implement-CSS_TASK_ITER_THREADED.patch > 0005-cgroup-implement-cgroup-v2-thread-support.patch > > This patchset is on top of cgroup/for-4.11 63f1ca59453a ("Merge branch > 'cgroup/for-4.11-rdmacg' into cgroup/for-4.11") and available in the > following git branch. > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-threads > > diffstat follows. Thanks. > > Documentation/cgroup-v2.txt | 75 ++++- > include/linux/cgroup-defs.h | 38 ++ > include/linux/cgroup.h | 12 > kernel/cgroup/cgroup-internal.h | 8 > kernel/cgroup/cgroup-v1.c | 64 +++- > kernel/cgroup/cgroup.c | 589 ++++++++++++++++++++++++++++++++-------- > kernel/cgroup/cpuset.c | 6 > kernel/cgroup/freezer.c | 6 > kernel/cgroup/pids.c | 1 > kernel/events/core.c | 1 > mm/memcontrol.c | 2 > net/core/netclassid_cgroup.c | 2 > 12 files changed, 671 insertions(+), 133 deletions(-) > > -- > tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode 2017-02-09 13:07 ` Paul Turner @ 2017-02-09 14:47 ` Peter Zijlstra 2017-02-09 15:08 ` Mike Galbraith [not found] ` <CAPM31RJaJjFwenC36Abij+EdzO3KBm-DEjQ_crSmzrtrrn2N2A@mail.gmail.com> [not found] ` <CAPM31RKHsM1-iWb5B6jkOtLomLhiOARtgMOTWZ8p1yjEn-ZK0A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 2 replies; 32+ messages in thread From: Peter Zijlstra @ 2017-02-09 14:47 UTC (permalink / raw) To: Paul Turner Cc: Tejun Heo, lizefan, Johannes Weiner, Ingo Molnar, Andy Lutomirski, Mike Galbraith, cgroups, LKML, kernel-team, lvenanci, Linus Torvalds, Andrew Morton On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote: > The only case that this does not support vs ".threads" would be some > hybrid where we co-mingle threads from different processes (with the > processes belonging to the same node in the hierarchy). I'm not aware > of any usage that looks like this. If I understand you right; this is a fairly common thing with RT where we would stuff all the !rt threads of the various processes in a 'misc' bucket. Similarly, it happens that we stuff the various rt threads of processes in a specific (shared) 'rt' bucket. So I would certainly not like to exclude that setup. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode 2017-02-09 14:47 ` Peter Zijlstra @ 2017-02-09 15:08 ` Mike Galbraith [not found] ` <CAPM31RJaJjFwenC36Abij+EdzO3KBm-DEjQ_crSmzrtrrn2N2A@mail.gmail.com> 1 sibling, 0 replies; 32+ messages in thread From: Mike Galbraith @ 2017-02-09 15:08 UTC (permalink / raw) To: Peter Zijlstra, Paul Turner Cc: Tejun Heo, lizefan, Johannes Weiner, Ingo Molnar, Andy Lutomirski, cgroups, LKML, kernel-team, lvenanci, Linus Torvalds, Andrew Morton On Thu, 2017-02-09 at 15:47 +0100, Peter Zijlstra wrote: > On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote: > > The only case that this does not support vs ".threads" would be some > > hybrid where we co-mingle threads from different processes (with the > > processes belonging to the same node in the hierarchy). I'm not aware > > of any usage that looks like this. > > If I understand you right; this is a fairly common thing with RT where > we would stuff all the !rt threads of the various processes in a 'misc' > bucket. > > Similarly, it happens that we stuff the various rt threads of processes > in a specific (shared) 'rt' bucket. > > So I would certainly not like to exclude that setup. Absolutely, you just described my daily bread performance setup. -Mike ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <CAPM31RJaJjFwenC36Abij+EdzO3KBm-DEjQ_crSmzrtrrn2N2A@mail.gmail.com>]
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <CAPM31RJaJjFwenC36Abij+EdzO3KBm-DEjQ_crSmzrtrrn2N2A@mail.gmail.com> @ 2017-02-13 5:28 ` Mike Galbraith 0 siblings, 0 replies; 32+ messages in thread From: Mike Galbraith @ 2017-02-13 5:28 UTC (permalink / raw) To: Paul Turner, Peter Zijlstra Cc: Tejun Heo, lizefan@huawei.com, Johannes Weiner, Ingo Molnar, Andy Lutomirski, cgroups, LKML, kernel-team, lvenanci@redhat.com, Linus Torvalds, Andrew Morton On Sun, 2017-02-12 at 13:16 -0800, Paul Turner wrote: > > > On Thursday, February 9, 2017, Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote: > > > The only case that this does not support vs ".threads" would be some > > > hybrid where we co-mingle threads from different processes (with the > > > processes belonging to the same node in the hierarchy). I'm not aware > > > of any usage that looks like this. > > > > If I understand you right; this is a fairly common thing with RT where > > we would stuff all the !rt threads of the various processes in a 'misc' > > bucket. > > > > Similarly, it happens that we stuff the various rt threads of processes > > in a specific (shared) 'rt' bucket. > > > > So I would certainly not like to exclude that setup. > > > > Unless you're using rt groups I'm not sure this one really changes. > Whether the "misc" threads exist at the parent level or one below > should not matter. (with exclusive cpusets, a mask can exist at one and only one location) ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <CAPM31RKHsM1-iWb5B6jkOtLomLhiOARtgMOTWZ8p1yjEn-ZK0A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode [not found] ` <CAPM31RKHsM1-iWb5B6jkOtLomLhiOARtgMOTWZ8p1yjEn-ZK0A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-02-10 15:46 ` Tejun Heo 0 siblings, 0 replies; 32+ messages in thread From: Tejun Heo @ 2017-02-10 15:46 UTC (permalink / raw) To: Paul Turner Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, Johannes Weiner, Peter Zijlstra, Ingo Molnar, Andy Lutomirski, Mike Galbraith, cgroups, LKML, kernel-team, lvenanci-H+wXaHxf7aLQT0dZR+AlfA, Linus Torvalds, Andrew Morton On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote: > What are the motivations that you see for forcing this all onto one > mount-point via .threads sub-tree tags? So, you wanted rgroup but with /proc interface? I'm afraid it's too late for that. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2017-03-22 14:52 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-02-02 20:06 [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Tejun Heo 2017-02-02 20:06 ` [PATCH 1/5] cgroup: reorganize cgroup.procs / task write path Tejun Heo 2017-02-02 20:06 ` [PATCH 2/5] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS Tejun Heo 2017-02-02 20:06 ` [PATCH 4/5] cgroup: implement CSS_TASK_ITER_THREADED Tejun Heo [not found] ` <20170202200632.13992-1-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2017-02-02 20:06 ` [PATCH 3/5] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling Tejun Heo 2017-02-02 20:06 ` [PATCH 5/5] cgroup: implement cgroup v2 thread support Tejun Heo 2017-02-02 21:32 ` [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Andy Lutomirski [not found] ` <CALCETrW6Mqj9VLogd0XaLgVAzEqsZ+VnZjN5NROCqr0ssdYaKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-02-02 21:52 ` Tejun Heo [not found] ` <20170202215229.GA27231-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2017-02-03 21:10 ` Andy Lutomirski 2017-02-03 21:56 ` Tejun Heo 2017-02-06 9:50 ` Peter Zijlstra 2017-02-03 20:20 ` Peter Zijlstra [not found] ` <20170203202048.GD6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2017-02-03 20:59 ` Tejun Heo [not found] ` <20170203205955.GA9886-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2017-02-06 12:49 ` Peter Zijlstra [not found] ` <20170206124943.GJ6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2017-02-08 23:08 ` Tejun Heo [not found] ` <20170208230819.GD25826-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2017-02-09 10:29 ` Peter Zijlstra [not found] ` <20170209102909.GC6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2017-02-10 15:45 ` Tejun Heo [not found] ` <20170210154508.GA16097-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2017-02-10 17:51 ` Peter Zijlstra [not found] ` <20170210175145.GJ6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2017-02-12 5:05 ` Tejun Heo [not found] ` <20170212050544.GJ29323-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2017-02-12 6:59 ` Mike Galbraith 2017-02-13 5:45 ` Mike Galbraith [not found] ` <1486964707.5912.93.camel-Mmb7MZpHnFY@public.gmane.org> 2017-03-13 19:26 ` Tejun Heo 2017-03-14 14:45 ` Mike Galbraith 2017-02-14 10:35 ` Peter Zijlstra [not found] ` <20170214103541.GS6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2017-03-13 20:05 ` Tejun Heo [not found] ` <20170313200544.GE15709-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2017-03-21 12:39 ` Peter Zijlstra [not found] ` <20170321123958.af7mcvcovexxzahu-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2017-03-22 14:52 ` Peter Zijlstra 2017-02-09 13:07 ` Paul Turner 2017-02-09 14:47 ` Peter Zijlstra 2017-02-09 15:08 ` Mike Galbraith [not found] ` <CAPM31RJaJjFwenC36Abij+EdzO3KBm-DEjQ_crSmzrtrrn2N2A@mail.gmail.com> 2017-02-13 5:28 ` Mike Galbraith [not found] ` <CAPM31RKHsM1-iWb5B6jkOtLomLhiOARtgMOTWZ8p1yjEn-ZK0A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-02-10 15:46 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).