* [Documentation] State of CPU controller in cgroup v2 @ 2016-08-05 17:07 Tejun Heo 2016-08-05 17:09 ` [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface Tejun Heo ` (3 more replies) 0 siblings, 4 replies; 48+ messages in thread From: Tejun Heo @ 2016-08-05 17:07 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton, Li Zefan, Johannes Weiner, Peter Zijlstra, Paul Turner, Mike Galbraith, Ingo Molnar Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg Hello, There have been several discussions around CPU controller support. Unfortunately, no consensus was reached and cgroup v2 is sorely lacking CPU controller support. This document includes summary of the situation and arguments along with an interim solution for parties who want to use the out-of-tree patches for CPU controller cgroup v2 support. I'll post the two patches as replies for reference. Thanks. CPU Controller on Control Group v2 August, 2016 Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> While most controllers have support for cgroup v2 now, the CPU controller support is not upstream yet due to objections from the scheduler maintainers on the basic designs of cgroup v2. This document explains the current situation as well as an interim solution, and details the disagreements and arguments. The latest version of this document can be found at the following URL. https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu CONTENTS 1. Current Situation and Interim Solution 2. Disagreements and Arguments 2-1. Contentious Restrictions 2-1-1. Process Granularity 2-1-2. No Internal Process Constraint 2-2. Impact on CPU Controller 2-2-1. Impact of Process Granularity 2-2-2. Impact of No Internal Process Constraint 2-3. Arguments for cgroup v2 3. Way Forward 4. References 1. Current Situation and Interim Solution All objections from the scheduler maintainers apply to cgroup v2 core design, and there are no known objections to the specifics of the CPU controller cgroup v2 interface. The only blocked part is changes to expose the CPU controller interface on cgroup v2, which comprises the following two patches: [1] sched: Misc preps for cgroup unified hierarchy interface [2] sched: Implement interface for cgroup unified hierarchy The necessary changes are superficial and implement the interface files on cgroup v2. The combined diffstat is as follows. kernel/sched/core.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++-- kernel/sched/cpuacct.c | 57 ++++++++++++------ kernel/sched/cpuacct.h | 5 + 3 files changed, 189 insertions(+), 22 deletions(-) The patches are easy to apply and forward-port. The following git branch will always carry the two patches on top of the latest release of the upstream kernel. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/cgroup-v2-cpu There also are versioned branches going back to v4.4. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/cgroup-v2-cpu-$KERNEL_VER While it's difficult to tell whether the CPU controller support will be merged, there are crucial resource control features in cgroup v2 that are only possible due to the design choices that are being objected to, and every effort will be made to ease enabling the CPU controller cgroup v2 support out-of-tree for parties which choose to. 2. Disagreements and Arguments There have been several lengthy discussion threads [3][4] on LKML around the structural constraints of cgroup v2. The two that affect the CPU controller are process granularity and no internal process constraint. Both arise primarily from the need for common resource domain definition across different resources. The common resource domain is a powerful concept in cgroup v2 that allows controllers to make basic assumptions about the structural organization of processes and controllers inside the cgroup hierarchy, and thus solve problems spanning multiple types of resources. The prime example for this is page cache writeback: dirty page cache is regulated through throttling buffered writers based on memory availability, and initiating batched write outs to the disk based on IO capacity. Tracking and controlling writeback inside a cgroup thus requires the direct cooperation of the memory and the IO controller. This easily extends to other areas, such as CPU cycles consumed while performing memory reclaim or IO encryption. 2-1. Contentious Restrictions For controllers of different resources to work together, they must agree on a common organization. This uniform model across controllers imposes two contentious restrictions on the CPU controller: process granularity and the no-internal-process constraint. 2-1-1. Process Granularity For memory, because an address space is shared between all threads of a process, the terminal consumer is a process, not a thread. Separating the threads of a single process into different memory control domains doesn't make semantical sense. cgroup v2 ensures that all controller can agree on the same organization by requiring that threads of the same process belong to the same cgroup. There are other reasons to enforce process granularity. One important one is isolating system-level management operations from in-process application operations. The cgroup interface, being a virtual filesystem, is very unfit for multiple independent operations taking place at the same time as most operations have to be multi-step and there is no way to synchronize multiple accessors. See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity" 2-1-2. No Internal Process Constraint cgroup v2 does not allow processes to belong to any cgroup which has child cgroups when resource controllers are enabled on it (the notable exception being the root cgroup itself). This is because, for some resources, a resource domain (cgroup) is not directly comparable to the terminal consumer (process/task) of said resource, and so putting the two into a sibling relationship isn't meaningful. - Differing Control Parameters and Capabilities A cgroup controller has different resource control parameters and capabilities from a terminal consumer, be that a task or process. There are a couple cases where a cgroup control knob can be mapped to a per-task or per-process API but they are exceptions and the mappings aren't obvious even in those cases. For example, task priorities (also known as nice values) set through setpriority(2) are mapped to the CPU controller "cpu.shares" values. However, how exactly the two ranges map and even the fact that they map to each other at all are not obvious. The situation gets further muddled when considering other resource types and control knobs. IO priorities set through ioprio_set(2) cannot be mapped to IO controller weights and most cgroup resource control knobs including the bandwidth control knobs of the CPU controller don't have counterparts in the terminal consumers. - Anonymous Resource Consumption For CPU, every time slice consumed from inside a cgroup, which comprises most but not all of consumed CPU time for the cgroup, can be clearly attributed to a specific task or process. Because these two types of entities are directly comparable as consumers of CPU time, it's theoretically possible to mix tasks and cgroups on the same tree levels and let them directly compete for the time quota available to their common ancestor. However, the same can't be said for resource types like memory or IO: the memory consumed by the page cache, for example, can be tracked on a per-cgroup level, but due to mismatches in lifetimes of involved objects (page cache can persist long after processes are gone), shared usages and the implementation overhead of tracking persistent state, it can no longer be attributed to individual processes after instantiation. Consequently, any IO incurred by page cache writeback can be attributed to a cgroup, but not to the individual consumers inside the cgroup. For memory and IO, this makes a resource domain (cgroup) an object of a fundamentally different type than a terminal consumer (process). A process can't be a first class object in the resource distribution graph as its total resource consumption can't be described without the containing resource domain. Disallowing processes in internal cgroups avoids competition between cgroups and processes which cannot be meaningfully defined for these resources. All resource control takes place among cgroups and a terminal consumer interacts with the containing cgroup the same way it would with the system without cgroup. Root cgroup is exempt from this constraint, which is in line with how root cgroup is handled in general - it's excluded from cgroup resource accounting and control. Enforcing process granularity and no internal process constraint allows all controllers to be on the same footing in terms of resource distribution hierarchy. 2-2. Impact on CPU Controller As indicated earlier, the CPU controller's resource distribution graph is the simplest. Every schedulable resource consumption can be attributed to a specific task. In addition, for weight based control, the per-task priority set through setpriority(2) can be translated to and from a per-cgroup weight. As such, the CPU controller can treat a task and a cgroup symmetrically, allowing support for any tree layout of cgroups and tasks. Both process granularity and the no internal process constraint restrict how the CPU controller can be used. 2-2-1. Impact of Process Granularity Process granularity prevents tasks belonging to the same process to be assigned to different cgroups. It was pointed out [6] that this excludes the valid use case of hierarchical CPU distribution within processes. To address this issue, the rgroup (resource group) [7][8][9] interface, an extension of the existing setpriority(2) API, was proposed, which is in line with other programmable priority mechanisms and eliminates the risk of in-application configuration and system configuration stepping on each other's toes. Unfortunately, the proposal quickly turned into discussions around cgroup v2 design decisions [4] and no consensus could be reached. 2-2-2. Impact of No Internal Process Constraint The no internal process constraint disallows tasks from competing directly against cgroups. Here is an excerpt from Peter Zijlstra pointing out the issue [10] - R, L and A are cgroups; t1, t2, t3 and t4 are tasks: R / | \ t1 t2 A / \ t3 t4 Is fundamentally different from: R / \ L A / \ / \ t1 t2 t3 t4 Because if in the first hierarchy you add a task (t5) to R, all of its A will run at 1/4th of total bandwidth where before it had 1/3rd, whereas with the second example, if you add our t5 to L, A doesn't get any less bandwidth. It is true that the trees are semantically different from each other and the symmetric handling of tasks and cgroups is aesthetically pleasing. However, it isn't clear what the practical usefulness of a layout with direct competition between tasks and cgroups would be, considering that number and behavior of tasks are controlled by each application, and cgroups primarily deal with system level resource distribution; changes in the number of active threads would directly impact resource distribution. Real world use cases of such layouts could not be established during the discussions. 2-3. Arguments for cgroup v2 There are strong demands for comprehensive hierarchical resource control across all major resources, and establishing a common resource hierarchy is an essential step. As with most engineering decisions, common resource hierarchy definition comes with its trade-offs. With cgroup v2, the trade-offs are in the form of structural constraints which, among others, restrict the CPU controller's space of possible configurations. However, even with the restrictions, cgroup v2, in combination with rgroup, covers most of identified real world use cases while enabling new important use cases of resource control across multiple resource types that were fundamentally broken previously. Furthermore, for resource control, treating resource domains as objects of a different type from terminal consumers has important advantages - it can account for resource consumptions which are not tied to any specific terminal consumer, be that a task or process, and allows decoupling resource distribution controls from in-application APIs. Even the CPU controller may benefit from it as the kernel can consume significant amount of CPU cycles in interrupt context or tasks shared across multiple resource domains (e.g. softirq). Finally, it's important to note that enabling cgroup v2 support for the CPU controller doesn't block use cases which require the features which are not available on cgroup v2. Unlikely, but should anybody actually rely on the CPU controller's symmetric handling of tasks and cgroups, backward compatibility is and will be maintained by being able to disconnect the controller from the cgroup v2 hierarchy and use it standalone. This also holds for cpuset which is often used in highly customized configurations which might be a poor fit for common resource domains. The required changes are minimal, the benefits for the target use cases are critical and obvious, and use cases which have to use v1 can continue to do so. 3. Way Forward cgroup v2 primarily aims to solve the problem of comprehensive hierarchical resource control across all major computing resources, which is one of the core problems of modern server infrastructure engineering. The trade-offs that cgroup v2 took are results of pursuing that goal and gaining a better understanding of the nature of resource control in the process. I believe that real world usages will prove cgroup v2's model right, considering the crucial pieces of comprehensive resource control that cannot be implemented without common resource domains. This is not to say that cgroup v2 is fixed in stone and can't be updated; if there is an approach which better serves both comprehensive resource control and the CPU controller's flexibility, we will surely move towards that. It goes without saying that discussions around such approach should consider practical aspects of resource control as a whole rather than absolutely focusing on a particular controller. Until such consensus can be reached, the CPU controller cgroup v2 support will be maintained out of the mainline kernel in an easily accessible form. If there is anything cgroup developers can do to ease the pain, please feel free to contact us on the cgroup mailing list at cgroups-u79uwXL29TaiAVqoAR/hOA@public.gmane.org 4. References [1] http://lkml.kernel.org/r/20160105164834.GE5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> [2] http://lkml.kernel.org/r/20160105164852.GF5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> [3] http://lkml.kernel.org/r/1438641689-14655-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> [4] http://lkml.kernel.org/r/20160407064549.GH3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> [5] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroup-v2.txt Control Group v2 Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> [6] http://lkml.kernel.org/r/CAPM31RJNy3jgG=DYe6GO=wyL4BPPxwUm1f2S6YXacQmo7viFZA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> [7] http://lkml.kernel.org/r/20160105154503.GC5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org [RFD] cgroup: thread granularity support for cpu controller Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> [8] http://lkml.kernel.org/r/1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> [9] http://lkml.kernel.org/r/20160311160522.GA24046-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org Example program for PRIO_RGRP Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> [10] http://lkml.kernel.org/r/20160407082810.GN3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface 2016-08-05 17:07 [Documentation] State of CPU controller in cgroup v2 Tejun Heo @ 2016-08-05 17:09 ` Tejun Heo 2016-08-05 17:09 ` [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy Tejun Heo ` (2 subsequent siblings) 3 siblings, 0 replies; 48+ messages in thread From: Tejun Heo @ 2016-08-05 17:09 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton, Li Zefan, Johannes Weiner, Peter Zijlstra, Paul Turner, Mike Galbraith, Ingo Molnar Cc: linux-kernel, cgroups, linux-api, kernel-team >From 0d966df508ef4d6c0b1baae9e369f4fb0d3e10af Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Fri, 11 Mar 2016 07:31:23 -0500 Make the following changes in preparation for the cpu controller interface implementation for the unified hierarchy. This patch doesn't cause any functional differences. * s/cpu_stats_show()/cpu_cfs_stats_show()/ * s/cpu_files/cpu_legacy_files/ * Separate out cpuacct_stats_read() from cpuacct_stats_show(). While at it, remove pointless cpuacct_stat_desc[] array. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> --- kernel/sched/core.c | 8 ++++---- kernel/sched/cpuacct.c | 33 +++++++++++++++------------------ 2 files changed, 19 insertions(+), 22 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 97ee9ac..c148dfe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8482,7 +8482,7 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota) return ret; } -static int cpu_stats_show(struct seq_file *sf, void *v) +static int cpu_cfs_stats_show(struct seq_file *sf, void *v) { struct task_group *tg = css_tg(seq_css(sf)); struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; @@ -8522,7 +8522,7 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css, } #endif /* CONFIG_RT_GROUP_SCHED */ -static struct cftype cpu_files[] = { +static struct cftype cpu_legacy_files[] = { #ifdef CONFIG_FAIR_GROUP_SCHED { .name = "shares", @@ -8543,7 +8543,7 @@ static struct cftype cpu_files[] = { }, { .name = "stat", - .seq_show = cpu_stats_show, + .seq_show = cpu_cfs_stats_show, }, #endif #ifdef CONFIG_RT_GROUP_SCHED @@ -8568,7 +8568,7 @@ struct cgroup_subsys cpu_cgrp_subsys = { .fork = cpu_cgroup_fork, .can_attach = cpu_cgroup_can_attach, .attach = cpu_cgroup_attach, - .legacy_cftypes = cpu_files, + .legacy_cftypes = cpu_legacy_files, .early_init = true, }; diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 41f85c4..3eb9eda 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -242,36 +242,33 @@ static int cpuacct_percpu_seq_show(struct seq_file *m, void *V) return __cpuacct_percpu_seq_show(m, CPUACCT_USAGE_NRUSAGE); } -static const char * const cpuacct_stat_desc[] = { - [CPUACCT_STAT_USER] = "user", - [CPUACCT_STAT_SYSTEM] = "system", -}; - -static int cpuacct_stats_show(struct seq_file *sf, void *v) +static void cpuacct_stats_read(struct cpuacct *ca, u64 *userp, u64 *sysp) { - struct cpuacct *ca = css_ca(seq_css(sf)); int cpu; - s64 val = 0; + *userp = 0; for_each_possible_cpu(cpu) { struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu); - val += kcpustat->cpustat[CPUTIME_USER]; - val += kcpustat->cpustat[CPUTIME_NICE]; + *userp += kcpustat->cpustat[CPUTIME_USER]; + *userp += kcpustat->cpustat[CPUTIME_NICE]; } - val = cputime64_to_clock_t(val); - seq_printf(sf, "%s %lld\n", cpuacct_stat_desc[CPUACCT_STAT_USER], val); - val = 0; + *sysp = 0; for_each_possible_cpu(cpu) { struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu); - val += kcpustat->cpustat[CPUTIME_SYSTEM]; - val += kcpustat->cpustat[CPUTIME_IRQ]; - val += kcpustat->cpustat[CPUTIME_SOFTIRQ]; + *sysp += kcpustat->cpustat[CPUTIME_SYSTEM]; + *sysp += kcpustat->cpustat[CPUTIME_IRQ]; + *sysp += kcpustat->cpustat[CPUTIME_SOFTIRQ]; } +} - val = cputime64_to_clock_t(val); - seq_printf(sf, "%s %lld\n", cpuacct_stat_desc[CPUACCT_STAT_SYSTEM], val); +static int cpuacct_stats_show(struct seq_file *sf, void *v) +{ + cputime64_t user, sys; + cpuacct_stats_read(css_ca(seq_css(sf)), &user, &sys); + seq_printf(sf, "user %lld\n", cputime64_to_clock_t(user)); + seq_printf(sf, "system %lld\n", cputime64_to_clock_t(sys)); return 0; } -- 2.7.4 ^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy 2016-08-05 17:07 [Documentation] State of CPU controller in cgroup v2 Tejun Heo 2016-08-05 17:09 ` [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface Tejun Heo @ 2016-08-05 17:09 ` Tejun Heo [not found] ` <20160805170752.GK2542-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2016-08-17 20:18 ` Andy Lutomirski 3 siblings, 0 replies; 48+ messages in thread From: Tejun Heo @ 2016-08-05 17:09 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton, Li Zefan, Johannes Weiner, Peter Zijlstra, Paul Turner, Mike Galbraith, Ingo Molnar Cc: linux-kernel, cgroups, linux-api, kernel-team >From ed6d93036ec930cb774da10b7c87f67905ce71f1 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Fri, 11 Mar 2016 07:31:23 -0500 While the cpu controller doesn't have any functional problems, there are a couple interface issues which can be addressed in the v2 interface. * cpuacct being a separate controller. This separation is artificial and rather pointless as demonstrated by most use cases co-mounting the two controllers. It also forces certain information to be accounted twice. * Use of different time units. Writable control knobs use microseconds, some stat fields use nanoseconds while other cpuacct stat fields use centiseconds. * Control knobs which can't be used in the root cgroup still show up in the root. * Control knob names and semantics aren't consistent with other controllers. This patchset implements cpu controller's interface on the unified hierarchy which adheres to the controller file conventions described in Documentation/cgroups/unified-hierarchy.txt. Overall, the following changes are made. * cpuacct is implictly enabled and disabled by cpu and its information is reported through "cpu.stat" which now uses microseconds for all time durations. All time duration fields now have "_usec" appended to them for clarity. While this doesn't solve the double accounting immediately, once majority of users switch to v2, cpu can directly account and report the relevant stats and cpuacct can be disabled on the unified hierarchy. Note that cpuacct.usage_percpu is currently not included in "cpu.stat". If this information is actually called for, it can be added later. * "cpu.shares" is replaced with "cpu.weight" and operates on the standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000). The weight is scaled to scheduler weight so that 100 maps to 1024 and the ratio relationship is preserved - if weight is W and its scaled value is S, W / 100 == S / 1024. While the mapped range is a bit smaller than the orignal scheduler weight range, the dead zones on both sides are relatively small and covers wider range than the nice value mappings. This file doesn't make sense in the root cgroup and isn't create on root. * "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max" which contains both quota and period. * "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by "cpu.rt.max" which contains both runtime and period. v2: cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED for CFS bandwidth stats and also using raw division for u64. Use CONFIG_CFS_BANDWITH and do_div() instead. The semantics of "cpu.rt.max" is not fully decided yet. Dropped for now. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> --- kernel/sched/core.c | 141 +++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/cpuacct.c | 24 +++++++++ kernel/sched/cpuacct.h | 5 ++ 3 files changed, 170 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c148dfe..7bba2c5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8561,6 +8561,139 @@ static struct cftype cpu_legacy_files[] = { { } /* terminate */ }; +static int cpu_stats_show(struct seq_file *sf, void *v) +{ + cpuacct_cpu_stats_show(sf); + +#ifdef CONFIG_CFS_BANDWIDTH + { + struct task_group *tg = css_tg(seq_css(sf)); + struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; + u64 throttled_usec; + + throttled_usec = cfs_b->throttled_time; + do_div(throttled_usec, NSEC_PER_USEC); + + seq_printf(sf, "nr_periods %d\n" + "nr_throttled %d\n" + "throttled_usec %llu\n", + cfs_b->nr_periods, cfs_b->nr_throttled, + throttled_usec); + } +#endif + return 0; +} + +#ifdef CONFIG_FAIR_GROUP_SCHED +static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct task_group *tg = css_tg(css); + u64 weight = scale_load_down(tg->shares); + + return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024); +} + +static int cpu_weight_write_u64(struct cgroup_subsys_state *css, + struct cftype *cftype, u64 weight) +{ + /* + * cgroup weight knobs should use the common MIN, DFL and MAX + * values which are 1, 100 and 10000 respectively. While it loses + * a bit of range on both ends, it maps pretty well onto the shares + * value used by scheduler and the round-trip conversions preserve + * the original value over the entire range. + */ + if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX) + return -ERANGE; + + weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL); + + return sched_group_set_shares(css_tg(css), scale_load(weight)); +} +#endif + +static void __maybe_unused cpu_period_quota_print(struct seq_file *sf, + long period, long quota) +{ + if (quota < 0) + seq_puts(sf, "max"); + else + seq_printf(sf, "%ld", quota); + + seq_printf(sf, " %ld\n", period); +} + +/* caller should put the current value in *@periodp before calling */ +static int __maybe_unused cpu_period_quota_parse(char *buf, + u64 *periodp, u64 *quotap) +{ + char tok[21]; /* U64_MAX */ + + if (!sscanf(buf, "%s %llu", tok, periodp)) + return -EINVAL; + + *periodp *= NSEC_PER_USEC; + + if (sscanf(tok, "%llu", quotap)) + *quotap *= NSEC_PER_USEC; + else if (!strcmp(tok, "max")) + *quotap = RUNTIME_INF; + else + return -EINVAL; + + return 0; +} + +#ifdef CONFIG_CFS_BANDWIDTH +static int cpu_max_show(struct seq_file *sf, void *v) +{ + struct task_group *tg = css_tg(seq_css(sf)); + + cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg)); + return 0; +} + +static ssize_t cpu_max_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct task_group *tg = css_tg(of_css(of)); + u64 period = tg_get_cfs_period(tg); + u64 quota; + int ret; + + ret = cpu_period_quota_parse(buf, &period, "a); + if (!ret) + ret = tg_set_cfs_bandwidth(tg, period, quota); + return ret ?: nbytes; +} +#endif + +static struct cftype cpu_files[] = { + { + .name = "stat", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cpu_stats_show, + }, +#ifdef CONFIG_FAIR_GROUP_SCHED + { + .name = "weight", + .flags = CFTYPE_NOT_ON_ROOT, + .read_u64 = cpu_weight_read_u64, + .write_u64 = cpu_weight_write_u64, + }, +#endif +#ifdef CONFIG_CFS_BANDWIDTH + { + .name = "max", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cpu_max_show, + .write = cpu_max_write, + }, +#endif + { } /* terminate */ +}; + struct cgroup_subsys cpu_cgrp_subsys = { .css_alloc = cpu_cgroup_css_alloc, .css_released = cpu_cgroup_css_released, @@ -8569,7 +8702,15 @@ struct cgroup_subsys cpu_cgrp_subsys = { .can_attach = cpu_cgroup_can_attach, .attach = cpu_cgroup_attach, .legacy_cftypes = cpu_legacy_files, + .dfl_cftypes = cpu_files, .early_init = true, +#ifdef CONFIG_CGROUP_CPUACCT + /* + * cpuacct is enabled together with cpu on the unified hierarchy + * and its stats are reported through "cpu.stat". + */ + .depends_on = 1 << cpuacct_cgrp_id, +#endif }; #endif /* CONFIG_CGROUP_SCHED */ diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 3eb9eda..7a02d26 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -305,6 +305,30 @@ static struct cftype files[] = { { } /* terminate */ }; +/* used to print cpuacct stats in cpu.stat on the unified hierarchy */ +void cpuacct_cpu_stats_show(struct seq_file *sf) +{ + struct cgroup_subsys_state *css; + u64 usage, user, sys; + + css = cgroup_get_e_css(seq_css(sf)->cgroup, &cpuacct_cgrp_subsys); + + usage = cpuusage_read(css, seq_cft(sf)); + cpuacct_stats_read(css_ca(css), &user, &sys); + + user *= TICK_NSEC; + sys *= TICK_NSEC; + do_div(usage, NSEC_PER_USEC); + do_div(user, NSEC_PER_USEC); + do_div(sys, NSEC_PER_USEC); + + seq_printf(sf, "usage_usec %llu\n" + "user_usec %llu\n" + "system_usec %llu\n", usage, user, sys); + + css_put(css); +} + /* * charge this task's execution time to its accounting group. * diff --git a/kernel/sched/cpuacct.h b/kernel/sched/cpuacct.h index ba72807..ddf7af4 100644 --- a/kernel/sched/cpuacct.h +++ b/kernel/sched/cpuacct.h @@ -2,6 +2,7 @@ extern void cpuacct_charge(struct task_struct *tsk, u64 cputime); extern void cpuacct_account_field(struct task_struct *tsk, int index, u64 val); +extern void cpuacct_cpu_stats_show(struct seq_file *sf); #else @@ -14,4 +15,8 @@ cpuacct_account_field(struct task_struct *tsk, int index, u64 val) { } +static inline void cpuacct_cpu_stats_show(struct seq_file *sf) +{ +} + #endif -- 2.7.4 ^ permalink raw reply related [flat|nested] 48+ messages in thread
[parent not found: <20160805170752.GK2542-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160805170752.GK2542-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2016-08-06 9:04 ` Mike Galbraith [not found] ` <1470474291.4117.243.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 48+ messages in thread From: Mike Galbraith @ 2016-08-06 9:04 UTC (permalink / raw) To: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan, Johannes Weiner, Peter Zijlstra, Paul Turner, Ingo Molnar Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Fri, 2016-08-05 at 13:07 -0400, Tejun Heo wrote: > 2-2. Impact on CPU Controller > > As indicated earlier, the CPU controller's resource distribution graph > is the simplest. Every schedulable resource consumption can be > attributed to a specific task. In addition, for weight based control, > the per-task priority set through setpriority(2) can be translated to > and from a per-cgroup weight. As such, the CPU controller can treat a > task and a cgroup symmetrically, allowing support for any tree layout > of cgroups and tasks. Both process granularity and the no internal > process constraint restrict how the CPU controller can be used. Not only the cpu controller, but also cpuacct and cpuset. > 2-2-1. Impact of Process Granularity > > Process granularity prevents tasks belonging to the same process to > be assigned to different cgroups. It was pointed out [6] that this > excludes the valid use case of hierarchical CPU distribution within > processes. Does that not obsolete the rather useful/common concept "thread pool"? > 2-2-2. Impact of No Internal Process Constraint > > The no internal process constraint disallows tasks from competing > directly against cgroups. Here is an excerpt from Peter Zijlstra > pointing out the issue [10] - R, L and A are cgroups; t1, t2, t3 and > t4 are tasks: > > > R > / | \ > t1 t2 A > / \ > t3 t4 > > > Is fundamentally different from: > > > R > / \ > L A > / \ / \ > t1 t2 t3 t4 > > > Because if in the first hierarchy you add a task (t5) to R, all of > its A will run at 1/4th of total bandwidth where before it had > 1/3rd, whereas with the second example, if you add our t5 to L, A > doesn't get any less bandwidth. > > > It is true that the trees are semantically different from each other > and the symmetric handling of tasks and cgroups is aesthetically > pleasing. However, it isn't clear what the practical usefulness of > a layout with direct competition between tasks and cgroups would be, > considering that number and behavior of tasks are controlled by each > application, and cgroups primarily deal with system level resource > distribution; changes in the number of active threads would directly > impact resource distribution. Real world use cases of such layouts > could not be established during the discussions. You apparently intend to ignore any real world usages that don't work with these new constraints. Priority and affinity are not process wide attributes, never have been, but you're insisting that so they must become for the sake of progress. I mentioned a real world case of a thread pool servicing customer accounts by doing something quite sane: hop into an account (cgroup), do work therein, send bean count off to the $$ department, wash, rinse repeat. That's real world users making real world cash registers go ka -ching so real world people can pay their real world bills. I also mentioned breakage to cpusets: given exclusive set A and exclusive subset B therein, there is one and only one spot where affinity A exists... at the to be forbidden junction of A and B. As with the thread pool, process granularity makes it impossible for any threaded application affinity to be managed via cpusets, such as say stuffing realtime critical threads into a shielded cpuset, mundane threads into another. There are any number of affinity usages that will break. Try as I may, I can't see anything progressive about enforcing process granularity of per thread attributes. I do see regression potential for users of these controllers, and no viable means to even report them as being such. It will likely be systemd flipping the V2 on switch, not the kernel, not the user. Regression reports would thus presumably be deflected to... those who want this. Sweet. -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <1470474291.4117.243.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <1470474291.4117.243.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2016-08-10 22:09 ` Johannes Weiner [not found] ` <20160810220944.GB3085-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 0 siblings, 1 reply; 48+ messages in thread From: Johannes Weiner @ 2016-08-10 22:09 UTC (permalink / raw) To: Mike Galbraith Cc: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan, Peter Zijlstra, Paul Turner, Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Sat, Aug 06, 2016 at 11:04:51AM +0200, Mike Galbraith wrote: > On Fri, 2016-08-05 at 13:07 -0400, Tejun Heo wrote: > > It is true that the trees are semantically different from each other > > and the symmetric handling of tasks and cgroups is aesthetically > > pleasing. However, it isn't clear what the practical usefulness of > > a layout with direct competition between tasks and cgroups would be, > > considering that number and behavior of tasks are controlled by each > > application, and cgroups primarily deal with system level resource > > distribution; changes in the number of active threads would directly > > impact resource distribution. Real world use cases of such layouts > > could not be established during the discussions. > > You apparently intend to ignore any real world usages that don't work > with these new constraints. He didn't ignore these use cases. He offered alternatives like rgroup to allow manipulating threads from within the application, only in a way that does not interfere with cgroup2's common controller model. The complete lack of cohesiveness between v1 controllers prevents us from implementing even the most fundamental resource control that cloud fleets like Google's and Facebook's are facing, such as controlling buffered IO; attributing CPU cycles spent receiving packets, reclaiming memory in kswapd, encrypting the disk; attributing swap IO etc. That's why cgroup2 runs a tighter ship when it comes to the controllers: to make something much bigger work. Agreeing on something - in this case a common controller model - is necessarily going to take away some flexibility from how you approach a problem. What matters is whether the problem can still be solved. This argument that cgroup2 is not backward compatible is laughable. Of course it's going to be different, otherwise we wouldn't have had to version it. The question is not whether the exact same configurations and existing application design can be used in v1 and v2 - that's a strange onus to put on a versioned interface. The question is whether you can translate a solution from v1 to v2. Yeah, it might be a hassle depending on how specialized your setup is, but that's why we keep v1 around until the last user dies and allow you to freely mix and match v1 and v2 controllers within a single system to ease the transition. But this distinction between approach and application design, and the application's actual purpose is crucial. Every time this discussion came up, somebody says 'moving worker threads between different resource domains'. That's not a goal, though, that's a very specific means to an end, with no explanation of why it has to be done that way. When comparing the cgroup v1 and v2 interface, we should be discussing goals, not 'this is my favorite way to do it'. If you have an actual real-world goal that can be accomplished in v1 but not in v2 + rgroup, then that's what we should be talking about. Lastly, again - and this was the whole point of this document - the changes in cgroup2 are not gratuitous. They are driven by fundamental resource control problems faced by more comprehensive applications of cgroup. On the other hand, the opposition here mainly seems to be the inconvenience of switching some specialized setups from a v1-oriented way of solving a problem to a v2-oriented way. [ That, and a disturbing number of emotional outbursts against systemd, which has nothing to do with any of this. ] It's a really myopic line of argument. That being said, let's go through your points: > Priority and affinity are not process wide attributes, never have > been, but you're insisting that so they must become for the sake of > progress. Not really. It's just questionable whether the cgroup interface is the best way to manipulate these attributes, or whether existing interfaces like setpriority() and sched_setaffinity() should be extended to manipulate groups, like the rgroup proposal does. The problems of using the cgroup interface for this are extensively documented, including in the email you were replying to. > I mentioned a real world case of a thread pool servicing customer > accounts by doing something quite sane: hop into an account (cgroup), > do work therein, send bean count off to the $$ department, wash, rinse > repeat. That's real world users making real world cash registers go ka > -ching so real world people can pay their real world bills. Sure, but you're implying that this is the only way to run this real world cash register. I think it's entirely justified to re-evaluate this, given the myriad of much more fundamental problems that cgroup2 is solving by building on a common controller model. I'm not going down the rabbit hole again of arguing against an incomplete case description. Scale matters. Number of workers matter. Amount of work each thread does matters to evaluate transaction overhead. Task migration is an expensive operation etc. > I also mentioned breakage to cpusets: given exclusive set A and > exclusive subset B therein, there is one and only one spot where > affinity A exists... at the to be forbidden junction of A and B. Again, a means to an end rather than a goal - and a particularly suspicious one at that: why would a cgroup need to tell its *siblings* which cpus/nodes in cannot use? In the hierarchical model, it's clearly the task of the ancestor to allocate the resources downward. More details would be needed to properly discuss what we are trying to accomplish here. > As with the thread pool, process granularity makes it impossible for > any threaded application affinity to be managed via cpusets, such as > say stuffing realtime critical threads into a shielded cpuset, mundane > threads into another. There are any number of affinity usages that > will break. Ditto. It's not obvious why this needs to be the cgroup interface and couldn't instead be solved with extending sched_setaffinity() - again weighing that against the power of the common controller model that could be preserved this way. > Try as I may, I can't see anything progressive about enforcing process > granularity of per thread attributes. I do see regression potential > for users of these controllers, I could understand not being entirely happy about the trade-offs if you look at this from the perspective of a single controller in the entire resource control subsystem. But not seeing anything progressive in a common controller model? Have you read anything we have been writing? > and no viable means to even report them as being such. It will > likely be systemd flipping the V2 on switch, not the kernel, not the > user. Regression reports would thus presumably be deflected > to... those who want this. Sweet. There it is... ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20160810220944.GB3085-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160810220944.GB3085-druUgvl0LCNAfugRpC6u6w@public.gmane.org> @ 2016-08-11 6:25 ` Mike Galbraith [not found] ` <1470896706.4116.146.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-08-16 14:07 ` Peter Zijlstra 1 sibling, 1 reply; 48+ messages in thread From: Mike Galbraith @ 2016-08-11 6:25 UTC (permalink / raw) To: Johannes Weiner Cc: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan, Peter Zijlstra, Paul Turner, Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Wed, 2016-08-10 at 18:09 -0400, Johannes Weiner wrote: > The complete lack of cohesiveness between v1 controllers prevents us > from implementing even the most fundamental resource control that > cloud fleets like Google's and Facebook's are facing, such as > controlling buffered IO; attributing CPU cycles spent receiving > packets, reclaiming memory in kswapd, encrypting the disk; attributing > swap IO etc. That's why cgroup2 runs a tighter ship when it comes to > the controllers: to make something much bigger work. Where is the gun wielding thug forcing people to place tasks where v2 now explicitly forbids them? > Agreeing on something - in this case a common controller model - is > necessarily going to take away some flexibility from how you approach > a problem. What matters is whether the problem can still be solved. What annoys me about this more than the seemingly gratuitous breakage is that the decision is passed to third parties who have nothing to lose, and have done quite a bit of breaking lately. > This argument that cgroup2 is not backward compatible is laughable. Fine, you're entitled to your sense of humor. I have one to, I find it laughable that threaded applications can only sit there like a lump of mud simply because they share more than applications written as a gaggle of tasks. "Threads are like.. so yesterday, the future belongs to the process" tickles my funny-bone. Whatever, to each his own. ... > Lastly, again - and this was the whole point of this document - the > changes in cgroup2 are not gratuitous. They are driven by fundamental > resource control problems faced by more comprehensive applications of > cgroup. On the other hand, the opposition here mainly seems to be the > inconvenience of switching some specialized setups from a v1-oriented > way of solving a problem to a v2-oriented way. > > [ That, and a disturbing number of emotional outbursts against > systemd, which has nothing to do with any of this. ] > > It's a really myopic line of argument. And I think the myopia is on the other side of my monitor, whatever. > That being said, let's go through your points: > > > Priority and affinity are not process wide attributes, never have > > been, but you're insisting that so they must become for the sake of > > progress. > > Not really. > > It's just questionable whether the cgroup interface is the best way to > manipulate these attributes, or whether existing interfaces like > setpriority() and sched_setaffinity() should be extended to manipulate > groups, like the rgroup proposal does. The problems of using the > cgroup interface for this are extensively documented, including in the > email you were replying to. > > > I mentioned a real world case of a thread pool servicing customer > > accounts by doing something quite sane: hop into an account (cgroup), > > do work therein, send bean count off to the $$ department, wash, rinse > > repeat. That's real world users making real world cash registers go ka > > -ching so real world people can pay their real world bills. > > Sure, but you're implying that this is the only way to run this real > world cash register. I implied no such thing. Of course it can be done differently, all they have to do is rip out these archaic thread thingies. Apologies for dripping sarcasm all over your monitor, but this annoys me far more that it should any casual user of cgroups. Perhaps I shouldn't care about the users (suse customers) who will step in this eventually, but I do. > I'm not going down the rabbit hole again of arguing against an > incomplete case description. Scale matters. Number of workers > matter. Amount of work each thread does matters to evaluate > transaction overhead. Task migration is an expensive operation etc. > > > I also mentioned breakage to cpusets: given exclusive set A and > > exclusive subset B therein, there is one and only one spot where > > affinity A exists... at the to be forbidden junction of A and B. > > Again, a means to an end rather than a goal I don't believe I described a means to an end, I believe I described affinity bits going missing. > - and a particularly > suspicious one at that: why would a cgroup need to tell its *siblings* > which cpus/nodes in cannot use? In the hierarchical model, it's > clearly the task of the ancestor to allocate the resources downward. > > More details would be needed to properly discuss what we are trying to > accomplish here. > > > As with the thread pool, process granularity makes it impossible for > > any threaded application affinity to be managed via cpusets, such as > > say stuffing realtime critical threads into a shielded cpuset, mundane > > threads into another. There are any number of affinity usages that > > will break. > > Ditto. It's not obvious why this needs to be the cgroup interface and > couldn't instead be solved with extending sched_setaffinity() - again > weighing that against the power of the common controller model that > could be preserved this way. Wow. Well sure, anything that becomes broken can be replaced by something else. Hell, people can just stop using cgroups entirely, and the way issues become non-issues with the wave of a hand makes me suspect that some users are going to be forced to do just that. -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <1470896706.4116.146.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <1470896706.4116.146.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2016-08-12 22:17 ` Johannes Weiner [not found] ` <20160812221742.GA24736-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 0 siblings, 1 reply; 48+ messages in thread From: Johannes Weiner @ 2016-08-12 22:17 UTC (permalink / raw) To: Mike Galbraith Cc: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan, Peter Zijlstra, Paul Turner, Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Thu, Aug 11, 2016 at 08:25:06AM +0200, Mike Galbraith wrote: > On Wed, 2016-08-10 at 18:09 -0400, Johannes Weiner wrote: > > The complete lack of cohesiveness between v1 controllers prevents us > > from implementing even the most fundamental resource control that > > cloud fleets like Google's and Facebook's are facing, such as > > controlling buffered IO; attributing CPU cycles spent receiving > > packets, reclaiming memory in kswapd, encrypting the disk; attributing > > swap IO etc. That's why cgroup2 runs a tighter ship when it comes to > > the controllers: to make something much bigger work. > > Where is the gun wielding thug forcing people to place tasks where v2 > now explicitly forbids them? The problems with supporting this are well-documented. Please see R-2 in Documentation/cgroup-v2.txt. > > Agreeing on something - in this case a common controller model - is > > necessarily going to take away some flexibility from how you approach > > a problem. What matters is whether the problem can still be solved. > > What annoys me about this more than the seemingly gratuitous breakage > is that the decision is passed to third parties who have nothing to > lose, and have done quite a bit of breaking lately. Mike, there is no connection between what you are quoting and what you are replying to here. We cannot have a technical discussion when you enter it with your mind fully made up, repeat the same inflammatory talking points over and over - some of them trivially false, some a gross misrepresentation of what we have been trying to do - and are completely unwilling to even entertain the idea that there might be problems outside of the one-controller-scope you are looking at. But to address your point: there is no 'breakage' here. Or in your words: there is no gun wielding thug forcing people to upgrade to v2. If v1 does everything your specific setup needs, nobody forces you to upgrade. We are fairly confident that the majority of users *will* upgrade, simply because v2 solves so many basic resource control problems that v1 is inherently incapable of solving. There is a positive incentive, but we are trying not to create negative ones. And even if you run a systemd distribution, and systemd switches to v2, it's trivially easy to pry the CPU controller from its hands and maintain your setup exactly as-is using the current CPU controller. This is really not a technical argument. > > This argument that cgroup2 is not backward compatible is laughable. > > Fine, you're entitled to your sense of humor. I have one to, I find it > laughable that threaded applications can only sit there like a lump of > mud simply because they share more than applications written as a > gaggle of tasks. "Threads are like.. so yesterday, the future belongs > to the process" tickles my funny-bone. Whatever, to each his own. Who are you quoting here? This is such a grotesque misrepresentation of what we have been saying and implementing, it's not even funny. In reality, the rgroup extension for setpriority() was directly based on your and PeterZ's feedback regarding thread control. Except that, unlike cgroup1's approach to threads, which might work in some setups but suffers immensely from the global nature of the vfs interface once you have to cooperate with other applications and system management*, rgroup was proposed as a much more generic and robust interface to do hierarchical resource control from inside the application. * This doesn't have to be systemd, btw. We have used cgroups to isolate system services, maintenance jobs, cron jobs etc. from our applications way before systemd, and it's been a pita to coordinate the system managing applications and the applications managing its workers using the same globally scoped vfs interface. > > > I mentioned a real world case of a thread pool servicing customer > > > accounts by doing something quite sane: hop into an account (cgroup), > > > do work therein, send bean count off to the $$ department, wash, rinse > > > repeat. That's real world users making real world cash registers go ka > > > -ching so real world people can pay their real world bills. > > > > Sure, but you're implying that this is the only way to run this real > > world cash register. > > I implied no such thing. Of course it can be done differently, all > they have to do is rip out these archaic thread thingies. > > Apologies for dripping sarcasm all over your monitor, but this annoys > me far more that it should any casual user of cgroups. Perhaps I > shouldn't care about the users (suse customers) who will step in this > eventually, but I do. https://yourlogicalfallacyis.com/black-or-white https://yourlogicalfallacyis.com/strawman https://yourlogicalfallacyis.com/appeal-to-emotion Can you please try to stay objective? > > > As with the thread pool, process granularity makes it impossible for > > > any threaded application affinity to be managed via cpusets, such as > > > say stuffing realtime critical threads into a shielded cpuset, mundane > > > threads into another. There are any number of affinity usages that > > > will break. > > > > Ditto. It's not obvious why this needs to be the cgroup interface and > > couldn't instead be solved with extending sched_setaffinity() - again > > weighing that against the power of the common controller model that > > could be preserved this way. > > Wow. Well sure, anything that becomes broken can be replaced by > something else. Hell, people can just stop using cgroups entirely, and > the way issues become non-issues with the wave of a hand makes me > suspect that some users are going to be forced to do just that. We are not the ones doing the handwaving. We have reacted with code and with repeated attempts to restart a grounded technical discussion on this issue, and were met time and again with polemics, categorical dismissal of the problems we are facing in the cloud, and a flatout refusal to even consider a different approach to resource control. It's great that cgroup1 works for some of your customers, and they are free to keep using it, but there is only so much you can build with a handful of loose shoestrings, and we are badly hitting the design limitations of that model. We have tried to work in your direction and proposed interfaces/processes to support the different things people are (ab)using cgroup1 for right now, but at some point you have to acknowledge that cgroup2 is the result of problems we have run into with cgroup1 and that, consequently, not everything from cgroup1 can be retained as-is. Only when that happens can we properly discuss cgroup2's current design choices and whether it could be done better. Ignoring the real problems that cgroup2 is solving will not remove the demand for it. It only squanders your chance to help shape it in the interest of the particular group of users you feel most obligated to. ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20160812221742.GA24736-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160812221742.GA24736-druUgvl0LCNAfugRpC6u6w@public.gmane.org> @ 2016-08-13 5:08 ` Mike Galbraith 0 siblings, 0 replies; 48+ messages in thread From: Mike Galbraith @ 2016-08-13 5:08 UTC (permalink / raw) To: Johannes Weiner Cc: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan, Peter Zijlstra, Paul Turner, Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Fri, 2016-08-12 at 18:17 -0400, Johannes Weiner wrote: > > > This argument that cgroup2 is not backward compatible is laughable. > > > > Fine, you're entitled to your sense of humor. I have one to, I find it > > laughable that threaded applications can only sit there like a lump of > > mud simply because they share more than applications written as a > > gaggle of tasks. "Threads are like.. so yesterday, the future belongs > > to the process" tickles my funny-bone. Whatever, to each his own. > > Who are you quoting here? This is such a grotesque misrepresentation > of what we have been saying and implementing, it's not even funny. Agreed, it's not funny to me either. Excluding threaded applications from doing.. anything.. implies to me that either someone thinks same do not need resource management facilities due to some magical property of threading itself, or someone doesn't realize that an application thread is a task, ie one and the same things which can be doing one and the same job. No matter how I turn it, what I see is nonsense. > https://yourlogicalfallacyis.com/black-or-white > https://yourlogicalfallacyis.com/strawman > https://yourlogicalfallacyis.com/appeal-to-emotion Nope, plain ole sarcasm, an expression of shock and awe. > It's great that cgroup1 works for some of your customers, and they are > free to keep using it. If no third party can flush my customers investment down the toilet, I can cease to care. Please don't CC me in future, you're unlikely to convince me that v2 is remotely sane, nor do you need to. Lucky you. -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160810220944.GB3085-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 2016-08-11 6:25 ` Mike Galbraith @ 2016-08-16 14:07 ` Peter Zijlstra [not found] ` <20160816140738.GW6879-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 1 sibling, 1 reply; 48+ messages in thread From: Peter Zijlstra @ 2016-08-16 14:07 UTC (permalink / raw) To: Johannes Weiner Cc: Mike Galbraith, Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan, Paul Turner, Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote: > [ That, and a disturbing number of emotional outbursts against > systemd, which has nothing to do with any of this. ] Oh, so I'm entirely dreaming this then: https://github.com/systemd/systemd/pull/3905 Completely unrelated. Also, the argument there seems unfair at best, you don't need cpu-v2 for buffered write control, you only need memcg and block co-mounted. ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20160816140738.GW6879-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160816140738.GW6879-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2016-08-16 14:58 ` Chris Mason 2016-08-16 16:30 ` Johannes Weiner 2016-08-16 21:59 ` Tejun Heo 2 siblings, 0 replies; 48+ messages in thread From: Chris Mason @ 2016-08-16 14:58 UTC (permalink / raw) To: Peter Zijlstra, Johannes Weiner Cc: Mike Galbraith, Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan, Paul Turner, Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On 08/16/2016 10:07 AM, Peter Zijlstra wrote: > On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote: > >> [ That, and a disturbing number of emotional outbursts against >> systemd, which has nothing to do with any of this. ] > > Oh, so I'm entirely dreaming this then: > > https://github.com/systemd/systemd/pull/3905 > > Completely unrelated. > > Also, the argument there seems unfair at best, you don't need cpu-v2 for > buffered write control, you only need memcg and block co-mounted. > This isn't systemd dictating cgroups2 or systemd trying to get rid of v1. But systemd is a common user of cgroups, and we do use it here in production. We're just sending patches upstream for the tools we're using. It's better than keeping them private, or reinventing a completely different tool that does almost the same thing. -chris ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160816140738.GW6879-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-08-16 14:58 ` Chris Mason @ 2016-08-16 16:30 ` Johannes Weiner 2016-08-17 9:33 ` Mike Galbraith 2016-08-16 21:59 ` Tejun Heo 2 siblings, 1 reply; 48+ messages in thread From: Johannes Weiner @ 2016-08-16 16:30 UTC (permalink / raw) To: Peter Zijlstra Cc: Mike Galbraith, Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan, Paul Turner, Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg On Tue, Aug 16, 2016 at 04:07:38PM +0200, Peter Zijlstra wrote: > On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote: > > > [ That, and a disturbing number of emotional outbursts against > > systemd, which has nothing to do with any of this. ] > > Oh, so I'm entirely dreaming this then: > > https://github.com/systemd/systemd/pull/3905 > > Completely unrelated. Yes and no. We certainly do use systemd (kind of hard not to at this point if you're using any major distribution), and we do feed back the changes we make to it upstream. But this is updating systemd to work with the resource control design choices we made in the kernel, not the other way round. As I wrote to Mike before, we have been running into these resource control issues way before systemd, when we used a combination of libcgroup and custom hacks to coordinate the jobs on the system. The cgroup2 design choices fell out of experiences with those setups. Neither the problem statement nor the proposed solutions depend on systemd, which is why I had hoped we could focus these cgroup2 debates around the broader resource control issues we are trying to address, rather than get hung up on one contentious user of the interface. > Also, the argument there seems unfair at best, you don't need cpu-v2 for > buffered write control, you only need memcg and block co-mounted. Yes, memcg and block agreeing is enough for that case. But I mentioned a whole bunch of these examples, to make the broader case for a common controller model. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [Documentation] State of CPU controller in cgroup v2 2016-08-16 16:30 ` Johannes Weiner @ 2016-08-17 9:33 ` Mike Galbraith 0 siblings, 0 replies; 48+ messages in thread From: Mike Galbraith @ 2016-08-17 9:33 UTC (permalink / raw) To: Johannes Weiner, Peter Zijlstra Cc: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan, Paul Turner, Ingo Molnar, linux-kernel, cgroups, linux-api, kernel-team On Tue, 2016-08-16 at 12:30 -0400, Johannes Weiner wrote: > On Tue, Aug 16, 2016 at 04:07:38PM +0200, Peter Zijlstra wrote: > > Also, the argument there seems unfair at best, you don't need cpu-v2 for > > buffered write control, you only need memcg and block co-mounted. > > Yes, memcg and block agreeing is enough for that case. But I mentioned > a whole bunch of these examples, to make the broader case for a common > controller model. The core issue I have with that model is that it defines context=mm, and declares context=task to be invalid, while in reality, both views are perfectly valid, useful, and in use. That redefinition of context is demonstrably harmful when applied to scheduler related controllers, rendering a substantial portion of to be managed objects completely unmanageable. You (collectively) know that full well. AFAIKT, there is only one viable option, and that is to continue to allow both. Whether you like the duality or not (who would), it's deeply embedded in what's under the controllers, and won't go away. I'll now go try a little harder while you ponder (or pop) this thought bubble, see if I can set a new personal best at the art of ignoring. (CC did not help btw, your bad if you don't like bubble content) -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160816140738.GW6879-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-08-16 14:58 ` Chris Mason 2016-08-16 16:30 ` Johannes Weiner @ 2016-08-16 21:59 ` Tejun Heo 2 siblings, 0 replies; 48+ messages in thread From: Tejun Heo @ 2016-08-16 21:59 UTC (permalink / raw) To: Peter Zijlstra Cc: Johannes Weiner, Mike Galbraith, Linus Torvalds, Andrew Morton, Li Zefan, Paul Turner, Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg Hello, Peter. On Tue, Aug 16, 2016 at 04:07:38PM +0200, Peter Zijlstra wrote: > On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote: > > > [ That, and a disturbing number of emotional outbursts against > > systemd, which has nothing to do with any of this. ] > > Oh, so I'm entirely dreaming this then: > > https://github.com/systemd/systemd/pull/3905 > > Completely unrelated. We use centos in the fleet and are trying to control resources in base system which of course requires writeback control and thus cgroup v2. I'm working to solve the use cases people are facing and systemd is a piece of the puzzle. There is no big conspiracy. As Johannes and Chris already pointed out, systemd is a user of cgroup v2, a pretty important one at this point. While I of course care about it having a proper support for cgroup v2, systemd is just picking up the changes in cgroup v2. cgroup v2 design wouldn't be different without systemd. We'll just have something else playing its role in resource management. > Also, the argument there seems unfair at best, you don't need cpu-v2 for > buffered write control, you only need memcg and block co-mounted. ( Everything I'm gonna write below has already been extensively documented in the posted documentation. I'm gonna repeat the points for completeness but if we're gonna start an actually technical discussion, let's please start from the documentation instead of jumping off of an one liner and trying to rebuild the entire argument each time. I'm not sure what you exactly meant by the above sentence and assuming that you're saying that there are no new capabilities gained by cpu controller being on the v2 hierarchy and thus the cpu controller doesn't need to be on cgroup v2? If I'm mistaken, please let me know. ) Just co-mounting isn't enough as it still leaves the problems with anonymous consumption, different handling of threads belonging to different cgroups, and whether it's acceptable to always require blkio to use memory controller. cgroup v2 is what we got after working through all these issues. While it is true that cpu controller doesn't need to be on cgroup v2 for writeback control to work, it misses the point about the larger design issues identified during writeback control work, which can be easily applied to the cpu controller - e.g. accounting cpu cycles spent for packet reception, memory reclaim, IO encryption and so on. In addition, it is an unnecessary inconvenience for users who want writeback control to require the complication of mixed v1 and v2 hierarchies when their requirements can be easily served by v2, especially considering that the only blocked part is trivial changes to expose cpu controller interface on v2 and that enabling it on v2 doesn't preclude it from being used on a v1 hierarchy if necessary. Thanks. -- tejun ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [Documentation] State of CPU controller in cgroup v2 2016-08-05 17:07 [Documentation] State of CPU controller in cgroup v2 Tejun Heo ` (2 preceding siblings ...) [not found] ` <20160805170752.GK2542-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2016-08-17 20:18 ` Andy Lutomirski [not found] ` <CALCETrXvLNeds+ugZ8j3eD1Zg1RZYJSAET3Kguz5G2vqSLFCwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 3 siblings, 1 reply; 48+ messages in thread From: Andy Lutomirski @ 2016-08-17 20:18 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Mike Galbraith, linux-kernel@vger.kernel.org, kernel-team, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds On Aug 5, 2016 7:07 PM, "Tejun Heo" <tj@kernel.org> wrote: > > Hello, > > There have been several discussions around CPU controller support. > Unfortunately, no consensus was reached and cgroup v2 is sorely > lacking CPU controller support. This document includes summary of the > situation and arguments along with an interim solution for parties who > want to use the out-of-tree patches for CPU controller cgroup v2 > support. I'll post the two patches as replies for reference. > > Thanks. > > > CPU Controller on Control Group v2 > > August, 2016 Tejun Heo <tj@kernel.org> > > > While most controllers have support for cgroup v2 now, the CPU > controller support is not upstream yet due to objections from the > scheduler maintainers on the basic designs of cgroup v2. This > document explains the current situation as well as an interim > solution, and details the disagreements and arguments. The latest > version of this document can be found at the following URL. > > https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu > > > CONTENTS > > 1. Current Situation and Interim Solution > 2. Disagreements and Arguments > 2-1. Contentious Restrictions > 2-1-1. Process Granularity > 2-1-2. No Internal Process Constraint > 2-2. Impact on CPU Controller > 2-2-1. Impact of Process Granularity > 2-2-2. Impact of No Internal Process Constraint > 2-3. Arguments for cgroup v2 > 3. Way Forward > 4. References > > > 1. Current Situation and Interim Solution > > All objections from the scheduler maintainers apply to cgroup v2 core > design, and there are no known objections to the specifics of the CPU > controller cgroup v2 interface. The only blocked part is changes to > expose the CPU controller interface on cgroup v2, which comprises the > following two patches: > > [1] sched: Misc preps for cgroup unified hierarchy interface > [2] sched: Implement interface for cgroup unified hierarchy > > The necessary changes are superficial and implement the interface > files on cgroup v2. The combined diffstat is as follows. > > kernel/sched/core.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++-- > kernel/sched/cpuacct.c | 57 ++++++++++++------ > kernel/sched/cpuacct.h | 5 + > 3 files changed, 189 insertions(+), 22 deletions(-) > > The patches are easy to apply and forward-port. The following git > branch will always carry the two patches on top of the latest release > of the upstream kernel. > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/cgroup-v2-cpu > > There also are versioned branches going back to v4.4. > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/cgroup-v2-cpu-$KERNEL_VER > > While it's difficult to tell whether the CPU controller support will > be merged, there are crucial resource control features in cgroup v2 > that are only possible due to the design choices that are being > objected to, and every effort will be made to ease enabling the CPU > controller cgroup v2 support out-of-tree for parties which choose to. > > > 2. Disagreements and Arguments > > There have been several lengthy discussion threads [3][4] on LKML > around the structural constraints of cgroup v2. The two that affect > the CPU controller are process granularity and no internal process > constraint. Both arise primarily from the need for common resource > domain definition across different resources. > > The common resource domain is a powerful concept in cgroup v2 that > allows controllers to make basic assumptions about the structural > organization of processes and controllers inside the cgroup hierarchy, > and thus solve problems spanning multiple types of resources. The > prime example for this is page cache writeback: dirty page cache is > regulated through throttling buffered writers based on memory > availability, and initiating batched write outs to the disk based on > IO capacity. Tracking and controlling writeback inside a cgroup thus > requires the direct cooperation of the memory and the IO controller. > > This easily extends to other areas, such as CPU cycles consumed while > performing memory reclaim or IO encryption. > > > 2-1. Contentious Restrictions > > For controllers of different resources to work together, they must > agree on a common organization. This uniform model across controllers > imposes two contentious restrictions on the CPU controller: process > granularity and the no-internal-process constraint. > > > 2-1-1. Process Granularity > > For memory, because an address space is shared between all threads > of a process, the terminal consumer is a process, not a thread. > Separating the threads of a single process into different memory > control domains doesn't make semantical sense. cgroup v2 ensures > that all controller can agree on the same organization by requiring > that threads of the same process belong to the same cgroup. I haven't followed all of the history here, but it seems to me that this argument is less accurate than it appears. Linux, for better or for worse, has somewhat orthogonal concepts of thread groups (processes), mms, and file tables. An mm has VMAs in it, and VMAs can reference things (files, etc) that hold resources. (Two mms can share resources by mapping the same thing or using fork().) File tables hold files, and files can use resources. Both of these are, at best, moderately good approximations of what actually holds resources. Meanwhile, threads (tasks) do syscalls, take page faults, *allocate* resources, etc. So I think it's not really true to say that the "terminal consumer" of anything is a process, not a thread. While it's certainly easier to think about assigning processes to cgroups, and I certainly agree that, in the common case, it's the right thing to do, I don't see why requiring it is a good idea. Can we turn this around: what actually goes wrong if cgroup v2 were to allow assigning individual threads if a user specifically requests it? > > There are other reasons to enforce process granularity. One > important one is isolating system-level management operations from > in-process application operations. The cgroup interface, being a > virtual filesystem, is very unfit for multiple independent > operations taking place at the same time as most operations have to > be multi-step and there is no way to synchronize multiple accessors. > See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity" I don't buy this argument at all. System-level code is likely to assign single process *trees*, which are a different beast entirely. I.e. you fork, move the child into a cgroup, and that child and its children stay in that cgroup. I don't see how the thread/process distinction matters. On the contrary: with cgroup namespaces, one could easily create a cgroup namespace, shove a process in it, and let that process delegate its threads to child cgroups however it likes. (Well, children of the namespace root.) > > > 2-1-2. No Internal Process Constraint > > cgroup v2 does not allow processes to belong to any cgroup which has > child cgroups when resource controllers are enabled on it (the > notable exception being the root cgroup itself). Can you elaborate on this exception? How do you get any of the supposed benefits of not having processes and cgroups exist as siblings when you make an exception for the root? Similarly, if you make an exception for the root, what do you do about cgroup namespaces where the apparent root isn't the global root? --Andy ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <CALCETrXvLNeds+ugZ8j3eD1Zg1RZYJSAET3Kguz5G2vqSLFCwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <CALCETrXvLNeds+ugZ8j3eD1Zg1RZYJSAET3Kguz5G2vqSLFCwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-08-20 15:56 ` Tejun Heo 2016-08-20 18:45 ` Andy Lutomirski [not found] ` <20160820155659.GA16906-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2016-08-21 5:34 ` James Bottomley 1 sibling, 2 replies; 48+ messages in thread From: Tejun Heo @ 2016-08-20 15:56 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds Hello, Andy. On Wed, Aug 17, 2016 at 01:18:24PM -0700, Andy Lutomirski wrote: > > 2-1-1. Process Granularity > > > > For memory, because an address space is shared between all threads > > of a process, the terminal consumer is a process, not a thread. > > Separating the threads of a single process into different memory > > control domains doesn't make semantical sense. cgroup v2 ensures > > that all controller can agree on the same organization by requiring > > that threads of the same process belong to the same cgroup. > > I haven't followed all of the history here, but it seems to me that > this argument is less accurate than it appears. Linux, for better or > for worse, has somewhat orthogonal concepts of thread groups > (processes), mms, and file tables. An mm has VMAs in it, and VMAs can > reference things (files, etc) that hold resources. (Two mms can share > resources by mapping the same thing or using fork().) File tables > hold files, and files can use resources. Both of these are, at best, > moderately good approximations of what actually holds resources. > Meanwhile, threads (tasks) do syscalls, take page faults, *allocate* > resources, etc. > > So I think it's not really true to say that the "terminal consumer" of > anything is a process, not a thread. The terminal consumer is actually the mm context. A task may be the allocating entity but not always for itself. This becomes clear whenever an entity is allocating memory on behalf of someone else - get_user_pages(), khugepaged, swapoff and so on (and likely userfaultfd too). When a task is trying to add a page to a VMA, the task might not have any relationship with the VMA other than that it's operating on it for someone else. The page has to be charged to whoever is responsible for the VMA and the only ownership which can be established is the containing mm_struct. While a mm_struct technically may not map to a process, it is a very close approxmiation which is hardly ever broken in practice. > While it's certainly easier to think about assigning processes to > cgroups, and I certainly agree that, in the common case, it's the > right thing to do, I don't see why requiring it is a good idea. Can > we turn this around: what actually goes wrong if cgroup v2 were to > allow assigning individual threads if a user specifically requests it? Consider the scenario where you have somebody faulting on behalf of a foreign VMA, but the thread who created and is actively using that VMA is in a different cgroup than the process leader. Who are we going to charge? All possible answers seem erratic. Please note that I agree that thread granularity can be useful for some resources; however, my points are 1. it should be scoped so that the resource distribution tree as a whole can be shared across different resources, and, 2. cgroup filesystem interface isn't a good interface for the purpose. I'll continue the second point below. > > there are other reasons to enforce process granularity. One > > important one is isolating system-level management operations from > > in-process application operations. The cgroup interface, being a > > virtual filesystem, is very unfit for multiple independent > > operations taking place at the same time as most operations have to > > be multi-step and there is no way to synchronize multiple accessors. > > See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity" > > I don't buy this argument at all. System-level code is likely to > assign single process *trees*, which are a different beast entirely. > I.e. you fork, move the child into a cgroup, and that child and its > children stay in that cgroup. I don't see how the thread/process > distinction matters. Good point on the multi-process issue, this is something which nagged me a bit while working on rgroup, although I have to point out that the issue here is one of not going far enough rather than the approach being wrong. There are limitations to scoping it to individual processes but that doesn't negate the underlying problem or the usefulness of in-process control. For system-level and process-level operations to not step on each other's toes, they need to agree on the granularity boundary - system-level should be able to treat an application hierarchy as a single unit. A possible solution is allowing rgroup hirearchies to span across process boundaries and implementing cgroup migration operations which treat such hierarchies as a single unit. I'm not yet sure whether the boundary should be at program groups or rgroups. > On the contrary: with cgroup namespaces, one could easily create a > cgroup namespace, shove a process in it, and let that process delegate > its threads to child cgroups however it likes. (Well, children of the > namespace root.) cgroup namespace solves just one piece of the whole problem and not in a very robust way. It's okay for containers but not so for individual applications. * Using namespace is neither trivial or dependable. It requires explicit mount setups, and, more importantly, an application can't rely on a specific namespace setup being there, unlike a setpriority() extension. This affects application designs in the first place and severely hampers the accessibility and thus usefulness of in-application resource control. * While it makes the names consistent from inside, it doesn't solve the atomicity issues when system and application operate on the subtree concurrently. Imagine system level operation trying to relocate the namespace. While the symbolic names can be made to stay the same before and after. That's about it. During migration, depending on how migration is implemented, some may see path linking back to the old or new location. Even the open files for the filesystem knobs wouldn't work after such migration. * It's just a bad interface if one has to use setpriority(2) to set a thread priority but resort to opening a file, parse path, open another file, write a number string which uses a completely different value range to it for thread groups. > > 2-1-2. No Internal Process Constraint > > > > cgroup v2 does not allow processes to belong to any cgroup which has > > child cgroups when resource controllers are enabled on it (the > > notable exception being the root cgroup itself). > > Can you elaborate on this exception? How do you get any of the > supposed benefits of not having processes and cgroups exist as > siblings when you make an exception for the root? Similarly, if you > make an exception for the root, what do you do about cgroup namespaces > where the apparent root isn't the global root? Having a special case doesn't necessarily get in the way of benefiting from a set of general rules. The root cgroup is inherently special as it has to be the catch-all scope for entities and resource consumptions which can't be tied to any specific consumer - irq handling, packet rx, journal writes, memory reclaim from global memory pressure and so on. None of sub-cgroups have to worry about them. These base-system operations are special regardless of cgroup and we already have sometimes crude ways to affect their behaviors where necessary through sysctl knobs, priorities on specific kernel threads and so on. cgroup doesn't change the situation all that much. What gets left in the root cgroup usually are the base-system operations which are outside the scope of cgroup resource control in the first place and cgroup resource graph can treat the root as an opaque anchor point. There can be other ways to deal with the issue; however, treating root cgroup this way has the big advantage of minimizing the gap between configurations without and with cgroups both in terms of mental model and implementation. Hopefully, the case of a namespace root is clear now. If it's gonna have a sub-hierarchy, it itself can't contain processes but the system root just contains base-system entities and resources which a namespace root doesn't have to worry about. Ignoring base-system stuff, a namespace root is topologically in the same position as the system root in the cgroup resource graph. Thanks. -- tejun ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [Documentation] State of CPU controller in cgroup v2 2016-08-20 15:56 ` Tejun Heo @ 2016-08-20 18:45 ` Andy Lutomirski [not found] ` <CALCETrUWn1ux-ZRJoMjFCuP1aQrPOo3oTPD7k-ojsaov29NsRw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> [not found] ` <20160820155659.GA16906-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 1 sibling, 1 reply; 48+ messages in thread From: Andy Lutomirski @ 2016-08-20 18:45 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Mike Galbraith, linux-kernel@vger.kernel.org, kernel-team, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds On Sat, Aug 20, 2016 at 8:56 AM, Tejun Heo <tj@kernel.org> wrote: > Hello, Andy. > > On Wed, Aug 17, 2016 at 01:18:24PM -0700, Andy Lutomirski wrote: >> > 2-1-1. Process Granularity >> > >> > For memory, because an address space is shared between all threads >> > of a process, the terminal consumer is a process, not a thread. >> > Separating the threads of a single process into different memory >> > control domains doesn't make semantical sense. cgroup v2 ensures >> > that all controller can agree on the same organization by requiring >> > that threads of the same process belong to the same cgroup. >> >> I haven't followed all of the history here, but it seems to me that >> this argument is less accurate than it appears. Linux, for better or >> for worse, has somewhat orthogonal concepts of thread groups >> (processes), mms, and file tables. An mm has VMAs in it, and VMAs can >> reference things (files, etc) that hold resources. (Two mms can share >> resources by mapping the same thing or using fork().) File tables >> hold files, and files can use resources. Both of these are, at best, >> moderately good approximations of what actually holds resources. >> Meanwhile, threads (tasks) do syscalls, take page faults, *allocate* >> resources, etc. >> >> So I think it's not really true to say that the "terminal consumer" of >> anything is a process, not a thread. > > The terminal consumer is actually the mm context. A task may be the > allocating entity but not always for itself. > > This becomes clear whenever an entity is allocating memory on behalf > of someone else - get_user_pages(), khugepaged, swapoff and so on (and > likely userfaultfd too). When a task is trying to add a page to a > VMA, the task might not have any relationship with the VMA other than > that it's operating on it for someone else. The page has to be > charged to whoever is responsible for the VMA and the only ownership > which can be established is the containing mm_struct. This surprises me a bit. If I do access_process_vm(), then I would have expected the charge to go the caller, not the mm being accessed. What happens if a program calls read(2), though? A page may be inserted into page cache on behalf of an address_space without any particular mm being involved. There will usually be a calling task, though. But this is all very memcg-specific. What about other cgroups? I/O is per-task, right? Scheduling is definitely per-task. > > While a mm_struct technically may not map to a process, it is a very > close approxmiation which is hardly ever broken in practice. > >> While it's certainly easier to think about assigning processes to >> cgroups, and I certainly agree that, in the common case, it's the >> right thing to do, I don't see why requiring it is a good idea. Can >> we turn this around: what actually goes wrong if cgroup v2 were to >> allow assigning individual threads if a user specifically requests it? > > Consider the scenario where you have somebody faulting on behalf of a > foreign VMA, but the thread who created and is actively using that VMA > is in a different cgroup than the process leader. Who are we going to > charge? All possible answers seem erratic. > Indeed, and this problem is probably not solvable in practice unless you charge all involved cgroups. But the caller's *mm* is entirely irrelevant here, so I don't see how this implies that cgroups need to keep tasks in the same process together. The relevant entities are the calling *task* and the target mm, and you're going to be hard-pressed to ensure that they belong to the same cgroup, so I think you need to be able handle weird cases in which there isn't an obviously correct cgroup to charge. >> > there are other reasons to enforce process granularity. One >> > important one is isolating system-level management operations from >> > in-process application operations. The cgroup interface, being a >> > virtual filesystem, is very unfit for multiple independent >> > operations taking place at the same time as most operations have to >> > be multi-step and there is no way to synchronize multiple accessors. >> > See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity" >> >> I don't buy this argument at all. System-level code is likely to >> assign single process *trees*, which are a different beast entirely. >> I.e. you fork, move the child into a cgroup, and that child and its >> children stay in that cgroup. I don't see how the thread/process >> distinction matters. > > Good point on the multi-process issue, this is something which nagged > me a bit while working on rgroup, although I have to point out that > the issue here is one of not going far enough rather than the approach > being wrong. There are limitations to scoping it to individual > processes but that doesn't negate the underlying problem or the > usefulness of in-process control. > > For system-level and process-level operations to not step on each > other's toes, they need to agree on the granularity boundary - > system-level should be able to treat an application hierarchy as a > single unit. A possible solution is allowing rgroup hirearchies to > span across process boundaries and implementing cgroup migration > operations which treat such hierarchies as a single unit. I'm not yet > sure whether the boundary should be at program groups or rgroups. I think that, if the system cgroup manager is moving processes around after starting them and execing the final binary, there will be races and confusion, and no about of granularity fiddling will fix that. I know nothing about rgroups. Are they upstream? > >> > 2-1-2. No Internal Process Constraint >> > >> > cgroup v2 does not allow processes to belong to any cgroup which has >> > child cgroups when resource controllers are enabled on it (the >> > notable exception being the root cgroup itself). >> >> Can you elaborate on this exception? How do you get any of the >> supposed benefits of not having processes and cgroups exist as >> siblings when you make an exception for the root? Similarly, if you >> make an exception for the root, what do you do about cgroup namespaces >> where the apparent root isn't the global root? > > Having a special case doesn't necessarily get in the way of benefiting > from a set of general rules. The root cgroup is inherently special as > it has to be the catch-all scope for entities and resource > consumptions which can't be tied to any specific consumer - irq > handling, packet rx, journal writes, memory reclaim from global memory > pressure and so on. None of sub-cgroups have to worry about them. > > These base-system operations are special regardless of cgroup and we > already have sometimes crude ways to affect their behaviors where > necessary through sysctl knobs, priorities on specific kernel threads > and so on. cgroup doesn't change the situation all that much. What > gets left in the root cgroup usually are the base-system operations > which are outside the scope of cgroup resource control in the first > place and cgroup resource graph can treat the root as an opaque anchor > point. This seems to explain why the controllers need to be able to handle things being charged to the root cgroup (or to an unidentifiable cgroup, anyway). That isn't quite the same thing as allowing, from an ABI point of view, the root cgroup to contain processes and cgroups but not allowing other cgroups to do the same thing. Consider: suppose that systemd (or some competing cgroup manager) is designed to run in the root cgroup namespace. It presumably expects *itself* to be in the root cgroup. Now try to run it using cgroups v2 in a non-root namespace. I don't see how it can possibly work if it the hierarchy constraints don't permit it to create sub-cgroups while it's still in the root. In fact, this seems impossible to fix even with user code changes. The manager would need to simultaneously create a new child cgroup to contain itself and assign itself to that child cgroup, because the intermediate state is illegal. I really, really think that cgroup v2 should supply the same *interface* inside and outside of a non-root namespace. If this is impossible due to ABI compatibility, then you could, in the worst case, introduce cgroup v3, fix it there, and remove cgroup v2, since apparently cgroup v2 isn't in use right now in mainline kernels. (To be clear, I think either decision -- allowing tasks and cgroups to be siblings or disallowing it -- is okay, but I think that the interface should apply the same constraint at all levels.) --Andy ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <CALCETrUWn1ux-ZRJoMjFCuP1aQrPOo3oTPD7k-ojsaov29NsRw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <CALCETrUWn1ux-ZRJoMjFCuP1aQrPOo3oTPD7k-ojsaov29NsRw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-08-29 22:20 ` Tejun Heo [not found] ` <20160829222048.GH28713-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 48+ messages in thread From: Tejun Heo @ 2016-08-29 22:20 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds Hello, Andy. Sorry about the delay. Was kinda overwhelmed with other things. On Sat, Aug 20, 2016 at 11:45:55AM -0700, Andy Lutomirski wrote: > > This becomes clear whenever an entity is allocating memory on behalf > > of someone else - get_user_pages(), khugepaged, swapoff and so on (and > > likely userfaultfd too). When a task is trying to add a page to a > > VMA, the task might not have any relationship with the VMA other than > > that it's operating on it for someone else. The page has to be > > charged to whoever is responsible for the VMA and the only ownership > > which can be established is the containing mm_struct. > > This surprises me a bit. If I do access_process_vm(), then I would > have expected the charge to go the caller, not the mm being accessed. It does and should go the target mm. Who faults in a page shouldn't be the final determinant in the ownership; otherwise, we end up in situations where the ownership changes due to, for example, fluctuations in page fault pattern. It doesn't make semantical sense either. If a kthread is doing PIO for a process, why would it get charged for the memory it's faulting in? > What happens if a program calls read(2), though? A page may be > inserted into page cache on behalf of an address_space without any > particular mm being involved. There will usually be a calling task, > though. Most faults are synchronous and the faulting thread is a member of the mm to be charged, so this usually isn't an issue. I don't think there are places where we populate an address_space without knowing who it is for (as opposed / in addition to who the operator is). > But this is all very memcg-specific. What about other cgroups? I/O > is per-task, right? Scheduling is definitely per-task. They aren't separate. Think about IOs to write out page cache, CPU cycles spent reclaiming memory or encrypting writeback IOs. It's fine to get more granular with specific resources but the semantics gets messy for cross-resource accounting and control without proper scoping. > > Consider the scenario where you have somebody faulting on behalf of a > > foreign VMA, but the thread who created and is actively using that VMA > > is in a different cgroup than the process leader. Who are we going to > > charge? All possible answers seem erratic. > > Indeed, and this problem is probably not solvable in practice unless > you charge all involved cgroups. But the caller's *mm* is entirely > irrelevant here, so I don't see how this implies that cgroups need to > keep tasks in the same process together. The relevant entities are > the calling *task* and the target mm, and you're going to be > hard-pressed to ensure that they belong to the same cgroup, so I think > you need to be able handle weird cases in which there isn't an > obviously correct cgroup to charge. It is an erratic case which is caused by userland interface allowing non-sensical configuration. We can accept it as a necessary trade-off given big enough benefits or unavoidable constraints but it isn't something to do willy-nilly. > > For system-level and process-level operations to not step on each > > other's toes, they need to agree on the granularity boundary - > > system-level should be able to treat an application hierarchy as a > > single unit. A possible solution is allowing rgroup hirearchies to > > span across process boundaries and implementing cgroup migration > > operations which treat such hierarchies as a single unit. I'm not yet > > sure whether the boundary should be at program groups or rgroups. > > I think that, if the system cgroup manager is moving processes around > after starting them and execing the final binary, there will be races > and confusion, and no about of granularity fiddling will fix that. I don't see how that statement is true. For example, if you confine the hierarhcy to in-process, there is proper isolation and whether system agent migrates the process or not doesn't make any difference to the internal hierarchy. > I know nothing about rgroups. Are they upstream? It was linked from the original message. [7] http://lkml.kernel.org/r/20160105154503.GC5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org [RFD] cgroup: thread granularity support for cpu controller Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> [8] http://lkml.kernel.org/r/1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> [9] http://lkml.kernel.org/r/20160311160522.GA24046-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org Example program for PRIO_RGRP Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > > These base-system operations are special regardless of cgroup and we > > already have sometimes crude ways to affect their behaviors where > > necessary through sysctl knobs, priorities on specific kernel threads > > and so on. cgroup doesn't change the situation all that much. What > > gets left in the root cgroup usually are the base-system operations > > which are outside the scope of cgroup resource control in the first > > place and cgroup resource graph can treat the root as an opaque anchor > > point. > > This seems to explain why the controllers need to be able to handle > things being charged to the root cgroup (or to an unidentifiable > cgroup, anyway). That isn't quite the same thing as allowing, from an > ABI point of view, the root cgroup to contain processes and cgroups > but not allowing other cgroups to do the same thing. Consider: The points are 1. we need the root to be a special container anyway 2. allowing it to be special and contain system-wide consumptions doesn't make the resource graph inconsistent once all non-system-wide consumptions are put in non-root cgroups, and 3. this is the most natural way to handle the situation both from implementation and interface standpoints as it makes non-cgroup configuration a natural degenerate case of cgroup configuration. > suppose that systemd (or some competing cgroup manager) is designed to > run in the root cgroup namespace. It presumably expects *itself* to > be in the root cgroup. Now try to run it using cgroups v2 in a > non-root namespace. I don't see how it can possibly work if it the > hierarchy constraints don't permit it to create sub-cgroups while it's > still in the root. In fact, this seems impossible to fix even with > user code changes. The manager would need to simultaneously create a > new child cgroup to contain itself and assign itself to that child > cgroup, because the intermediate state is illegal. Please re-read the constraint. It doesn't prevent any organizational operations before resource control is enabled. > I really, really think that cgroup v2 should supply the same > *interface* inside and outside of a non-root namespace. If this is It *does*. That's what I tried to explain, that it's exactly isomorhpic once you discount the system-wide consumptions. Thanks. -- tejun ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20160829222048.GH28713-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160829222048.GH28713-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2016-08-31 3:42 ` Andy Lutomirski 2016-08-31 17:32 ` Tejun Heo 2016-08-31 19:57 ` Andy Lutomirski 1 sibling, 1 reply; 48+ messages in thread From: Andy Lutomirski @ 2016-08-31 3:42 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds On Mon, Aug 29, 2016 at 3:20 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: >> > These base-system operations are special regardless of cgroup and we >> > already have sometimes crude ways to affect their behaviors where >> > necessary through sysctl knobs, priorities on specific kernel threads >> > and so on. cgroup doesn't change the situation all that much. What >> > gets left in the root cgroup usually are the base-system operations >> > which are outside the scope of cgroup resource control in the first >> > place and cgroup resource graph can treat the root as an opaque anchor >> > point. >> >> This seems to explain why the controllers need to be able to handle >> things being charged to the root cgroup (or to an unidentifiable >> cgroup, anyway). That isn't quite the same thing as allowing, from an >> ABI point of view, the root cgroup to contain processes and cgroups >> but not allowing other cgroups to do the same thing. Consider: > > The points are 1. we need the root to be a special container anyway But you don't need to let userspace see that. > 2. allowing it to be special and contain system-wide consumptions > doesn't make the resource graph inconsistent once all non-system-wide > consumptions are put in non-root cgroups, and 3. this is the most > natural way to handle the situation both from implementation and > interface standpoints as it makes non-cgroup configuration a natural > degenerate case of cgroup configuration. > >> suppose that systemd (or some competing cgroup manager) is designed to >> run in the root cgroup namespace. It presumably expects *itself* to >> be in the root cgroup. Now try to run it using cgroups v2 in a >> non-root namespace. I don't see how it can possibly work if it the >> hierarchy constraints don't permit it to create sub-cgroups while it's >> still in the root. In fact, this seems impossible to fix even with >> user code changes. The manager would need to simultaneously create a >> new child cgroup to contain itself and assign itself to that child >> cgroup, because the intermediate state is illegal. > > Please re-read the constraint. It doesn't prevent any organizational > operations before resource control is enabled. > >> I really, really think that cgroup v2 should supply the same >> *interface* inside and outside of a non-root namespace. If this is > > It *does*. That's what I tried to explain, that it's exactly > isomorhpic once you discount the system-wide consumptions. > I don't think I agree. Suppose I wrote an init program or a cgroup manager. I can expect that init program to be started in the root cgroup. The program can be lazy and write +io to /cgroup/cgroup.subtree_control and then create some new cgroup /cgroup/a and it will work (I just tried it). Now I run that program in a namespace. It will not work because it'll get -EBUSY when it tries to write to cgroup.subtree_control. (I just tried this, too, only using cd instead of a namespace.) So it's *not* isomorphic. It *also* won't work (I think) if subtree control is enabled on the root, but I don't think this is a problem in practice because subtree control won't be enabled on the namespace root by a sensible cgroup manager. --Andy ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [Documentation] State of CPU controller in cgroup v2 2016-08-31 3:42 ` Andy Lutomirski @ 2016-08-31 17:32 ` Tejun Heo [not found] ` <20160831173251.GY12660-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 48+ messages in thread From: Tejun Heo @ 2016-08-31 17:32 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Mike Galbraith, linux-kernel@vger.kernel.org, kernel-team, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds Hello, Andy. On Tue, Aug 30, 2016 at 08:42:20PM -0700, Andy Lutomirski wrote: > On Mon, Aug 29, 2016 at 3:20 PM, Tejun Heo <tj@kernel.org> wrote: > >> This seems to explain why the controllers need to be able to handle > >> things being charged to the root cgroup (or to an unidentifiable > >> cgroup, anyway). That isn't quite the same thing as allowing, from an > >> ABI point of view, the root cgroup to contain processes and cgroups > >> but not allowing other cgroups to do the same thing. Consider: > > > > The points are 1. we need the root to be a special container anyway > > But you don't need to let userspace see that. I'm not saying that what cgroup v2 implements is the only solution. There of course can be other approaches which don't expose this particular detail to userland. I was highlighting that there is an underlying condition to be dealt with and that what cgroup v2 implements is one working solution for it. It's fine to have, say, aesthetical disgreements on the specifics of the chosen approach, and, while a bit late, we can still talk about pros and cons of different possible approaches and make improvements where it makes sense. However, this isn't in any way a make-it-or-break-it issue as you implied before. > >> I really, really think that cgroup v2 should supply the same > >> *interface* inside and outside of a non-root namespace. If this is > > > > It *does*. That's what I tried to explain, that it's exactly > > isomorhpic once you discount the system-wide consumptions. > > I don't think I agree. > > Suppose I wrote an init program or a cgroup manager. I can expect > that init program to be started in the root cgroup. The program can > be lazy and write +io to /cgroup/cgroup.subtree_control and then > create some new cgroup /cgroup/a and it will work (I just tried it). > > Now I run that program in a namespace. It will not work because it'll > get -EBUSY when it tries to write to cgroup.subtree_control. (I just > tried this, too, only using cd instead of a namespace.) So it's *not* > isomorphic. Yeah, it is possible to shoot yourself in the foot but both system-scope and namespace-scope can implement the exactly same behavior - move yourself out of root before enabling resource controls and get the same expected outcome, which BTW is how systemd behaves already. You can say that allowing the possibility of deviation isn't a good design choice but it is a design choice with other implications - on how we deal with configurations without cgroup at all, transitioning from v1, bootstrapping a system and avoiding surprising userland-visible behaviors (e.g. like creating magic preset cgroups and silently migrating process there on certain events). > It *also* won't work (I think) if subtree control is enabled on the > root, but I don't think this is a problem in practice because subtree > control won't be enabled on the namespace root by a sensible cgroup > manager. Exactly the same thing. You can shoot yourself in the foot but it's easy not to. Thanks. -- tejun ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20160831173251.GY12660-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160831173251.GY12660-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2016-08-31 19:11 ` Andy Lutomirski [not found] ` <CALCETrUKOJZS+=QDPyQD+vxXpwyjoj4+Crg6wU7Xk8rP4prYkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 48+ messages in thread From: Andy Lutomirski @ 2016-08-31 19:11 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds On Wed, Aug 31, 2016 at 10:32 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > Hello, Andy. > > >> >> I really, really think that cgroup v2 should supply the same >> >> *interface* inside and outside of a non-root namespace. If this is >> > >> > It *does*. That's what I tried to explain, that it's exactly >> > isomorhpic once you discount the system-wide consumptions. >> >> I don't think I agree. >> >> Suppose I wrote an init program or a cgroup manager. I can expect >> that init program to be started in the root cgroup. The program can >> be lazy and write +io to /cgroup/cgroup.subtree_control and then >> create some new cgroup /cgroup/a and it will work (I just tried it). >> >> Now I run that program in a namespace. It will not work because it'll >> get -EBUSY when it tries to write to cgroup.subtree_control. (I just >> tried this, too, only using cd instead of a namespace.) So it's *not* >> isomorphic. > > Yeah, it is possible to shoot yourself in the foot but both > system-scope and namespace-scope can implement the exactly same > behavior - move yourself out of root before enabling resource controls > and get the same expected outcome, which BTW is how systemd behaves > already. > > You can say that allowing the possibility of deviation isn't a good > design choice but it is a design choice with other implications - on > how we deal with configurations without cgroup at all, transitioning > from v1, bootstrapping a system and avoiding surprising > userland-visible behaviors (e.g. like creating magic preset cgroups > and silently migrating process there on certain events). Are there existing userspace programs that use cgroup2 and enable subtree control on / when there are processes in /? If the answer is no, then I think you should change cgroup2 to just disallow it. If the answer is yes, then I think there's a problem and maybe you should consider a breaking change. Given that cgroup2 hasn't really launched on a large scale, it seems worthwhile to get it right. I don't understand what you're talking about wrt silently migrating processes. Are you thinking about usermodehelper? If so, maybe it really does make sense to allow (or require?) the cgroup manager to specify which cgroup these processes end up in. But, given that all the controllers need to support the current magic root exception (for genuinely unaccountable things if nothing else), can you explain what would actually go wrong if you just removed the restriction entirely? Also, here's an idea to maybe make PeterZ happier: relax the restriction a bit per-controller. Currently (except for /), if you have subtree control enabled you can't have any processes in the cgroup. Could you change this so it only applies to certain controllers? If the cpu controller is entirely happy to have processes and cgroups as siblings, then maybe a cgroup with only cpu subtree control enabled could allow processes to exist. > >> It *also* won't work (I think) if subtree control is enabled on the >> root, but I don't think this is a problem in practice because subtree >> control won't be enabled on the namespace root by a sensible cgroup >> manager. > > Exactly the same thing. You can shoot yourself in the foot but it's > easy not to. > Somewhat off-topic: this appears to be either a bug or a misfeature: bash-4.3# mkdir foo bash-4.3# ls foo cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control bash-4.3# mkdir foo/io.max <-- IMO this shouldn't have worked bash-4.3# echo +io >cgroup.subtree_control [ 40.470712] cgroup: cgroup_addrm_files: failed to add max, err=-17 Shouldn't cgroups with names that potentially conflict with kernel-provided dentries be disallowed? --Andy ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <CALCETrUKOJZS+=QDPyQD+vxXpwyjoj4+Crg6wU7Xk8rP4prYkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <CALCETrUKOJZS+=QDPyQD+vxXpwyjoj4+Crg6wU7Xk8rP4prYkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-08-31 21:07 ` Tejun Heo 2016-08-31 21:46 ` Andy Lutomirski 0 siblings, 1 reply; 48+ messages in thread From: Tejun Heo @ 2016-08-31 21:07 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds Hello, On Wed, Aug 31, 2016 at 12:11:58PM -0700, Andy Lutomirski wrote: > > You can say that allowing the possibility of deviation isn't a good > > design choice but it is a design choice with other implications - on > > how we deal with configurations without cgroup at all, transitioning > > from v1, bootstrapping a system and avoiding surprising > > userland-visible behaviors (e.g. like creating magic preset cgroups > > and silently migrating process there on certain events). > > Are there existing userspace programs that use cgroup2 and enable > subtree control on / when there are processes in /? If the answer is > no, then I think you should change cgroup2 to just disallow it. If > the answer is yes, then I think there's a problem and maybe you should > consider a breaking change. Given that cgroup2 hasn't really launched > on a large scale, it seems worthwhile to get it right. Adding the restriction isn't difficult from implementation point of view and for a system agent which control the boot process implementing that wouldn't be difficult either but I can't see what the actual benefits of the extra restriction would be and there are tangible downsides to doing so. Consider a use case where the user isn't interested in fully accounting and dividing up system resources but wants to just cap resource usage from a subset of workloads. There is no reason to require such usages to fully contain all processes in non-root cgroups. Furthermore, it's not trivial to migrate all processes out of root to a sub-cgroup unless the agent is in full control of boot process. At least up until this point in discussion, I can't see actual benefits of adding this restriction and the only reason for pushing it seems the initial misunderstanding and purism. > I don't understand what you're talking about wrt silently migrating > processes. Are you thinking about usermodehelper? If so, maybe it > really does make sense to allow (or require?) the cgroup manager to > specify which cgroup these processes end up in. That was from one of the ideas that I was considering way back where enabling resource control in an intermediate node automatically moves internal processes to a preset cgroup whether visible or hidden, which would be another way of addressing the problem. None of these affects what cgroup v2 can do at all and the only thing the userland is asked to do under the current scheme is "if you wanna keep the whole system divided up and use the same mode of operations across system-scope and namespace-scope move out of root while setting yourself up, which also happens to be what you have to do inside namespaces anyway." > But, given that all the controllers need to support the current magic > root exception (for genuinely unaccountable things if nothing else), > can you explain what would actually go wrong if you just removed the > restriction entirely? I have, multiple times. Can you please read 2-1-2 of the document in the original post and take the discussion from there? > Also, here's an idea to maybe make PeterZ happier: relax the > restriction a bit per-controller. Currently (except for /), if you > have subtree control enabled you can't have any processes in the > cgroup. Could you change this so it only applies to certain > controllers? If the cpu controller is entirely happy to have > processes and cgroups as siblings, then maybe a cgroup with only cpu > subtree control enabled could allow processes to exist. The document lists several reasons for not doing this and also that there is no known real world use case for such configuration. Please also note that the behavior that you're describing is actually what rgroup implements. It makes a lot more sense there because threads and groups share the same configuration mechanism and it only has to worry about competition among threads (anonymous consumption is out of scope for rgroup). > >> It *also* won't work (I think) if subtree control is enabled on the > >> root, but I don't think this is a problem in practice because subtree > >> control won't be enabled on the namespace root by a sensible cgroup > >> manager. > > > > Exactly the same thing. You can shoot yourself in the foot but it's > > easy not to. > > Somewhat off-topic: this appears to be either a bug or a misfeature: > > bash-4.3# mkdir foo > bash-4.3# ls foo > cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control > bash-4.3# mkdir foo/io.max <-- IMO this shouldn't have worked > bash-4.3# echo +io >cgroup.subtree_control > [ 40.470712] cgroup: cgroup_addrm_files: failed to add max, err=-17 > > Shouldn't cgroups with names that potentially conflict with > kernel-provided dentries be disallowed? Yeap, the name collisions suck. I thought about disallowing all sub-cgroups which starts with "KNOWN_SUBSYS." but that has a non-trivial chance of breaking users which were happy before when a new controller gets added. But, yeah, we at least should disallow the known filenames. Will think more about it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [Documentation] State of CPU controller in cgroup v2 2016-08-31 21:07 ` Tejun Heo @ 2016-08-31 21:46 ` Andy Lutomirski [not found] ` <CALCETrXj2Z=-GMaWV_EpCvw_8C3t1vc=D53Ff2wdvo=At8ZF1Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 48+ messages in thread From: Andy Lutomirski @ 2016-08-31 21:46 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Mike Galbraith, linux-kernel@vger.kernel.org, kernel-team, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds On Wed, Aug 31, 2016 at 2:07 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Wed, Aug 31, 2016 at 12:11:58PM -0700, Andy Lutomirski wrote: >> > You can say that allowing the possibility of deviation isn't a good >> > design choice but it is a design choice with other implications - on >> > how we deal with configurations without cgroup at all, transitioning >> > from v1, bootstrapping a system and avoiding surprising >> > userland-visible behaviors (e.g. like creating magic preset cgroups >> > and silently migrating process there on certain events). >> >> Are there existing userspace programs that use cgroup2 and enable >> subtree control on / when there are processes in /? If the answer is >> no, then I think you should change cgroup2 to just disallow it. If >> the answer is yes, then I think there's a problem and maybe you should >> consider a breaking change. Given that cgroup2 hasn't really launched >> on a large scale, it seems worthwhile to get it right. > > Adding the restriction isn't difficult from implementation point of > view and for a system agent which control the boot process > implementing that wouldn't be difficult either but I can't see what > the actual benefits of the extra restriction would be and there are > tangible downsides to doing so. > > Consider a use case where the user isn't interested in fully > accounting and dividing up system resources but wants to just cap > resource usage from a subset of workloads. There is no reason to > require such usages to fully contain all processes in non-root > cgroups. Furthermore, it's not trivial to migrate all processes out > of root to a sub-cgroup unless the agent is in full control of boot > process. Then please also consider exactly the same use case while running in a container. I'm a bit frustrated that you're saying that my example failure modes consist of shooting oneself in the foot and then you go on to come up with your own examples that have precisely the same problem. > >> I don't understand what you're talking about wrt silently migrating >> processes. Are you thinking about usermodehelper? If so, maybe it >> really does make sense to allow (or require?) the cgroup manager to >> specify which cgroup these processes end up in. > > That was from one of the ideas that I was considering way back where > enabling resource control in an intermediate node automatically moves > internal processes to a preset cgroup whether visible or hidden, which > would be another way of addressing the problem. > > None of these affects what cgroup v2 can do at all and the only thing > the userland is asked to do under the current scheme is "if you wanna > keep the whole system divided up and use the same mode of operations > across system-scope and namespace-scope move out of root while setting > yourself up, which also happens to be what you have to do inside > namespaces anyway." > >> But, given that all the controllers need to support the current magic >> root exception (for genuinely unaccountable things if nothing else), >> can you explain what would actually go wrong if you just removed the >> restriction entirely? > > I have, multiple times. Can you please read 2-1-2 of the document in > the original post and take the discussion from there? I've read it multiple times, and I don't see any explanation that's consistent with the fact that you are exempting the root cgroup from this constraint. If the constraint were really critical to everything working, then I would expect the root cgroup to have exactly the same problem. This makes me think that either something nasty is being fudged for the root cgroup or that the constraint isn't actually so important after all. The only thing on point I can find is: > Root cgroup is exempt from this constraint, which is in line with > how root cgroup is handled in general - it's excluded from cgroup > resource accounting and control. and that's not very helpful. > >> Also, here's an idea to maybe make PeterZ happier: relax the >> restriction a bit per-controller. Currently (except for /), if you >> have subtree control enabled you can't have any processes in the >> cgroup. Could you change this so it only applies to certain >> controllers? If the cpu controller is entirely happy to have >> processes and cgroups as siblings, then maybe a cgroup with only cpu >> subtree control enabled could allow processes to exist. > > The document lists several reasons for not doing this and also that > there is no known real world use case for such configuration. My company's production workload would map quite nicely to this relaxed model. I have quite a few processes each with several threads. Some of those threads get some CPUs, some get other CPUs, and they vary in what shares of what CPUs they get. To be clear, there is not a hierarchy of resource usage that's compatible with the process hierarchy. Multiple processes have threads that should be grouped in a different place in the hierarchy than other threads. Concretely, I have processes A and B with threads A1, A2, B1, and B2. (And many more, but this is enough to get the point across.) The natural grouping is: Group 1: A1 and B1 Group 2: A2 Group 3: B2 This cannot be expressed with rgroup or with cgroup2. cgroup1 has no problem with it. If I were using memcg, I would want to have a memcg hierarchy that was incompatible with the hierarchy above, so I actually find the cgroup2 insistence on a unified hierarchy to be a bit annoying, but I at least understand the motivation behind the unified hierarchy. And I don't care that the system controller can't atomically move this whole mess around. I'm currently running without systemd, so I don't *have* a system controller. If I end up migrating to systemd, I'll probably put this whole pile into its own slice and manage it manually. > >> >> It *also* won't work (I think) if subtree control is enabled on the >> >> root, but I don't think this is a problem in practice because subtree >> >> control won't be enabled on the namespace root by a sensible cgroup >> >> manager. >> > >> > Exactly the same thing. You can shoot yourself in the foot but it's >> > easy not to. >> >> Somewhat off-topic: this appears to be either a bug or a misfeature: >> >> bash-4.3# mkdir foo >> bash-4.3# ls foo >> cgroup.controllers cgroup.events cgroup.procs cgroup.subtree_control >> bash-4.3# mkdir foo/io.max <-- IMO this shouldn't have worked >> bash-4.3# echo +io >cgroup.subtree_control >> [ 40.470712] cgroup: cgroup_addrm_files: failed to add max, err=-17 >> >> Shouldn't cgroups with names that potentially conflict with >> kernel-provided dentries be disallowed? > > Yeap, the name collisions suck. I thought about disallowing all > sub-cgroups which starts with "KNOWN_SUBSYS." but that has a > non-trivial chance of breaking users which were happy before when a > new controller gets added. But, yeah, we at least should disallow the > known filenames. Will think more about it. How about disallowing names that contain a '.'? --Andy ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <CALCETrXj2Z=-GMaWV_EpCvw_8C3t1vc=D53Ff2wdvo=At8ZF1Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <CALCETrXj2Z=-GMaWV_EpCvw_8C3t1vc=D53Ff2wdvo=At8ZF1Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-09-03 22:05 ` Tejun Heo 2016-09-05 17:37 ` Andy Lutomirski 0 siblings, 1 reply; 48+ messages in thread From: Tejun Heo @ 2016-09-03 22:05 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds Hello, Andy. On Wed, Aug 31, 2016 at 02:46:20PM -0700, Andy Lutomirski wrote: > > Consider a use case where the user isn't interested in fully > > accounting and dividing up system resources but wants to just cap > > resource usage from a subset of workloads. There is no reason to > > require such usages to fully contain all processes in non-root > > cgroups. Furthermore, it's not trivial to migrate all processes out > > of root to a sub-cgroup unless the agent is in full control of boot > > process. > > Then please also consider exactly the same use case while running in a > container. > > I'm a bit frustrated that you're saying that my example failure modes > consist of shooting oneself in the foot and then you go on to come up > with your own examples that have precisely the same problem. You have a point, which is The system-root and namespace-roots are not symmetric. and that's a valid concern. Here's why the system-root is special. * A system has entities and resource consumptions which can only be attributed to the "system". The system-root is the natural place to put them. The system-root has stuff no other cgroups, not even namespace-roots, have. It's a unique situation. * The need to bypass most cgroup related overhead when not in use. The system-root is there whether cgroup is actally in use or not and thus can not impose noticeable overhead. It has to make sense for both resource-controlled systems as well as ones that aren't. Again, no other group has these requirements. Note that this means that all controllers should be able to and already allow uncontained consumptions in the system-root. I'll come back to this later. Now, due to the various issues with direct competition between processes and cgroups, cgroup v2 disallows resource control across them (the no-internal-tasks restriction); however, cgroup v2 currently doesn't apply the restriction to the system-root. Here are the reasons. * It doesn't bring any practical benefits in terms of implementation. As noted above, all controllers already have to allow uncontained consumptions in the system-root and that's the only attribute required for the exemption. * It doesn't bring any practical benefits in terms of capability. Userland can trivially handle the system-root and namespace-roots in a symmetrical manner. * It's an unncessary inconvenience, especially for cases where the cgroup agent isn't in control of boot, for partial usage cases, or just for playing with it. You say that I'm ignoring the same use case for namespace-scope but namespace-roots don't have the same hybrid function for partial and uncontrolled systems, so it's not clear why there even NEEDS to be strict symmetry. On this subject, your only actual point is that there is an asymmetry and that's bothersome. I've been trying to explain why the special case doesn't actually get in the way in terms of implementation or capability and is actually beneficial. Instead of engaging in the actual discussion, you're constantly coming up with different ways of saying "it's not symmetric". The system-root and namespace-roots aren't equivalent. There are a lot of parallels between system-root and namescope-root but they aren't the same thing (e.g. bootstrapping a namespace is a less complicated and more malleable process). The system-root is not even a fully qualified node of the resource graph. It's easy and understandable to get hangups on asymmetries or exemptions like this, but they also often are acceptable trade-offs. It's really frustrating to see you first getting hung up on "this must be wrong" and even after explanations repeating the same thing just in different ways. If there is something fundamentally wrong with it, sure, let's fix it, but what's actually broken? > > I have, multiple times. Can you please read 2-1-2 of the document in > > the original post and take the discussion from there? > > I've read it multiple times, and I don't see any explanation that's > consistent with the fact that you are exempting the root cgroup from > this constraint. If the constraint were really critical to everything > working, then I would expect the root cgroup to have exactly the same > problem. This makes me think that either something nasty is being > fudged for the root cgroup or that the constraint isn't actually so > important after all. The only thing on point I can find is: > > > Root cgroup is exempt from this constraint, which is in line with > > how root cgroup is handled in general - it's excluded from cgroup > > resource accounting and control. > > and that's not very helpful. My apologies. I somehow thought that was part of the documentation. Will update it later, but here's an excerpt from my earlier response. Having a special case doesn't necessarily get in the way of benefiting from a set of general rules. The root cgroup is inherently special as it has to be the catch-all scope for entities and resource consumptions which can't be tied to any specific consumer - irq handling, packet rx, journal writes, memory reclaim from global memory pressure and so on. None of sub-cgroups have to worry about them. These base-system operations are special regardless of cgroup and we already have sometimes crude ways to affect their behaviors where necessary through sysctl knobs, priorities on specific kernel threads and so on. cgroup doesn't change the situation all that much. What gets left in the root cgroup usually are the base-system operations which are outside the scope of cgroup resource control in the first place and cgroup resource graph can treat the root as an opaque anchor point. There can be other ways to deal with the issue; however, treating root cgroup this way has the big advantage of minimizing the gap between configurations without and with cgroups both in terms of mental model and implementation. Hopefully, the case of a namespace root is clear now. If it's gonna have a sub-hierarchy, it itself can't contain processes but the system root just contains base-system entities and resources which a namespace root doesn't have to worry about. Ignoring base-system stuff, a namespace root is topologically in the same position as the system root in the cgroup resource graph. Maybe this wasn't as clear as I thought it was. I hope the earlier part of this message is enough of a clarification. > >> Also, here's an idea to maybe make PeterZ happier: relax the > >> restriction a bit per-controller. Currently (except for /), if you > >> have subtree control enabled you can't have any processes in the > >> cgroup. Could you change this so it only applies to certain > >> controllers? If the cpu controller is entirely happy to have > >> processes and cgroups as siblings, then maybe a cgroup with only cpu > >> subtree control enabled could allow processes to exist. > > > > The document lists several reasons for not doing this and also that > > there is no known real world use case for such configuration. So, up until this point, we were talking about no-internal-tasks constraint. > My company's production workload would map quite nicely to this > relaxed model. I have quite a few processes each with several > threads. Some of those threads get some CPUs, some get other CPUs, > and they vary in what shares of what CPUs they get. To be clear, > there is not a hierarchy of resource usage that's compatible with the > process hierarchy. Multiple processes have threads that should be > grouped in a different place in the hierarchy than other threads. > Concretely, I have processes A and B with threads A1, A2, B1, and B2. > (And many more, but this is enough to get the point across.) The > natural grouping is: > > Group 1: A1 and B1 > Group 2: A2 > Group 3: B2 And now you're talking about process granularity. > This cannot be expressed with rgroup or with cgroup2. cgroup1 has no > problem with it. If I were using memcg, I would want to have a memcg > hierarchy that was incompatible with the hierarchy above, so I > actually find the cgroup2 insistence on a unified hierarchy to be a > bit annoying, but I at least understand the motivation behind the > unified hierarchy. > > And I don't care that the system controller can't atomically move this > whole mess around. I'm currently running without systemd, so I don't I do. It's a horrible userland API to expose to individual applications if the organization that a given application expects can be disturbed by system operations. Imagine how this would be documented - "if this operation races with system operation, it may return -ENOENT. Repeating the path lookup might make the operation succeed again." > *have* a system controller. If I end up migrating to systemd, I'll > probably put this whole pile into its own slice and manage it > manually. Yeah, systemd has delegation feature for cases like that which we depend on too. As for your example, who performs the cgroup setup and configuration, the application itself or an external entity? If an external entity, how does it know which thread is what? And, as for rgroup not covering it, would extending rgroup to cover multi-process cases be enough or are there more fundamental issues? > > Yeap, the name collisions suck. I thought about disallowing all > > sub-cgroups which starts with "KNOWN_SUBSYS." but that has a > > non-trivial chance of breaking users which were happy before when a > > new controller gets added. But, yeah, we at least should disallow the > > known filenames. Will think more about it. > > How about disallowing names that contain a '.'? That's guaranteed to break things left and right, and, given how departed it is from what has been all along including v1, it'd be an actually gratuitous painful change. While name collisions is a nasty possibility, it seldom is a practical problem as most use naming schemes which are unlikely to actually collide. Even "$SUBSYS." is likely too broad. Most cures seem worse than the disease here. Thanks. -- tejun ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [Documentation] State of CPU controller in cgroup v2 2016-09-03 22:05 ` Tejun Heo @ 2016-09-05 17:37 ` Andy Lutomirski [not found] ` <CALCETrVcAjFWLQ1arjSP-g=4jRY_J7G-j9JJHrvTDgOnxApYPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-09-09 22:57 ` Tejun Heo 0 siblings, 2 replies; 48+ messages in thread From: Andy Lutomirski @ 2016-09-05 17:37 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Mike Galbraith, linux-kernel@vger.kernel.org, kernel-team, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds On Sat, Sep 3, 2016 at 3:05 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, Andy. > > On Wed, Aug 31, 2016 at 02:46:20PM -0700, Andy Lutomirski wrote: >> > Consider a use case where the user isn't interested in fully >> > accounting and dividing up system resources but wants to just cap >> > resource usage from a subset of workloads. There is no reason to >> > require such usages to fully contain all processes in non-root >> > cgroups. Furthermore, it's not trivial to migrate all processes out >> > of root to a sub-cgroup unless the agent is in full control of boot >> > process. >> >> Then please also consider exactly the same use case while running in a >> container. >> >> I'm a bit frustrated that you're saying that my example failure modes >> consist of shooting oneself in the foot and then you go on to come up >> with your own examples that have precisely the same problem. > > You have a point, which is > > The system-root and namespace-roots are not symmetric. > > and that's a valid concern. Here's why the system-root is special. > [...] > > Now, due to the various issues with direct competition between > processes and cgroups, cgroup v2 disallows resource control across > them (the no-internal-tasks restriction); however, cgroup v2 currently > doesn't apply the restriction to the system-root. Here are the > reasons. > > * It doesn't bring any practical benefits in terms of implementation. > As noted above, all controllers already have to allow uncontained > consumptions in the system-root and that's the only attribute > required for the exemption. > > * It doesn't bring any practical benefits in terms of capability. > Userland can trivially handle the system-root and namespace-roots in > a symmetrical manner. Your idea of "trivially" doesn't match mine. You gave a use case in which userspace might take advantage of root being special. If userspace does that, then that userspace cannot be run in a container. This could be a problem for real users. Sure, "don't do that" is a *valid* answer, but it's not a very helpful answer. > > * It's an unncessary inconvenience, especially for cases where the > cgroup agent isn't in control of boot, for partial usage cases, or > just for playing with it. > > You say that I'm ignoring the same use case for namespace-scope but > namespace-roots don't have the same hybrid function for partial and > uncontrolled systems, so it's not clear why there even NEEDS to be > strict symmetry. I think their functions are much closer than you think they are. I want a whole Linux distro to be able to run in a container. This means that useful things people do in a distro or initramfs or whatever should just work if containerized. > > It's easy and understandable to get hangups on asymmetries or > exemptions like this, but they also often are acceptable trade-offs. > It's really frustrating to see you first getting hung up on "this must > be wrong" and even after explanations repeating the same thing just in > different ways. > > If there is something fundamentally wrong with it, sure, let's fix it, > but what's actually broken? I'm not saying it's fundamentally wrong. I'm saying it's a design that has a big wart, and that wart is unfortunate, and after thinking a bit, I'm starting to agree with PeterZ that this is problematic. It also seems fixable: the constraint could be relaxed. >> >> Also, here's an idea to maybe make PeterZ happier: relax the >> >> restriction a bit per-controller. Currently (except for /), if you >> >> have subtree control enabled you can't have any processes in the >> >> cgroup. Could you change this so it only applies to certain >> >> controllers? If the cpu controller is entirely happy to have >> >> processes and cgroups as siblings, then maybe a cgroup with only cpu >> >> subtree control enabled could allow processes to exist. >> > >> > The document lists several reasons for not doing this and also that >> > there is no known real world use case for such configuration. > > So, up until this point, we were talking about no-internal-tasks > constraint. Isn't this the same thing? IIUC the constraint in question is that, if a non-root cgroup has subtree control on, then it can't have processes in it. This is the no-internal-tasks constraint, right? And I still think that, at least for cpu, nothing at all goes wrong if you allow processes to exist in cgroups that have cpu set in subtree-control. ----- begin talking about process granularity ----- > >> My company's production workload would map quite nicely to this >> relaxed model. I have quite a few processes each with several >> threads. Some of those threads get some CPUs, some get other CPUs, >> and they vary in what shares of what CPUs they get. To be clear, >> there is not a hierarchy of resource usage that's compatible with the >> process hierarchy. Multiple processes have threads that should be >> grouped in a different place in the hierarchy than other threads. >> Concretely, I have processes A and B with threads A1, A2, B1, and B2. >> (And many more, but this is enough to get the point across.) The >> natural grouping is: >> >> Group 1: A1 and B1 >> Group 2: A2 >> Group 3: B2 > > And now you're talking about process granularity. Yes. > >> This cannot be expressed with rgroup or with cgroup2. cgroup1 has no >> problem with it. If I were using memcg, I would want to have a memcg >> hierarchy that was incompatible with the hierarchy above, so I >> actually find the cgroup2 insistence on a unified hierarchy to be a >> bit annoying, but I at least understand the motivation behind the >> unified hierarchy. >> >> And I don't care that the system controller can't atomically move this >> whole mess around. I'm currently running without systemd, so I don't > > I do. It's a horrible userland API to expose to individual > applications if the organization that a given application expects can > be disturbed by system operations. Imagine how this would be > documented - "if this operation races with system operation, it may > return -ENOENT. Repeating the path lookup might make the operation > succeed again." It could be made to work without races, though, with minimal (or even no) ABI change. The managed program could grab an fd pointing to its cgroup. Then it would use openat, etc for all operations. As long as 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working, we're fine. Note that this pretty much has to work if cgroup namespaces are to allow rearrangement of the hierarchy -- '/cgroup/' from inside the namespace has to remain valid at all times Obviously this only works if the cgroup in question doesn't itself get destroyed, but having an internal hierarchy is a bit nonsensical if the application shares a cgroup with another application, so that shouldn't be a problem in practice. In fact, ISTM that allowing applications to manage cgroup sub-hierarchies has almost exactly the same set of constraints as allowing namespaced cgroup managers to work. In a container, the outer manager manages where the container lives and the container manages its own hierarchy. Why can't fancy cgroup-aware applications work exactly the same way? > >> *have* a system controller. If I end up migrating to systemd, I'll >> probably put this whole pile into its own slice and manage it >> manually. > > Yeah, systemd has delegation feature for cases like that which we > depend on too. > > As for your example, who performs the cgroup setup and configuration, > the application itself or an external entity? If an external entity, > how does it know which thread is what? In my case, it would be a little script that reads a config file that knows all kinds of internal information about the application and its threads. > > And, as for rgroup not covering it, would extending rgroup to cover > multi-process cases be enough or are there more fundamental issues? Maybe, as long as the configuration could actually be created -- IIUC the current rgroup proposal requires that the hierarchy of groups matches the hierarchy implied by clone(), which isn't going to happen in my case. But, given that this fancy-cgroup-aware-multiprocess-application case looks so much like cgroup-using container, ISTM you could solve the problem completely by just allowing tasks to be split out by users who want to do it. (Obviously those users will get funny results if they try to do this to memcg. "Don't do that" seems fine here.) I don't expect the race condition issues you're worried about to happen in practice. Certainly not in my case, since I control the entire system. ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <CALCETrVcAjFWLQ1arjSP-g=4jRY_J7G-j9JJHrvTDgOnxApYPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <CALCETrVcAjFWLQ1arjSP-g=4jRY_J7G-j9JJHrvTDgOnxApYPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-09-06 10:29 ` Peter Zijlstra 2016-10-04 14:47 ` Tejun Heo 0 siblings, 1 reply; 48+ messages in thread From: Peter Zijlstra @ 2016-09-06 10:29 UTC (permalink / raw) To: Andy Lutomirski Cc: Tejun Heo, Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Johannes Weiner, Linus Torvalds On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote: > And I still think that, at least for cpu, nothing at all goes wrong if > you allow processes to exist in cgroups that have cpu set in > subtree-control. cpu, cpuset, perf, cpuacct (although we all agree that really should be part of cpu), pid, and possibly freezer (but I think we all agree freezer is 'broken'). That's roughly half the controllers out there. They all work on tasks, and should therefore have no problems what so ever to allow the full hierarchy without silly exceptions and constraints. The fundamental problem is that we have 2 different types of controllers, on the one hand these controllers above, that work on tasks and form groups of them and build up from that. Lets call them task-controllers. On the other hand we have controllers like memcg which take the 'system' as a whole and shrink it down into smaller bits. Lets call these system-controllers. They are fundamentally at odds with capabilities, simply because of the granularity they can work on. Merging the two into a common hierarchy is a useful concept for containerization, no argument on that, esp. when also coupled with namespaces and the like. However, where I object _most_ strongly is having this one use dominate and destroy the capabilities (which are in use) of the task-controllers. > > I do. It's a horrible userland API to expose to individual > > applications if the organization that a given application expects can > > be disturbed by system operations. Imagine how this would be > > documented - "if this operation races with system operation, it may > > return -ENOENT. Repeating the path lookup might make the operation > > succeed again." > > It could be made to work without races, though, with minimal (or even > no) ABI change. The managed program could grab an fd pointing to its > cgroup. Then it would use openat, etc for all operations. As long as > 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working, > we're fine. I've mentioned openat() and related APIs several times, but so far never got good reasons why that wouldn't work. Also note that in order to partition the cpus with cpusets, you're required to generate a disjoint hierarchy (that is, one where the (common) parent is 'disabled' and the children have no overlap). This is rather fundamental to partitioning, that by its very nature requires separation. The result is that if you want to place your RT threads (consider an application that consists of RT and !RT parts) in a different partition there is no common parent you can place the process in. cgroup-v2, by placing the system style controllers first and foremost, completely renders that scenario impossible. Note also that any proposed rgroup would not work for this, since that, per design, is a subtree, and therefore not disjoint. So my objection to the whole cgroup-v2 model and implementation stems from the fact that it purports to be a 'better' and 'improved' system, while in actuality it neuters and destroys a lot of useful usecases. It completely disregards all task-controllers and labels their use-cases as irrelevant. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [Documentation] State of CPU controller in cgroup v2 2016-09-06 10:29 ` Peter Zijlstra @ 2016-10-04 14:47 ` Tejun Heo [not found] ` <20161004144717.GA4205-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 48+ messages in thread From: Tejun Heo @ 2016-10-04 14:47 UTC (permalink / raw) To: Peter Zijlstra Cc: Andy Lutomirski, Ingo Molnar, Mike Galbraith, linux-kernel@vger.kernel.org, kernel-team, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Johannes Weiner, Linus Torvalds Hello, Peter. On Tue, Sep 06, 2016 at 12:29:50PM +0200, Peter Zijlstra wrote: > The fundamental problem is that we have 2 different types of > controllers, on the one hand these controllers above, that work on tasks > and form groups of them and build up from that. Lets call them > task-controllers. > > On the other hand we have controllers like memcg which take the 'system' > as a whole and shrink it down into smaller bits. Lets call these > system-controllers. > > They are fundamentally at odds with capabilities, simply because of the > granularity they can work on. As pointed out multiple times, the picture is not that simple. For example, eventually, we want to be able to account for cpu cycles spent during memory reclaim or processing IOs (e.g. encryption), which can only be tied to the resource domain, not a specific task. There surely are things that can only be done by task-level controllers, but there are two different aspects here. One is the actual capabilities (e.g. hierarchical proportional cpu cycle distribution) and the other is how such capabilities are exposed. I'll continue below. > Merging the two into a common hierarchy is a useful concept for > containerization, no argument on that, esp. when also coupled with > namespaces and the like. Great, we now agree that comprehensive system resource control is useful. > However, where I object _most_ strongly is having this one use dominate > and destroy the capabilities (which are in use) of the task-controllers. The objection isn't necessarily just about loss of capabilities but also about not being able to do them in the same way as v1. The reason I proposed rgroup instead of scoped task-granularity is because I think that a properly insulated programmable interface which is inline with other widely used APIs is a better solution in the long run. If we go cgroupfs route for thread granularity, we pretty much lose the possibility, or at least make it very difficult, to make hierarchical resource control widely available to individual applications. How important such use cases are is debatable. I don't find it too difficult to imagine scenarios where individual applications like apache or torrent clients make use of it. Probably more importantly, rgroup, or something like it, gives an application an officially supported way to build and expose their resource hierarchies, which can then be used by both the application itself and outside to monitor and manipulate resource distribution. The decision between cgroupfs thread granularity and something like rgroup isn't an obvious one. Choosing the former is the path of lower resistance but it is so at the cost of certain long-term benefits. > > It could be made to work without races, though, with minimal (or even > > no) ABI change. The managed program could grab an fd pointing to its > > cgroup. Then it would use openat, etc for all operations. As long as > > 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working, > > we're fine. > > I've mentioned openat() and related APIs several times, but so far never > got good reasons why that wouldn't work. Hopefully, this part was addressed in my reply to Andy. > cgroup-v2, by placing the system style controllers first and foremost, > completely renders that scenario impossible. Note also that any proposed > rgroup would not work for this, since that, per design, is a subtree, > and therefore not disjoint. If a use case absolutely requires disjoint resource hierarchies, the only solution is to keep using multiple v1 hierarchies, which necessarily excludes the possibility of doing anyting across different resource types. > So my objection to the whole cgroup-v2 model and implementation stems > from the fact that it purports to be a 'better' and 'improved' system, > while in actuality it neuters and destroys a lot of useful usecases. > > It completely disregards all task-controllers and labels their use-cases > as irrelevant. Your objection then doesn't have much to do with the specifics of the cgroup v2 model or implementation. It's an objection against establishing common resource domains as that excludes building orthogonal multiple hierarchies. That, necessarily, can only be achieved by having multiple hierarchies for different resource types and thus giving up the benefits of common resource domains. Assuming that, I don't think your position is against cgroup v2 but more toward keeping v1 around. We're talking about two quite different mutually exclusive classes of use cases. You need unified for one and disjoint for the other. v1 is gonna be there and can easily be used alongside v2 for different controller types, which would in most cases be cpu and cpuset. I can't see a reason why this would need to block properly supporting containerization use cases. Thanks. -- tejun ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20161004144717.GA4205-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20161004144717.GA4205-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2016-10-05 8:07 ` Peter Zijlstra 0 siblings, 0 replies; 48+ messages in thread From: Peter Zijlstra @ 2016-10-05 8:07 UTC (permalink / raw) To: Tejun Heo Cc: Andy Lutomirski, Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Johannes Weiner, Linus Torvalds On Tue, Oct 04, 2016 at 10:47:17AM -0400, Tejun Heo wrote: > > cgroup-v2, by placing the system style controllers first and foremost, > > completely renders that scenario impossible. Note also that any proposed > > rgroup would not work for this, since that, per design, is a subtree, > > and therefore not disjoint. > > If a use case absolutely requires disjoint resource hierarchies, the > only solution is to keep using multiple v1 hierarchies, which > necessarily excludes the possibility of doing anyting across different > resource types. > > > So my objection to the whole cgroup-v2 model and implementation stems > > from the fact that it purports to be a 'better' and 'improved' system, > > while in actuality it neuters and destroys a lot of useful usecases. > > > > It completely disregards all task-controllers and labels their use-cases > > as irrelevant. > > Your objection then doesn't have much to do with the specifics of the > cgroup v2 model or implementation. It is too, I've stated multiple times that the no internal tasks thing is bad and that the root exception is an inexcusable wart that makes the whole thing internally inconsistent. But talking to you guys is pointless. You'll just keep moving air until the other party tires and gives up. My NAK on v2 stands. > It's an objection against > establishing common resource domains as that excludes building > orthogonal multiple hierarchies. That, necessarily, can only be > achieved by having multiple hierarchies for different resource types > and thus giving up the benefits of common resource domains. Yes, v2 not allowing that rules it out as a valid model. > Assuming that, I don't think your position is against cgroup v2 but > more toward keeping v1 around. We're talking about two quite > different mutually exclusive classes of use cases. You need unified > for one and disjoint for the other. v1 is gonna be there and can > easily be used alongside v2 for different controller types, which > would in most cases be cpu and cpuset. > > I can't see a reason why this would need to block properly supporting > containerization use cases. I don't block that use-case, I block cgroup-v2, its shit. The fact is, the naming "v2" suggests its a replacement and will deprecate "v1". Also the implementation is mutually exclusive with v1, you have to pick one and the other becomes inaccessible. You cannot even pick another one inside a container, breaking the container invariant. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [Documentation] State of CPU controller in cgroup v2 2016-09-05 17:37 ` Andy Lutomirski [not found] ` <CALCETrVcAjFWLQ1arjSP-g=4jRY_J7G-j9JJHrvTDgOnxApYPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-09-09 22:57 ` Tejun Heo [not found] ` <20160909225747.GA30105-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> [not found] ` <CALCETrUhpPQdyZ-6WRjdB+iLbpGBduRZMWXQtCuS+R7Cq7rygg@mail.gmail.com> 1 sibling, 2 replies; 48+ messages in thread From: Tejun Heo @ 2016-09-09 22:57 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Mike Galbraith, linux-kernel@vger.kernel.org, kernel-team, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds Hello, again. On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote: > > * It doesn't bring any practical benefits in terms of capability. > > Userland can trivially handle the system-root and namespace-roots in > > a symmetrical manner. > > Your idea of "trivially" doesn't match mine. You gave a use case in I suppose I wasn't clear enough. It is trivial in the sense that if the userland implements something which works for namespace-root, it would work the same in system-root without further modifications. > which userspace might take advantage of root being special. If I was emphasizing the cases where userspace would have to deal with the inherent differences, and, when they don't, they can behave exactly the same way. > userspace does that, then that userspace cannot be run in a container. > This could be a problem for real users. Sure, "don't do that" is a > *valid* answer, but it's not a very helpful answer. Great, now we agree that what's currently implemented is valid. I think you're still failing to recognize the inherent specialness of the system-root and how much unnecessary pain the removal of the exemption would cause at virtually no practical gain. I won't repeat the same backing points here. > > * It's an unncessary inconvenience, especially for cases where the > > cgroup agent isn't in control of boot, for partial usage cases, or > > just for playing with it. > > > > You say that I'm ignoring the same use case for namespace-scope but > > namespace-roots don't have the same hybrid function for partial and > > uncontrolled systems, so it's not clear why there even NEEDS to be > > strict symmetry. > > I think their functions are much closer than you think they are. I > want a whole Linux distro to be able to run in a container. This > means that useful things people do in a distro or initramfs or > whatever should just work if containerized. There isn't much which is getting in the way of doing that. Again, something which follows no-internal-task rule would behave the same no matter where it is. The system-root is different in that it is exempt from the rule and thus is more flexible but that difference is serving the purpose of handling the inherent specialness of the system-root. AFAICS, it is the solution which causes the least amount of contortion and unnecessary inconvenience to userland. > > It's easy and understandable to get hangups on asymmetries or > > exemptions like this, but they also often are acceptable trade-offs. > > It's really frustrating to see you first getting hung up on "this must > > be wrong" and even after explanations repeating the same thing just in > > different ways. > > > > If there is something fundamentally wrong with it, sure, let's fix it, > > but what's actually broken? > > I'm not saying it's fundamentally wrong. I'm saying it's a design You were. > that has a big wart, and that wart is unfortunate, and after thinking > a bit, I'm starting to agree with PeterZ that this is problematic. It > also seems fixable: the constraint could be relaxed. You've been pushing for enforcing the restriction on the system-root too and now are jumping to the opposite end. It's really frustrating that this is such a whack-a-mole game where you throw ideas without really thinking through them and only concede the bare minimum when all other logical avenues are closed off. Here, again, you seem to be stating a strong opinion when you haven't fully thought about it or tried to understand the reasons behind it. But, whatever, let's go there: Given the arguments that I laid out for the no-internal-tasks rule, how does the problem seem fixable through relaxing the constraint? > >> >> Also, here's an idea to maybe make PeterZ happier: relax the > >> >> restriction a bit per-controller. Currently (except for /), if you > >> >> have subtree control enabled you can't have any processes in the > >> >> cgroup. Could you change this so it only applies to certain > >> >> controllers? If the cpu controller is entirely happy to have > >> >> processes and cgroups as siblings, then maybe a cgroup with only cpu > >> >> subtree control enabled could allow processes to exist. > >> > > >> > The document lists several reasons for not doing this and also that > >> > there is no known real world use case for such configuration. > > > > So, up until this point, we were talking about no-internal-tasks > > constraint. > > Isn't this the same thing? IIUC the constraint in question is that, > if a non-root cgroup has subtree control on, then it can't have > processes in it. This is the no-internal-tasks constraint, right? Yes, that is what no-internal-tasks rule is but I don't understand how that is the same thing as process granularity. Am I completely misunderstanding what you are trying to say here? > And I still think that, at least for cpu, nothing at all goes wrong if > you allow processes to exist in cgroups that have cpu set in > subtree-control. If you confine it to the cpu controller, ignore anonymous consumptions, the rather ugly mapping between nice and weight values and the fact that nobody could come up with a practical usefulness for such setup, yes. My point was never that the cpu controller can't do it but that we should find a better way of coordinating it with other controllers and exposing it to individual applications. > ----- begin talking about process granularity ----- ... > > I do. It's a horrible userland API to expose to individual > > applications if the organization that a given application expects can > > be disturbed by system operations. Imagine how this would be > > documented - "if this operation races with system operation, it may > > return -ENOENT. Repeating the path lookup might make the operation > > succeed again." > > It could be made to work without races, though, with minimal (or even > no) ABI change. The managed program could grab an fd pointing to its > cgroup. Then it would use openat, etc for all operations. As long as > 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working, > we're fine. After a migration, the cgroup and its interface knobs are a different directory and files. Semantically, during migration, we aren't moving the directory or files and it'd be bizarre to overlay the semantics you're describing on top of the existing cgroupfs. We will have to break away from the very basic vfs rules such as a fd, once opened, always corresponding to the same file. The only thing openat(2) does is abstracting away prefix handling and that is only a small part of the problem. A more acceptable way could be implementing, say, per-task filesystem which always appears at the fixed location and proxies the operations; however, even this wouldn't be able to handle issues stemming from lack of actual atomicity. Think about two tasks accessing the same interface file. If they race against outside agent migrating them one-by-one, they may or may not be accessing the same file. If they perform operations with side effects such as config changes, creation of sub-cgroups and migrations, what would be the end result? In addition, a per-task filesystem is an a lot worse interface to program against than a system-call based API, especially when the same API which is used to do the exact same operations on threads can be reused for resource groups. > Note that this pretty much has to work if cgroup namespaces are to > allow rearrangement of the hierarchy -- '/cgroup/' from inside the > namespace has to remain valid at all times If I'm not mistaken, namespaces don't allow this type of dynamic migrations. > Obviously this only works if the cgroup in question doesn't itself get > destroyed, but having an internal hierarchy is a bit nonsensical if > the application shares a cgroup with another application, so that > shouldn't be a problem in practice. > > In fact, ISTM that allowing applications to manage cgroup > sub-hierarchies has almost exactly the same set of constraints as > allowing namespaced cgroup managers to work. In a container, the > outer manager manages where the container lives and the container > manages its own hierarchy. Why can't fancy cgroup-aware applications > work exactly the same way? System agents and individual applications are different. This is the same argument that you brought up earlier in this thread where you said that userland can just set up namespaces for individual applications. In purely mathematical terms, they can be mapped to each other but that grossly ignores practical differences between them. Most applications should and want to keep their assumptions conservative, robust and portable, and not dependent on some crazy fragile and custom-built namespace setup that nobody in the stack is really responsible for. How many would ever program against something like that? A system agent has a large part of the system configuration under its control (it's the system agent after all) and thus is way more flexible in what assumptions it can dictate and depend on. > > Yeah, systemd has delegation feature for cases like that which we > > depend on too. > > > > As for your example, who performs the cgroup setup and configuration, > > the application itself or an external entity? If an external entity, > > how does it know which thread is what? > > In my case, it would be a little script that reads a config file that > knows all kinds of internal information about the application and its > threads. I see. One-of-a-kind custom setup. This is a completely valid usage; however, please also recognize that it's an extremely specific one which is niche by definition. If we're going to support in-application hierarchical resource control, I think it's very important to make sure that it's something which is easy to use and widely accessible so that any lay application can make use of it. I'll come back to this point later. > > And, as for rgroup not covering it, would extending rgroup to cover > > multi-process cases be enough or are there more fundamental issues? > > Maybe, as long as the configuration could actually be created -- IIUC > the current rgroup proposal requires that the hierarchy of groups > matches the hierarchy implied by clone(), which isn't going to happen > in my case. We can make that dynamic as long as the subtree is properly scoped; however, there is an important design decision to make here. If we open up full-on dynamic migrations to individual applications, we commit ourselves to supporting arbitrarily high frequency migration operations, which we've never supported before and will restrict what we can do in terms of optimizing hot paths over migration. We haven't had to face this decision because cgroup has never properly supported delegating to applications and the in-use setups where this happens are custom configurations where there is no boundary between system and applications and adhoc trial-and-error is good enough a way to find a working solution. That wiggle room goes away once we officially open this up to individual applications. So, if we decide to open up dynamic assignment, we need to weigh what we gain in terms of capabilities against reduction of implementation maneuvering room. I guess there can be a middleground where, for example, only initial asssignment is allowed. It is really difficult to understand your position without understanding where the requirements are coming from. Can you please elaborate more on the workload? Why is the specific configuration useful? What is it trying to achieve? > But, given that this fancy-cgroup-aware-multiprocess-application case > looks so much like cgroup-using container, ISTM you could solve the > problem completely by just allowing tasks to be split out by users who > want to do it. (Obviously those users will get funny results if they > try to do this to memcg. "Don't do that" seems fine here.) I don't > expect the race condition issues you're worried about to happen in > practice. Certainly not in my case, since I control the entire > system. What people do now with cgroup inside an application is extremely limited. Because there is no proper support for it, each use case has to craft up a dedicated custom setup which is all but guaranteed to be incompatible with what someone else would come up for another application. Everybody is in "this is mine, I control the entire system" mindset, which is fine for those specific setups but deterimental to making it widely available and useful. Accepting some measured restrictions and building a common ground for everyone can make in-application cgroup usages vastly more accessible and useful than now. Certain things would need to be done differently and maybe some scenarios won't be supported as well but those are trade-offs that we'd need to weigh against what we gain. Another point is that, for very specific use cases where none of these generic concerns matter, keeping using cgroup v1 is fine. The lack of common resource domains has never been an issue for those use cases anyway. Thanks. -- tejun ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20160909225747.GA30105-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160909225747.GA30105-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2016-09-10 8:54 ` Mike Galbraith 2016-09-10 10:08 ` Mike Galbraith 2016-09-12 15:20 ` Austin S. Hemmelgarn 2 siblings, 0 replies; 48+ messages in thread From: Mike Galbraith @ 2016-09-10 8:54 UTC (permalink / raw) To: Tejun Heo, Andy Lutomirski Cc: Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote: > But, whatever, let's go there: Given the arguments that I laid out for > the no-internal-tasks rule, how does the problem seem fixable through > relaxing the constraint? Well, for one thing, cpusets would cease to leak CPUs. With the no -internal-tasks constraint, no task can acquire affinity of exclusive set A if set B is an exclusive subset thereof, as there is one and only one spot where the affinity of set A exists: in the forbidden set A. Relaxing no-internal-tasks would fix that, but without also relaxing the process-only rule, cpusets would remain useless for the purpose for which it was created. After all, it doesn't do much good to use the one and only dynamic partitioning tool to partition a box if you cannot subsequently place your tasks/threads properly therein. > What people do now with cgroup inside an application is extremely > limited. Because there is no proper support for it, each use case has > to craft up a dedicated custom setup which is all but guaranteed to be > incompatible with what someone else would come up for another > application. Everybody is in "this is mine, I control the entire > system" mindset, which is fine for those specific setups but > deterimental to making it widely available and useful. IMO, the problem with that making it available to the huddled masses bit is that it is a completely unrealistic fantasy. Can hordes of programs really autonomously carve up a single set of resources? I do not believe they can. The system agent cannot autonomously do so either. Intimate knowledge of local requirements is not optional, it is a prerequisite to sound decision making. You have to have a well defined need before it makes any sense to turn these things on, they are not free, and impact is global. -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160909225747.GA30105-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2016-09-10 8:54 ` Mike Galbraith @ 2016-09-10 10:08 ` Mike Galbraith [not found] ` <1473502137.3857.218.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-09-12 15:20 ` Austin S. Hemmelgarn 2 siblings, 1 reply; 48+ messages in thread From: Mike Galbraith @ 2016-09-10 10:08 UTC (permalink / raw) To: Tejun Heo, Andy Lutomirski Cc: Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote: > > > As for your example, who performs the cgroup setup and configuration, > > > the application itself or an external entity? If an external entity, > > > how does it know which thread is what? > > > > In my case, it would be a little script that reads a config file that > > knows all kinds of internal information about the application and its > > threads. > > I see. One-of-a-kind custom setup. This is a completely valid usage; > however, please also recognize that it's an extremely specific one > which is niche by definition. This is the same pigeon hole you placed Google into. So Google, my (also decidedly non-petite) users, and now Andy are all sharing the one of a kind extremely specific niche.. it's becoming a tad crowded. -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <1473502137.3857.218.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <1473502137.3857.218.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2016-09-30 9:06 ` Tejun Heo [not found] ` <20160930090603.GD29207-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 48+ messages in thread From: Tejun Heo @ 2016-09-30 9:06 UTC (permalink / raw) To: Mike Galbraith Cc: Andy Lutomirski, Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds Hello, Mike. On Sat, Sep 10, 2016 at 12:08:57PM +0200, Mike Galbraith wrote: > On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote: > > > > As for your example, who performs the cgroup setup and configuration, > > > > the application itself or an external entity? If an external entity, > > > > how does it know which thread is what? > > > > > > In my case, it would be a little script that reads a config file that > > > knows all kinds of internal information about the application and its > > > threads. > > > > I see. One-of-a-kind custom setup. This is a completely valid usage; > > however, please also recognize that it's an extremely specific one > > which is niche by definition. > > This is the same pigeon hole you placed Google into. So Google, my > (also decidedly non-petite) users, and now Andy are all sharing the one > of a kind extremely specific niche.. it's becoming a tad crowded. I wasn't trying to say that these use cases are small in numbers when added up, but that they're all isolated in their own small silos. Facebook has a lot of these usages too but they're almost all mutually exculsive. Making workloads share machines or even adding resource conrol for base system operations afterwards is extremely difficult. There are cases these adhoc approaches make sense but insisting that this is all there is to resource control is short-sighted. Thanks. -- tejun ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20160930090603.GD29207-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160930090603.GD29207-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2016-09-30 14:53 ` Mike Galbraith 0 siblings, 0 replies; 48+ messages in thread From: Mike Galbraith @ 2016-09-30 14:53 UTC (permalink / raw) To: Tejun Heo Cc: Andy Lutomirski, Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds On Fri, 2016-09-30 at 11:06 +0200, Tejun Heo wrote: > Hello, Mike. > > On Sat, Sep 10, 2016 at 12:08:57PM +0200, Mike Galbraith wrote: > > On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote: > > > > > As for your example, who performs the cgroup setup and configuration, > > > > > the application itself or an external entity? If an external entity, > > > > > how does it know which thread is what? > > > > > > > > In my case, it would be a little script that reads a config file that > > > > knows all kinds of internal information about the application and its > > > > threads. > > > > > > I see. One-of-a-kind custom setup. This is a completely valid usage; > > > however, please also recognize that it's an extremely specific one > > > which is niche by definition. > > > > This is the same pigeon hole you placed Google into. So Google, my > > (also decidedly non-petite) users, and now Andy are all sharing the one > > of a kind extremely specific niche.. it's becoming a tad crowded. > > I wasn't trying to say that these use cases are small in numbers when > added up, but that they're all isolated in their own small silos. These use cases exist, and are perfectly valid use cases. That is sum and total of what is relevant. > Facebook has a lot of these usages too but they're almost all mutually > exculsive. Making workloads share machines or even adding resource > conrol for base system operations afterwards is extremely difficult. The cases I have in mind are not difficult to deal with, as you don't have to worry about collisions. > There are cases these adhoc approaches make sense but insisting that > this is all there is to resource control is short-sighted. 1. I never insisted any such thing. 2. Please stop pigeon-holing. The usage cases in question are no more ad hoc than any other usage, they are all "for this", none are globally applicable. What they are is power users utilizing the intimate knowledge that is both required and in the possession of power users who are in fact using controllers precisely as said controllers were designed to be used. No, these usages do not belong in an "adhoc" (aka disposable refuse) pi geon-hole. I choose to ignore the one you stuffed me into. -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160909225747.GA30105-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2016-09-10 8:54 ` Mike Galbraith 2016-09-10 10:08 ` Mike Galbraith @ 2016-09-12 15:20 ` Austin S. Hemmelgarn [not found] ` <ab6f3376-4c09-a339-f984-937f537ddc17-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2 siblings, 1 reply; 48+ messages in thread From: Austin S. Hemmelgarn @ 2016-09-12 15:20 UTC (permalink / raw) To: Tejun Heo, Andy Lutomirski Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds On 2016-09-09 18:57, Tejun Heo wrote: > Hello, again. > > On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote: >>> * It doesn't bring any practical benefits in terms of capability. >>> Userland can trivially handle the system-root and namespace-roots in >>> a symmetrical manner. >> >> Your idea of "trivially" doesn't match mine. You gave a use case in > > I suppose I wasn't clear enough. It is trivial in the sense that if > the userland implements something which works for namespace-root, it > would work the same in system-root without further modifications. > >> which userspace might take advantage of root being special. If > > I was emphasizing the cases where userspace would have to deal with > the inherent differences, and, when they don't, they can behave > exactly the same way. > >> userspace does that, then that userspace cannot be run in a container. >> This could be a problem for real users. Sure, "don't do that" is a >> *valid* answer, but it's not a very helpful answer. > > Great, now we agree that what's currently implemented is valid. I > think you're still failing to recognize the inherent specialness of > the system-root and how much unnecessary pain the removal of the > exemption would cause at virtually no practical gain. I won't repeat > the same backing points here. > >>> * It's an unncessary inconvenience, especially for cases where the >>> cgroup agent isn't in control of boot, for partial usage cases, or >>> just for playing with it. >>> >>> You say that I'm ignoring the same use case for namespace-scope but >>> namespace-roots don't have the same hybrid function for partial and >>> uncontrolled systems, so it's not clear why there even NEEDS to be >>> strict symmetry. >> >> I think their functions are much closer than you think they are. I >> want a whole Linux distro to be able to run in a container. This >> means that useful things people do in a distro or initramfs or >> whatever should just work if containerized. > > There isn't much which is getting in the way of doing that. Again, > something which follows no-internal-task rule would behave the same no > matter where it is. The system-root is different in that it is exempt > from the rule and thus is more flexible but that difference is serving > the purpose of handling the inherent specialness of the system-root. > AFAICS, it is the solution which causes the least amount of contortion > and unnecessary inconvenience to userland. > >>> It's easy and understandable to get hangups on asymmetries or >>> exemptions like this, but they also often are acceptable trade-offs. >>> It's really frustrating to see you first getting hung up on "this must >>> be wrong" and even after explanations repeating the same thing just in >>> different ways. >>> >>> If there is something fundamentally wrong with it, sure, let's fix it, >>> but what's actually broken? >> >> I'm not saying it's fundamentally wrong. I'm saying it's a design > > You were. > >> that has a big wart, and that wart is unfortunate, and after thinking >> a bit, I'm starting to agree with PeterZ that this is problematic. It >> also seems fixable: the constraint could be relaxed. > > You've been pushing for enforcing the restriction on the system-root > too and now are jumping to the opposite end. It's really frustrating > that this is such a whack-a-mole game where you throw ideas without > really thinking through them and only concede the bare minimum when > all other logical avenues are closed off. Here, again, you seem to be > stating a strong opinion when you haven't fully thought about it or > tried to understand the reasons behind it. > > But, whatever, let's go there: Given the arguments that I laid out for > the no-internal-tasks rule, how does the problem seem fixable through > relaxing the constraint? > >>>>>> Also, here's an idea to maybe make PeterZ happier: relax the >>>>>> restriction a bit per-controller. Currently (except for /), if you >>>>>> have subtree control enabled you can't have any processes in the >>>>>> cgroup. Could you change this so it only applies to certain >>>>>> controllers? If the cpu controller is entirely happy to have >>>>>> processes and cgroups as siblings, then maybe a cgroup with only cpu >>>>>> subtree control enabled could allow processes to exist. >>>>> >>>>> The document lists several reasons for not doing this and also that >>>>> there is no known real world use case for such configuration. >>> >>> So, up until this point, we were talking about no-internal-tasks >>> constraint. >> >> Isn't this the same thing? IIUC the constraint in question is that, >> if a non-root cgroup has subtree control on, then it can't have >> processes in it. This is the no-internal-tasks constraint, right? > > Yes, that is what no-internal-tasks rule is but I don't understand how > that is the same thing as process granularity. Am I completely > misunderstanding what you are trying to say here? > >> And I still think that, at least for cpu, nothing at all goes wrong if >> you allow processes to exist in cgroups that have cpu set in >> subtree-control. > > If you confine it to the cpu controller, ignore anonymous > consumptions, the rather ugly mapping between nice and weight values > and the fact that nobody could come up with a practical usefulness for > such setup, yes. My point was never that the cpu controller can't do > it but that we should find a better way of coordinating it with other > controllers and exposing it to individual applications. So, having a container where not everything in the container is split further into subgroups is not a practically useful situation? Because that's exactly what both systemd and every other cgroup management tool expects to have work as things stand right now. The root cgroup within a cgroup namespace has to function exactly like the system-root, otherwise nothing can depend on the special cases for the system root, because they might get run in a cgroup namespace and such assumptions will be invalid. This in turn means that no current distro can run unmodified in a cgroup namespace under a v2 hierarchy, which is a Very Bad Thing. > >> ----- begin talking about process granularity ----- > ... >>> I do. It's a horrible userland API to expose to individual >>> applications if the organization that a given application expects can >>> be disturbed by system operations. Imagine how this would be >>> documented - "if this operation races with system operation, it may >>> return -ENOENT. Repeating the path lookup might make the operation >>> succeed again." >> >> It could be made to work without races, though, with minimal (or even >> no) ABI change. The managed program could grab an fd pointing to its >> cgroup. Then it would use openat, etc for all operations. As long as >> 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working, >> we're fine. > > After a migration, the cgroup and its interface knobs are a different > directory and files. Semantically, during migration, we aren't moving > the directory or files and it'd be bizarre to overlay the semantics > you're describing on top of the existing cgroupfs. We will have to > break away from the very basic vfs rules such as a fd, once opened, > always corresponding to the same file. The only thing openat(2) does > is abstracting away prefix handling and that is only a small part of > the problem. > > A more acceptable way could be implementing, say, per-task filesystem > which always appears at the fixed location and proxies the operations; > however, even this wouldn't be able to handle issues stemming from > lack of actual atomicity. Think about two tasks accessing the same > interface file. If they race against outside agent migrating them > one-by-one, they may or may not be accessing the same file. If they > perform operations with side effects such as config changes, creation > of sub-cgroups and migrations, what would be the end result? > > In addition, a per-task filesystem is an a lot worse interface to > program against than a system-call based API, especially when the same > API which is used to do the exact same operations on threads can be > reused for resource groups. > >> Note that this pretty much has to work if cgroup namespaces are to >> allow rearrangement of the hierarchy -- '/cgroup/' from inside the >> namespace has to remain valid at all times > > If I'm not mistaken, namespaces don't allow this type of dynamic > migrations. > >> Obviously this only works if the cgroup in question doesn't itself get >> destroyed, but having an internal hierarchy is a bit nonsensical if >> the application shares a cgroup with another application, so that >> shouldn't be a problem in practice. >> >> In fact, ISTM that allowing applications to manage cgroup >> sub-hierarchies has almost exactly the same set of constraints as >> allowing namespaced cgroup managers to work. In a container, the >> outer manager manages where the container lives and the container >> manages its own hierarchy. Why can't fancy cgroup-aware applications >> work exactly the same way? > > System agents and individual applications are different. This is the > same argument that you brought up earlier in this thread where you > said that userland can just set up namespaces for individual > applications. In purely mathematical terms, they can be mapped to > each other but that grossly ignores practical differences between > them. > > Most applications should and want to keep their assumptions > conservative, robust and portable, and not dependent on some crazy > fragile and custom-built namespace setup that nobody in the stack is > really responsible for. How many would ever program against something > like that? > > A system agent has a large part of the system configuration under its > control (it's the system agent after all) and thus is way more > flexible in what assumptions it can dictate and depend on. > >>> Yeah, systemd has delegation feature for cases like that which we >>> depend on too. >>> >>> As for your example, who performs the cgroup setup and configuration, >>> the application itself or an external entity? If an external entity, >>> how does it know which thread is what? >> >> In my case, it would be a little script that reads a config file that >> knows all kinds of internal information about the application and its >> threads. > > I see. One-of-a-kind custom setup. This is a completely valid usage; > however, please also recognize that it's an extremely specific one > which is niche by definition. If we're going to support > in-application hierarchical resource control, I think it's very > important to make sure that it's something which is easy to use and > widely accessible so that any lay application can make use of it. > I'll come back to this point later. > >>> And, as for rgroup not covering it, would extending rgroup to cover >>> multi-process cases be enough or are there more fundamental issues? >> >> Maybe, as long as the configuration could actually be created -- IIUC >> the current rgroup proposal requires that the hierarchy of groups >> matches the hierarchy implied by clone(), which isn't going to happen >> in my case. > > We can make that dynamic as long as the subtree is properly scoped; > however, there is an important design decision to make here. If we > open up full-on dynamic migrations to individual applications, we > commit ourselves to supporting arbitrarily high frequency migration > operations, which we've never supported before and will restrict what > we can do in terms of optimizing hot paths over migration. > > We haven't had to face this decision because cgroup has never properly > supported delegating to applications and the in-use setups where this > happens are custom configurations where there is no boundary between > system and applications and adhoc trial-and-error is good enough a way > to find a working solution. That wiggle room goes away once we > officially open this up to individual applications. > > So, if we decide to open up dynamic assignment, we need to weigh what > we gain in terms of capabilities against reduction of implementation > maneuvering room. I guess there can be a middleground where, for > example, only initial asssignment is allowed. > > It is really difficult to understand your position without > understanding where the requirements are coming from. Can you please > elaborate more on the workload? Why is the specific configuration > useful? What is it trying to achieve? > >> But, given that this fancy-cgroup-aware-multiprocess-application case >> looks so much like cgroup-using container, ISTM you could solve the >> problem completely by just allowing tasks to be split out by users who >> want to do it. (Obviously those users will get funny results if they >> try to do this to memcg. "Don't do that" seems fine here.) I don't >> expect the race condition issues you're worried about to happen in >> practice. Certainly not in my case, since I control the entire >> system. > > What people do now with cgroup inside an application is extremely > limited. Because there is no proper support for it, each use case has > to craft up a dedicated custom setup which is all but guaranteed to be > incompatible with what someone else would come up for another > application. Everybody is in "this is mine, I control the entire > system" mindset, which is fine for those specific setups but > deterimental to making it widely available and useful. > > Accepting some measured restrictions and building a common ground for > everyone can make in-application cgroup usages vastly more accessible > and useful than now. Certain things would need to be done differently > and maybe some scenarios won't be supported as well but those are > trade-offs that we'd need to weigh against what we gain. Another > point is that, for very specific use cases where none of these generic > concerns matter, keeping using cgroup v1 is fine. The lack of common > resource domains has never been an issue for those use cases anyway. > > Thanks. > ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <ab6f3376-4c09-a339-f984-937f537ddc17-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <ab6f3376-4c09-a339-f984-937f537ddc17-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2016-09-19 21:34 ` Tejun Heo 0 siblings, 0 replies; 48+ messages in thread From: Tejun Heo @ 2016-09-19 21:34 UTC (permalink / raw) To: Austin S. Hemmelgarn Cc: Andy Lutomirski, Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds Hello, Austin. On Mon, Sep 12, 2016 at 11:20:03AM -0400, Austin S. Hemmelgarn wrote: > > If you confine it to the cpu controller, ignore anonymous > > consumptions, the rather ugly mapping between nice and weight values > > and the fact that nobody could come up with a practical usefulness for > > such setup, yes. My point was never that the cpu controller can't do > > it but that we should find a better way of coordinating it with other > > controllers and exposing it to individual applications. > > So, having a container where not everything in the container is split > further into subgroups is not a practically useful situation? Because > that's exactly what both systemd and every other cgroup management tool > expects to have work as things stand right now. The root cgroup within a Not true. $ cat /proc/1/cgroup 11:hugetlb:/ 10:pids:/init.scope 9:blkio:/ 8:cpuset:/ 7:memory:/ 6:freezer:/ 5:perf_event:/ 4:net_cls,net_prio:/ 3:cpu,cpuacct:/ 2:devices:/init.scope 1:name=systemd:/init.scope $ systemctl --version systemd 229 +PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN > cgroup namespace has to function exactly like the system-root, otherwise > nothing can depend on the special cases for the system root, because they > might get run in a cgroup namespace and such assumptions will be invalid. systemd already behaves exactly the same whether it's inside a namespace or not. > This in turn means that no current distro can run unmodified in a cgroup > namespace under a v2 hierarchy, which is a Very Bad Thing. cgroup v1 hierarchies can be mounted the same inside a namespace whether the system itself is on cgroup v1 or v2. Obviously, a given controller can only be attached to one hierarchy, so a controller can't be used at the same time on both v1 and v2 hierarchies; however, that is true with different v1 hierarchies too, and, given that delegations doesn't work properly on v1, shouldn't be that much of an issue. I'm not just claiming it. systemd-nspawn can already be on either v1 or v2 hierarchies regardless of what the outer systemd uses. Out of the claims that you made, the only one which holds up is that an existing software can't make use of cgroup v2 without modifications, which is true but at the same time doesn't mean much of anything. Thanks. -- tejun ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <CALCETrUhpPQdyZ-6WRjdB+iLbpGBduRZMWXQtCuS+R7Cq7rygg@mail.gmail.com>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <CALCETrUhpPQdyZ-6WRjdB+iLbpGBduRZMWXQtCuS+R7Cq7rygg@mail.gmail.com> @ 2016-09-14 20:00 ` Tejun Heo [not found] ` <20160914200041.GB6832-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 48+ messages in thread From: Tejun Heo @ 2016-09-14 20:00 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Mike Galbraith, Andrew Morton, kernel-team, open list:CONTROL GROUP (CGROUP), Linux API, Li Zefan, Paul Turner, linux-kernel@vger.kernel.org, Linus Torvalds, Johannes Weiner, Peter Zijlstra Hello, On Mon, Sep 12, 2016 at 10:39:04AM -0700, Andy Lutomirski wrote: > > > Your idea of "trivially" doesn't match mine. You gave a use case in > > > > I suppose I wasn't clear enough. It is trivial in the sense that if > > the userland implements something which works for namespace-root, it > > would work the same in system-root without further modifications. > > So I guess userspace can trivially get it right and can just as trivially > get it wrong. I wasn't trying to play a word game. What I was trying to say is that a configuration which works for namespace-roots works for the system-root too, in terms of cgroup hierarchy, without any modifications. > > Great, now we agree that what's currently implemented is valid. I > > think you're still failing to recognize the inherent specialness of > > the system-root and how much unnecessary pain the removal of the > > exemption would cause at virtually no practical gain. I won't repeat > > the same backing points here. > > I'm starting to think that you could extend the exemption with considerably > less difficulty. Can you please elaborate? It feels like you're repeating the same opinions without really describing them in detail or backing them up in the last couple replies. Having differing opinions is fine but to actually hash them out, the opinions and their rationles need to be laid out in detail. > > There isn't much which is getting in the way of doing that. Again, > > something which follows no-internal-task rule would behave the same no > > matter where it is. The system-root is different in that it is exempt > > from the rule and thus is more flexible but that difference is serving > > the purpose of handling the inherent specialness of the system-root. > > From *userspace's* POV, I still don't think there's any specialness except > from an accounting POV. After all, userspace has no control over the > special stuff anyway. And accounting doesn't matter: a namespace could > just see zeros in any special root accounting slots. The disagreement here isn't really consequential. The only reason this part became imporant is because you felt that something must be broken, which you now don't think is the case. I agree that there can be other ways to handle this but what's your proposal here? And how would that be practically and substantically better than what is implemented now? > > You've been pushing for enforcing the restriction on the system-root > > too and now are jumping to the opposite end. It's really frustrating > > that this is such a whack-a-mole game where you throw ideas without > > really thinking through them and only concede the bare minimum when > > all other logical avenues are closed off. Here, again, you seem to be > > stating a strong opinion when you haven't fully thought about it or > > tried to understand the reasons behind it. > > I think you should make it work the same way in namespace roots as it does > in the system root. I acknowledge that there are pros and cons of each. I > think the current middle ground is worse than either of the consistent > options. Again, the only thing you're doing is restating the same opinion. I understand that you have an impression that this can be done better but how exactly? > > But, whatever, let's go there: Given the arguments that I laid out for > > the no-internal-tasks rule, how does the problem seem fixable through > > relaxing the constraint? > > By deciding that, despite the arguments you laid out, it's still worth > relaxing the constraint. Or by deciding to add the constraint to the root. You're not really saying anything of substance in the above paragraph. > > > Isn't this the same thing? IIUC the constraint in question is that, > > > if a non-root cgroup has subtree control on, then it can't have > > > processes in it. This is the no-internal-tasks constraint, right? > > > > Yes, that is what no-internal-tasks rule is but I don't understand how > > that is the same thing as process granularity. Am I completely > > misunderstanding what you are trying to say here? > > Yes. I'm saying that no-internal-tasks could be relaxed per controller. I was asking whether you were wondering whether no-internal-tasks rule and process-granularity are the same thing. And, if that's not the case, what the previous sentence meant. I can't make out what you're responding to. > > If you confine it to the cpu controller, ignore anonymous > > consumptions, the rather ugly mapping between nice and weight values > > and the fact that nobody could come up with a practical usefulness for > > such setup, yes. My point was never that the cpu controller can't do > > it but that we should find a better way of coordinating it with other > > controllers and exposing it to individual applications. > > I'm not sure what the nice-vs-weight thing has to do with internal > processes, but all of this is a question for Peter. That part is from cgroup cpu controller weight being mapped to task nice numbers because the priorities between the two have to be somehow comparable. It's not a critical issue, just awkward. > > After a migration, the cgroup and its interface knobs are a different > > directory and files. Semantically, during migration, we aren't moving > > the directory or files and it'd be bizarre to overlay the semantics > > you're describing on top of the existing cgroupfs. We will have to > > break away from the very basic vfs rules such as a fd, once opened, > > always corresponding to the same file. > > What kind of migration do you mean? Having fds follow rename(2) around is > the normal vfs behavior, so I don't really know what you mean. Process or task migration by writing pid to cgroup.procs or tasks file. cgroup never supported directory / cgroup level migrations. > > If I'm not mistaken, namespaces don't allow this type of dynamic > > migrations. > > I don't see why they couldn't allow exactly this. If you rename(2) a > cgroup, any namespace with that cgroup as root should keep it as root, > completely atomically. If this doesn't work, I'd argue that it's a bug. I hope this part is clear now. > > A system agent has a large part of the system configuration under its > > control (it's the system agent after all) and thus is way more > > flexible in what assumptions it can dictate and depend on. > > Can you give an example of any use case for which a system agent would > fork, exec a daemon that isn't written by the same developers as the system > agent, and then walk that daemon's process tree and move the processes > around in the cgroup hierarchy one by one? I think this is what you're > describing, and I don't see why doing so is sensible. Certainly if a > system agent gives the daemon write access to cgroupfs, it should not start > moving that daemon's children around individually. That's the only way anything can be moved across cgroups. In terms of resource control, I can't think of scenarios which would *require* this behavior but it's still a behavior cgroup has to allow as there's no "spawn this process in that cgroup" call and all migrations are dynamic. We can proclaim that once an application is started outer scope shouldn't meddle with it. It would be another restriction where violation would actually break applications tho. And it doesn't address other downsides - making in-application controls less approachable as it requires specific setup and cooperation from the system agent, and the interface being awkward. > > We can make that dynamic as long as the subtree is properly scoped; > > however, there is an important design decision to make here. If we > > open up full-on dynamic migrations to individual applications, we > > commit ourselves to supporting arbitrarily high frequency migration > > operations, which we've never supported before and will restrict what > > we can do in terms of optimizing hot paths over migration. > > I haven't (yet?) seen use cases where changing cgroups *quickly* is > important. Android does something along this line - creating preset cgroups and migrating processes according to their current states. The problem is that once we generally open up the API to individual applications, there is no good way of policing the usages and there certainly are multiple ways to make use of frequent cgroup membership changes especially for stateless controllers like CPU. We can easily end up in situations where having several of these usages on the same machine bogs down the whole system. One way to avoid this is building the API so that changing cgroup membership is naturally unattractive - e.g. membership can only be assigned only on creation of a new thread or process, or migration can only be towards deeper level in the tree, so that migrations can be used to organize the threads and processes as necessary but not used as the primary method of adjusting configurations dynamically. > > It is really difficult to understand your position without > > understanding where the requirements are coming from. Can you please > > elaborate more on the workload? Why is the specific configuration > > useful? What is it trying to achieve? > > Multiple cooperating RT processes, most of which have non-RT helper > threads. For scheduling purposes, I lump the non-RT threads together. I see. Can you please share how the cgroups are actually configured (ie. how the weights are assigned and so on)? > Will you (Tejun), PeterZ, and maybe some of the other interested parties be > at KS? Maybe this is worth hashing out in person. Yeap, it'd be nice to talk in person. However, I'm not sure talking offline is the best way to hash out technical details. The discussion has been painful but we're actually addressing technical misunderstandings and where the actual disgreements lie. We really need to agree on what we disagree on and why first. Thanks. -- tejun ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20160914200041.GB6832-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160914200041.GB6832-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2016-09-15 20:08 ` Andy Lutomirski [not found] ` <CALCETrUA6_noue4kq9JLqr-V_yo7hB+v1Arhg6i6fFn0tyTrpw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 48+ messages in thread From: Andy Lutomirski @ 2016-09-15 20:08 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Mike Galbraith, Andrew Morton, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Linux API, Li Zefan, Paul Turner, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linus Torvalds, Johannes Weiner, Peter Zijlstra On Wed, Sep 14, 2016 at 1:00 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > Hello, > With regard to no-internal-tasks, I see (at least) three options: 1. Keep the cgroup2 status quo. Lots of distros and such are likely to have their cgroup management fail if run in a container. I really, really dislike this option. 2. Enforce no-internal-tasks for the root cgroup. Un-cgroupable thinks will still get accounted to the root cgroup even if subtree control is on, but no tasks can be in the root cgroup if the root cgroup has subtree control on. (If some controllers removed the no-internal-tasks restriction, this would apply to the root as well.) I think this may annoy certain users. If so, and if those users are doing something valid, then I think that either those users should be strongly encouraged or even forced to changed so namespacing works for them or that we should do (3) instead. 3. Remove the no-internal-tasks restriction entirely. I can see this resulting in a lot of configuration awkwardness, but I think it will *work*, especially since all of the controllers already need to do something vaguely intelligent when subtree control is on in the root and there are tasks in the root. What I'm trying to say is that I think that option (1) is sufficiently bad that cgroup2 should do (2) or (3) instead. If option (2) is preferred and if it would break userspace, then I think we can work around it by entirely deprecating cgroup2, renaming it to cgroup3, and doing option (2) there. You've given reasons you don't like options (2) and (3). I mostly agree with those reasons, but I don't think they're strong enough to overcome the problems with (1). BTW, Mike keeps mentioning exclusive cgroups as problematic with the no-internal-tasks constraints. Do exclusive cgroups still exist in cgroup2? Could we perhaps just remove that capability entirely? I've never understood what problem exlusive cpusets and such solve that can't be more comprehensibly solved by just assigning the cpusets the normal inclusive way. >> > After a migration, the cgroup and its interface knobs are a different >> > directory and files. Semantically, during migration, we aren't moving >> > the directory or files and it'd be bizarre to overlay the semantics >> > you're describing on top of the existing cgroupfs. We will have to >> > break away from the very basic vfs rules such as a fd, once opened, >> > always corresponding to the same file. >> >> What kind of migration do you mean? Having fds follow rename(2) around is >> the normal vfs behavior, so I don't really know what you mean. > > Process or task migration by writing pid to cgroup.procs or tasks > file. cgroup never supported directory / cgroup level migrations. > Ugh. Perhaps cgroup2 should start supporting this. I think that making rename(2) work is simpler than adding a whole new API for rgroups, and I think it could solve a lot of the same problems that rgroups are trying to solve. --Andy ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <CALCETrUA6_noue4kq9JLqr-V_yo7hB+v1Arhg6i6fFn0tyTrpw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <CALCETrUA6_noue4kq9JLqr-V_yo7hB+v1Arhg6i6fFn0tyTrpw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-09-16 7:51 ` Peter Zijlstra [not found] ` <20160916075137.GK5012-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-09-19 21:53 ` Tejun Heo 1 sibling, 1 reply; 48+ messages in thread From: Peter Zijlstra @ 2016-09-16 7:51 UTC (permalink / raw) To: Andy Lutomirski Cc: Tejun Heo, Ingo Molnar, Mike Galbraith, Andrew Morton, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Linux API, Li Zefan, Paul Turner, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linus Torvalds, Johannes Weiner On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote: > BTW, Mike keeps mentioning exclusive cgroups as problematic with the > no-internal-tasks constraints. Do exclusive cgroups still exist in > cgroup2? Could we perhaps just remove that capability entirely? I've > never understood what problem exlusive cpusets and such solve that > can't be more comprehensibly solved by just assigning the cpusets the > normal inclusive way. Without exclusive sets we cannot split the sched_domain structure. Which leads to not being able to actually partition things. That would break DL for one. ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20160916075137.GK5012-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160916075137.GK5012-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2016-09-16 15:12 ` Andy Lutomirski [not found] ` <CALCETrXzrXJmZoFVfAXS1Zf9uNZjibnHizEhwgqdmRvnbJEksw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 48+ messages in thread From: Andy Lutomirski @ 2016-09-16 15:12 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Mike Galbraith, kernel-team-b10kYP2dOMg, Andrew Morton, open list:CONTROL GROUP (CGROUP), Paul Turner, Li Zefan, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Tejun Heo, Johannes Weiner, Linus Torvalds On Sep 16, 2016 12:51 AM, "Peter Zijlstra" <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote: > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the > > no-internal-tasks constraints. Do exclusive cgroups still exist in > > cgroup2? Could we perhaps just remove that capability entirely? I've > > never understood what problem exlusive cpusets and such solve that > > can't be more comprehensibly solved by just assigning the cpusets the > > normal inclusive way. > > Without exclusive sets we cannot split the sched_domain structure. > Which leads to not being able to actually partition things. That would > break DL for one. Can you sketch out a toy example? And what's DL? ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <CALCETrXzrXJmZoFVfAXS1Zf9uNZjibnHizEhwgqdmRvnbJEksw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <CALCETrXzrXJmZoFVfAXS1Zf9uNZjibnHizEhwgqdmRvnbJEksw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-09-16 16:19 ` Peter Zijlstra [not found] ` <20160916161951.GH5016-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 0 siblings, 1 reply; 48+ messages in thread From: Peter Zijlstra @ 2016-09-16 16:19 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Mike Galbraith, kernel-team-b10kYP2dOMg, Andrew Morton, open list:CONTROL GROUP (CGROUP), Paul Turner, Li Zefan, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Tejun Heo, Johannes Weiner, Linus Torvalds On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote: > On Sep 16, 2016 12:51 AM, "Peter Zijlstra" <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > > > > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote: > > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the > > > no-internal-tasks constraints. Do exclusive cgroups still exist in > > > cgroup2? Could we perhaps just remove that capability entirely? I've > > > never understood what problem exlusive cpusets and such solve that > > > can't be more comprehensibly solved by just assigning the cpusets the > > > normal inclusive way. > > > > Without exclusive sets we cannot split the sched_domain structure. > > Which leads to not being able to actually partition things. That would > > break DL for one. > > Can you sketch out a toy example? [ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ] mkdir /cpuset mount -t cgroup -o cpuset none /cpuset mkdir /cpuset/A mkdir /cpuset/B cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus echo 0 > /cpuset/A/cpuset.mems cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus echo 1 > /cpuset/B/cpuset.mems # move all movable tasks into A cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done # kill machine wide load-balancing echo 0 > /cpuset/cpuset.sched_load_balance # now place 'special' tasks in B This partitions the scheduler into two, one for each node. Hereafter no task will be moved from one node to another. The load-balancer is split in two, one balances in A one balances in B nothing crosses. (It is important that A.cpus and B.cpus do not intersect.) Ideally no task would remain in the root group, back in the day we could actually do this (with exception of the cpu bound kernel threads), but this has significantly regressed :-( (still hate the workqueue affinity interface) As is, tasks that are left in the root group get balanced within whatever domain they ended up in. > And what's DL? SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support CPU affinities (because that doesn't make sense). The only way to restrict it is to partition. 'Global' because you can partition it. If you reduce your system to single CPU partitions you'll reduce to P-EDF. (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same partition scheme, it however does support sched_affinity, but using it gives 'interesting' schedulability results -- call it a historic accident). Note that related, but differently, we have the isolcpus boot parameter which creates single CPU partitions for all listed CPUs and gives the rest to the root cpuset. Ideally we'd kill this option given its a boot time setting (for something which is trivially to do at runtime). But this cannot be done, because that would mean we'd have to start with a !0 cpuset layout: '/' load_balance=0 / \ 'system' 'isolated' cpus=~isolcpus cpus=isolcpus load_balance=0 And start with _everything_ in the /system group (inclding default IRQ affinities). Of course, that will break everything cgroup :-( ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20160916161951.GH5016-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160916161951.GH5016-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2016-09-16 16:29 ` Andy Lutomirski [not found] ` <CALCETrXoTfhaDxZJ9_XcFknnniDvrYLY9SATVXj+tK1UdaWw4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 48+ messages in thread From: Andy Lutomirski @ 2016-09-16 16:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Mike Galbraith, kernel-team-b10kYP2dOMg, Andrew Morton, open list:CONTROL GROUP (CGROUP), Paul Turner, Li Zefan, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Tejun Heo, Johannes Weiner, Linus Torvalds On Fri, Sep 16, 2016 at 9:19 AM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote: >> On Sep 16, 2016 12:51 AM, "Peter Zijlstra" <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: >> > >> > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote: >> > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the >> > > no-internal-tasks constraints. Do exclusive cgroups still exist in >> > > cgroup2? Could we perhaps just remove that capability entirely? I've >> > > never understood what problem exlusive cpusets and such solve that >> > > can't be more comprehensibly solved by just assigning the cpusets the >> > > normal inclusive way. >> > >> > Without exclusive sets we cannot split the sched_domain structure. >> > Which leads to not being able to actually partition things. That would >> > break DL for one. >> >> Can you sketch out a toy example? > > [ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ] > > > mkdir /cpuset > > mount -t cgroup -o cpuset none /cpuset > > mkdir /cpuset/A > mkdir /cpuset/B > > cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus > echo 0 > /cpuset/A/cpuset.mems > > cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus > echo 1 > /cpuset/B/cpuset.mems > > # move all movable tasks into A > cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done > > # kill machine wide load-balancing > echo 0 > /cpuset/cpuset.sched_load_balance > > # now place 'special' tasks in B > > > This partitions the scheduler into two, one for each node. > > Hereafter no task will be moved from one node to another. The > load-balancer is split in two, one balances in A one balances in B > nothing crosses. (It is important that A.cpus and B.cpus do not > intersect.) > > Ideally no task would remain in the root group, back in the day we could > actually do this (with exception of the cpu bound kernel threads), but > this has significantly regressed :-( > (still hate the workqueue affinity interface) I wonder if we could address this by creating (automatically at boot or when the cpuset controller is enabled or whatever) a /cpuset/random_kernel_shit cgroup and have all of the unmoveable tasks land there? > > As is, tasks that are left in the root group get balanced within > whatever domain they ended up in. > >> And what's DL? > > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support > CPU affinities (because that doesn't make sense). The only way to > restrict it is to partition. > > 'Global' because you can partition it. If you reduce your system to > single CPU partitions you'll reduce to P-EDF. > > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same > partition scheme, it however does support sched_affinity, but using it > gives 'interesting' schedulability results -- call it a historic > accident). Hmm, I didn't realize that the deadline scheduler was global. But ISTM requiring the use of "exclusive" to get this working is unfortunate. What if a user wants two separate partitions, one using CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for non-RT stuff)? Shouldn't we be able to have a cgroup for each of the DL partitions and do something to tell the deadline scheduler "here is your domain"? > > > Note that related, but differently, we have the isolcpus boot parameter > which creates single CPU partitions for all listed CPUs and gives the > rest to the root cpuset. Ideally we'd kill this option given its a boot > time setting (for something which is trivially to do at runtime). > > But this cannot be done, because that would mean we'd have to start with > a !0 cpuset layout: > > '/' > load_balance=0 > / \ > 'system' 'isolated' > cpus=~isolcpus cpus=isolcpus > load_balance=0 > > And start with _everything_ in the /system group (inclding default IRQ > affinities). > > Of course, that will break everything cgroup :-( > I would actually *much* prefer this over the status quo. I'm tired of my crappy, partially-working script that sits there and creates exactly this configuration (minus the isolcpus part because I actually want migration to work) on boot. (Actually, it could have two automatic cgroups: /kernel and /init -- init and UMH would go in init and kernel threads and such would go in /kernel. Userspace would be able to request that a different cgroup be used for newly-created kernel threads.) Heck, even systemd would probably prefer this. Then it could cleanly expose a "slice" or whatever it's called for random kernel shit and at least you could configure it meaningfully. ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <CALCETrXoTfhaDxZJ9_XcFknnniDvrYLY9SATVXj+tK1UdaWw4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <CALCETrXoTfhaDxZJ9_XcFknnniDvrYLY9SATVXj+tK1UdaWw4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-09-16 16:50 ` Peter Zijlstra [not found] ` <20160916165045.GJ5016-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 0 siblings, 1 reply; 48+ messages in thread From: Peter Zijlstra @ 2016-09-16 16:50 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Mike Galbraith, kernel-team-b10kYP2dOMg, Andrew Morton, open list:CONTROL GROUP (CGROUP), Paul Turner, Li Zefan, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Tejun Heo, Johannes Weiner, Linus Torvalds On Fri, Sep 16, 2016 at 09:29:06AM -0700, Andy Lutomirski wrote: > > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support > > CPU affinities (because that doesn't make sense). The only way to > > restrict it is to partition. > > > > 'Global' because you can partition it. If you reduce your system to > > single CPU partitions you'll reduce to P-EDF. > > > > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same > > partition scheme, it however does support sched_affinity, but using it > > gives 'interesting' schedulability results -- call it a historic > > accident). > > Hmm, I didn't realize that the deadline scheduler was global. But > ISTM requiring the use of "exclusive" to get this working is > unfortunate. What if a user wants two separate partitions, one using > CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for > non-RT stuff)? {1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5 cpu parts are 'rare'). > Shouldn't we be able to have a cgroup for each of the > DL partitions and do something to tell the deadline scheduler "here is > your domain"? Somewhat confused, by doing the non-overlapping domains, you do exactly that no? You end up with 2 (or more) independent deadline schedulers, but if you're not running deadline tasks (like in the /system partition) you don't care its there. > > Note that related, but differently, we have the isolcpus boot parameter > > which creates single CPU partitions for all listed CPUs and gives the > > rest to the root cpuset. Ideally we'd kill this option given its a boot > > time setting (for something which is trivially to do at runtime). > > > > But this cannot be done, because that would mean we'd have to start with > > a !0 cpuset layout: > > > > '/' > > load_balance=0 > > / \ > > 'system' 'isolated' > > cpus=~isolcpus cpus=isolcpus > > load_balance=0 > > > > And start with _everything_ in the /system group (inclding default IRQ > > affinities). > > > > Of course, that will break everything cgroup :-( > > > > I would actually *much* prefer this over the status quo. I'm tired of > my crappy, partially-working script that sits there and creates > exactly this configuration (minus the isolcpus part because I actually > want migration to work) on boot. (Actually, it could have two > automatic cgroups: /kernel and /init -- init and UMH would go in init > and kernel threads and such would go in /kernel. Userspace would be > able to request that a different cgroup be used for newly-created > kernel threads.) So there's a problem with sticking kernel threads (and esp. kthreadd) into !root groups. For example if you place it in a cpuset that doesn't have all cpus, then binding your shiny new kthread to a cpu will fail. You can fix that of course, and we used to do exactly that, but we kept running into 'fun' cases like that. The unbound workqueue stuff is totally arbitrary borkage though, that can be made to work just fine, TJ didn't like it for some reason which I really cannot remember. Also, UMH? > Heck, even systemd would probably prefer this. Then it could cleanly > expose a "slice" or whatever it's called for random kernel shit and at > least you could configure it meaningfully. No clue about systemd, I'm still on systems without that virus. ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20160916165045.GJ5016-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160916165045.GJ5016-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> @ 2016-09-16 18:19 ` Andy Lutomirski [not found] ` <CALCETrVMw4Nd-QZER9qzOzRte5s48WrUaM8ZZzkY_g3B6s+5Ow-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 48+ messages in thread From: Andy Lutomirski @ 2016-09-16 18:19 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Mike Galbraith, kernel-team-b10kYP2dOMg, Andrew Morton, open list:CONTROL GROUP (CGROUP), Paul Turner, Li Zefan, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Tejun Heo, Johannes Weiner, Linus Torvalds On Fri, Sep 16, 2016 at 9:50 AM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > On Fri, Sep 16, 2016 at 09:29:06AM -0700, Andy Lutomirski wrote: > >> > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support >> > CPU affinities (because that doesn't make sense). The only way to >> > restrict it is to partition. >> > >> > 'Global' because you can partition it. If you reduce your system to >> > single CPU partitions you'll reduce to P-EDF. >> > >> > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same >> > partition scheme, it however does support sched_affinity, but using it >> > gives 'interesting' schedulability results -- call it a historic >> > accident). >> >> Hmm, I didn't realize that the deadline scheduler was global. But >> ISTM requiring the use of "exclusive" to get this working is >> unfortunate. What if a user wants two separate partitions, one using >> CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for >> non-RT stuff)? > > {1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5 > cpu parts are 'rare'). There's no overlap, so they're logically exclusive, but it avoids needing the "cpu_exclusive" parameter. It always seemed confusing to me that a setting on a child cgroup would strictly remove a resource from the parent. (To be clear: I don't have any particularly strong objection to cpu_exclusive. It just always seemed like a bit of a hack that mostly duplicated what you could get by just setting the cpusets appropriately throughout the hierarchy.) >> > Note that related, but differently, we have the isolcpus boot parameter >> > which creates single CPU partitions for all listed CPUs and gives the >> > rest to the root cpuset. Ideally we'd kill this option given its a boot >> > time setting (for something which is trivially to do at runtime). >> > >> > But this cannot be done, because that would mean we'd have to start with >> > a !0 cpuset layout: >> > >> > '/' >> > load_balance=0 >> > / \ >> > 'system' 'isolated' >> > cpus=~isolcpus cpus=isolcpus >> > load_balance=0 >> > >> > And start with _everything_ in the /system group (inclding default IRQ >> > affinities). >> > >> > Of course, that will break everything cgroup :-( >> > >> >> I would actually *much* prefer this over the status quo. I'm tired of >> my crappy, partially-working script that sits there and creates >> exactly this configuration (minus the isolcpus part because I actually >> want migration to work) on boot. (Actually, it could have two >> automatic cgroups: /kernel and /init -- init and UMH would go in init >> and kernel threads and such would go in /kernel. Userspace would be >> able to request that a different cgroup be used for newly-created >> kernel threads.) > > So there's a problem with sticking kernel threads (and esp. kthreadd) > into !root groups. For example if you place it in a cpuset that doesn't > have all cpus, then binding your shiny new kthread to a cpu will fail. > > You can fix that of course, and we used to do exactly that, but we kept > running into 'fun' cases like that. Blech. But may this *should* have that effect. I'm sick of random kernel crap being scheduled on my RT CPUs and on the CPUs that I intend to be kept forcibly idle. > > The unbound workqueue stuff is totally arbitrary borkage though, that > can be made to work just fine, TJ didn't like it for some reason which I > really cannot remember. > > Also, UMH? User mode helper. Fortunately most users are gone now, but it still exists. ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <CALCETrVMw4Nd-QZER9qzOzRte5s48WrUaM8ZZzkY_g3B6s+5Ow-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <CALCETrVMw4Nd-QZER9qzOzRte5s48WrUaM8ZZzkY_g3B6s+5Ow-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2016-09-17 1:47 ` Peter Zijlstra 0 siblings, 0 replies; 48+ messages in thread From: Peter Zijlstra @ 2016-09-17 1:47 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Mike Galbraith, kernel-team-b10kYP2dOMg, Andrew Morton, open list:CONTROL GROUP (CGROUP), Paul Turner, Li Zefan, Linux API, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Tejun Heo, Johannes Weiner, Linus Torvalds On Fri, Sep 16, 2016 at 11:19:38AM -0700, Andy Lutomirski wrote: > On Fri, Sep 16, 2016 at 9:50 AM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > > {1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5 > > cpu parts are 'rare'). > > There's no overlap, so they're logically exclusive, but it avoids > needing the "cpu_exclusive" parameter. I'd need to double check, but I don't think you _need_ that. That's more for enforcing nobody else steals your CPUs and 'accidentally' creates overlaps. But if you configure it right, non-overlap should be enough. That is, generate_sched_domains() only uses cpusets_overlap() which is cpumask_intersects(). Then again, it is almost 4am, so who knows. > > So there's a problem with sticking kernel threads (and esp. kthreadd) > > into !root groups. For example if you place it in a cpuset that doesn't > > have all cpus, then binding your shiny new kthread to a cpu will fail. > > > > You can fix that of course, and we used to do exactly that, but we kept > > running into 'fun' cases like that. > > Blech. But may this *should* have that effect. I'm sick of random > kernel crap being scheduled on my RT CPUs and on the CPUs that I > intend to be kept forcibly idle. Hehe, so ideally those threads don't do anything unless the tasks running on those CPUs explicitly ask for it. If you find any of the CPU-bound kernel tasks do work that is unrelated to the tasks running on that CPU, we should certainly look into it. Personally I'm not much bothered by idle threads sitting about. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <CALCETrUA6_noue4kq9JLqr-V_yo7hB+v1Arhg6i6fFn0tyTrpw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-09-16 7:51 ` Peter Zijlstra @ 2016-09-19 21:53 ` Tejun Heo 1 sibling, 0 replies; 48+ messages in thread From: Tejun Heo @ 2016-09-19 21:53 UTC (permalink / raw) To: Andy Lutomirski Cc: Ingo Molnar, Mike Galbraith, Andrew Morton, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Linux API, Li Zefan, Paul Turner, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linus Torvalds, Johannes Weiner, Peter Zijlstra Hello, On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote: > With regard to no-internal-tasks, I see (at least) three options: > > 1. Keep the cgroup2 status quo. Lots of distros and such are likely > to have their cgroup management fail if run in a container. I really, I don't know where you're getting this. No-internal-tasks rule has *NOTHING* to do with how or how not cgroup v1 hierarchies can be used inside a namespace. I suppose this is coming from the same misunderstanding that Austin has. Please see my reply there for more details. > really dislike this option. Up until this point, you haven't supplied any valid technical reasons for your objection. Repeating "really" doesn't add to the discussion at all. If you're indicating that you don't like it on an aeshtetic ground, please just say so. > 2. Enforce no-internal-tasks for the root cgroup. Un-cgroupable > thinks will still get accounted to the root cgroup even if subtree > control is on, but no tasks can be in the root cgroup if the root > cgroup has subtree control on. (If some controllers removed the > no-internal-tasks restriction, this would apply to the root as well.) > I think this may annoy certain users. If so, and if those users are > doing something valid, then I think that either those users should be > strongly encouraged or even forced to changed so namespacing works for > them or that we should do (3) instead. Theoretically, we can do that but what are the upsides and are they enough to justify the added inconveniences? Up until now, the only argument you provided is that people may do certain things in system-root which might not work in namespace-root but that isn't a critical problem. No real functionalities are lost by implementing the same behaviors both inside and outside namespaces. > 3. Remove the no-internal-tasks restriction entirely. I can see this > resulting in a lot of configuration awkwardness, but I think it will > *work*, especially since all of the controllers already need to do > something vaguely intelligent when subtree control is on in the root > and there are tasks in the root. The reasons for no-internal-tasks restriction have been explained multiple times in the documentations and throughout this thread, and we also discussed how and why system-root is special and allowing system-root's special treatment doesn't break things. > What I'm trying to say is that I think that option (1) is sufficiently > bad that cgroup2 should do (2) or (3) instead. If option (2) is > preferred and if it would break userspace, then I think we can work > around it by entirely deprecating cgroup2, renaming it to cgroup3, and > doing option (2) there. You've given reasons you don't like options > (2) and (3). I mostly agree with those reasons, but I don't think > they're strong enough to overcome the problems with (1). And you keep suggesting very drastic measures for an issue which isn't critical without providing any substantial technical reasons why such drastic measures would be necessary. This part of discussion started with your misunderstanding of the implications of the system-root being special, and the only reason you presented in the previous message is still a, different, misunderstanding. The only thing which isn't changing here is your opinions on how it should be. It is a baffling situation because your opinions don't seem to be affected at all by the validity of reasons for thinking so. > BTW, Mike keeps mentioning exclusive cgroups as problematic with the > no-internal-tasks constraints. Do exclusive cgroups still exist in > cgroup2? Could we perhaps just remove that capability entirely? I've > never understood what problem exlusive cpusets and such solve that > can't be more comprehensibly solved by just assigning the cpusets the > normal inclusive way. This was explained before during the discussion. Maybe it wasn't clear enough. The knob is a config protector which protects oneself from changing its configs. It doesn't really belong in the kernel. My guess is that it was added because delegation model wasn't properly established and people tried to delegate resource control knobs along with the cgroups and then wanted to prevent those knobs from changed in certain ways. > >> What kind of migration do you mean? Having fds follow rename(2) around is > >> the normal vfs behavior, so I don't really know what you mean. > > > > Process or task migration by writing pid to cgroup.procs or tasks > > file. cgroup never supported directory / cgroup level migrations. > > Ugh. Perhaps cgroup2 should start supporting this. I think that > making rename(2) work is simpler than adding a whole new API for > rgroups, and I think it could solve a lot of the same problems that > rgroups are trying to solve. We haven't needed that yet and supporting rename(2) doesn't necessarily make the API safe in terms of migration atomicity. Also, as pointed out in my previous reply (and rgroup documentation), atomicity is only one part of rationales for rgroup. Thanks. -- tejun ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160829222048.GH28713-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2016-08-31 3:42 ` Andy Lutomirski @ 2016-08-31 19:57 ` Andy Lutomirski 1 sibling, 0 replies; 48+ messages in thread From: Andy Lutomirski @ 2016-08-31 19:57 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds I'm replying separately to keep the two issues in separate emails. On Mon, Aug 29, 2016 at 3:20 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > Hello, Andy. > > Sorry about the delay. Was kinda overwhelmed with other things. > > On Sat, Aug 20, 2016 at 11:45:55AM -0700, Andy Lutomirski wrote: >> > This becomes clear whenever an entity is allocating memory on behalf >> > of someone else - get_user_pages(), khugepaged, swapoff and so on (and >> > likely userfaultfd too). When a task is trying to add a page to a >> > VMA, the task might not have any relationship with the VMA other than >> > that it's operating on it for someone else. The page has to be >> > charged to whoever is responsible for the VMA and the only ownership >> > which can be established is the containing mm_struct. >> >> This surprises me a bit. If I do access_process_vm(), then I would >> have expected the charge to go the caller, not the mm being accessed. > > It does and should go the target mm. Who faults in a page shouldn't > be the final determinant in the ownership; otherwise, we end up in > situations where the ownership changes due to, for example, > fluctuations in page fault pattern. It doesn't make semantical sense > either. If a kthread is doing PIO for a process, why would it get > charged for the memory it's faulting in? OK, that makes sense. Although, given that cgroup1 allows tasks in the same processes to be split up, how does this work in cgroup1? Do you just pick the mm associated with the thread group leader? If so, why can't cgroup2 do the same thing? But even this is at best a vague approximation. If you have MAP_SHARED mappings (libc.so, for example), then the cgroup you charge it to is more or less arbitrary. > >> What happens if a program calls read(2), though? A page may be >> inserted into page cache on behalf of an address_space without any >> particular mm being involved. There will usually be a calling task, >> though. > > Most faults are synchronous and the faulting thread is a member of the > mm to be charged, so this usually isn't an issue. I don't think there > are places where we populate an address_space without knowing who it > is for (as opposed / in addition to who the operator is). True, but there's no *mm* involved in any fundamental sense. You can look at the task and find the task's mm (or actually the task's thread group leader, since cgroup2 doesn't literally map mms to cgroups), but that seems to me to be a pretty poor reason to argue that tasks should have to be kept together. > >> But this is all very memcg-specific. What about other cgroups? I/O >> is per-task, right? Scheduling is definitely per-task. > > They aren't separate. Think about IOs to write out page cache, CPU > cycles spent reclaiming memory or encrypting writeback IOs. It's fine > to get more granular with specific resources but the semantics gets > messy for cross-resource accounting and control without proper > scoping. Page cache doesn't belong to a a specific mm. Memory reclaim only has an mm associated if the memory being reclaimed belongs cleanly to an mm. Encrypting writeback (I assume you mean the cpu usage) is just like page cache writeback IO -- there's no specific mm involved in general. > >> > Consider the scenario where you have somebody faulting on behalf of a >> > foreign VMA, but the thread who created and is actively using that VMA >> > is in a different cgroup than the process leader. Who are we going to >> > charge? All possible answers seem erratic. >> >> Indeed, and this problem is probably not solvable in practice unless >> you charge all involved cgroups. But the caller's *mm* is entirely >> irrelevant here, so I don't see how this implies that cgroups need to >> keep tasks in the same process together. The relevant entities are >> the calling *task* and the target mm, and you're going to be >> hard-pressed to ensure that they belong to the same cgroup, so I think >> you need to be able handle weird cases in which there isn't an >> obviously correct cgroup to charge. > > It is an erratic case which is caused by userland interface allowing > non-sensical configuration. We can accept it as a necessary trade-off > given big enough benefits or unavoidable constraints but it isn't > something to do willy-nilly. > >> > For system-level and process-level operations to not step on each >> > other's toes, they need to agree on the granularity boundary - >> > system-level should be able to treat an application hierarchy as a >> > single unit. A possible solution is allowing rgroup hirearchies to >> > span across process boundaries and implementing cgroup migration >> > operations which treat such hierarchies as a single unit. I'm not yet >> > sure whether the boundary should be at program groups or rgroups. >> >> I think that, if the system cgroup manager is moving processes around >> after starting them and execing the final binary, there will be races >> and confusion, and no about of granularity fiddling will fix that. > > I don't see how that statement is true. For example, if you confine > the hierarhcy to in-process, there is proper isolation and whether > system agent migrates the process or not doesn't make any difference > to the internal hierarchy. But hierarchy isn't always per process. Some real-world services have threads and subprocesses. > >> I know nothing about rgroups. Are they upstream? > > It was linked from the original message. > > [7] http://lkml.kernel.org/r/20160105154503.GC5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org > [RFD] cgroup: thread granularity support for cpu controller > Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> I can see two issues here: 1. You're allowing groups and tasks to be siblings. If you're okay allowing that for rgroups, why not allow it for cgroup2 on the same set of controllers? 2. It looks impossible to fork and keep a child in the same group as one of your non-leader threads. I think I'm starting to agree with PeterZ here. Why not just make cgroup2 more flexible? --Andy ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20160820155659.GA16906-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <20160820155659.GA16906-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2016-08-22 10:12 ` Mike Galbraith 0 siblings, 0 replies; 48+ messages in thread From: Mike Galbraith @ 2016-08-22 10:12 UTC (permalink / raw) To: Tejun Heo, Andy Lutomirski Cc: Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds On Sat, 2016-08-20 at 11:56 -0400, Tejun Heo wrote: > > > there are other reasons to enforce process granularity. One > > > important one is isolating system-level management operations from > > > in-process application operations. The cgroup interface, being a > > > virtual filesystem, is very unfit for multiple independent > > > operations taking place at the same time as most operations have to > > > be multi-step and there is no way to synchronize multiple accessors. > > > See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity" > > > > I don't buy this argument at all. System-level code is likely to > > assign single process *trees*, which are a different beast entirely. > > I.e. you fork, move the child into a cgroup, and that child and its > > children stay in that cgroup. I don't see how the thread/process > > distinction matters. > > Good point on the multi-process issue, this is something which nagged > me a bit while working on rgroup, although I have to point out that > the issue here is one of not going far enough rather than the approach > being wrong. There are limitations to scoping it to individual > processes but that doesn't negate the underlying problem or the > usefulness of in-process control. > > For system-level and process-level operations to not step on each > other's toes, they need to agree on the granularity boundary - > system-level should be able to treat an application hierarchy as a > single unit. A possible solution is allowing rgroup hirearchies to > span across process boundaries and implementing cgroup migration > operations which treat such hierarchies as a single unit. I'm not yet > sure whether the boundary should be at program groups or rgroups. Why is it not viable to predicate contentious lowest common denominator restrictions upon the set of enabled controllers? If only thread granularity controllers are enabled, from that point onward, v2 restrictions cease to make any sense, thus could be lifted, leaving nobody cast adrift in a leaky v1 lifeboat when v2 sets sail. Or? -Mike ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <CALCETrXvLNeds+ugZ8j3eD1Zg1RZYJSAET3Kguz5G2vqSLFCwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-08-20 15:56 ` Tejun Heo @ 2016-08-21 5:34 ` James Bottomley [not found] ` <1471757654.2354.97.camel-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk@public.gmane.org> 1 sibling, 1 reply; 48+ messages in thread From: James Bottomley @ 2016-08-21 5:34 UTC (permalink / raw) To: Andy Lutomirski, Tejun Heo Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds On Wed, 2016-08-17 at 13:18 -0700, Andy Lutomirski wrote: > On Aug 5, 2016 7:07 PM, "Tejun Heo" <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: [...] > > 2. Disagreements and Arguments > > > > There have been several lengthy discussion threads [3][4] on LKML > > around the structural constraints of cgroup v2. The two that > > affect the CPU controller are process granularity and no internal > > process constraint. Both arise primarily from the need for common > > resource domain definition across different resources. > > > > The common resource domain is a powerful concept in cgroup v2 that > > allows controllers to make basic assumptions about the structural > > organization of processes and controllers inside the cgroup > > hierarchy, and thus solve problems spanning multiple types of > > resources. The prime example for this is page cache writeback: > > dirty page cache is regulated through throttling buffered writers > > based on memory availability, and initiating batched write outs to > > the disk based on IO capacity. Tracking and controlling writeback > > inside a cgroup thus requires the direct cooperation of the memory > > and the IO controller. > > > > This easily extends to other areas, such as CPU cycles consumed > > while performing memory reclaim or IO encryption. > > > > > > 2-1. Contentious Restrictions > > > > For controllers of different resources to work together, they must > > agree on a common organization. This uniform model across > > controllers imposes two contentious restrictions on the CPU > > controller: process granularity and the no-internal-process > > constraint. > > > > > > 2-1-1. Process Granularity > > > > For memory, because an address space is shared between all > > threads > > of a process, the terminal consumer is a process, not a thread. > > Separating the threads of a single process into different memory > > control domains doesn't make semantical sense. cgroup v2 ensures > > that all controller can agree on the same organization by > > requiring > > that threads of the same process belong to the same cgroup. > > I haven't followed all of the history here, but it seems to me that > this argument is less accurate than it appears. Linux, for better or > for worse, has somewhat orthogonal concepts of thread groups > (processes), mms, and file tables. An mm has VMAs in it, and VMAs > can reference things (files, etc) that hold resources. (Two mms can > share resources by mapping the same thing or using fork().) File > tables hold files, and files can use resources. Both of these are, > at best, moderately good approximations of what actually holds > resources. Meanwhile, threads (tasks) do syscalls, take page faults, > *allocate* resources, etc. > > So I think it's not really true to say that the "terminal consumer" > of anything is a process, not a thread. > > While it's certainly easier to think about assigning processes to > cgroups, and I certainly agree that, in the common case, it's the > right thing to do, I don't see why requiring it is a good idea. Can > we turn this around: what actually goes wrong if cgroup v2 were to > allow assigning individual threads if a user specifically requests > it? A similar point from a different consumer: from the unprivileged containers point of view, I'm interested in a thread based interface as well. The principle utility of unprivileged containers is to allow applications that wish to to use container properties (effectively to become self-containerising). Some that use the producer/consumer model do use process pools (apache springs to mind instantly) but some use thread pools. It is useful to the latter to preserve the concept of a thread as being the entity inhabiting the cgroup (but only where the granularity of the cgroup permits threads to participate) so we can easily modify them to be self containerising without forcing them to switch back from a thread pool model to a process pool model. I can see that process based is conceptually easier in v2 because you begin with a process tree, but it would really be a pity to lose the thread based controls we have now and permanently lose the ability to create more as we find uses for them. I can't really see how improving "common resource domain" is a good tradeoff for this. James ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <1471757654.2354.97.camel-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk@public.gmane.org>]
* Re: [Documentation] State of CPU controller in cgroup v2 [not found] ` <1471757654.2354.97.camel-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk@public.gmane.org> @ 2016-08-29 22:35 ` Tejun Heo 0 siblings, 0 replies; 48+ messages in thread From: Tejun Heo @ 2016-08-29 22:35 UTC (permalink / raw) To: James Bottomley Cc: Andy Lutomirski, Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP), Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra, Johannes Weiner, Linus Torvalds Hello, James. On Sat, Aug 20, 2016 at 10:34:14PM -0700, James Bottomley wrote: > I can see that process based is conceptually easier in v2 because you > begin with a process tree, but it would really be a pity to lose the > thread based controls we have now and permanently lose the ability to > create more as we find uses for them. I can't really see how improving > "common resource domain" is a good tradeoff for this. Thread based control for namespace is not a different problem from thread based control for individual applications, right? And the problems with using cgroupfs directly for in-process control still applies the same whether it's system-wide or inside a namespace. One argument could be that inside a namespace, as the cgroupfs is already scoped, cgroup path headaches are less of an issue, which is true; however, that isn't applicable to applications which aren't scoped in thier own namespaces and we can't scope every binary on the system. More importnatly, a given application can't rely on being scoped in a certain way. You can craft a custom config for a specific setup but that's a horrible way to solve the problem of in-application hierarchical resource distribution, and that's what rgroup was all about. Thanks. -- tejun ^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2016-10-05 8:07 UTC | newest] Thread overview: 48+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-08-05 17:07 [Documentation] State of CPU controller in cgroup v2 Tejun Heo 2016-08-05 17:09 ` [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface Tejun Heo 2016-08-05 17:09 ` [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy Tejun Heo [not found] ` <20160805170752.GK2542-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2016-08-06 9:04 ` [Documentation] State of CPU controller in cgroup v2 Mike Galbraith [not found] ` <1470474291.4117.243.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-08-10 22:09 ` Johannes Weiner [not found] ` <20160810220944.GB3085-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 2016-08-11 6:25 ` Mike Galbraith [not found] ` <1470896706.4116.146.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-08-12 22:17 ` Johannes Weiner [not found] ` <20160812221742.GA24736-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 2016-08-13 5:08 ` Mike Galbraith 2016-08-16 14:07 ` Peter Zijlstra [not found] ` <20160816140738.GW6879-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-08-16 14:58 ` Chris Mason 2016-08-16 16:30 ` Johannes Weiner 2016-08-17 9:33 ` Mike Galbraith 2016-08-16 21:59 ` Tejun Heo 2016-08-17 20:18 ` Andy Lutomirski [not found] ` <CALCETrXvLNeds+ugZ8j3eD1Zg1RZYJSAET3Kguz5G2vqSLFCwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-08-20 15:56 ` Tejun Heo 2016-08-20 18:45 ` Andy Lutomirski [not found] ` <CALCETrUWn1ux-ZRJoMjFCuP1aQrPOo3oTPD7k-ojsaov29NsRw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-08-29 22:20 ` Tejun Heo [not found] ` <20160829222048.GH28713-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2016-08-31 3:42 ` Andy Lutomirski 2016-08-31 17:32 ` Tejun Heo [not found] ` <20160831173251.GY12660-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2016-08-31 19:11 ` Andy Lutomirski [not found] ` <CALCETrUKOJZS+=QDPyQD+vxXpwyjoj4+Crg6wU7Xk8rP4prYkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-08-31 21:07 ` Tejun Heo 2016-08-31 21:46 ` Andy Lutomirski [not found] ` <CALCETrXj2Z=-GMaWV_EpCvw_8C3t1vc=D53Ff2wdvo=At8ZF1Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-09-03 22:05 ` Tejun Heo 2016-09-05 17:37 ` Andy Lutomirski [not found] ` <CALCETrVcAjFWLQ1arjSP-g=4jRY_J7G-j9JJHrvTDgOnxApYPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-09-06 10:29 ` Peter Zijlstra 2016-10-04 14:47 ` Tejun Heo [not found] ` <20161004144717.GA4205-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2016-10-05 8:07 ` Peter Zijlstra 2016-09-09 22:57 ` Tejun Heo [not found] ` <20160909225747.GA30105-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2016-09-10 8:54 ` Mike Galbraith 2016-09-10 10:08 ` Mike Galbraith [not found] ` <1473502137.3857.218.camel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-09-30 9:06 ` Tejun Heo [not found] ` <20160930090603.GD29207-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2016-09-30 14:53 ` Mike Galbraith 2016-09-12 15:20 ` Austin S. Hemmelgarn [not found] ` <ab6f3376-4c09-a339-f984-937f537ddc17-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2016-09-19 21:34 ` Tejun Heo [not found] ` <CALCETrUhpPQdyZ-6WRjdB+iLbpGBduRZMWXQtCuS+R7Cq7rygg@mail.gmail.com> 2016-09-14 20:00 ` Tejun Heo [not found] ` <20160914200041.GB6832-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2016-09-15 20:08 ` Andy Lutomirski [not found] ` <CALCETrUA6_noue4kq9JLqr-V_yo7hB+v1Arhg6i6fFn0tyTrpw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-09-16 7:51 ` Peter Zijlstra [not found] ` <20160916075137.GK5012-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-09-16 15:12 ` Andy Lutomirski [not found] ` <CALCETrXzrXJmZoFVfAXS1Zf9uNZjibnHizEhwgqdmRvnbJEksw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-09-16 16:19 ` Peter Zijlstra [not found] ` <20160916161951.GH5016-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-09-16 16:29 ` Andy Lutomirski [not found] ` <CALCETrXoTfhaDxZJ9_XcFknnniDvrYLY9SATVXj+tK1UdaWw4A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-09-16 16:50 ` Peter Zijlstra [not found] ` <20160916165045.GJ5016-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> 2016-09-16 18:19 ` Andy Lutomirski [not found] ` <CALCETrVMw4Nd-QZER9qzOzRte5s48WrUaM8ZZzkY_g3B6s+5Ow-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-09-17 1:47 ` Peter Zijlstra 2016-09-19 21:53 ` Tejun Heo 2016-08-31 19:57 ` Andy Lutomirski [not found] ` <20160820155659.GA16906-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2016-08-22 10:12 ` Mike Galbraith 2016-08-21 5:34 ` James Bottomley [not found] ` <1471757654.2354.97.camel-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk@public.gmane.org> 2016-08-29 22:35 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).