* [PATCH-cgroup v2] cgroup: Show # of subsystem CSSes in root cgroup.stat
@ 2024-07-09 13:28 Waiman Long
2024-07-09 15:58 ` Kamalesh Babulal
2024-07-09 23:08 ` Roman Gushchin
0 siblings, 2 replies; 10+ messages in thread
From: Waiman Long @ 2024-07-09 13:28 UTC (permalink / raw)
To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet
Cc: cgroups, linux-doc, linux-kernel, Waiman Long
The /proc/cgroups file shows the number of cgroups for each of the
subsystems. With cgroup v1, the number of CSSes is the same as the
number of cgroups. That is not the case anymore with cgroup v2. The
/proc/cgroups file cannot show the actual number of CSSes for the
subsystems that are bound to cgroup v2.
So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
we can't tell by looking at /proc/cgroups which cgroup subsystems may be
responsible. This patch adds CSS counts in the cgroup_subsys structure
to keep track of the number of CSSes for each of the cgroup subsystems.
As cgroup v2 had deprecated the use of /proc/cgroups, the root
cgroup.stat file is extended to show the number of outstanding CSSes
associated with all the non-inhibited cgroup subsystems that have been
bound to cgroup v2. This will help us pinpoint which subsystems may be
responsible for the increasing number of dying (nr_dying_descendants)
cgroups.
The cgroup-v2.rst file is updated to discuss this new behavior.
With this patch applied, a sample output from root cgroup.stat file
was shown below.
nr_descendants 53
nr_dying_descendants 34
nr_cpuset 1
nr_cpu 40
nr_io 40
nr_memory 87
nr_perf_event 54
nr_hugetlb 1
nr_pids 53
nr_rdma 1
nr_misc 1
In this particular case, it can be seen that memory cgroup is the most
likely culprit for causing the 34 dying cgroups.
Signed-off-by: Waiman Long <longman@redhat.com>
---
Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++--
include/linux/cgroup-defs.h | 3 +++
kernel/cgroup/cgroup.c | 19 +++++++++++++++++++
3 files changed, 30 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 52763d6b2919..65af2f30196f 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -981,6 +981,12 @@ All cgroup core files are prefixed with "cgroup."
A dying cgroup can consume system resources not exceeding
limits, which were active at the moment of cgroup deletion.
+ nr_<cgroup_subsys>
+ Total number of cgroups associated with that cgroup
+ subsystem, e.g. cpuset or memory. These cgroup counts
+ will only be shown in the root cgroup and for subsystems
+ bound to cgroup v2.
+
cgroup.freeze
A read-write single value file which exists on non-root cgroups.
Allowed values are "0" and "1". The default is "0".
@@ -2930,8 +2936,8 @@ Deprecated v1 Core Features
- "cgroup.clone_children" is removed.
-- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
- at the root instead.
+- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
+ "cgroup.stat" files at the root instead.
Issues with v1 and Rationales for v2
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index b36690ca0d3f..522ab77f0406 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -776,6 +776,9 @@ struct cgroup_subsys {
* specifies the mask of subsystems that this one depends on.
*/
unsigned int depends_on;
+
+ /* Number of CSSes, used only for /proc/cgroups */
+ atomic_t nr_csses;
};
extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c8e4b62b436a..48eba2737b1a 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3669,12 +3669,27 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
static int cgroup_stat_show(struct seq_file *seq, void *v)
{
struct cgroup *cgroup = seq_css(seq)->cgroup;
+ struct cgroup_subsys *ss;
+ int i;
seq_printf(seq, "nr_descendants %d\n",
cgroup->nr_descendants);
seq_printf(seq, "nr_dying_descendants %d\n",
cgroup->nr_dying_descendants);
+ if (cgroup_parent(cgroup))
+ return 0;
+
+ /*
+ * For the root cgroup, shows the number of csses associated
+ * with each of non-inhibited cgroup subsystems bound to it.
+ */
+ do_each_subsys_mask(ss, i, ~cgrp_dfl_inhibit_ss_mask) {
+ if (ss->root != &cgrp_dfl_root)
+ continue;
+ seq_printf(seq, "nr_%s %d\n", ss->name,
+ atomic_read(&ss->nr_csses));
+ } while_each_subsys_mask();
return 0;
}
@@ -5375,6 +5390,7 @@ static void css_free_rwork_fn(struct work_struct *work)
int id = css->id;
ss->css_free(css);
+ atomic_dec(&ss->nr_csses);
cgroup_idr_remove(&ss->css_idr, id);
cgroup_put(cgrp);
@@ -5567,6 +5583,7 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
if (IS_ERR(css))
return css;
+ atomic_inc(&ss->nr_csses);
init_and_link_css(css, ss, cgrp);
err = percpu_ref_init(&css->refcnt, css_release, 0, GFP_KERNEL);
@@ -6005,6 +6022,8 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
/* Create the root cgroup state for this subsystem */
ss->root = &cgrp_dfl_root;
css = ss->css_alloc(NULL);
+ atomic_set(&ss->nr_csses, 1);
+
/* We don't handle early failures gracefully */
BUG_ON(IS_ERR(css));
init_and_link_css(css, ss, &cgrp_dfl_root.cgrp);
--
2.39.3
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH-cgroup v2] cgroup: Show # of subsystem CSSes in root cgroup.stat
2024-07-09 13:28 [PATCH-cgroup v2] cgroup: Show # of subsystem CSSes in root cgroup.stat Waiman Long
@ 2024-07-09 15:58 ` Kamalesh Babulal
2024-07-09 16:09 ` Waiman Long
2024-07-09 23:08 ` Roman Gushchin
1 sibling, 1 reply; 10+ messages in thread
From: Kamalesh Babulal @ 2024-07-09 15:58 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Zefan Li, Johannes Weiner,
Jonathan Corbet
Cc: cgroups, linux-doc, linux-kernel
On 7/9/24 6:58 PM, Waiman Long wrote:
> The /proc/cgroups file shows the number of cgroups for each of the
> subsystems. With cgroup v1, the number of CSSes is the same as the
> number of cgroups. That is not the case anymore with cgroup v2. The
> /proc/cgroups file cannot show the actual number of CSSes for the
> subsystems that are bound to cgroup v2.
>
> So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
> we can't tell by looking at /proc/cgroups which cgroup subsystems may be
> responsible. This patch adds CSS counts in the cgroup_subsys structure
> to keep track of the number of CSSes for each of the cgroup subsystems.
>
> As cgroup v2 had deprecated the use of /proc/cgroups, the root
> cgroup.stat file is extended to show the number of outstanding CSSes
> associated with all the non-inhibited cgroup subsystems that have been
> bound to cgroup v2. This will help us pinpoint which subsystems may be
> responsible for the increasing number of dying (nr_dying_descendants)
> cgroups.
>
> The cgroup-v2.rst file is updated to discuss this new behavior.
>
> With this patch applied, a sample output from root cgroup.stat file
> was shown below.
>
> nr_descendants 53
> nr_dying_descendants 34
> nr_cpuset 1
> nr_cpu 40
> nr_io 40
> nr_memory 87
> nr_perf_event 54
> nr_hugetlb 1
> nr_pids 53
> nr_rdma 1
> nr_misc 1
>
> In this particular case, it can be seen that memory cgroup is the most
> likely culprit for causing the 34 dying cgroups.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++--
> include/linux/cgroup-defs.h | 3 +++
> kernel/cgroup/cgroup.c | 19 +++++++++++++++++++
> 3 files changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 52763d6b2919..65af2f30196f 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -981,6 +981,12 @@ All cgroup core files are prefixed with "cgroup."
> A dying cgroup can consume system resources not exceeding
> limits, which were active at the moment of cgroup deletion.
>
> + nr_<cgroup_subsys>
> + Total number of cgroups associated with that cgroup
> + subsystem, e.g. cpuset or memory. These cgroup counts
> + will only be shown in the root cgroup and for subsystems
> + bound to cgroup v2.
> +
> cgroup.freeze
> A read-write single value file which exists on non-root cgroups.
> Allowed values are "0" and "1". The default is "0".
> @@ -2930,8 +2936,8 @@ Deprecated v1 Core Features
>
> - "cgroup.clone_children" is removed.
>
> -- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
> - at the root instead.
> +- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
> + "cgroup.stat" files at the root instead.
>
>
> Issues with v1 and Rationales for v2
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index b36690ca0d3f..522ab77f0406 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -776,6 +776,9 @@ struct cgroup_subsys {
> * specifies the mask of subsystems that this one depends on.
> */
> unsigned int depends_on;
> +
> + /* Number of CSSes, used only for /proc/cgroups */
> + atomic_t nr_csses;
> };
>
> extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index c8e4b62b436a..48eba2737b1a 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3669,12 +3669,27 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
> static int cgroup_stat_show(struct seq_file *seq, void *v)
> {
> struct cgroup *cgroup = seq_css(seq)->cgroup;
> + struct cgroup_subsys *ss;
> + int i;
>
> seq_printf(seq, "nr_descendants %d\n",
> cgroup->nr_descendants);
> seq_printf(seq, "nr_dying_descendants %d\n",
> cgroup->nr_dying_descendants);
>
> + if (cgroup_parent(cgroup))
> + return 0;
> +
> + /*
> + * For the root cgroup, shows the number of csses associated
> + * with each of non-inhibited cgroup subsystems bound to it.
> + */
> + do_each_subsys_mask(ss, i, ~cgrp_dfl_inhibit_ss_mask) {
> + if (ss->root != &cgrp_dfl_root)
> + continue;
> + seq_printf(seq, "nr_%s %d\n", ss->name,
> + atomic_read(&ss->nr_csses));
> + } while_each_subsys_mask();
> return 0;
> }
>
Thanks for adding nr_csses, the patch looks good to me. A preference comment,
nr_<subsys>_css format, makes it easier to interpret the count.
With or without the changes to the cgroup subsys format:
Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
--
Thanks,
Kamalesh
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH-cgroup v2] cgroup: Show # of subsystem CSSes in root cgroup.stat
2024-07-09 15:58 ` Kamalesh Babulal
@ 2024-07-09 16:09 ` Waiman Long
2024-07-09 18:02 ` Johannes Weiner
2024-07-10 8:43 ` Kamalesh Babulal
0 siblings, 2 replies; 10+ messages in thread
From: Waiman Long @ 2024-07-09 16:09 UTC (permalink / raw)
To: Kamalesh Babulal, Tejun Heo, Zefan Li, Johannes Weiner,
Jonathan Corbet
Cc: cgroups, linux-doc, linux-kernel
On 7/9/24 11:58, Kamalesh Babulal wrote:
>
> On 7/9/24 6:58 PM, Waiman Long wrote:
>> The /proc/cgroups file shows the number of cgroups for each of the
>> subsystems. With cgroup v1, the number of CSSes is the same as the
>> number of cgroups. That is not the case anymore with cgroup v2. The
>> /proc/cgroups file cannot show the actual number of CSSes for the
>> subsystems that are bound to cgroup v2.
>>
>> So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
>> we can't tell by looking at /proc/cgroups which cgroup subsystems may be
>> responsible. This patch adds CSS counts in the cgroup_subsys structure
>> to keep track of the number of CSSes for each of the cgroup subsystems.
>>
>> As cgroup v2 had deprecated the use of /proc/cgroups, the root
>> cgroup.stat file is extended to show the number of outstanding CSSes
>> associated with all the non-inhibited cgroup subsystems that have been
>> bound to cgroup v2. This will help us pinpoint which subsystems may be
>> responsible for the increasing number of dying (nr_dying_descendants)
>> cgroups.
>>
>> The cgroup-v2.rst file is updated to discuss this new behavior.
>>
>> With this patch applied, a sample output from root cgroup.stat file
>> was shown below.
>>
>> nr_descendants 53
>> nr_dying_descendants 34
>> nr_cpuset 1
>> nr_cpu 40
>> nr_io 40
>> nr_memory 87
>> nr_perf_event 54
>> nr_hugetlb 1
>> nr_pids 53
>> nr_rdma 1
>> nr_misc 1
>>
>> In this particular case, it can be seen that memory cgroup is the most
>> likely culprit for causing the 34 dying cgroups.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>> Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++--
>> include/linux/cgroup-defs.h | 3 +++
>> kernel/cgroup/cgroup.c | 19 +++++++++++++++++++
>> 3 files changed, 30 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>> index 52763d6b2919..65af2f30196f 100644
>> --- a/Documentation/admin-guide/cgroup-v2.rst
>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>> @@ -981,6 +981,12 @@ All cgroup core files are prefixed with "cgroup."
>> A dying cgroup can consume system resources not exceeding
>> limits, which were active at the moment of cgroup deletion.
>>
>> + nr_<cgroup_subsys>
>> + Total number of cgroups associated with that cgroup
>> + subsystem, e.g. cpuset or memory. These cgroup counts
>> + will only be shown in the root cgroup and for subsystems
>> + bound to cgroup v2.
>> +
>> cgroup.freeze
>> A read-write single value file which exists on non-root cgroups.
>> Allowed values are "0" and "1". The default is "0".
>> @@ -2930,8 +2936,8 @@ Deprecated v1 Core Features
>>
>> - "cgroup.clone_children" is removed.
>>
>> -- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
>> - at the root instead.
>> +- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
>> + "cgroup.stat" files at the root instead.
>>
>>
>> Issues with v1 and Rationales for v2
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>> index b36690ca0d3f..522ab77f0406 100644
>> --- a/include/linux/cgroup-defs.h
>> +++ b/include/linux/cgroup-defs.h
>> @@ -776,6 +776,9 @@ struct cgroup_subsys {
>> * specifies the mask of subsystems that this one depends on.
>> */
>> unsigned int depends_on;
>> +
>> + /* Number of CSSes, used only for /proc/cgroups */
>> + atomic_t nr_csses;
>> };
>>
>> extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index c8e4b62b436a..48eba2737b1a 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -3669,12 +3669,27 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
>> static int cgroup_stat_show(struct seq_file *seq, void *v)
>> {
>> struct cgroup *cgroup = seq_css(seq)->cgroup;
>> + struct cgroup_subsys *ss;
>> + int i;
>>
>> seq_printf(seq, "nr_descendants %d\n",
>> cgroup->nr_descendants);
>> seq_printf(seq, "nr_dying_descendants %d\n",
>> cgroup->nr_dying_descendants);
>>
>> + if (cgroup_parent(cgroup))
>> + return 0;
>> +
>> + /*
>> + * For the root cgroup, shows the number of csses associated
>> + * with each of non-inhibited cgroup subsystems bound to it.
>> + */
>> + do_each_subsys_mask(ss, i, ~cgrp_dfl_inhibit_ss_mask) {
>> + if (ss->root != &cgrp_dfl_root)
>> + continue;
>> + seq_printf(seq, "nr_%s %d\n", ss->name,
>> + atomic_read(&ss->nr_csses));
>> + } while_each_subsys_mask();
>> return 0;
>> }
>>
> Thanks for adding nr_csses, the patch looks good to me. A preference comment,
> nr_<subsys>_css format, makes it easier to interpret the count.
>
> With or without the changes to the cgroup subsys format:
>
> Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Thanks for the review.
CSS is a kernel internal name for cgroup subsystem state. Non kernel
developers or users may not know what CSS is and cgroup-v2.rst doesn't
mention CSS at all. So I don't think it is a good idea to add the "_css"
suffix. From the user point of view, the proper term to use here is the
number of cgroups, just like what "nr_descendants" and
"nr_dying_descendants" are referring to before this patch. The only
issue that I didn't address is the use of the proper plural form which
is hard for cgroup subsystem names that we have.
Cheers,
Longman
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH-cgroup v2] cgroup: Show # of subsystem CSSes in root cgroup.stat
2024-07-09 16:09 ` Waiman Long
@ 2024-07-09 18:02 ` Johannes Weiner
2024-07-09 19:12 ` Waiman Long
2024-07-10 18:28 ` Waiman Long
2024-07-10 8:43 ` Kamalesh Babulal
1 sibling, 2 replies; 10+ messages in thread
From: Johannes Weiner @ 2024-07-09 18:02 UTC (permalink / raw)
To: Waiman Long
Cc: Kamalesh Babulal, Tejun Heo, Zefan Li, Jonathan Corbet, cgroups,
linux-doc, linux-kernel
On Tue, Jul 09, 2024 at 12:09:05PM -0400, Waiman Long wrote:
> On 7/9/24 11:58, Kamalesh Babulal wrote:
> >
> > On 7/9/24 6:58 PM, Waiman Long wrote:
> >> The /proc/cgroups file shows the number of cgroups for each of the
> >> subsystems. With cgroup v1, the number of CSSes is the same as the
> >> number of cgroups. That is not the case anymore with cgroup v2. The
> >> /proc/cgroups file cannot show the actual number of CSSes for the
> >> subsystems that are bound to cgroup v2.
> >>
> >> So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
> >> we can't tell by looking at /proc/cgroups which cgroup subsystems may be
> >> responsible. This patch adds CSS counts in the cgroup_subsys structure
> >> to keep track of the number of CSSes for each of the cgroup subsystems.
> >>
> >> As cgroup v2 had deprecated the use of /proc/cgroups, the root
> >> cgroup.stat file is extended to show the number of outstanding CSSes
> >> associated with all the non-inhibited cgroup subsystems that have been
> >> bound to cgroup v2. This will help us pinpoint which subsystems may be
> >> responsible for the increasing number of dying (nr_dying_descendants)
> >> cgroups.
> >>
> >> The cgroup-v2.rst file is updated to discuss this new behavior.
> >>
> >> With this patch applied, a sample output from root cgroup.stat file
> >> was shown below.
> >>
> >> nr_descendants 53
> >> nr_dying_descendants 34
> >> nr_cpuset 1
> >> nr_cpu 40
> >> nr_io 40
> >> nr_memory 87
> >> nr_perf_event 54
> >> nr_hugetlb 1
> >> nr_pids 53
> >> nr_rdma 1
> >> nr_misc 1
> >>
> >> In this particular case, it can be seen that memory cgroup is the most
> >> likely culprit for causing the 34 dying cgroups.
> >>
> >> Signed-off-by: Waiman Long <longman@redhat.com>
> >> ---
> >> Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++--
> >> include/linux/cgroup-defs.h | 3 +++
> >> kernel/cgroup/cgroup.c | 19 +++++++++++++++++++
> >> 3 files changed, 30 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> >> index 52763d6b2919..65af2f30196f 100644
> >> --- a/Documentation/admin-guide/cgroup-v2.rst
> >> +++ b/Documentation/admin-guide/cgroup-v2.rst
> >> @@ -981,6 +981,12 @@ All cgroup core files are prefixed with "cgroup."
> >> A dying cgroup can consume system resources not exceeding
> >> limits, which were active at the moment of cgroup deletion.
> >>
> >> + nr_<cgroup_subsys>
> >> + Total number of cgroups associated with that cgroup
> >> + subsystem, e.g. cpuset or memory. These cgroup counts
> >> + will only be shown in the root cgroup and for subsystems
> >> + bound to cgroup v2.
> >> +
> >> cgroup.freeze
> >> A read-write single value file which exists on non-root cgroups.
> >> Allowed values are "0" and "1". The default is "0".
> >> @@ -2930,8 +2936,8 @@ Deprecated v1 Core Features
> >>
> >> - "cgroup.clone_children" is removed.
> >>
> >> -- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
> >> - at the root instead.
> >> +- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
> >> + "cgroup.stat" files at the root instead.
> >>
> >>
> >> Issues with v1 and Rationales for v2
> >> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> >> index b36690ca0d3f..522ab77f0406 100644
> >> --- a/include/linux/cgroup-defs.h
> >> +++ b/include/linux/cgroup-defs.h
> >> @@ -776,6 +776,9 @@ struct cgroup_subsys {
> >> * specifies the mask of subsystems that this one depends on.
> >> */
> >> unsigned int depends_on;
> >> +
> >> + /* Number of CSSes, used only for /proc/cgroups */
> >> + atomic_t nr_csses;
> >> };
> >>
> >> extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
> >> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> >> index c8e4b62b436a..48eba2737b1a 100644
> >> --- a/kernel/cgroup/cgroup.c
> >> +++ b/kernel/cgroup/cgroup.c
> >> @@ -3669,12 +3669,27 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
> >> static int cgroup_stat_show(struct seq_file *seq, void *v)
> >> {
> >> struct cgroup *cgroup = seq_css(seq)->cgroup;
> >> + struct cgroup_subsys *ss;
> >> + int i;
> >>
> >> seq_printf(seq, "nr_descendants %d\n",
> >> cgroup->nr_descendants);
> >> seq_printf(seq, "nr_dying_descendants %d\n",
> >> cgroup->nr_dying_descendants);
> >>
> >> + if (cgroup_parent(cgroup))
> >> + return 0;
> >> +
> >> + /*
> >> + * For the root cgroup, shows the number of csses associated
> >> + * with each of non-inhibited cgroup subsystems bound to it.
> >> + */
> >> + do_each_subsys_mask(ss, i, ~cgrp_dfl_inhibit_ss_mask) {
> >> + if (ss->root != &cgrp_dfl_root)
> >> + continue;
> >> + seq_printf(seq, "nr_%s %d\n", ss->name,
> >> + atomic_read(&ss->nr_csses));
> >> + } while_each_subsys_mask();
> >> return 0;
> >> }
> >>
> > Thanks for adding nr_csses, the patch looks good to me. A preference comment,
> > nr_<subsys>_css format, makes it easier to interpret the count.
> >
> > With or without the changes to the cgroup subsys format:
> >
> > Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
>
> Thanks for the review.
>
> CSS is a kernel internal name for cgroup subsystem state. Non kernel
> developers or users may not know what CSS is and cgroup-v2.rst doesn't
> mention CSS at all. So I don't think it is a good idea to add the "_css"
> suffix. From the user point of view, the proper term to use here is the
> number of cgroups, just like what "nr_descendants" and
> "nr_dying_descendants" are referring to before this patch. The only
> issue that I didn't address is the use of the proper plural form which
> is hard for cgroup subsystem names that we have.
It's not quite the same right? You could have 1 dying cgroup with
multiple zombie subsys states. At least in theory. It could be
confusing to add these counts without introducing the css concept.
I also wonder if it would be better to just report the dying css
instead of all of them. Live ones are 1) under user control and 2)
easy to inspect in cgroupfs. I can see a scenario for the
nr_descendants aggregation ("Oh, that's a lot of subgroups!"); and a
scenario for dying css ("Oh, it's memory state pinning dead groups!").
But not so much "Oh, that's a lot of live memory controlled groups!"
I can't think of a good name for it though.
nr_dying_memory_css is a mouthful
nr_offline_memory?
nr_zombie_memory?
Should this be in debugfs?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH-cgroup v2] cgroup: Show # of subsystem CSSes in root cgroup.stat
2024-07-09 18:02 ` Johannes Weiner
@ 2024-07-09 19:12 ` Waiman Long
2024-07-09 19:15 ` Waiman Long
2024-07-10 18:28 ` Waiman Long
1 sibling, 1 reply; 10+ messages in thread
From: Waiman Long @ 2024-07-09 19:12 UTC (permalink / raw)
To: Johannes Weiner
Cc: Kamalesh Babulal, Tejun Heo, Zefan Li, Jonathan Corbet, cgroups,
linux-doc, linux-kernel
On 7/9/24 14:02, Johannes Weiner wrote:
> On Tue, Jul 09, 2024 at 12:09:05PM -0400, Waiman Long wrote:
>> On 7/9/24 11:58, Kamalesh Babulal wrote:
>>> On 7/9/24 6:58 PM, Waiman Long wrote:
>>>> The /proc/cgroups file shows the number of cgroups for each of the
>>>> subsystems. With cgroup v1, the number of CSSes is the same as the
>>>> number of cgroups. That is not the case anymore with cgroup v2. The
>>>> /proc/cgroups file cannot show the actual number of CSSes for the
>>>> subsystems that are bound to cgroup v2.
>>>>
>>>> So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
>>>> we can't tell by looking at /proc/cgroups which cgroup subsystems may be
>>>> responsible. This patch adds CSS counts in the cgroup_subsys structure
>>>> to keep track of the number of CSSes for each of the cgroup subsystems.
>>>>
>>>> As cgroup v2 had deprecated the use of /proc/cgroups, the root
>>>> cgroup.stat file is extended to show the number of outstanding CSSes
>>>> associated with all the non-inhibited cgroup subsystems that have been
>>>> bound to cgroup v2. This will help us pinpoint which subsystems may be
>>>> responsible for the increasing number of dying (nr_dying_descendants)
>>>> cgroups.
>>>>
>>>> The cgroup-v2.rst file is updated to discuss this new behavior.
>>>>
>>>> With this patch applied, a sample output from root cgroup.stat file
>>>> was shown below.
>>>>
>>>> nr_descendants 53
>>>> nr_dying_descendants 34
>>>> nr_cpuset 1
>>>> nr_cpu 40
>>>> nr_io 40
>>>> nr_memory 87
>>>> nr_perf_event 54
>>>> nr_hugetlb 1
>>>> nr_pids 53
>>>> nr_rdma 1
>>>> nr_misc 1
>>>>
>>>> In this particular case, it can be seen that memory cgroup is the most
>>>> likely culprit for causing the 34 dying cgroups.
>>>>
>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>> ---
>>>> Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++--
>>>> include/linux/cgroup-defs.h | 3 +++
>>>> kernel/cgroup/cgroup.c | 19 +++++++++++++++++++
>>>> 3 files changed, 30 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>>>> index 52763d6b2919..65af2f30196f 100644
>>>> --- a/Documentation/admin-guide/cgroup-v2.rst
>>>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>>>> @@ -981,6 +981,12 @@ All cgroup core files are prefixed with "cgroup."
>>>> A dying cgroup can consume system resources not exceeding
>>>> limits, which were active at the moment of cgroup deletion.
>>>>
>>>> + nr_<cgroup_subsys>
>>>> + Total number of cgroups associated with that cgroup
>>>> + subsystem, e.g. cpuset or memory. These cgroup counts
>>>> + will only be shown in the root cgroup and for subsystems
>>>> + bound to cgroup v2.
>>>> +
>>>> cgroup.freeze
>>>> A read-write single value file which exists on non-root cgroups.
>>>> Allowed values are "0" and "1". The default is "0".
>>>> @@ -2930,8 +2936,8 @@ Deprecated v1 Core Features
>>>>
>>>> - "cgroup.clone_children" is removed.
>>>>
>>>> -- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
>>>> - at the root instead.
>>>> +- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
>>>> + "cgroup.stat" files at the root instead.
>>>>
>>>>
>>>> Issues with v1 and Rationales for v2
>>>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>>>> index b36690ca0d3f..522ab77f0406 100644
>>>> --- a/include/linux/cgroup-defs.h
>>>> +++ b/include/linux/cgroup-defs.h
>>>> @@ -776,6 +776,9 @@ struct cgroup_subsys {
>>>> * specifies the mask of subsystems that this one depends on.
>>>> */
>>>> unsigned int depends_on;
>>>> +
>>>> + /* Number of CSSes, used only for /proc/cgroups */
>>>> + atomic_t nr_csses;
>>>> };
>>>>
>>>> extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
>>>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>>>> index c8e4b62b436a..48eba2737b1a 100644
>>>> --- a/kernel/cgroup/cgroup.c
>>>> +++ b/kernel/cgroup/cgroup.c
>>>> @@ -3669,12 +3669,27 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
>>>> static int cgroup_stat_show(struct seq_file *seq, void *v)
>>>> {
>>>> struct cgroup *cgroup = seq_css(seq)->cgroup;
>>>> + struct cgroup_subsys *ss;
>>>> + int i;
>>>>
>>>> seq_printf(seq, "nr_descendants %d\n",
>>>> cgroup->nr_descendants);
>>>> seq_printf(seq, "nr_dying_descendants %d\n",
>>>> cgroup->nr_dying_descendants);
>>>>
>>>> + if (cgroup_parent(cgroup))
>>>> + return 0;
>>>> +
>>>> + /*
>>>> + * For the root cgroup, shows the number of csses associated
>>>> + * with each of non-inhibited cgroup subsystems bound to it.
>>>> + */
>>>> + do_each_subsys_mask(ss, i, ~cgrp_dfl_inhibit_ss_mask) {
>>>> + if (ss->root != &cgrp_dfl_root)
>>>> + continue;
>>>> + seq_printf(seq, "nr_%s %d\n", ss->name,
>>>> + atomic_read(&ss->nr_csses));
>>>> + } while_each_subsys_mask();
>>>> return 0;
>>>> }
>>>>
>>> Thanks for adding nr_csses, the patch looks good to me. A preference comment,
>>> nr_<subsys>_css format, makes it easier to interpret the count.
>>>
>>> With or without the changes to the cgroup subsys format:
>>>
>>> Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
>> Thanks for the review.
>>
>> CSS is a kernel internal name for cgroup subsystem state. Non kernel
>> developers or users may not know what CSS is and cgroup-v2.rst doesn't
>> mention CSS at all. So I don't think it is a good idea to add the "_css"
>> suffix. From the user point of view, the proper term to use here is the
>> number of cgroups, just like what "nr_descendants" and
>> "nr_dying_descendants" are referring to before this patch. The only
>> issue that I didn't address is the use of the proper plural form which
>> is hard for cgroup subsystem names that we have.
> It's not quite the same right? You could have 1 dying cgroup with
> multiple zombie subsys states. At least in theory. It could be
> confusing to add these counts without introducing the css concept.
Right. There can be multiple csses of different subsystems associated
with each cgroup.
>
> I also wonder if it would be better to just report the dying css
> instead of all of them. Live ones are 1) under user control and 2)
> easy to inspect in cgroupfs. I can see a scenario for the
> nr_descendants aggregation ("Oh, that's a lot of subgroups!"); and a
> scenario for dying css ("Oh, it's memory state pinning dead groups!").
> But not so much "Oh, that's a lot of live memory controlled groups!"
Right now, the patch just keeps a set of simple counters for the total
number of outstanding csses for each cgroup subsystem. We currently
doesn't track how many of those csses are associated with dying cgroups.
That may require iterating all the csses to see if it is attached to a
dying cgroup or more housekeeping code whenever the state of a cgroup
changes. So it will burn more CPU cycles.
>
> I can't think of a good name for it though.
>
> nr_dying_memory_css is a mouthful
>
> nr_offline_memory?
>
> nr_zombie_memory?
>
> Should this be in debugfs?
The current css counts include both live and dying cgroups. As I said
above, it can be much more work if we want to report only dying csses.
This information is certainly good to have, but the current counts
should have provided enough information for an educated guess.
As for the location of this information, I believe it will be easier for
the users to consume if they are in the same place as nr_descendants and
nr_dying_descendants.
Cheers,
Longman
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH-cgroup v2] cgroup: Show # of subsystem CSSes in root cgroup.stat
2024-07-09 19:12 ` Waiman Long
@ 2024-07-09 19:15 ` Waiman Long
0 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2024-07-09 19:15 UTC (permalink / raw)
To: Johannes Weiner
Cc: Kamalesh Babulal, Tejun Heo, Zefan Li, Jonathan Corbet, cgroups,
linux-doc, linux-kernel
On 7/9/24 15:12, Waiman Long wrote:
> On 7/9/24 14:02, Johannes Weiner wrote:
>> On Tue, Jul 09, 2024 at 12:09:05PM -0400, Waiman Long wrote:
>>> On 7/9/24 11:58, Kamalesh Babulal wrote:
>>>> On 7/9/24 6:58 PM, Waiman Long wrote:
>>>>> The /proc/cgroups file shows the number of cgroups for each of the
>>>>> subsystems. With cgroup v1, the number of CSSes is the same as the
>>>>> number of cgroups. That is not the case anymore with cgroup v2. The
>>>>> /proc/cgroups file cannot show the actual number of CSSes for the
>>>>> subsystems that are bound to cgroup v2.
>>>>>
>>>>> So if a v2 cgroup subsystem is leaking cgroups (usually memory
>>>>> cgroup),
>>>>> we can't tell by looking at /proc/cgroups which cgroup subsystems
>>>>> may be
>>>>> responsible. This patch adds CSS counts in the cgroup_subsys
>>>>> structure
>>>>> to keep track of the number of CSSes for each of the cgroup
>>>>> subsystems.
>>>>>
>>>>> As cgroup v2 had deprecated the use of /proc/cgroups, the root
>>>>> cgroup.stat file is extended to show the number of outstanding CSSes
>>>>> associated with all the non-inhibited cgroup subsystems that have
>>>>> been
>>>>> bound to cgroup v2. This will help us pinpoint which subsystems
>>>>> may be
>>>>> responsible for the increasing number of dying (nr_dying_descendants)
>>>>> cgroups.
>>>>>
>>>>> The cgroup-v2.rst file is updated to discuss this new behavior.
>>>>>
>>>>> With this patch applied, a sample output from root cgroup.stat file
>>>>> was shown below.
>>>>>
>>>>> nr_descendants 53
>>>>> nr_dying_descendants 34
>>>>> nr_cpuset 1
>>>>> nr_cpu 40
>>>>> nr_io 40
>>>>> nr_memory 87
>>>>> nr_perf_event 54
>>>>> nr_hugetlb 1
>>>>> nr_pids 53
>>>>> nr_rdma 1
>>>>> nr_misc 1
>>>>>
>>>>> In this particular case, it can be seen that memory cgroup is the
>>>>> most
>>>>> likely culprit for causing the 34 dying cgroups.
>>>>>
>>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>>> ---
>>>>> Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++--
>>>>> include/linux/cgroup-defs.h | 3 +++
>>>>> kernel/cgroup/cgroup.c | 19 +++++++++++++++++++
>>>>> 3 files changed, 30 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst
>>>>> b/Documentation/admin-guide/cgroup-v2.rst
>>>>> index 52763d6b2919..65af2f30196f 100644
>>>>> --- a/Documentation/admin-guide/cgroup-v2.rst
>>>>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>>>>> @@ -981,6 +981,12 @@ All cgroup core files are prefixed with
>>>>> "cgroup."
>>>>> A dying cgroup can consume system resources not exceeding
>>>>> limits, which were active at the moment of cgroup
>>>>> deletion.
>>>>> + nr_<cgroup_subsys>
>>>>> + Total number of cgroups associated with that cgroup
>>>>> + subsystem, e.g. cpuset or memory. These cgroup counts
>>>>> + will only be shown in the root cgroup and for subsystems
>>>>> + bound to cgroup v2.
>>>>> +
>>>>> cgroup.freeze
>>>>> A read-write single value file which exists on non-root
>>>>> cgroups.
>>>>> Allowed values are "0" and "1". The default is "0".
>>>>> @@ -2930,8 +2936,8 @@ Deprecated v1 Core Features
>>>>> - "cgroup.clone_children" is removed.
>>>>> -- /proc/cgroups is meaningless for v2. Use
>>>>> "cgroup.controllers" file
>>>>> - at the root instead.
>>>>> +- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
>>>>> + "cgroup.stat" files at the root instead.
>>>>> Issues with v1 and Rationales for v2
>>>>> diff --git a/include/linux/cgroup-defs.h
>>>>> b/include/linux/cgroup-defs.h
>>>>> index b36690ca0d3f..522ab77f0406 100644
>>>>> --- a/include/linux/cgroup-defs.h
>>>>> +++ b/include/linux/cgroup-defs.h
>>>>> @@ -776,6 +776,9 @@ struct cgroup_subsys {
>>>>> * specifies the mask of subsystems that this one depends on.
>>>>> */
>>>>> unsigned int depends_on;
>>>>> +
>>>>> + /* Number of CSSes, used only for /proc/cgroups */
>>>>> + atomic_t nr_csses;
>>>>> };
>>>>> extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
>>>>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>>>>> index c8e4b62b436a..48eba2737b1a 100644
>>>>> --- a/kernel/cgroup/cgroup.c
>>>>> +++ b/kernel/cgroup/cgroup.c
>>>>> @@ -3669,12 +3669,27 @@ static int cgroup_events_show(struct
>>>>> seq_file *seq, void *v)
>>>>> static int cgroup_stat_show(struct seq_file *seq, void *v)
>>>>> {
>>>>> struct cgroup *cgroup = seq_css(seq)->cgroup;
>>>>> + struct cgroup_subsys *ss;
>>>>> + int i;
>>>>> seq_printf(seq, "nr_descendants %d\n",
>>>>> cgroup->nr_descendants);
>>>>> seq_printf(seq, "nr_dying_descendants %d\n",
>>>>> cgroup->nr_dying_descendants);
>>>>> + if (cgroup_parent(cgroup))
>>>>> + return 0;
>>>>> +
>>>>> + /*
>>>>> + * For the root cgroup, shows the number of csses associated
>>>>> + * with each of non-inhibited cgroup subsystems bound to it.
>>>>> + */
>>>>> + do_each_subsys_mask(ss, i, ~cgrp_dfl_inhibit_ss_mask) {
>>>>> + if (ss->root != &cgrp_dfl_root)
>>>>> + continue;
>>>>> + seq_printf(seq, "nr_%s %d\n", ss->name,
>>>>> + atomic_read(&ss->nr_csses));
>>>>> + } while_each_subsys_mask();
>>>>> return 0;
>>>>> }
>>>> Thanks for adding nr_csses, the patch looks good to me. A
>>>> preference comment,
>>>> nr_<subsys>_css format, makes it easier to interpret the count.
>>>>
>>>> With or without the changes to the cgroup subsys format:
>>>>
>>>> Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
>>> Thanks for the review.
>>>
>>> CSS is a kernel internal name for cgroup subsystem state. Non kernel
>>> developers or users may not know what CSS is and cgroup-v2.rst doesn't
>>> mention CSS at all. So I don't think it is a good idea to add the
>>> "_css"
>>> suffix. From the user point of view, the proper term to use here is the
>>> number of cgroups, just like what "nr_descendants" and
>>> "nr_dying_descendants" are referring to before this patch. The only
>>> issue that I didn't address is the use of the proper plural form which
>>> is hard for cgroup subsystem names that we have.
>> It's not quite the same right? You could have 1 dying cgroup with
>> multiple zombie subsys states. At least in theory. It could be
>> confusing to add these counts without introducing the css concept.
> Right. There can be multiple csses of different subsystems associated
> with each cgroup.
>>
>> I also wonder if it would be better to just report the dying css
>> instead of all of them. Live ones are 1) under user control and 2)
>> easy to inspect in cgroupfs. I can see a scenario for the
>> nr_descendants aggregation ("Oh, that's a lot of subgroups!"); and a
>> scenario for dying css ("Oh, it's memory state pinning dead groups!").
>> But not so much "Oh, that's a lot of live memory controlled groups!"
> Right now, the patch just keeps a set of simple counters for the total
> number of outstanding csses for each cgroup subsystem. We currently
> doesn't track how many of those csses are associated with dying
> cgroups. That may require iterating all the csses to see if it is
> attached to a dying cgroup or more housekeeping code whenever the
> state of a cgroup changes. So it will burn more CPU cycles.
>>
>> I can't think of a good name for it though.
>>
>> nr_dying_memory_css is a mouthful
>>
>> nr_offline_memory?
>>
>> nr_zombie_memory?
>>
>> Should this be in debugfs?
>
> The current css counts include both live and dying cgroups. As I said
> above, it can be much more work if we want to report only dying csses.
> This information is certainly good to have, but the current counts
> should have provided enough information for an educated guess.
>
> As for the location of this information, I believe it will be easier
> for the users to consume if they are in the same place as
> nr_descendants and nr_dying_descendants.
BTW, I consider the new counts in cgroup.stat as the same information
you can get from /proc/cgroups when using cgroup v1. These counts are
now move into cgroup.stat instead because we want to retire the use of
/proc/cgroups.
Cheers,
Longman
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH-cgroup v2] cgroup: Show # of subsystem CSSes in root cgroup.stat
2024-07-09 13:28 [PATCH-cgroup v2] cgroup: Show # of subsystem CSSes in root cgroup.stat Waiman Long
2024-07-09 15:58 ` Kamalesh Babulal
@ 2024-07-09 23:08 ` Roman Gushchin
2024-07-10 0:23 ` Waiman Long
1 sibling, 1 reply; 10+ messages in thread
From: Roman Gushchin @ 2024-07-09 23:08 UTC (permalink / raw)
To: Waiman Long
Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, cgroups,
linux-doc, linux-kernel
On Tue, Jul 09, 2024 at 09:28:14AM -0400, Waiman Long wrote:
> The /proc/cgroups file shows the number of cgroups for each of the
> subsystems. With cgroup v1, the number of CSSes is the same as the
> number of cgroups. That is not the case anymore with cgroup v2. The
> /proc/cgroups file cannot show the actual number of CSSes for the
> subsystems that are bound to cgroup v2.
>
> So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
> we can't tell by looking at /proc/cgroups which cgroup subsystems may be
> responsible. This patch adds CSS counts in the cgroup_subsys structure
> to keep track of the number of CSSes for each of the cgroup subsystems.
>
> As cgroup v2 had deprecated the use of /proc/cgroups, the root
> cgroup.stat file is extended to show the number of outstanding CSSes
> associated with all the non-inhibited cgroup subsystems that have been
> bound to cgroup v2. This will help us pinpoint which subsystems may be
> responsible for the increasing number of dying (nr_dying_descendants)
> cgroups.
>
> The cgroup-v2.rst file is updated to discuss this new behavior.
>
> With this patch applied, a sample output from root cgroup.stat file
> was shown below.
>
> nr_descendants 53
> nr_dying_descendants 34
> nr_cpuset 1
> nr_cpu 40
> nr_io 40
> nr_memory 87
> nr_perf_event 54
> nr_hugetlb 1
> nr_pids 53
> nr_rdma 1
> nr_misc 1
>
> In this particular case, it can be seen that memory cgroup is the most
> likely culprit for causing the 34 dying cgroups.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++--
> include/linux/cgroup-defs.h | 3 +++
> kernel/cgroup/cgroup.c | 19 +++++++++++++++++++
> 3 files changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 52763d6b2919..65af2f30196f 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -981,6 +981,12 @@ All cgroup core files are prefixed with "cgroup."
> A dying cgroup can consume system resources not exceeding
> limits, which were active at the moment of cgroup deletion.
>
> + nr_<cgroup_subsys>
> + Total number of cgroups associated with that cgroup
> + subsystem, e.g. cpuset or memory. These cgroup counts
> + will only be shown in the root cgroup and for subsystems
> + bound to cgroup v2.
> +
> cgroup.freeze
> A read-write single value file which exists on non-root cgroups.
> Allowed values are "0" and "1". The default is "0".
> @@ -2930,8 +2936,8 @@ Deprecated v1 Core Features
>
> - "cgroup.clone_children" is removed.
>
> -- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
> - at the root instead.
> +- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
> + "cgroup.stat" files at the root instead.
>
>
> Issues with v1 and Rationales for v2
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index b36690ca0d3f..522ab77f0406 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -776,6 +776,9 @@ struct cgroup_subsys {
> * specifies the mask of subsystems that this one depends on.
> */
> unsigned int depends_on;
> +
> + /* Number of CSSes, used only for /proc/cgroups */
> + atomic_t nr_csses;
I believe it should be doable without atomics because most of css
operations are already synchronized using the cgroup mutex.
Other than that, I believe that this information is useful. Maybe
it can be retrieved using drgn/bpf iterator, but adding this functionality
to the kernel makes it easier to retrieve and the overhead is modest.
Also, if you add it to the cgroupfs, why not make it fully hierarchical
as existing entries in cgroup.stat. And if not, I'd agree with Johannes
that it looks like the debugfs material.
Thanks!
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH-cgroup v2] cgroup: Show # of subsystem CSSes in root cgroup.stat
2024-07-09 23:08 ` Roman Gushchin
@ 2024-07-10 0:23 ` Waiman Long
0 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2024-07-10 0:23 UTC (permalink / raw)
To: Roman Gushchin
Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, cgroups,
linux-doc, linux-kernel
On 7/9/24 19:08, Roman Gushchin wrote:
> On Tue, Jul 09, 2024 at 09:28:14AM -0400, Waiman Long wrote:
>> The /proc/cgroups file shows the number of cgroups for each of the
>> subsystems. With cgroup v1, the number of CSSes is the same as the
>> number of cgroups. That is not the case anymore with cgroup v2. The
>> /proc/cgroups file cannot show the actual number of CSSes for the
>> subsystems that are bound to cgroup v2.
>>
>> So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
>> we can't tell by looking at /proc/cgroups which cgroup subsystems may be
>> responsible. This patch adds CSS counts in the cgroup_subsys structure
>> to keep track of the number of CSSes for each of the cgroup subsystems.
>>
>> As cgroup v2 had deprecated the use of /proc/cgroups, the root
>> cgroup.stat file is extended to show the number of outstanding CSSes
>> associated with all the non-inhibited cgroup subsystems that have been
>> bound to cgroup v2. This will help us pinpoint which subsystems may be
>> responsible for the increasing number of dying (nr_dying_descendants)
>> cgroups.
>>
>> The cgroup-v2.rst file is updated to discuss this new behavior.
>>
>> With this patch applied, a sample output from root cgroup.stat file
>> was shown below.
>>
>> nr_descendants 53
>> nr_dying_descendants 34
>> nr_cpuset 1
>> nr_cpu 40
>> nr_io 40
>> nr_memory 87
>> nr_perf_event 54
>> nr_hugetlb 1
>> nr_pids 53
>> nr_rdma 1
>> nr_misc 1
>>
>> In this particular case, it can be seen that memory cgroup is the most
>> likely culprit for causing the 34 dying cgroups.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>> Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++--
>> include/linux/cgroup-defs.h | 3 +++
>> kernel/cgroup/cgroup.c | 19 +++++++++++++++++++
>> 3 files changed, 30 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>> index 52763d6b2919..65af2f30196f 100644
>> --- a/Documentation/admin-guide/cgroup-v2.rst
>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>> @@ -981,6 +981,12 @@ All cgroup core files are prefixed with "cgroup."
>> A dying cgroup can consume system resources not exceeding
>> limits, which were active at the moment of cgroup deletion.
>>
>> + nr_<cgroup_subsys>
>> + Total number of cgroups associated with that cgroup
>> + subsystem, e.g. cpuset or memory. These cgroup counts
>> + will only be shown in the root cgroup and for subsystems
>> + bound to cgroup v2.
>> +
>> cgroup.freeze
>> A read-write single value file which exists on non-root cgroups.
>> Allowed values are "0" and "1". The default is "0".
>> @@ -2930,8 +2936,8 @@ Deprecated v1 Core Features
>>
>> - "cgroup.clone_children" is removed.
>>
>> -- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
>> - at the root instead.
>> +- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
>> + "cgroup.stat" files at the root instead.
>>
>>
>> Issues with v1 and Rationales for v2
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>> index b36690ca0d3f..522ab77f0406 100644
>> --- a/include/linux/cgroup-defs.h
>> +++ b/include/linux/cgroup-defs.h
>> @@ -776,6 +776,9 @@ struct cgroup_subsys {
>> * specifies the mask of subsystems that this one depends on.
>> */
>> unsigned int depends_on;
>> +
>> + /* Number of CSSes, used only for /proc/cgroups */
>> + atomic_t nr_csses;
> I believe it should be doable without atomics because most of css
> operations are already synchronized using the cgroup mutex.
css_create() was protected under cgroup_mutex, but I don't believe
css_free_rwork_fn() is. It is called from the kworker. So atomic_t is
still needed.
>
> Other than that, I believe that this information is useful. Maybe
> it can be retrieved using drgn/bpf iterator, but adding this functionality
> to the kernel makes it easier to retrieve and the overhead is modest.
>
> Also, if you add it to the cgroupfs, why not make it fully hierarchical
> as existing entries in cgroup.stat. And if not, I'd agree with Johannes
> that it looks like the debugfs material.
To make it hierarchical, I would have to store a nr_descendants and
nr_dying_descendants in each css, just like the corresponding ones in
cgroup. I think it is doable, but the patch will be much more complex.
Cheers,
Longman
>
> Thanks!
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH-cgroup v2] cgroup: Show # of subsystem CSSes in root cgroup.stat
2024-07-09 16:09 ` Waiman Long
2024-07-09 18:02 ` Johannes Weiner
@ 2024-07-10 8:43 ` Kamalesh Babulal
1 sibling, 0 replies; 10+ messages in thread
From: Kamalesh Babulal @ 2024-07-10 8:43 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Zefan Li, Johannes Weiner,
Jonathan Corbet
Cc: cgroups, linux-doc, linux-kernel
On 7/9/24 9:39 PM, Waiman Long wrote:
> On 7/9/24 11:58, Kamalesh Babulal wrote:
>>
>> On 7/9/24 6:58 PM, Waiman Long wrote:
>>> The /proc/cgroups file shows the number of cgroups for each of the
>>> subsystems. With cgroup v1, the number of CSSes is the same as the
>>> number of cgroups. That is not the case anymore with cgroup v2. The
>>> /proc/cgroups file cannot show the actual number of CSSes for the
>>> subsystems that are bound to cgroup v2.
>>>
>>> So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
>>> we can't tell by looking at /proc/cgroups which cgroup subsystems may be
>>> responsible. This patch adds CSS counts in the cgroup_subsys structure
>>> to keep track of the number of CSSes for each of the cgroup subsystems.
>>>
>>> As cgroup v2 had deprecated the use of /proc/cgroups, the root
>>> cgroup.stat file is extended to show the number of outstanding CSSes
>>> associated with all the non-inhibited cgroup subsystems that have been
>>> bound to cgroup v2. This will help us pinpoint which subsystems may be
>>> responsible for the increasing number of dying (nr_dying_descendants)
>>> cgroups.
>>>
>>> The cgroup-v2.rst file is updated to discuss this new behavior.
>>>
>>> With this patch applied, a sample output from root cgroup.stat file
>>> was shown below.
>>>
>>> nr_descendants 53
>>> nr_dying_descendants 34
>>> nr_cpuset 1
>>> nr_cpu 40
>>> nr_io 40
>>> nr_memory 87
>>> nr_perf_event 54
>>> nr_hugetlb 1
>>> nr_pids 53
>>> nr_rdma 1
>>> nr_misc 1
>>>
>>> In this particular case, it can be seen that memory cgroup is the most
>>> likely culprit for causing the 34 dying cgroups.
>>>
>>> Signed-off-by: Waiman Long <longman@redhat.com>
[...]
>>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>>> index c8e4b62b436a..48eba2737b1a 100644
>>> --- a/kernel/cgroup/cgroup.c
>>> +++ b/kernel/cgroup/cgroup.c
>>> @@ -3669,12 +3669,27 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
>>> static int cgroup_stat_show(struct seq_file *seq, void *v)
>>> {
>>> struct cgroup *cgroup = seq_css(seq)->cgroup;
>>> + struct cgroup_subsys *ss;
>>> + int i;
>>> seq_printf(seq, "nr_descendants %d\n",
>>> cgroup->nr_descendants);
>>> seq_printf(seq, "nr_dying_descendants %d\n",
>>> cgroup->nr_dying_descendants);
>>> + if (cgroup_parent(cgroup))
>>> + return 0;
>>> +
>>> + /*
>>> + * For the root cgroup, shows the number of csses associated
>>> + * with each of non-inhibited cgroup subsystems bound to it.
>>> + */
>>> + do_each_subsys_mask(ss, i, ~cgrp_dfl_inhibit_ss_mask) {
>>> + if (ss->root != &cgrp_dfl_root)
>>> + continue;
>>> + seq_printf(seq, "nr_%s %d\n", ss->name,
>>> + atomic_read(&ss->nr_csses));
>>> + } while_each_subsys_mask();
>>> return 0;
>>> }
>>>
>> Thanks for adding nr_csses, the patch looks good to me. A preference comment,
>> nr_<subsys>_css format, makes it easier to interpret the count.
>>
>> With or without the changes to the cgroup subsys format:
>>
>> Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
>
> Thanks for the review.
>
> CSS is a kernel internal name for cgroup subsystem state. Non kernel developers or users may not know what CSS is and cgroup-v2.rst doesn't mention CSS at all. So I don't think it is a good idea to add the "_css" suffix. From the user point of view, the proper term to use here is the number of cgroups, just like what "nr_descendants" and "nr_dying_descendants" are referring to before this patch. The only issue that I didn't address is the use of the proper plural form which is hard for cgroup subsystem names that we have.
Agreed, css might not be a familiar term to many outside kernel development,
though it has been introduced to an extent in cgroup-v1 documentation.
nr_<subsys> count is the sum of all subsys state objects, and suffix "_css"
sounded preferable to me, but might not be so intuitive to the user.
nr_<subsys>_cgrps was the other suffix, I thought of, but it was more redundant
when read from cgroups.stat file.
--
Thanks,
Kamalesh
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH-cgroup v2] cgroup: Show # of subsystem CSSes in root cgroup.stat
2024-07-09 18:02 ` Johannes Weiner
2024-07-09 19:12 ` Waiman Long
@ 2024-07-10 18:28 ` Waiman Long
1 sibling, 0 replies; 10+ messages in thread
From: Waiman Long @ 2024-07-10 18:28 UTC (permalink / raw)
To: Johannes Weiner
Cc: Kamalesh Babulal, Tejun Heo, Zefan Li, Jonathan Corbet, cgroups,
linux-doc, linux-kernel
On 7/9/24 14:02, Johannes Weiner wrote:
> On Tue, Jul 09, 2024 at 12:09:05PM -0400, Waiman Long wrote:
>> On 7/9/24 11:58, Kamalesh Babulal wrote:
>>> On 7/9/24 6:58 PM, Waiman Long wrote:
>>>> The /proc/cgroups file shows the number of cgroups for each of the
>>>> subsystems. With cgroup v1, the number of CSSes is the same as the
>>>> number of cgroups. That is not the case anymore with cgroup v2. The
>>>> /proc/cgroups file cannot show the actual number of CSSes for the
>>>> subsystems that are bound to cgroup v2.
>>>>
>>>> So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
>>>> we can't tell by looking at /proc/cgroups which cgroup subsystems may be
>>>> responsible. This patch adds CSS counts in the cgroup_subsys structure
>>>> to keep track of the number of CSSes for each of the cgroup subsystems.
>>>>
>>>> As cgroup v2 had deprecated the use of /proc/cgroups, the root
>>>> cgroup.stat file is extended to show the number of outstanding CSSes
>>>> associated with all the non-inhibited cgroup subsystems that have been
>>>> bound to cgroup v2. This will help us pinpoint which subsystems may be
>>>> responsible for the increasing number of dying (nr_dying_descendants)
>>>> cgroups.
>>>>
>>>> The cgroup-v2.rst file is updated to discuss this new behavior.
>>>>
>>>> With this patch applied, a sample output from root cgroup.stat file
>>>> was shown below.
>>>>
>>>> nr_descendants 53
>>>> nr_dying_descendants 34
>>>> nr_cpuset 1
>>>> nr_cpu 40
>>>> nr_io 40
>>>> nr_memory 87
>>>> nr_perf_event 54
>>>> nr_hugetlb 1
>>>> nr_pids 53
>>>> nr_rdma 1
>>>> nr_misc 1
>>>>
>>>> In this particular case, it can be seen that memory cgroup is the most
>>>> likely culprit for causing the 34 dying cgroups.
>>>>
>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>> ---
>>>> Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++--
>>>> include/linux/cgroup-defs.h | 3 +++
>>>> kernel/cgroup/cgroup.c | 19 +++++++++++++++++++
>>>> 3 files changed, 30 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>>>> index 52763d6b2919..65af2f30196f 100644
>>>> --- a/Documentation/admin-guide/cgroup-v2.rst
>>>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>>>> @@ -981,6 +981,12 @@ All cgroup core files are prefixed with "cgroup."
>>>> A dying cgroup can consume system resources not exceeding
>>>> limits, which were active at the moment of cgroup deletion.
>>>>
>>>> + nr_<cgroup_subsys>
>>>> + Total number of cgroups associated with that cgroup
>>>> + subsystem, e.g. cpuset or memory. These cgroup counts
>>>> + will only be shown in the root cgroup and for subsystems
>>>> + bound to cgroup v2.
>>>> +
>>>> cgroup.freeze
>>>> A read-write single value file which exists on non-root cgroups.
>>>> Allowed values are "0" and "1". The default is "0".
>>>> @@ -2930,8 +2936,8 @@ Deprecated v1 Core Features
>>>>
>>>> - "cgroup.clone_children" is removed.
>>>>
>>>> -- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
>>>> - at the root instead.
>>>> +- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
>>>> + "cgroup.stat" files at the root instead.
>>>>
>>>>
>>>> Issues with v1 and Rationales for v2
>>>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>>>> index b36690ca0d3f..522ab77f0406 100644
>>>> --- a/include/linux/cgroup-defs.h
>>>> +++ b/include/linux/cgroup-defs.h
>>>> @@ -776,6 +776,9 @@ struct cgroup_subsys {
>>>> * specifies the mask of subsystems that this one depends on.
>>>> */
>>>> unsigned int depends_on;
>>>> +
>>>> + /* Number of CSSes, used only for /proc/cgroups */
>>>> + atomic_t nr_csses;
>>>> };
>>>>
>>>> extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
>>>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>>>> index c8e4b62b436a..48eba2737b1a 100644
>>>> --- a/kernel/cgroup/cgroup.c
>>>> +++ b/kernel/cgroup/cgroup.c
>>>> @@ -3669,12 +3669,27 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
>>>> static int cgroup_stat_show(struct seq_file *seq, void *v)
>>>> {
>>>> struct cgroup *cgroup = seq_css(seq)->cgroup;
>>>> + struct cgroup_subsys *ss;
>>>> + int i;
>>>>
>>>> seq_printf(seq, "nr_descendants %d\n",
>>>> cgroup->nr_descendants);
>>>> seq_printf(seq, "nr_dying_descendants %d\n",
>>>> cgroup->nr_dying_descendants);
>>>>
>>>> + if (cgroup_parent(cgroup))
>>>> + return 0;
>>>> +
>>>> + /*
>>>> + * For the root cgroup, shows the number of csses associated
>>>> + * with each of non-inhibited cgroup subsystems bound to it.
>>>> + */
>>>> + do_each_subsys_mask(ss, i, ~cgrp_dfl_inhibit_ss_mask) {
>>>> + if (ss->root != &cgrp_dfl_root)
>>>> + continue;
>>>> + seq_printf(seq, "nr_%s %d\n", ss->name,
>>>> + atomic_read(&ss->nr_csses));
>>>> + } while_each_subsys_mask();
>>>> return 0;
>>>> }
>>>>
>>> Thanks for adding nr_csses, the patch looks good to me. A preference comment,
>>> nr_<subsys>_css format, makes it easier to interpret the count.
>>>
>>> With or without the changes to the cgroup subsys format:
>>>
>>> Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
>> Thanks for the review.
>>
>> CSS is a kernel internal name for cgroup subsystem state. Non kernel
>> developers or users may not know what CSS is and cgroup-v2.rst doesn't
>> mention CSS at all. So I don't think it is a good idea to add the "_css"
>> suffix. From the user point of view, the proper term to use here is the
>> number of cgroups, just like what "nr_descendants" and
>> "nr_dying_descendants" are referring to before this patch. The only
>> issue that I didn't address is the use of the proper plural form which
>> is hard for cgroup subsystem names that we have.
> It's not quite the same right? You could have 1 dying cgroup with
> multiple zombie subsys states. At least in theory. It could be
> confusing to add these counts without introducing the css concept.
>
> I also wonder if it would be better to just report the dying css
> instead of all of them. Live ones are 1) under user control and 2)
> easy to inspect in cgroupfs. I can see a scenario for the
> nr_descendants aggregation ("Oh, that's a lot of subgroups!"); and a
> scenario for dying css ("Oh, it's memory state pinning dead groups!").
> But not so much "Oh, that's a lot of live memory controlled groups!"
>
> I can't think of a good name for it though.
>
> nr_dying_memory_css is a mouthful
>
> nr_offline_memory?
>
> nr_zombie_memory?
>
> Should this be in debugfs?
>
I just have sent out a v3 patch that make this hierarchical and separate
out live and dying csses. Let me know if you have other suggestions.
Thanks,
Longman
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-07-10 18:28 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-09 13:28 [PATCH-cgroup v2] cgroup: Show # of subsystem CSSes in root cgroup.stat Waiman Long
2024-07-09 15:58 ` Kamalesh Babulal
2024-07-09 16:09 ` Waiman Long
2024-07-09 18:02 ` Johannes Weiner
2024-07-09 19:12 ` Waiman Long
2024-07-09 19:15 ` Waiman Long
2024-07-10 18:28 ` Waiman Long
2024-07-10 8:43 ` Kamalesh Babulal
2024-07-09 23:08 ` Roman Gushchin
2024-07-10 0:23 ` Waiman Long
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).