public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
@ 2024-07-10 18:23 Waiman Long
  2024-07-10 18:23 ` [PATCH v3 2/2] cgroup: Limit frequency of reading cgroup.stat for unprivileged users Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Waiman Long @ 2024-07-10 18:23 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet
  Cc: cgroups, linux-doc, linux-kernel, Kamalesh Babulal,
	Roman Gushchin, Waiman Long

Cgroup subsystem state (CSS) is an abstraction in the cgroup layer to
help manage different structures in various cgroup subsystems by being
an embedded element inside a larger structure like cpuset or mem_cgroup.

The /proc/cgroups file shows the number of cgroups for each of the
subsystems.  With cgroup v1, the number of CSSes is the same as the
number of cgroups.  That is not the case anymore with cgroup v2. The
/proc/cgroups file cannot show the actual number of CSSes for the
subsystems that are bound to cgroup v2.

So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
we can't tell by looking at /proc/cgroups which cgroup subsystems may
be responsible.

As cgroup v2 had deprecated the use of /proc/cgroups, the hierarchical
cgroup.stat file is now being extended to show the number of live and
dying CSSes associated with all the non-inhibited cgroup subsystems
that have been bound to cgroup v2 as long as it is not zero.  The number
includes CSSes in the current cgroup as well as in all the descendants
underneath it.  This will help us pinpoint which subsystems are
responsible for the increasing number of dying (nr_dying_descendants)
cgroups.

The cgroup-v2.rst file is updated to discuss this new behavior.

With this patch applied, a sample output from root cgroup.stat file
was shown below.

	nr_descendants 54
	nr_dying_descendants 44
	nr_cpuset 1
	nr_cpu 40
	nr_io 40
	nr_memory 54
	nr_dying_memory 44
	nr_perf_event 55
	nr_hugetlb 1
	nr_pids 54
	nr_rdma 1
	nr_misc 1

Another sample output from system.slice/cgroup.stat was:

	nr_descendants 32
	nr_dying_descendants 37
	nr_cpu 30
	nr_io 30
	nr_memory 32
	nr_dying_memory 37
	nr_perf_event 33
	nr_pids 32

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 14 ++++++-
 include/linux/cgroup-defs.h             |  7 ++++
 kernel/cgroup/cgroup.c                  | 50 ++++++++++++++++++++++++-
 3 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 52763d6b2919..9031419271cd 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -981,6 +981,16 @@ All cgroup core files are prefixed with "cgroup."
 		A dying cgroup can consume system resources not exceeding
 		limits, which were active at the moment of cgroup deletion.
 
+	  nr_<cgroup_subsys>
+		Total number of live cgroups associated with that cgroup
+		subsystem (e.g. memory) at and beneath the current
+		cgroup.  An entry will only be shown if it is not zero.
+
+	  nr_dying_<cgroup_subsys>
+		Total number of dying cgroups associated with that cgroup
+		subsystem (e.g. memory) beneath the current cgroup.
+		An entry will only be shown if it is not zero.
+
   cgroup.freeze
 	A read-write single value file which exists on non-root cgroups.
 	Allowed values are "0" and "1". The default is "0".
@@ -2930,8 +2940,8 @@ Deprecated v1 Core Features
 
 - "cgroup.clone_children" is removed.
 
-- /proc/cgroups is meaningless for v2.  Use "cgroup.controllers" file
-  at the root instead.
+- /proc/cgroups is meaningless for v2.  Use "cgroup.controllers" or
+  "cgroup.stat" files at the root instead.
 
 
 Issues with v1 and Rationales for v2
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index b36690ca0d3f..62de18874508 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -210,6 +210,13 @@ struct cgroup_subsys_state {
 	 * fields of the containing structure.
 	 */
 	struct cgroup_subsys_state *parent;
+
+	/*
+	 * Keep track of total numbers of visible and dying descendant CSSes.
+	 * Protected by cgroup_mutex.
+	 */
+	int nr_descendants;
+	int nr_dying_descendants;
 };
 
 /*
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c8e4b62b436a..18c982a06446 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3669,12 +3669,34 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
 static int cgroup_stat_show(struct seq_file *seq, void *v)
 {
 	struct cgroup *cgroup = seq_css(seq)->cgroup;
+	struct cgroup_subsys_state *css;
+	int ssid;
 
+	/* cgroup_mutex required for for_each_css() */
+	cgroup_lock();
 	seq_printf(seq, "nr_descendants %d\n",
 		   cgroup->nr_descendants);
 	seq_printf(seq, "nr_dying_descendants %d\n",
 		   cgroup->nr_dying_descendants);
 
+	/*
+	 * Show the number of live and dying csses associated with each of
+	 * non-inhibited cgroup subsystems bound to cgroup v2 if non-zero.
+	 */
+	for_each_css(css, ssid, cgroup) {
+		if ((BIT(ssid) & cgrp_dfl_inhibit_ss_mask) ||
+		    (cgroup_subsys[ssid]->root !=  &cgrp_dfl_root))
+			continue;
+
+		seq_printf(seq, "nr_%s %d\n", cgroup_subsys[ssid]->name,
+			   css->nr_descendants + 1);
+		/* Current css is online */
+		if (css->nr_dying_descendants)
+			seq_printf(seq, "nr_dying_%s %d\n",
+				   cgroup_subsys[ssid]->name,
+				   css->nr_dying_descendants);
+	}
+	cgroup_unlock();
 	return 0;
 }
 
@@ -5424,6 +5446,8 @@ static void css_release_work_fn(struct work_struct *work)
 	list_del_rcu(&css->sibling);
 
 	if (ss) {
+		struct cgroup_subsys_state *parent_css;
+
 		/* css release path */
 		if (!list_empty(&css->rstat_css_node)) {
 			cgroup_rstat_flush(cgrp);
@@ -5433,6 +5457,14 @@ static void css_release_work_fn(struct work_struct *work)
 		cgroup_idr_replace(&ss->css_idr, NULL, css->id);
 		if (ss->css_released)
 			ss->css_released(css);
+
+		WARN_ON_ONCE(css->nr_descendants || css->nr_dying_descendants);
+		parent_css = css->parent;
+		while (parent_css) {
+			parent_css->nr_dying_descendants--;
+			parent_css = parent_css->parent;
+		}
+		css_put(css->parent);	/* Parent can be freed now */
 	} else {
 		struct cgroup *tcgrp;
 
@@ -5517,8 +5549,11 @@ static int online_css(struct cgroup_subsys_state *css)
 		rcu_assign_pointer(css->cgroup->subsys[ss->id], css);
 
 		atomic_inc(&css->online_cnt);
-		if (css->parent)
+		if (css->parent) {
 			atomic_inc(&css->parent->online_cnt);
+			while ((css = css->parent))
+				css->nr_descendants++;
+		}
 	}
 	return ret;
 }
@@ -5540,6 +5575,19 @@ static void offline_css(struct cgroup_subsys_state *css)
 	RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL);
 
 	wake_up_all(&css->cgroup->offline_waitq);
+
+	/*
+	 * Get a reference to parent css to ensure reliable access to its
+	 * nr_dying_descendants until after this child css is ready to be
+	 * freed.
+	 */
+	if (css->parent)
+		css_get(css->parent);
+
+	while ((css = css->parent)) {
+		css->nr_descendants--;
+		css->nr_dying_descendants++;
+	}
 }
 
 /**
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 2/2] cgroup: Limit frequency of reading cgroup.stat for unprivileged users
  2024-07-10 18:23 [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat Waiman Long
@ 2024-07-10 18:23 ` Waiman Long
  2024-07-10 21:43 ` [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat Roman Gushchin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Waiman Long @ 2024-07-10 18:23 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet
  Cc: cgroups, linux-doc, linux-kernel, Kamalesh Babulal,
	Roman Gushchin, Waiman Long

Since cgroup_mutex is being acquired when reading from a world readable
cgroup.stat control file, it is possible that repeated reading of
cgroup.stat may be used as a denial of service (DoS) attack vector
by unprivileged users to greatly hinder cgroup related operations in
a system.

To prevent this, we are limiting the reading of cgroup.stat file from
unprivileged users to at most 8 times per second.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cgroup.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 18c982a06446..56ac9f14d100 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3668,12 +3668,22 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
 
 static int cgroup_stat_show(struct seq_file *seq, void *v)
 {
+	static unsigned long unpriv_timestamp;
 	struct cgroup *cgroup = seq_css(seq)->cgroup;
 	struct cgroup_subsys_state *css;
 	int ssid;
 
+	/*
+	 * Unprivileged users can only read cgroup.stat up to 8 times
+	 * per second to avoid potential DoS attack.
+	 */
+	if (!capable(CAP_SYS_ADMIN) &&
+	    time_before(jiffies, unpriv_timestamp + (HZ >> 3)))
+		msleep(HZ >> 3);
+
 	/* cgroup_mutex required for for_each_css() */
 	cgroup_lock();
+	unpriv_timestamp = jiffies;
 	seq_printf(seq, "nr_descendants %d\n",
 		   cgroup->nr_descendants);
 	seq_printf(seq, "nr_dying_descendants %d\n",
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-10 18:23 [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat Waiman Long
  2024-07-10 18:23 ` [PATCH v3 2/2] cgroup: Limit frequency of reading cgroup.stat for unprivileged users Waiman Long
@ 2024-07-10 21:43 ` Roman Gushchin
  2024-07-10 23:49   ` Waiman Long
  2024-07-10 21:59 ` Tejun Heo
  2024-07-11 13:49 ` Johannes Weiner
  3 siblings, 1 reply; 21+ messages in thread
From: Roman Gushchin @ 2024-07-10 21:43 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, cgroups,
	linux-doc, linux-kernel, Kamalesh Babulal

On Wed, Jul 10, 2024 at 02:23:52PM -0400, Waiman Long wrote:
> Cgroup subsystem state (CSS) is an abstraction in the cgroup layer to
> help manage different structures in various cgroup subsystems by being
> an embedded element inside a larger structure like cpuset or mem_cgroup.
> 
> The /proc/cgroups file shows the number of cgroups for each of the
> subsystems.  With cgroup v1, the number of CSSes is the same as the
> number of cgroups.  That is not the case anymore with cgroup v2. The
> /proc/cgroups file cannot show the actual number of CSSes for the
> subsystems that are bound to cgroup v2.
> 
> So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
> we can't tell by looking at /proc/cgroups which cgroup subsystems may
> be responsible.
> 
> As cgroup v2 had deprecated the use of /proc/cgroups, the hierarchical
> cgroup.stat file is now being extended to show the number of live and
> dying CSSes associated with all the non-inhibited cgroup subsystems
> that have been bound to cgroup v2 as long as it is not zero.  The number
> includes CSSes in the current cgroup as well as in all the descendants
> underneath it.  This will help us pinpoint which subsystems are
> responsible for the increasing number of dying (nr_dying_descendants)
> cgroups.
> 
> The cgroup-v2.rst file is updated to discuss this new behavior.
> 
> With this patch applied, a sample output from root cgroup.stat file
> was shown below.
> 
> 	nr_descendants 54
> 	nr_dying_descendants 44
> 	nr_cpuset 1
> 	nr_cpu 40
> 	nr_io 40
> 	nr_memory 54
> 	nr_dying_memory 44
> 	nr_perf_event 55
> 	nr_hugetlb 1
> 	nr_pids 54
> 	nr_rdma 1
> 	nr_misc 1
> 
> Another sample output from system.slice/cgroup.stat was:
> 
> 	nr_descendants 32
> 	nr_dying_descendants 37
> 	nr_cpu 30
> 	nr_io 30
> 	nr_memory 32
> 	nr_dying_memory 37
> 	nr_perf_event 33
> 	nr_pids 32
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

I like it way more than the previous version, thank you for the update.

> ---
>  Documentation/admin-guide/cgroup-v2.rst | 14 ++++++-
>  include/linux/cgroup-defs.h             |  7 ++++
>  kernel/cgroup/cgroup.c                  | 50 ++++++++++++++++++++++++-
>  3 files changed, 68 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 52763d6b2919..9031419271cd 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -981,6 +981,16 @@ All cgroup core files are prefixed with "cgroup."
>  		A dying cgroup can consume system resources not exceeding
>  		limits, which were active at the moment of cgroup deletion.
>  
> +	  nr_<cgroup_subsys>
> +		Total number of live cgroups associated with that cgroup
> +		subsystem (e.g. memory) at and beneath the current
> +		cgroup.  An entry will only be shown if it is not zero.
> +
> +	  nr_dying_<cgroup_subsys>
> +		Total number of dying cgroups associated with that cgroup
> +		subsystem (e.g. memory) beneath the current cgroup.
> +		An entry will only be shown if it is not zero.
> +
>    cgroup.freeze
>  	A read-write single value file which exists on non-root cgroups.
>  	Allowed values are "0" and "1". The default is "0".
> @@ -2930,8 +2940,8 @@ Deprecated v1 Core Features
>  
>  - "cgroup.clone_children" is removed.
>  
> -- /proc/cgroups is meaningless for v2.  Use "cgroup.controllers" file
> -  at the root instead.
> +- /proc/cgroups is meaningless for v2.  Use "cgroup.controllers" or
> +  "cgroup.stat" files at the root instead.
>  
>  
>  Issues with v1 and Rationales for v2
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index b36690ca0d3f..62de18874508 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -210,6 +210,13 @@ struct cgroup_subsys_state {
>  	 * fields of the containing structure.
>  	 */
>  	struct cgroup_subsys_state *parent;
> +
> +	/*
> +	 * Keep track of total numbers of visible and dying descendant CSSes.
> +	 * Protected by cgroup_mutex.
> +	 */
> +	int nr_descendants;
> +	int nr_dying_descendants;
>  };
>  
>  /*
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index c8e4b62b436a..18c982a06446 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3669,12 +3669,34 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
>  static int cgroup_stat_show(struct seq_file *seq, void *v)
>  {
>  	struct cgroup *cgroup = seq_css(seq)->cgroup;
> +	struct cgroup_subsys_state *css;
> +	int ssid;
>  
> +	/* cgroup_mutex required for for_each_css() */
> +	cgroup_lock();

I *guess* it can be done under a rcu_read_lock(), isn't it?
That would eliminate a need for the second patch as well, which
is questionable (e.g. one unprivileged user can block others?)

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-10 18:23 [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat Waiman Long
  2024-07-10 18:23 ` [PATCH v3 2/2] cgroup: Limit frequency of reading cgroup.stat for unprivileged users Waiman Long
  2024-07-10 21:43 ` [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat Roman Gushchin
@ 2024-07-10 21:59 ` Tejun Heo
  2024-07-10 23:51   ` Waiman Long
  2024-07-11 13:49 ` Johannes Weiner
  3 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2024-07-10 21:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin

Hello,

On Wed, Jul 10, 2024 at 02:23:52PM -0400, Waiman Long wrote:
> With this patch applied, a sample output from root cgroup.stat file
> was shown below.
> 
> 	nr_descendants 54
> 	nr_dying_descendants 44
> 	nr_cpuset 1
> 	nr_cpu 40
> 	nr_io 40
> 	nr_memory 54
> 	nr_dying_memory 44
> 	nr_perf_event 55
> 	nr_hugetlb 1
> 	nr_pids 54
> 	nr_rdma 1
> 	nr_misc 1

So, css may be too specific a name but this looks a bit disorganized. How
about using controller as the common prefix? Maybe something like:

	nr_controllers_cpu 40
	nr_controllers_io 40
	nr_controllers_memory 54
	nr_controller_perf_event 55
	...
	nr_dying_controllers_memory 44

If controllers is too long, we can shorten it somehow or use subsys, maybe?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-10 21:43 ` [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat Roman Gushchin
@ 2024-07-10 23:49   ` Waiman Long
  0 siblings, 0 replies; 21+ messages in thread
From: Waiman Long @ 2024-07-10 23:49 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet, cgroups,
	linux-doc, linux-kernel, Kamalesh Babulal

On 7/10/24 17:43, Roman Gushchin wrote:
> On Wed, Jul 10, 2024 at 02:23:52PM -0400, Waiman Long wrote:
>> Cgroup subsystem state (CSS) is an abstraction in the cgroup layer to
>> help manage different structures in various cgroup subsystems by being
>> an embedded element inside a larger structure like cpuset or mem_cgroup.
>>
>> The /proc/cgroups file shows the number of cgroups for each of the
>> subsystems.  With cgroup v1, the number of CSSes is the same as the
>> number of cgroups.  That is not the case anymore with cgroup v2. The
>> /proc/cgroups file cannot show the actual number of CSSes for the
>> subsystems that are bound to cgroup v2.
>>
>> So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
>> we can't tell by looking at /proc/cgroups which cgroup subsystems may
>> be responsible.
>>
>> As cgroup v2 had deprecated the use of /proc/cgroups, the hierarchical
>> cgroup.stat file is now being extended to show the number of live and
>> dying CSSes associated with all the non-inhibited cgroup subsystems
>> that have been bound to cgroup v2 as long as it is not zero.  The number
>> includes CSSes in the current cgroup as well as in all the descendants
>> underneath it.  This will help us pinpoint which subsystems are
>> responsible for the increasing number of dying (nr_dying_descendants)
>> cgroups.
>>
>> The cgroup-v2.rst file is updated to discuss this new behavior.
>>
>> With this patch applied, a sample output from root cgroup.stat file
>> was shown below.
>>
>> 	nr_descendants 54
>> 	nr_dying_descendants 44
>> 	nr_cpuset 1
>> 	nr_cpu 40
>> 	nr_io 40
>> 	nr_memory 54
>> 	nr_dying_memory 44
>> 	nr_perf_event 55
>> 	nr_hugetlb 1
>> 	nr_pids 54
>> 	nr_rdma 1
>> 	nr_misc 1
>>
>> Another sample output from system.slice/cgroup.stat was:
>>
>> 	nr_descendants 32
>> 	nr_dying_descendants 37
>> 	nr_cpu 30
>> 	nr_io 30
>> 	nr_memory 32
>> 	nr_dying_memory 37
>> 	nr_perf_event 33
>> 	nr_pids 32
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> I like it way more than the previous version, thank you for the update.
>
>> ---
>>   Documentation/admin-guide/cgroup-v2.rst | 14 ++++++-
>>   include/linux/cgroup-defs.h             |  7 ++++
>>   kernel/cgroup/cgroup.c                  | 50 ++++++++++++++++++++++++-
>>   3 files changed, 68 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>> index 52763d6b2919..9031419271cd 100644
>> --- a/Documentation/admin-guide/cgroup-v2.rst
>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>> @@ -981,6 +981,16 @@ All cgroup core files are prefixed with "cgroup."
>>   		A dying cgroup can consume system resources not exceeding
>>   		limits, which were active at the moment of cgroup deletion.
>>   
>> +	  nr_<cgroup_subsys>
>> +		Total number of live cgroups associated with that cgroup
>> +		subsystem (e.g. memory) at and beneath the current
>> +		cgroup.  An entry will only be shown if it is not zero.
>> +
>> +	  nr_dying_<cgroup_subsys>
>> +		Total number of dying cgroups associated with that cgroup
>> +		subsystem (e.g. memory) beneath the current cgroup.
>> +		An entry will only be shown if it is not zero.
>> +
>>     cgroup.freeze
>>   	A read-write single value file which exists on non-root cgroups.
>>   	Allowed values are "0" and "1". The default is "0".
>> @@ -2930,8 +2940,8 @@ Deprecated v1 Core Features
>>   
>>   - "cgroup.clone_children" is removed.
>>   
>> -- /proc/cgroups is meaningless for v2.  Use "cgroup.controllers" file
>> -  at the root instead.
>> +- /proc/cgroups is meaningless for v2.  Use "cgroup.controllers" or
>> +  "cgroup.stat" files at the root instead.
>>   
>>   
>>   Issues with v1 and Rationales for v2
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>> index b36690ca0d3f..62de18874508 100644
>> --- a/include/linux/cgroup-defs.h
>> +++ b/include/linux/cgroup-defs.h
>> @@ -210,6 +210,13 @@ struct cgroup_subsys_state {
>>   	 * fields of the containing structure.
>>   	 */
>>   	struct cgroup_subsys_state *parent;
>> +
>> +	/*
>> +	 * Keep track of total numbers of visible and dying descendant CSSes.
>> +	 * Protected by cgroup_mutex.
>> +	 */
>> +	int nr_descendants;
>> +	int nr_dying_descendants;
>>   };
>>   
>>   /*
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index c8e4b62b436a..18c982a06446 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -3669,12 +3669,34 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
>>   static int cgroup_stat_show(struct seq_file *seq, void *v)
>>   {
>>   	struct cgroup *cgroup = seq_css(seq)->cgroup;
>> +	struct cgroup_subsys_state *css;
>> +	int ssid;
>>   
>> +	/* cgroup_mutex required for for_each_css() */
>> +	cgroup_lock();
> I *guess* it can be done under a rcu_read_lock(), isn't it?
> That would eliminate a need for the second patch as well, which
> is questionable (e.g. one unprivileged user can block others?)

I am just following the instruction in the for_each_css() macro:

  *
  * Should be called under cgroup_mutex.
  */

I think taking rcu_read_lock() should also work in this case. Will try 
it out and update the patch after some testing.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-10 21:59 ` Tejun Heo
@ 2024-07-10 23:51   ` Waiman Long
  0 siblings, 0 replies; 21+ messages in thread
From: Waiman Long @ 2024-07-10 23:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin


On 7/10/24 17:59, Tejun Heo wrote:
> Hello,
>
> On Wed, Jul 10, 2024 at 02:23:52PM -0400, Waiman Long wrote:
>> With this patch applied, a sample output from root cgroup.stat file
>> was shown below.
>>
>> 	nr_descendants 54
>> 	nr_dying_descendants 44
>> 	nr_cpuset 1
>> 	nr_cpu 40
>> 	nr_io 40
>> 	nr_memory 54
>> 	nr_dying_memory 44
>> 	nr_perf_event 55
>> 	nr_hugetlb 1
>> 	nr_pids 54
>> 	nr_rdma 1
>> 	nr_misc 1
> So, css may be too specific a name but this looks a bit disorganized. How
> about using controller as the common prefix? Maybe something like:
>
> 	nr_controllers_cpu 40
> 	nr_controllers_io 40
> 	nr_controllers_memory 54
> 	nr_controller_perf_event 55
> 	...
> 	nr_dying_controllers_memory 44
>
> If controllers is too long, we can shorten it somehow or use subsys, maybe?

I think "controllers" is too long. I am fine with "subsys". Will make 
change in the next version.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-10 18:23 [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat Waiman Long
                   ` (2 preceding siblings ...)
  2024-07-10 21:59 ` Tejun Heo
@ 2024-07-11 13:49 ` Johannes Weiner
  2024-07-11 14:05   ` Waiman Long
  3 siblings, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2024-07-11 13:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin

On Wed, Jul 10, 2024 at 02:23:52PM -0400, Waiman Long wrote:
> @@ -3669,12 +3669,34 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
>  static int cgroup_stat_show(struct seq_file *seq, void *v)
>  {
>  	struct cgroup *cgroup = seq_css(seq)->cgroup;
> +	struct cgroup_subsys_state *css;
> +	int ssid;
>  
> +	/* cgroup_mutex required for for_each_css() */
> +	cgroup_lock();
>  	seq_printf(seq, "nr_descendants %d\n",
>  		   cgroup->nr_descendants);
>  	seq_printf(seq, "nr_dying_descendants %d\n",
>  		   cgroup->nr_dying_descendants);
>  
> +	/*
> +	 * Show the number of live and dying csses associated with each of
> +	 * non-inhibited cgroup subsystems bound to cgroup v2 if non-zero.
> +	 */
> +	for_each_css(css, ssid, cgroup) {
> +		if ((BIT(ssid) & cgrp_dfl_inhibit_ss_mask) ||
> +		    (cgroup_subsys[ssid]->root !=  &cgrp_dfl_root))
> +			continue;
> +
> +		seq_printf(seq, "nr_%s %d\n", cgroup_subsys[ssid]->name,
> +			   css->nr_descendants + 1);
> +		/* Current css is online */
> +		if (css->nr_dying_descendants)
> +			seq_printf(seq, "nr_dying_%s %d\n",
> +				   cgroup_subsys[ssid]->name,
> +				   css->nr_dying_descendants);
> +	}

I think it'd be better to print the dying count unconditionally. It
makes the output more predictable for parsers, and also it's clearer
to users which data points are being tracked and reported.

With that, and TJ's "subsys" suggestion for the name, it looks good to
me. Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-11 13:49 ` Johannes Weiner
@ 2024-07-11 14:05   ` Waiman Long
  2024-07-11 17:18     ` Tejun Heo
  0 siblings, 1 reply; 21+ messages in thread
From: Waiman Long @ 2024-07-11 14:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Zefan Li, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin


On 7/11/24 09:49, Johannes Weiner wrote:
> On Wed, Jul 10, 2024 at 02:23:52PM -0400, Waiman Long wrote:
>> @@ -3669,12 +3669,34 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
>>   static int cgroup_stat_show(struct seq_file *seq, void *v)
>>   {
>>   	struct cgroup *cgroup = seq_css(seq)->cgroup;
>> +	struct cgroup_subsys_state *css;
>> +	int ssid;
>>   
>> +	/* cgroup_mutex required for for_each_css() */
>> +	cgroup_lock();
>>   	seq_printf(seq, "nr_descendants %d\n",
>>   		   cgroup->nr_descendants);
>>   	seq_printf(seq, "nr_dying_descendants %d\n",
>>   		   cgroup->nr_dying_descendants);
>>   
>> +	/*
>> +	 * Show the number of live and dying csses associated with each of
>> +	 * non-inhibited cgroup subsystems bound to cgroup v2 if non-zero.
>> +	 */
>> +	for_each_css(css, ssid, cgroup) {
>> +		if ((BIT(ssid) & cgrp_dfl_inhibit_ss_mask) ||
>> +		    (cgroup_subsys[ssid]->root !=  &cgrp_dfl_root))
>> +			continue;
>> +
>> +		seq_printf(seq, "nr_%s %d\n", cgroup_subsys[ssid]->name,
>> +			   css->nr_descendants + 1);
>> +		/* Current css is online */
>> +		if (css->nr_dying_descendants)
>> +			seq_printf(seq, "nr_dying_%s %d\n",
>> +				   cgroup_subsys[ssid]->name,
>> +				   css->nr_dying_descendants);
>> +	}
> I think it'd be better to print the dying count unconditionally. It
> makes the output more predictable for parsers, and also it's clearer
> to users which data points are being tracked and reported.
>
> With that, and TJ's "subsys" suggestion for the name, it looks good to
> me. Thanks!

Given the fact that for_each_css() iteration is filtering out csses that 
are absent, the dying counts follow the same logic of skipping it if 
there is no dying css. That also makes it easier to identify cgroups 
with dying descendant csses as we don't need filter out entries with a 0 
dying count. It also makes the output less verbose and let user focus 
more on what are significant.

I do understand that it makes it inconsistent with the ways 
nr_descendants and nr_dying_descendants are being handled as entries 
with 0 count are also displayed. I can update the patch to display those 
entries with 0 dying subsys count if other people also think that is the 
better way forward.

Cheers,
Longman

>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-11 14:05   ` Waiman Long
@ 2024-07-11 17:18     ` Tejun Heo
  2024-07-11 17:39       ` Waiman Long
  0 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2024-07-11 17:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Zefan Li, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin

Hello,

On Thu, Jul 11, 2024 at 10:05:22AM -0400, Waiman Long wrote:
> Given the fact that for_each_css() iteration is filtering out csses that are
> absent, the dying counts follow the same logic of skipping it if there is no
> dying css. That also makes it easier to identify cgroups with dying
> descendant csses as we don't need filter out entries with a 0 dying count.
> It also makes the output less verbose and let user focus more on what are
> significant.
> 
> I do understand that it makes it inconsistent with the ways nr_descendants
> and nr_dying_descendants are being handled as entries with 0 count are also
> displayed. I can update the patch to display those entries with 0 dying
> subsys count if other people also think that is the better way forward.

I think it'd be better to have all the keys. There are some dynamic keys in
stat files but those are mostly for things which can come and go (e.g. block
and misc devices), so yeah, I think it'd be better to show all the keys even
when they're zero.

Also, I personally would much prefer if the same prefixes are collected
together - ie. totals first and then dying. It's just a lot easier on the
eyes that way.

	nr_subsys_cpu
	nr_subsys_memory
	nr_subsys_io
	...
	nr_dying_subsys_cpu
	nr_dying_subsys_memory
	nr_dying_subsys_io
	...

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-11 17:18     ` Tejun Heo
@ 2024-07-11 17:39       ` Waiman Long
  2024-07-11 18:44         ` Tejun Heo
  0 siblings, 1 reply; 21+ messages in thread
From: Waiman Long @ 2024-07-11 17:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Zefan Li, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin


On 7/11/24 13:18, Tejun Heo wrote:
> Hello,
>
> On Thu, Jul 11, 2024 at 10:05:22AM -0400, Waiman Long wrote:
>> Given the fact that for_each_css() iteration is filtering out csses that are
>> absent, the dying counts follow the same logic of skipping it if there is no
>> dying css. That also makes it easier to identify cgroups with dying
>> descendant csses as we don't need filter out entries with a 0 dying count.
>> It also makes the output less verbose and let user focus more on what are
>> significant.
>>
>> I do understand that it makes it inconsistent with the ways nr_descendants
>> and nr_dying_descendants are being handled as entries with 0 count are also
>> displayed. I can update the patch to display those entries with 0 dying
>> subsys count if other people also think that is the better way forward.
> I think it'd be better to have all the keys. There are some dynamic keys in
> stat files but those are mostly for things which can come and go (e.g. block
> and misc devices), so yeah, I think it'd be better to show all the keys even
> when they're zero.
Currently, I use the for_each_css() macro for iteration. If you mean 
displaying all the possible cgroup subsystems even if they are not 
enabled for the current cgroup, I will have to manually do the iteration.
>
> Also, I personally would much prefer if the same prefixes are collected
> together - ie. totals first and then dying. It's just a lot easier on the
> eyes that way.
>
> 	nr_subsys_cpu
> 	nr_subsys_memory
> 	nr_subsys_io
> 	...
> 	nr_dying_subsys_cpu
> 	nr_dying_subsys_memory
> 	nr_dying_subsys_io
> 	...

That is fine. I can group entries with the same prefix together.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-11 17:39       ` Waiman Long
@ 2024-07-11 18:44         ` Tejun Heo
  2024-07-11 18:51           ` Waiman Long
  0 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2024-07-11 18:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Zefan Li, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin

Hello,

On Thu, Jul 11, 2024 at 01:39:38PM -0400, Waiman Long wrote:
> On 7/11/24 13:18, Tejun Heo wrote:
...
> Currently, I use the for_each_css() macro for iteration. If you mean
> displaying all the possible cgroup subsystems even if they are not enabled
> for the current cgroup, I will have to manually do the iteration.

Just wrapping it with for_each_subsys() should do, no? for_each_css() won't
iterate anything if css doesn't exist for the cgroup.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-11 18:44         ` Tejun Heo
@ 2024-07-11 18:51           ` Waiman Long
  2024-07-11 18:59             ` Tejun Heo
  0 siblings, 1 reply; 21+ messages in thread
From: Waiman Long @ 2024-07-11 18:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Zefan Li, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin


On 7/11/24 14:44, Tejun Heo wrote:
> Hello,
>
> On Thu, Jul 11, 2024 at 01:39:38PM -0400, Waiman Long wrote:
>> On 7/11/24 13:18, Tejun Heo wrote:
> ...
>> Currently, I use the for_each_css() macro for iteration. If you mean
>> displaying all the possible cgroup subsystems even if they are not enabled
>> for the current cgroup, I will have to manually do the iteration.
> Just wrapping it with for_each_subsys() should do, no? for_each_css() won't
> iterate anything if css doesn't exist for the cgroup.

OK, I wasn't sure if you were asking to list all the possible cgroup v2 
cgroup subsystems even if they weren't enabled in the current cgroup. 
Apparently, that is the case. I prefer it that way too.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-11 18:51           ` Waiman Long
@ 2024-07-11 18:59             ` Tejun Heo
  2024-07-11 19:13               ` Waiman Long
  0 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2024-07-11 18:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Zefan Li, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin

On Thu, Jul 11, 2024 at 02:51:38PM -0400, Waiman Long wrote:
> 
> On 7/11/24 14:44, Tejun Heo wrote:
> > Hello,
> > 
> > On Thu, Jul 11, 2024 at 01:39:38PM -0400, Waiman Long wrote:
> > > On 7/11/24 13:18, Tejun Heo wrote:
> > ...
> > > Currently, I use the for_each_css() macro for iteration. If you mean
> > > displaying all the possible cgroup subsystems even if they are not enabled
> > > for the current cgroup, I will have to manually do the iteration.
> > Just wrapping it with for_each_subsys() should do, no? for_each_css() won't
> > iterate anything if css doesn't exist for the cgroup.
> 
> OK, I wasn't sure if you were asking to list all the possible cgroup v2
> cgroup subsystems even if they weren't enabled in the current cgroup.
> Apparently, that is the case. I prefer it that way too.

Yeah, I think listing all is better. If the list corresponded directly to
cgroup.controllers, it may make sense to only show enabled ones but we can
have dying ones and implicitly enabled memory and so on, so I think it'd be
cleaner to just list them all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-11 18:59             ` Tejun Heo
@ 2024-07-11 19:13               ` Waiman Long
  2024-07-11 19:21                 ` Tejun Heo
  2024-07-11 19:59                 ` Johannes Weiner
  0 siblings, 2 replies; 21+ messages in thread
From: Waiman Long @ 2024-07-11 19:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Zefan Li, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin


On 7/11/24 14:59, Tejun Heo wrote:
> On Thu, Jul 11, 2024 at 02:51:38PM -0400, Waiman Long wrote:
>> On 7/11/24 14:44, Tejun Heo wrote:
>>> Hello,
>>>
>>> On Thu, Jul 11, 2024 at 01:39:38PM -0400, Waiman Long wrote:
>>>> On 7/11/24 13:18, Tejun Heo wrote:
>>> ...
>>>> Currently, I use the for_each_css() macro for iteration. If you mean
>>>> displaying all the possible cgroup subsystems even if they are not enabled
>>>> for the current cgroup, I will have to manually do the iteration.
>>> Just wrapping it with for_each_subsys() should do, no? for_each_css() won't
>>> iterate anything if css doesn't exist for the cgroup.
>> OK, I wasn't sure if you were asking to list all the possible cgroup v2
>> cgroup subsystems even if they weren't enabled in the current cgroup.
>> Apparently, that is the case. I prefer it that way too.
> Yeah, I think listing all is better. If the list corresponded directly to
> cgroup.controllers, it may make sense to only show enabled ones but we can
> have dying ones and implicitly enabled memory and so on, so I think it'd be
> cleaner to just list them all.

That will means cgroup subsystems that are seldomly used like rdma, misc 
or even hugetlb will always be shown in all the cgroup.stat output. I 
actually prefer just showing those that are enabled. As for dying memory 
cgroups, they will only be shown in its online ancestors. We currently 
don't know how many level down are each of the dying ones.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-11 19:13               ` Waiman Long
@ 2024-07-11 19:21                 ` Tejun Heo
  2024-07-11 19:29                   ` Waiman Long
  2024-07-11 19:59                 ` Johannes Weiner
  1 sibling, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2024-07-11 19:21 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Zefan Li, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin

Hello,

On Thu, Jul 11, 2024 at 03:13:12PM -0400, Waiman Long wrote:
> That will means cgroup subsystems that are seldomly used like rdma, misc or
> even hugetlb will always be shown in all the cgroup.stat output. I actually

Hmm... yeah, that does increase the amount of output, but I don't know. The
trade-off is between consistency and brevity and I think I'd go for
consistency here.

> prefer just showing those that are enabled. As for dying memory cgroups,
> they will only be shown in its online ancestors. We currently don't know how

So, one peculiarity with memory is that when you enable io, it gets
implicitly enabled together and we likely wanna show that.

> many level down are each of the dying ones.

I didn't understand the last sentence.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-11 19:21                 ` Tejun Heo
@ 2024-07-11 19:29                   ` Waiman Long
  0 siblings, 0 replies; 21+ messages in thread
From: Waiman Long @ 2024-07-11 19:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Zefan Li, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin


On 7/11/24 15:21, Tejun Heo wrote:
> Hello,
>
> On Thu, Jul 11, 2024 at 03:13:12PM -0400, Waiman Long wrote:
>> That will means cgroup subsystems that are seldomly used like rdma, misc or
>> even hugetlb will always be shown in all the cgroup.stat output. I actually
> Hmm... yeah, that does increase the amount of output, but I don't know. The
> trade-off is between consistency and brevity and I think I'd go for
> consistency here.
>
>> prefer just showing those that are enabled. As for dying memory cgroups,
>> they will only be shown in its online ancestors. We currently don't know how
> So, one peculiarity with memory is that when you enable io, it gets
> implicitly enabled together and we likely wanna show that.

If memory is implicitly enabled, it is treated as enabled as the 
corresponding cgroup->subsys[] entry should be set. I currently don't 
filter out those in the cgrp_dfl_implicit_ss_mask. So perf_event, which 
is implicitly enabled, is shown in all the cgroup.stat.

If you still want to display all even if some of them are not enabled, I 
can certainly do that.

Cheers,
Longman




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-11 19:13               ` Waiman Long
  2024-07-11 19:21                 ` Tejun Heo
@ 2024-07-11 19:59                 ` Johannes Weiner
  2024-07-11 21:00                   ` Waiman Long
  1 sibling, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2024-07-11 19:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin

On Thu, Jul 11, 2024 at 03:13:12PM -0400, Waiman Long wrote:
> 
> On 7/11/24 14:59, Tejun Heo wrote:
> > On Thu, Jul 11, 2024 at 02:51:38PM -0400, Waiman Long wrote:
> >> On 7/11/24 14:44, Tejun Heo wrote:
> >>> Hello,
> >>>
> >>> On Thu, Jul 11, 2024 at 01:39:38PM -0400, Waiman Long wrote:
> >>>> On 7/11/24 13:18, Tejun Heo wrote:
> >>> ...
> >>>> Currently, I use the for_each_css() macro for iteration. If you mean
> >>>> displaying all the possible cgroup subsystems even if they are not enabled
> >>>> for the current cgroup, I will have to manually do the iteration.
> >>> Just wrapping it with for_each_subsys() should do, no? for_each_css() won't
> >>> iterate anything if css doesn't exist for the cgroup.
> >> OK, I wasn't sure if you were asking to list all the possible cgroup v2
> >> cgroup subsystems even if they weren't enabled in the current cgroup.
> >> Apparently, that is the case. I prefer it that way too.
> > Yeah, I think listing all is better. If the list corresponded directly to
> > cgroup.controllers, it may make sense to only show enabled ones but we can
> > have dying ones and implicitly enabled memory and so on, so I think it'd be
> > cleaner to just list them all.
> 
> That will means cgroup subsystems that are seldomly used like rdma, misc 
> or even hugetlb will always be shown in all the cgroup.stat output. I 
> actually prefer just showing those that are enabled. As for dying memory 
> cgroups, they will only be shown in its online ancestors. We currently 
> don't know how many level down are each of the dying ones.

It seems odd to me to not show dead ones after a cgroup has disabled
the controller again. They still consume memory, after all, and so
continue to be property of that cgroup afterwards.

Instead of doing for_each_css(), would it make more sense to have

	struct cgroup {
		...
		int nr_dying_subsys[CGROUP_SUBSYS_COUNT];
		...
	}

and just always print them all, regardless of what is, or was,
enabled?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-11 19:59                 ` Johannes Weiner
@ 2024-07-11 21:00                   ` Waiman Long
  2024-07-11 21:57                     ` Waiman Long
  2024-07-12 16:29                     ` Johannes Weiner
  0 siblings, 2 replies; 21+ messages in thread
From: Waiman Long @ 2024-07-11 21:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Zefan Li, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin

On 7/11/24 15:59, Johannes Weiner wrote:
> On Thu, Jul 11, 2024 at 03:13:12PM -0400, Waiman Long wrote:
>> On 7/11/24 14:59, Tejun Heo wrote:
>>> On Thu, Jul 11, 2024 at 02:51:38PM -0400, Waiman Long wrote:
>>>> On 7/11/24 14:44, Tejun Heo wrote:
>>>>> Hello,
>>>>>
>>>>> On Thu, Jul 11, 2024 at 01:39:38PM -0400, Waiman Long wrote:
>>>>>> On 7/11/24 13:18, Tejun Heo wrote:
>>>>> ...
>>>>>> Currently, I use the for_each_css() macro for iteration. If you mean
>>>>>> displaying all the possible cgroup subsystems even if they are not enabled
>>>>>> for the current cgroup, I will have to manually do the iteration.
>>>>> Just wrapping it with for_each_subsys() should do, no? for_each_css() won't
>>>>> iterate anything if css doesn't exist for the cgroup.
>>>> OK, I wasn't sure if you were asking to list all the possible cgroup v2
>>>> cgroup subsystems even if they weren't enabled in the current cgroup.
>>>> Apparently, that is the case. I prefer it that way too.
>>> Yeah, I think listing all is better. If the list corresponded directly to
>>> cgroup.controllers, it may make sense to only show enabled ones but we can
>>> have dying ones and implicitly enabled memory and so on, so I think it'd be
>>> cleaner to just list them all.
>> That will means cgroup subsystems that are seldomly used like rdma, misc
>> or even hugetlb will always be shown in all the cgroup.stat output. I
>> actually prefer just showing those that are enabled. As for dying memory
>> cgroups, they will only be shown in its online ancestors. We currently
>> don't know how many level down are each of the dying ones.
> It seems odd to me to not show dead ones after a cgroup has disabled
> the controller again. They still consume memory, after all, and so
> continue to be property of that cgroup afterwards.
>
> Instead of doing for_each_css(), would it make more sense to have
>
> 	struct cgroup {
> 		...
> 		int nr_dying_subsys[CGROUP_SUBSYS_COUNT];

What exactly does new this array for? Is this for copying out 
css->nr_dying_descendants before disabling a controller? The number may 
be out of date when it is used. I would think we should store the actual 
css and clearing it again once the css is ready to be freed.

Anyway, I would suggest doing it as a separate add-on patch if we decide 
to do it instead of adding it to the current patch.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-11 21:00                   ` Waiman Long
@ 2024-07-11 21:57                     ` Waiman Long
  2024-07-12 16:29                     ` Johannes Weiner
  1 sibling, 0 replies; 21+ messages in thread
From: Waiman Long @ 2024-07-11 21:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Zefan Li, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin

On 7/11/24 17:00, Waiman Long wrote:
> On 7/11/24 15:59, Johannes Weiner wrote:
>> On Thu, Jul 11, 2024 at 03:13:12PM -0400, Waiman Long wrote:
>>> On 7/11/24 14:59, Tejun Heo wrote:
>>>> On Thu, Jul 11, 2024 at 02:51:38PM -0400, Waiman Long wrote:
>>>>> On 7/11/24 14:44, Tejun Heo wrote:
>>>>>> Hello,
>>>>>>
>>>>>> On Thu, Jul 11, 2024 at 01:39:38PM -0400, Waiman Long wrote:
>>>>>>> On 7/11/24 13:18, Tejun Heo wrote:
>>>>>> ...
>>>>>>> Currently, I use the for_each_css() macro for iteration. If you 
>>>>>>> mean
>>>>>>> displaying all the possible cgroup subsystems even if they are 
>>>>>>> not enabled
>>>>>>> for the current cgroup, I will have to manually do the iteration.
>>>>>> Just wrapping it with for_each_subsys() should do, no? 
>>>>>> for_each_css() won't
>>>>>> iterate anything if css doesn't exist for the cgroup.
>>>>> OK, I wasn't sure if you were asking to list all the possible 
>>>>> cgroup v2
>>>>> cgroup subsystems even if they weren't enabled in the current cgroup.
>>>>> Apparently, that is the case. I prefer it that way too.
>>>> Yeah, I think listing all is better. If the list corresponded 
>>>> directly to
>>>> cgroup.controllers, it may make sense to only show enabled ones but 
>>>> we can
>>>> have dying ones and implicitly enabled memory and so on, so I think 
>>>> it'd be
>>>> cleaner to just list them all.
>>> That will means cgroup subsystems that are seldomly used like rdma, 
>>> misc
>>> or even hugetlb will always be shown in all the cgroup.stat output. I
>>> actually prefer just showing those that are enabled. As for dying 
>>> memory
>>> cgroups, they will only be shown in its online ancestors. We currently
>>> don't know how many level down are each of the dying ones.
>> It seems odd to me to not show dead ones after a cgroup has disabled
>> the controller again. They still consume memory, after all, and so
>> continue to be property of that cgroup afterwards.
>>
>> Instead of doing for_each_css(), would it make more sense to have
>>
>>     struct cgroup {
>>         ...
>>         int nr_dying_subsys[CGROUP_SUBSYS_COUNT];
>
> What exactly does new this array for? Is this for copying out 
> css->nr_dying_descendants before disabling a controller? The number 
> may be out of date when it is used. I would think we should store the 
> actual css and clearing it again once the css is ready to be freed.
>
> Anyway, I would suggest doing it as a separate add-on patch if we 
> decide to do it instead of adding it to the current patch.

Alternatively, we could delay the clearing of cgroup->subsys[] entry 
from offline time to until the css is ready to be freed. We do need to 
add check about the CSS_ONLINE flag when we only want to deal with 
online csses.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-11 21:00                   ` Waiman Long
  2024-07-11 21:57                     ` Waiman Long
@ 2024-07-12 16:29                     ` Johannes Weiner
  2024-07-12 17:10                       ` Waiman Long
  1 sibling, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2024-07-12 16:29 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Zefan Li, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin

On Thu, Jul 11, 2024 at 05:00:41PM -0400, Waiman Long wrote:
> On 7/11/24 15:59, Johannes Weiner wrote:
> > On Thu, Jul 11, 2024 at 03:13:12PM -0400, Waiman Long wrote:
> >> On 7/11/24 14:59, Tejun Heo wrote:
> >>> On Thu, Jul 11, 2024 at 02:51:38PM -0400, Waiman Long wrote:
> >>>> On 7/11/24 14:44, Tejun Heo wrote:
> >>>>> Hello,
> >>>>>
> >>>>> On Thu, Jul 11, 2024 at 01:39:38PM -0400, Waiman Long wrote:
> >>>>>> On 7/11/24 13:18, Tejun Heo wrote:
> >>>>> ...
> >>>>>> Currently, I use the for_each_css() macro for iteration. If you mean
> >>>>>> displaying all the possible cgroup subsystems even if they are not enabled
> >>>>>> for the current cgroup, I will have to manually do the iteration.
> >>>>> Just wrapping it with for_each_subsys() should do, no? for_each_css() won't
> >>>>> iterate anything if css doesn't exist for the cgroup.
> >>>> OK, I wasn't sure if you were asking to list all the possible cgroup v2
> >>>> cgroup subsystems even if they weren't enabled in the current cgroup.
> >>>> Apparently, that is the case. I prefer it that way too.
> >>> Yeah, I think listing all is better. If the list corresponded directly to
> >>> cgroup.controllers, it may make sense to only show enabled ones but we can
> >>> have dying ones and implicitly enabled memory and so on, so I think it'd be
> >>> cleaner to just list them all.
> >> That will means cgroup subsystems that are seldomly used like rdma, misc
> >> or even hugetlb will always be shown in all the cgroup.stat output. I
> >> actually prefer just showing those that are enabled. As for dying memory
> >> cgroups, they will only be shown in its online ancestors. We currently
> >> don't know how many level down are each of the dying ones.
> > It seems odd to me to not show dead ones after a cgroup has disabled
> > the controller again. They still consume memory, after all, and so
> > continue to be property of that cgroup afterwards.
> >
> > Instead of doing for_each_css(), would it make more sense to have
> >
> > 	struct cgroup {
> > 		...
> > 		int nr_dying_subsys[CGROUP_SUBSYS_COUNT];
> 
> What exactly does new this array for?

For keeping the counts. Instead of inside the css.

AFAICS, with your current patch, if somebody were to disable the
controller in cgroup.subtree_control, it would offline the css at that
level, become unreachable from cgroup->subsys[], and you'd lose
remaining counts of dead css that are still associated with that
cgroup. Re-enabling the controller would create a new css with new
descendant counts, and now the reported numbers are actively misleading.

That seems undesirable.

If you track the counts in the cgroup itself, then cgroup.stat would
reliably show the total counts of dead controllers that are associated
with the subtree, even after disabling or toggling controllers.

The hooks in online, offline, release should be the same, just update
css->cgroup->nr_dying_subsys[id] instead of css->nr_dying_descendants.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat
  2024-07-12 16:29                     ` Johannes Weiner
@ 2024-07-12 17:10                       ` Waiman Long
  0 siblings, 0 replies; 21+ messages in thread
From: Waiman Long @ 2024-07-12 17:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Zefan Li, Jonathan Corbet, cgroups, linux-doc,
	linux-kernel, Kamalesh Babulal, Roman Gushchin

On 7/12/24 12:29, Johannes Weiner wrote:
> On Thu, Jul 11, 2024 at 05:00:41PM -0400, Waiman Long wrote:
>> On 7/11/24 15:59, Johannes Weiner wrote:
>>> On Thu, Jul 11, 2024 at 03:13:12PM -0400, Waiman Long wrote:
>>>> On 7/11/24 14:59, Tejun Heo wrote:
>>>>> On Thu, Jul 11, 2024 at 02:51:38PM -0400, Waiman Long wrote:
>>>>>> On 7/11/24 14:44, Tejun Heo wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> On Thu, Jul 11, 2024 at 01:39:38PM -0400, Waiman Long wrote:
>>>>>>>> On 7/11/24 13:18, Tejun Heo wrote:
>>>>>>> ...
>>>>>>>> Currently, I use the for_each_css() macro for iteration. If you mean
>>>>>>>> displaying all the possible cgroup subsystems even if they are not enabled
>>>>>>>> for the current cgroup, I will have to manually do the iteration.
>>>>>>> Just wrapping it with for_each_subsys() should do, no? for_each_css() won't
>>>>>>> iterate anything if css doesn't exist for the cgroup.
>>>>>> OK, I wasn't sure if you were asking to list all the possible cgroup v2
>>>>>> cgroup subsystems even if they weren't enabled in the current cgroup.
>>>>>> Apparently, that is the case. I prefer it that way too.
>>>>> Yeah, I think listing all is better. If the list corresponded directly to
>>>>> cgroup.controllers, it may make sense to only show enabled ones but we can
>>>>> have dying ones and implicitly enabled memory and so on, so I think it'd be
>>>>> cleaner to just list them all.
>>>> That will means cgroup subsystems that are seldomly used like rdma, misc
>>>> or even hugetlb will always be shown in all the cgroup.stat output. I
>>>> actually prefer just showing those that are enabled. As for dying memory
>>>> cgroups, they will only be shown in its online ancestors. We currently
>>>> don't know how many level down are each of the dying ones.
>>> It seems odd to me to not show dead ones after a cgroup has disabled
>>> the controller again. They still consume memory, after all, and so
>>> continue to be property of that cgroup afterwards.
>>>
>>> Instead of doing for_each_css(), would it make more sense to have
>>>
>>> 	struct cgroup {
>>> 		...
>>> 		int nr_dying_subsys[CGROUP_SUBSYS_COUNT];
>> What exactly does new this array for?
> For keeping the counts. Instead of inside the css.
>
> AFAICS, with your current patch, if somebody were to disable the
> controller in cgroup.subtree_control, it would offline the css at that
> level, become unreachable from cgroup->subsys[], and you'd lose
> remaining counts of dead css that are still associated with that
> cgroup. Re-enabling the controller would create a new css with new
> descendant counts, and now the reported numbers are actively misleading.
>
> That seems undesirable.
>
> If you track the counts in the cgroup itself, then cgroup.stat would
> reliably show the total counts of dead controllers that are associated
> with the subtree, even after disabling or toggling controllers.
>
> The hooks in online, offline, release should be the same, just update
> css->cgroup->nr_dying_subsys[id] instead of css->nr_dying_descendants.

That does make sense. Thank for the suggestion. I will update the patch 
accordingly.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-07-12 17:10 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-10 18:23 [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat Waiman Long
2024-07-10 18:23 ` [PATCH v3 2/2] cgroup: Limit frequency of reading cgroup.stat for unprivileged users Waiman Long
2024-07-10 21:43 ` [PATCH v3 1/2] cgroup: Show # of subsystem CSSes in cgroup.stat Roman Gushchin
2024-07-10 23:49   ` Waiman Long
2024-07-10 21:59 ` Tejun Heo
2024-07-10 23:51   ` Waiman Long
2024-07-11 13:49 ` Johannes Weiner
2024-07-11 14:05   ` Waiman Long
2024-07-11 17:18     ` Tejun Heo
2024-07-11 17:39       ` Waiman Long
2024-07-11 18:44         ` Tejun Heo
2024-07-11 18:51           ` Waiman Long
2024-07-11 18:59             ` Tejun Heo
2024-07-11 19:13               ` Waiman Long
2024-07-11 19:21                 ` Tejun Heo
2024-07-11 19:29                   ` Waiman Long
2024-07-11 19:59                 ` Johannes Weiner
2024-07-11 21:00                   ` Waiman Long
2024-07-11 21:57                     ` Waiman Long
2024-07-12 16:29                     ` Johannes Weiner
2024-07-12 17:10                       ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox