Re: [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Yosry Ahmed <yosry.ahmed@linux.dev>
To: inwardvessel <inwardvessel@gmail.com>
Cc: tj@kernel.org, shakeel.butt@linux.dev, mhocko@kernel.org,
	hannes@cmpxchg.org, akpm@linux-foundation.org,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	kernel-team@meta.com
Subject: Re: [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
Date: Fri, 28 Feb 2025 19:04:37 +0000	[thread overview]
Message-ID: <Z8IIxUdRpqxZyIHO@google.com> (raw)
In-Reply-To: <20250227215543.49928-2-inwardvessel@gmail.com>

On Thu, Feb 27, 2025 at 01:55:40PM -0800, inwardvessel wrote:
> From: JP Kobryn <inwardvessel@gmail.com>
> 
> Each cgroup owns rstat pointers. This means that a tree of pending rstat
> updates can contain changes from different subsystems. Because of this
> arrangement, when one subsystem is flushed via the public api
> cgroup_rstat_flushed(), all other subsystems with pending updates will
> also be flushed. Remove the rstat pointers from the cgroup and instead
> give them to each cgroup_subsys_state. Separate rstat trees will now
> exist for each unique subsystem. This separation allows for subsystems
> to make updates and flushes without the side effects of other
> subsystems. i.e. flushing the cpu stats does not cause the memory stats
> to be flushed and vice versa. The change in pointer ownership from
> cgroup to cgroup_subsys_state allows for direct flushing of the css, so
> the rcu list management entities and operations previously tied to the
> cgroup which were used for managing a list of subsystem states with
> pending flushes are removed. In terms of client code, public api calls
> were changed to now accept a reference to the cgroup_subsys_state so
> that when flushing or updating, a specific subsystem is associated with
> the call.

I think the subject is misleading. It makes it seem like this is a
refactoring patch that is only moving a member from one struct to
another, but this is actually the core of the series.

Maybe something lik "cgroup: use separate rstat trees for diffrent
subsystems"?

Also, breaking down the commit message into paragraphs helps with
readability.

[..]
> @@ -386,8 +394,8 @@ struct cgroup_rstat_cpu {
>  	 *
>  	 * Protected by per-cpu cgroup_rstat_cpu_lock.
>  	 */
> -	struct cgroup *updated_children;	/* terminated by self cgroup */
> -	struct cgroup *updated_next;		/* NULL iff not on the list */
> +	struct cgroup_subsys_state *updated_children;	/* terminated by self */
> +	struct cgroup_subsys_state *updated_next;		/* NULL if not on list */

nit: comment indentation needs fixing here

>  };
>  
>  struct cgroup_freezer_state {
[..]  
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index afc665b7b1fe..31b3bfebf7ba 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -164,7 +164,9 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
>  static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);

Should we rename cgrp_dfl_root_rstat_cpu to indicate that it's specific
to self css?

>  
>  /* the default hierarchy */
> -struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
> +struct cgroup_root cgrp_dfl_root = {
> +	.cgrp.self.rstat_cpu = &cgrp_dfl_root_rstat_cpu
> +};
>  EXPORT_SYMBOL_GPL(cgrp_dfl_root);
>  
>  /*
[..]
> @@ -5407,7 +5401,11 @@ static void css_free_rwork_fn(struct work_struct *work)
>  		struct cgroup_subsys_state *parent = css->parent;
>  		int id = css->id;
>  
> +		if (css->ss->css_rstat_flush)
> +			cgroup_rstat_exit(css);
> +
>  		ss->css_free(css);
> +

nit: extra blank line here

>  		cgroup_idr_remove(&ss->css_idr, id);
>  		cgroup_put(cgrp);
>  
> @@ -5431,7 +5429,7 @@ static void css_free_rwork_fn(struct work_struct *work)
>  			cgroup_put(cgroup_parent(cgrp));
>  			kernfs_put(cgrp->kn);
>  			psi_cgroup_free(cgrp);
> -			cgroup_rstat_exit(cgrp);
> +			cgroup_rstat_exit(&cgrp->self);
>  			kfree(cgrp);
>  		} else {
>  			/*
> @@ -5459,11 +5457,7 @@ static void css_release_work_fn(struct work_struct *work)
>  	if (ss) {
>  		struct cgroup *parent_cgrp;
>  
> -		/* css release path */
> -		if (!list_empty(&css->rstat_css_node)) {
> -			cgroup_rstat_flush(cgrp);
> -			list_del_rcu(&css->rstat_css_node);
> -		}
> +		cgroup_rstat_flush(css);

Here we used to call cgroup_rstat_flush() only if there was a
css_rstat_flush() callback registered, now we call it unconditionally.

Could this cause a NULL dereference when we try to call
css->ss->css_rstat_flush() for controllers that did not register a
callback?

>  
>  		cgroup_idr_replace(&ss->css_idr, NULL, css->id);
>  		if (ss->css_released)
[..]
> @@ -6188,6 +6186,9 @@ int __init cgroup_init(void)
>  			css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
>  						   GFP_KERNEL);
>  			BUG_ON(css->id < 0);
> +
> +			if (css->ss && css->ss->css_rstat_flush)
> +				BUG_ON(cgroup_rstat_init(css));

Why do we need this call here? We already call cgroup_rstat_init() in
cgroup_init_subsys(). IIUC for subsystems with ss->early_init, we will
have already called cgroup_init_subsys() in cgroup_init_early().

Did I miss something?

>  		} else {
>  			cgroup_init_subsys(ss, false);
>  		}
[..]
> @@ -300,27 +306,25 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
>  }
>  
>  /* see cgroup_rstat_flush() */
> -static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
> +static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
>  	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
>  {
> +	struct cgroup *cgrp = css->cgroup;
>  	int cpu;
>  
>  	lockdep_assert_held(&cgroup_rstat_lock);
>  
>  	for_each_possible_cpu(cpu) {
> -		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
> +		struct cgroup_subsys_state *pos;
>  
> +		pos = cgroup_rstat_updated_list(css, cpu);
>  		for (; pos; pos = pos->rstat_flush_next) {
> -			struct cgroup_subsys_state *css;
> +			if (!pos->ss)
> +				cgroup_base_stat_flush(pos->cgroup, cpu);
> +			else
> +				pos->ss->css_rstat_flush(pos, cpu);
>  
> -			cgroup_base_stat_flush(pos, cpu);
> -			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
> -
> -			rcu_read_lock();
> -			list_for_each_entry_rcu(css, &pos->rstat_css_list,
> -						rstat_css_node)
> -				css->ss->css_rstat_flush(css, cpu);
> -			rcu_read_unlock();
> +			bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu);

We should call bpf_rstat_flush() only if (!pos->ss) as well, right?
Otherwise we will call BPF rstat flush whenever any subsystem is
flushed.

I guess it's because BPF can now pass any subsystem to
cgroup_rstat_flush(), and we don't keep track. I think it would be
better if we do not allow BPF programs to select a css and always make
them flush the self css.

We can perhaps introduce a bpf_cgroup_rstat_flush() wrapper that takes
in a cgroup and passes cgroup->self internally to cgroup_rstat_flush().

But if the plan is to remove the bpf_rstat_flush() call here soon then
it's probably not worth the hassle.

Shakeel (and others), WDYT?

next prev parent reply	other threads:[~2025-02-28 19:04 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-27 21:55 [PATCH 0/4 v2] cgroup: separate rstat trees inwardvessel
2025-02-27 21:55 ` [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state inwardvessel
2025-02-27 22:43   ` Shakeel Butt
2025-02-28 19:04   ` Yosry Ahmed [this message]
2025-03-01  1:06     ` JP Kobryn
2025-03-01  1:25       ` Yosry Ahmed
2025-03-01  1:30         ` JP Kobryn
2025-03-03 18:18         ` Shakeel Butt
2025-03-03 18:21           ` Yosry Ahmed
2025-03-03 15:20   ` Michal Koutný
2025-03-03 19:31     ` JP Kobryn
2025-02-27 21:55 ` [PATCH 2/4 v2] cgroup: rstat lock indirection inwardvessel
2025-03-03 15:21   ` Michal Koutný
2025-02-27 21:55 ` [PATCH 3/4 v2] cgroup: separate rstat locks for subsystems inwardvessel
2025-02-27 22:52   ` Shakeel Butt
2025-02-28 16:07     ` JP Kobryn
2025-02-28 17:37   ` JP Kobryn
2025-02-28 19:20   ` Yosry Ahmed
2025-03-06 21:47     ` JP Kobryn
2025-03-01 23:00   ` kernel test robot
2025-03-03 15:22   ` Michal Koutný
2025-03-03 18:29     ` Yosry Ahmed
2025-03-03 18:40       ` Shakeel Butt
2025-03-03 19:23         ` JP Kobryn
2025-03-03 19:39           ` Shakeel Butt
2025-03-03 19:50         ` Yosry Ahmed
2025-03-03 20:09           ` Shakeel Butt
2025-03-03 18:49       ` Michal Koutný
2025-03-10 17:59         ` JP Kobryn
2025-03-11 13:49           ` Michal Koutný
2025-03-06 21:36       ` JP Kobryn
2025-03-03 23:49   ` kernel test robot
2025-02-27 21:55 ` [PATCH 4/4 v2] cgroup: separate rstat list pointers from base stats inwardvessel
2025-02-27 23:01   ` Shakeel Butt
2025-02-28 20:33   ` Yosry Ahmed
2025-02-28 18:22 ` [PATCH 0/4 v2] cgroup: separate rstat trees Yosry Ahmed
2025-03-03 15:19 ` Michal Koutný
2025-03-06  1:07   ` JP Kobryn
2025-03-11 13:49     ` Michal Koutný

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z8IIxUdRpqxZyIHO@google.com \
    --to=yosry.ahmed@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=inwardvessel@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.