From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-180.mta0.migadu.com (out-180.mta0.migadu.com [91.218.175.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B61511F8743 for ; Fri, 28 Feb 2025 19:04:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.180 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740769486; cv=none; b=ngwgefGU80KvsC99XwRR8++FN7e6AGU74CRWXcenq0QJ73qbZi4w22dbPsX0OFjhZRIuJU6CYMnp2zjMxHZ43UMd4ck/6VwT/HmoA+7qu3Ba10Q8qUGjX1V7kIO2ZWfjyLwUw3KkT6S88Yhc83+6JXwHb+DFJVPCaYHA0lrdD1o= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740769486; c=relaxed/simple; bh=rF65UpvkFrTK18kYQr44FCKkPQK75A02IX2G1n2HixI=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=ZMW3XjQi60Bz3dLF/V9pwzki8xR7fr4tcNuC/T05NEu988pU7GrQ43mGO1aRVKjvq3ZOg1EmC3aDaHf1GlRTKgRWarwUgvdJkwXuPIQoowt/ol2Dwiv6v8Iz0ukqALvO7QGTSi7cMAZNyanZL4qWmevn377VptS1od88YM3DBA8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=UMLnaGOu; arc=none smtp.client-ip=91.218.175.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="UMLnaGOu" Date: Fri, 28 Feb 2025 19:04:37 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1740769482; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=EUoDXTNpMhUllcwII7ASYWMxp1W5rvEw1o/euNXhFvk=; b=UMLnaGOufHW2t9TCLQ7Uujtem/w0EipDTVwUlF6w8zUz34cLvZLa3FyJ7oQf21Fy75YZEd gBEOUDBqj8votVKeejBpflSyHbUF8fw/sNQcDFsFb7xQ+FLe1Dr8uweLmfTVLxiEQN+/g1 BiP+swD13BBQEaenUpa8u5wStxxP2HU= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yosry Ahmed To: inwardvessel Cc: tj@kernel.org, shakeel.butt@linux.dev, mhocko@kernel.org, hannes@cmpxchg.org, akpm@linux-foundation.org, linux-mm@kvack.org, cgroups@vger.kernel.org, kernel-team@meta.com Subject: Re: [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state Message-ID: References: <20250227215543.49928-1-inwardvessel@gmail.com> <20250227215543.49928-2-inwardvessel@gmail.com> Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250227215543.49928-2-inwardvessel@gmail.com> X-Migadu-Flow: FLOW_OUT On Thu, Feb 27, 2025 at 01:55:40PM -0800, inwardvessel wrote: > From: JP Kobryn > > Each cgroup owns rstat pointers. This means that a tree of pending rstat > updates can contain changes from different subsystems. Because of this > arrangement, when one subsystem is flushed via the public api > cgroup_rstat_flushed(), all other subsystems with pending updates will > also be flushed. Remove the rstat pointers from the cgroup and instead > give them to each cgroup_subsys_state. Separate rstat trees will now > exist for each unique subsystem. This separation allows for subsystems > to make updates and flushes without the side effects of other > subsystems. i.e. flushing the cpu stats does not cause the memory stats > to be flushed and vice versa. The change in pointer ownership from > cgroup to cgroup_subsys_state allows for direct flushing of the css, so > the rcu list management entities and operations previously tied to the > cgroup which were used for managing a list of subsystem states with > pending flushes are removed. In terms of client code, public api calls > were changed to now accept a reference to the cgroup_subsys_state so > that when flushing or updating, a specific subsystem is associated with > the call. I think the subject is misleading. It makes it seem like this is a refactoring patch that is only moving a member from one struct to another, but this is actually the core of the series. Maybe something lik "cgroup: use separate rstat trees for diffrent subsystems"? Also, breaking down the commit message into paragraphs helps with readability. [..] > @@ -386,8 +394,8 @@ struct cgroup_rstat_cpu { > * > * Protected by per-cpu cgroup_rstat_cpu_lock. > */ > - struct cgroup *updated_children; /* terminated by self cgroup */ > - struct cgroup *updated_next; /* NULL iff not on the list */ > + struct cgroup_subsys_state *updated_children; /* terminated by self */ > + struct cgroup_subsys_state *updated_next; /* NULL if not on list */ nit: comment indentation needs fixing here > }; > > struct cgroup_freezer_state { [..] > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index afc665b7b1fe..31b3bfebf7ba 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -164,7 +164,9 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = { > static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu); Should we rename cgrp_dfl_root_rstat_cpu to indicate that it's specific to self css? > > /* the default hierarchy */ > -struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu }; > +struct cgroup_root cgrp_dfl_root = { > + .cgrp.self.rstat_cpu = &cgrp_dfl_root_rstat_cpu > +}; > EXPORT_SYMBOL_GPL(cgrp_dfl_root); > > /* [..] > @@ -5407,7 +5401,11 @@ static void css_free_rwork_fn(struct work_struct *work) > struct cgroup_subsys_state *parent = css->parent; > int id = css->id; > > + if (css->ss->css_rstat_flush) > + cgroup_rstat_exit(css); > + > ss->css_free(css); > + nit: extra blank line here > cgroup_idr_remove(&ss->css_idr, id); > cgroup_put(cgrp); > > @@ -5431,7 +5429,7 @@ static void css_free_rwork_fn(struct work_struct *work) > cgroup_put(cgroup_parent(cgrp)); > kernfs_put(cgrp->kn); > psi_cgroup_free(cgrp); > - cgroup_rstat_exit(cgrp); > + cgroup_rstat_exit(&cgrp->self); > kfree(cgrp); > } else { > /* > @@ -5459,11 +5457,7 @@ static void css_release_work_fn(struct work_struct *work) > if (ss) { > struct cgroup *parent_cgrp; > > - /* css release path */ > - if (!list_empty(&css->rstat_css_node)) { > - cgroup_rstat_flush(cgrp); > - list_del_rcu(&css->rstat_css_node); > - } > + cgroup_rstat_flush(css); Here we used to call cgroup_rstat_flush() only if there was a css_rstat_flush() callback registered, now we call it unconditionally. Could this cause a NULL dereference when we try to call css->ss->css_rstat_flush() for controllers that did not register a callback? > > cgroup_idr_replace(&ss->css_idr, NULL, css->id); > if (ss->css_released) [..] > @@ -6188,6 +6186,9 @@ int __init cgroup_init(void) > css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, > GFP_KERNEL); > BUG_ON(css->id < 0); > + > + if (css->ss && css->ss->css_rstat_flush) > + BUG_ON(cgroup_rstat_init(css)); Why do we need this call here? We already call cgroup_rstat_init() in cgroup_init_subsys(). IIUC for subsystems with ss->early_init, we will have already called cgroup_init_subsys() in cgroup_init_early(). Did I miss something? > } else { > cgroup_init_subsys(ss, false); > } [..] > @@ -300,27 +306,25 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) > } > > /* see cgroup_rstat_flush() */ > -static void cgroup_rstat_flush_locked(struct cgroup *cgrp) > +static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css) > __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock) > { > + struct cgroup *cgrp = css->cgroup; > int cpu; > > lockdep_assert_held(&cgroup_rstat_lock); > > for_each_possible_cpu(cpu) { > - struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu); > + struct cgroup_subsys_state *pos; > > + pos = cgroup_rstat_updated_list(css, cpu); > for (; pos; pos = pos->rstat_flush_next) { > - struct cgroup_subsys_state *css; > + if (!pos->ss) > + cgroup_base_stat_flush(pos->cgroup, cpu); > + else > + pos->ss->css_rstat_flush(pos, cpu); > > - cgroup_base_stat_flush(pos, cpu); > - bpf_rstat_flush(pos, cgroup_parent(pos), cpu); > - > - rcu_read_lock(); > - list_for_each_entry_rcu(css, &pos->rstat_css_list, > - rstat_css_node) > - css->ss->css_rstat_flush(css, cpu); > - rcu_read_unlock(); > + bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu); We should call bpf_rstat_flush() only if (!pos->ss) as well, right? Otherwise we will call BPF rstat flush whenever any subsystem is flushed. I guess it's because BPF can now pass any subsystem to cgroup_rstat_flush(), and we don't keep track. I think it would be better if we do not allow BPF programs to select a css and always make them flush the self css. We can perhaps introduce a bpf_cgroup_rstat_flush() wrapper that takes in a cgroup and passes cgroup->self internally to cgroup_rstat_flush(). But if the plan is to remove the bpf_rstat_flush() call here soon then it's probably not worth the hassle. Shakeel (and others), WDYT?