From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C3D5BC369C2 for ; Tue, 22 Apr 2025 13:33:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9EC326B0007; Tue, 22 Apr 2025 09:33:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 99AE46B000A; Tue, 22 Apr 2025 09:33:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 861486B000C; Tue, 22 Apr 2025 09:33:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 693BE6B0007 for ; Tue, 22 Apr 2025 09:33:33 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 8E3C5C668A for ; Tue, 22 Apr 2025 13:33:33 +0000 (UTC) X-FDA: 83361771906.09.8869030 Received: from out-189.mta0.migadu.com (out-189.mta0.migadu.com [91.218.175.189]) by imf06.hostedemail.com (Postfix) with ESMTP id F372018000B for ; Tue, 22 Apr 2025 13:33:30 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=Zv0Ngf4b; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf06.hostedemail.com: domain of yosry.ahmed@linux.dev designates 91.218.175.189 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745328811; a=rsa-sha256; cv=none; b=Nosaseb0nxSarHe4Z/t3Yi3m+3qQB3o3ixEy9Qrumd1TOq+kFcGmz86CcJCF2lN3N3M15X GmMkmUAe/Kd3AePJhcTr11W6vvwScAsuRvd6eX4tDjS/gRk7DKrnRqSUzzSoCuzx2qdUO9 q5HTKlfPWJQ9pzgZHE04sK6d6dH2AqY= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=Zv0Ngf4b; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf06.hostedemail.com: domain of yosry.ahmed@linux.dev designates 91.218.175.189 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745328811; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=AY4fKwl2w/0Y+OFsSf3SMakvPUO3ks8cM5I2ij4eDiw=; b=mpH3D9aNpxMZktok81ZnFvSwQtzraqvMK2++iiN6xwLaAvJ+BNYzA0i7L/yOcI1IPkbCL9 bV3/BL88JpBrb1eOhkgTw1akve1fB9W4uf0785z1neDAz17rAAvtdPUGtppYTpQYP6DFiP 3JwYUSXKepAVuZF1U7U6ZkuXXZmsCcA= Date: Tue, 22 Apr 2025 06:33:23 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745328807; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=AY4fKwl2w/0Y+OFsSf3SMakvPUO3ks8cM5I2ij4eDiw=; b=Zv0Ngf4bz0eYmnFNTs1TPIxd+MjPqkyuKBdUzXCJQ4kdaiFkD3t8E5nGWb80KRNIBcxf28 +0JTgeVV0oSMlcY7vDC2/g/dmnBTnBBl0zup6+YYM+z7o9N2dss0uYzLP7zeWWcAVuWL9f Ah2hZgZ5oiyNbl0FzO2YL9QfuqIcVRM= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yosry Ahmed To: JP Kobryn Cc: tj@kernel.org, shakeel.butt@linux.dev, yosryahmed@google.com, mkoutny@suse.com, hannes@cmpxchg.org, akpm@linux-foundation.org, linux-mm@kvack.org, cgroups@vger.kernel.org, kernel-team@meta.com Subject: Re: [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem Message-ID: References: <20250404011050.121777-1-inwardvessel@gmail.com> <20250404011050.121777-5-inwardvessel@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250404011050.121777-5-inwardvessel@gmail.com> X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: F372018000B X-Stat-Signature: 9rtc37xoxodgs6t78gxfde6okwsau51c X-HE-Tag: 1745328810-632367 X-HE-Meta: U2FsdGVkX19YkT1/7FcBFYNUsbpIj0e+0N3m+3vL5XZJSaKzhGu5feALMDojrClCz25MvLqYRD95RLDTqiGAXyluYf1kStkoPQTS3vK+sLd4u2Ya19AJC5JVDD3Dayutr9uGK8xRGJSKi7ydb6DjHJE5CVjPpQBg5N1Sb4E2tVvVeAkxr4eDYKKPTt958Im3SnyVFLivfq6wpuQ8RkC6ecJzUEBBpk0NrjECyO1fkxGKCpb4ErUWld+VRVcwAmsAIiCfJDqK9D5utCcD3Hq2EmkRgVkXEC4hBd1Jj6zq6HeyChAkpc6SWkflXU06LMsARqtrX2xrNstHqdxIRNYrmcfnCGsOa3CgYk2X/M7cXEZ95qKyX26lrrusZZhgt8qgeng5FMEo2e3kg0AX4xURU2hFEFiM83GsgKtc0IW0sXiBcXlu4EuXwqAvz5tYFZrpwGNpCI2KsW8Sp8TnKo6SmYLT2KUZSXCM9LMvZgX7tFXc4VMwD6JxazMwtLMn4laln5Ptig3fnKP09U1iDyHRbhKCbhY2puoRvi8gksUjd+LPUwHokMf6WKjWTpbgLU02eGiRFyiCo8LaZJSm0jM2YIO/dMtvQIM7mx7dOz9wq9zXksIT/51Cud7ENmycPcXI0abPMPmhMrKV8QuxQb/1rDZP8kqxaC779VVdRx7Uy52mRGm6JfRUOQDeenB1kcV54JLus6eSttErG7c601fZ556g30LeUIo3ZlRFhwFlCmTzGjXSo39JnXpROgq80HTVNn2NdXfV+IelX6AIwaRy/iH3w3BFUrBcUMNfh32aQDVygJGwcoaj+V/8LB/TK8jNE5IN3qPwwQa20AnlnG0/xpJuIBtPADjTTikOOSsTOrHd29rtpbsi9w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Apr 03, 2025 at 06:10:49PM -0700, JP Kobryn wrote: > Different subsystems may call cgroup_rstat_updated() within the same > cgroup, resulting in a tree of pending updates from multiple subsystems. > When one of these subsystems is flushed via cgroup_rstat_flushed(), all > other subsystems with pending updates on the tree will also be flushed. > > Change the paradigm of having a single rstat tree for all subsystems to > having separate trees for each subsystem. This separation allows for > subsystems to perform flushes without the side effects of other subsystems. > As an example, flushing the cpu stats will no longer cause the memory stats > to be flushed and vice versa. > > In order to achieve subsystem-specific trees, change the tree node type > from cgroup to cgroup_subsys_state pointer. Then remove those pointers from > the cgroup and instead place them on the css. Finally, change update/flush > functions to make use of the different node type (css). These changes allow > a specific subsystem to be associated with an update or flush. Separate > rstat trees will now exist for each unique subsystem. > > Since updating/flushing will now be done at the subsystem level, there is > no longer a need to keep track of updated css nodes at the cgroup level. > The list management of these nodes done within the cgroup (rstat_css_list > and related) has been removed accordingly. > > Signed-off-by: JP Kobryn [..] > @@ -219,6 +219,13 @@ struct cgroup_subsys_state { > * Protected by cgroup_mutex. > */ > int nr_descendants; > + > + /* > + * A singly-linked list of css structures to be rstat flushed. > + * This is a scratch field to be used exclusively by > + * css_rstat_flush_locked() and protected by cgroup_rstat_lock. > + */ Ditto obsolete function name. > + struct cgroup_subsys_state *rstat_flush_next; > }; > > /* > @@ -329,10 +336,10 @@ struct cgroup_base_stat { > > /* > * rstat - cgroup scalable recursive statistics. Accounting is done > - * per-cpu in cgroup_rstat_cpu which is then lazily propagated up the > + * per-cpu in css_rstat_cpu which is then lazily propagated up the > * hierarchy on reads. > * > - * When a stat gets updated, the cgroup_rstat_cpu and its ancestors are > + * When a stat gets updated, the css_rstat_cpu and its ancestors are > * linked into the updated tree. On the following read, propagation only > * considers and consumes the updated tree. This makes reading O(the > * number of descendants which have been active since last read) instead of > @@ -346,7 +353,7 @@ struct cgroup_base_stat { > * This struct hosts both the fields which implement the above - > * updated_children and updated_next. > */ > -struct cgroup_rstat_cpu { > +struct css_rstat_cpu { > /* > * Child cgroups with stat updates on this cpu since the last read > * are linked on the parent's ->updated_children through > @@ -358,8 +365,8 @@ struct cgroup_rstat_cpu { > * > * Protected by per-cpu cgroup_rstat_cpu_lock. > */ > - struct cgroup *updated_children; /* terminated by self cgroup */ > - struct cgroup *updated_next; /* NULL iff not on the list */ > + struct cgroup_subsys_state *updated_children; /* terminated by self cgroup */ > + struct cgroup_subsys_state *updated_next; /* NULL iff not on the list */ > }; > > /* > @@ -521,25 +528,16 @@ struct cgroup { > struct cgroup *dom_cgrp; > struct cgroup *old_dom_cgrp; /* used while enabling threaded */ > > - /* per-cpu recursive resource statistics */ > - struct cgroup_rstat_cpu __percpu *rstat_cpu; > + /* per-cpu recursive basic resource statistics */ This comment should probably be dropped now as it's only providing partial info and is potentially confusing. > struct cgroup_rstat_base_cpu __percpu *rstat_base_cpu; > - struct list_head rstat_css_list; > > /* > - * Add padding to separate the read mostly rstat_cpu and > - * rstat_css_list into a different cacheline from the following > - * rstat_flush_next and *bstat fields which can have frequent updates. > + * Add padding to keep the read mostly rstat per-cpu pointer on a > + * different cacheline than the following *bstat fields which can have > + * frequent updates. > */ > CACHELINE_PADDING(_pad_); > > - /* > - * A singly-linked list of cgroup structures to be rstat flushed. > - * This is a scratch field to be used exclusively by > - * css_rstat_flush_locked() and protected by cgroup_rstat_lock. > - */ > - struct cgroup *rstat_flush_next; > - > /* cgroup basic resource statistics */ > struct cgroup_base_stat last_bstat; > struct cgroup_base_stat bstat; [..] > @@ -5425,6 +5417,9 @@ static void css_free_rwork_fn(struct work_struct *work) > struct cgroup_subsys_state *parent = css->parent; > int id = css->id; > > + if (ss->css_rstat_flush) > + css_rstat_exit(css); > + This call now exists in both branches (self css or not), so it's probably best to pull it outside. We should probably also pull the call in cgroup_destroy_root() outside into css_free_rwork_fn() so that we end up with a single call to css_rstat_exit() (apart from failure paths). We can probably also use css_is_cgroup() here instead of 'if (ss)' for consistency. > ss->css_free(css); > cgroup_idr_remove(&ss->css_idr, id); > cgroup_put(cgrp); [..] > @@ -5659,6 +5647,12 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp, > goto err_free_css; > css->id = err; > > + if (ss->css_rstat_flush) { > + err = css_rstat_init(css); > + if (err) > + goto err_free_css; > + } > + > /* @css is ready to be brought online now, make it visible */ > list_add_tail_rcu(&css->sibling, &parent_css->children); > cgroup_idr_replace(&ss->css_idr, css, css->id); > @@ -5672,7 +5666,6 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp, > err_list_del: > list_del_rcu(&css->sibling); > err_free_css: > - list_del_rcu(&css->rstat_css_node); > INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn); > queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork); > return ERR_PTR(err); > @@ -6104,11 +6097,17 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early) > css->flags |= CSS_NO_REF; > > if (early) { > - /* allocation can't be done safely during early init */ > + /* > + * Allocation can't be done safely during early init. > + * Defer IDR and rstat allocations until cgroup_init(). > + */ > css->id = 1; > } else { > css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, GFP_KERNEL); > BUG_ON(css->id < 0); > + > + if (ss->css_rstat_flush) > + BUG_ON(css_rstat_init(css)); > } > > /* Update the init_css_set to contain a subsys > @@ -6207,9 +6206,17 @@ int __init cgroup_init(void) > struct cgroup_subsys_state *css = > init_css_set.subsys[ss->id]; > > + /* > + * It is now safe to perform allocations. > + * Finish setting up subsystems that previously > + * deferred IDR and rstat allocations. > + */ > css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, > GFP_KERNEL); > BUG_ON(css->id < 0); > + > + if (ss->css_rstat_flush) > + BUG_ON(css_rstat_init(css)); The calls to css_rstat_init() are really difficult to track. Let's recap, before this change we had two calls: - In cgroup_setup_root(), for root cgroups. - In cgroup_create(), for non-root cgroups. This patch adds 3 more, so we end up with 5 calls as follows: - In cgroup_setup_root(), for root self css's. - In cgroup_create(), for non-root self css's. - In cgroup_subsys_init(), for root subsys css's without early initialization. - In cgroup_init(), for root subsys css's with early initialization, as we cannot call it from cgroup_subsys_init() early as allocations are not allowed during early init. - In css_create(), for non-root non-self css's. We should try to consolidate as much as possible. For example: - Can we always make the call for root subsys css's in cgroup_init(), regardless of early initialization status? Is there a need to make the call early for subsystems that use early in cgroup_subsys_init() initialization? - Can we always make the call for root css's in cgroup_init(), regardless of whether the css is a self css or a subsys css? I imagine we'd still need two separate calls, one outside the loop for the self css's, and one in the loop for subsys css's, but having them in the same place should make things easier. Ideally if we can do both the above, we'd end up with 3 calling functions only: - cgroup_init() -> for all root css's. - cgroup_create() -> for non-root self css's. - css_create() -> for non-root subsys css's. Also, we should probably document all the different call paths for css_rstat_init() and css_rstat_exit() somewhere. > } else { > cgroup_init_subsys(ss, false); > }