From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 41F1CC369DC for ; Tue, 29 Apr 2025 06:13:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5819A6B0011; Tue, 29 Apr 2025 02:12:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 531546B0012; Tue, 29 Apr 2025 02:12:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 421236B0022; Tue, 29 Apr 2025 02:12:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 242CB6B0011 for ; Tue, 29 Apr 2025 02:12:59 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 667021CF74E for ; Tue, 29 Apr 2025 06:12:59 +0000 (UTC) X-FDA: 83386063278.03.5015DBA Received: from out-182.mta1.migadu.com (out-182.mta1.migadu.com [95.215.58.182]) by imf17.hostedemail.com (Postfix) with ESMTP id CDFDB40004 for ; Tue, 29 Apr 2025 06:12:57 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=Qa6imcJj; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf17.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.182 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745907178; a=rsa-sha256; cv=none; b=ibcydEQKady5CqRjyiSTj23OBLUEJXfe//b7QPIvK4jR55z9E29ZM7LoaQH6qZOkvrazVy lSPdwXqMPenpqbQcaftPsoWVq7vuFV/DZtT/0M/12Pq657eNdbkc+dxgzsBP+D710m2Y7X dNhhaufKgGU6X37cs03HWo2JJ+o/3i0= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=Qa6imcJj; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf17.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.182 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745907178; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=sqpKBip5whX4ZC0ZaGAtJrvmSdx6nd074HOCBHkJ5Ac=; b=qCNo8GPLfs1JRU4p6sxHjQUHoMHjC1Lp15OSjzkLNb3kyAHR/cgRzSuy9TdX1Via3xTjmR avE2VHLdxiStlEM3D3nb2IpswZDrEr9RBVIEqxYSy7jBD0IuU5aWmOHbGBIGilH4UpNowY VRuACQG2bTtpu23tcUvJC3JX7V28rSo= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745907176; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=sqpKBip5whX4ZC0ZaGAtJrvmSdx6nd074HOCBHkJ5Ac=; b=Qa6imcJjxykAmPNV/TT22Ru7Cjx3Dvd9LfbSP/iJBAFNAh845oZ0q8rXfAk44sv710Z0zs VaFpbjsP0/GobfHHADuUhqBF6uQJROwddwoaGXG5vM5Fdgk/vXlst6w0O4fH906uJKaszj OanLTE7r9YSNCbwn6IC3dAhS3w232TA= From: Shakeel Butt To: Tejun Heo , Andrew Morton , Alexei Starovoitov Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Yosry Ahmed , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Vlastimil Babka , Sebastian Andrzej Siewior , JP Kobryn , bpf@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team Subject: [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe Date: Mon, 28 Apr 2025 23:12:09 -0700 Message-ID: <20250429061211.1295443-4-shakeel.butt@linux.dev> In-Reply-To: <20250429061211.1295443-1-shakeel.butt@linux.dev> References: <20250429061211.1295443-1-shakeel.butt@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: CDFDB40004 X-Stat-Signature: e38hmsht6uxp4wysd488syzdkxxkio44 X-HE-Tag: 1745907177-41180 X-HE-Meta: U2FsdGVkX19GDxsJtQxWsl+4L9Tf7d0rTg/wU4VrvdHVtVAAM8SuwNZsSve3zUBBczh5ru5ZWJnYlRUu+i6yVioGHLe6/WYgKGxo3RbYRkFzcTmEAw6YL6SX3qlWQQAndiHQuM2XaaX2I33inJAVEI3AiYOOq/8NxKMASHwSAjFksRVFd45oEN0e7Zs/AujMEX6pCwEG7GB+SEoO8CeMtoupwSkMFapBMjcGx6xeSeSLvb8To1Q8HwLnywIN1ZAyAhlX8Rrja2A5QawtpVrIq9mDt++FABG2syg9+i4XWOv96GHTgVdrXspLBctOBKL1i5ytGb4OC1j13bTUOYO9t4FjIXuVUP6ziIJPoyhEnv1NrkuzcprHCt0k1VRCsIM0/MF6dPfU0Iajl3jFo4DziSu0/5a3OMLRa3eq7GvcHfwcCR4/MjOznkRASuYaEKDnVAKwlWggOps1qXI0zuBqnrOprI+SWgCG3fSFJGeZUxcqUtYYm61vzK0KgmBx6Cua1rBX7c5lIFRpAfyHdBQCCQJhRdju/0wPkVp8p9KWWhQcU7Q0CqLXj6kOzzJnkwIr7FWN0tWiHvvbRQAKUlb6JbytMHtP0GQLPkoi1mG5xIo10c7nNMvzImRug8vwneXmU+Djki8nmuJ9wQ28ns0llvDhwoxkW2MtTdlrG1fqvtyjJnKs+1hXDp2lqjX6nQui0+1QF8e+L5MXuHNj9WMSgKqCZzYkS46b7mOCRLby9PVungGMvQlSWivuluDIZcFO60uEiA2I2KIY6L/UxrY3z0kN/RNTQDeCXqAOwTAElIxaearGzWhUjaB2j1pjenKc1dqyRrTnxJuAYAfTVYrXmEkWbcX3fY31fc9nJklaqzhNvkmeiP6/GIToV+3uBvTFiS0Vol0ITCockoQaYrssQPBFzofPJI7fnWe3nTyw+xh38FHJCakxRoA30+aljUdOqayWF6ch8Uk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: To make css_rstat_updated() able to safely run in nmi context, it can not spin on locks and rather has to do trylock on the per-cpu per-ss raw spinlock. This patch implements the backlog mechanism to handle the failure in acquiring the per-cpu per-ss raw spinlock. Each subsystem provides a per-cpu lockless list on which the kernel stores the css given to css_rstat_updated() on trylock failure. These lockless lists serve as backlog. On cgroup stats flushing code path, the kernel first processes all the per-cpu lockless backlog lists of the given ss and then proceeds to flush the update stat trees. With css_rstat_updated() being nmi safe, the memch stats can and will be converted to be nmi safe to enable nmi safe mem charging. Signed-off-by: Shakeel Butt --- kernel/cgroup/rstat.c | 99 +++++++++++++++++++++++++++++++++---------- 1 file changed, 76 insertions(+), 23 deletions(-) diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index d3092b4c85d7..ac533e46afa9 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -11,6 +11,7 @@ static DEFINE_SPINLOCK(rstat_base_lock); static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock); +static DEFINE_PER_CPU(struct llist_head, rstat_backlog_list); static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu); @@ -42,6 +43,13 @@ static raw_spinlock_t *ss_rstat_cpu_lock(struct cgroup_subsys *ss, int cpu) return per_cpu_ptr(&rstat_base_cpu_lock, cpu); } +static struct llist_head *ss_lhead_cpu(struct cgroup_subsys *ss, int cpu) +{ + if (ss) + return per_cpu_ptr(ss->lhead, cpu); + return per_cpu_ptr(&rstat_backlog_list, cpu); +} + /* * Helper functions for rstat per CPU locks. * @@ -86,6 +94,21 @@ unsigned long _css_rstat_cpu_lock(struct cgroup_subsys_state *css, int cpu, return flags; } +static __always_inline +bool _css_rstat_cpu_trylock(struct cgroup_subsys_state *css, int cpu, + unsigned long *flags) +{ + struct cgroup *cgrp = css->cgroup; + raw_spinlock_t *cpu_lock; + bool contended; + + cpu_lock = ss_rstat_cpu_lock(css->ss, cpu); + contended = !raw_spin_trylock_irqsave(cpu_lock, *flags); + if (contended) + trace_cgroup_rstat_cpu_lock_contended(cgrp, cpu, contended); + return !contended; +} + static __always_inline void _css_rstat_cpu_unlock(struct cgroup_subsys_state *css, int cpu, unsigned long flags, const bool fast_path) @@ -102,32 +125,16 @@ void _css_rstat_cpu_unlock(struct cgroup_subsys_state *css, int cpu, raw_spin_unlock_irqrestore(cpu_lock, flags); } -/** - * css_rstat_updated - keep track of updated rstat_cpu - * @css: target cgroup subsystem state - * @cpu: cpu on which rstat_cpu was updated - * - * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching - * rstat_cpu->updated_children list. See the comment on top of - * css_rstat_cpu definition for details. - */ -__bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu) +static void css_add_to_backlog(struct cgroup_subsys_state *css, int cpu) { - unsigned long flags; - - /* - * Speculative already-on-list test. This may race leading to - * temporary inaccuracies, which is fine. - * - * Because @parent's updated_children is terminated with @parent - * instead of NULL, we can tell whether @css is on the list by - * testing the next pointer for NULL. - */ - if (data_race(css_rstat_cpu(css, cpu)->updated_next)) - return; + struct llist_head *lhead = ss_lhead_cpu(css->ss, cpu); + struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu); - flags = _css_rstat_cpu_lock(css, cpu, true); + llist_add_iff_not_on_list(&rstatc->lnode, lhead); +} +static void __css_rstat_updated(struct cgroup_subsys_state *css, int cpu) +{ /* put @css and all ancestors on the corresponding updated lists */ while (true) { struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu); @@ -153,6 +160,51 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu) css = parent; } +} + +static void css_process_backlog(struct cgroup_subsys *ss, int cpu) +{ + struct llist_head *lhead = ss_lhead_cpu(ss, cpu); + struct llist_node *lnode; + + while ((lnode = llist_del_first_init(lhead))) { + struct css_rstat_cpu *rstatc; + + rstatc = container_of(lnode, struct css_rstat_cpu, lnode); + __css_rstat_updated(rstatc->owner, cpu); + } +} + +/** + * css_rstat_updated - keep track of updated rstat_cpu + * @css: target cgroup subsystem state + * @cpu: cpu on which rstat_cpu was updated + * + * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching + * rstat_cpu->updated_children list. See the comment on top of + * css_rstat_cpu definition for details. + */ +__bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu) +{ + unsigned long flags; + + /* + * Speculative already-on-list test. This may race leading to + * temporary inaccuracies, which is fine. + * + * Because @parent's updated_children is terminated with @parent + * instead of NULL, we can tell whether @css is on the list by + * testing the next pointer for NULL. + */ + if (data_race(css_rstat_cpu(css, cpu)->updated_next)) + return; + + if (!_css_rstat_cpu_trylock(css, cpu, &flags)) { + css_add_to_backlog(css, cpu); + return; + } + + __css_rstat_updated(css, cpu); _css_rstat_cpu_unlock(css, cpu, flags, true); } @@ -255,6 +307,7 @@ static struct cgroup_subsys_state *css_rstat_updated_list( flags = _css_rstat_cpu_lock(root, cpu, false); + css_process_backlog(root->ss, cpu); /* Return NULL if this subtree is not on-list */ if (!rstatc->updated_next) goto unlock_ret; -- 2.47.1