From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AD652C04FFE for ; Tue, 14 May 2024 05:18:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F0AC18D001B; Tue, 14 May 2024 01:18:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EBA2A8D000D; Tue, 14 May 2024 01:18:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D828C8D001B; Tue, 14 May 2024 01:18:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id BBC5E8D000D for ; Tue, 14 May 2024 01:18:25 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 2675FA0B8C for ; Tue, 14 May 2024 05:18:25 +0000 (UTC) X-FDA: 82115845770.15.6D61CDC Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf29.hostedemail.com (Postfix) with ESMTP id 624CE12000C for ; Tue, 14 May 2024 05:18:23 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=WiWDV2Do; spf=pass (imf29.hostedemail.com: domain of hawk@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=hawk@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715663903; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GdNAeyjay0NL8HTHOYJIBsA97n4GHds9CgesPfyc74Q=; b=C8fdpmk+KdINYUrpo/igWP0h7YLvF3jkDDxzYY0A5WKawZeG2iVwtq5aKj1A8giGqkoiVW D7xxAgor9NIb+BqD5rKMwKT5/5nUQ9BjyCjPI6NRiOCxtDsg0gO7KlSYkguXz1aIeeTFlu zL5LqDDIu1iNuYrk9av7OfO2D1HD0xU= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=WiWDV2Do; spf=pass (imf29.hostedemail.com: domain of hawk@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=hawk@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715663903; a=rsa-sha256; cv=none; b=sgHpjQGYq5JtAEVWgCqzUDW+pGgaTl0+4uwg/dQl82ETSMhQTgDA5wS0/IBWxY/w1td3sR wQVD4ldvg7t7kRM2dDk0NpqpVWCHT7SeJ2HhYsUYaDGq8pqZ06t4+oV9jVWoFVr+BB/liH TYZrbYf3iUqXwomzKcHkl018OQlDWk4= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 64D376116E; Tue, 14 May 2024 05:18:22 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id D120CC2BD10; Tue, 14 May 2024 05:18:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1715663902; bh=QkpHlnc+ig2PXMttd/+olwpgeCOJicOPEYvANyR1shE=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=WiWDV2Dop6QttjUXw0Qq/AtrTV4FkTFc0TYLmkJ2g2EUaAZ8wfmOFHcO7/LY8lNgC CPN0uj6y1ewAzCm2VDoBQemZ6P0c7GAOIFGFsZ2uEX12/SCQEesLJ/o57JZtWD3EFW C5DIGs7cgYGBetFa6T7skxOVk+8guAKzF4uP5MepZ/MzoDeFQ2hKxdKx4Acuu9ZJxv 5L9Ep/w4nflRJ8F0W7QY5G4aD5vuzdwJabgdzm1ty+0NQ0/ysbKEuR+XhB1M97pved pPkkeoVPLt5OO4LpxljrJ5fGJllviockLW0fTLMagDULdSB/P2bVQuRQ6/0Ts1DIx9 UIWO3PiOLKkFw== Message-ID: <0ef04a5c-ced8-4e30-bcde-43c218e35387@kernel.org> Date: Tue, 14 May 2024 07:18:18 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1] cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints To: tj@kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com, cgroups@vger.kernel.org, yosryahmed@google.com, longman@redhat.com Cc: netdev@vger.kernel.org, linux-mm@kvack.org, shakeel.butt@linux.dev, kernel-team@cloudflare.com, Arnaldo Carvalho de Melo , Sebastian Andrzej Siewior References: <171457225108.4159924.12821205549807669839.stgit@firesoul> Content-Language: en-US From: Jesper Dangaard Brouer In-Reply-To: <171457225108.4159924.12821205549807669839.stgit@firesoul> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 624CE12000C X-Stat-Signature: dj9a9a5qknh1j7grg9csadirupkykxcz X-HE-Tag: 1715663903-679890 X-HE-Meta: U2FsdGVkX19LjfmZu+oyPN+cPK7nf+J4usbzs5YCMMmg5WdYHME+SkrmlnZzOQcj6djMYAHK4MyMwO/0Fz5e+ueGazcv9tEj0VY4CnPnWhDrkyG8KTSwYMAjQYz7l52GwNxEyiNQLqCsmzgNn5ViuLNFXCqpY1G7TXrfa/lNRXL8/7xd0kf353ls8wNJg3f62k/25BtS6kqpLWmByy95KE0eQzox4MsHdHbjHiI0xmhMFlDMpWVgLeBzai2FyT/grjvQ4TkONPZgQMCckz7dSvz8r9Fmeex9xacGi8iuzG8IVCZ8lxYkjRnayjgtcDGfxhlAlANWVgv0W1IvPNIfr2suM/P5arEGe+693w1IY3sO/EV+Wipsw74+rfwhfCbHv3HrWGnOfzqMmPcjE1i7RJ8P/Fg6IN1JJ25Sc1qW8DS1q077T9jeRZSKSjyZGHrlQqgOIqjlr6A8GmmBHluFaegXGQob4GAl2B+WOOiSL893vfaaxKMVr4/nMXX8CE2EUC1iAHZQH+fEHhJdTCqlZqzOlwvuqTGWJVYNk47m91b/eg/nn069fjh6D/+zsl2I1DMxW8ARnwF0MSNOiUqvySq8h/Iqvui5PEphLtkpwXbV+s0FwJSCgMpcj6IkwnSdrjzLPOdXPrsdEpIegrmuIjdY3sPO3pkarVGFZV7fjnJP95wKvhvNyfYHdhxYVWgdhMCX4E6X1I6S0f2XjiCQTfi4Cg81wx1/jr1+gEJ5KdoZkPHeYn34+w6Xm20d30gOZZ2BZBWetbTiQOuojuuhQudM9zVmhEnhyvG8oOUqf8vQridArammWMPQSEYyfrDOHmc6qVlBX2W58VkeczXF4dj9z3pIQZDO3jK4Mfge6kR+yBlEOBzR3fb2T8q0up5Tv63WhbywDvVnst/EW5gA0qzU+QAtNS5mDSh7LMW9s+MGAmCgntBO8OMFdj30/vnnYnncFQtL3xXxTI+kNuy c2Hjbn1M LCeuckO+RAZwlGArYLCzrbCq+S4B0j8/MlsA60ktR/yYtgxRkU3yK/yEcMkD5QcI/Q2bc63nPSLAj//W0lnfPEPoWFFbNfyzaOFM7r0cLqKaCKYjs3BLpRjqf1+fgFBNFz43xnMrY1U5F0kKhdZYirW8lEB12u7aBEZP1aAlRFg1Py/3rk6PVkrGUP9+Kf7hXfWT789ezvE53DSiPz1Im1Df5HPz++8y2GPhfYfVeedgay7pnHDvmaZVtauVX6dA7BTocLky6G3e0w0DTXAszYzHbNFoFGtjTpTebgExZvyXbrSJwUyb4WKvsM04deVuyvM2weSwIuHIkOmQj/4VK+/B8rA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000002, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Tejun, Could we please apply this for-6.10, to avoid splitting tracepoint changes over multiple kernel versions? --Jesper P.s. Cloudflare is already running with this applied in production. On 01/05/2024 16.04, Jesper Dangaard Brouer wrote: > This closely resembles helpers added for the global cgroup_rstat_lock in > commit fc29e04ae1ad ("cgroup/rstat: add cgroup_rstat_lock helpers and > tracepoints"). This is for the per CPU lock cgroup_rstat_cpu_lock. > > Based on production workloads, we observe the fast-path "update" function > cgroup_rstat_updated() is invoked around 3 million times per sec, while the > "flush" function cgroup_rstat_flush_locked(), walking each possible CPU, > can see periodic spikes of 700 invocations/sec. > > For this reason, the tracepoints are split into normal and fastpath > versions for this per-CPU lock. Making it feasible for production to > continuously monitor the non-fastpath tracepoint to detect lock contention > issues. The reason for monitoring is that lock disables IRQs which can > disturb e.g. softirq processing on the local CPUs involved. When the > global cgroup_rstat_lock stops disabling IRQs (e.g converted to a mutex), > this per CPU lock becomes the next bottleneck that can introduce latency > variations. > > A practical bpftrace script for monitoring contention latency: > > bpftrace -e ' > tracepoint:cgroup:cgroup_rstat_cpu_lock_contended { > @start[tid]=nsecs; @cnt[probe]=count()} > tracepoint:cgroup:cgroup_rstat_cpu_locked { > if (args->contended) { > @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);} > @cnt[probe]=count()} > interval:s:1 {time("%H:%M:%S "); print(@wait_ns); print(@cnt); clear(@cnt);}' > > Signed-off-by: Jesper Dangaard Brouer > --- > include/trace/events/cgroup.h | 56 +++++++++++++++++++++++++++++---- > kernel/cgroup/rstat.c | 70 ++++++++++++++++++++++++++++++++++------- > 2 files changed, 108 insertions(+), 18 deletions(-) > > diff --git a/include/trace/events/cgroup.h b/include/trace/events/cgroup.h > index 13f375800135..0b95865a90f3 100644 > --- a/include/trace/events/cgroup.h > +++ b/include/trace/events/cgroup.h > @@ -206,15 +206,15 @@ DEFINE_EVENT(cgroup_event, cgroup_notify_frozen, > > DECLARE_EVENT_CLASS(cgroup_rstat, > > - TP_PROTO(struct cgroup *cgrp, int cpu_in_loop, bool contended), > + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), > > - TP_ARGS(cgrp, cpu_in_loop, contended), > + TP_ARGS(cgrp, cpu, contended), > > TP_STRUCT__entry( > __field( int, root ) > __field( int, level ) > __field( u64, id ) > - __field( int, cpu_in_loop ) > + __field( int, cpu ) > __field( bool, contended ) > ), > > @@ -222,15 +222,16 @@ DECLARE_EVENT_CLASS(cgroup_rstat, > __entry->root = cgrp->root->hierarchy_id; > __entry->id = cgroup_id(cgrp); > __entry->level = cgrp->level; > - __entry->cpu_in_loop = cpu_in_loop; > + __entry->cpu = cpu; > __entry->contended = contended; > ), > > - TP_printk("root=%d id=%llu level=%d cpu_in_loop=%d lock contended:%d", > + TP_printk("root=%d id=%llu level=%d cpu=%d lock contended:%d", > __entry->root, __entry->id, __entry->level, > - __entry->cpu_in_loop, __entry->contended) > + __entry->cpu, __entry->contended) > ); > > +/* Related to global: cgroup_rstat_lock */ > DEFINE_EVENT(cgroup_rstat, cgroup_rstat_lock_contended, > > TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), > @@ -252,6 +253,49 @@ DEFINE_EVENT(cgroup_rstat, cgroup_rstat_unlock, > TP_ARGS(cgrp, cpu, contended) > ); > > +/* Related to per CPU: cgroup_rstat_cpu_lock */ > +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended, > + > + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), > + > + TP_ARGS(cgrp, cpu, contended) > +); > + > +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended_fastpath, > + > + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), > + > + TP_ARGS(cgrp, cpu, contended) > +); > + > +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_locked, > + > + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), > + > + TP_ARGS(cgrp, cpu, contended) > +); > + > +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_locked_fastpath, > + > + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), > + > + TP_ARGS(cgrp, cpu, contended) > +); > + > +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock, > + > + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), > + > + TP_ARGS(cgrp, cpu, contended) > +); > + > +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock_fastpath, > + > + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), > + > + TP_ARGS(cgrp, cpu, contended) > +); > + > #endif /* _TRACE_CGROUP_H */ > > /* This part must be outside protection */ > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c > index 52e3b0ed1cee..fb8b49437573 100644 > --- a/kernel/cgroup/rstat.c > +++ b/kernel/cgroup/rstat.c > @@ -19,6 +19,60 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu) > return per_cpu_ptr(cgrp->rstat_cpu, cpu); > } > > +/* > + * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock). > + * > + * This makes it easier to diagnose locking issues and contention in > + * production environments. The parameter @fast_path determine the > + * tracepoints being added, allowing us to diagnose "flush" related > + * operations without handling high-frequency fast-path "update" events. > + */ > +static __always_inline > +unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu, > + struct cgroup *cgrp, const bool fast_path) > +{ > + unsigned long flags; > + bool contended; > + > + /* > + * The _irqsave() is needed because cgroup_rstat_lock is > + * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring > + * this lock with the _irq() suffix only disables interrupts on > + * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables > + * interrupts on both configurations. The _irqsave() ensures > + * that interrupts are always disabled and later restored. > + */ > + contended = !raw_spin_trylock_irqsave(cpu_lock, flags); > + if (contended) { > + if (fast_path) > + trace_cgroup_rstat_cpu_lock_contended_fastpath(cgrp, cpu, contended); > + else > + trace_cgroup_rstat_cpu_lock_contended(cgrp, cpu, contended); > + > + raw_spin_lock_irqsave(cpu_lock, flags); > + } > + > + if (fast_path) > + trace_cgroup_rstat_cpu_locked_fastpath(cgrp, cpu, contended); > + else > + trace_cgroup_rstat_cpu_locked(cgrp, cpu, contended); > + > + return flags; > +} > + > +static __always_inline > +void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu, > + struct cgroup *cgrp, unsigned long flags, > + const bool fast_path) > +{ > + if (fast_path) > + trace_cgroup_rstat_cpu_unlock_fastpath(cgrp, cpu, false); > + else > + trace_cgroup_rstat_cpu_unlock(cgrp, cpu, false); > + > + raw_spin_unlock_irqrestore(cpu_lock, flags); > +} > + > /** > * cgroup_rstat_updated - keep track of updated rstat_cpu > * @cgrp: target cgroup > @@ -44,7 +98,7 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) > if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next)) > return; > > - raw_spin_lock_irqsave(cpu_lock, flags); > + flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true); > > /* put @cgrp and all ancestors on the corresponding updated lists */ > while (true) { > @@ -72,7 +126,7 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) > cgrp = parent; > } > > - raw_spin_unlock_irqrestore(cpu_lock, flags); > + _cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true); > } > > /** > @@ -153,15 +207,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu) > struct cgroup *head = NULL, *parent, *child; > unsigned long flags; > > - /* > - * The _irqsave() is needed because cgroup_rstat_lock is > - * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring > - * this lock with the _irq() suffix only disables interrupts on > - * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables > - * interrupts on both configurations. The _irqsave() ensures > - * that interrupts are always disabled and later restored. > - */ > - raw_spin_lock_irqsave(cpu_lock, flags); > + flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root, false); > > /* Return NULL if this subtree is not on-list */ > if (!rstatc->updated_next) > @@ -198,7 +244,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu) > if (child != root) > head = cgroup_rstat_push_children(head, child, cpu); > unlock_ret: > - raw_spin_unlock_irqrestore(cpu_lock, flags); > + _cgroup_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false); > return head; > } > > >