From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5141DC4345F for ; Thu, 18 Apr 2024 09:02:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D983D6B009E; Thu, 18 Apr 2024 05:02:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D49056B009F; Thu, 18 Apr 2024 05:02:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C36E76B00A0; Thu, 18 Apr 2024 05:02:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id A39EF6B009E for ; Thu, 18 Apr 2024 05:02:18 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 6118212130A for ; Thu, 18 Apr 2024 09:02:18 +0000 (UTC) X-FDA: 82022061156.04.0870124 Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf11.hostedemail.com (Postfix) with ESMTP id E57364000F for ; Thu, 18 Apr 2024 09:02:15 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="eBfnU/Ik"; spf=pass (imf11.hostedemail.com: domain of hawk@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=hawk@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1713430936; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=AaDALeR6Ovwa+ai5pPMC9Yx0HsVoHRTOf4W42ojn5WM=; b=DbGVU7lDG18lwG5vI1yV1TdIp2HZbKgoGPLO3lNrWwV+vNAla+QDAFzFpJVlIxYrZwl+jx RYB517P+8x1QFxkYgz85IqxLXm80Eba7+zKpdRcd4/YjfoeVUQdpZGQvPYG3x7Jd8fCBRZ /dNecW+ooVadahWdfBfRwAf72ehA+SY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713430936; a=rsa-sha256; cv=none; b=G9RJDLl3TJtkMK4mddVbn7B1G5UbXGOQG9T8t0ggMv9Wh/v5Ro/HoRI9gniXjvj5j86iCM XOa98KqONYHHbFo4D+HmwDwb4GWn/bjeG3uGUEuyB1I6Zazj+/YqXLljyJfBTyXWzbeZtr wyrq1s0/6YjOfGQgXPa6EhBlwuFdQKM= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="eBfnU/Ik"; spf=pass (imf11.hostedemail.com: domain of hawk@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=hawk@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 4E4D6CE1776; Thu, 18 Apr 2024 09:02:11 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id DFC03C113CC; Thu, 18 Apr 2024 09:02:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1713430930; bh=Z/laLqr/RJd1W/kEW84zTCYug41kjmWmk6qr6YqBijI=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=eBfnU/IkQZ1L1XRTqw0UmorODD5ilT+1CVEM7iL+2ka5Ny4lZ0kVe6rRMXqXJ4SIO F4Y7cSf4LnLedtT+7jBW9MMYjnKORkAlfRZCoObIfxIvHPK+cdTbXhRNGIsTL06VLW +aXuCl0Iu9E7lBQN9E7eKLlT+yqGSkq2U6RBoijSvFnSbZiDYyOe/lQgqHGrhH2Q7V H4ZTMclryfdaAD/nZAXVkwomUZQaKraw7xOUIuqRUXrdAiLY/s8mUc+YOZO2BN4uRN QWsmlPYrftPKsXvbvZ7nb11wQdk8GTunmvOov+4Kwcjn//5YL75JZ+Vyvh8eV1lv2/ Ary12cO7joHkw== Message-ID: <651a52ac-b545-4b25-b82f-ad3a2a57bf69@kernel.org> Date: Thu, 18 Apr 2024 11:02:06 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex To: Yosry Ahmed Cc: tj@kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com, cgroups@vger.kernel.org, longman@redhat.com, netdev@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, shakeel.butt@linux.dev, kernel-team@cloudflare.com, Arnaldo Carvalho de Melo , Sebastian Andrzej Siewior , mhocko@kernel.org References: <171328983017.3930751.9484082608778623495.stgit@firesoul> <171328989335.3930751.3091577850420501533.stgit@firesoul> Content-Language: en-US From: Jesper Dangaard Brouer In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: E57364000F X-Stat-Signature: niiiy77iz5frzbssw3cibqqjqqm8ns9t X-Rspam-User: X-HE-Tag: 1713430935-617109 X-HE-Meta: U2FsdGVkX19Nw2Pvk1q5gIR6njDZFiO0vbqCSkGo6A0xEnAHlcTnVWi/6JwcNwaHSxiOqsuxlN5PPRYxsgAy+va208Lm+m10CvXR5ks3EnHkHywaDdZVfmp30Ifr6pF9AfsubzTVW7pQFy5uCCk2fE/HUOqz62GCJAl8nJEysoJmskPfM9r9zTv0Px7xfPE5vcEbrV3/zqodvGKqzZ8O+lVjUQzyp3SnycCPZvogq3uACFHGOGgZjYzMOZ3qoum1vwPThEcVDjvjIPr1mq1J8H/jmd3sRA5gsidIOSxFZ93qfQGFnlCBtkI/MZlVDbqvwC4A5S0uw904f1nY2zMBEQjWzzD8epmJ9SNHbQ9db+4PrB2di2sl8g/PCWxTOnSLuGhILHfZsRNz2TZ8/XY64At9wQ/LuAubh6oxFn8WuB8AGgN0axhE9mho+2kAC6zMEoYmAt3SDc+L98pNEN9qqFxu4nZX0yAJUSoLcuV2VynSJw5UkEV4QRmPJNdULJzBTcHE9Dptg1qNstH7KrAA0QzOQxJFHjgVhqbS3HsWBTBzqi4n9b/L6CgQ0DkcuEjhifOdBVbummeL5r+zJUmGYd3QTz0zXW1CmmVXEb+a0lYP3yRcZMHdFAbCjyduHst4EzW9t3U5m0WvR4Rgk3q817zy0edZqgJ4iuyIEu2v5XW6hUAzM28PS1Rfs6HOJo1ey6maUs/y3ASAGSqAvd1AM6fvzMN6GcwkirsECd3MD1CwQPFypwR3H9tnYoMR7kGMhfR/RKWnMfZkntAkTyjRmIFtBAkEcbhAklKwFZqocvrWP797zbTmOGL736emVw273obFaMaXyJFqYsf2/KW62qJVh95f53URwo6fgH+QiOkozrOBRzIzKMjGNFIyAfB5+Mruk6w479RXG42RYV/jSP6qc3NeOW/KPIAB75J3fcuD9GYpL4CrSI0y9ruL3PbSO5wek/k/NyJBXInDArZ JHGgS8X2 JRdZaw8wrjRiSUjiIVqEWHcl34Q8sY1674uy9l+cVY8ChArquY7YrahNUhMH2lS+wdzBwJkOSx4jzDllZdu+fhQsqh9djbTNNMNUCi1p4JgIOaYEYNDxrPeSBdvm+ZjtsNHUVr+AkTv0xvfVou57hYHf39WPc8GW5CUGT9O+PT75b3paOc9HTxAVc/zO0lpzabCDPeaJmnVk7Qu4RfWyf/pNfaUnor9xce0OnQK4Kej2S53ZJ2EKyaDLZCnZ3ZTCYk/wfOyGTycnML7bnp77G4FoAIaROC+hdvMYCS37XPj8PiAYVYwh2xOUJ51fvg6GCI0J98srDpjMflJBPFeU292JuFpdwL0lcaiwPdmu23addjpk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 18/04/2024 04.19, Yosry Ahmed wrote: > On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer wrote: >> >> Since kernel v4.18, cgroup_rstat_lock has been an IRQ-disabling spinlock, >> as introduced by commit 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex >> with a spinlock"). >> >> Despite efforts in cgroup_rstat_flush_locked() to yield the lock when >> necessary during the collection of per-CPU stats, this approach has led >> to several scaling issues observed in production environments. Holding >> this IRQ lock has caused starvation of other critical kernel functions, >> such as softirq (e.g., timers and netstack). Although kernel v6.8 >> introduced optimizations in this area, we continue to observe instances >> where the spin_lock is held for 64-128 ms in production. >> >> This patch converts cgroup_rstat_lock back to being a mutex lock. This >> change is made possible thanks to the significant effort by Yosry Ahmed >> to eliminate all atomic context use-cases through multiple commits, >> ending in 0a2dc6ac3329 ("cgroup: removecgroup_rstat_flush_atomic()"), >> included in kernel v6.5. >> >> After this patch lock contention will be less obvious, as converting this >> to a mutex avoids multiple CPUs spinning while waiting for the lock, but >> it doesn't remove the lock contention. It is recommended to use the >> tracepoints to diagnose this. > > I will keep the high-level conversation about using the mutex here in > the cover letter thread, but I am wondering why we are keeping the > lock dropping logic here with the mutex? > I agree that yielding the mutex in the loop makes less sense. Especially since the raw_spin_unlock_irqrestore(cpu_lock, flags) call will be a preemption point for my softirq. But I kept it because, we are running a CONFIG_PREEMPT_VOLUNTARY kernel, so I still worried that there was no sched point for other userspace processes while holding the mutex, but I don't fully know the sched implication when holding a mutex. > If this is to reduce lock contention, why does it depend on > need_resched()? spin_needbreak() is a good indicator for lock > contention, but need_resched() isn't, right? > As I said, I'm unsure of the semantics of holding a mutex. > Also, how was this tested? > I tested this in a testlab, prior to posting upstream, with parallel reader of the stat files. As I said in other mail, I plan to experiment with these patches(2+3) in production, as micro-benchmarking will not reveal the corner cases we care about. With BPF based measurements of the lock congestion time, I hope we can catch production issues at a time scale that is happens prior to user visible impacts. > When I did previous changes to the flushing logic I used to make sure > that userspace read latency was not impacted, as well as in-kernel > flushers (e.g. reclaim). We should make sure there are no regressions > on both fronts. > Agree, we should consider both userspace readers and in-kernel flushers. Maybe these needed separate handing as they have separate needs. >> >> Signed-off-by: Jesper Dangaard Brouer >> --- >> kernel/cgroup/rstat.c | 10 +++++----- >> 1 file changed, 5 insertions(+), 5 deletions(-) >> >> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c >> index ff68c904e647..a90d68a7c27f 100644 >> --- a/kernel/cgroup/rstat.c >> +++ b/kernel/cgroup/rstat.c >> @@ -9,7 +9,7 @@ >> >> #include >> >> -static DEFINE_SPINLOCK(cgroup_rstat_lock); >> +static DEFINE_MUTEX(cgroup_rstat_lock); >> static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock); >> >> static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu); >> @@ -238,10 +238,10 @@ static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop) >> { >> bool contended; >> >> - contended = !spin_trylock_irq(&cgroup_rstat_lock); >> + contended = !mutex_trylock(&cgroup_rstat_lock); >> if (contended) { >> trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended); >> - spin_lock_irq(&cgroup_rstat_lock); >> + mutex_lock(&cgroup_rstat_lock); >> } >> trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended); >> } >> @@ -250,7 +250,7 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) >> __releases(&cgroup_rstat_lock) >> { >> trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false); >> - spin_unlock_irq(&cgroup_rstat_lock); >> + mutex_unlock(&cgroup_rstat_lock); >> } >> >> /* see cgroup_rstat_flush() */ >> @@ -278,7 +278,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) >> } >> >> /* play nice and yield if necessary */ >> - if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { >> + if (need_resched()) { >> __cgroup_rstat_unlock(cgrp, cpu); >> if (!cond_resched()) >> cpu_relax();