From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4C10FC3DA63 for ; Thu, 18 Jul 2024 15:56:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D463D6B0092; Thu, 18 Jul 2024 11:56:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CF6A86B0095; Thu, 18 Jul 2024 11:56:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B96A36B0096; Thu, 18 Jul 2024 11:56:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 9B5706B0092 for ; Thu, 18 Jul 2024 11:56:11 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 4FF111C38FF for ; Thu, 18 Jul 2024 15:56:11 +0000 (UTC) X-FDA: 82353324942.16.EE90724 Received: from mail-lf1-f43.google.com (mail-lf1-f43.google.com [209.85.167.43]) by imf30.hostedemail.com (Postfix) with ESMTP id 656248002D for ; Thu, 18 Jul 2024 15:56:09 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ZZl2zdyL; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf30.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.43 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721318139; a=rsa-sha256; cv=none; b=nTJSTAfYUTHBE9e7eWh1y8EWdYv/HGqKnp6B6YYG+YahYlN5JJkUpHEUoIA2qCL+ovAEnc 7nn72iwx/cLrq+OS4o3y08BXWFJWa7D5YHsRnCIHwoNEy7OKylU0pUJNfS2VL4OByxCW5K ceJOfwUNEg7RWE2jaT5wyJ9PbiRaohM= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ZZl2zdyL; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf30.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.43 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721318139; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fxgux2vbdZYB6RZEvo07oOPDR4FUj2S7XQtL1IHtKck=; b=bbFShfkZzgBD9lGuYTKPpY0YjXcixswso9AXdtJPy9vkbiMS5rpol71ukIeds+hsSvRd5J oxp9SianF05b3zt6CtGH3fHs2BUvqgbiY4eJ8le3kANzpuv8mWLfX1NvbBBHDQl0lsGrj3 eW1E/4dlP87DA0o7zN4hnfYpHqEEtKo= Received: by mail-lf1-f43.google.com with SMTP id 2adb3069b0e04-52ea5dc3c66so950147e87.3 for ; Thu, 18 Jul 2024 08:56:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1721318168; x=1721922968; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=fxgux2vbdZYB6RZEvo07oOPDR4FUj2S7XQtL1IHtKck=; b=ZZl2zdyLNXPu+VCkHw7+XUPBNWqNXRiEbbinmIVRRivosDSAHxGlANCAqGo4rjT8CT lyCULrYAiI5l7J3e+w8fVNukAEuRJIu5pFKv0cMnMZIlPH56n8X82fAwp9DH2ubWcprc ODpNutkv7khs/iH8wt6O/9Mu3441pSE7VAFSPs1NdUGDX79Vz9C+IFpY5WmSIEyPocWp j/AsSsCwmtQBptO4p8hHTMZqb+RZy2tIs1ecsEeRziZY7u/KqzcP4e6jIxx0tlRpCoQ2 vOGxX3VFmRIJV6Ch6VpxhaylUKlNfirYDLR4+pEVfz/owYB4UgUjlATwgGpoN6upPhKo Npaw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721318168; x=1721922968; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=fxgux2vbdZYB6RZEvo07oOPDR4FUj2S7XQtL1IHtKck=; b=CTzlFNL7Xer7m1lzlqkxdU31Nwd3x4JVX/NbAxYhtfyCLeXW8LOq2Bvc1WzMcpUnZ7 TD0QoUjvNL4UgIgvwmpm6zCA62MYfR+g35OfsEwCX8hONEeH04qNlrN/Hd3Umm+AOqMU YQ05ZlqpU121D0hb291LxMUIb6moCNq//XN9WKuyQUiKHw8km3/2begF2V7ACTo7cmtt IqDIxpHVd3qJMErHeEVyteTfKP9zz3SJSp4gQmGwFzJbfFbPt8qQuqaBBjHHpfOr6z5Y Dbi4/AQe7UdEamS8bOoVfGJCPeKnMiK1aCrANJV67xiQO5KzQyELGxK5ZOHYMzaUA++r PXBw== X-Forwarded-Encrypted: i=1; AJvYcCVqFIqJmiGuaF3RxpeO0ypOBacSETPDlgoXS1MN5hG2ej1SUB1KOi+6S+YKY9EHwg2ljzWrSMDL6AwRPJ3HM4E/yWI= X-Gm-Message-State: AOJu0YyG4J+Xzp3L1KkpTljodhYmb1PcwgNOb/AZ632DotyvMY4TX4az 2lFzXtsVbTsTHqvY1B6QQOjCNCWseWqErdrEa2sKja08AdBpQJACl0R/rSwq8Zcs7q0tRvqM8+q Ro49+Re7/ru0eI2weruX2vUviCeLpEuNMD1vC X-Google-Smtp-Source: AGHT+IFPE0lghj2YZixrbpHmgxoF0kZ8549vynInTdXVmmjQgeXJvlltx3tprAguDK2P3x48GSK3zddRYlWRwkMyDkE= X-Received: by 2002:a05:6512:1318:b0:52c:e05f:9052 with SMTP id 2adb3069b0e04-52ee5435136mr4625806e87.47.1721318167063; Thu, 18 Jul 2024 08:56:07 -0700 (PDT) MIME-Version: 1.0 References: <172070450139.2992819.13210624094367257881.stgit@firesoul> <100caebf-c11c-45c9-b864-d8562e2a5ac5@kernel.org> In-Reply-To: From: Yosry Ahmed Date: Thu, 18 Jul 2024 08:55:28 -0700 Message-ID: Subject: Re: [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes To: Jesper Dangaard Brouer Cc: tj@kernel.org, cgroups@vger.kernel.org, shakeel.butt@linux.dev, hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com, kernel-team@cloudflare.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 656248002D X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: x4qygniycmig6krhcnpiktor9nncyy3s X-HE-Tag: 1721318169-348542 X-HE-Meta: U2FsdGVkX1/NESaLI/Cc6xGF5t5w3Oyws49+V2nudQDeTDEcRrKJFcup/S2D3hJ9IljQdSjv6BA5wW87GbgqJyinErH6ElcJFSGibuF5nNwksXZr7IaMbavozwgzahCJpZTxC41lN6frRRjzf/TgXOg9/qqWcrOlwKRn/oIFPFJ3xEUtt2W6f1kbNzNutGWKTqSQD2Rs1dDIj2GiYlTa+O2gagY9K4EGZdkHhPoRT38bd6NDsvQRgbwZWTkywgr6vLfLbEA2jH/jqhTK4QQqwoIblR3fYj1gO3HC22d0GdqvctGVibVUCTgbCZiBOA54CdeSGcJ6NLyUEHKCdRsFW1bY881tDPTOHsE6lf/rTSqgSTl7WBlVlDQBR29w4A7ME1WdlA10H0rQq6o0iBegPseqku/04u3x3rTLEN00BvfcK59K+UNKZr22oYUy15t6NLOT3VxEpeyD+OgBgciOS1Avp642ufqntsrtHWRbunULoDFe3ipFpDqAlhIMDwG/H1ygG70p43BMI99WaJQyia5SI8fvn/Y6B16iRCCBze1PsquiHswV4HJGXoKu0XblUs4FIytbIPeX9N2WV4QXsIEwctbP9nNNAVlp6FLHiuw6b7EiWkulQF6+QQo+mCs74EcPdwYXODMl9ESZHm/OHaHDLjyeiPUerDWBzweNpxkD9PI2ckL1OMl+X5MML1jm+2R0D+x1aCA9rNBNDjyIKChwkunL1JkXtL7T+8OP2Uxzs3eT1TqbFpRb6C0ReSisTbhioSUk4U8J+03VpqB5BC60wArNNm0IEoLHgLGnakg/fHz5Fur/qy9zkemdINnWWHbJoJNbY0pZW3rzPlNVVBFIKU+/NWmDVUNhjNgtEIXarjtMuXHmh1WVUKA4tI+0+9p8+VXUBL66wfRUwqiycs+JQIGBP5ckZUzU/24f+xjeRGbPBjpCoDvzHlIYRM5jTHTWzKI0ZtN42tsaMTo 0oC8pajj /YOLUTlS8d0OmJLfCn4MqCybZXAP8cr3ex3KyGvxh2Y+RBK3bojRKBf6vraU+mFK0d3KmxWuzqzgjKj3skxzKCnQNXOJ8mTm1OFHiyaRqPGkSfihhXXqucRe1/X3mya4pouW/AvLBYMSLPaM06U2iGcFyEd3dsaAzYW7RnrBUlqsnar7q1cvNGRQ/8/+reZu4NbIR9R/kcDex7Vit7D2VzBd15Ftgg64wwm4F5Cmk/sCRJJQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000091, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jul 18, 2024 at 1:12=E2=80=AFAM Jesper Dangaard Brouer wrote: > > > > On 17/07/2024 18.49, Yosry Ahmed wrote: > > On Wed, Jul 17, 2024 at 9:36=E2=80=AFAM Jesper Dangaard Brouer wrote: > >> > >> > >> On 17/07/2024 02.35, Yosry Ahmed wrote: > >>> [..] > >>>> > >>>> > >>>> This is a clean (meaning no cadvisor interference) example of kswapd > >>>> starting simultaniously on many NUMA nodes, that in 27 out of 98 cas= es > >>>> hit the race (which is handled in V6 and V7). > >>>> > >>>> The BPF "cnt" maps are getting cleared every second, so this > >>>> approximates per sec numbers. This patch reduce pressure on the loc= k, > >>>> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) fu= ll > >>>> flushes approx 37 per sec (every 27 ms). On the positive side > >>>> ongoing_flusher mitigation stopped 98 per sec of these. > >>>> > >>>> In this clean kswapd case the patch removes the lock contention issu= e > >>>> for kswapd. The lock_contended cases 27 seems to be all related to > >>>> handled_race cases 27. > >>>> > >>>> The remaning high flush rate should also be addressed, and we should > >>>> also work on aproaches to limit this like my ealier proposal[1]. > >>> > >>> I honestly don't think a high number of flushes is a problem on its > >>> own as long as we are not spending too much time flushing, especially > >>> when we have magnitude-based thresholding so we know there is > >>> something to flush (although it may not be relevant to what we are > >>> doing). > >>> > >> > >> We are "spending too much time flushing" see below. > >> > >>> If we keep observing a lot of lock contention, one thing that I > >>> thought about is to have a variant of spin_lock with a timeout. This > >>> limits the flushing latency, instead of limiting the number of flushe= s > >>> (which I believe is the wrong metric to optimize). > >>> > >>> It also seems to me that we are doing a flush each 27ms, and your > >>> proposed threshold was once per 50ms. It doesn't seem like a > >>> fundamental difference. > >>> > >> > >> > >> Looking at the production numbers for the time the lock is held for le= vel 0: > >> > >> @locked_time_level[0]: > >> [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ = | > >> [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@= | > >> [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ = | > >> [32M, 64M) 275 |@@@@@@@@@@@@@@@@ = | > >> > >> The time is in nanosec, so M corresponds to ms (milliseconds). > >> > >> With 36 flushes per second (as shown earlier) this is a flush every > >> 27.7ms. It is not unreasonable (from above data) that the flush time > >> also spend 27ms, which means that we spend a full CPU second flushing. > >> That is spending too much time flushing. > >> > >> This around 1 sec CPU usage for kswapd is also quite clear in the > >> attached grafana graph for when server was rebooted into this V7 kerne= l. > >> > >> I choose 50ms because at the time I saw flush taking around 30ms, and = I > >> view the flush time as queue service-time. When arrival-rate is faste= r > >> than service-time, then a queue will form. So, choosing 50ms as > >> arrival-rate gave me some headroom. As I mentioned earlier, optimally > >> this threshold should be dynamically measured. > > > > Thanks for the data. Yeah this doesn't look good. > > > > Does it make sense to just throttle flushers at some point to increase > > the chances of coalescing multiple flushers? > > > > Otherwise I think it makes sense in this case to ratelimit flushing in > > general. Although instead of just checking how much time elapsed since > > the last flush, can we use something like __ratelimit()? > > > > This will make sure that we skip flushes when we actually have a high > > rate of flushing over a period of time, not because two flushes > > happened to be requested in close succession and the flushing rate is > > generally low. > > > > I really think "time elapsed since the last flush" is the right solution > here. As, we *do* want to catch the case you describe "two flushes > happened to be requested in close succession and the flushing rate is > generally low." > > (After this patch fixing the lock contention triggered by kswapd). > The remaining problem with kswapd is that those flushes that doesn't > "collide" on the lock, will be flushing in close succession. And we > likely have a generally low flushing rate, until kswapd starts up. I do not disagree, I think I may have misrepresented my intention. Time since last flush is essentially ratelimiting to 2 flushes per (let's say) 50 ms. So if we only have 2 flushes in an entire second, but they happen to occur <50ms apart, the latter will be skipped (arguably unnecessarily). What I am trying to say is to make it more generalized/less-granular, and apply ratelimiting when we are actually spending a large portion of our time flushing. For example, cap it at 20 flushes per 1s. I think __ratelimit() does exactly that, although it's currently mostly used to ratelimit warnings, but I think we can reuse it here. Ideally, of course, the number would be dynamic as you described before, but we can start with a reasonable static value based on data. What we do when the limit is hit could be changed later as well. We can just skip flushes initially, and later we can be smarter about it and throttle them in a way that makes them collide more effectively. > > Some production data from a "slow" period where only kswapd is active: > > 05:59:32 @ongoing_flusher_cnt[kswapd11]: 1 > @ongoing_flusher_cnt[kswapd7]: 1 > @ongoing_flusher_cnt[kswapd3]: 1 > @ongoing_flusher_cnt[kswapd5]: 1 > @ongoing_flusher_cnt[kswapd10]: 1 > @ongoing_flusher_cnt[kswapd6]: 2 > @ongoing_flusher_cnt[kswapd8]: 2 > @ongoing_flusher_cnt[kswapd1]: 2 > @ongoing_flusher_cnt[kswapd9]: 2 > @ongoing_flusher_cnt[kswapd0]: 2 > @ongoing_flusher_cnt[kswapd2]: 2 > @ongoing_flusher_cnt[kswapd4]: 2 > @ongoing_flusher_cnt[handled_race]: 2 > @ongoing_flusher_cnt[all]: 14 > @cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 2 > @cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 10 > @cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 43 > @cnt[tracepoint:cgroup:cgroup_rstat_locked]: 51 > > We observe that ongoing_flusher scheme saves/avoids 14 of the flushes > great, but we still have 43 flushes in this period. I think only kswapd > is doing the flushing here, so I claim 41 flushes are likely unnecessary > (as there are indication of 2 kswapd startups in the period). Also > observe that some of the kswapdNN processes only have > ongoing_flusher_cnt 1, which indicate they didn't fully "collide" with > an ongoing flusher, while others have 2. > > --Jesper