From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 413B2C3DA60 for ; Wed, 17 Jul 2024 16:32:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C10C36B0096; Wed, 17 Jul 2024 12:32:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BC09A6B0098; Wed, 17 Jul 2024 12:32:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A61896B0099; Wed, 17 Jul 2024 12:32:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 872F76B0096 for ; Wed, 17 Jul 2024 12:32:30 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 30E9B140AF3 for ; Wed, 17 Jul 2024 16:32:30 +0000 (UTC) X-FDA: 82349787660.10.6851BC4 Received: from mail-ed1-f53.google.com (mail-ed1-f53.google.com [209.85.208.53]) by imf20.hostedemail.com (Postfix) with ESMTP id 4F8E91C0031 for ; Wed, 17 Jul 2024 16:32:28 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=tVCD5dFQ; spf=pass (imf20.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721233895; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=n8xVRfMJ7abYq0IMMhEbnwpXyZd3kKKex3fs0rUVCOA=; b=QgSxVBPYS+TDrvT54KPkwc7OnZEgBlL9jln3LLSPuWa5U9w8LtCnmjRa57ja+rQDJydCAd AZ28U1XCM3EJlgySbWnqqwMTH1Gs00+LfSJfbfZ1bfJaHOgI21+dOFGQJuAE2UtHuwvgid 6LgaxKa53gxaTW6qx8YhQveeGiiH0wo= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=tVCD5dFQ; spf=pass (imf20.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721233895; a=rsa-sha256; cv=none; b=GwKPNbYMcmVZAqCBW0ni2biMNwd5RojCDMqnMY7fawTK8J2zzeiNXhxWSN7PTXd5k/Vjae nYYQwJVJJXTbwpQgKjQKlFBuRxk58OBRln6eraUwiqNTvPZyp2Zi1dZhbwrh9yoboy5XTW WUCtLSyEkjSgvnynldgSjbQ3wToKKw4= Received: by mail-ed1-f53.google.com with SMTP id 4fb4d7f45d1cf-5a083ba04e7so1159424a12.1 for ; Wed, 17 Jul 2024 09:32:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1721233947; x=1721838747; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=n8xVRfMJ7abYq0IMMhEbnwpXyZd3kKKex3fs0rUVCOA=; b=tVCD5dFQyYKptay4JiMx3JgzBBsTosDenpZ4FzeyjEhCLTxpPHlJs8/Cf3AMsC8yw8 E4IkIzvHUbvgsrGLekqiohJ4V8FI7OtCmjiufN1SW+OmorNRgH4L3QAKkgu35jBuLqqX 9teIRjZg3e4VtkBzPyOpSOh2LotDhHvsj++qWIsRWLFvJBep04NpsLYTB1m3feM6Hv0P +CNY2SHKH3W08YY51I8Bpp3EjYmmcNc+vn2gVCMvnYRT0Py6oZ95sc70EFVU14FR58b8 i5Nt+XwFgBwdflcm2AIcJmHehVclelzjtiIJ8wA/ErqugPt6dGYk86Y4oXt1vfWco2yQ m12w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721233947; x=1721838747; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=n8xVRfMJ7abYq0IMMhEbnwpXyZd3kKKex3fs0rUVCOA=; b=eafTpv9OtwXPIg3pKgqTDvVwCy80e9gMtbJUMxNxls7Xt5iZwAckM1IXoPxtfI4lb6 lSTHz0z2bzDPkr6Ii3sJSO2fUkR9SQ+i3DTfVaO2r7RzfliJSuUw+rn3eZzS1nCDLfGn didR7M91TEMFvm4BWtRVtrHYWRLv6yHfD74fgwBm5FuA8GhF2OKjKZyjZZ0gQ8huFVzk zKkx8n6AesVOOOMqJ/GESPo9w5lOYj18cN+OtkC6WUMwh3t30RuT8jUh2GgXf4xrkZVn 1KJhKCbcx78EsHfewpShs+GLwoS3eSlrzDg5u53fN+y+f6XlPDmnh9eiUo/kN4Vf6QtQ VLYw== X-Forwarded-Encrypted: i=1; AJvYcCWLBSukOVC1VIsoW7Qbx7Oc0xdfwFKTsovf+x52x4PUQOhGNtUgDNh8W8O6Sku+Tl3ffyv4I1g3AQNcHjU5/SC/DYY= X-Gm-Message-State: AOJu0YzWjhxIIA+ushFT+K8yk/bsZ3CwO9wkDBN0GrREkd4oQq091hnb By4DinmuSEWJuWR16yIr2UGqmBdBDvAin8tAx4TY52RZcDOb06PXdPVYdlPvjpX3f9fXC99k87k mX2jVjYeoixGE/neS+tS3abr6TIRNo8ZQKIM3 X-Google-Smtp-Source: AGHT+IHG5R+9vUC0pxg3o2S6hutLWKPft4fiUKe+vWmBY5Soqw+9Q4/c+7zXX5r5yN16r1FkY4xe/DpCTZx2CcRMxrk= X-Received: by 2002:a17:906:c0c7:b0:a77:cc93:ae02 with SMTP id a640c23a62f3a-a7a0f75e206mr15786166b.28.1721233946078; Wed, 17 Jul 2024 09:32:26 -0700 (PDT) MIME-Version: 1.0 References: <172070450139.2992819.13210624094367257881.stgit@firesoul> <8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org> In-Reply-To: <8c123882-a5c5-409a-938b-cb5aec9b9ab5@kernel.org> From: Yosry Ahmed Date: Wed, 17 Jul 2024 09:31:48 -0700 Message-ID: Subject: Re: [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes To: Jesper Dangaard Brouer Cc: tj@kernel.org, cgroups@vger.kernel.org, shakeel.butt@linux.dev, hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com, kernel-team@cloudflare.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 4F8E91C0031 X-Stat-Signature: wkfa6ii5cm9o55dccyxibdb6nuiewahn X-Rspam-User: X-HE-Tag: 1721233948-870316 X-HE-Meta: U2FsdGVkX1+OYkR3d5eaGeZpBtNPoynr8volyOYcm+35juQGWVpVXv6rU7TI3xuWRc3nJx/pxguNyJc0ZvItTHHRmKzHmHl+Icat5sVh9BifNR9vW1gY/XySP4ACVn0YxiN5io0nvmR3wEaMmulQsljDwvxhKmZ5SndYPvuHrkAkRVn+y0pP9EXmxOzUjqhdbT+/yg5BxFzsaU2cCrSNJY52A8K9fcKo9ZXA3dwbfqpcZs1E4LkRqNywX13HmGjqduFS+coEE/FBNvCzwED84ns1jck7oMr0kFJDtA8XujoSEYkJzo87FZm+IMsLpKdwIBfVgU8zUMyNMt3zIcnfMHWg3tdrua/w9MQLZff1Js1JsNv8ApnWoxamH5IfbPGKbHAzqXm3yxMZb5WBjw2BGSh2LD2AlBG05T7nRlU0I6HXBPSbSnzizei41Mf5p0UVP+HnT1vCsxxm+lTjIcpp028JlUiUsjxBSTeCUOty8LJzv1OJXLNurRCfyJrTcpln3g9AlhXRhVSDk6R2y71MaINN1HF4+T1HO4M8OyebCQzsfcDSOaC07lKbhU8X5vdpZ0MuagW4EiORT7xWsoEUSEfTd2BzWyb50/7bOh/q4KPRm4xwYeEuByrWvTPB6SEg2AZYaQ9+j4Ty192+KLIqCGNalTc3pNaf+ioOky4gt9tZu6evZBhnlJaunqrD8iCFJ2HheRC0A0eGaWgtTqHWfIVfKQarkJS67jSMtyYNJqbwJkFv9ZuWJ9P+tNflmyVoZWeo6MUrcMij2OSoxkFuiTl6FfD0wNi3VC8TX12OZCPkiZqmlnECCoKcbrAiZide5yOIrHQQzVmk+58atB/c94+MZEfSyFaMzydBfO6wZEqumem8EYXVP30ZNCLE4WVrJNu/JQnxkGTz1oAnbuFpy3taUkKaCRqAWiFSA8g0tx3usIxNGG+FDBhCOd39iWhjcNqFxT2MZtYOmMPqVdt XAe1JJLl bZ4sdXK/3CvSniz013AYGIWYCvTJ0m0hLy8eeANtYlPpWwPTDL2N/SYvyL3T2lqrghxvvxlfXvLzGNizg9X4Ximh0m1nuPwNMkb6xJeZA+mmrVhVgXHgvfMHNNuDQ3T7KpGZT0J1guYFUMgceznfDWvSrg7J7sNt60pJJqlPoQJjRgWE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.186772, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: [..] > > I agree that description should be updated as patch have evolved, and > you suggestions make sense, thanks. > > But I'm not sure we are finished evolving this patch, because the scheme > of allowing all sub-cgroups to become the ongoing flusher (and only > avoiding lock-spinning when cgroup_is_descendant), is causing issues in > production. Sure, I just wanted to point out the the commit log should be updated going forward, not implying that we are done here :) > > We are using cadvisor in production for collecting metrics, which walks > all the cgroups reading stat files e.g. io.stat, cpu.stat , memory.stat > (doing 3 flushes of same cgroup in short timespan), but individual > cgroups don't overlap much to benefit from ongoing_flusher scheme. > > Below is (1 sec) production data, where cadvisor and kswapd collide: > > > 02:01:33 @ongoing_flusher_cnt[kworker/u395:14]: 1 > > @ongoing_flusher_cnt[kswapd5]: 3 > > @ongoing_flusher_cnt[kswapd7]: 4 > > @ongoing_flusher_cnt[kswapd6]: 4 > > @ongoing_flusher_cnt[kswapd8]: 4 > > @ongoing_flusher_cnt[kswapd9]: 5 > > @ongoing_flusher_cnt[kswapd2]: 5 > > @ongoing_flusher_cnt[kswapd10]: 5 > > @ongoing_flusher_cnt[kswapd3]: 5 > > @ongoing_flusher_cnt[kswapd11]: 5 > > @ongoing_flusher_cnt[kswapd0]: 5 > > @ongoing_flusher_cnt[kswapd4]: 6 > > @ongoing_flusher_cnt[kswapd1]: 7 > > @ongoing_flusher_cnt[cadvisor]: 9 > > For cadvisor ongoing_flusher only helps 9 times to avoid flush. > > > @ongoing_flusher_cnt[handled_race]: 18 > > @ongoing_flusher_cnt[all]: 61 > > Our ongoing_flusher scheme detects overlap and avoid 61 out of 462 flushes. > > > @cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 61 > > @cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 462 > > @cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 9032 > > This is bad: Lock is contended 9032 time within this 1 sec period. > Below, lock_contended cases is captured in more detail. > > > @cnt[tracepoint:cgroup:cgroup_rstat_locked]: 9435 > > @lock_contended[normal, 4]: 1 > > @lock_contended[normal, 1]: 2 > > @lock_contended[normal, 3]: 6 > > @lock_contended[normal, 2]: 15 > > @lock_contended[yield, 4]: 49 > > @lock_contended[after_obtaining_lock, 4]: 49 > > @lock_contended[normal, 0]: 59 > > The "normal" lock_contended for level 0, > meaning the cgroup root is contended 59 times within this 1 sec period. > > > @lock_contended[after_obtaining_lock, 1]: 117 > > @lock_contended[yield, 1]: 118 > > @lock_contended[yield, 3]: 258 > > @lock_contended[after_obtaining_lock, 3]: 258 > > @lock_contended[after_obtaining_lock, 2]: 946 > > @lock_contended[yield, 2]: 946 > > Lock contention for 'yielded' case for level 2. > Goes crazy with 946/sec. > > > @lock_contended[yield, 0]: 7579 > > Lock contention for 'yielded' case for level 0, the root. > Is really crazy with 7579/sec lock spin cases. > > > @lock_contended[after_obtaining_lock, 0]: 7579 > > > IMHO this shows that, due to lock yielding, the scheme of > ongoing_flusher for sub-trees only cause issues. Just to make sure I understand correctly, allowing non-root to become ongoing_flushers is problematic because we only support a single ongoing_flusher, so if we have a subsequent root flush, it won't be the ongoing_flusher and we won't benefit from skipping other flushes. Correct? I honestly think the problem here is not that we support ongoing_flusher for subtrees, which imo is nice to have at a low additional cost, and does have its benefits (see below). I think the problem is that lock yielding means that we can have multiple ongoing flushers, while the current scheme only records one. What I think we should be doing is either supporting multiple ongoing_flushers (by replacing the global variable with a per-cgroup flag, perhaps), or biting the bullet and start using a mutex to drop lock yielding. If there aren't any concerns beyond priority inversion (raised by Shakeel), maybe adding a mutex_lock_timeout() variant as I suggested could fix this issue (and also put a bound on the flushing latency in general)? (I suppose the latter is preferrable) > > IMHO we should go back to only doing ongoing_flusher for root-cgroup. > There is really low chance of sub-trees flushes being concurrent enough > to benefit from this, and it cause issues and needs (ugly) race handling. > > Further more sub-tree flushing doesn't take as long time as level 0 > flushing, which further lower the chance of concurrent flushes. I agree that the race handling is not pretty, and we can try to improve the implementation or handle the race in other ways. However, I think that skipping flushes for subtrees is valuable. I am not sure about the cgroup arrangement in your use case, but we do have cgroups with a lot of tasks in them (and/or child cgroups). If there is memory pressure (or hit the cgroup limit), they all may go into direct reclaim concurrently, so skipping flushes could really be beneficial. Of course, if the difference in complexity is not justified, we can go back to only supporting root cgroups for ongoing_flusher for now. But as I mentioned in the v4 email thread, some of the complexity may still remain anyway as we have multiple root cgroups in v1. > > Let's get some quick data on flush times from production, to support my > claim: Thanks for the data. I agree that in general root flushes will be a lot more expensive than subtree flushes, but keep in mind that the data may look really different depends on the cgroup setup. As I mentioned, I think we should still support subtree flushes unless the difference in complexity is not justified. > > The bpftrace onliner: > sudo bpftrace -e ' > tracepoint:cgroup:cgroup_rstat_locked { > if (args->cpu == -1) { @start[tid]=nsecs; } > @cnt[probe]=count(); > if (args->contended) { > @lock_contended["after_obtaining_lock", args->level]=count(); > }} > tracepoint:cgroup:cgroup_rstat_unlock { > if (args->cpu == -1) { > $now=nsecs; $start=@start[tid]; $diff=$now-$start; > @locked_time_level[args->level]=hist($diff); > } > @cnt[probe]=count()} > kfunc:cgroup_rstat_flush_locked {@cnt[probe]=count()} > interval:s:1 {time("%H:%M:%S "); > print(@cnt); > print(@lock_contended); > print(@locked_time_level); > clear (@cnt); > clear (@lock_contended); } > END {clear(@start)}' > > Time below is in nanosec. > > @locked_time_level[0]: > [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ > | > [8M, 16M) 860 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ > | > [32M, 64M) 275 |@@@@@@@@@@@@@@@@ > | > > > @locked_time_level[1]: > [4K, 8K) 6 |@@@@ > | > [8K, 16K) 65 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [16K, 32K) 52 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ > | > [32K, 64K) 23 |@@@@@@@@@@@@@@@@@@ > | > [64K, 128K) 15 |@@@@@@@@@@@@ > | > [128K, 256K) 10 |@@@@@@@@ > | > [256K, 512K) 6 |@@@@ > | > [512K, 1M) 15 |@@@@@@@@@@@@ > | > [1M, 2M) 2 |@ > | > [2M, 4M) 14 |@@@@@@@@@@@ > | > [4M, 8M) 6 |@@@@ > | > [8M, 16M) 7 |@@@@@ > | > [16M, 32M) 1 | > | > > > @locked_time_level[2]: > [2K, 4K) 1 | > | > [4K, 8K) 160 |@@@@@@@@@ > | > [8K, 16K) 733 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ > | > [16K, 32K) 901 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [32K, 64K) 191 |@@@@@@@@@@@ > | > [64K, 128K) 115 |@@@@@@ > | > [128K, 256K) 61 |@@@ > | > [256K, 512K) 70 |@@@@ > | > [512K, 1M) 59 |@@@ > | > [1M, 2M) 27 |@ > | > [2M, 4M) 9 | > | > > > @locked_time_level[3]: > [1K, 2K) 3 | > | > [2K, 4K) 2 | > | > [4K, 8K) 5 | > | > [8K, 16K) 147 |@@@@@@ > | > [16K, 32K) 1222 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [32K, 64K) 266 |@@@@@@@@@@@ > | > [64K, 128K) 199 |@@@@@@@@ > | > [128K, 256K) 146 |@@@@@@ > | > [256K, 512K) 124 |@@@@@ > | > [512K, 1M) 17 | > | > [1M, 2M) 0 | > | > [2M, 4M) 0 | > | > [4M, 8M) 1 | > | > > > @locked_time_level[4]: > [4K, 8K) 2 |@@ > | > [8K, 16K) 17 |@@@@@@@@@@@@@@@@@@@@@@ > | > [16K, 32K) 40 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [32K, 64K) 4 |@@@@@ > | > > --Jesper >