From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84AD4C369C9 for ; Thu, 17 Apr 2025 21:45:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F1C6B2800D6; Thu, 17 Apr 2025 17:45:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ECCC52800C8; Thu, 17 Apr 2025 17:45:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D6B202800D6; Thu, 17 Apr 2025 17:45:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id B8A1A2800C8 for ; Thu, 17 Apr 2025 17:45:40 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 901DA1A015F for ; Thu, 17 Apr 2025 21:45:41 +0000 (UTC) X-FDA: 83344868082.03.FBDE0A8 Received: from out-176.mta1.migadu.com (out-176.mta1.migadu.com [95.215.58.176]) by imf24.hostedemail.com (Postfix) with ESMTP id 9E64A180003 for ; Thu, 17 Apr 2025 21:45:39 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=fWi9QmVW; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf24.hostedemail.com: domain of roman.gushchin@linux.dev designates 95.215.58.176 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744926340; a=rsa-sha256; cv=none; b=Z3i+bIN6YVIJNFjmsc1WF6CldzuhYtMV5Ic0gscwprgCAhmD4BQLiP4vcIeWh9ndlOQ1ua gEium2+a0mMoSOnkwu9vHtE+J4g4nIwH++cc7UMeEHWjwYTO0dgaKTiyh9s6pZD1XIQrFq 3SlQl4GlaIV1rRNhunyiLRvN/iQzMXU= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=fWi9QmVW; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf24.hostedemail.com: domain of roman.gushchin@linux.dev designates 95.215.58.176 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744926340; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YyL22TIdx0Q5a8q6LIr4f7OFCQiot5vdD9q+5r2Pq1M=; b=DhcX7gJ/CegLXYXumEvm9GycZJCu4fEAAnLknIZc4ZFSaRGezqHKiAL8P9Hb/rMRnkolJH rnez8KFBz88UvRY5C5FfVGqkJS0SkCZQHZtonGlT3XdtPtaignezpYvb0WHin5KSoxBZQi 1A3QccJviD/dOiQ6VcU1aaK33PEAQIE= Date: Thu, 17 Apr 2025 21:45:29 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1744926337; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YyL22TIdx0Q5a8q6LIr4f7OFCQiot5vdD9q+5r2Pq1M=; b=fWi9QmVWTQXLS+cHrsXzp9VdFCl1Wu6eF3gWaUI/LYpxjL8da362t/2VrISfu29y/nqiFp 7ppv8G5hA1hR6BA2OWNgdTxsQc/elTxKxO3iQ+TP9ZtPCI050IpT+KTUCT2gOJ1a/ZG7Xv Rvu6+PRuCarcI1mA0/+dFIUbp5DvO5Q= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Roman Gushchin To: Kairui Song Cc: Muchun Song , Muchun Song , hannes@cmpxchg.org, mhocko@kernel.org, shakeel.butt@linux.dev, akpm@linux-foundation.org, david@fromorbit.com, zhengqi.arch@bytedance.com, yosry.ahmed@linux.dev, nphamcs@gmail.com, chengming.zhou@linux.dev, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, hamzamahfooz@linux.microsoft.com, apais@linux.microsoft.com, yuzhao@google.com Subject: Re: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Message-ID: References: <20250415024532.26632-1-songmuchun@bytedance.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 9E64A180003 X-Stat-Signature: b6ozccmedpez1gr8o5sm76cypzfb78h6 X-Rspam-User: X-HE-Tag: 1744926339-549231 X-HE-Meta: U2FsdGVkX1/s1h552O9KgRB3YHD2ZG19Gvwtc9oS787F0Ze5FT8/UfzDZr487jMrpTNSIyY4pAQcJ9fIcXiP4EQqW8pAUUpZZCr7mkTJcYKXrDOlnJubrB+UAC8Fh4m9ennM9GfWpOtNX9VaPFGieTMYKmf51x+j/VdH1pH4f0kVKw2sPviCr5O/iMtdb8Hmrp+yzitPBPqJKkkyDClGfQ2WuPMXg1yAHd/9SPkYgoIKnAQ2/CnvgfIk10NXGqKE/URMIez6fLX2O7NbpTFrZ9aYpDaUHxLRkpQkT9vrf1LmLR9oSU04dxE6NRenjOPom9vLpkHAPBh3MskXopiEEbvMw2a1rwtUZapSgNse64O8MorN/Ik5DrDFS+/FkouueAI3iW1vLefqk8FvDQQCFzYfGxoKZcYsnIi6NsT7CKLzUwzDsAcHO/LJNYgWpxVPkG1xhHESOGFY6eQl5n5rfvvlTOAZ06D57zpGUn5uz7LX7QdxeTA/u476qJTASKma4sD/yX/xwicqC2NMLzExNK5o/RqfArs8M2Pmlw5I/1PrNBhe2cOCLp8IruWaz6oulT5JUaUGzpnerUM2SbccoE6V5hE2k0M4vsWTunim/OEtUC9n7qURuwlsvwBPB2UUhKSE6fw5LRiPTi0+E9CR2UzepiiYJdaCdhcWG7chFYVW5MumuRZb5uO2L1gOijhhccq0Rs9Az4+k5uDFxu7lTAVqWjx+ZbxG3wFdObeUAPsHxhJcBJ5WX/b6/WUXuPc9J/VsJ2e7r6Hj6ggq5PQlxNJSUdQBMue+IHV9Hx96bgLTMKHK600N/GaU4bodO+mpX2hT3Rd9WEY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Apr 18, 2025 at 02:22:12AM +0800, Kairui Song wrote: > On Tue, Apr 15, 2025 at 4:02 PM Muchun Song wrote: > > > > > > > > > On Apr 15, 2025, at 14:19, Kairui Song wrote: > > > > > > On Tue, Apr 15, 2025 at 10:46 AM Muchun Song wrote: > > >> > > >> This patchset is based on v6.15-rc2. It functions correctly only when > > >> CONFIG_LRU_GEN (Multi-Gen LRU) is disabled. Several issues were encountered > > >> during rebasing onto the latest code. For more details and assistance, refer > > >> to the "Challenges" section. This is the reason for adding the RFC tag. > > >> > > >> ## Introduction > > >> > > >> This patchset is intended to transfer the LRU pages to the object cgroup > > >> without holding a reference to the original memory cgroup in order to > > >> address the issue of the dying memory cgroup. A consensus has already been > > >> reached regarding this approach recently [1]. > > >> > > >> ## Background > > >> > > >> The issue of a dying memory cgroup refers to a situation where a memory > > >> cgroup is no longer being used by users, but memory (the metadata > > >> associated with memory cgroups) remains allocated to it. This situation > > >> may potentially result in memory leaks or inefficiencies in memory > > >> reclamation and has persisted as an issue for several years. Any memory > > >> allocation that endures longer than the lifespan (from the users' > > >> perspective) of a memory cgroup can lead to the issue of dying memory > > >> cgroup. We have exerted greater efforts to tackle this problem by > > >> introducing the infrastructure of object cgroup [2]. > > >> > > >> Presently, numerous types of objects (slab objects, non-slab kernel > > >> allocations, per-CPU objects) are charged to the object cgroup without > > >> holding a reference to the original memory cgroup. The final allocations > > >> for LRU pages (anonymous pages and file pages) are charged at allocation > > >> time and continues to hold a reference to the original memory cgroup > > >> until reclaimed. > > >> > > >> File pages are more complex than anonymous pages as they can be shared > > >> among different memory cgroups and may persist beyond the lifespan of > > >> the memory cgroup. The long-term pinning of file pages to memory cgroups > > >> is a widespread issue that causes recurring problems in practical > > >> scenarios [3]. File pages remain unreclaimed for extended periods. > > >> Additionally, they are accessed by successive instances (second, third, > > >> fourth, etc.) of the same job, which is restarted into a new cgroup each > > >> time. As a result, unreclaimable dying memory cgroups accumulate, > > >> leading to memory wastage and significantly reducing the efficiency > > >> of page reclamation. > > >> > > >> ## Fundamentals > > >> > > >> A folio will no longer pin its corresponding memory cgroup. It is necessary > > >> to ensure that the memory cgroup or the lruvec associated with the memory > > >> cgroup is not released when a user obtains a pointer to the memory cgroup > > >> or lruvec returned by folio_memcg() or folio_lruvec(). Users are required > > >> to hold the RCU read lock or acquire a reference to the memory cgroup > > >> associated with the folio to prevent its release if they are not concerned > > >> about the binding stability between the folio and its corresponding memory > > >> cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock) > > >> desire a stable binding between the folio and its corresponding memory > > >> cgroup. An approach is needed to ensure the stability of the binding while > > >> the lruvec lock is held, and to detect the situation of holding the > > >> incorrect lruvec lock when there is a race condition during memory cgroup > > >> reparenting. The following four steps are taken to achieve these goals. > > >> > > >> 1. The first step to be taken is to identify all users of both functions > > >> (folio_memcg() and folio_lruvec()) who are not concerned about binding > > >> stability and implement appropriate measures (such as holding a RCU read > > >> lock or temporarily obtaining a reference to the memory cgroup for a > > >> brief period) to prevent the release of the memory cgroup. > > >> > > >> 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates > > >> how to ensure the binding stability from the user's perspective of > > >> folio_lruvec(). > > >> > > >> struct lruvec *folio_lruvec_lock(struct folio *folio) > > >> { > > >> struct lruvec *lruvec; > > >> > > >> rcu_read_lock(); > > >> retry: > > >> lruvec = folio_lruvec(folio); > > >> spin_lock(&lruvec->lru_lock); > > >> if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { > > >> spin_unlock(&lruvec->lru_lock); > > >> goto retry; > > >> } > > >> > > >> return lruvec; > > >> } > > >> > > >> From the perspective of memory cgroup removal, the entire reparenting > > >> process (altering the binding relationship between folio and its memory > > >> cgroup and moving the LRU lists to its parental memory cgroup) should be > > >> carried out under both the lruvec lock of the memory cgroup being removed > > >> and the lruvec lock of its parent. > > >> > > >> 3. Thirdly, another lock that requires the same approach is the split-queue > > >> lock of THP. > > >> > > >> 4. Finally, transfer the LRU pages to the object cgroup without holding a > > >> reference to the original memory cgroup. > > >> > > > > > > Hi, Muchun, thanks for the patch. > > > > Thanks for your reply and attention. > > > > > > > >> ## Challenges > > >> > > >> In a non-MGLRU scenario, each lruvec of every memory cgroup comprises four > > >> LRU lists (i.e., two active lists for anonymous and file folios, and two > > >> inactive lists for anonymous and file folios). Due to the symmetry of the > > >> LRU lists, it is feasible to transfer the LRU lists from a memory cgroup > > >> to its parent memory cgroup during the reparenting process. > > > > > > Symmetry of LRU lists doesn't mean symmetry 'hotness', it's totally > > > possible that a child's active LRU is colder and should be evicted > > > first before the parent's inactive LRU (might even be a common > > > scenario for certain workloads). > > > > Yes. > > > > > This only affects the performance not the correctness though, so not a > > > big problem. > > > > > > So will it be easier to just assume dying cgroup's folios are colder? > > > Simply move them to parent's LRU tail is OK. This will make the logic > > > appliable for both active/inactive LRU and MGLRU. > > > > I think you mean moving all child LRU list to the parent memcg's inactive > > list. It works well for your case. But sometimes, due to shared page cache > > pages, some pages in the child list may be accessed more frequently than > > those in the parent's. Still, it's okay as they can be promoted quickly > > later. So I am fine with this change. > > > > > > > >> > > >> In a MGLRU scenario, each lruvec of every memory cgroup comprises at least > > >> 2 (MIN_NR_GENS) generations and at most 4 (MAX_NR_GENS) generations. > > >> > > >> 1. The first question is how to move the LRU lists from a memory cgroup to > > >> its parent memory cgroup during the reparenting process. This is due to > > >> the fact that the quantity of LRU lists (aka generations) may differ > > >> between a child memory cgroup and its parent memory cgroup. > > >> > > >> 2. The second question is how to make the process of reparenting more > > >> efficient, since each folio charged to a memory cgroup stores its > > >> generation counter into its ->flags. And the generation counter may > > >> differ between a child memory cgroup and its parent memory cgroup because > > >> the values of ->min_seq and ->max_seq are not identical. Should those > > >> generation counters be updated correspondingly? > > > > > > I think you do have to iterate through the folios to set or clear > > > their generation flags if you want to put the folio in the right gen. > > > > > > MGLRU does similar thing in inc_min_seq. MGLRU uses the gen flags to > > > defer the actual LRU movement of folios, that's a very important > > > optimization per my test. > > > > I noticed that, which is why I asked the second question. It's > > inefficient when dealing with numerous pages related to a memory > > cgroup. > > > > > > > >> > > >> I am uncertain about how to handle them appropriately as I am not an > > >> expert at MGLRU. I would appreciate it if you could offer some suggestions. > > >> Moreover, if you are willing to directly provide your patches, I would be > > >> glad to incorporate them into this patchset. > > > > > > If we just follow the above idea (move them to parent's tail), we can > > > just keep the folio's tier info untouched here. > > > > > > For mapped file folios, they will still be promoted upon eviction if > > > their access bit are set (rmap walk), and MGLRU's table walker might > > > just promote them just fine. > > > > > > For unmapped file folios, if we just keep their tier info and add > > > child's MGLRU tier PID counter back to the parent. Workingset > > > protection of MGLRU should still work just fine. > > > > > >> > > >> ## Compositions > > >> > > >> Patches 1-8 involve code refactoring and cleanup with the aim of > > >> facilitating the transfer LRU folios to object cgroup infrastructures. > > >> > > >> Patches 9-10 aim to allocate the object cgroup for non-kmem scenarios, > > >> enabling the ability that LRU folios could be charged to it and aligning > > >> the behavior of object-cgroup-related APIs with that of the memory cgroup. > > >> > > >> Patches 11-19 aim to prevent memory cgroup returned by folio_memcg() from > > >> being released. > > >> > > >> Patches 20-23 aim to prevent lruvec returned by folio_lruvec() from being > > >> released. > > >> > > >> Patches 24-25 implement the core mechanism to guarantee binding stability > > >> between the folio and its corresponding memory cgroup while holding lruvec > > >> lock or split-queue lock of THP. > > >> > > >> Patches 26-27 are intended to transfer the LRU pages to the object cgroup > > >> without holding a reference to the original memory cgroup in order to > > >> address the issue of the dying memory cgroup. > > >> > > >> Patch 28 aims to add VM_WARN_ON_ONCE_FOLIO to LRU maintenance helpers to > > >> ensure correct folio operations in the future. > > >> > > >> ## Effect > > >> > > >> Finally, it can be observed that the quantity of dying memory cgroups will > > >> not experience a significant increase if the following test script is > > >> executed to reproduce the issue. > > >> > > >> ```bash > > >> #!/bin/bash > > >> > > >> # Create a temporary file 'temp' filled with zero bytes > > >> dd if=/dev/zero of=temp bs=4096 count=1 > > >> > > >> # Display memory-cgroup info from /proc/cgroups > > >> cat /proc/cgroups | grep memory > > >> > > >> for i in {0..2000} > > >> do > > >> mkdir /sys/fs/cgroup/memory/test$i > > >> echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs > > >> > > >> # Append 'temp' file content to 'log' > > >> cat temp >> log > > >> > > >> echo $$ > /sys/fs/cgroup/memory/cgroup.procs > > >> > > >> # Potentially create a dying memory cgroup > > >> rmdir /sys/fs/cgroup/memory/test$i > > >> done > > >> > > >> # Display memory-cgroup info after test > > >> cat /proc/cgroups | grep memory > > >> > > >> rm -f temp log > > >> ``` > > >> > > >> ## References > > >> > > >> [1] https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/ > > >> [2] https://lwn.net/Articles/895431/ > > >> [3] https://github.com/systemd/systemd/pull/36827 > > > > > > How much overhead will it be? Objcj has some extra overhead, and we > > > have some extra convention for retrieving memcg of a folio now, not > > > sure if this will have an observable slow down. > > > > I don't think there'll be an observable slowdown. I think objcg is > > more effective for slab objects as they're more sensitive than user > > pages. If it's acceptable for slab objects, it should be acceptable > > for user pages too. > > We currently have some workloads running with `nokmem` due to objcg > performance issues. I know there are efforts to improve them, but so > far it's still not painless to have. So I'm a bit worried about > this... Do you mind sharing more details here? Thanks!