From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f41.google.com (mail-wm1-f41.google.com [209.85.128.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 897912356A4 for ; Fri, 31 Oct 2025 10:35:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.41 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761906955; cv=none; b=ChjjqzXAp5EQ5BwgFJDqPr99nVjxjrcdufoZCzBaihue0w1Kbb8uxU+JtXUCNcBpittbzLG9fDc+b/obaV6OWIu7DNGzVOsR73asZH1a2OKn8CCMXPycO5AqzxX1XlpdOwSFw9vhbFxpEvrdKCXP2P5iUKs6XVUBHHV/DdpUneY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761906955; c=relaxed/simple; bh=1eEt65Ofs+/r5l4cleXlDhf2myoCEBlgqwIcB0WQFuY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=Hp8LbSWv0oCS8int2SqfGDbFJyArYR8fST6SCzO1Xn0q3fTiFjt1Q6MJFCKyD6rtjhPsJQx4ZGHbGCOjpxAnZPFnVMAUt+bWSXQnbXFXu6pT4SfMcEPJGXkoRNKT9VA6xQGlYcbmcfilk870L5jZoqyktjOLY8Q7sbescF2qVJI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=suse.com; spf=pass smtp.mailfrom=suse.com; dkim=pass (2048-bit key) header.d=suse.com header.i=@suse.com header.b=IEmiozuQ; arc=none smtp.client-ip=209.85.128.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=suse.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=suse.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=suse.com header.i=@suse.com header.b="IEmiozuQ" Received: by mail-wm1-f41.google.com with SMTP id 5b1f17b1804b1-4711810948aso15511365e9.2 for ; Fri, 31 Oct 2025 03:35:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=google; t=1761906952; x=1762511752; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=HzZMIk02Nqu6V1427BR/J6vEzSH1QrB2vBKOtaJmgIs=; b=IEmiozuQNrOOtEwKzBBIDLeoQ+Mu5wyjQW9isKu/ZfIVyzRFZK+k/4CQDByE5NYXvn wTFtdKgODOpwLn9LtxapCh61CyEZtcuv3e2CTNrIbIOKbU9FsmW63Rpz4V5ihUhFxou5 MO+02xGRIJ/xSnvn3hyxymv3OGYVYaGXNZg/RCoKHwsD3os8xpXrVfMwC21Pbn5W1umF hqBYXyJ7TgF12ZZLkLzfaXhzTLqkYHqfV7bM+ZvIljb087mYvQev07SF+g7d0+wgaWRH rnMPdU7LwMwbgtEJuc7Iu8YStNKr7pyyrxv9sTzhJhw7L5q5NMaPSsoYhKnEhnbrC1m8 bEtQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761906952; x=1762511752; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=HzZMIk02Nqu6V1427BR/J6vEzSH1QrB2vBKOtaJmgIs=; b=vBDntOOEcUprbjhtJBm941KzZ1KvQDj7TzQuZSJ8xT49AHSVjVchrsgUS/Oooj/M3l 82sNpi57AKppVFNqX23cBmvVnVS9cUGwWIuwq8Jz9ge2mLEwLe2e2H/0MdtDXxGV9gvX NfYiYrxJtTrcDo9LCQgs0dvA5HjPiOj9fzGquNQxOAq6LJObpIXUZV5Vta2QTDV4nFo7 nGsknrw47ESl2qFvnX9mlnyLxSES9gJ1KEJS3svf5CN2S68jjnKTpRZd68nwhFJQabWs Sh/uXTajL2VBHxZzzHv4nsVUNfqKARN7i8084S0PohBZbcPZOCF4k+CgYjdJn0G+mgdS PJFQ== X-Forwarded-Encrypted: i=1; AJvYcCWEi5mQx/iLloAZwSTuN584HxMMmJ2GDIIW8FQAtxw3oLuE+J+tcbCBEWdF/6A+2vVqggAQNufD@vger.kernel.org X-Gm-Message-State: AOJu0Yz7IxJFDX6KJqoIkIl5GRoMu8N3Y5x/YbZ12szKF8Vx5TyxZ9+E 8OlTtAAp9WTSg/PJdY2sY2zC8NFXx58Cs9bCt7W94GruVpmRClo8nyEJ+TRCf2kAFS8= X-Gm-Gg: ASbGncsz3aUWltQgxaw+aXF4UH/MjvcjCUIGHjEysrNtQHrJsfsmEEDQfjBCg27KKx/ 0PrwfiYfdsEP3AE3mSW6NRgu8qIb8wvcHVClrXm46WHO+9u4g1SYK6Aq6Oj4oq5xV4C3ITm8l0N eawpEdejKPwH4/6rmYigJG7D9ZVj+JP5hhF/rMmvQwIjahwUpM19YCkOEUeHqVnd8BAV4Naer/r 5u3vdELIV4d1bfGUt9GMa6EBidWov522JdHCkIQ5KvBeOpsFN/IFmo2ZFw2UXaCdKxh1L9+GhIa HfQp6UABjzgG7cnUqZSBLsFYH1ZdsPmD7svCWCVD4GCvt7cVuE6pkfG+ofKo0xXF02FvT3u8kTU sI3eJmrD8SwzV3KLV95Bk1y6YyT+dwjzIpQxv8bpex+VOHzBZ4JQBO8ogDJIEs8pEd51ko3fd0D LW9U8WBG4ieju96A== X-Google-Smtp-Source: AGHT+IG7O9Y6l9WMooa1HgoKgieD6/ZhET+Ll2xATp5EvgMTaFPocp0XktS42l0peKgogXQ+h4nHIw== X-Received: by 2002:a05:600c:4f0b:b0:45d:d353:a491 with SMTP id 5b1f17b1804b1-47730797bbfmr22646085e9.1.1761906951752; Fri, 31 Oct 2025 03:35:51 -0700 (PDT) Received: from localhost (109-81-31-109.rct.o2.cz. [109.81.31.109]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-47732ddff0csm26082275e9.8.2025.10.31.03.35.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 31 Oct 2025 03:35:51 -0700 (PDT) Date: Fri, 31 Oct 2025 11:35:50 +0100 From: Michal Hocko To: Qi Zheng Cc: hannes@cmpxchg.org, hughd@google.com, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, david@redhat.com, lorenzo.stoakes@oracle.com, ziy@nvidia.com, harry.yoo@oracle.com, imran.f.khan@oracle.com, kamalesh.babulal@oracle.com, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: Re: [PATCH v1 00/26] Eliminate Dying Memory Cgroup Message-ID: References: <8edf2f49-54f6-4604-8d01-42751234bee9@linux.dev> Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <8edf2f49-54f6-4604-8d01-42751234bee9@linux.dev> On Wed 29-10-25 16:05:16, Qi Zheng wrote: > Hi Michal, > > On 10/29/25 3:53 PM, Michal Hocko wrote: > > On Tue 28-10-25 21:58:13, Qi Zheng wrote: > > > From: Qi Zheng > > > > > > Hi all, > > > > > > This series aims to eliminate the problem of dying memory cgroup. It completes > > > the adaptation to the MGLRU scenarios based on the Muchun Song's patchset[1]. > > > > I high level summary and main design decisions should be describe in the > > cover letter. > > Got it. Will add it in the next version. > > I've pasted the contents of Muchun Song's cover letter below: > > ``` > ## Introduction > > This patchset is intended to transfer the LRU pages to the object cgroup > without holding a reference to the original memory cgroup in order to > address the issue of the dying memory cgroup. A consensus has already been > reached regarding this approach recently [1]. Could you add those referenced links as well please? > ## Background > > The issue of a dying memory cgroup refers to a situation where a memory > cgroup is no longer being used by users, but memory (the metadata > associated with memory cgroups) remains allocated to it. This situation > may potentially result in memory leaks or inefficiencies in memory > reclamation and has persisted as an issue for several years. Any memory > allocation that endures longer than the lifespan (from the users' > perspective) of a memory cgroup can lead to the issue of dying memory > cgroup. We have exerted greater efforts to tackle this problem by > introducing the infrastructure of object cgroup [2]. > > Presently, numerous types of objects (slab objects, non-slab kernel > allocations, per-CPU objects) are charged to the object cgroup without > holding a reference to the original memory cgroup. The final allocations > for LRU pages (anonymous pages and file pages) are charged at allocation > time and continues to hold a reference to the original memory cgroup > until reclaimed. > > File pages are more complex than anonymous pages as they can be shared > among different memory cgroups and may persist beyond the lifespan of > the memory cgroup. The long-term pinning of file pages to memory cgroups > is a widespread issue that causes recurring problems in practical > scenarios [3]. File pages remain unreclaimed for extended periods. > Additionally, they are accessed by successive instances (second, third, > fourth, etc.) of the same job, which is restarted into a new cgroup each > time. As a result, unreclaimable dying memory cgroups accumulate, > leading to memory wastage and significantly reducing the efficiency > of page reclamation. Very useful introduction to the problem. Thanks! > ## Fundamentals > > A folio will no longer pin its corresponding memory cgroup. It is necessary > to ensure that the memory cgroup or the lruvec associated with the memory > cgroup is not released when a user obtains a pointer to the memory cgroup > or lruvec returned by folio_memcg() or folio_lruvec(). Users are required > to hold the RCU read lock or acquire a reference to the memory cgroup > associated with the folio to prevent its release if they are not concerned > about the binding stability between the folio and its corresponding memory > cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock) > desire a stable binding between the folio and its corresponding memory > cgroup. An approach is needed to ensure the stability of the binding while > the lruvec lock is held, and to detect the situation of holding the > incorrect lruvec lock when there is a race condition during memory cgroup > reparenting. The following four steps are taken to achieve these goals. > > 1. The first step to be taken is to identify all users of both functions > (folio_memcg() and folio_lruvec()) who are not concerned about binding > stability and implement appropriate measures (such as holding a RCU read > lock or temporarily obtaining a reference to the memory cgroup for a > brief period) to prevent the release of the memory cgroup. > > 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates > how to ensure the binding stability from the user's perspective of > folio_lruvec(). > > struct lruvec *folio_lruvec_lock(struct folio *folio) > { > struct lruvec *lruvec; > > rcu_read_lock(); > retry: > lruvec = folio_lruvec(folio); > spin_lock(&lruvec->lru_lock); > if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { > spin_unlock(&lruvec->lru_lock); > goto retry; > } > > return lruvec; > } > > From the perspective of memory cgroup removal, the entire reparenting > process (altering the binding relationship between folio and its memory > cgroup and moving the LRU lists to its parental memory cgroup) should be > carried out under both the lruvec lock of the memory cgroup being removed > and the lruvec lock of its parent. > > 3. Thirdly, another lock that requires the same approach is the split-queue > lock of THP. > > 4. Finally, transfer the LRU pages to the object cgroup without holding a > reference to the original memory cgroup. > ``` > > And the details of the adaptation are below: > > ``` > Similar to traditional LRU folios, in order to solve the dying memcg > problem, we also need to reparenting MGLRU folios to the parent memcg when > memcg offline. > > However, there are the following challenges: > > 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the > number of generations of the parent and child memcg may be different, > so we cannot simply transfer MGLRU folios in the child memcg to the > parent memcg as we did for traditional LRU folios. > 2. The generation information is stored in folio->flags, but we cannot > traverse these folios while holding the lru lock, otherwise it may > cause softlockup. > 3. In walk_update_folio(), the gen of folio and corresponding lru size > may be updated, but the folio is not immediately moved to the > corresponding lru list. Therefore, there may be folios of different > generations on an LRU list. > 4. In lru_gen_del_folio(), the generation to which the folio belongs is > found based on the generation information in folio->flags, and the > corresponding LRU size will be updated. Therefore, we need to update > the lru size correctly during reparenting, otherwise the lru size may > be updated incorrectly in lru_gen_del_folio(). > > Finally, this patch chose a compromise method, which is to splice the lru > list in the child memcg to the lru list of the same generation in the > parent memcg during reparenting. And in order to ensure that the parent > memcg has the same generation, we need to increase the generations in the > parent memcg to the MAX_NR_GENS before reparenting. > > Of course, the same generation has different meanings in the parent and > child memcg, this will cause confusion in the hot and cold information of > folios. But other than that, this method is simple enough, the lru size > is correct, and there is no need to consider some concurrency issues (such > as lru_gen_del_folio()). > ``` Thanks you this is very useful. A high level overview on how the patch series (of this size) would be appreaciate as well. -- Michal Hocko SUSE Labs