From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-qk1-f179.google.com (mail-qk1-f179.google.com [209.85.222.179])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id DCCB082899
	for <cgroups@vger.kernel.org>; Tue,  2 Jun 2026 21:46:08 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.179
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780436772; cv=none; b=jglnrvmXbpAwAZ8nCbMnruo9ijfTO8b/QtBUBKSCj6hdsvw2wxCy9qTc6BqF6izCnzJHSW+/MHvsfxTUM7NMxKd8CMo3+IsG3suI3hCMKzeHdprUA1dPPtlhdbL+GnfecM/Fr9CqR37u9T3DDgjvatDxBJTUv7PwxNQeg5kxCJ4=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780436772; c=relaxed/simple;
	bh=0l0dcw5CApKRg7LKg7VLrffJIS2BXOTOXYGn5heCHH0=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=E/3x/UdYUp/7Nq6oVDEZN852qvu/MCzaL8hR24+LQPpD7qbRtrbtUZ2w+bVdFojv3651kWeG7BMUhdeIqPN4aECAPmHLbKZtARDfeRqIs6p3CICapVzYykqmi/0A9oM4Y6nneMs/y0EgWpj+ubaAS8b+buAGPt5nJ7d2/SqekxI=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org; spf=pass smtp.mailfrom=cmpxchg.org; dkim=pass (2048-bit key) header.d=cmpxchg.org header.i=@cmpxchg.org header.b=q4pP0rmd; arc=none smtp.client-ip=209.85.222.179
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cmpxchg.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=cmpxchg.org header.i=@cmpxchg.org header.b="q4pP0rmd"
Received: by mail-qk1-f179.google.com with SMTP id af79cd13be357-91588056619so44831085a.2
        for <cgroups@vger.kernel.org>; Tue, 02 Jun 2026 14:46:08 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg.org; s=google; t=1780436768; x=1781041568; darn=vger.kernel.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=I/4Vsto/1l7Al8DkWZsvzlSoLvXQ7vHOlLVSRBNyTZU=;
        b=q4pP0rmd9KxzpuJ/SInQCeMzXihVtfDjbDMkyjE5H1Pe082PpbSrDPFlqvAhqNEsqL
         u90MQdqagODPkBDRKJ7Xu+q3DJqV4I2dUI7Ds+Zo9+yEj0xGc0FBjiJRZ8B9xn+mgUt1
         NtX8EmGkRFEUCIe2TCqP0uVcsKPfjn76kkaZ0438ZaKxmt5dytE7noDSZnXQU2VfRJZ0
         luqYCUh3BL6nL1eGb4Nr+xNesRbaUnM/V1QUNVvI/rb+A1gpbmxrkWdP9rRthcHs01vH
         gw/+iyXEhf9fdlUa//AFJbP6VyhasFNnyolZBhG2nBE6q2+CzG5iu6DgQedEWUfPcogm
         62TQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1780436768; x=1781041568;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=I/4Vsto/1l7Al8DkWZsvzlSoLvXQ7vHOlLVSRBNyTZU=;
        b=aoKA0xG6saDy9sWQPoTCjZZeyJdgrC6N7HnPIMNAUJ/USEfA4hGPbgm4tk7Gy5VTfn
         5GXLS6IaRGOyP0eTf/sWr9isvzJd7L8DOOiTxHIxu1gXjD5zvBCSjSTAkVFRoYb9JBQs
         z3C4eOTPRjJxXEC3o9lYRRqyjKgWOC0tp4zGejSDGtWkVYQGU33D2YP/kzGyB1JxplqL
         m82ncifhFERfeXuRxe2GrtpclFFo+TWuaFGnQqJmEq380fh0n6E17fc9+uTN7hFz8IQr
         73PzkN8Kjfgb4xXVUwkaWUwkuTjnrX7eqCUVK3+aNp123WfC2mkfCwBMpVXSA1/RuFTV
         wsFg==
X-Forwarded-Encrypted: i=1; AFNElJ8ovGuRIFg9HbXShHovnuKuFqah1lhjV3zRz0Q8l1yjAo32FhI6mjDqWdZjmRyxL7Ixh3Ul4OiX@vger.kernel.org
X-Gm-Message-State: AOJu0YwNGO9P66suldFvDUU6TMjljPI6vynDzb4N2iDMWwVaN2XNPtEn
	LPmggAUo/BkuEq48B81lms+mE0jlwsa1IqeW5Tf+98wenJEdos+yBV6PV/DcKaPoNO0=
X-Gm-Gg: Acq92OHXK/IpQgq0AO0cNiPT4Mq7en1+s4eg56CzXbFNHoI5085nqgyyFfwvTxh1Vdg
	bVkN92DP/9nDKjP0KCPpTdtzu1xy4VXhn48Xb/a4EMxJOUoKRlUfMoCcKYsheBzbdmfzi+Mtv5E
	zmsrMJ8yAtj5WPExMD/rGX7PzDUxB9W5XNXULLY/BkNP/vRYgUw7ygrArKl2YYgkK/uA36yTL46
	KdQVkLu87JxjWimZaoHQ9Lz+d22+6jcX2xwKmDFRT7c54I9AmjbxnrEeg5o3YX9d8d0WOj38qis
	njkQmxCIyHLG9idfs3obr2cnlhn9ayAgp6xJKkOVWGFuUIc8IORwZXW8WdTSVKvoBOFkoqbGAeg
	Foaj56Xv/1nUz1cmipShU/DcvuR0PlOp1ekLWjGv4dxhhmqtnaAqO06XMwZbG+leeQ5WjioHTJe
	YRrdd0YIsptESDIL8TzFdGElIHrJNui2a5
X-Received: by 2002:a05:620a:271d:b0:915:675d:a2d with SMTP id af79cd13be357-9158a858a41mr159358785a.51.1780436767594;
        Tue, 02 Jun 2026 14:46:07 -0700 (PDT)
Received: from localhost ([2603:7001:f100:500:365a:60ff:fe62:ff29])
        by smtp.gmail.com with ESMTPSA id af79cd13be357-9158a3e0d55sm57779185a.43.2026.06.02.14.46.06
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 02 Jun 2026 14:46:06 -0700 (PDT)
Date: Tue, 2 Jun 2026 17:46:02 -0400
From: Johannes Weiner <hannes@cmpxchg.org>
To: Lance Yang <lance.yang@linux.dev>
Cc: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org,
	shakeel.butt@linux.dev, mhocko@kernel.org, david@fromorbit.com,
	roman.gushchin@linux.dev, muchun.song@linux.dev, qi.zheng@linux.dev,
	yosry.ahmed@linux.dev, ziy@nvidia.com, liam@infradead.org,
	usama.arif@linux.dev, kas@kernel.org, vbabka@kernel.org,
	ryncsn@gmail.com, zaslonko@linux.ibm.com, gor@linux.ibm.com,
	baolin.wang@linux.alibaba.com, baohua@kernel.org, dev.jain@arm.com,
	npache@redhat.com, ryan.roberts@arm.com, cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 0/9] mm: switch THP shrinker to list_lru
Message-ID: <ah9PGv12mqai84ES@cmpxchg.org>
References: <20260527204757.2544958-1-hannes@cmpxchg.org>
 <20260601083652.59539-1-lance.yang@linux.dev>
Precedence: bulk
X-Mailing-List: cgroups@vger.kernel.org
List-Id: <cgroups.vger.kernel.org>
List-Subscribe: <mailto:cgroups+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:cgroups+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20260601083652.59539-1-lance.yang@linux.dev>

On Mon, Jun 01, 2026 at 04:36:52PM +0800, Lance Yang wrote:
> As the changelog above says, the old queue is per-memcg only, rather
> than per-memcg-per-node. So reclaim on one node can still walk the whole
> memcg queue and split underused THPs from other nodes in the same memcg.
> 
> But I think the new one can lose reclaim in the cgroup.memory=nokmem
> case ...
> 
> With nokmem, the deferred shrinker can still run from memcg reclaim,
> because it is SHRINKER_NONSLAB. But the list_lru is no longer per-memcg:
> 
> __list_lru_init() clears memcg_aware,
> 
> 	if (mem_cgroup_kmem_disabled())
> 		memcg_aware = false;
> 
> so list_lru_from_memcg_idx() falls back to the shared node list:
> 
> static inline struct list_lru_one *
> list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
> {
> 	if (list_lru_memcg_aware(lru) && idx >= 0) {
> [...]
> 	}
> 	return &lru->node[nid].lru;
> }
> 
> That makes the shrinker bit unreliable. __list_lru_add() still sets the
> bit on the memcg passed in, but only when the list goes from empty to
> non-empty:
> 
> bool __list_lru_add(struct list_lru *lru, struct list_lru_one *l,
> 		    struct list_head *item, int nid,
> 		    struct mem_cgroup *memcg)
> {
> 	if (list_empty(item)) {
> [...]
> 		if (!l->nr_items++)
> 			set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
> [...]
> 		return true;
> 	}
> 	return false;
> }
> 
> If memcg A adds the first folio, A gets the bit. If memcg B later adds a
> folio to the same shared list, B does not get a bit, because the list
> was already non-empty.
> 
> So in the A-first/B-later case, reclaim from B may not call the deferred
> shrinker at all. The shared list is scanned from memcg reclaim only if
> reclaim runs from the memcg that has the bit, such as A here, or from
> global reclaim :)
> 
> Anyway, only after the shared list is emptied does the next memcg to add
> a folio get to be the one with the bit, IIUC :)

Sorry for the delay, this took me a bit to think about. The shrinker
code is a mess.

I read it the same way you do. And this is true for all list_lru users
when nokmem is set: we just set random nonsense shrinker bits.

HOWEVER, the generic shrinker code fixes that up by IGNORING random
shrinker bits like this when !memcg_kmem_online(). And shrinking
correctly happens only against the shared root queue when the reclaim
iterator walks root_mem_cgroup.

HOWEVER, the THP shrinker explicitly sets SHRINKER_NONSLAB, which in
turn overrides the previous override. So yes there is a weirdness: we
get the root cgroup invocation against the shared queue, and then one
more time triggered by that random memcg bit.

The most direct fix is to just drop SHRINKER_NONSLAB. It declares
independence from kmem, which is no longer true.

Cleaning up the shrinker code is left for another day.

Andrew, if there are no objections, can you please fold this?

---

>From 6787efabb9584824c196bf01c517d93aae3764c3 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 2 Jun 2026 17:11:46 -0400
Subject: [PATCH] mm: switch deferred split shrinker to list_lru fix

Lance Yang points out a weirdness in the list_lru code with
cgroup.memory=nokmem: in this mode, list_lru collapses to a shared
per-node list that holds the folios, but __list_lru_add() still sets
the shrinker bit on the owning memcg.

Usually this is fine, because the generic shrinker code ignores these
random bits when !memcg_kmem_online(). But the THP shrinker still has
the SHRINKER_NONSLAB flag set, which specifically declares an
independence from kmem. As a result, the shrinker fires twice per
reclaim cycle: one during the regular root cgroup scan, and then one
more time triggered from whichever memcg got the shrinker bit.

Drop the flag, since it's no longer true. The deferred_split shrinker
then behaves like every other list_lru-backed shrinker under nokmem,
including the non-kmem ones (zswap, workingset shadow_nodes): skipped
from memcg-internal reclaim, driven by global reclaim only.

This needs proper cleaning up on the shrinker and list_lru side, but
that's scope for a follow-up series. Just make it consistent now.

Reported-by: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/huge_memory.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 72f6caf0fec6..aef495891f8c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -956,8 +956,7 @@ int folio_memcg_alloc_deferred(struct folio *folio)
 static int __init thp_shrinker_init(void)
 {
 	deferred_split_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE |
-						 SHRINKER_MEMCG_AWARE |
-						 SHRINKER_NONSLAB,
+						 SHRINKER_MEMCG_AWARE,
 						 "thp-deferred_split");
 	if (!deferred_split_shrinker)
 		return -ENOMEM;
-- 
2.54.0