From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-qk1-f179.google.com (mail-qk1-f179.google.com [209.85.222.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DCCB082899 for ; Tue, 2 Jun 2026 21:46:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.179 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780436772; cv=none; b=jglnrvmXbpAwAZ8nCbMnruo9ijfTO8b/QtBUBKSCj6hdsvw2wxCy9qTc6BqF6izCnzJHSW+/MHvsfxTUM7NMxKd8CMo3+IsG3suI3hCMKzeHdprUA1dPPtlhdbL+GnfecM/Fr9CqR37u9T3DDgjvatDxBJTUv7PwxNQeg5kxCJ4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780436772; c=relaxed/simple; bh=0l0dcw5CApKRg7LKg7VLrffJIS2BXOTOXYGn5heCHH0=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=E/3x/UdYUp/7Nq6oVDEZN852qvu/MCzaL8hR24+LQPpD7qbRtrbtUZ2w+bVdFojv3651kWeG7BMUhdeIqPN4aECAPmHLbKZtARDfeRqIs6p3CICapVzYykqmi/0A9oM4Y6nneMs/y0EgWpj+ubaAS8b+buAGPt5nJ7d2/SqekxI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org; spf=pass smtp.mailfrom=cmpxchg.org; dkim=pass (2048-bit key) header.d=cmpxchg.org header.i=@cmpxchg.org header.b=q4pP0rmd; arc=none smtp.client-ip=209.85.222.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg.org header.i=@cmpxchg.org header.b="q4pP0rmd" Received: by mail-qk1-f179.google.com with SMTP id af79cd13be357-91588056619so44831085a.2 for ; Tue, 02 Jun 2026 14:46:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg.org; s=google; t=1780436768; x=1781041568; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=I/4Vsto/1l7Al8DkWZsvzlSoLvXQ7vHOlLVSRBNyTZU=; b=q4pP0rmd9KxzpuJ/SInQCeMzXihVtfDjbDMkyjE5H1Pe082PpbSrDPFlqvAhqNEsqL u90MQdqagODPkBDRKJ7Xu+q3DJqV4I2dUI7Ds+Zo9+yEj0xGc0FBjiJRZ8B9xn+mgUt1 NtX8EmGkRFEUCIe2TCqP0uVcsKPfjn76kkaZ0438ZaKxmt5dytE7noDSZnXQU2VfRJZ0 luqYCUh3BL6nL1eGb4Nr+xNesRbaUnM/V1QUNVvI/rb+A1gpbmxrkWdP9rRthcHs01vH gw/+iyXEhf9fdlUa//AFJbP6VyhasFNnyolZBhG2nBE6q2+CzG5iu6DgQedEWUfPcogm 62TQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780436768; x=1781041568; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=I/4Vsto/1l7Al8DkWZsvzlSoLvXQ7vHOlLVSRBNyTZU=; b=aoKA0xG6saDy9sWQPoTCjZZeyJdgrC6N7HnPIMNAUJ/USEfA4hGPbgm4tk7Gy5VTfn 5GXLS6IaRGOyP0eTf/sWr9isvzJd7L8DOOiTxHIxu1gXjD5zvBCSjSTAkVFRoYb9JBQs z3C4eOTPRjJxXEC3o9lYRRqyjKgWOC0tp4zGejSDGtWkVYQGU33D2YP/kzGyB1JxplqL m82ncifhFERfeXuRxe2GrtpclFFo+TWuaFGnQqJmEq380fh0n6E17fc9+uTN7hFz8IQr 73PzkN8Kjfgb4xXVUwkaWUwkuTjnrX7eqCUVK3+aNp123WfC2mkfCwBMpVXSA1/RuFTV wsFg== X-Forwarded-Encrypted: i=1; AFNElJ8ovGuRIFg9HbXShHovnuKuFqah1lhjV3zRz0Q8l1yjAo32FhI6mjDqWdZjmRyxL7Ixh3Ul4OiX@vger.kernel.org X-Gm-Message-State: AOJu0YwNGO9P66suldFvDUU6TMjljPI6vynDzb4N2iDMWwVaN2XNPtEn LPmggAUo/BkuEq48B81lms+mE0jlwsa1IqeW5Tf+98wenJEdos+yBV6PV/DcKaPoNO0= X-Gm-Gg: Acq92OHXK/IpQgq0AO0cNiPT4Mq7en1+s4eg56CzXbFNHoI5085nqgyyFfwvTxh1Vdg bVkN92DP/9nDKjP0KCPpTdtzu1xy4VXhn48Xb/a4EMxJOUoKRlUfMoCcKYsheBzbdmfzi+Mtv5E zmsrMJ8yAtj5WPExMD/rGX7PzDUxB9W5XNXULLY/BkNP/vRYgUw7ygrArKl2YYgkK/uA36yTL46 KdQVkLu87JxjWimZaoHQ9Lz+d22+6jcX2xwKmDFRT7c54I9AmjbxnrEeg5o3YX9d8d0WOj38qis njkQmxCIyHLG9idfs3obr2cnlhn9ayAgp6xJKkOVWGFuUIc8IORwZXW8WdTSVKvoBOFkoqbGAeg Foaj56Xv/1nUz1cmipShU/DcvuR0PlOp1ekLWjGv4dxhhmqtnaAqO06XMwZbG+leeQ5WjioHTJe YRrdd0YIsptESDIL8TzFdGElIHrJNui2a5 X-Received: by 2002:a05:620a:271d:b0:915:675d:a2d with SMTP id af79cd13be357-9158a858a41mr159358785a.51.1780436767594; Tue, 02 Jun 2026 14:46:07 -0700 (PDT) Received: from localhost ([2603:7001:f100:500:365a:60ff:fe62:ff29]) by smtp.gmail.com with ESMTPSA id af79cd13be357-9158a3e0d55sm57779185a.43.2026.06.02.14.46.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 Jun 2026 14:46:06 -0700 (PDT) Date: Tue, 2 Jun 2026 17:46:02 -0400 From: Johannes Weiner To: Lance Yang Cc: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org, shakeel.butt@linux.dev, mhocko@kernel.org, david@fromorbit.com, roman.gushchin@linux.dev, muchun.song@linux.dev, qi.zheng@linux.dev, yosry.ahmed@linux.dev, ziy@nvidia.com, liam@infradead.org, usama.arif@linux.dev, kas@kernel.org, vbabka@kernel.org, ryncsn@gmail.com, zaslonko@linux.ibm.com, gor@linux.ibm.com, baolin.wang@linux.alibaba.com, baohua@kernel.org, dev.jain@arm.com, npache@redhat.com, ryan.roberts@arm.com, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v5 0/9] mm: switch THP shrinker to list_lru Message-ID: References: <20260527204757.2544958-1-hannes@cmpxchg.org> <20260601083652.59539-1-lance.yang@linux.dev> Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260601083652.59539-1-lance.yang@linux.dev> On Mon, Jun 01, 2026 at 04:36:52PM +0800, Lance Yang wrote: > As the changelog above says, the old queue is per-memcg only, rather > than per-memcg-per-node. So reclaim on one node can still walk the whole > memcg queue and split underused THPs from other nodes in the same memcg. > > But I think the new one can lose reclaim in the cgroup.memory=nokmem > case ... > > With nokmem, the deferred shrinker can still run from memcg reclaim, > because it is SHRINKER_NONSLAB. But the list_lru is no longer per-memcg: > > __list_lru_init() clears memcg_aware, > > if (mem_cgroup_kmem_disabled()) > memcg_aware = false; > > so list_lru_from_memcg_idx() falls back to the shared node list: > > static inline struct list_lru_one * > list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx) > { > if (list_lru_memcg_aware(lru) && idx >= 0) { > [...] > } > return &lru->node[nid].lru; > } > > That makes the shrinker bit unreliable. __list_lru_add() still sets the > bit on the memcg passed in, but only when the list goes from empty to > non-empty: > > bool __list_lru_add(struct list_lru *lru, struct list_lru_one *l, > struct list_head *item, int nid, > struct mem_cgroup *memcg) > { > if (list_empty(item)) { > [...] > if (!l->nr_items++) > set_shrinker_bit(memcg, nid, lru_shrinker_id(lru)); > [...] > return true; > } > return false; > } > > If memcg A adds the first folio, A gets the bit. If memcg B later adds a > folio to the same shared list, B does not get a bit, because the list > was already non-empty. > > So in the A-first/B-later case, reclaim from B may not call the deferred > shrinker at all. The shared list is scanned from memcg reclaim only if > reclaim runs from the memcg that has the bit, such as A here, or from > global reclaim :) > > Anyway, only after the shared list is emptied does the next memcg to add > a folio get to be the one with the bit, IIUC :) Sorry for the delay, this took me a bit to think about. The shrinker code is a mess. I read it the same way you do. And this is true for all list_lru users when nokmem is set: we just set random nonsense shrinker bits. HOWEVER, the generic shrinker code fixes that up by IGNORING random shrinker bits like this when !memcg_kmem_online(). And shrinking correctly happens only against the shared root queue when the reclaim iterator walks root_mem_cgroup. HOWEVER, the THP shrinker explicitly sets SHRINKER_NONSLAB, which in turn overrides the previous override. So yes there is a weirdness: we get the root cgroup invocation against the shared queue, and then one more time triggered by that random memcg bit. The most direct fix is to just drop SHRINKER_NONSLAB. It declares independence from kmem, which is no longer true. Cleaning up the shrinker code is left for another day. Andrew, if there are no objections, can you please fold this? --- >From 6787efabb9584824c196bf01c517d93aae3764c3 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Tue, 2 Jun 2026 17:11:46 -0400 Subject: [PATCH] mm: switch deferred split shrinker to list_lru fix Lance Yang points out a weirdness in the list_lru code with cgroup.memory=nokmem: in this mode, list_lru collapses to a shared per-node list that holds the folios, but __list_lru_add() still sets the shrinker bit on the owning memcg. Usually this is fine, because the generic shrinker code ignores these random bits when !memcg_kmem_online(). But the THP shrinker still has the SHRINKER_NONSLAB flag set, which specifically declares an independence from kmem. As a result, the shrinker fires twice per reclaim cycle: one during the regular root cgroup scan, and then one more time triggered from whichever memcg got the shrinker bit. Drop the flag, since it's no longer true. The deferred_split shrinker then behaves like every other list_lru-backed shrinker under nokmem, including the non-kmem ones (zswap, workingset shadow_nodes): skipped from memcg-internal reclaim, driven by global reclaim only. This needs proper cleaning up on the shrinker and list_lru side, but that's scope for a follow-up series. Just make it consistent now. Reported-by: Lance Yang Signed-off-by: Johannes Weiner --- mm/huge_memory.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 72f6caf0fec6..aef495891f8c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -956,8 +956,7 @@ int folio_memcg_alloc_deferred(struct folio *folio) static int __init thp_shrinker_init(void) { deferred_split_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE | - SHRINKER_MEMCG_AWARE | - SHRINKER_NONSLAB, + SHRINKER_MEMCG_AWARE, "thp-deferred_split"); if (!deferred_split_shrinker) return -ENOMEM; -- 2.54.0