All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Lance Yang <lance.yang@linux.dev>
Cc: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org,
	shakeel.butt@linux.dev, mhocko@kernel.org, david@fromorbit.com,
	roman.gushchin@linux.dev, muchun.song@linux.dev,
	qi.zheng@linux.dev, yosry.ahmed@linux.dev, ziy@nvidia.com,
	liam@infradead.org, usama.arif@linux.dev, kas@kernel.org,
	vbabka@kernel.org, ryncsn@gmail.com, zaslonko@linux.ibm.com,
	gor@linux.ibm.com, baolin.wang@linux.alibaba.com,
	baohua@kernel.org, dev.jain@arm.com, npache@redhat.com,
	ryan.roberts@arm.com, cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 0/9] mm: switch THP shrinker to list_lru
Date: Tue, 2 Jun 2026 17:46:02 -0400	[thread overview]
Message-ID: <ah9PGv12mqai84ES@cmpxchg.org> (raw)
In-Reply-To: <20260601083652.59539-1-lance.yang@linux.dev>

On Mon, Jun 01, 2026 at 04:36:52PM +0800, Lance Yang wrote:
> As the changelog above says, the old queue is per-memcg only, rather
> than per-memcg-per-node. So reclaim on one node can still walk the whole
> memcg queue and split underused THPs from other nodes in the same memcg.
> 
> But I think the new one can lose reclaim in the cgroup.memory=nokmem
> case ...
> 
> With nokmem, the deferred shrinker can still run from memcg reclaim,
> because it is SHRINKER_NONSLAB. But the list_lru is no longer per-memcg:
> 
> __list_lru_init() clears memcg_aware,
> 
> 	if (mem_cgroup_kmem_disabled())
> 		memcg_aware = false;
> 
> so list_lru_from_memcg_idx() falls back to the shared node list:
> 
> static inline struct list_lru_one *
> list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
> {
> 	if (list_lru_memcg_aware(lru) && idx >= 0) {
> [...]
> 	}
> 	return &lru->node[nid].lru;
> }
> 
> That makes the shrinker bit unreliable. __list_lru_add() still sets the
> bit on the memcg passed in, but only when the list goes from empty to
> non-empty:
> 
> bool __list_lru_add(struct list_lru *lru, struct list_lru_one *l,
> 		    struct list_head *item, int nid,
> 		    struct mem_cgroup *memcg)
> {
> 	if (list_empty(item)) {
> [...]
> 		if (!l->nr_items++)
> 			set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
> [...]
> 		return true;
> 	}
> 	return false;
> }
> 
> If memcg A adds the first folio, A gets the bit. If memcg B later adds a
> folio to the same shared list, B does not get a bit, because the list
> was already non-empty.
> 
> So in the A-first/B-later case, reclaim from B may not call the deferred
> shrinker at all. The shared list is scanned from memcg reclaim only if
> reclaim runs from the memcg that has the bit, such as A here, or from
> global reclaim :)
> 
> Anyway, only after the shared list is emptied does the next memcg to add
> a folio get to be the one with the bit, IIUC :)

Sorry for the delay, this took me a bit to think about. The shrinker
code is a mess.

I read it the same way you do. And this is true for all list_lru users
when nokmem is set: we just set random nonsense shrinker bits.

HOWEVER, the generic shrinker code fixes that up by IGNORING random
shrinker bits like this when !memcg_kmem_online(). And shrinking
correctly happens only against the shared root queue when the reclaim
iterator walks root_mem_cgroup.

HOWEVER, the THP shrinker explicitly sets SHRINKER_NONSLAB, which in
turn overrides the previous override. So yes there is a weirdness: we
get the root cgroup invocation against the shared queue, and then one
more time triggered by that random memcg bit.

The most direct fix is to just drop SHRINKER_NONSLAB. It declares
independence from kmem, which is no longer true.

Cleaning up the shrinker code is left for another day.

Andrew, if there are no objections, can you please fold this?

---

From 6787efabb9584824c196bf01c517d93aae3764c3 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 2 Jun 2026 17:11:46 -0400
Subject: [PATCH] mm: switch deferred split shrinker to list_lru fix

Lance Yang points out a weirdness in the list_lru code with
cgroup.memory=nokmem: in this mode, list_lru collapses to a shared
per-node list that holds the folios, but __list_lru_add() still sets
the shrinker bit on the owning memcg.

Usually this is fine, because the generic shrinker code ignores these
random bits when !memcg_kmem_online(). But the THP shrinker still has
the SHRINKER_NONSLAB flag set, which specifically declares an
independence from kmem. As a result, the shrinker fires twice per
reclaim cycle: one during the regular root cgroup scan, and then one
more time triggered from whichever memcg got the shrinker bit.

Drop the flag, since it's no longer true. The deferred_split shrinker
then behaves like every other list_lru-backed shrinker under nokmem,
including the non-kmem ones (zswap, workingset shadow_nodes): skipped
from memcg-internal reclaim, driven by global reclaim only.

This needs proper cleaning up on the shrinker and list_lru side, but
that's scope for a follow-up series. Just make it consistent now.

Reported-by: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/huge_memory.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 72f6caf0fec6..aef495891f8c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -956,8 +956,7 @@ int folio_memcg_alloc_deferred(struct folio *folio)
 static int __init thp_shrinker_init(void)
 {
 	deferred_split_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE |
-						 SHRINKER_MEMCG_AWARE |
-						 SHRINKER_NONSLAB,
+						 SHRINKER_MEMCG_AWARE,
 						 "thp-deferred_split");
 	if (!deferred_split_shrinker)
 		return -ENOMEM;
-- 
2.54.0


  reply	other threads:[~2026-06-02 21:46 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-27 20:45 [PATCH v5 0/9] mm: switch THP shrinker to list_lru Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 1/9] mm: list_lru: fix set_shrinker_bit() call during race with cgroup deletion Johannes Weiner
2026-05-28 13:25   ` Usama Arif
2026-05-30  2:38   ` Wei Yang
2026-05-27 20:45 ` [PATCH v5 2/9] mm: list_lru: lock_list_lru_of_memcg() cannot return NULL if !skip_empty Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 3/9] mm: list_lru: deduplicate unlock_list_lru() Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 4/9] mm: list_lru: move list dead check to lock_list_lru_of_memcg() Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 5/9] mm: list_lru: deduplicate lock_list_lru() Johannes Weiner
2026-05-29  9:56   ` Wei Yang
2026-05-29 13:42     ` Johannes Weiner
2026-05-30  1:25       ` Wei Yang
2026-05-27 20:45 ` [PATCH v5 6/9] mm: list_lru: introduce caller locking for additions and deletions Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 7/9] mm: list_lru: introduce folio_memcg_list_lru_alloc() Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 8/9] mm: memory: flatten alloc_anon_folio() retry loop Johannes Weiner
2026-05-30  9:06   ` Dev Jain
2026-05-27 20:45 ` [PATCH v5 9/9] mm: switch deferred split shrinker to list_lru Johannes Weiner
2026-05-28  7:08   ` SeongJae Park
2026-05-28 14:03     ` Johannes Weiner
2026-05-28 13:32   ` Usama Arif
2026-05-28 14:02     ` Johannes Weiner
2026-05-28 15:31       ` Usama Arif
2026-05-29 17:33   ` Kairui Song
2026-05-31  8:00   ` Wei Yang
2026-06-01 10:39   ` Lance Yang
2026-06-01 11:09     ` Lance Yang
2026-06-01 13:21   ` Lance Yang
2026-06-01 18:17     ` Johannes Weiner
2026-06-01  8:36 ` [PATCH v5 0/9] mm: switch THP " Lance Yang
2026-06-02 21:46   ` Johannes Weiner [this message]
2026-06-03  4:44     ` Lance Yang
2026-06-03 11:41       ` Johannes Weiner
2026-06-03 11:53         ` Lance Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ah9PGv12mqai84ES@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=cgroups@vger.kernel.org \
    --cc=david@fromorbit.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gor@linux.ibm.com \
    --cc=kas@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=npache@redhat.com \
    --cc=qi.zheng@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=ryan.roberts@arm.com \
    --cc=ryncsn@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=yosry.ahmed@linux.dev \
    --cc=zaslonko@linux.ibm.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.