Re: [PATCH v5 0/9] mm: switch THP shrinker to list_lru

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Lance Yang <lance.yang@linux.dev>
To: hannes@cmpxchg.org
Cc: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org,
	shakeel.butt@linux.dev, mhocko@kernel.org, david@fromorbit.com,
	roman.gushchin@linux.dev, muchun.song@linux.dev,
	qi.zheng@linux.dev, yosry.ahmed@linux.dev, ziy@nvidia.com,
	liam@infradead.org, usama.arif@linux.dev, kas@kernel.org,
	vbabka@kernel.org, ryncsn@gmail.com, zaslonko@linux.ibm.com,
	gor@linux.ibm.com, baolin.wang@linux.alibaba.com,
	baohua@kernel.org, dev.jain@arm.com, lance.yang@linux.dev,
	npache@redhat.com, ryan.roberts@arm.com, cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 0/9] mm: switch THP shrinker to list_lru
Date: Mon,  1 Jun 2026 16:36:52 +0800	[thread overview]
Message-ID: <20260601083652.59539-1-lance.yang@linux.dev> (raw)
In-Reply-To: <20260527204757.2544958-1-hannes@cmpxchg.org>

Hi Johannes,

On Wed, May 27, 2026 at 04:45:07PM -0400, Johannes Weiner wrote:
>This is version 5 of switching the THP shrinker to list_lru.
>
>Core of the new version is the list_lru/set_shrinker_bit fix up front,
>which minimally affects later patches; and a rebase onto the latest
>mm-unstable - replaced alloc_swap_folio() with __swap_cache_alloc().
>
>The changes seemed small enough that *I chose to keep the review tags
>from v4*. Please shout if you object to this!
>
>Changes in v5:
>- patch 1 is a new fix for a very old, pre-existing set_shrinker_bit()
>  problem in list_lru, where the bit can be set on a dying child memcg
>  instead of the ancestor that actually received the item. Pointed out
>  by Usama Arif and Sashiko; fix it first to make it minimally
>  backportable and so the conversion is safe.
>- patches 6 and 9 adapt to that fix's new memcg-by-reference
>  lock_list_lru_of_memcg() signature
>- collapse_huge_page(): propagate folio_memcg_alloc_deferred() failure
>  as SCAN_ALLOC_HUGE_PAGE_FAIL instead of leaking SCAN_SUCCEED and
>  falsely reporting a successful MADV_COLLAPSE (Usama Arif, Sashiko)
>- deferred_split_isolate(): fix a UAF by reading folio state before
>  list_lru_isolate(); once removed, a racing folio_put() frees the
>  folio via the lockless list_empty() check while we still touch its
>  flags and stats (Sashiko)
>- rebased to mm-unstable of 2026-05-27, which simplifies the flatten
>  prep patch (now anon-only, as alloc_swap_folio() was folded into the
>  new __swap_cache_alloc()) and moves the swap-side
>  folio_memcg_alloc_deferred() hook into __swap_cache_alloc(). Kairui,
>  I would appreciate an eyeball on that.
>
>Changes in v4:
>- guard folio_memcg_alloc_deferred() with mem_cgroup_disabled() to fix
>  NULL deref in __memcg_list_lru_alloc() when booting with
>  cgroup_disable=memory (e.g., kdump capture kernel) -- reported and
>  tested by Mikhail Zaslonko on s390 and x86
>- flatten if (folio) branches in alloc_swap_folio() and alloc_anon_folio()
>  in a prep patch so the list_lru allocation additions are a clean minimal
>  diff (Lorenzo)
>- folio_memcg_alloc_deferred() moved out of alloc_charge_folio() into the
>  anon-only collapse_huge_page() path; collapse_file() shares that helper
>  but its pages don't go on the THP shrinker queue (David)
>- guard folio_memcg_alloc_deferred() with order > 1; mTHPs below order-2
>  can't be queued on the deferred split list (David)
>- make deferred_split_lru static, hide behind folio_memcg_alloc_deferred()
>  wrapper with GFP_KERNEL (Lorenzo)
>- rename l -> lru throughout huge_memory.c (Lorenzo)
>- kdoc for folio_memcg_list_lru_alloc() (Lorenzo)
>- list_lru_lock_irq()/unlock_irq()/add_irq() irq-disabling variants;
>  use list_lru_add_irq() in deferred_split_scan() (Lorenzo)
>- reorder shrinker_free() before list_lru_destroy() (Lorenzo)
>
>Changes in v3:
>- dedicated lockdep_key for irqsafe deferred_split_lru.lock (syzbot)
>- conditional list_lru ops in __folio_freeze_and_split_unmapped() (syzbot)
>- annotate runs of inscrutable false, NULL, false function arguments (David)
>- rename to folio_memcg_list_lru_alloc() (David)
>
>Changes in v2:
>- explicit rcu_read_lock() in __folio_freeze_and_split_unmapped() (Usama)
>- split out list_lru prep bits (Dave)
>
>The open-coded deferred split queue has issues. It's not NUMA-aware
>(when cgroup is enabled), and it's more complicated in the callsites
>interacting with it. Switching to list_lru fixes the NUMA problem and
>streamlines things. It also simplifies planned shrinker work.
>
>Patch 1 fixes a pre-existing list_lru bug where the shrinker bit is
>set on the caller's memcg rather than the ancestor whose sublist the
>item actually lands on after a walk-up. Standalone, backportable; the
>rest of the series depends on it.
>
>Patches 2-5 are cleanups and small refactors in list_lru code. They're
>basically independent, but make the THP shrinker conversion easier.
>
>Patch 6 extends the list_lru API to allow the caller to control the
>locking scope. The THP shrinker has private state it needs to keep
>synchronized with the LRU state.
>
>Patch 7 extends the list_lru API with a convenience helper to do
>list_lru head allocation (memcg_list_lru_alloc) when coming from a
>folio. Anon THPs are instantiated in several places, and with the
>folio reparenting patches pending, folio_memcg() access is now a more
>delicate dance. This avoids having to replicate that dance everywhere.
>
>Patch 8 flattens the alloc_anon_folio() retry loop so the next patch's
>list_lru hook lands as a clean addition rather than nested deep inside
>an if (folio) block.
>
>Patch 9 finally switches the deferred_split_queue to list_lru.

As the changelog above says, the old queue is per-memcg only, rather
than per-memcg-per-node. So reclaim on one node can still walk the whole
memcg queue and split underused THPs from other nodes in the same memcg.

But I think the new one can lose reclaim in the cgroup.memory=nokmem
case ...

With nokmem, the deferred shrinker can still run from memcg reclaim,
because it is SHRINKER_NONSLAB. But the list_lru is no longer per-memcg:

__list_lru_init() clears memcg_aware,

	if (mem_cgroup_kmem_disabled())
		memcg_aware = false;

so list_lru_from_memcg_idx() falls back to the shared node list:

static inline struct list_lru_one *
list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
{
	if (list_lru_memcg_aware(lru) && idx >= 0) {
[...]
	}
	return &lru->node[nid].lru;
}

That makes the shrinker bit unreliable. __list_lru_add() still sets the
bit on the memcg passed in, but only when the list goes from empty to
non-empty:

bool __list_lru_add(struct list_lru *lru, struct list_lru_one *l,
		    struct list_head *item, int nid,
		    struct mem_cgroup *memcg)
{
	if (list_empty(item)) {
[...]
		if (!l->nr_items++)
			set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
[...]
		return true;
	}
	return false;
}

If memcg A adds the first folio, A gets the bit. If memcg B later adds a
folio to the same shared list, B does not get a bit, because the list
was already non-empty.

So in the A-first/B-later case, reclaim from B may not call the deferred
shrinker at all. The shared list is scanned from memcg reclaim only if
reclaim runs from the memcg that has the bit, such as A here, or from
global reclaim :)

Anyway, only after the shared list is emptied does the next memcg to add
a folio get to be the one with the bit, IIUC :)

Hopefully I didn't miss somthing important ...

Cheers, Lance

next prev parent reply	other threads:[~2026-06-01  8:37 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-27 20:45 [PATCH v5 0/9] mm: switch THP shrinker to list_lru Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 1/9] mm: list_lru: fix set_shrinker_bit() call during race with cgroup deletion Johannes Weiner
2026-05-28 13:25   ` Usama Arif
2026-05-30  2:38   ` Wei Yang
2026-05-27 20:45 ` [PATCH v5 2/9] mm: list_lru: lock_list_lru_of_memcg() cannot return NULL if !skip_empty Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 3/9] mm: list_lru: deduplicate unlock_list_lru() Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 4/9] mm: list_lru: move list dead check to lock_list_lru_of_memcg() Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 5/9] mm: list_lru: deduplicate lock_list_lru() Johannes Weiner
2026-05-29  9:56   ` Wei Yang
2026-05-29 13:42     ` Johannes Weiner
2026-05-30  1:25       ` Wei Yang
2026-05-27 20:45 ` [PATCH v5 6/9] mm: list_lru: introduce caller locking for additions and deletions Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 7/9] mm: list_lru: introduce folio_memcg_list_lru_alloc() Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 8/9] mm: memory: flatten alloc_anon_folio() retry loop Johannes Weiner
2026-05-30  9:06   ` Dev Jain
2026-05-27 20:45 ` [PATCH v5 9/9] mm: switch deferred split shrinker to list_lru Johannes Weiner
2026-05-28  7:08   ` SeongJae Park
2026-05-28 14:03     ` Johannes Weiner
2026-05-28 13:32   ` Usama Arif
2026-05-28 14:02     ` Johannes Weiner
2026-05-28 15:31       ` Usama Arif
2026-05-29 17:33   ` Kairui Song
2026-05-31  8:00   ` Wei Yang
2026-06-01 10:39   ` Lance Yang
2026-06-01 11:09     ` Lance Yang
2026-06-01 13:21   ` Lance Yang
2026-06-01 18:17     ` Johannes Weiner
2026-06-01  8:36 ` Lance Yang [this message]
2026-06-02 21:46   ` [PATCH v5 0/9] mm: switch THP " Johannes Weiner
2026-06-03  4:44     ` Lance Yang
2026-06-03 11:41       ` Johannes Weiner
2026-06-03 11:53         ` Lance Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260601083652.59539-1-lance.yang@linux.dev \
    --to=lance.yang@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=cgroups@vger.kernel.org \
    --cc=david@fromorbit.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gor@linux.ibm.com \
    --cc=hannes@cmpxchg.org \
    --cc=kas@kernel.org \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=npache@redhat.com \
    --cc=qi.zheng@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=ryan.roberts@arm.com \
    --cc=ryncsn@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=yosry.ahmed@linux.dev \
    --cc=zaslonko@linux.ibm.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.