Linux cgroups development
 help / color / mirror / Atom feed
From: Lance Yang <lance.yang@linux.dev>
To: hannes@cmpxchg.org
Cc: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org,
	shakeel.butt@linux.dev, mhocko@kernel.org, david@fromorbit.com,
	roman.gushchin@linux.dev, muchun.song@linux.dev,
	qi.zheng@linux.dev, yosry.ahmed@linux.dev, ziy@nvidia.com,
	liam@infradead.org, usama.arif@linux.dev, kas@kernel.org,
	vbabka@kernel.org, ryncsn@gmail.com, zaslonko@linux.ibm.com,
	gor@linux.ibm.com, baolin.wang@linux.alibaba.com,
	baohua@kernel.org, dev.jain@arm.com, lance.yang@linux.dev,
	npache@redhat.com, ryan.roberts@arm.com, cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 0/9] mm: switch THP shrinker to list_lru
Date: Mon,  1 Jun 2026 16:36:52 +0800	[thread overview]
Message-ID: <20260601083652.59539-1-lance.yang@linux.dev> (raw)
In-Reply-To: <20260527204757.2544958-1-hannes@cmpxchg.org>

Hi Johannes,

On Wed, May 27, 2026 at 04:45:07PM -0400, Johannes Weiner wrote:
>This is version 5 of switching the THP shrinker to list_lru.
>
>Core of the new version is the list_lru/set_shrinker_bit fix up front,
>which minimally affects later patches; and a rebase onto the latest
>mm-unstable - replaced alloc_swap_folio() with __swap_cache_alloc().
>
>The changes seemed small enough that *I chose to keep the review tags
>from v4*. Please shout if you object to this!
>
>Changes in v5:
>- patch 1 is a new fix for a very old, pre-existing set_shrinker_bit()
>  problem in list_lru, where the bit can be set on a dying child memcg
>  instead of the ancestor that actually received the item. Pointed out
>  by Usama Arif and Sashiko; fix it first to make it minimally
>  backportable and so the conversion is safe.
>- patches 6 and 9 adapt to that fix's new memcg-by-reference
>  lock_list_lru_of_memcg() signature
>- collapse_huge_page(): propagate folio_memcg_alloc_deferred() failure
>  as SCAN_ALLOC_HUGE_PAGE_FAIL instead of leaking SCAN_SUCCEED and
>  falsely reporting a successful MADV_COLLAPSE (Usama Arif, Sashiko)
>- deferred_split_isolate(): fix a UAF by reading folio state before
>  list_lru_isolate(); once removed, a racing folio_put() frees the
>  folio via the lockless list_empty() check while we still touch its
>  flags and stats (Sashiko)
>- rebased to mm-unstable of 2026-05-27, which simplifies the flatten
>  prep patch (now anon-only, as alloc_swap_folio() was folded into the
>  new __swap_cache_alloc()) and moves the swap-side
>  folio_memcg_alloc_deferred() hook into __swap_cache_alloc(). Kairui,
>  I would appreciate an eyeball on that.
>
>Changes in v4:
>- guard folio_memcg_alloc_deferred() with mem_cgroup_disabled() to fix
>  NULL deref in __memcg_list_lru_alloc() when booting with
>  cgroup_disable=memory (e.g., kdump capture kernel) -- reported and
>  tested by Mikhail Zaslonko on s390 and x86
>- flatten if (folio) branches in alloc_swap_folio() and alloc_anon_folio()
>  in a prep patch so the list_lru allocation additions are a clean minimal
>  diff (Lorenzo)
>- folio_memcg_alloc_deferred() moved out of alloc_charge_folio() into the
>  anon-only collapse_huge_page() path; collapse_file() shares that helper
>  but its pages don't go on the THP shrinker queue (David)
>- guard folio_memcg_alloc_deferred() with order > 1; mTHPs below order-2
>  can't be queued on the deferred split list (David)
>- make deferred_split_lru static, hide behind folio_memcg_alloc_deferred()
>  wrapper with GFP_KERNEL (Lorenzo)
>- rename l -> lru throughout huge_memory.c (Lorenzo)
>- kdoc for folio_memcg_list_lru_alloc() (Lorenzo)
>- list_lru_lock_irq()/unlock_irq()/add_irq() irq-disabling variants;
>  use list_lru_add_irq() in deferred_split_scan() (Lorenzo)
>- reorder shrinker_free() before list_lru_destroy() (Lorenzo)
>
>Changes in v3:
>- dedicated lockdep_key for irqsafe deferred_split_lru.lock (syzbot)
>- conditional list_lru ops in __folio_freeze_and_split_unmapped() (syzbot)
>- annotate runs of inscrutable false, NULL, false function arguments (David)
>- rename to folio_memcg_list_lru_alloc() (David)
>
>Changes in v2:
>- explicit rcu_read_lock() in __folio_freeze_and_split_unmapped() (Usama)
>- split out list_lru prep bits (Dave)
>
>The open-coded deferred split queue has issues. It's not NUMA-aware
>(when cgroup is enabled), and it's more complicated in the callsites
>interacting with it. Switching to list_lru fixes the NUMA problem and
>streamlines things. It also simplifies planned shrinker work.
>
>Patch 1 fixes a pre-existing list_lru bug where the shrinker bit is
>set on the caller's memcg rather than the ancestor whose sublist the
>item actually lands on after a walk-up. Standalone, backportable; the
>rest of the series depends on it.
>
>Patches 2-5 are cleanups and small refactors in list_lru code. They're
>basically independent, but make the THP shrinker conversion easier.
>
>Patch 6 extends the list_lru API to allow the caller to control the
>locking scope. The THP shrinker has private state it needs to keep
>synchronized with the LRU state.
>
>Patch 7 extends the list_lru API with a convenience helper to do
>list_lru head allocation (memcg_list_lru_alloc) when coming from a
>folio. Anon THPs are instantiated in several places, and with the
>folio reparenting patches pending, folio_memcg() access is now a more
>delicate dance. This avoids having to replicate that dance everywhere.
>
>Patch 8 flattens the alloc_anon_folio() retry loop so the next patch's
>list_lru hook lands as a clean addition rather than nested deep inside
>an if (folio) block.
>
>Patch 9 finally switches the deferred_split_queue to list_lru.

As the changelog above says, the old queue is per-memcg only, rather
than per-memcg-per-node. So reclaim on one node can still walk the whole
memcg queue and split underused THPs from other nodes in the same memcg.

But I think the new one can lose reclaim in the cgroup.memory=nokmem
case ...

With nokmem, the deferred shrinker can still run from memcg reclaim,
because it is SHRINKER_NONSLAB. But the list_lru is no longer per-memcg:

__list_lru_init() clears memcg_aware,

	if (mem_cgroup_kmem_disabled())
		memcg_aware = false;

so list_lru_from_memcg_idx() falls back to the shared node list:

static inline struct list_lru_one *
list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
{
	if (list_lru_memcg_aware(lru) && idx >= 0) {
[...]
	}
	return &lru->node[nid].lru;
}

That makes the shrinker bit unreliable. __list_lru_add() still sets the
bit on the memcg passed in, but only when the list goes from empty to
non-empty:

bool __list_lru_add(struct list_lru *lru, struct list_lru_one *l,
		    struct list_head *item, int nid,
		    struct mem_cgroup *memcg)
{
	if (list_empty(item)) {
[...]
		if (!l->nr_items++)
			set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
[...]
		return true;
	}
	return false;
}

If memcg A adds the first folio, A gets the bit. If memcg B later adds a
folio to the same shared list, B does not get a bit, because the list
was already non-empty.

So in the A-first/B-later case, reclaim from B may not call the deferred
shrinker at all. The shared list is scanned from memcg reclaim only if
reclaim runs from the memcg that has the bit, such as A here, or from
global reclaim :)

Anyway, only after the shared list is emptied does the next memcg to add
a folio get to be the one with the bit, IIUC :)

Hopefully I didn't miss somthing important ...

Cheers, Lance

      parent reply	other threads:[~2026-06-01  8:37 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-27 20:45 [PATCH v5 0/9] mm: switch THP shrinker to list_lru Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 1/9] mm: list_lru: fix set_shrinker_bit() call during race with cgroup deletion Johannes Weiner
2026-05-28 13:25   ` Usama Arif
2026-05-30  2:38   ` Wei Yang
2026-05-27 20:45 ` [PATCH v5 2/9] mm: list_lru: lock_list_lru_of_memcg() cannot return NULL if !skip_empty Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 3/9] mm: list_lru: deduplicate unlock_list_lru() Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 4/9] mm: list_lru: move list dead check to lock_list_lru_of_memcg() Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 5/9] mm: list_lru: deduplicate lock_list_lru() Johannes Weiner
2026-05-29  9:56   ` Wei Yang
2026-05-29 13:42     ` Johannes Weiner
2026-05-30  1:25       ` Wei Yang
2026-05-27 20:45 ` [PATCH v5 6/9] mm: list_lru: introduce caller locking for additions and deletions Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 7/9] mm: list_lru: introduce folio_memcg_list_lru_alloc() Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 8/9] mm: memory: flatten alloc_anon_folio() retry loop Johannes Weiner
2026-05-30  9:06   ` Dev Jain
2026-05-27 20:45 ` [PATCH v5 9/9] mm: switch deferred split shrinker to list_lru Johannes Weiner
2026-05-28  7:08   ` SeongJae Park
2026-05-28 14:03     ` Johannes Weiner
2026-05-28 13:32   ` Usama Arif
2026-05-28 14:02     ` Johannes Weiner
2026-05-28 15:31       ` Usama Arif
2026-05-29 17:33   ` Kairui Song
2026-05-31  8:00   ` Wei Yang
2026-06-01 10:39   ` Lance Yang
2026-06-01 11:09     ` Lance Yang
2026-06-01 13:21   ` Lance Yang
2026-06-01 18:17     ` Johannes Weiner
2026-06-01  8:36 ` Lance Yang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260601083652.59539-1-lance.yang@linux.dev \
    --to=lance.yang@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=cgroups@vger.kernel.org \
    --cc=david@fromorbit.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gor@linux.ibm.com \
    --cc=hannes@cmpxchg.org \
    --cc=kas@kernel.org \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=npache@redhat.com \
    --cc=qi.zheng@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=ryan.roberts@arm.com \
    --cc=ryncsn@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=yosry.ahmed@linux.dev \
    --cc=zaslonko@linux.ibm.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox