All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lance Yang <lance.yang@linux.dev>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org,
	shakeel.butt@linux.dev, mhocko@kernel.org, david@fromorbit.com,
	roman.gushchin@linux.dev, muchun.song@linux.dev,
	qi.zheng@linux.dev, yosry.ahmed@linux.dev, ziy@nvidia.com,
	liam@infradead.org, usama.arif@linux.dev, kas@kernel.org,
	vbabka@kernel.org, ryncsn@gmail.com, zaslonko@linux.ibm.com,
	gor@linux.ibm.com, baolin.wang@linux.alibaba.com,
	baohua@kernel.org, dev.jain@arm.com, npache@redhat.com,
	ryan.roberts@arm.com, cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 0/9] mm: switch THP shrinker to list_lru
Date: Wed, 3 Jun 2026 19:53:17 +0800	[thread overview]
Message-ID: <4b2ebbf7-8c5e-45ca-a17c-111f68e2324c@linux.dev> (raw)
In-Reply-To: <aiAS7GxsffqXWILD@cmpxchg.org>



On 2026/6/3 19:41, Johannes Weiner wrote:
> On Wed, Jun 03, 2026 at 12:44:26PM +0800, Lance Yang wrote:
>>
>> On Tue, Jun 02, 2026 at 05:46:02PM -0400, Johannes Weiner wrote:
>>> On Mon, Jun 01, 2026 at 04:36:52PM +0800, Lance Yang wrote:
>>>> As the changelog above says, the old queue is per-memcg only, rather
>>>> than per-memcg-per-node. So reclaim on one node can still walk the whole
>>>> memcg queue and split underused THPs from other nodes in the same memcg.
>>>>
>>>> But I think the new one can lose reclaim in the cgroup.memory=nokmem
>>>> case ...
>>>>
>>>> With nokmem, the deferred shrinker can still run from memcg reclaim,
>>>> because it is SHRINKER_NONSLAB. But the list_lru is no longer per-memcg:
>>>>
>>>> __list_lru_init() clears memcg_aware,
>>>>
>>>> 	if (mem_cgroup_kmem_disabled())
>>>> 		memcg_aware = false;
>>>>
>>>> so list_lru_from_memcg_idx() falls back to the shared node list:
>>>>
>>>> static inline struct list_lru_one *
>>>> list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
>>>> {
>>>> 	if (list_lru_memcg_aware(lru) && idx >= 0) {
>>>> [...]
>>>> 	}
>>>> 	return &lru->node[nid].lru;
>>>> }
>>>>
>>>> That makes the shrinker bit unreliable. __list_lru_add() still sets the
>>>> bit on the memcg passed in, but only when the list goes from empty to
>>>> non-empty:
>>>>
>>>> bool __list_lru_add(struct list_lru *lru, struct list_lru_one *l,
>>>> 		    struct list_head *item, int nid,
>>>> 		    struct mem_cgroup *memcg)
>>>> {
>>>> 	if (list_empty(item)) {
>>>> [...]
>>>> 		if (!l->nr_items++)
>>>> 			set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
>>>> [...]
>>>> 		return true;
>>>> 	}
>>>> 	return false;
>>>> }
>>>>
>>>> If memcg A adds the first folio, A gets the bit. If memcg B later adds a
>>>> folio to the same shared list, B does not get a bit, because the list
>>>> was already non-empty.
>>>>
>>>> So in the A-first/B-later case, reclaim from B may not call the deferred
>>>> shrinker at all. The shared list is scanned from memcg reclaim only if
>>>> reclaim runs from the memcg that has the bit, such as A here, or from
>>>> global reclaim :)
>>>>
>>>> Anyway, only after the shared list is emptied does the next memcg to add
>>>> a folio get to be the one with the bit, IIUC :)
>>>
>>> Sorry for the delay, this took me a bit to think about. The shrinker
>>> code is a mess.
>>>
>>> I read it the same way you do. And this is true for all list_lru users
>>> when nokmem is set: we just set random nonsense shrinker bits.
>>>
>>> HOWEVER, the generic shrinker code fixes that up by IGNORING random
>>> shrinker bits like this when !memcg_kmem_online(). And shrinking
>>> correctly happens only against the shared root queue when the reclaim
>>> iterator walks root_mem_cgroup.
>>>
>>> HOWEVER, the THP shrinker explicitly sets SHRINKER_NONSLAB, which in
>>> turn overrides the previous override. So yes there is a weirdness: we
>>> get the root cgroup invocation against the shared queue, and then one
>>> more time triggered by that random memcg bit.
>>>
>>> The most direct fix is to just drop SHRINKER_NONSLAB. It declares
>>> independence from kmem, which is no longer true.
>>>
>>> Cleaning up the shrinker code is left for another day.
>>
>> Thanks for working on this!
>>
>> Wondering if this fix trades one problem for another, though ...
>>
>> Before this series, the deferred split shrinker had a real per-memcg
>> queue. Even with cgroup.memory=nokmem, memcg reclaim could still scan
>> that memcg's own deferred_split_queue:
>>
>> memcg reclaim -> deferred split shrinker -> sc->memcg->deferred_split_queue
>>
>> With the fix, nokmem + w/o SHRINKER_NONSLAB falls back to a
>> non-memcg-aware shrinker:
>>
>> memcg reclaim -> skip deferred split shrinker
>>
>> root/global reclaim -> deferred split shrinker -> shared list_lru
>>
>> Is that expected? There woud be no memcg-driven deferred split reclaim
>> under nokmem, IIUC ...
> 
> Yes, this is all correct. list_lru is still inherently tied to the
> kmem component of memcg (memcg_kmem_id()).
> 
> So without kmem, no isolation. But without kmem, no isolation *for a
> lot of stuff*. It's a legacy knob when slab accounting was new and
> expensive. But so many things depend on it now, disabling it just
> punches a nassive hole into memcg functionality and isolation
> coverage. It's not a sanctioned production use flag.
> 
> This change is negligible from a memcg semantics POV.

Thanks for clarifying!

No strong objection from me. Just wanted to call out the nokmem
behavior change and hear what folks think :D

>> Not sure what the right fix is, as I am not a memcg expert ...



      reply	other threads:[~2026-06-03 11:53 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-27 20:45 [PATCH v5 0/9] mm: switch THP shrinker to list_lru Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 1/9] mm: list_lru: fix set_shrinker_bit() call during race with cgroup deletion Johannes Weiner
2026-05-28 13:25   ` Usama Arif
2026-05-30  2:38   ` Wei Yang
2026-05-27 20:45 ` [PATCH v5 2/9] mm: list_lru: lock_list_lru_of_memcg() cannot return NULL if !skip_empty Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 3/9] mm: list_lru: deduplicate unlock_list_lru() Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 4/9] mm: list_lru: move list dead check to lock_list_lru_of_memcg() Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 5/9] mm: list_lru: deduplicate lock_list_lru() Johannes Weiner
2026-05-29  9:56   ` Wei Yang
2026-05-29 13:42     ` Johannes Weiner
2026-05-30  1:25       ` Wei Yang
2026-05-27 20:45 ` [PATCH v5 6/9] mm: list_lru: introduce caller locking for additions and deletions Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 7/9] mm: list_lru: introduce folio_memcg_list_lru_alloc() Johannes Weiner
2026-05-27 20:45 ` [PATCH v5 8/9] mm: memory: flatten alloc_anon_folio() retry loop Johannes Weiner
2026-05-30  9:06   ` Dev Jain
2026-05-27 20:45 ` [PATCH v5 9/9] mm: switch deferred split shrinker to list_lru Johannes Weiner
2026-05-28  7:08   ` SeongJae Park
2026-05-28 14:03     ` Johannes Weiner
2026-05-28 13:32   ` Usama Arif
2026-05-28 14:02     ` Johannes Weiner
2026-05-28 15:31       ` Usama Arif
2026-05-29 17:33   ` Kairui Song
2026-05-31  8:00   ` Wei Yang
2026-06-01 10:39   ` Lance Yang
2026-06-01 11:09     ` Lance Yang
2026-06-01 13:21   ` Lance Yang
2026-06-01 18:17     ` Johannes Weiner
2026-06-01  8:36 ` [PATCH v5 0/9] mm: switch THP " Lance Yang
2026-06-02 21:46   ` Johannes Weiner
2026-06-03  4:44     ` Lance Yang
2026-06-03 11:41       ` Johannes Weiner
2026-06-03 11:53         ` Lance Yang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4b2ebbf7-8c5e-45ca-a17c-111f68e2324c@linux.dev \
    --to=lance.yang@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=cgroups@vger.kernel.org \
    --cc=david@fromorbit.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gor@linux.ibm.com \
    --cc=hannes@cmpxchg.org \
    --cc=kas@kernel.org \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=npache@redhat.com \
    --cc=qi.zheng@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=ryan.roberts@arm.com \
    --cc=ryncsn@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=usama.arif@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=yosry.ahmed@linux.dev \
    --cc=zaslonko@linux.ibm.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.