From: Johannes Weiner <hannes@cmpxchg.org>
To: "David Hildenbrand (Arm)" <david@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Shakeel Butt <shakeel.butt@linux.dev>,
Yosry Ahmed <yosry.ahmed@linux.dev>, Zi Yan <ziy@nvidia.com>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Usama Arif <usama.arif@linux.dev>,
Kiryl Shutsemau <kas@kernel.org>,
Dave Chinner <david@fromorbit.com>,
Roman Gushchin <roman.gushchin@linux.dev>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 7/7] mm: switch deferred split shrinker to list_lru
Date: Wed, 18 Mar 2026 18:48:54 -0400 [thread overview]
Message-ID: <absr1vgNP_tM1OEP@cmpxchg.org> (raw)
In-Reply-To: <61d86249-cd89-4e99-99d8-ab7c72e95f34@kernel.org>
On Wed, Mar 18, 2026 at 09:25:17PM +0100, David Hildenbrand (Arm) wrote:
> On 3/12/26 21:51, Johannes Weiner wrote:
> > The deferred split queue handles cgroups in a suboptimal fashion. The
> > queue is per-NUMA node or per-cgroup, not the intersection. That means
> > on a cgrouped system, a node-restricted allocation entering reclaim
> > can end up splitting large pages on other nodes:
> >
> > alloc/unmap
> > deferred_split_folio()
> > list_add_tail(memcg->split_queue)
> > set_shrinker_bit(memcg, node, deferred_shrinker_id)
> >
> > for_each_zone_zonelist_nodemask(restricted_nodes)
> > mem_cgroup_iter()
> > shrink_slab(node, memcg)
> > shrink_slab_memcg(node, memcg)
> > if test_shrinker_bit(memcg, node, deferred_shrinker_id)
> > deferred_split_scan()
> > walks memcg->split_queue
> >
> > The shrinker bit adds an imperfect guard rail. As soon as the cgroup
> > has a single large page on the node of interest, all large pages owned
> > by that memcg, including those on other nodes, will be split.
> >
> > list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
> > streamlines a lot of the list operations and reclaim walks. It's used
> > widely by other major shrinkers already. Convert the deferred split
> > queue as well.
> >
> > The list_lru per-memcg heads are instantiated on demand when the first
> > object of interest is allocated for a cgroup, by calling
> > memcg_list_lru_alloc_folio(). Add calls to where splittable pages are
> > created: anon faults, swapin faults, khugepaged collapse.
> >
> > These calls create all possible node heads for the cgroup at once, so
> > the migration code (between nodes) doesn't need any special care.
>
>
> [...]
>
> > -
> > static inline bool is_transparent_hugepage(const struct folio *folio)
> > {
> > if (!folio_test_large(folio))
> > @@ -1293,6 +1189,14 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
> > count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> > return NULL;
> > }
> > +
> > + if (memcg_list_lru_alloc_folio(folio, &deferred_split_lru, gfp)) {
> > + folio_put(folio);
> > + count_vm_event(THP_FAULT_FALLBACK);
> > + count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
> > + return NULL;
> > + }
>
> So, in all anon alloc paths, we essentialy have
>
> 1) vma_alloc_folio / __folio_alloc (khugepaged being odd)
> 2) mem_cgroup_charge / mem_cgroup_swapin_charge_folio
> 3) memcg_list_lru_alloc_folio
>
> I wonder if we could do better in most cases and have something like a
>
> vma_alloc_anon_folio()
>
> That wraps the vma_alloc_folio() + memcg_list_lru_alloc_folio(), but
> still leaves the charging to the caller?
Hm, but it's the charging that figures out the memcg and sets
folio_memcg() :(
> The would at least combine 1) and 3) in a single API. (except for the
> odd cases without a VMA).
>
> I guess we would want to skip the memcg_list_lru_alloc_folio() for
> order-0 folios, correct?
Yeah, we don't use the queue for < order-1. In deferred_split_folio():
/*
* Order 1 folios have no space for a deferred list, but we also
* won't waste much memory by not adding them to the deferred list.
*/
if (folio_order(folio) <= 1)
return;
> > @@ -3802,33 +3706,28 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
> > struct folio *new_folio, *next;
> > int old_order = folio_order(folio);
> > int ret = 0;
> > - struct deferred_split *ds_queue;
> > + struct list_lru_one *l;
> >
> > VM_WARN_ON_ONCE(!mapping && end);
> > /* Prevent deferred_split_scan() touching ->_refcount */
> > - ds_queue = folio_split_queue_lock(folio);
> > + rcu_read_lock();
>
> The RCU lock is for the folio_memcg(), right?
>
> I recall I raised in the past that some get/put-like logic (that wraps
> the rcu_read_lock() + folio_memcg()) might make this a lot easier to get.
>
>
> memcg = folio_memcg_lookup(folio)
>
> ... do stuff
>
> folio_memcg_putback(folio, memcg);
>
> Or sth like that.
>
>
> Alternativey, you could have some helpers that do the
> list_lru_lock+unlock etc.
>
> folio_memcg_list_lru_lock()
> ...
> folio_memcg_list_ru_unlock(l);
>
> Just some thoughts as inspiration :)
I remember you raising this in the objcg + reparenting patches. There
are a few more instances of
rcu_read_lock()
foo = folio_memcg()
...
rcu_read_unlock()
in other parts of the code not touched by these patches here, so the
first pattern is a more universal encapsulation.
Let me look into this. Would you be okay with a follow-up that covers
the others as well?
next prev parent reply other threads:[~2026-03-18 22:49 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-12 20:51 [PATCH v2 0/7] mm: switch THP shrinker to list_lru Johannes Weiner
2026-03-12 20:51 ` [PATCH v2 1/7] mm: list_lru: lock_list_lru_of_memcg() cannot return NULL if !skip_empty Johannes Weiner
2026-03-17 9:43 ` David Hildenbrand (Arm)
2026-03-18 17:56 ` Shakeel Butt
2026-03-18 19:25 ` Johannes Weiner
2026-03-18 19:34 ` Shakeel Butt
2026-03-12 20:51 ` [PATCH v2 2/7] mm: list_lru: deduplicate unlock_list_lru() Johannes Weiner
2026-03-17 9:44 ` David Hildenbrand (Arm)
2026-03-18 17:57 ` Shakeel Butt
2026-03-12 20:51 ` [PATCH v2 3/7] mm: list_lru: move list dead check to lock_list_lru_of_memcg() Johannes Weiner
2026-03-17 9:47 ` David Hildenbrand (Arm)
2026-03-12 20:51 ` [PATCH v2 4/7] mm: list_lru: deduplicate lock_list_lru() Johannes Weiner
2026-03-17 9:51 ` David Hildenbrand (Arm)
2026-03-12 20:51 ` [PATCH v2 5/7] mm: list_lru: introduce caller locking for additions and deletions Johannes Weiner
2026-03-17 10:00 ` David Hildenbrand (Arm)
2026-03-17 14:03 ` Johannes Weiner
2026-03-17 14:34 ` Johannes Weiner
2026-03-17 16:35 ` David Hildenbrand (Arm)
2026-03-12 20:51 ` [PATCH v2 6/7] mm: list_lru: introduce memcg_list_lru_alloc_folio() Johannes Weiner
2026-03-17 10:09 ` David Hildenbrand (Arm)
2026-03-12 20:51 ` [PATCH v2 7/7] mm: switch deferred split shrinker to list_lru Johannes Weiner
2026-03-18 20:25 ` David Hildenbrand (Arm)
2026-03-18 22:48 ` Johannes Weiner [this message]
2026-03-19 7:21 ` David Hildenbrand (Arm)
2026-03-20 16:02 ` Johannes Weiner
2026-03-23 19:39 ` David Hildenbrand (Arm)
2026-03-20 16:07 ` Johannes Weiner
2026-03-23 19:32 ` David Hildenbrand (Arm)
2026-03-13 17:39 ` [syzbot ci] Re: mm: switch THP " syzbot ci
2026-03-13 23:08 ` Johannes Weiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=absr1vgNP_tM1OEP@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=david@fromorbit.com \
--cc=david@kernel.org \
--cc=kas@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=usama.arif@linux.dev \
--cc=yosry.ahmed@linux.dev \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox