Re: [PATCH v2 7/7] mm: switch deferred split shrinker to list_lru

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>,
	Yosry Ahmed <yosry.ahmed@linux.dev>, Zi Yan <ziy@nvidia.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Usama Arif <usama.arif@linux.dev>,
	Kiryl Shutsemau <kas@kernel.org>,
	Dave Chinner <david@fromorbit.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 7/7] mm: switch deferred split shrinker to list_lru
Date: Wed, 18 Mar 2026 21:25:17 +0100	[thread overview]
Message-ID: <61d86249-cd89-4e99-99d8-ab7c72e95f34@kernel.org> (raw)
In-Reply-To: <20260312205321.638053-8-hannes@cmpxchg.org>

On 3/12/26 21:51, Johannes Weiner wrote:
> The deferred split queue handles cgroups in a suboptimal fashion. The
> queue is per-NUMA node or per-cgroup, not the intersection. That means
> on a cgrouped system, a node-restricted allocation entering reclaim
> can end up splitting large pages on other nodes:
> 
> 	alloc/unmap
> 	  deferred_split_folio()
> 	    list_add_tail(memcg->split_queue)
> 	    set_shrinker_bit(memcg, node, deferred_shrinker_id)
> 
> 	for_each_zone_zonelist_nodemask(restricted_nodes)
> 	  mem_cgroup_iter()
> 	    shrink_slab(node, memcg)
> 	      shrink_slab_memcg(node, memcg)
> 	        if test_shrinker_bit(memcg, node, deferred_shrinker_id)
> 	          deferred_split_scan()
> 	            walks memcg->split_queue
> 
> The shrinker bit adds an imperfect guard rail. As soon as the cgroup
> has a single large page on the node of interest, all large pages owned
> by that memcg, including those on other nodes, will be split.
> 
> list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
> streamlines a lot of the list operations and reclaim walks. It's used
> widely by other major shrinkers already. Convert the deferred split
> queue as well.
> 
> The list_lru per-memcg heads are instantiated on demand when the first
> object of interest is allocated for a cgroup, by calling
> memcg_list_lru_alloc_folio(). Add calls to where splittable pages are
> created: anon faults, swapin faults, khugepaged collapse.
> 
> These calls create all possible node heads for the cgroup at once, so
> the migration code (between nodes) doesn't need any special care.


[...]

> -
>  static inline bool is_transparent_hugepage(const struct folio *folio)
>  {
>  	if (!folio_test_large(folio))
> @@ -1293,6 +1189,14 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
>  		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
>  		return NULL;
>  	}
> +
> +	if (memcg_list_lru_alloc_folio(folio, &deferred_split_lru, gfp)) {
> +		folio_put(folio);
> +		count_vm_event(THP_FAULT_FALLBACK);
> +		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
> +		return NULL;
> +	}

So, in all anon alloc paths, we essentialy have

1) vma_alloc_folio / __folio_alloc (khugepaged being odd)
2) mem_cgroup_charge / mem_cgroup_swapin_charge_folio
3) memcg_list_lru_alloc_folio

I wonder if we could do better in most cases and have something like a

	vma_alloc_anon_folio()

That wraps the vma_alloc_folio() + memcg_list_lru_alloc_folio(), but
still leaves the charging to the caller?

The would at least combine 1) and 3) in a single API. (except for the
odd cases without a VMA).

I guess we would want to skip the memcg_list_lru_alloc_folio() for
order-0 folios, correct?

> +
>  	folio_throttle_swaprate(folio, gfp);
>  
>         /*
> @@ -3802,33 +3706,28 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
>  	struct folio *new_folio, *next;
>  	int old_order = folio_order(folio);
>  	int ret = 0;
> -	struct deferred_split *ds_queue;
> +	struct list_lru_one *l;
>  
>  	VM_WARN_ON_ONCE(!mapping && end);
>  	/* Prevent deferred_split_scan() touching ->_refcount */
> -	ds_queue = folio_split_queue_lock(folio);
> +	rcu_read_lock();

The RCU lock is for the folio_memcg(), right?

I recall I raised in the past that some get/put-like logic (that wraps
the rcu_read_lock() + folio_memcg()) might make this a lot easier to get.


memcg = folio_memcg_lookup(folio)

... do stuff

folio_memcg_putback(folio, memcg);

Or sth like that.


Alternativey, you could have some helpers that do the
list_lru_lock+unlock etc.

folio_memcg_list_lru_lock()
...
folio_memcg_list_ru_unlock(l);

Just some thoughts as inspiration :)

> +	l = list_lru_lock(&deferred_split_lru, folio_nid(folio), folio_memcg(folio));
>  	if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) {
>  		struct swap_cluster_info *ci = NULL;
>  		struct lruvec *lruvec;
>  
>  		if (old_order > 1) {
> -			if (!list_empty(&folio->_deferred_list)) {
> -				ds_queue->split_queue_len--;
> -				/*
> -				 * Reinitialize page_deferred_list after removing the
> -				 * page from the split_queue, otherwise a subsequent
> -				 * split will see list corruption when checking the
> -				 * page_deferred_list.
> -				 */
> -				list_del_init(&folio->_deferred_list);
> -			}
> +			__list_lru_del(&deferred_split_lru, l,
> +				       &folio->_deferred_list, folio_nid(folio));
>  			if (folio_test_partially_mapped(folio)) {
>  				folio_clear_partially_mapped(folio);
>  				mod_mthp_stat(old_order,
>  					MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>  			}
>  		}
> -		split_queue_unlock(ds_queue);
> +		list_lru_unlock(l);
> +		rcu_read_unlock();
> +
>  		if (mapping) {

[...]

Most changes here look mostly mechanically, quite nice. I'll probably
have to go over some bits once again with a fresh mind :)

-- 
Cheers,

David

next prev parent reply	other threads:[~2026-03-18 20:25 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-12 20:51 [PATCH v2 0/7] mm: switch THP shrinker to list_lru Johannes Weiner
2026-03-12 20:51 ` [PATCH v2 1/7] mm: list_lru: lock_list_lru_of_memcg() cannot return NULL if !skip_empty Johannes Weiner
2026-03-17  9:43   ` David Hildenbrand (Arm)
2026-03-18 17:56   ` Shakeel Butt
2026-03-18 19:25     ` Johannes Weiner
2026-03-18 19:34       ` Shakeel Butt
2026-03-12 20:51 ` [PATCH v2 2/7] mm: list_lru: deduplicate unlock_list_lru() Johannes Weiner
2026-03-17  9:44   ` David Hildenbrand (Arm)
2026-03-18 17:57   ` Shakeel Butt
2026-03-12 20:51 ` [PATCH v2 3/7] mm: list_lru: move list dead check to lock_list_lru_of_memcg() Johannes Weiner
2026-03-17  9:47   ` David Hildenbrand (Arm)
2026-03-12 20:51 ` [PATCH v2 4/7] mm: list_lru: deduplicate lock_list_lru() Johannes Weiner
2026-03-17  9:51   ` David Hildenbrand (Arm)
2026-03-12 20:51 ` [PATCH v2 5/7] mm: list_lru: introduce caller locking for additions and deletions Johannes Weiner
2026-03-17 10:00   ` David Hildenbrand (Arm)
2026-03-17 14:03     ` Johannes Weiner
2026-03-17 14:34       ` Johannes Weiner
2026-03-17 16:35         ` David Hildenbrand (Arm)
2026-03-12 20:51 ` [PATCH v2 6/7] mm: list_lru: introduce memcg_list_lru_alloc_folio() Johannes Weiner
2026-03-17 10:09   ` David Hildenbrand (Arm)
2026-03-12 20:51 ` [PATCH v2 7/7] mm: switch deferred split shrinker to list_lru Johannes Weiner
2026-03-18 20:25   ` David Hildenbrand (Arm) [this message]
2026-03-18 22:48     ` Johannes Weiner
2026-03-19  7:21       ` David Hildenbrand (Arm)
2026-03-20 16:02         ` Johannes Weiner
2026-03-23 19:39           ` David Hildenbrand (Arm)
2026-03-20 16:07         ` Johannes Weiner
2026-03-23 19:32           ` David Hildenbrand (Arm)
2026-03-13 17:39 ` [syzbot ci] Re: mm: switch THP " syzbot ci
2026-03-13 23:08   ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=61d86249-cd89-4e99-99d8-ab7c72e95f34@kernel.org \
    --to=david@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=kas@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=usama.arif@linux.dev \
    --cc=yosry.ahmed@linux.dev \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.