public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>,
	Yosry Ahmed <yosry.ahmed@linux.dev>, Zi Yan <ziy@nvidia.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Usama Arif <usama.arif@linux.dev>,
	Kiryl Shutsemau <kas@kernel.org>,
	Dave Chinner <david@fromorbit.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 7/7] mm: switch deferred split shrinker to list_lru
Date: Wed, 18 Mar 2026 21:25:17 +0100	[thread overview]
Message-ID: <61d86249-cd89-4e99-99d8-ab7c72e95f34@kernel.org> (raw)
In-Reply-To: <20260312205321.638053-8-hannes@cmpxchg.org>

On 3/12/26 21:51, Johannes Weiner wrote:
> The deferred split queue handles cgroups in a suboptimal fashion. The
> queue is per-NUMA node or per-cgroup, not the intersection. That means
> on a cgrouped system, a node-restricted allocation entering reclaim
> can end up splitting large pages on other nodes:
> 
> 	alloc/unmap
> 	  deferred_split_folio()
> 	    list_add_tail(memcg->split_queue)
> 	    set_shrinker_bit(memcg, node, deferred_shrinker_id)
> 
> 	for_each_zone_zonelist_nodemask(restricted_nodes)
> 	  mem_cgroup_iter()
> 	    shrink_slab(node, memcg)
> 	      shrink_slab_memcg(node, memcg)
> 	        if test_shrinker_bit(memcg, node, deferred_shrinker_id)
> 	          deferred_split_scan()
> 	            walks memcg->split_queue
> 
> The shrinker bit adds an imperfect guard rail. As soon as the cgroup
> has a single large page on the node of interest, all large pages owned
> by that memcg, including those on other nodes, will be split.
> 
> list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
> streamlines a lot of the list operations and reclaim walks. It's used
> widely by other major shrinkers already. Convert the deferred split
> queue as well.
> 
> The list_lru per-memcg heads are instantiated on demand when the first
> object of interest is allocated for a cgroup, by calling
> memcg_list_lru_alloc_folio(). Add calls to where splittable pages are
> created: anon faults, swapin faults, khugepaged collapse.
> 
> These calls create all possible node heads for the cgroup at once, so
> the migration code (between nodes) doesn't need any special care.


[...]

> -
>  static inline bool is_transparent_hugepage(const struct folio *folio)
>  {
>  	if (!folio_test_large(folio))
> @@ -1293,6 +1189,14 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
>  		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
>  		return NULL;
>  	}
> +
> +	if (memcg_list_lru_alloc_folio(folio, &deferred_split_lru, gfp)) {
> +		folio_put(folio);
> +		count_vm_event(THP_FAULT_FALLBACK);
> +		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
> +		return NULL;
> +	}

So, in all anon alloc paths, we essentialy have

1) vma_alloc_folio / __folio_alloc (khugepaged being odd)
2) mem_cgroup_charge / mem_cgroup_swapin_charge_folio
3) memcg_list_lru_alloc_folio

I wonder if we could do better in most cases and have something like a

	vma_alloc_anon_folio()

That wraps the vma_alloc_folio() + memcg_list_lru_alloc_folio(), but
still leaves the charging to the caller?

The would at least combine 1) and 3) in a single API. (except for the
odd cases without a VMA).

I guess we would want to skip the memcg_list_lru_alloc_folio() for
order-0 folios, correct?

> +
>  	folio_throttle_swaprate(folio, gfp);
>  
>         /*
> @@ -3802,33 +3706,28 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
>  	struct folio *new_folio, *next;
>  	int old_order = folio_order(folio);
>  	int ret = 0;
> -	struct deferred_split *ds_queue;
> +	struct list_lru_one *l;
>  
>  	VM_WARN_ON_ONCE(!mapping && end);
>  	/* Prevent deferred_split_scan() touching ->_refcount */
> -	ds_queue = folio_split_queue_lock(folio);
> +	rcu_read_lock();

The RCU lock is for the folio_memcg(), right?

I recall I raised in the past that some get/put-like logic (that wraps
the rcu_read_lock() + folio_memcg()) might make this a lot easier to get.


memcg = folio_memcg_lookup(folio)

... do stuff

folio_memcg_putback(folio, memcg);

Or sth like that.


Alternativey, you could have some helpers that do the
list_lru_lock+unlock etc.

folio_memcg_list_lru_lock()
...
folio_memcg_list_ru_unlock(l);

Just some thoughts as inspiration :)

> +	l = list_lru_lock(&deferred_split_lru, folio_nid(folio), folio_memcg(folio));
>  	if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) {
>  		struct swap_cluster_info *ci = NULL;
>  		struct lruvec *lruvec;
>  
>  		if (old_order > 1) {
> -			if (!list_empty(&folio->_deferred_list)) {
> -				ds_queue->split_queue_len--;
> -				/*
> -				 * Reinitialize page_deferred_list after removing the
> -				 * page from the split_queue, otherwise a subsequent
> -				 * split will see list corruption when checking the
> -				 * page_deferred_list.
> -				 */
> -				list_del_init(&folio->_deferred_list);
> -			}
> +			__list_lru_del(&deferred_split_lru, l,
> +				       &folio->_deferred_list, folio_nid(folio));
>  			if (folio_test_partially_mapped(folio)) {
>  				folio_clear_partially_mapped(folio);
>  				mod_mthp_stat(old_order,
>  					MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>  			}
>  		}
> -		split_queue_unlock(ds_queue);
> +		list_lru_unlock(l);
> +		rcu_read_unlock();
> +
>  		if (mapping) {

[...]

Most changes here look mostly mechanically, quite nice. I'll probably
have to go over some bits once again with a fresh mind :)

-- 
Cheers,

David

  reply	other threads:[~2026-03-18 20:25 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-12 20:51 [PATCH v2 0/7] mm: switch THP shrinker to list_lru Johannes Weiner
2026-03-12 20:51 ` [PATCH v2 1/7] mm: list_lru: lock_list_lru_of_memcg() cannot return NULL if !skip_empty Johannes Weiner
2026-03-17  9:43   ` David Hildenbrand (Arm)
2026-03-18 17:56   ` Shakeel Butt
2026-03-18 19:25     ` Johannes Weiner
2026-03-18 19:34       ` Shakeel Butt
2026-03-12 20:51 ` [PATCH v2 2/7] mm: list_lru: deduplicate unlock_list_lru() Johannes Weiner
2026-03-17  9:44   ` David Hildenbrand (Arm)
2026-03-18 17:57   ` Shakeel Butt
2026-03-12 20:51 ` [PATCH v2 3/7] mm: list_lru: move list dead check to lock_list_lru_of_memcg() Johannes Weiner
2026-03-17  9:47   ` David Hildenbrand (Arm)
2026-03-12 20:51 ` [PATCH v2 4/7] mm: list_lru: deduplicate lock_list_lru() Johannes Weiner
2026-03-17  9:51   ` David Hildenbrand (Arm)
2026-03-12 20:51 ` [PATCH v2 5/7] mm: list_lru: introduce caller locking for additions and deletions Johannes Weiner
2026-03-17 10:00   ` David Hildenbrand (Arm)
2026-03-17 14:03     ` Johannes Weiner
2026-03-17 14:34       ` Johannes Weiner
2026-03-17 16:35         ` David Hildenbrand (Arm)
2026-03-12 20:51 ` [PATCH v2 6/7] mm: list_lru: introduce memcg_list_lru_alloc_folio() Johannes Weiner
2026-03-17 10:09   ` David Hildenbrand (Arm)
2026-03-12 20:51 ` [PATCH v2 7/7] mm: switch deferred split shrinker to list_lru Johannes Weiner
2026-03-18 20:25   ` David Hildenbrand (Arm) [this message]
2026-03-18 22:48     ` Johannes Weiner
2026-03-19  7:21       ` David Hildenbrand (Arm)
2026-03-20 16:02         ` Johannes Weiner
2026-03-23 19:39           ` David Hildenbrand (Arm)
2026-03-20 16:07         ` Johannes Weiner
2026-03-23 19:32           ` David Hildenbrand (Arm)
2026-03-13 17:39 ` [syzbot ci] Re: mm: switch THP " syzbot ci
2026-03-13 23:08   ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=61d86249-cd89-4e99-99d8-ab7c72e95f34@kernel.org \
    --to=david@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=kas@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=usama.arif@linux.dev \
    --cc=yosry.ahmed@linux.dev \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox