From: Johannes Weiner <hannes@cmpxchg.org>
To: Muchun Song <songmuchun@bytedance.com>
Cc: mhocko@kernel.org, roman.gushchin@linux.dev,
shakeel.butt@linux.dev, muchun.song@linux.dev,
akpm@linux-foundation.org, david@fromorbit.com,
zhengqi.arch@bytedance.com, yosry.ahmed@linux.dev,
nphamcs@gmail.com, chengming.zhou@linux.dev,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
linux-mm@kvack.org, hamzamahfooz@linux.microsoft.com,
apais@linux.microsoft.com, Hugh Dickins <hughd@google.com>
Subject: Re: [PATCH RFC 07/28] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
Date: Wed, 30 Apr 2025 10:37:14 -0400 [thread overview]
Message-ID: <20250430143714.GA2020@cmpxchg.org> (raw)
In-Reply-To: <20250415024532.26632-8-songmuchun@bytedance.com>
On Tue, Apr 15, 2025 at 10:45:11AM +0800, Muchun Song wrote:
> The maintenance of the folio->_deferred_list is intricate because it's
> reused in a local list.
>
> Here are some peculiarities:
>
> 1) When a folio is removed from its split queue and added to a local
> on-stack list in deferred_split_scan(), the ->split_queue_len isn't
> updated, leading to an inconsistency between it and the actual
> number of folios in the split queue.
>
> 2) When the folio is split via split_folio() later, it's removed from
> the local list while holding the split queue lock. At this time,
> this lock protects the local list, not the split queue.
>
> 3) To handle the race condition with a third-party freeing or migrating
> the preceding folio, we must ensure there's always one safe (with
> raised refcount) folio before by delaying its folio_put(). More
> details can be found in commit e66f3185fa04. It's rather tricky.
>
> We can use the folio_batch infrastructure to handle this clearly. In this
> case, ->split_queue_len will be consistent with the real number of folios
> in the split queue. If list_empty(&folio->_deferred_list) returns false,
> it's clear the folio must be in its split queue (not in a local list
> anymore).
>
> In the future, we aim to reparent LRU folios during memcg offline to
> eliminate dying memory cgroups. This patch prepares for using
> folio_split_queue_lock_irqsave() as folio memcg may change then.
>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
This is a very nice simplification. And getting rid of the stack list
and its subtle implication on all the various current and future
list_empty(&folio->_deferred_list) checks should be much more robust.
However, I think there is one snag related to this:
> ---
> mm/huge_memory.c | 69 +++++++++++++++++++++---------------------------
> 1 file changed, 30 insertions(+), 39 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 70820fa75c1f..d2bc943a40e8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4220,40 +4220,47 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
> struct pglist_data *pgdata = NODE_DATA(sc->nid);
> struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
> unsigned long flags;
> - LIST_HEAD(list);
> - struct folio *folio, *next, *prev = NULL;
> - int split = 0, removed = 0;
> + struct folio *folio, *next;
> + int split = 0, i;
> + struct folio_batch fbatch;
> + bool done;
>
> #ifdef CONFIG_MEMCG
> if (sc->memcg)
> ds_queue = &sc->memcg->deferred_split_queue;
> #endif
> -
> + folio_batch_init(&fbatch);
> +retry:
> + done = true;
> spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> /* Take pin on all head pages to avoid freeing them under us */
> list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
> _deferred_list) {
> if (folio_try_get(folio)) {
> - list_move(&folio->_deferred_list, &list);
> - } else {
> + folio_batch_add(&fbatch, folio);
> + } else if (folio_test_partially_mapped(folio)) {
> /* We lost race with folio_put() */
> - if (folio_test_partially_mapped(folio)) {
> - folio_clear_partially_mapped(folio);
> - mod_mthp_stat(folio_order(folio),
> - MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> - }
> - list_del_init(&folio->_deferred_list);
> - ds_queue->split_queue_len--;
> + folio_clear_partially_mapped(folio);
> + mod_mthp_stat(folio_order(folio),
> + MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> }
> + list_del_init(&folio->_deferred_list);
> + ds_queue->split_queue_len--;
> if (!--sc->nr_to_scan)
> break;
> + if (folio_batch_space(&fbatch) == 0) {
> + done = false;
> + break;
> + }
> }
> split_queue_unlock_irqrestore(ds_queue, flags);
>
> - list_for_each_entry_safe(folio, next, &list, _deferred_list) {
> + for (i = 0; i < folio_batch_count(&fbatch); i++) {
> bool did_split = false;
> bool underused = false;
> + struct deferred_split *fqueue;
>
> + folio = fbatch.folios[i];
> if (!folio_test_partially_mapped(folio)) {
> underused = thp_underused(folio);
> if (!underused)
> @@ -4269,39 +4276,23 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
> }
> folio_unlock(folio);
> next:
> + if (did_split || !folio_test_partially_mapped(folio))
> + continue;
There IS a list_empty() check in the splitting code that we actually
relied on, for cleaning up the partially_mapped state and counter:
!list_empty(&folio->_deferred_list)) {
ds_queue->split_queue_len--;
if (folio_test_partially_mapped(folio)) {
folio_clear_partially_mapped(folio);
mod_mthp_stat(folio_order(folio),
MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
}
/*
* Reinitialize page_deferred_list after removing the
* page from the split_queue, otherwise a subsequent
* split will see list corruption when checking the
* page_deferred_list.
*/
list_del_init(&folio->_deferred_list);
With the folios isolated up front, it looks like you need to handle
this from the shrinker.
Otherwise this looks correct to me. But this code is subtle, I would
feel much better if Hugh (CC-ed) could take a look as well.
Thanks!
> /*
> - * split_folio() removes folio from list on success.
> * Only add back to the queue if folio is partially mapped.
> * If thp_underused returns false, or if split_folio fails
> * in the case it was underused, then consider it used and
> * don't add it back to split_queue.
> */
> - if (did_split) {
> - ; /* folio already removed from list */
> - } else if (!folio_test_partially_mapped(folio)) {
> - list_del_init(&folio->_deferred_list);
> - removed++;
> - } else {
> - /*
> - * That unlocked list_del_init() above would be unsafe,
> - * unless its folio is separated from any earlier folios
> - * left on the list (which may be concurrently unqueued)
> - * by one safe folio with refcount still raised.
> - */
> - swap(folio, prev);
> - }
> - if (folio)
> - folio_put(folio);
> + fqueue = folio_split_queue_lock_irqsave(folio, &flags);
> + list_add_tail(&folio->_deferred_list, &fqueue->split_queue);
> + fqueue->split_queue_len++;
> + split_queue_unlock_irqrestore(fqueue, flags);
> }
> + folios_put(&fbatch);
>
> - spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> - list_splice_tail(&list, &ds_queue->split_queue);
> - ds_queue->split_queue_len -= removed;
> - split_queue_unlock_irqrestore(ds_queue, flags);
> -
> - if (prev)
> - folio_put(prev);
> -
> + if (!done)
> + goto retry;
> /*
> * Stop shrinker if we didn't split any page, but the queue is empty.
> * This can happen if pages were freed under us.
> --
> 2.20.1
next prev parent reply other threads:[~2025-04-30 14:37 UTC|newest]
Thread overview: 69+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-15 2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
2025-04-15 2:45 ` [PATCH RFC 01/28] mm: memcontrol: remove dead code of checking parent memory cgroup Muchun Song
2025-04-17 14:35 ` Johannes Weiner
2025-04-15 2:45 ` [PATCH RFC 02/28] mm: memcontrol: use folio_memcg_charged() to avoid potential rcu lock holding Muchun Song
2025-04-17 14:48 ` Johannes Weiner
2025-04-18 2:38 ` Muchun Song
2025-04-15 2:45 ` [PATCH RFC 03/28] mm: workingset: use folio_lruvec() in workingset_refault() Muchun Song
2025-04-17 14:52 ` Johannes Weiner
2025-04-15 2:45 ` [PATCH RFC 04/28] mm: rename unlock_page_lruvec_irq and its variants Muchun Song
2025-04-17 14:53 ` Johannes Weiner
2025-04-15 2:45 ` [PATCH RFC 05/28] mm: thp: replace folio_memcg() with folio_memcg_charged() Muchun Song
2025-04-17 14:54 ` Johannes Weiner
2025-04-15 2:45 ` [PATCH RFC 06/28] mm: thp: introduce folio_split_queue_lock and its variants Muchun Song
2025-04-17 14:58 ` Johannes Weiner
2025-04-18 19:50 ` Johannes Weiner
2025-04-19 14:20 ` Muchun Song
2025-04-15 2:45 ` [PATCH RFC 07/28] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() Muchun Song
2025-04-30 14:37 ` Johannes Weiner [this message]
2025-05-06 6:44 ` Hugh Dickins
2025-05-06 21:44 ` Hugh Dickins
2025-05-07 3:30 ` Muchun Song
2025-04-15 2:45 ` [PATCH RFC 08/28] mm: vmscan: refactor move_folios_to_lru() Muchun Song
2025-04-30 14:49 ` Johannes Weiner
2025-04-15 2:45 ` [PATCH RFC 09/28] mm: memcontrol: allocate object cgroup for non-kmem case Muchun Song
2025-04-15 2:45 ` [PATCH RFC 10/28] mm: memcontrol: return root object cgroup for root memory cgroup Muchun Song
2025-06-28 3:09 ` Chen Ridong
2025-06-30 7:16 ` Muchun Song
2025-04-15 2:45 ` [PATCH RFC 11/28] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() Muchun Song
2025-04-15 2:45 ` [PATCH RFC 12/28] buffer: prevent memory cgroup release in folio_alloc_buffers() Muchun Song
2025-04-15 2:45 ` [PATCH RFC 13/28] writeback: prevent memory cgroup release in writeback module Muchun Song
2025-04-15 2:45 ` [PATCH RFC 14/28] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Muchun Song
2025-04-15 2:45 ` [PATCH RFC 15/28] mm: page_io: prevent memory cgroup release in page_io module Muchun Song
2025-04-15 2:45 ` [PATCH RFC 16/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Muchun Song
2025-04-15 2:45 ` [PATCH RFC 17/28] mm: mglru: prevent memory cgroup release in mglru Muchun Song
2025-04-15 2:45 ` [PATCH RFC 18/28] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Muchun Song
2025-04-15 2:45 ` [PATCH RFC 19/28] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Muchun Song
2025-04-15 2:45 ` [PATCH RFC 20/28] mm: workingset: prevent lruvec release in workingset_refault() Muchun Song
2025-04-15 2:45 ` [PATCH RFC 21/28] mm: zswap: prevent lruvec release in zswap_folio_swapin() Muchun Song
2025-04-17 17:39 ` Nhat Pham
2025-04-18 2:36 ` Chengming Zhou
2025-04-15 2:45 ` [PATCH RFC 22/28] mm: swap: prevent lruvec release in swap module Muchun Song
2025-04-15 2:45 ` [PATCH RFC 23/28] mm: workingset: prevent lruvec release in workingset_activation() Muchun Song
2025-04-15 2:45 ` [PATCH RFC 24/28] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock Muchun Song
2025-04-15 2:45 ` [PATCH RFC 25/28] mm: thp: prepare for reparenting LRU pages for split queue lock Muchun Song
2025-04-15 2:45 ` [PATCH RFC 26/28] mm: memcontrol: introduce memcg_reparent_ops Muchun Song
2025-06-30 12:47 ` Harry Yoo
2025-07-01 22:12 ` Harry Yoo
2025-07-07 9:29 ` [External] " Muchun Song
2025-07-09 0:14 ` Harry Yoo
2025-04-15 2:45 ` [PATCH RFC 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Muchun Song
2025-05-20 11:27 ` Harry Yoo
2025-05-22 2:31 ` Muchun Song
2025-05-23 1:24 ` Harry Yoo
2025-04-15 2:45 ` [PATCH RFC 28/28] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Muchun Song
2025-04-15 2:53 ` [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
2025-04-15 6:19 ` Kairui Song
2025-04-15 8:01 ` Muchun Song
2025-04-17 18:22 ` Kairui Song
2025-04-17 19:04 ` Johannes Weiner
2025-06-27 8:50 ` Chen Ridong
2025-04-17 21:45 ` Roman Gushchin
2025-04-28 3:43 ` Kairui Song
2025-06-27 9:02 ` Chen Ridong
2025-06-27 18:54 ` Kairui Song
2025-06-27 19:14 ` Shakeel Butt
2025-06-28 9:21 ` Chen Ridong
2025-04-22 14:20 ` Yosry Ahmed
2025-05-23 1:23 ` Harry Yoo
2025-05-23 2:39 ` Muchun Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250430143714.GA2020@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=akpm@linux-foundation.org \
--cc=apais@linux.microsoft.com \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=david@fromorbit.com \
--cc=hamzamahfooz@linux.microsoft.com \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=songmuchun@bytedance.com \
--cc=yosry.ahmed@linux.dev \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.