From: Qi Zheng <qi.zheng@linux.dev>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: hannes@cmpxchg.org, hughd@google.com, mhocko@suse.com,
roman.gushchin@linux.dev, muchun.song@linux.dev,
david@redhat.com, lorenzo.stoakes@oracle.com, ziy@nvidia.com,
harry.yoo@oracle.com, baolin.wang@linux.alibaba.com,
Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
dev.jain@arm.com, baohua@kernel.org, lance.yang@linux.dev,
akpm@linux-foundation.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
Muchun Song <songmuchun@bytedance.com>,
Qi Zheng <zhengqi.arch@bytedance.com>
Subject: Re: [PATCH v4 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
Date: Mon, 13 Oct 2025 15:28:28 +0800 [thread overview]
Message-ID: <ba2ec325-72d1-4a11-943f-b36a090cb68b@linux.dev> (raw)
In-Reply-To: <x4d36plhxcbyp76q4gmesktnnh7yi7bfifx3amk3fwx2moqkk6@77umpnw6rkg3>
Hi Shakeel,
On 10/7/25 7:16 AM, Shakeel Butt wrote:
> On Sat, Oct 04, 2025 at 12:53:17AM +0800, Qi Zheng wrote:
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> The maintenance of the folio->_deferred_list is intricate because it's
>> reused in a local list.
>>
>> Here are some peculiarities:
>>
>> 1) When a folio is removed from its split queue and added to a local
>> on-stack list in deferred_split_scan(), the ->split_queue_len isn't
>> updated, leading to an inconsistency between it and the actual
>> number of folios in the split queue.
>>
>> 2) When the folio is split via split_folio() later, it's removed from
>> the local list while holding the split queue lock. At this time,
>> this lock protects the local list, not the split queue.
>
> I think the above text needs some massaging. Rather than saying lock
> protects the local list, I think, it would be better to say that the
> lock is not needed as it is not protecting anything.
Make sense, will do.
>
>>
>> 3) To handle the race condition with a third-party freeing or migrating
>> the preceding folio, we must ensure there's always one safe (with
>> raised refcount) folio before by delaying its folio_put(). More
>> details can be found in commit e66f3185fa04 ("mm/thp: fix deferred
>> split queue not partially_mapped"). It's rather tricky.
>>
>> We can use the folio_batch infrastructure to handle this clearly. In this
>> case, ->split_queue_len will be consistent with the real number of folios
>> in the split queue. If list_empty(&folio->_deferred_list) returns false,
>> it's clear the folio must be in its split queue (not in a local list
>> anymore).
>>
>> In the future, we will reparent LRU folios during memcg offline to
>> eliminate dying memory cgroups, which requires reparenting the split queue
>> to its parent first. So this patch prepares for using
>> folio_split_queue_lock_irqsave() as the memcg may change then.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> Reviewed-by: Zi Yan <ziy@nvidia.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
>
> One nit below.
>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Thanks!
>
>> ---
>> mm/huge_memory.c | 85 ++++++++++++++++++++++--------------------------
>> 1 file changed, 39 insertions(+), 46 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 134666503440d..59ddebc9f3232 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3782,21 +3782,22 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>> struct lruvec *lruvec;
>> int expected_refs;
>>
>> - if (folio_order(folio) > 1 &&
>> - !list_empty(&folio->_deferred_list)) {
>> - ds_queue->split_queue_len--;
>> + if (folio_order(folio) > 1) {
>> + if (!list_empty(&folio->_deferred_list)) {
>> + ds_queue->split_queue_len--;
>> + /*
>> + * Reinitialize page_deferred_list after removing the
>> + * page from the split_queue, otherwise a subsequent
>> + * split will see list corruption when checking the
>> + * page_deferred_list.
>> + */
>> + list_del_init(&folio->_deferred_list);
>> + }
>> if (folio_test_partially_mapped(folio)) {
>> folio_clear_partially_mapped(folio);
>> mod_mthp_stat(folio_order(folio),
>> MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>> }
>> - /*
>> - * Reinitialize page_deferred_list after removing the
>> - * page from the split_queue, otherwise a subsequent
>> - * split will see list corruption when checking the
>> - * page_deferred_list.
>> - */
>> - list_del_init(&folio->_deferred_list);
>> }
>> split_queue_unlock(ds_queue);
>> if (mapping) {
>> @@ -4185,35 +4186,40 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>> {
>> struct deferred_split *ds_queue;
>> unsigned long flags;
>> - LIST_HEAD(list);
>> - struct folio *folio, *next, *prev = NULL;
>> - int split = 0, removed = 0;
>> + struct folio *folio, *next;
>> + int split = 0, i;
>> + struct folio_batch fbatch;
>>
>> + folio_batch_init(&fbatch);
>> +
>> +retry:
>> ds_queue = split_queue_lock_irqsave(sc->nid, sc->memcg, &flags);
>> /* Take pin on all head pages to avoid freeing them under us */
>> list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>> _deferred_list) {
>> if (folio_try_get(folio)) {
>> - list_move(&folio->_deferred_list, &list);
>> - } else {
>> + folio_batch_add(&fbatch, folio);
>> + } else if (folio_test_partially_mapped(folio)) {
>> /* We lost race with folio_put() */
>> - if (folio_test_partially_mapped(folio)) {
>> - folio_clear_partially_mapped(folio);
>> - mod_mthp_stat(folio_order(folio),
>> - MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>> - }
>> - list_del_init(&folio->_deferred_list);
>> - ds_queue->split_queue_len--;
>> + folio_clear_partially_mapped(folio);
>> + mod_mthp_stat(folio_order(folio),
>> + MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>> }
>> + list_del_init(&folio->_deferred_list);
>> + ds_queue->split_queue_len--;
>> if (!--sc->nr_to_scan)
>> break;
>> + if (!folio_batch_space(&fbatch))
>> + break;
>> }
>> split_queue_unlock_irqrestore(ds_queue, flags);
>>
>> - list_for_each_entry_safe(folio, next, &list, _deferred_list) {
>> + for (i = 0; i < folio_batch_count(&fbatch); i++) {
>> bool did_split = false;
>> bool underused = false;
>> + struct deferred_split *fqueue;
>>
>> + folio = fbatch.folios[i];
>> if (!folio_test_partially_mapped(folio)) {
>> /*
>> * See try_to_map_unused_to_zeropage(): we cannot
>> @@ -4236,38 +4242,25 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>> }
>> folio_unlock(folio);
>> next:
>> + if (did_split || !folio_test_partially_mapped(folio))
>> + continue;
>> /*
>> - * split_folio() removes folio from list on success.
>> * Only add back to the queue if folio is partially mapped.
>> * If thp_underused returns false, or if split_folio fails
>> * in the case it was underused, then consider it used and
>> * don't add it back to split_queue.
>> */
>> - if (did_split) {
>> - ; /* folio already removed from list */
>> - } else if (!folio_test_partially_mapped(folio)) {
>> - list_del_init(&folio->_deferred_list);
>> - removed++;
>> - } else {
>> - /*
>> - * That unlocked list_del_init() above would be unsafe,
>> - * unless its folio is separated from any earlier folios
>> - * left on the list (which may be concurrently unqueued)
>> - * by one safe folio with refcount still raised.
>> - */
>> - swap(folio, prev);
>> + fqueue = folio_split_queue_lock_irqsave(folio, &flags);
>> + if (list_empty(&folio->_deferred_list)) {
>> + list_add_tail(&folio->_deferred_list, &fqueue->split_queue);
>> + fqueue->split_queue_len++;
>> }
>> - if (folio)
>> - folio_put(folio);
>> + split_queue_unlock_irqrestore(fqueue, flags);
>
> Is it possible to move this lock/list_add/unlock code chunk out of loop
> and before the folios_put(). I think it would be possible if you tag the
> corresponding index or have a separate bool array. It is also reasonable
> to claim that the contention of this lock is not a concern for now.
Considering the code complexity, perhaps we could wait until contention
on this lock becomes a problem?
Thanks,
Qi
>
>> }
>> + folios_put(&fbatch);
>>
>> - spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> - list_splice_tail(&list, &ds_queue->split_queue);
>> - ds_queue->split_queue_len -= removed;
>> - spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>> -
>> - if (prev)
>> - folio_put(prev);
>> + if (sc->nr_to_scan)
>> + goto retry;
>>
>> /*
>> * Stop shrinker if we didn't split any page, but the queue is empty.
>> --
>> 2.20.1
>>
next prev parent reply other threads:[~2025-10-13 7:28 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-03 16:53 [PATCH v4 0/4] reparent the THP split queue Qi Zheng
2025-10-03 16:53 ` [PATCH v4 1/4] mm: thp: replace folio_memcg() with folio_memcg_charged() Qi Zheng
2025-10-03 16:53 ` [PATCH v4 2/4] mm: thp: introduce folio_split_queue_lock and its variants Qi Zheng
2025-10-03 16:53 ` [PATCH v4 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() Qi Zheng
2025-10-06 23:16 ` Shakeel Butt
2025-10-13 7:28 ` Qi Zheng [this message]
2025-10-14 14:25 ` kernel test robot
2025-10-03 16:53 ` [PATCH v4 4/4] mm: thp: reparent the split queue during memcg offline Qi Zheng
2025-10-03 16:58 ` Zi Yan
2025-10-04 7:52 ` Muchun Song
2025-10-06 6:46 ` David Hildenbrand
2025-10-07 17:56 ` Shakeel Butt
2025-10-13 7:29 ` Qi Zheng
2025-10-10 16:25 ` [PATCH v4 0/4] reparent the THP split queue Zi Yan
2025-10-11 0:51 ` Qi Zheng
2025-10-11 18:28 ` Andrew Morton
2025-10-13 7:23 ` Qi Zheng
2025-10-13 16:37 ` Zi Yan
2025-10-14 6:49 ` Qi Zheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ba2ec325-72d1-4a11-943f-b36a090cb68b@linux.dev \
--to=qi.zheng@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=cgroups@vger.kernel.org \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=hughd@google.com \
--cc=lance.yang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=npache@redhat.com \
--cc=roman.gushchin@linux.dev \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=songmuchun@bytedance.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.