All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qi Zheng <qi.zheng@linux.dev>
To: Wei Yang <richard.weiyang@gmail.com>
Cc: hannes@cmpxchg.org, hughd@google.com, mhocko@suse.com,
	roman.gushchin@linux.dev, shakeel.butt@linux.dev,
	muchun.song@linux.dev, david@redhat.com,
	lorenzo.stoakes@oracle.com, ziy@nvidia.com, harry.yoo@oracle.com,
	baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com,
	npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
	baohua@kernel.org, lance.yang@linux.dev,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	Muchun Song <songmuchun@bytedance.com>,
	Qi Zheng <zhengqi.arch@bytedance.com>
Subject: Re: [PATCH v5 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
Date: Fri, 7 Nov 2025 10:51:15 +0800	[thread overview]
Message-ID: <131176ed-8901-4a04-92ce-e270fc536404@linux.dev> (raw)
In-Reply-To: <20251106145213.jblfgslgjzfr3z7h@master>



On 11/6/25 10:52 PM, Wei Yang wrote:
> On Wed, Oct 15, 2025 at 02:35:32PM +0800, Qi Zheng wrote:
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> The maintenance of the folio->_deferred_list is intricate because it's
>> reused in a local list.
>>
>> Here are some peculiarities:
>>
>>    1) When a folio is removed from its split queue and added to a local
>>       on-stack list in deferred_split_scan(), the ->split_queue_len isn't
>>       updated, leading to an inconsistency between it and the actual
>>       number of folios in the split queue.
>>
>>    2) When the folio is split via split_folio() later, it's removed from
>>       the local list while holding the split queue lock. At this time,
>>       the lock is not needed as it is not protecting anything.
>>
>>    3) To handle the race condition with a third-party freeing or migrating
>>       the preceding folio, we must ensure there's always one safe (with
>>       raised refcount) folio before by delaying its folio_put(). More
>>       details can be found in commit e66f3185fa04 ("mm/thp: fix deferred
>>       split queue not partially_mapped"). It's rather tricky.
>>
>> We can use the folio_batch infrastructure to handle this clearly. In this
>> case, ->split_queue_len will be consistent with the real number of folios
>> in the split queue. If list_empty(&folio->_deferred_list) returns false,
>> it's clear the folio must be in its split queue (not in a local list
>> anymore).
>>
>> In the future, we will reparent LRU folios during memcg offline to
>> eliminate dying memory cgroups, which requires reparenting the split queue
>> to its parent first. So this patch prepares for using
>> folio_split_queue_lock_irqsave() as the memcg may change then.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> Reviewed-by: Zi Yan <ziy@nvidia.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
>> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>> ---
>> mm/huge_memory.c | 87 +++++++++++++++++++++++-------------------------
>> 1 file changed, 41 insertions(+), 46 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index a68f26547cd99..e850bc10da3e2 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3782,21 +3782,22 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>> 		struct lruvec *lruvec;
>> 		int expected_refs;
>>
>> -		if (folio_order(folio) > 1 &&
>> -		    !list_empty(&folio->_deferred_list)) {
>> -			ds_queue->split_queue_len--;
>> +		if (folio_order(folio) > 1) {
>> +			if (!list_empty(&folio->_deferred_list)) {
>> +				ds_queue->split_queue_len--;
>> +				/*
>> +				 * Reinitialize page_deferred_list after removing the
>> +				 * page from the split_queue, otherwise a subsequent
>> +				 * split will see list corruption when checking the
>> +				 * page_deferred_list.
>> +				 */
>> +				list_del_init(&folio->_deferred_list);
>> +			}
>> 			if (folio_test_partially_mapped(folio)) {
>> 				folio_clear_partially_mapped(folio);
>> 				mod_mthp_stat(folio_order(folio),
>> 					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>> 			}
>> -			/*
>> -			 * Reinitialize page_deferred_list after removing the
>> -			 * page from the split_queue, otherwise a subsequent
>> -			 * split will see list corruption when checking the
>> -			 * page_deferred_list.
>> -			 */
>> -			list_del_init(&folio->_deferred_list);
> 
> @Andrew
> 
> Current mm-new looks not merge the code correctly?
> 
> The above removed code is still there.
> 
> @Qi
> 
> After rescan this, I am confused about this code change.
> 
> The difference here is originally it would check/clear partially_mapped if
> folio is on a list. But now we would do this even folio is not on a list.
> 
> If my understanding is correct, after this change, !list_empty() means folio
> is on its ds_queue. And there are total three places to remove it from
> ds_queue.
> 
>    1) __folio_unqueue_deferred_split()
>    2) deferred_split_scan()
>    3) __folio_split()
> 
> In 1) and 2) we all clear partially_mapped bit before removing folio from
> ds_queue, this means if the folio is not on ds_queue in __folio_split(), it is
> not necessary to check/clear partially_mapped bit.

In deferred_split_scan(), if folio_try_get() succeeds, then only the
folio will be removed from ds_queue, but not clear partially_mapped.

> 
> Maybe I missed something, would you mind correct me on this?
> 


  parent reply	other threads:[~2025-11-07  2:51 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-15  6:35 [PATCH v5 0/4] reparent the THP split queue Qi Zheng
2025-10-15  6:35 ` [PATCH v5 1/4] mm: thp: replace folio_memcg() with folio_memcg_charged() Qi Zheng
2025-10-15  6:35 ` [PATCH v5 2/4] mm: thp: introduce folio_split_queue_lock and its variants Qi Zheng
2025-10-15  6:35 ` [PATCH v5 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() Qi Zheng
2025-10-17  0:46   ` Wei Yang
2025-10-17  2:33     ` Qi Zheng
2025-10-17  5:38   ` Harry Yoo
2025-11-06 14:52   ` Wei Yang
2025-11-07  2:29     ` Andrew Morton
2025-11-07  2:52       ` Qi Zheng
2025-11-07  2:51     ` Qi Zheng [this message]
2025-11-07  2:59       ` Wei Yang
2025-10-15  6:35 ` [PATCH v5 4/4] mm: thp: reparent the split queue during memcg offline Qi Zheng
2025-10-21  6:09   ` Harry Yoo
2025-10-21  6:21     ` Qi Zheng
2025-10-21  9:29       ` Harry Yoo
2025-10-21  9:43         ` Qi Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=131176ed-8901-4a04-92ce-e270fc536404@linux.dev \
    --to=qi.zheng@linux.dev \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=cgroups@vger.kernel.org \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=hughd@google.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=npache@redhat.com \
    --cc=richard.weiyang@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=songmuchun@bytedance.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.