public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Wei Yang <richard.weiyang@gmail.com>
Cc: akpm@linux-foundation.org, ljs@kernel.org, ziy@nvidia.com,
	baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com,
	npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
	baohua@kernel.org, lance.yang@linux.dev, riel@surriel.com,
	vbabka@kernel.org, harry@kernel.org, jannh@google.com,
	rppt@kernel.org, surenb@google.com, mhocko@suse.com,
	shuah@kernel.org, linux-mm@kvack.org,
	Gavin Guo <gavinguo@igalia.com>
Subject: Re: [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry
Date: Mon, 4 May 2026 14:44:43 +0200	[thread overview]
Message-ID: <ccb23360-d257-491e-926e-2a7f7b9c98c6@kernel.org> (raw)
In-Reply-To: <20260503003818.t35q5roc7osx6se2@master>

On 5/3/26 02:38, Wei Yang wrote:
> On Wed, Apr 29, 2026 at 08:55:07AM +0200, David Hildenbrand (Arm) wrote:
>> On 4/29/26 04:49, Wei Yang wrote:
>>>
>>> Below is my proposed change:
>>>
>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>> index a4d52fdb3056..6e915d35ae54 100644
>>> --- a/mm/page_vma_mapped.c
>>> +++ b/mm/page_vma_mapped.c
>>> @@ -273,17 +273,21 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>  
>>>  			if (softleaf_is_device_private(entry)) {
>>>  				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>>> -				return true;
>>> +				if (pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))
>>> +					return true;
>>
>> As we have a softleaf entry, I assume we wouldn't expect to get any other bits
>> (access/dirty) set until we grab the lock. Verifying
>> softleaf_is_device_private() again would be better cleaner, though.
>>
> 
> Got it.
> 
>> But really, I do wonder if we should just have a "goto retry" back to the "pmde
>> = pmdp_get_lockless(pvmw->pmd);" instead?
>>
> 
> Sounds reasonable. See below.
> 
>>
>> And now I wonder why we don't have a check_pmd() handling in there? :/
>>
>> Should we check for the pfn here?
> 
> Thanks for pointing out. I think you are right.
> 
> After re-read the code, more questions come up my mind. I am afraid we need
> more cleanup for page_vma_mapped_walk().
> 
> Below is my finding based on current understanding:
> 
> 1. thp_migration_supported() seems not necessary
> 
>    The code reaches here means pmd_is_migration_entry() return true, which
>    means CONFIG_ARCH_ENABLE_THP_MIGRATION is set, otherwise
>    softleaf_from_pmd() should return softleaf_mk_none() which is not a
>    migration softleaf.
>    
>    CONFIG_ARCH_ENABLE_THP_MIGRATION is set in turn means
>    CONFIG_TRANSPARENT_HUGEPAGE is set, so thp_migration_supported() must
>    returns true.
> 
> 2. if migration entry change under us, we may need to handle on pte level
> 
>    In pmd_is_migration_entry() -> !pmd_present() branch, we have:
> 
> 		if (!softleaf_is_migration(entry) ||
> 		    !check_pmd(softleaf_to_pfn(entry), pvmw))
> 			return not_found(pvmw);
> 		return true;
> 
>    But I think we need do this:
> 
> 		if (softleaf_is_migration(entry)) {
> 			if (check_pmd(softleaf_to_pfn(entry), pvmw))
> 				return not_found(pvmw);
> 			return true;
> 		}
> 
>    Per my understanding, if the pmd_is_migration_entry() change under us, we
>    need to handle on pte level. Just like pmd_trans_huge() case. Break the
>    loop and return false seems not consistent.
> 
> 3. add proper check for device private entry
> 
>    For device private entry, currently we just grab lock and return. While
>    according to the handling to pmd_trans_huge() and pmd_is_migration_entry(),
>    we should:
> 
>      * re-validate it is still device private entry after pmd_lock()
>      * check PVMW_MIGRATION
>      * check_pmd()
> 
> 4. consolidate pmd entry handling
> 
>    Per my understanding, there are 4 cases for pmd entry handling:
> 
>      * pmd_trans_huge()
>      * pmd_is_migration_entry()
>      * pmd_is_device_private_entry()
>      * !pmd_present()
> 
>    Now we handle them in a mixed state check, which complicates the logic. And
>    the first three share similar logic. (If my above analysis is correct.)
> 
>      * grab pmd_lock()
>      * re-validate pmde
>      * check PVMW_MIGRATION
>      * check_pmd
> 
>    Here I would like to take a more bold step: consolidate handling for these
>    three cases.
> 
>    Below is what it would look like.
> 
> 	pmde = pmdp_get_lockless(pvmw->pmd);
> 
> 	if (pmd_trans_huge(pmde) || pmd_is_valid_softleaf(pmde)) {
> 		unsigned long pfn;
> 		bool is_migration;
> 		bool for_migration;
> 
> 		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> 		if (pmd_same(pmde, pmdp_get_lockless(pvmw->pmd))) {
> 			is_migration = pmd_is_migration_entry(pmde);
> 			for_migration = !!(pvmw->flags & PVMW_MIGRATION);
> 
> 			if (is_migration != for_migration)
> 				return not_found(pvmw);
> 
> 			if (pmd_trans_huge(pmde))
> 				pfn = pmd_pfn(pmde);
> 			else
> 				pfn = softleaf_to_pfn(softleaf_from_pmd(pmde));
> 
> 			if (!check_pmd(pfn, pvmw))
> 				return not_found(pvmw);
> 
> 			return true;
> 		}
> 		/* THP pmd was split under us: handle on pte level */
> 		spin_unlock(pvmw->ptl);
> 		pvmw->ptl = NULL;
> 	} else if (!pmd_present(pmde)) {
> 		if ((pvmw->flags & PVMW_SYNC) &&
> 		    thp_vma_suitable_order(vma, pvmw->address,
> 					   PMD_ORDER) &&
> 		    (pvmw->nr_pages >= HPAGE_PMD_NR))
> 			sync_with_folio_pmd_zap(mm, pvmw->pmd);
> 
> 		step_forward(pvmw, PMD_SIZE);
> 		continue;
> 	}
> 
> 5. use "goto retry"
> 
>    As you mentioned above. Instead of "handle on pte level", go to
>    pmdp_get_lockless() for retry. This looks more reasonable to me.
>>
>>> +				/* THP pmd was split under us: handle on pte level */
>>> +				spin_unlock(pvmw->ptl);
>>> +				pvmw->ptl = NULL;
>>
>>
>>
>>> +			} else {
>>> +				if ((pvmw->flags & PVMW_SYNC) &&
>>> +				    thp_vma_suitable_order(vma, pvmw->address,
>>> +							   PMD_ORDER) &&
>>> +				    (pvmw->nr_pages >= HPAGE_PMD_NR))
>>> +					sync_with_folio_pmd_zap(mm, pvmw->pmd);
>>> +
>>> +				step_forward(pvmw, PMD_SIZE);
>>> +				continue;
>>>  			}
>>> -
>>> -			if ((pvmw->flags & PVMW_SYNC) &&
>>> -			    thp_vma_suitable_order(vma, pvmw->address,
>>> -						   PMD_ORDER) &&
>>> -			    (pvmw->nr_pages >= HPAGE_PMD_NR))
>>> -				sync_with_folio_pmd_zap(mm, pvmw->pmd);
>>> -
>>> -			step_forward(pvmw, PMD_SIZE);
>>> -			continue;
>>>  		}
>>>  		if (!map_pte(pvmw, &pmde, &ptl)) {
>>>  			if (!pvmw->pte)
>>>
>>> After this, we could simplify the logic in try_to_migrate_one() as:
>>>
>>> @@ -2471,14 +2471,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>                                  * so we can detect this scenario and properly
>>>                                  * abort the walk.
>>>                                  */
>>> -                               if (split_huge_pmd_locked(vma, pvmw.address,
>>> -                                                     pvmw.pmd, true)) {
>>> -                                       page_vma_mapped_walk_done(&pvmw);
>>> -                                       break;
>>> -                               }
>>> -                               flags &= ~TTU_SPLIT_HUGE_PMD;
>>> -                               page_vma_mapped_walk_restart(&pvmw);
>>> -                               continue;
>>> +                               ret = split_huge_pmd_locked(vma, pvmw.address,
>>> +                                                     pvmw.pmd, true);
>>> +                               page_vma_mapped_walk_done(&pvmw);
>>> +                               break;
>>>                         }
>>
>> Right. But just to be clear: Let's split the page_vma_mapped_walk() validation
>> (which looks like a bugfix to me) from the other optimization.
>>
> 
> Sure, maybe we can split the page_vma_mapped_walk() cleanup out to another
> patch set for better reviewing?

Yes, but I assume it could even be fixes?

-- 
Cheers,

David


  reply	other threads:[~2026-05-04 12:44 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-15  1:08 [PATCH 0/2] mm/huge_memory: optimize migration when huge PMD needs split Wei Yang
2026-04-15  1:08 ` [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry Wei Yang
2026-04-24 19:29   ` David Hildenbrand (Arm)
2026-04-26  9:19     ` Wei Yang
2026-04-28  8:24       ` David Hildenbrand (Arm)
2026-04-29  2:49         ` Wei Yang
2026-04-29  6:55           ` David Hildenbrand (Arm)
2026-05-03  0:38             ` Wei Yang
2026-05-04 12:44               ` David Hildenbrand (Arm) [this message]
2026-05-05  3:15                 ` Wei Yang
2026-04-15  1:08 ` [PATCH 2/2] mm/selftests: add split_shared_pmd() Wei Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ccb23360-d257-491e-926e-2a7f7b9c98c6@kernel.org \
    --to=david@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=dev.jain@arm.com \
    --cc=gavinguo@igalia.com \
    --cc=harry@kernel.org \
    --cc=jannh@google.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=richard.weiyang@gmail.com \
    --cc=riel@surriel.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shuah@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox