Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Wei Yang <richard.weiyang@gmail.com>
To: Wei Yang <richard.weiyang@gmail.com>
Cc: "David Hildenbrand (Arm)" <david@kernel.org>,
	akpm@linux-foundation.org, ljs@kernel.org, ziy@nvidia.com,
	baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com,
	npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com,
	baohua@kernel.org, lance.yang@linux.dev, riel@surriel.com,
	vbabka@kernel.org, harry@kernel.org, jannh@google.com,
	rppt@kernel.org, surenb@google.com, mhocko@suse.com,
	shuah@kernel.org, linux-mm@kvack.org,
	Gavin Guo <gavinguo@igalia.com>
Subject: Re: [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry
Date: Thu, 7 May 2026 07:56:59 +0000	[thread overview]
Message-ID: <20260507075659.mverdcmofmoymtmf@master> (raw)
In-Reply-To: <20260505031514.hlnn5o7wkad4teo2@master>

On Tue, May 05, 2026 at 03:15:14AM +0000, Wei Yang wrote:
>On Mon, May 04, 2026 at 02:44:43PM +0200, David Hildenbrand (Arm) wrote:
>>On 5/3/26 02:38, Wei Yang wrote:
>>> On Wed, Apr 29, 2026 at 08:55:07AM +0200, David Hildenbrand (Arm) wrote:
>>>> On 4/29/26 04:49, Wei Yang wrote:
>>>>>
>>>>> Below is my proposed change:
>>>>>
>>>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>>>> index a4d52fdb3056..6e915d35ae54 100644
>>>>> --- a/mm/page_vma_mapped.c
>>>>> +++ b/mm/page_vma_mapped.c
>>>>> @@ -273,17 +273,21 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>>>  
>>>>>  			if (softleaf_is_device_private(entry)) {
>>>>>  				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>>>>> -				return true;
>>>>> +				if (pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))
>>>>> +					return true;
>>>>
>>>> As we have a softleaf entry, I assume we wouldn't expect to get any other bits
>>>> (access/dirty) set until we grab the lock. Verifying
>>>> softleaf_is_device_private() again would be better cleaner, though.
>>>>
>>> 
>>> Got it.
>>> 
>>>> But really, I do wonder if we should just have a "goto retry" back to the "pmde
>>>> = pmdp_get_lockless(pvmw->pmd);" instead?
>>>>
>>> 
>>> Sounds reasonable. See below.
>>> 
>>>>
>>>> And now I wonder why we don't have a check_pmd() handling in there? :/
>>>>
>>>> Should we check for the pfn here?
>>> 
>>> Thanks for pointing out. I think you are right.
>>> 
>>> After re-read the code, more questions come up my mind. I am afraid we need
>>> more cleanup for page_vma_mapped_walk().
>>> 
>>> Below is my finding based on current understanding:
>>> 
>>> 1. thp_migration_supported() seems not necessary
>>> 
>>>    The code reaches here means pmd_is_migration_entry() return true, which
>>>    means CONFIG_ARCH_ENABLE_THP_MIGRATION is set, otherwise
>>>    softleaf_from_pmd() should return softleaf_mk_none() which is not a
>>>    migration softleaf.
>>>    
>>>    CONFIG_ARCH_ENABLE_THP_MIGRATION is set in turn means
>>>    CONFIG_TRANSPARENT_HUGEPAGE is set, so thp_migration_supported() must
>>>    returns true.
>>> 
>>> 2. if migration entry change under us, we may need to handle on pte level
>>> 
>>>    In pmd_is_migration_entry() -> !pmd_present() branch, we have:
>>> 
>>> 		if (!softleaf_is_migration(entry) ||
>>> 		    !check_pmd(softleaf_to_pfn(entry), pvmw))
>>> 			return not_found(pvmw);
>>> 		return true;
>>> 
>>>    But I think we need do this:
>>> 
>>> 		if (softleaf_is_migration(entry)) {
>>> 			if (check_pmd(softleaf_to_pfn(entry), pvmw))
>>> 				return not_found(pvmw);
>>> 			return true;
>>> 		}
>>> 
>>>    Per my understanding, if the pmd_is_migration_entry() change under us, we
>>>    need to handle on pte level. Just like pmd_trans_huge() case. Break the
>>>    loop and return false seems not consistent.
>>> 
>>> 3. add proper check for device private entry
>>> 
>>>    For device private entry, currently we just grab lock and return. While
>>>    according to the handling to pmd_trans_huge() and pmd_is_migration_entry(),
>>>    we should:
>>> 
>>>      * re-validate it is still device private entry after pmd_lock()
>>>      * check PVMW_MIGRATION
>>>      * check_pmd()
>>> 
>>> 4. consolidate pmd entry handling
>>> 
>>>    Per my understanding, there are 4 cases for pmd entry handling:
>>> 
>>>      * pmd_trans_huge()
>>>      * pmd_is_migration_entry()
>>>      * pmd_is_device_private_entry()
>>>      * !pmd_present()
>>> 
>>>    Now we handle them in a mixed state check, which complicates the logic. And
>>>    the first three share similar logic. (If my above analysis is correct.)
>>> 
>>>      * grab pmd_lock()
>>>      * re-validate pmde
>>>      * check PVMW_MIGRATION
>>>      * check_pmd
>>> 
>>>    Here I would like to take a more bold step: consolidate handling for these
>>>    three cases.
>>> 
>>>    Below is what it would look like.
>>> 
>>> 	pmde = pmdp_get_lockless(pvmw->pmd);
>>> 
>>> 	if (pmd_trans_huge(pmde) || pmd_is_valid_softleaf(pmde)) {
>>> 		unsigned long pfn;
>>> 		bool is_migration;
>>> 		bool for_migration;
>>> 
>>> 		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>>> 		if (pmd_same(pmde, pmdp_get_lockless(pvmw->pmd))) {
>>> 			is_migration = pmd_is_migration_entry(pmde);
>>> 			for_migration = !!(pvmw->flags & PVMW_MIGRATION);
>>> 
>>> 			if (is_migration != for_migration)
>>> 				return not_found(pvmw);
>>> 
>>> 			if (pmd_trans_huge(pmde))
>>> 				pfn = pmd_pfn(pmde);
>>> 			else
>>> 				pfn = softleaf_to_pfn(softleaf_from_pmd(pmde));
>>> 
>>> 			if (!check_pmd(pfn, pvmw))
>>> 				return not_found(pvmw);
>>> 
>>> 			return true;
>>> 		}
>>> 		/* THP pmd was split under us: handle on pte level */
>>> 		spin_unlock(pvmw->ptl);
>>> 		pvmw->ptl = NULL;
>>> 	} else if (!pmd_present(pmde)) {
>>> 		if ((pvmw->flags & PVMW_SYNC) &&
>>> 		    thp_vma_suitable_order(vma, pvmw->address,
>>> 					   PMD_ORDER) &&
>>> 		    (pvmw->nr_pages >= HPAGE_PMD_NR))
>>> 			sync_with_folio_pmd_zap(mm, pvmw->pmd);
>>> 
>>> 		step_forward(pvmw, PMD_SIZE);
>>> 		continue;
>>> 	}
>>> 
>>> 5. use "goto retry"
>>> 
>>>    As you mentioned above. Instead of "handle on pte level", go to
>>>    pmdp_get_lockless() for retry. This looks more reasonable to me.
>>>>
>>>>> +				/* THP pmd was split under us: handle on pte level */
>>>>> +				spin_unlock(pvmw->ptl);
>>>>> +				pvmw->ptl = NULL;
>>>>
>>>>
>>>>
>>>>> +			} else {
>>>>> +				if ((pvmw->flags & PVMW_SYNC) &&
>>>>> +				    thp_vma_suitable_order(vma, pvmw->address,
>>>>> +							   PMD_ORDER) &&
>>>>> +				    (pvmw->nr_pages >= HPAGE_PMD_NR))
>>>>> +					sync_with_folio_pmd_zap(mm, pvmw->pmd);
>>>>> +
>>>>> +				step_forward(pvmw, PMD_SIZE);
>>>>> +				continue;
>>>>>  			}
>>>>> -
>>>>> -			if ((pvmw->flags & PVMW_SYNC) &&
>>>>> -			    thp_vma_suitable_order(vma, pvmw->address,
>>>>> -						   PMD_ORDER) &&
>>>>> -			    (pvmw->nr_pages >= HPAGE_PMD_NR))
>>>>> -				sync_with_folio_pmd_zap(mm, pvmw->pmd);
>>>>> -
>>>>> -			step_forward(pvmw, PMD_SIZE);
>>>>> -			continue;
>>>>>  		}
>>>>>  		if (!map_pte(pvmw, &pmde, &ptl)) {
>>>>>  			if (!pvmw->pte)
>>>>>
>>>>> After this, we could simplify the logic in try_to_migrate_one() as:
>>>>>
>>>>> @@ -2471,14 +2471,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>                                  * so we can detect this scenario and properly
>>>>>                                  * abort the walk.
>>>>>                                  */
>>>>> -                               if (split_huge_pmd_locked(vma, pvmw.address,
>>>>> -                                                     pvmw.pmd, true)) {
>>>>> -                                       page_vma_mapped_walk_done(&pvmw);
>>>>> -                                       break;
>>>>> -                               }
>>>>> -                               flags &= ~TTU_SPLIT_HUGE_PMD;
>>>>> -                               page_vma_mapped_walk_restart(&pvmw);
>>>>> -                               continue;
>>>>> +                               ret = split_huge_pmd_locked(vma, pvmw.address,
>>>>> +                                                     pvmw.pmd, true);
>>>>> +                               page_vma_mapped_walk_done(&pvmw);
>>>>> +                               break;
>>>>>                         }
>>>>
>>>> Right. But just to be clear: Let's split the page_vma_mapped_walk() validation
>>>> (which looks like a bugfix to me) from the other optimization.
>>>>
>>> 
>>> Sure, maybe we can split the page_vma_mapped_walk() cleanup out to another
>>> patch set for better reviewing?
>>
>>Yes, but I assume it could even be fixes?
>
>Agree.
>
>For those above proposed changes, #2 and #3 is proper to be fixes.
>
> 2. if migration entry change under us, we may need to handle on pte level

When preparing a fix for this, I can't find a bug case.

If a pmd migration entry split under us, it becomes a pmd_present() entry.
While the related check is in !pmd_present(). So the split case is not
affected.

There is one little behavioral change in commit 616b8371539a.

  Before commit 616b8371539a, when pmd_trans_huge() is zapped after
  pmd_lock(), it still continue on pte level and then return false.

  After commit 616b8371539a, zapped pmd_trans_huge() is caught by
  !pmd_present(), then return not_found() directly.

But I can't say this is a bug.

After all, I would put related change to cleanup instead of bug fix.

> 3. add proper check for device private entry
>
>The corresponding commit to fix are
>
>   commit 616b8371539a6c487404c3b8fb04078016dab4ba
>   Author: Zi Yan <zi.yan@cs.rutgers.edu>
>   Date:   Fri Sep 8 16:10:57 2017 -0700
>   
>       mm: thp: enable thp migration in generic path
>
>   commit 65edfda6f3f2e58f757485a056e4f1775a1404a8
>   Author: Balbir Singh <balbirs@nvidia.com>
>   Date:   Wed Oct 1 16:56:55 2025 +1000
>
>       mm/rmap: extend rmap and migration support device-private entries
>
>As Andrew suggested, I would send fixes and cleanup separately.
>
>After all these settled down, I will respin this thread.
>
>-- 
>Wei Yang
>Help you, Help me

-- 
Wei Yang
Help you, Help me


  reply	other threads:[~2026-05-07  7:57 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-15  1:08 [PATCH 0/2] mm/huge_memory: optimize migration when huge PMD needs split Wei Yang
2026-04-15  1:08 ` [PATCH 1/2] mm/huge_memory: return true if split_huge_pmd_locked() split PMD to migration entry Wei Yang
2026-04-24 19:29   ` David Hildenbrand (Arm)
2026-04-26  9:19     ` Wei Yang
2026-04-28  8:24       ` David Hildenbrand (Arm)
2026-04-29  2:49         ` Wei Yang
2026-04-29  6:55           ` David Hildenbrand (Arm)
2026-05-03  0:38             ` Wei Yang
2026-05-04 12:44               ` David Hildenbrand (Arm)
2026-05-05  3:15                 ` Wei Yang
2026-05-07  7:56                   ` Wei Yang [this message]
2026-04-15  1:08 ` [PATCH 2/2] mm/selftests: add split_shared_pmd() Wei Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260507075659.mverdcmofmoymtmf@master \
    --to=richard.weiyang@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gavinguo@igalia.com \
    --cc=harry@kernel.org \
    --cc=jannh@google.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=riel@surriel.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shuah@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox