Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Lance Yang <lance.yang@linux.dev>
To: ziy@nvidia.com
Cc: Liam.Howlett@oracle.com, akpm@linux-foundation.org,
	baolin.wang@linux.alibaba.com, david@kernel.org,
	gavinguo@igalia.com, gshan@redhat.com, harry.yoo@oracle.com,
	jannh@google.com, linux-mm@kvack.org, lorenzo.stoakes@oracle.com,
	richard.weiyang@gmail.com, riel@surriel.com,
	stable@vger.kernel.org, vbabka@suse.cz,
	Lance Yang <lance.yang@linux.dev>
Subject: Re: [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp
Date: Tue,  3 Feb 2026 21:20:06 +0800	[thread overview]
Message-ID: <20260203132006.66958-1-lance.yang@linux.dev> (raw)
In-Reply-To: <4D8CC775-A86C-4D80-ADB3-6F5CD0FF9330@nvidia.com>


On Sun, Feb 01, 2026 at 09:20:35AM -0500, Zi Yan wrote:
>On 1 Feb 2026, at 8:04, Gavin Guo wrote:
>
>> On 2/1/26 11:39, Zi Yan wrote:
>>> On 31 Jan 2026, at 21:09, Wei Yang wrote:
>>>
>>>> On Fri, Jan 30, 2026 at 09:44:10PM -0500, Zi Yan wrote:
>>>>> On 30 Jan 2026, at 18:00, Wei Yang wrote:
>>>>>
>>>>>> Commit 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and
>>>>>> split_huge_pmd_locked()") return false unconditionally after
>>>>>> split_huge_pmd_locked() which may fail early during try_to_migrate() for
>>>>>> shared thp. This will lead to unexpected folio split failure.
>>>>>>
>>>>>> One way to reproduce:
>>>>>>
>>>>>>      Create an anonymous thp range and fork 512 children, so we have a
>>>>>>      thp shared mapped in 513 processes. Then trigger folio split with
>>>>>>      /sys/kernel/debug/split_huge_pages debugfs to split the thp folio to
>>>>>>      order 0.
>>>>>>
>>>>>> Without the above commit, we can successfully split to order 0.
>>>>>> With the above commit, the folio is still a large folio.
>>>>>>
>>>>>> The reason is the above commit return false after split pmd
>>>>>> unconditionally in the first process and break try_to_migrate().
>>>>>
>>>>> The reasoning looks good to me.
>>>>>
>>>>>>
>>>>>> The tricky thing in above reproduce method is current debugfs interface
>>>>>> leverage function split_huge_pages_pid(), which will iterate the whole
>>>>>> pmd range and do folio split on each base page address. This means it
>>>>>> will try 512 times, and each time split one pmd from pmd mapped to pte
>>>>>> mapped thp. If there are less than 512 shared mapped process,
>>>>>> the folio is still split successfully at last. But in real world, we
>>>>>> usually try it for once.
>>>>>>
>>>>>> This patch fixes this by removing the unconditional false return after
>>>>>> split_huge_pmd_locked(). Later, we may introduce a true fail early if
>>>>>> split_huge_pmd_locked() does fail.
>>>>>>
>>>>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>>>>>> Fixes: 60fbb14396d5 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()")
>>>>>> Cc: Gavin Guo <gavinguo@igalia.com>
>>>>>> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>> Cc: <stable@vger.kernel.org>
>>>>>> ---
>>>>>>   mm/rmap.c | 1 -
>>>>>>   1 file changed, 1 deletion(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 618df3385c8b..eed971568d65 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -2448,7 +2448,6 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>>   			if (flags & TTU_SPLIT_HUGE_PMD) {
>>>>>>   				split_huge_pmd_locked(vma, pvmw.address,
>>>>>>   						      pvmw.pmd, true);
>>>>>> -				ret = false;
>>>>>>   				page_vma_mapped_walk_done(&pvmw);
>>>>>>   				break;
>>>>>>   			}
>>>>>
>>>>> How about the patch below? It matches the pattern of set_pmd_migration_entry() below.
>>>>> Basically, continue if the operation is successful, break otherwise.
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 618df3385c8b..83cc9d98533e 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -2448,9 +2448,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>>> 			if (flags & TTU_SPLIT_HUGE_PMD) {
>>>>> 				split_huge_pmd_locked(vma, pvmw.address,
>>>>> 						      pvmw.pmd, true);
>>>>> -				ret = false;
>>>>> -				page_vma_mapped_walk_done(&pvmw);
>>>>> -				break;
>>>>> +				continue;
>>>>> 			}
>>>>
>>>> Per my understanding if @freeze is trur, split_huge_pmd_locked() may "fail" as
>>>> the comment says:
>>>>
>>>> 		 * Without "freeze", we'll simply split the PMD, propagating the
>>>> 		 * PageAnonExclusive() flag for each PTE by setting it for
>>>> 		 * each subpage -- no need to (temporarily) clear.
>>>> 		 *
>>>> 		 * With "freeze" we want to replace mapped pages by
>>>> 		 * migration entries right away. This is only possible if we
>>>> 		 * managed to clear PageAnonExclusive() -- see
>>>> 		 * set_pmd_migration_entry().
>>>> 		 *
>>>> 		 * In case we cannot clear PageAnonExclusive(), split the PMD
>>>> 		 * only and let try_to_migrate_one() fail later.
>>>>
>>>> While currently we don't return the status of split_huge_pmd_locked() to
>>>> indicate whether it does replaced PMD with migration entries successfully. So
>>>> we are not sure this operation succeed.
>>>
>>> This is the right reasoning. This means to properly handle it, split_huge_pmd_locked()
>>> needs to return whether it inserts migration entries or not when freeze is true.
>>>
>>>>
>>>> Another difference from set_pmd_migration_entry() is split_huge_pmd_locked()
>>>> would change the page table from PMD mapped to PTE mapped.
>>>> page_vma_mapped_walk() can handle it now for (pvmw->pmd && !pvmw->pte), but I
>>>> am not sure this is what we expected. For example, in try_to_unmap_one(), we
>>>> use page_vma_mapped_walk_restart() after pmd splitted.
>>>>
>>>> So I prefer just remove the "ret = false" for a fix. Not sure this is
>>>> reasonable to you.
>>>>
>>>> I am thinking two things after this fix:
>>>>
>>>>    * add one similar test in selftests
>>>>    * let split_huge_pmd_locked() return value to indicate freeze is degrade to
>>>>      !freeze, and fail early on try_to_migrate() like the thp migration branch
>>>>
>>>> Look forward your opinion on whether it worth to do it.
>>>
>>> This is not the right fix, neither was mine above. Because before commit 60fbb14396d5,
>>> the code handles PAE properly. If PAE is cleared, PMD is split into PTEs and each
>>> PTE becomes a migration entry, page_vma_mapped_walk(&pvmw) returns false,
>>> and try_to_migrate_one() returns true. If PAE is not cleared, PMD is split into PTEs
>>> and each PTE is not a migration entry, inside while (page_vma_mapped_walk(&pvmw)),
>>> PAE will be attempted to get cleared again and it will fail again, leading to
>>> try_to_migrate_one() returns false. After commit 60fbb14396d5, no matter PAE is
>>> cleared or not, try_to_migrate_one() always returns false. It causes folio split
>>> failures for shared PMD THPs.
>>>
>>> Now with your fix (and mine above), no matter PAE is cleared or not, try_to_migrate_one()
>>> always returns true. It just flips the code to a different issue. So the proper fix
>>> is to let split_huge_pmd_locked() returns whether it inserts migration entries or not
>>> and do the same pattern as THP migration code path.
>>
>> How about aligning with the try_to_unmap_one()? The behavior would be the same before applying the commit 60fbb14396d5:
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 7b9879ef442d..0c96f0883013 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2333,9 +2333,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>                         if (flags & TTU_SPLIT_HUGE_PMD) {
>>                                 split_huge_pmd_locked(vma, pvmw.address,
>>                                                       pvmw.pmd, true);
>> -                               ret = false;
>> -                               page_vma_mapped_walk_done(&pvmw);
>> -                               break;
>> +                               flags &= ~TTU_SPLIT_HUGE_PMD;
>> +                               page_vma_mapped_walk_restart(&pvmw);
>> +                               continue;
>>                         }
>>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>>                         pmdval = pmdp_get(pvmw.pmd);
>
>Yes, it works and definitely needs a comment like "After split_huge_pmd_locked(), restart
>the walk to detect PageAnonExclusive handling failure in __split_huge_pmd_locked()".
>The change is good for backporting, but an additional patch to fix it properly by adding
>a return value to split_huge_pmd_locked() is also necessary.

Right. IIUC, after split_huge_pmd_locked() we always have 512 PTEs: either
migration entries, or present PTEs with PageAnonExclusive still set.

And try_to_migrate_one() doesn't use PVMW_MIGRATION. So when we restart the
walk we're either seeing migration — then map_pte/check_pte() won't match,
we hit not_found() and leave the loop with ret still true.

Or we see present (with PageAnonExclusive still set) — then we drop into
the normal PTE path, call folio_try_share_anon_rmap_pte() again, and set
ret=false when it fails.

Cheers,
Lance

next prev parent reply	other threads:[~2026-02-03 13:20 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-30 23:00 [PATCH] mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared thp Wei Yang
2026-01-31  2:44 ` Zi Yan
2026-02-01  2:09   ` Wei Yang
2026-02-01  3:39     ` Zi Yan
2026-02-01 13:04       ` Gavin Guo
2026-02-01 14:20         ` Zi Yan
2026-02-03  0:00           ` Wei Yang
2026-02-03  0:07             ` Zi Yan
2026-02-03 13:04               ` Wei Yang
2026-02-03 13:07                 ` Zi Yan
2026-02-03 13:20           ` Lance Yang [this message]
2026-02-02 23:57       ` Wei Yang
2026-02-03  0:05         ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260203132006.66958-1-lance.yang@linux.dev \
    --to=lance.yang@linux.dev \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@kernel.org \
    --cc=gavinguo@igalia.com \
    --cc=gshan@redhat.com \
    --cc=harry.yoo@oracle.com \
    --cc=jannh@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=richard.weiyang@gmail.com \
    --cc=riel@surriel.com \
    --cc=stable@vger.kernel.org \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.