All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lance Yang <lance.yang@linux.dev>
To: Zi Yan <ziy@nvidia.com>
Cc: akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	baohua@kernel.org, baolin.wang@linux.alibaba.com,
	dev.jain@arm.com, hughd@google.com, ioworker0@gmail.com,
	kirill@shutemov.name, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, mpenttil@redhat.com, npache@redhat.com,
	ryan.roberts@arm.com, richard.weiyang@gmail.com
Subject: Re: [PATCH mm-new v3 1/1] mm/khugepaged: abort collapse scan on non-swap entries
Date: Fri, 10 Oct 2025 23:34:14 +0800	[thread overview]
Message-ID: <73bbc22f-2739-4f0e-b8d7-b8e344a3fead@linux.dev> (raw)
In-Reply-To: <3ABDAD81-A650-4C9A-BB4B-5515180F0743@nvidia.com>



On 2025/10/10 23:21, Zi Yan wrote:
> On 7 Oct 2025, at 23:26, Lance Yang wrote:
> 
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Currently, special non-swap entries (like PTE markers) are not caught
>> early in hpage_collapse_scan_pmd(), leading to failures deep in the
>> swap-in logic.
>>
>> A function that is called __collapse_huge_page_swapin() and documented
>> to "Bring missing pages in from swap" will handle other types as well.
>>
>> As analyzed by David[1], we could have ended up with the following
>> entry types right before do_swap_page():
>>
>>    (1) Migration entries. We would have waited.
>>        -> Maybe worth it to wait, maybe not. We suspect we don't stumble
>>           into that frequently such that we don't care. We could always
>>           unlock this separately later.
>>
>>    (2) Device-exclusive entries. We would have converted to non-exclusive.
>>        -> See make_device_exclusive(), we cannot tolerate PMD entries and
>>           have to split them through FOLL_SPLIT_PMD. As popped up during
>>           a recent discussion, collapsing here is actually
>>           counter-productive, because the next conversion will PTE-map
>>           it again.
>>        -> Ok to not collapse.
>>
>>    (3) Device-private entries. We would have migrated to RAM.
>>        -> Device-private still does not support THPs, so collapsing right
>>           now just means that the next device access would split the
>>           folio again.
>>        -> Ok to not collapse.
>>
>>    (4) HWPoison entries
>>        -> Cannot collapse
>>
>>    (5) Markers
>>        -> Cannot collapse
>>
>> First, this patch adds an early check for these non-swap entries. If
>> any one is found, the scan is aborted immediately with the
>> SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
>> work. While at it, convert pte_swp_uffd_wp_any() to pte_swp_uffd_wp()
>> since we are in the swap pte branch.
>>
>> Second, as Wei pointed out[3], we may have a chance to get a non-swap
>> entry, since we will drop and re-acquire the mmap lock before
>> __collapse_huge_page_swapin(). To handle this, we also add a
>> non_swap_entry() check there.
>>
>> Note that we can unlock later what we really need, and not account it
>> towards max_swap_ptes.
>>
>> [1] https://lore.kernel.org/linux-mm/09eaca7b-9988-41c7-8d6e-4802055b3f1e@redhat.com
>> [2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680-dcd55219c8bd@lucifer.local
>> [3] https://lore.kernel.org/linux-mm/20251005010511.ysek2nqojebqngf3@master
>>
>> Acked-by: David Hildenbrand <david@redhat.com>
>> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
>> Reviewed-by: Dev Jain <dev.jain@arm.com>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>> v2 -> v3:
>>   - Collect Acked-by from David - thanks!
>>   - Collect Reviewed-by from Wei and Dev - thanks!
>>   - Add a non_swap_entry() check in __collapse_huge_page_swapin() (per Wei
>>     and David) - thanks!
>>   - Rework the changelog to incorporate David's detailed analysis of
>>     non-swap entry types - thanks!!!
>>   - https://lore.kernel.org/linux-mm/20251001032251.85888-1-lance.yang@linux.dev/
>>
>> v1 -> v2:
>>   - Skip all non-present entries except swap entries (per David) thanks!
>>   - https://lore.kernel.org/linux-mm/20250924100207.28332-1-lance.yang@linux.dev/
>>
>>   mm/khugepaged.c | 37 +++++++++++++++++++++++--------------
>>   1 file changed, 23 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index abe54f0043c7..bec3e268dc76 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1020,6 +1020,11 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>>   		if (!is_swap_pte(vmf.orig_pte))
>>   			continue;
>>
>> +		if (non_swap_entry(pte_to_swp_entry(vmf.orig_pte))) {
>> +			result = SCAN_PTE_NON_PRESENT;
>> +			goto out;
>> +		}
>> +
>>   		vmf.pte = pte;
>>   		vmf.ptl = ptl;
>>   		ret = do_swap_page(&vmf);
>> @@ -1281,7 +1286,23 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>   	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>   	     _pte++, addr += PAGE_SIZE) {
>>   		pte_t pteval = ptep_get(_pte);
>> -		if (is_swap_pte(pteval)) {
>> +		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> +			++none_or_zero;
>> +			if (!userfaultfd_armed(vma) &&
>> +			    (!cc->is_khugepaged ||
>> +			     none_or_zero <= khugepaged_max_ptes_none)) {
>> +				continue;
>> +			} else {
>> +				result = SCAN_EXCEED_NONE_PTE;
>> +				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> +				goto out_unmap;
>> +			}
>> +		} else if (!pte_present(pteval)) {
>> +			if (non_swap_entry(pte_to_swp_entry(pteval))) { 
>> +				result = SCAN_PTE_NON_PRESENT;
>> +				goto out_unmap;
>> +			}
>> +
>>   			++unmapped;
>>   			if (!cc->is_khugepaged ||
>>   			    unmapped <= khugepaged_max_ptes_swap) {
>> @@ -1290,7 +1311,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>   				 * enabled swap entries.  Please see
>>   				 * comment below for pte_uffd_wp().
>>   				 */
>> -				if (pte_swp_uffd_wp_any(pteval)) {
>> +				if (pte_swp_uffd_wp(pteval)) {
> 
> pte_swp_uffd_wp_any() returns true for both pte_swp_uffd_wp() and
> pte_marker_uffd_wp(). Why is it OK to just check pte_swp_uffd_wp() here?

+		} else if (!pte_present(pteval)) {
+			if (non_swap_entry(pte_to_swp_entry(pteval))) { <--
+				result = SCAN_PTE_NON_PRESENT;
+				goto out_unmap;
+			}

IIUC, we have just handled all non-swap entries above (which would
include pte_marker_uffd_wp()), right?

> 
>>   					result = SCAN_PTE_UFFD_WP;
>>   					goto out_unmap;
>>   				}
>> @@ -1301,18 +1322,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>   				goto out_unmap;
>>   			}
>>   		}
>> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> -			++none_or_zero;
>> -			if (!userfaultfd_armed(vma) &&
>> -			    (!cc->is_khugepaged ||
>> -			     none_or_zero <= khugepaged_max_ptes_none)) {
>> -				continue;
>> -			} else {
>> -				result = SCAN_EXCEED_NONE_PTE;
>> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> -				goto out_unmap;
>> -			}
>> -		}
>>   		if (pte_uffd_wp(pteval)) {
>>   			/*
>>   			 * Don't collapse the page if any of the small
>> -- 
>> 2.49.0
> 
> 
> --
> Best Regards,
> Yan, Zi



  reply	other threads:[~2025-10-10 15:34 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-08  3:26 [PATCH mm-new v3 1/1] mm/khugepaged: abort collapse scan on non-swap entries Lance Yang
2025-10-08  8:18 ` David Hildenbrand
2025-10-10  3:44 ` Barry Song
2025-10-10 15:21 ` Zi Yan
2025-10-10 15:34   ` Lance Yang [this message]
2025-10-10 15:35     ` Zi Yan
2025-10-14 11:08 ` Lorenzo Stoakes
2025-10-14 14:26   ` Lance Yang
2025-10-14 14:32     ` David Hildenbrand
2025-10-14 14:37       ` Lance Yang
2025-10-14 14:43         ` Lorenzo Stoakes
2025-10-14 14:39     ` Lorenzo Stoakes
2025-10-14 15:01       ` Lance Yang
2025-10-14 15:12         ` David Hildenbrand
2025-10-14 15:33           ` David Hildenbrand
2025-10-14 16:10             ` Lorenzo Stoakes
2025-10-14 15:41           ` Lorenzo Stoakes
2025-10-14 15:48             ` David Hildenbrand
2025-10-14 15:57               ` Lorenzo Stoakes
2025-10-14 15:11       ` David Hildenbrand
2025-10-14 15:52         ` Lorenzo Stoakes
2025-10-14 16:09           ` Lance Yang
2025-10-14 16:27             ` Lorenzo Stoakes
2025-10-15  1:52               ` Lance Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=73bbc22f-2739-4f0e-b8d7-b8e344a3fead@linux.dev \
    --to=lance.yang@linux.dev \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=hughd@google.com \
    --cc=ioworker0@gmail.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mpenttil@redhat.com \
    --cc=npache@redhat.com \
    --cc=richard.weiyang@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.