[PATCH mm-new 0/3] mm/khugepaged: optimize collapse candidate detection

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH mm-new 0/3] mm/khugepaged: optimize collapse candidate detection
@ 2025-09-14 14:35 Lance Yang
  2025-09-14 14:35 ` [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot() Lance Yang
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Lance Yang @ 2025-09-14 14:35 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes
  Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	baohua, ioworker0, linux-kernel, linux-mm

Hi all,

This series contains a couple of small optimizations for the scanner. The
idea is to detect unsuitable collapse candidates, like mlocked VMAs or
guard PTEs, earlier in the scan and bail out to avoid wasted work ;)

Thanks,
Lance

Lance Yang (3):
  mm/khugepaged: skip unsuitable VMAs earlier in
    khugepaged_scan_mm_slot()
  mm: clean up and expose is_guard_pte_marker()
  mm/khugepaged: abort collapse scan on guard PTEs

 include/linux/mm.h      |  6 +++++-
 include/linux/swapops.h |  6 ++++++
 mm/huge_memory.c        |  2 +-
 mm/khugepaged.c         | 26 +++++++++++++++++++++++++-
 mm/madvise.c            |  6 ------
 5 files changed, 37 insertions(+), 9 deletions(-)

-- 
2.49.0



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()
  2025-09-14 14:35 [PATCH mm-new 0/3] mm/khugepaged: optimize collapse candidate detection Lance Yang
@ 2025-09-14 14:35 ` Lance Yang
  2025-09-14 16:16   ` Dev Jain
  2025-09-16  5:32   ` Hugh Dickins
  2025-09-14 14:35 ` [PATCH mm-new 2/3] mm: clean up and expose is_guard_pte_marker() Lance Yang
  2025-09-14 14:35 ` [PATCH mm-new 3/3] mm/khugepaged: abort collapse scan on guard PTEs Lance Yang
  2 siblings, 2 replies; 25+ messages in thread
From: Lance Yang @ 2025-09-14 14:35 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes
  Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	baohua, ioworker0, linux-kernel, linux-mm, Lance Yang

From: Lance Yang <lance.yang@linux.dev>

Let's skip unsuitable VMAs early in the khugepaged scan; specifically,
mlocked VMAs should not be touched.

Note that the only other user of the VM_NO_KHUGEPAGED mask is
 __thp_vma_allowable_orders(), which is also used by the MADV_COLLAPSE
path. Since MADV_COLLAPSE has different rules (e.g., for mlocked VMAs), we
cannot simply make the shared mask stricter as that would break it.

So, we also introduce a new VM_NO_THP_COLLAPSE mask for that helper,
leaving the stricter checks to be applied only within the khugepaged path
itself.

Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 include/linux/mm.h |  6 +++++-
 mm/huge_memory.c   |  2 +-
 mm/khugepaged.c    | 14 +++++++++++++-
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index be3e6fb4d0db..cb54d94b2343 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -505,7 +505,11 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_REMAP_FLAGS (VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP)
 
 /* This mask prevents VMA from being scanned with khugepaged */
-#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)
+#define VM_NO_KHUGEPAGED \
+	(VM_SPECIAL | VM_HUGETLB | VM_LOCKED_MASK | VM_NOHUGEPAGE)
+
+/* This mask prevents VMA from being collapsed by any THP path */
+#define VM_NO_THP_COLLAPSE	(VM_SPECIAL | VM_HUGETLB)
 
 /* This mask defines which mm->def_flags a process can inherit its parent */
 #define VM_INIT_DEF_MASK	VM_NOHUGEPAGE
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d6fc669e11c1..2e91526a037f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -134,7 +134,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 	 * Must be checked after dax since some dax mappings may have
 	 * VM_MIXEDMAP set.
 	 */
-	if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
+	if (!in_pf && !smaps && (vm_flags & VM_NO_THP_COLLAPSE))
 		return 0;
 
 	/*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 7c5ff1b23e93..e54f99bb0b57 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -345,6 +345,17 @@ struct attribute_group khugepaged_attr_group = {
 };
 #endif /* CONFIG_SYSFS */
 
+/**
+ * khugepaged_should_scan_vma - check if a VMA is a candidate for collapse
+ * @vm_flags: The flags of the VMA to check.
+ *
+ * Returns: true if the VMA should be scanned by khugepaged, false otherwise.
+ */
+static inline bool khugepaged_should_scan_vma(vm_flags_t vm_flags)
+{
+	return !(vm_flags & VM_NO_KHUGEPAGED);
+}
+
 int hugepage_madvise(struct vm_area_struct *vma,
 		     vm_flags_t *vm_flags, int advice)
 {
@@ -2443,7 +2454,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			progress++;
 			break;
 		}
-		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
+		if (!khugepaged_should_scan_vma(vma->vm_flags) ||
+		    !thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
 skip:
 			progress++;
 			continue;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()
  2025-09-14 14:35 ` [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot() Lance Yang
@ 2025-09-14 16:16   ` Dev Jain
  2025-09-15  3:02     ` Lance Yang
  2025-09-16  5:32   ` Hugh Dickins
  1 sibling, 1 reply; 25+ messages in thread
From: Dev Jain @ 2025-09-14 16:16 UTC (permalink / raw)
  To: Lance Yang, akpm, david, lorenzo.stoakes
  Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, baohua,
	ioworker0, linux-kernel, linux-mm


On 14/09/25 8:05 pm, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> Let's skip unsuitable VMAs early in the khugepaged scan; specifically,
> mlocked VMAs should not be touched.
>
> Note that the only other user of the VM_NO_KHUGEPAGED mask is
>   __thp_vma_allowable_orders(), which is also used by the MADV_COLLAPSE
> path. Since MADV_COLLAPSE has different rules (e.g., for mlocked VMAs), we
> cannot simply make the shared mask stricter as that would break it.
>
> So, we also introduce a new VM_NO_THP_COLLAPSE mask for that helper,
> leaving the stricter checks to be applied only within the khugepaged path
> itself.
>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>   include/linux/mm.h |  6 +++++-
>   mm/huge_memory.c   |  2 +-
>   mm/khugepaged.c    | 14 +++++++++++++-
>   3 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index be3e6fb4d0db..cb54d94b2343 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -505,7 +505,11 @@ extern unsigned int kobjsize(const void *objp);
>   #define VM_REMAP_FLAGS (VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP)
>   
>   /* This mask prevents VMA from being scanned with khugepaged */
> -#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)
> +#define VM_NO_KHUGEPAGED \
> +	(VM_SPECIAL | VM_HUGETLB | VM_LOCKED_MASK | VM_NOHUGEPAGE)
> +
> +/* This mask prevents VMA from being collapsed by any THP path */
> +#define VM_NO_THP_COLLAPSE	(VM_SPECIAL | VM_HUGETLB)

VM_NO_KHUGEPAGED should then be defined as VM_NO_THP_COLLAPSE | VM_LOCKED_MASK | VM_NOHUGEPAGE.
But...

I believe that the eligibility checking for khugepaged collapse is the business of
thp_vma_allowable_order(). This functionality should be put there, we literally
have a TVA_KHUGEPAGED flag :)

>   
>   /* This mask defines which mm->def_flags a process can inherit its parent */
>   #define VM_INIT_DEF_MASK	VM_NOHUGEPAGE
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d6fc669e11c1..2e91526a037f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -134,7 +134,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>   	 * Must be checked after dax since some dax mappings may have
>   	 * VM_MIXEDMAP set.
>   	 */
> -	if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
> +	if (!in_pf && !smaps && (vm_flags & VM_NO_THP_COLLAPSE))
>   		return 0;
>   
>   	/*
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 7c5ff1b23e93..e54f99bb0b57 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -345,6 +345,17 @@ struct attribute_group khugepaged_attr_group = {
>   };
>   #endif /* CONFIG_SYSFS */
>   
> +/**
> + * khugepaged_should_scan_vma - check if a VMA is a candidate for collapse
> + * @vm_flags: The flags of the VMA to check.
> + *
> + * Returns: true if the VMA should be scanned by khugepaged, false otherwise.
> + */
> +static inline bool khugepaged_should_scan_vma(vm_flags_t vm_flags)
> +{
> +	return !(vm_flags & VM_NO_KHUGEPAGED);
> +}
> +
>   int hugepage_madvise(struct vm_area_struct *vma,
>   		     vm_flags_t *vm_flags, int advice)
>   {
> @@ -2443,7 +2454,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>   			progress++;
>   			break;
>   		}
> -		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> +		if (!khugepaged_should_scan_vma(vma->vm_flags) ||
> +		    !thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
>   skip:
>   			progress++;
>   			continue;


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()
  2025-09-14 16:16   ` Dev Jain
@ 2025-09-15  3:02     ` Lance Yang
  0 siblings, 0 replies; 25+ messages in thread
From: Lance Yang @ 2025-09-15  3:02 UTC (permalink / raw)
  To: Dev Jain
  Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, baohua,
	ioworker0, linux-kernel, linux-mm, akpm, lorenzo.stoakes, david

Hey Dev,

Thanks for taking time to review!

On 2025/9/15 00:16, Dev Jain wrote:
> 
> On 14/09/25 8:05 pm, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Let's skip unsuitable VMAs early in the khugepaged scan; specifically,
>> mlocked VMAs should not be touched.
>>
>> Note that the only other user of the VM_NO_KHUGEPAGED mask is
>>   __thp_vma_allowable_orders(), which is also used by the MADV_COLLAPSE
>> path. Since MADV_COLLAPSE has different rules (e.g., for mlocked 
>> VMAs), we
>> cannot simply make the shared mask stricter as that would break it.
>>
>> So, we also introduce a new VM_NO_THP_COLLAPSE mask for that helper,
>> leaving the stricter checks to be applied only within the khugepaged path
>> itself.
>>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>   include/linux/mm.h |  6 +++++-
>>   mm/huge_memory.c   |  2 +-
>>   mm/khugepaged.c    | 14 +++++++++++++-
>>   3 files changed, 19 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index be3e6fb4d0db..cb54d94b2343 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -505,7 +505,11 @@ extern unsigned int kobjsize(const void *objp);
>>   #define VM_REMAP_FLAGS (VM_IO | VM_PFNMAP | VM_DONTEXPAND | 
>> VM_DONTDUMP)
>>   /* This mask prevents VMA from being scanned with khugepaged */
>> -#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)
>> +#define VM_NO_KHUGEPAGED \
>> +    (VM_SPECIAL | VM_HUGETLB | VM_LOCKED_MASK | VM_NOHUGEPAGE)
>> +
>> +/* This mask prevents VMA from being collapsed by any THP path */
>> +#define VM_NO_THP_COLLAPSE    (VM_SPECIAL | VM_HUGETLB)
> 
> VM_NO_KHUGEPAGED should then be defined as VM_NO_THP_COLLAPSE | 
> VM_LOCKED_MASK | VM_NOHUGEPAGE.

Yep, it's a good cleanup ;)

> But...
> 
> I believe that the eligibility checking for khugepaged collapse is the 
> business of
> thp_vma_allowable_order(). This functionality should be put there, we 
> literally
> have a TVA_KHUGEPAGED flag :)

Good spot. That's a much better apporach!

My initial thinking was to keep thp_vma_allowable_order() as generic as
possible, avoiding specific checks for individual callers ;)

BUT you are right, the TVA_KHUGEPAGED flag is only passed from the
khugepaged path, so the compiler will optimize out the branch for other
callers, leaving no runtime overhead.

Will rework this patch for v2 as your suggestion!

Thanks,
Lance

> 
>>   /* This mask defines which mm->def_flags a process can inherit its 
>> parent */
>>   #define VM_INIT_DEF_MASK    VM_NOHUGEPAGE
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index d6fc669e11c1..2e91526a037f 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -134,7 +134,7 @@ unsigned long __thp_vma_allowable_orders(struct 
>> vm_area_struct *vma,
>>        * Must be checked after dax since some dax mappings may have
>>        * VM_MIXEDMAP set.
>>        */
>> -    if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
>> +    if (!in_pf && !smaps && (vm_flags & VM_NO_THP_COLLAPSE))
>>           return 0;
>>       /*
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 7c5ff1b23e93..e54f99bb0b57 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -345,6 +345,17 @@ struct attribute_group khugepaged_attr_group = {
>>   };
>>   #endif /* CONFIG_SYSFS */
>> +/**
>> + * khugepaged_should_scan_vma - check if a VMA is a candidate for 
>> collapse
>> + * @vm_flags: The flags of the VMA to check.
>> + *
>> + * Returns: true if the VMA should be scanned by khugepaged, false 
>> otherwise.
>> + */
>> +static inline bool khugepaged_should_scan_vma(vm_flags_t vm_flags)
>> +{
>> +    return !(vm_flags & VM_NO_KHUGEPAGED);
>> +}
>> +
>>   int hugepage_madvise(struct vm_area_struct *vma,
>>                vm_flags_t *vm_flags, int advice)
>>   {
>> @@ -2443,7 +2454,8 @@ static unsigned int 
>> khugepaged_scan_mm_slot(unsigned int pages, int *result,
>>               progress++;
>>               break;
>>           }
>> -        if (!thp_vma_allowable_order(vma, vma->vm_flags, 
>> TVA_KHUGEPAGED, PMD_ORDER)) {
>> +        if (!khugepaged_should_scan_vma(vma->vm_flags) ||
>> +            !thp_vma_allowable_order(vma, vma->vm_flags, 
>> TVA_KHUGEPAGED, PMD_ORDER)) {
>>   skip:
>>               progress++;
>>               continue;



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()
  2025-09-14 14:35 ` [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot() Lance Yang
  2025-09-14 16:16   ` Dev Jain
@ 2025-09-16  5:32   ` Hugh Dickins
  2025-09-16  6:21     ` Lance Yang
  1 sibling, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2025-09-16  5:32 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, ioworker0, linux-kernel,
	linux-mm

On Sun, 14 Sep 2025, Lance Yang wrote:

> From: Lance Yang <lance.yang@linux.dev>
> 
> Let's skip unsuitable VMAs early in the khugepaged scan; specifically,
> mlocked VMAs should not be touched.

Why?  That's a change in behaviour, isn't it?

I'm aware that hugepage collapse on an mlocked VMA can insert a fault
latency, not universally welcome; but I've not seen discussion, let
alone agreement, that current behaviour should be changed.
Somewhere in yet-to-be-read mail?  Please give us a link.

Hugh


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()
  2025-09-16  5:32   ` Hugh Dickins
@ 2025-09-16  6:21     ` Lance Yang
  2025-09-16  6:42       ` Hugh Dickins
  2025-09-16  9:29       ` Kiryl Shutsemau
  0 siblings, 2 replies; 25+ messages in thread
From: Lance Yang @ 2025-09-16  6:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, ioworker0, linux-kernel,
	linux-mm

Hi Hugh,

Thanks for taking a look and for raising this important point!

On 2025/9/16 13:32, Hugh Dickins wrote:
> On Sun, 14 Sep 2025, Lance Yang wrote:
> 
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Let's skip unsuitable VMAs early in the khugepaged scan; specifically,
>> mlocked VMAs should not be touched.
> 
> Why?  That's a change in behaviour, isn't it?
> 
> I'm aware that hugepage collapse on an mlocked VMA can insert a fault
> latency, not universally welcome; but I've not seen discussion, let
> alone agreement, that current behaviour should be changed.
> Somewhere in yet-to-be-read mail?  Please give us a link.
> 
> Hugh

You're right, this is indeed a change in behaviour. But it's specifically
for khugepaged.

Users of mlock() expect low and predictable latency. THP collapse is a
heavy operation that introduces exactly the kind of unpredictable delays
they want to avoid. It has to unmap PTEs, copy data from the small folios
to a new THP, and then remap the THP back to the PMD ;)

IMO, that change is acceptable because THP is generally transparent to
users, and khugepaged does not guarantee when THP collapse or split will
happen.

Well, we don't have a discussion on that, just something I noticed.

Thanks,
Lance

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()
  2025-09-16  6:21     ` Lance Yang
@ 2025-09-16  6:42       ` Hugh Dickins
  2025-09-16  7:05         ` Lance Yang
  2025-09-16  9:29       ` Kiryl Shutsemau
  1 sibling, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2025-09-16  6:42 UTC (permalink / raw)
  To: Lance Yang
  Cc: Hugh Dickins, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
	linux-kernel, linux-mm

On Tue, 16 Sep 2025, Lance Yang wrote:

> Hi Hugh,
> 
> Thanks for taking a look and for raising this important point!
> 
> On 2025/9/16 13:32, Hugh Dickins wrote:
> > On Sun, 14 Sep 2025, Lance Yang wrote:
> > 
> >> From: Lance Yang <lance.yang@linux.dev>
> >>
> >> Let's skip unsuitable VMAs early in the khugepaged scan; specifically,
> >> mlocked VMAs should not be touched.
> > 
> > Why?  That's a change in behaviour, isn't it?
> > 
> > I'm aware that hugepage collapse on an mlocked VMA can insert a fault
> > latency, not universally welcome; but I've not seen discussion, let
> > alone agreement, that current behaviour should be changed.
> > Somewhere in yet-to-be-read mail?  Please give us a link.
> > 
> > Hugh
> 
> You're right, this is indeed a change in behaviour. But it's specifically
> for khugepaged.
> 
> Users of mlock() expect low and predictable latency. THP collapse is a
> heavy operation that introduces exactly the kind of unpredictable delays
> they want to avoid. It has to unmap PTEs, copy data from the small folios
> to a new THP, and then remap the THP back to the PMD ;)
> 
> IMO, that change is acceptable because THP is generally transparent to
> users, and khugepaged does not guarantee when THP collapse or split will
> happen.

I disagree.  Many of those who have khugepaged enabled would prefer
it to give them hugepages, even or especially on mlocked areas.

If you make that change, it must be guarded by a sysfs or sysctl tuning.

Perhaps it could share the sysctl_compact_unevictable_allowed tuning
(I'm not sure whether that's a good or bad idea: opinions will differ).

Hugh

> 
> Well, we don't have a discussion on that, just something I noticed.
> 
> Thanks,
> Lance


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()
  2025-09-16  6:42       ` Hugh Dickins
@ 2025-09-16  7:05         ` Lance Yang
  0 siblings, 0 replies; 25+ messages in thread
From: Lance Yang @ 2025-09-16  7:05 UTC (permalink / raw)
  To: Hugh Dickins, david, lorenzo.stoakes
  Cc: akpm, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, ioworker0, linux-kernel, linux-mm



On 2025/9/16 14:42, Hugh Dickins wrote:
> On Tue, 16 Sep 2025, Lance Yang wrote:
> 
>> Hi Hugh,
>>
>> Thanks for taking a look and for raising this important point!
>>
>> On 2025/9/16 13:32, Hugh Dickins wrote:
>>> On Sun, 14 Sep 2025, Lance Yang wrote:
>>>
>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>
>>>> Let's skip unsuitable VMAs early in the khugepaged scan; specifically,
>>>> mlocked VMAs should not be touched.
>>>
>>> Why?  That's a change in behaviour, isn't it?
>>>
>>> I'm aware that hugepage collapse on an mlocked VMA can insert a fault
>>> latency, not universally welcome; but I've not seen discussion, let
>>> alone agreement, that current behaviour should be changed.
>>> Somewhere in yet-to-be-read mail?  Please give us a link.
>>>
>>> Hugh
>>
>> You're right, this is indeed a change in behaviour. But it's specifically
>> for khugepaged.
>>
>> Users of mlock() expect low and predictable latency. THP collapse is a
>> heavy operation that introduces exactly the kind of unpredictable delays
>> they want to avoid. It has to unmap PTEs, copy data from the small folios
>> to a new THP, and then remap the THP back to the PMD ;)
>>
>> IMO, that change is acceptable because THP is generally transparent to
>> users, and khugepaged does not guarantee when THP collapse or split will
>> happen.
> 
> I disagree.  Many of those who have khugepaged enabled would prefer
> it to give them hugepages, even or especially on mlocked areas.
> 
> If you make that change, it must be guarded by a sysfs or sysctl tuning.

Thanks for the feedback!

Well, seems like we're not on the same page. Let's gather more opinions from
other folks ;)

> 
> Perhaps it could share the sysctl_compact_unevictable_allowed tuning
> (I'm not sure whether that's a good or bad idea: opinions will differ).

Thanks,
Lance

> 
> Hugh
> 
>>
>> Well, we don't have a discussion on that, just something I noticed.
>>
>> Thanks,
>> Lance



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()
  2025-09-16  6:21     ` Lance Yang
  2025-09-16  6:42       ` Hugh Dickins
@ 2025-09-16  9:29       ` Kiryl Shutsemau
  2025-09-16  9:39         ` Lorenzo Stoakes
  1 sibling, 1 reply; 25+ messages in thread
From: Kiryl Shutsemau @ 2025-09-16  9:29 UTC (permalink / raw)
  To: Lance Yang
  Cc: Hugh Dickins, akpm, david, lorenzo.stoakes, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
	linux-kernel, linux-mm

On Tue, Sep 16, 2025 at 02:21:26PM +0800, Lance Yang wrote:
> Users of mlock() expect low and predictable latency. THP collapse is a
> heavy operation that introduces exactly the kind of unpredictable delays
> they want to avoid. It has to unmap PTEs, copy data from the small folios
> to a new THP, and then remap the THP back to the PMD ;)

Generally, we allow minor page faults into mlocked VMAs and avoid major.
This is minor page fault territory in my view.

Also it is very similar to what compaction does and we allow compaction
of mlocked VMA by default, unless sysctl vm.compact_unevictable_allowed
is set to zero.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()
  2025-09-16  9:29       ` Kiryl Shutsemau
@ 2025-09-16  9:39         ` Lorenzo Stoakes
  2025-09-16  9:48           ` Kiryl Shutsemau
  2025-09-16  9:59           ` Lance Yang
  0 siblings, 2 replies; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16  9:39 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Lance Yang, Hugh Dickins, akpm, david, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
	linux-kernel, linux-mm

On Tue, Sep 16, 2025 at 10:29:11AM +0100, Kiryl Shutsemau wrote:
> On Tue, Sep 16, 2025 at 02:21:26PM +0800, Lance Yang wrote:
> > Users of mlock() expect low and predictable latency. THP collapse is a
> > heavy operation that introduces exactly the kind of unpredictable delays
> > they want to avoid. It has to unmap PTEs, copy data from the small folios
> > to a new THP, and then remap the THP back to the PMD ;)
>
> Generally, we allow minor page faults into mlocked VMAs and avoid major.
> This is minor page fault territory in my view.

Hm, but we won't be causing minor faults via reclaim right, since they're
not on any LRU?

>
> Also it is very similar to what compaction does and we allow compaction
> of mlocked VMA by default, unless sysctl vm.compact_unevictable_allowed
> is set to zero.

This is a much stronger point.

I think we are sometimes too vague as to what mlock() means in
totality. But given that we default allow compaction it seems sensible to
keep this behaviour the same.

Unless you have a specific situation where this is problematic Lance?

>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()
  2025-09-16  9:39         ` Lorenzo Stoakes
@ 2025-09-16  9:48           ` Kiryl Shutsemau
  2025-09-16  9:58             ` Lorenzo Stoakes
  2025-09-16  9:59           ` Lance Yang
  1 sibling, 1 reply; 25+ messages in thread
From: Kiryl Shutsemau @ 2025-09-16  9:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Lance Yang, Hugh Dickins, akpm, david, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
	linux-kernel, linux-mm

On Tue, Sep 16, 2025 at 10:39:53AM +0100, Lorenzo Stoakes wrote:
> On Tue, Sep 16, 2025 at 10:29:11AM +0100, Kiryl Shutsemau wrote:
> > On Tue, Sep 16, 2025 at 02:21:26PM +0800, Lance Yang wrote:
> > > Users of mlock() expect low and predictable latency. THP collapse is a
> > > heavy operation that introduces exactly the kind of unpredictable delays
> > > they want to avoid. It has to unmap PTEs, copy data from the small folios
> > > to a new THP, and then remap the THP back to the PMD ;)
> >
> > Generally, we allow minor page faults into mlocked VMAs and avoid major.
> > This is minor page fault territory in my view.
> 
> Hm, but we won't be causing minor faults via reclaim right, since they're
> not on any LRU?

PTEs are still present when we do THP allocation. No reclaim while the
access is blocked. We only block the access on copy and PTEs->PMD
collapse.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()
  2025-09-16  9:48           ` Kiryl Shutsemau
@ 2025-09-16  9:58             ` Lorenzo Stoakes
  2025-09-16 10:00               ` Lance Yang
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16  9:58 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Lance Yang, Hugh Dickins, akpm, david, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
	linux-kernel, linux-mm

On Tue, Sep 16, 2025 at 10:48:22AM +0100, Kiryl Shutsemau wrote:
> On Tue, Sep 16, 2025 at 10:39:53AM +0100, Lorenzo Stoakes wrote:
> > On Tue, Sep 16, 2025 at 10:29:11AM +0100, Kiryl Shutsemau wrote:
> > > On Tue, Sep 16, 2025 at 02:21:26PM +0800, Lance Yang wrote:
> > > > Users of mlock() expect low and predictable latency. THP collapse is a
> > > > heavy operation that introduces exactly the kind of unpredictable delays
> > > > they want to avoid. It has to unmap PTEs, copy data from the small folios
> > > > to a new THP, and then remap the THP back to the PMD ;)
> > >
> > > Generally, we allow minor page faults into mlocked VMAs and avoid major.
> > > This is minor page fault territory in my view.
> >
> > Hm, but we won't be causing minor faults via reclaim right, since they're
> > not on any LRU?
>
> PTEs are still present when we do THP allocation. No reclaim while the
> access is blocked. We only block the access on copy and PTEs->PMD
> collapse.

Right indeed, esp. with compaction being allowed for mlock, I agree with you
that this patch should be dropped :)

>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()
  2025-09-16  9:58             ` Lorenzo Stoakes
@ 2025-09-16 10:00               ` Lance Yang
  0 siblings, 0 replies; 25+ messages in thread
From: Lance Yang @ 2025-09-16 10:00 UTC (permalink / raw)
  To: Lorenzo Stoakes, Kiryl Shutsemau
  Cc: Hugh Dickins, akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, ioworker0, linux-kernel, linux-mm



On 2025/9/16 17:58, Lorenzo Stoakes wrote:
> On Tue, Sep 16, 2025 at 10:48:22AM +0100, Kiryl Shutsemau wrote:
>> On Tue, Sep 16, 2025 at 10:39:53AM +0100, Lorenzo Stoakes wrote:
>>> On Tue, Sep 16, 2025 at 10:29:11AM +0100, Kiryl Shutsemau wrote:
>>>> On Tue, Sep 16, 2025 at 02:21:26PM +0800, Lance Yang wrote:
>>>>> Users of mlock() expect low and predictable latency. THP collapse is a
>>>>> heavy operation that introduces exactly the kind of unpredictable delays
>>>>> they want to avoid. It has to unmap PTEs, copy data from the small folios
>>>>> to a new THP, and then remap the THP back to the PMD ;)
>>>>
>>>> Generally, we allow minor page faults into mlocked VMAs and avoid major.
>>>> This is minor page fault territory in my view.
>>>
>>> Hm, but we won't be causing minor faults via reclaim right, since they're
>>> not on any LRU?
>>
>> PTEs are still present when we do THP allocation. No reclaim while the
>> access is blocked. We only block the access on copy and PTEs->PMD
>> collapse.
> 
> Right indeed, esp. with compaction being allowed for mlock, I agree with you
> that this patch should be dropped :)

Got it. Will do ;)

Thanks,
Lance



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot()
  2025-09-16  9:39         ` Lorenzo Stoakes
  2025-09-16  9:48           ` Kiryl Shutsemau
@ 2025-09-16  9:59           ` Lance Yang
  1 sibling, 0 replies; 25+ messages in thread
From: Lance Yang @ 2025-09-16  9:59 UTC (permalink / raw)
  To: Lorenzo Stoakes, Kiryl Shutsemau, Hugh Dickins
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, ioworker0, linux-kernel, linux-mm



On 2025/9/16 17:39, Lorenzo Stoakes wrote:
> On Tue, Sep 16, 2025 at 10:29:11AM +0100, Kiryl Shutsemau wrote:
>> On Tue, Sep 16, 2025 at 02:21:26PM +0800, Lance Yang wrote:
>>> Users of mlock() expect low and predictable latency. THP collapse is a
>>> heavy operation that introduces exactly the kind of unpredictable delays
>>> they want to avoid. It has to unmap PTEs, copy data from the small folios
>>> to a new THP, and then remap the THP back to the PMD ;)
>>
>> Generally, we allow minor page faults into mlocked VMAs and avoid major.
>> This is minor page fault territory in my view.

Makes sense to me!

> 
> Hm, but we won't be causing minor faults via reclaim right, since they're
> not on any LRU?
> 
>>
>> Also it is very similar to what compaction does and we allow compaction
>> of mlocked VMA by default, unless sysctl vm.compact_unevictable_allowed
>> is set to zero.
> 
> This is a much stronger point.

Ah, indeed, the compaction analogy is quite strong here, thanks!

> 
> I think we are sometimes too vague as to what mlock() means in

Totally agree on too vague ;)

> totality. But given that we default allow compaction it seems sensible to
> keep this behaviour the same.
> 
> Unless you have a specific situation where this is problematic Lance?

Not a specific situation right now that would clearly make this problematic.

Anyway, I will drop this patch from the series.

Thanks again for all the feedback everyone!
Lance



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH mm-new 2/3] mm: clean up and expose is_guard_pte_marker()
  2025-09-14 14:35 [PATCH mm-new 0/3] mm/khugepaged: optimize collapse candidate detection Lance Yang
  2025-09-14 14:35 ` [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot() Lance Yang
@ 2025-09-14 14:35 ` Lance Yang
  2025-09-14 16:38   ` Dev Jain
                     ` (2 more replies)
  2025-09-14 14:35 ` [PATCH mm-new 3/3] mm/khugepaged: abort collapse scan on guard PTEs Lance Yang
  2 siblings, 3 replies; 25+ messages in thread
From: Lance Yang @ 2025-09-14 14:35 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes
  Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	baohua, ioworker0, linux-kernel, linux-mm, Kairui Song,
	Lance Yang

From: Lance Yang <lance.yang@linux.dev>

is_guard_pte_marker() performs a redundant check because it calls both
is_pte_marker() and is_guard_swp_entry(), both of which internally check
for a PTE marker.

is_guard_pte_marker()
 |- is_pte_marker()
 |   `- is_pte_marker_entry()  // First check
 `- is_guard_swp_entry()
     `- is_pte_marker_entry()  // Second, redundant check

While a modern compiler could likely optimize this away, let's have clean
code and not rely on it ;)

Also, make it available for hugepage collapsing code.

Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 include/linux/swapops.h | 6 ++++++
 mm/madvise.c            | 6 ------
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 59c5889a4d54..7f5684fa043b 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -469,6 +469,12 @@ static inline int is_guard_swp_entry(swp_entry_t entry)
 		(pte_marker_get(entry) & PTE_MARKER_GUARD);
 }
 
+static inline bool is_guard_pte_marker(pte_t ptent)
+{
+	return is_swap_pte(ptent) &&
+	       is_guard_swp_entry(pte_to_swp_entry(ptent));
+}
+
 /*
  * This is a special version to check pte_none() just to cover the case when
  * the pte is a pte marker.  It existed because in many cases the pte marker
diff --git a/mm/madvise.c b/mm/madvise.c
index 35ed4ab0d7c5..bd46e6788fac 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1069,12 +1069,6 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
 	return !(vma->vm_flags & disallowed);
 }
 
-static bool is_guard_pte_marker(pte_t ptent)
-{
-	return is_pte_marker(ptent) &&
-		is_guard_swp_entry(pte_to_swp_entry(ptent));
-}
-
 static int guard_install_pud_entry(pud_t *pud, unsigned long addr,
 				   unsigned long next, struct mm_walk *walk)
 {
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 2/3] mm: clean up and expose is_guard_pte_marker()
  2025-09-14 14:35 ` [PATCH mm-new 2/3] mm: clean up and expose is_guard_pte_marker() Lance Yang
@ 2025-09-14 16:38   ` Dev Jain
  2025-09-15  4:24     ` Lance Yang
  2025-09-15 13:54   ` Lorenzo Stoakes
  2025-09-17 10:32   ` David Hildenbrand
  2 siblings, 1 reply; 25+ messages in thread
From: Dev Jain @ 2025-09-14 16:38 UTC (permalink / raw)
  To: Lance Yang, akpm, david, lorenzo.stoakes
  Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, baohua,
	ioworker0, linux-kernel, linux-mm, Kairui Song


On 14/09/25 8:05 pm, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> is_guard_pte_marker() performs a redundant check because it calls both
> is_pte_marker() and is_guard_swp_entry(), both of which internally check
> for a PTE marker.
>
> is_guard_pte_marker()
>   |- is_pte_marker()
>   |   `- is_pte_marker_entry()  // First check
>   `- is_guard_swp_entry()
>       `- is_pte_marker_entry()  // Second, redundant check
>
> While a modern compiler could likely optimize this away, let's have clean
> code and not rely on it ;)
>
> Also, make it available for hugepage collapsing code.

The movement of the code should be merged with the next patch.

>
> Cc: Kairui Song <kasong@tencent.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>   include/linux/swapops.h | 6 ++++++
>   mm/madvise.c            | 6 ------
>   2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 59c5889a4d54..7f5684fa043b 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -469,6 +469,12 @@ static inline int is_guard_swp_entry(swp_entry_t entry)
>   		(pte_marker_get(entry) & PTE_MARKER_GUARD);
>   }
>   
> +static inline bool is_guard_pte_marker(pte_t ptent)
> +{
> +	return is_swap_pte(ptent) &&
> +	       is_guard_swp_entry(pte_to_swp_entry(ptent));
> +}
> +
>   /*
>    * This is a special version to check pte_none() just to cover the case when
>    * the pte is a pte marker.  It existed because in many cases the pte marker
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 35ed4ab0d7c5..bd46e6788fac 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1069,12 +1069,6 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
>   	return !(vma->vm_flags & disallowed);
>   }
>   
> -static bool is_guard_pte_marker(pte_t ptent)
> -{
> -	return is_pte_marker(ptent) &&
> -		is_guard_swp_entry(pte_to_swp_entry(ptent));
> -}
> -
>   static int guard_install_pud_entry(pud_t *pud, unsigned long addr,
>   				   unsigned long next, struct mm_walk *walk)
>   {


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 2/3] mm: clean up and expose is_guard_pte_marker()
  2025-09-14 16:38   ` Dev Jain
@ 2025-09-15  4:24     ` Lance Yang
  0 siblings, 0 replies; 25+ messages in thread
From: Lance Yang @ 2025-09-15  4:24 UTC (permalink / raw)
  To: Dev Jain
  Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, baohua,
	ioworker0, lorenzo.stoakes, linux-kernel, linux-mm, Kairui Song,
	akpm, david


On 2025/9/15 00:38, Dev Jain wrote:
> 
> On 14/09/25 8:05 pm, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> is_guard_pte_marker() performs a redundant check because it calls both
>> is_pte_marker() and is_guard_swp_entry(), both of which internally check
>> for a PTE marker.
>>
>> is_guard_pte_marker()
>>   |- is_pte_marker()
>>   |   `- is_pte_marker_entry()  // First check
>>   `- is_guard_swp_entry()
>>       `- is_pte_marker_entry()  // Second, redundant check
>>
>> While a modern compiler could likely optimize this away, let's have clean
>> code and not rely on it ;)
>>
>> Also, make it available for hugepage collapsing code.
> 
> The movement of the code should be merged with the next patch.

Thanks for the suggestion ;)

I'd prefer to keep them as separate patches to make them easier to review,
if that's okay.

Cheers,
Lance

> 
>>
>> Cc: Kairui Song <kasong@tencent.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>   include/linux/swapops.h | 6 ++++++
>>   mm/madvise.c            | 6 ------
>>   2 files changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
>> index 59c5889a4d54..7f5684fa043b 100644
>> --- a/include/linux/swapops.h
>> +++ b/include/linux/swapops.h
>> @@ -469,6 +469,12 @@ static inline int is_guard_swp_entry(swp_entry_t 
>> entry)
>>           (pte_marker_get(entry) & PTE_MARKER_GUARD);
>>   }
>> +static inline bool is_guard_pte_marker(pte_t ptent)
>> +{
>> +    return is_swap_pte(ptent) &&
>> +           is_guard_swp_entry(pte_to_swp_entry(ptent));
>> +}
>> +
>>   /*
>>    * This is a special version to check pte_none() just to cover the 
>> case when
>>    * the pte is a pte marker.  It existed because in many cases the 
>> pte marker
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 35ed4ab0d7c5..bd46e6788fac 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -1069,12 +1069,6 @@ static bool is_valid_guard_vma(struct 
>> vm_area_struct *vma, bool allow_locked)
>>       return !(vma->vm_flags & disallowed);
>>   }
>> -static bool is_guard_pte_marker(pte_t ptent)
>> -{
>> -    return is_pte_marker(ptent) &&
>> -        is_guard_swp_entry(pte_to_swp_entry(ptent));
>> -}
>> -
>>   static int guard_install_pud_entry(pud_t *pud, unsigned long addr,
>>                      unsigned long next, struct mm_walk *walk)
>>   {



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 2/3] mm: clean up and expose is_guard_pte_marker()
  2025-09-14 14:35 ` [PATCH mm-new 2/3] mm: clean up and expose is_guard_pte_marker() Lance Yang
  2025-09-14 16:38   ` Dev Jain
@ 2025-09-15 13:54   ` Lorenzo Stoakes
  2025-09-15 14:26     ` Lance Yang
  2025-09-17 10:32   ` David Hildenbrand
  2 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 13:54 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, ioworker0, linux-kernel, linux-mm, Kairui Song

On Sun, Sep 14, 2025 at 10:35:46PM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> is_guard_pte_marker() performs a redundant check because it calls both
> is_pte_marker() and is_guard_swp_entry(), both of which internally check
> for a PTE marker.
>
> is_guard_pte_marker()
>  |- is_pte_marker()
>  |   `- is_pte_marker_entry()  // First check
>  `- is_guard_swp_entry()
>      `- is_pte_marker_entry()  // Second, redundant check
>

I mean, it expands to:

is_swap_pte(pte) && is_pte_marker_entry(pte_to_swp_entry(pte)) &&
is_pte_marker_entry(pte_to_swp_entry(pte))

So I don't think it's really unreasonable to expect compiler magic here...

But you're right that I should have just used is_swap_pte() really, it's a bit
silly not to, so this is fine :)

> While a modern compiler could likely optimize this away, let's have clean
> code and not rely on it ;)

Please don't put smileys in commit messages :) as cute as they are, this is
going on the permanent kernel record and while we all love them, it's
probably not the best place to put them :P

>
> Also, make it available for hugepage collapsing code.

Nit but put a newline after this.

I think probably if I'm really really nitty I'd say that you should put
this bit first, as it's the primary motivation for the change, and put the
refactoring stuff after.

> Cc: Kairui Song <kasong@tencent.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>

This seems fine to me, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  include/linux/swapops.h | 6 ++++++
>  mm/madvise.c            | 6 ------
>  2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 59c5889a4d54..7f5684fa043b 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -469,6 +469,12 @@ static inline int is_guard_swp_entry(swp_entry_t entry)
>  		(pte_marker_get(entry) & PTE_MARKER_GUARD);
>  }
>
> +static inline bool is_guard_pte_marker(pte_t ptent)
> +{
> +	return is_swap_pte(ptent) &&
> +	       is_guard_swp_entry(pte_to_swp_entry(ptent));
> +}
> +
>  /*
>   * This is a special version to check pte_none() just to cover the case when
>   * the pte is a pte marker.  It existed because in many cases the pte marker
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 35ed4ab0d7c5..bd46e6788fac 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1069,12 +1069,6 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
>  	return !(vma->vm_flags & disallowed);
>  }
>
> -static bool is_guard_pte_marker(pte_t ptent)
> -{
> -	return is_pte_marker(ptent) &&
> -		is_guard_swp_entry(pte_to_swp_entry(ptent));
> -}
> -
>  static int guard_install_pud_entry(pud_t *pud, unsigned long addr,
>  				   unsigned long next, struct mm_walk *walk)
>  {
> --
> 2.49.0
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 2/3] mm: clean up and expose is_guard_pte_marker()
  2025-09-15 13:54   ` Lorenzo Stoakes
@ 2025-09-15 14:26     ` Lance Yang
  0 siblings, 0 replies; 25+ messages in thread
From: Lance Yang @ 2025-09-15 14:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, ioworker0, linux-kernel, linux-mm, Kairui Song

Hi Lorenzo,

Thanks for taking time to review!

On 2025/9/15 21:54, Lorenzo Stoakes wrote:
> On Sun, Sep 14, 2025 at 10:35:46PM +0800, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> is_guard_pte_marker() performs a redundant check because it calls both
>> is_pte_marker() and is_guard_swp_entry(), both of which internally check
>> for a PTE marker.
>>
>> is_guard_pte_marker()
>>   |- is_pte_marker()
>>   |   `- is_pte_marker_entry()  // First check
>>   `- is_guard_swp_entry()
>>       `- is_pte_marker_entry()  // Second, redundant check
>>
> 
> I mean, it expands to:
> 
> is_swap_pte(pte) && is_pte_marker_entry(pte_to_swp_entry(pte)) &&
> is_pte_marker_entry(pte_to_swp_entry(pte))

Yes, that's a much clearer way to lay it out ;)

> 
> So I don't think it's really unreasonable to expect compiler magic here...
> 
> But you're right that I should have just used is_swap_pte() really, it's a bit
> silly not to, so this is fine :)

Exactly. Glad we're on the same page!

> 
>> While a modern compiler could likely optimize this away, let's have clean
>> code and not rely on it ;)
> 
> Please don't put smileys in commit messages :) as cute as they are, this is
> going on the permanent kernel record and while we all love them, it's
> probably not the best place to put them :P
> 
>>
>> Also, make it available for hugepage collapsing code.
> 
> Nit but put a newline after this.

Got it. Will fix up all nits in v2.

> 
> I think probably if I'm really really nitty I'd say that you should put
> this bit first, as it's the primary motivation for the change, and put the
> refactoring stuff after.

Ah, right. The motivation for exposing the helper should come first. I'll
reorder this changelog in v2.

> 
>> Cc: Kairui Song <kasong@tencent.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> 
> This seems fine to me, so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks,
Lance

> 
>> ---
>>   include/linux/swapops.h | 6 ++++++
>>   mm/madvise.c            | 6 ------
>>   2 files changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
>> index 59c5889a4d54..7f5684fa043b 100644
>> --- a/include/linux/swapops.h
>> +++ b/include/linux/swapops.h
>> @@ -469,6 +469,12 @@ static inline int is_guard_swp_entry(swp_entry_t entry)
>>   		(pte_marker_get(entry) & PTE_MARKER_GUARD);
>>   }
>>
>> +static inline bool is_guard_pte_marker(pte_t ptent)
>> +{
>> +	return is_swap_pte(ptent) &&
>> +	       is_guard_swp_entry(pte_to_swp_entry(ptent));
>> +}
>> +
>>   /*
>>    * This is a special version to check pte_none() just to cover the case when
>>    * the pte is a pte marker.  It existed because in many cases the pte marker
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 35ed4ab0d7c5..bd46e6788fac 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -1069,12 +1069,6 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
>>   	return !(vma->vm_flags & disallowed);
>>   }
>>
>> -static bool is_guard_pte_marker(pte_t ptent)
>> -{
>> -	return is_pte_marker(ptent) &&
>> -		is_guard_swp_entry(pte_to_swp_entry(ptent));
>> -}
>> -
>>   static int guard_install_pud_entry(pud_t *pud, unsigned long addr,
>>   				   unsigned long next, struct mm_walk *walk)
>>   {
>> --
>> 2.49.0
>>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 2/3] mm: clean up and expose is_guard_pte_marker()
  2025-09-14 14:35 ` [PATCH mm-new 2/3] mm: clean up and expose is_guard_pte_marker() Lance Yang
  2025-09-14 16:38   ` Dev Jain
  2025-09-15 13:54   ` Lorenzo Stoakes
@ 2025-09-17 10:32   ` David Hildenbrand
  2 siblings, 0 replies; 25+ messages in thread
From: David Hildenbrand @ 2025-09-17 10:32 UTC (permalink / raw)
  To: Lance Yang, akpm, lorenzo.stoakes
  Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	baohua, ioworker0, linux-kernel, linux-mm, Kairui Song

On 14.09.25 16:35, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> is_guard_pte_marker() performs a redundant check because it calls both
> is_pte_marker() and is_guard_swp_entry(), both of which internally check
> for a PTE marker.
> 
> is_guard_pte_marker()
>   |- is_pte_marker()
>   |   `- is_pte_marker_entry()  // First check
>   `- is_guard_swp_entry()
>       `- is_pte_marker_entry()  // Second, redundant check
> 
> While a modern compiler could likely optimize this away, let's have clean
> code and not rely on it ;)
> 
> Also, make it available for hugepage collapsing code.
> 
> Cc: Kairui Song <kasong@tencent.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH mm-new 3/3] mm/khugepaged: abort collapse scan on guard PTEs
  2025-09-14 14:35 [PATCH mm-new 0/3] mm/khugepaged: optimize collapse candidate detection Lance Yang
  2025-09-14 14:35 ` [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot() Lance Yang
  2025-09-14 14:35 ` [PATCH mm-new 2/3] mm: clean up and expose is_guard_pte_marker() Lance Yang
@ 2025-09-14 14:35 ` Lance Yang
  2025-09-14 17:03   ` Dev Jain
  2025-09-15 14:08   ` Lorenzo Stoakes
  2 siblings, 2 replies; 25+ messages in thread
From: Lance Yang @ 2025-09-14 14:35 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes
  Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	baohua, ioworker0, linux-kernel, linux-mm, Lance Yang

From: Lance Yang <lance.yang@linux.dev>

Guard PTE markers are installed via MADV_GUARD_INSTALL to create
lightweight guard regions.

Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
encountering such a range.

MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
the special marker in __collapse_huge_page_swapin().

hpage_collapse_scan_pmd()
 `- collapse_huge_page()
     `- __collapse_huge_page_swapin() -> fails!

khugepaged's behavior is slightly different due to its max_ptes_swap limit
(default 64). It won't fail as deep, but it will still needlessly scan up
to 64 swap entries before bailing out.

IMHO, we can and should detect this much earlier ;)

This patch adds a check directly inside the PTE scan loop. If a guard
marker is found, the scan is aborted immediately with a new SCAN_PTE_GUARD
status, avoiding wasted work.

Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 mm/khugepaged.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e54f99bb0b57..910a6f2ec8a9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -59,6 +59,7 @@ enum scan_result {
 	SCAN_STORE_FAILED,
 	SCAN_COPY_MC,
 	SCAN_PAGE_FILLED,
+	SCAN_PTE_GUARD,
 };
 
 #define CREATE_TRACE_POINTS
@@ -1317,6 +1318,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 					result = SCAN_PTE_UFFD_WP;
 					goto out_unmap;
 				}
+				/*
+				 * Guard PTE markers are installed by
+				 * MADV_GUARD_INSTALL. Any collapse path must
+				 * not touch them, so abort the scan immediately
+				 * if one is found.
+				 */
+				if (is_guard_pte_marker(pteval)) {
+					result = SCAN_PTE_GUARD;
+					goto out_unmap;
+				}
 				continue;
 			} else {
 				result = SCAN_EXCEED_SWAP_PTE;
@@ -2860,6 +2871,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 		case SCAN_PAGE_COMPOUND:
 		case SCAN_PAGE_LRU:
 		case SCAN_DEL_PAGE_LRU:
+		case SCAN_PTE_GUARD:
 			last_fail = result;
 			break;
 		default:
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 3/3] mm/khugepaged: abort collapse scan on guard PTEs
  2025-09-14 14:35 ` [PATCH mm-new 3/3] mm/khugepaged: abort collapse scan on guard PTEs Lance Yang
@ 2025-09-14 17:03   ` Dev Jain
  2025-09-15  3:36     ` Lance Yang
  2025-09-15 14:08   ` Lorenzo Stoakes
  1 sibling, 1 reply; 25+ messages in thread
From: Dev Jain @ 2025-09-14 17:03 UTC (permalink / raw)
  To: Lance Yang, akpm, david, lorenzo.stoakes
  Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, baohua,
	ioworker0, linux-kernel, linux-mm


On 14/09/25 8:05 pm, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
> lightweight guard regions.
>
> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
> encountering such a range.
>
> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
> the special marker in __collapse_huge_page_swapin().
>
> hpage_collapse_scan_pmd()
>   `- collapse_huge_page()
>       `- __collapse_huge_page_swapin() -> fails!
>
> khugepaged's behavior is slightly different due to its max_ptes_swap limit
> (default 64). It won't fail as deep, but it will still needlessly scan up
> to 64 swap entries before bailing out.
>
> IMHO, we can and should detect this much earlier ;)
>
> This patch adds a check directly inside the PTE scan loop. If a guard
> marker is found, the scan is aborted immediately with a new SCAN_PTE_GUARD
> status, avoiding wasted work.
>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>   mm/khugepaged.c | 12 ++++++++++++
>   1 file changed, 12 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index e54f99bb0b57..910a6f2ec8a9 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -59,6 +59,7 @@ enum scan_result {
>   	SCAN_STORE_FAILED,
>   	SCAN_COPY_MC,
>   	SCAN_PAGE_FILLED,
> +	SCAN_PTE_GUARD,
>   };
>   
>   #define CREATE_TRACE_POINTS
> @@ -1317,6 +1318,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   					result = SCAN_PTE_UFFD_WP;
>   					goto out_unmap;
>   				}
> +				/*
> +				 * Guard PTE markers are installed by
> +				 * MADV_GUARD_INSTALL. Any collapse path must
> +				 * not touch them, so abort the scan immediately
> +				 * if one is found.
> +				 */
> +				if (is_guard_pte_marker(pteval)) {
> +					result = SCAN_PTE_GUARD;
> +					goto out_unmap;
> +				}
>   				continue;

This looks good, but see below.

>   			} else {
>   				result = SCAN_EXCEED_SWAP_PTE;
> @@ -2860,6 +2871,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>   		case SCAN_PAGE_COMPOUND:
>   		case SCAN_PAGE_LRU:
>   		case SCAN_DEL_PAGE_LRU:
> +		case SCAN_PTE_GUARD:
>   			last_fail = result;

Should we not do this, and just send this case over to the default case. That
would mean immediate exit with -EINVAL, instead of iterating over the complete
range, potentially collapsing a non-guard range, and returning -EINVAL. I do not
think we should spend a significant time in the kernel when the user is literally
invoking madvise(MADV_GUARD_INSTALL) and madvise(MADV_COLLAPSE) on overlapping regions.

>   			break;
>   		default:


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 3/3] mm/khugepaged: abort collapse scan on guard PTEs
  2025-09-14 17:03   ` Dev Jain
@ 2025-09-15  3:36     ` Lance Yang
  0 siblings, 0 replies; 25+ messages in thread
From: Lance Yang @ 2025-09-15  3:36 UTC (permalink / raw)
  To: Dev Jain
  Cc: ziy, baolin.wang, Liam.Howlett, npache, lorenzo.stoakes,
	ryan.roberts, baohua, ioworker0, linux-kernel, linux-mm, akpm,
	david



On 2025/9/15 01:03, Dev Jain wrote:
> 
> On 14/09/25 8:05 pm, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
>> lightweight guard regions.
>>
>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
>> encountering such a range.
>>
>> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
>> the special marker in __collapse_huge_page_swapin().
>>
>> hpage_collapse_scan_pmd()
>>   `- collapse_huge_page()
>>       `- __collapse_huge_page_swapin() -> fails!
>>
>> khugepaged's behavior is slightly different due to its max_ptes_swap 
>> limit
>> (default 64). It won't fail as deep, but it will still needlessly scan up
>> to 64 swap entries before bailing out.
>>
>> IMHO, we can and should detect this much earlier ;)
>>
>> This patch adds a check directly inside the PTE scan loop. If a guard
>> marker is found, the scan is aborted immediately with a new 
>> SCAN_PTE_GUARD
>> status, avoiding wasted work.
>>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>   mm/khugepaged.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index e54f99bb0b57..910a6f2ec8a9 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -59,6 +59,7 @@ enum scan_result {
>>       SCAN_STORE_FAILED,
>>       SCAN_COPY_MC,
>>       SCAN_PAGE_FILLED,
>> +    SCAN_PTE_GUARD,
>>   };
>>   #define CREATE_TRACE_POINTS
>> @@ -1317,6 +1318,16 @@ static int hpage_collapse_scan_pmd(struct 
>> mm_struct *mm,
>>                       result = SCAN_PTE_UFFD_WP;
>>                       goto out_unmap;
>>                   }
>> +                /*
>> +                 * Guard PTE markers are installed by
>> +                 * MADV_GUARD_INSTALL. Any collapse path must
>> +                 * not touch them, so abort the scan immediately
>> +                 * if one is found.
>> +                 */
>> +                if (is_guard_pte_marker(pteval)) {
>> +                    result = SCAN_PTE_GUARD;
>> +                    goto out_unmap;
>> +                }
>>                   continue;
> 
> This looks good, but see below.
> 
>>               } else {
>>                   result = SCAN_EXCEED_SWAP_PTE;
>> @@ -2860,6 +2871,7 @@ int madvise_collapse(struct vm_area_struct *vma, 
>> unsigned long start,
>>           case SCAN_PAGE_COMPOUND:
>>           case SCAN_PAGE_LRU:
>>           case SCAN_DEL_PAGE_LRU:
>> +        case SCAN_PTE_GUARD:
>>               last_fail = result;
> 
> Should we not do this, and just send this case over to the default case. 
> That
> would mean immediate exit with -EINVAL, instead of iterating over the 
> complete
> range, potentially collapsing a non-guard range, and returning -EINVAL. 

That makes sense to me ;)

> I do not
> think we should spend a significant time in the kernel when the user is 
> literally
> invoking madvise(MADV_GUARD_INSTALL) and madvise(MADV_COLLAPSE) on 
> overlapping regions.

I'm just a bit unsure because the MADV_COLLAPSE man page[1] describes it
as a "best-effort" collapse. This patch follows that idea, collapsing what
it can.

        MADV_COLLAPSE (since Linux 6.1)
               Perform a best-effort synchronous collapse of the native
               pages mapped by the memory range into Transparent Huge
               Pages (THPs).  MADV_COLLAPSE operates on the current state
               of memory of the calling process and makes no persistent
               changes or guarantees on how pages will be mapped,
               constructed, or faulted in the future.

A hard-fail on a guard PTE marker might go against that.

Well, I'm open to either approach. What do other folks think?

[1] https://man7.org/linux/man-pages/man2/madvise.2.html

Cheers,
Lance

> 
>>               break;
>>           default:




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 3/3] mm/khugepaged: abort collapse scan on guard PTEs
  2025-09-14 14:35 ` [PATCH mm-new 3/3] mm/khugepaged: abort collapse scan on guard PTEs Lance Yang
  2025-09-14 17:03   ` Dev Jain
@ 2025-09-15 14:08   ` Lorenzo Stoakes
  2025-09-15 14:42     ` Lance Yang
  1 sibling, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-09-15 14:08 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, ioworker0, linux-kernel, linux-mm

On Sun, Sep 14, 2025 at 10:35:47PM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
> lightweight guard regions.
>
> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
> encountering such a range.
>
> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
> the special marker in __collapse_huge_page_swapin().
>
> hpage_collapse_scan_pmd()
>  `- collapse_huge_page()
>      `- __collapse_huge_page_swapin() -> fails!
>
> khugepaged's behavior is slightly different due to its max_ptes_swap limit
> (default 64). It won't fail as deep, but it will still needlessly scan up
> to 64 swap entries before bailing out.
>
> IMHO, we can and should detect this much earlier ;)

No smileys in commit messages please... :)

>
> This patch adds a check directly inside the PTE scan loop. If a guard
> marker is found, the scan is aborted immediately with a new SCAN_PTE_GUARD
> status, avoiding wasted work.
>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>  mm/khugepaged.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index e54f99bb0b57..910a6f2ec8a9 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -59,6 +59,7 @@ enum scan_result {
>  	SCAN_STORE_FAILED,
>  	SCAN_COPY_MC,
>  	SCAN_PAGE_FILLED,
> +	SCAN_PTE_GUARD,

I wonder if we really need to have it literally called out though, can we just
use:

SCAN_PTE_NON_PRESENT

Instead?

As it is, indeed, non-present :)

>  };
>
>  #define CREATE_TRACE_POINTS
> @@ -1317,6 +1318,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  					result = SCAN_PTE_UFFD_WP;
>  					goto out_unmap;
>  				}
> +				/*
> +				 * Guard PTE markers are installed by
> +				 * MADV_GUARD_INSTALL. Any collapse path must
> +				 * not touch them, so abort the scan immediately
> +				 * if one is found.
> +				 */
> +				if (is_guard_pte_marker(pteval)) {
> +					result = SCAN_PTE_GUARD;
> +					goto out_unmap;
> +				}
>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_SWAP_PTE;
> @@ -2860,6 +2871,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  		case SCAN_PAGE_COMPOUND:
>  		case SCAN_PAGE_LRU:
>  		case SCAN_DEL_PAGE_LRU:
> +		case SCAN_PTE_GUARD:
>  			last_fail = result;
>  			break;
>  		default:
> --
> 2.49.0
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-new 3/3] mm/khugepaged: abort collapse scan on guard PTEs
  2025-09-15 14:08   ` Lorenzo Stoakes
@ 2025-09-15 14:42     ` Lance Yang
  0 siblings, 0 replies; 25+ messages in thread
From: Lance Yang @ 2025-09-15 14:42 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, ioworker0, linux-kernel, linux-mm



On 2025/9/15 22:08, Lorenzo Stoakes wrote:
> On Sun, Sep 14, 2025 at 10:35:47PM +0800, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
>> lightweight guard regions.
>>
>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
>> encountering such a range.
>>
>> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
>> the special marker in __collapse_huge_page_swapin().
>>
>> hpage_collapse_scan_pmd()
>>   `- collapse_huge_page()
>>       `- __collapse_huge_page_swapin() -> fails!
>>
>> khugepaged's behavior is slightly different due to its max_ptes_swap limit
>> (default 64). It won't fail as deep, but it will still needlessly scan up
>> to 64 swap entries before bailing out.
>>
>> IMHO, we can and should detect this much earlier ;)
> 
> No smileys in commit messages please... :)

Got it. Apparently, I'm a bit too fond of them ... Will remove it in v2.

> 
>>
>> This patch adds a check directly inside the PTE scan loop. If a guard
>> marker is found, the scan is aborted immediately with a new SCAN_PTE_GUARD
>> status, avoiding wasted work.
>>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>   mm/khugepaged.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index e54f99bb0b57..910a6f2ec8a9 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -59,6 +59,7 @@ enum scan_result {
>>   	SCAN_STORE_FAILED,
>>   	SCAN_COPY_MC,
>>   	SCAN_PAGE_FILLED,
>> +	SCAN_PTE_GUARD,
> 
> I wonder if we really need to have it literally called out though, can we just
> use:
> 
> SCAN_PTE_NON_PRESENT
> 
> Instead?
> 
> As it is, indeed, non-present :)

Makes sense to me. A guard PTE is indeed a special non-present case.

So, let's reuse SCAN_PTE_NON_PRESENT for that case ;)

Cheers,
Lance

> 
>>   };
>>
>>   #define CREATE_TRACE_POINTS
>> @@ -1317,6 +1318,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>   					result = SCAN_PTE_UFFD_WP;
>>   					goto out_unmap;
>>   				}
>> +				/*
>> +				 * Guard PTE markers are installed by
>> +				 * MADV_GUARD_INSTALL. Any collapse path must
>> +				 * not touch them, so abort the scan immediately
>> +				 * if one is found.
>> +				 */
>> +				if (is_guard_pte_marker(pteval)) {
>> +					result = SCAN_PTE_GUARD;
>> +					goto out_unmap;
>> +				}
>>   				continue;
>>   			} else {
>>   				result = SCAN_EXCEED_SWAP_PTE;
>> @@ -2860,6 +2871,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>   		case SCAN_PAGE_COMPOUND:
>>   		case SCAN_PAGE_LRU:
>>   		case SCAN_DEL_PAGE_LRU:
>> +		case SCAN_PTE_GUARD:
>>   			last_fail = result;
>>   			break;
>>   		default:
>> --
>> 2.49.0
>>
> 
> Cheers, Lorenzo



^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-09-17 10:32 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-14 14:35 [PATCH mm-new 0/3] mm/khugepaged: optimize collapse candidate detection Lance Yang
2025-09-14 14:35 ` [PATCH mm-new 1/3] mm/khugepaged: skip unsuitable VMAs earlier in khugepaged_scan_mm_slot() Lance Yang
2025-09-14 16:16   ` Dev Jain
2025-09-15  3:02     ` Lance Yang
2025-09-16  5:32   ` Hugh Dickins
2025-09-16  6:21     ` Lance Yang
2025-09-16  6:42       ` Hugh Dickins
2025-09-16  7:05         ` Lance Yang
2025-09-16  9:29       ` Kiryl Shutsemau
2025-09-16  9:39         ` Lorenzo Stoakes
2025-09-16  9:48           ` Kiryl Shutsemau
2025-09-16  9:58             ` Lorenzo Stoakes
2025-09-16 10:00               ` Lance Yang
2025-09-16  9:59           ` Lance Yang
2025-09-14 14:35 ` [PATCH mm-new 2/3] mm: clean up and expose is_guard_pte_marker() Lance Yang
2025-09-14 16:38   ` Dev Jain
2025-09-15  4:24     ` Lance Yang
2025-09-15 13:54   ` Lorenzo Stoakes
2025-09-15 14:26     ` Lance Yang
2025-09-17 10:32   ` David Hildenbrand
2025-09-14 14:35 ` [PATCH mm-new 3/3] mm/khugepaged: abort collapse scan on guard PTEs Lance Yang
2025-09-14 17:03   ` Dev Jain
2025-09-15  3:36     ` Lance Yang
2025-09-15 14:08   ` Lorenzo Stoakes
2025-09-15 14:42     ` Lance Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).