linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
@ 2025-06-25  1:40 Baolin Wang
  2025-06-25  1:40 ` [PATCH v4 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs " Baolin Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 37+ messages in thread
From: Baolin Wang @ 2025-06-25  1:40 UTC (permalink / raw)
  To: akpm, hughd, david
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, baolin.wang, linux-mm, linux-kernel

When invoking thp_vma_allowable_orders(), if the TVA_ENFORCE_SYSFS flag is not
specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
callers who do not specify this flag, it creates a odd and surprising situation
where a sysadmin specifying 'never' for all THP sizes still observing THP pages
being allocated and used on the system. And the MADV_COLLAPSE is an example of
such a case, that means it will not set TVA_ENFORCE_SYSFS when calling
thp_vma_allowable_orders().

As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
the system-wide anon/shmem THP sysfs settings, which means that even though
we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
attempt to collapse into a anon/shmem THP. This violates the rule we have
agreed upon: never means never.

For example, system administrators who disabled THP everywhere must indeed very
much not want THP to be used for whatever reason - having individual programs
being able to quietly override this is very surprising and likely to cause headaches
for those who desire this not to happen on their systems.

This patch set will address the MADV_COLLAPSE issue.

Test
====
1. Tested the mm selftests and found no regressions.
2. With toggling different Anon mTHP settings, the allocation and madvise collapse for
anonymous pages work well.
3. With toggling different shmem mTHP settings, the allocation and madvise collapse for
shmem work well.
4. Tested the large order allocation for tmpfs, and works as expected.

[1] https://lore.kernel.org/all/1f00fdc3-a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com/

Changes from v3:
 - Collect reviewed tags. Thanks.
 - Update the commit message, per David.

Changes from v2:
 - Update the commit message and cover letter, per Lorenzo. Thanks.
 - Simplify the logic in thp_vma_allowable_orders(), per Lorenzo and David. Thanks.

Changes from v1:
 - Update the commit message, per Zi.
 - Add Zi's reviewed tag. Thanks.
 - Update the shmem logic.

Baolin Wang (2):
  mm: huge_memory: disallow hugepages if the system-wide THP sysfs
    settings are disabled
  mm: shmem: disallow hugepages if the system-wide shmem THP sysfs
    settings are disabled

 include/linux/huge_mm.h                 | 51 ++++++++++++++++++-------
 mm/shmem.c                              |  6 +--
 tools/testing/selftests/mm/khugepaged.c |  8 +---
 3 files changed, 43 insertions(+), 22 deletions(-)

-- 
2.43.5



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v4 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
  2025-06-25  1:40 [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled Baolin Wang
@ 2025-06-25  1:40 ` Baolin Wang
  2025-06-25  4:34   ` Dev Jain
  2025-06-25  1:40 ` [PATCH v4 2/2] mm: shmem: disallow hugepages if the system-wide shmem " Baolin Wang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: Baolin Wang @ 2025-06-25  1:40 UTC (permalink / raw)
  To: akpm, hughd, david
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, baolin.wang, linux-mm, linux-kernel

When invoking thp_vma_allowable_orders(), the TVA_ENFORCE_SYSFS flag is not
specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
callers who do not specify this flag, it creates a odd and surprising situation
where a sysadmin specifying 'never' for all THP sizes still observing THP pages
being allocated and used on the system.

The motivating case for this is MADV_COLLAPSE. The MADV_COLLAPSE will ignore
the system-wide Anon THP sysfs settings, which means that even though we have
disabled the Anon THP configuration, MADV_COLLAPSE will still attempt to collapse
into a Anon THP. This violates the rule we have agreed upon: never means never.

Currently, besides MADV_COLLAPSE not setting TVA_ENFORCE_SYSFS, there is only
one other instance where TVA_ENFORCE_SYSFS is not set, which is in the
collapse_pte_mapped_thp() function, but I believe this is reasonable from its
comments.

"
/*
 * If we are here, we've succeeded in replacing all the native pages
 * in the page cache with a single hugepage. If a mm were to fault-in
 * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
 * and map it by a PMD, regardless of sysfs THP settings. As such, let's
 * analogously elide sysfs THP settings here.
 */
if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
"

Another rule for madvise, referring to David's suggestion: “allowing for
collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine".

To address this issue, the current strategy should be:

If no hugepage modes are enabled for the desired orders, nor can we enable them
by inheriting from a 'global' enabled setting - then it must be the case that
all desired orders either specify or inherit 'NEVER' - and we must abort.

Meanwhile, we should fix the khugepaged selftest for MADV_COLLAPSE. Originally,
we could prevent khugepaged by setting THP_MADVISE and removing MADV_HUGEPAGE
setting, while madvise_collapse() can still perform THP collapse. However,
this would cause some test cases to fail because some tests previously set
MADV_NOHUGEPAGE, and now there is no other way to clear the MADV_NOHUGEPAGE
flag except for setting MADV_HUGEPAGE. Therefore, it should be changed to
THP_ALWAYS here to allow madvise_collapse() to perform THP collapse.

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/huge_mm.h                 | 51 ++++++++++++++++++-------
 tools/testing/selftests/mm/khugepaged.c |  6 +--
 2 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4d5bb67dc4ec..ab70ca4e704b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -267,6 +267,42 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 unsigned long tva_flags,
 					 unsigned long orders);
 
+/* Strictly mask requested anonymous orders according to sysfs settings. */
+static inline unsigned long __thp_mask_anon_orders(unsigned long vm_flags,
+		unsigned long tva_flags, unsigned long orders)
+{
+	const unsigned long always = READ_ONCE(huge_anon_orders_always);
+	const unsigned long madvise = READ_ONCE(huge_anon_orders_madvise);
+	const unsigned long inherit = READ_ONCE(huge_anon_orders_inherit);
+	const unsigned long never = ~(always | madvise | inherit);
+	const bool inherit_never = !hugepage_global_enabled();
+
+	/* Disallow orders that are set to NEVER directly ... */
+	orders &= ~never;
+
+	/* ... or through inheritance (global == NEVER). */
+	if (inherit_never)
+		orders &= ~inherit;
+
+	/*
+	 * Otherwise, we only enforce sysfs settings if asked. In addition,
+	 * if the user sets a sysfs mode of madvise and if TVA_ENFORCE_SYSFS
+	 * is not set, we don't bother checking whether the VMA has VM_HUGEPAGE
+	 * set.
+	 */
+	if (!(tva_flags & TVA_ENFORCE_SYSFS))
+		return orders;
+
+	/* We already excluded never inherit above. */
+	if (vm_flags & VM_HUGEPAGE)
+		return orders & (always | madvise | inherit);
+
+	if (hugepage_global_always())
+		return orders & (always | inherit);
+
+	return orders & always;
+}
+
 /**
  * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
  * @vma:  the vm area to check
@@ -289,19 +325,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 				       unsigned long orders)
 {
 	/* Optimization to check if required orders are enabled early. */
-	if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
-		unsigned long mask = READ_ONCE(huge_anon_orders_always);
-
-		if (vm_flags & VM_HUGEPAGE)
-			mask |= READ_ONCE(huge_anon_orders_madvise);
-		if (hugepage_global_always() ||
-		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
-			mask |= READ_ONCE(huge_anon_orders_inherit);
-
-		orders &= mask;
-		if (!orders)
-			return 0;
-	}
+	if (vma_is_anonymous(vma))
+		orders = __thp_mask_anon_orders(vm_flags, tva_flags, orders);
 
 	return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
 }
diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
index 4341ce6b3b38..85bfff53dba6 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -501,11 +501,7 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
 
 	printf("%s...", msg);
 
-	/*
-	 * Prevent khugepaged interference and tests that MADV_COLLAPSE
-	 * ignores /sys/kernel/mm/transparent_hugepage/enabled
-	 */
-	settings.thp_enabled = THP_NEVER;
+	settings.thp_enabled = THP_ALWAYS;
 	settings.shmem_enabled = SHMEM_NEVER;
 	thp_push_settings(&settings);
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 2/2] mm: shmem: disallow hugepages if the system-wide shmem THP sysfs settings are disabled
  2025-06-25  1:40 [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled Baolin Wang
  2025-06-25  1:40 ` [PATCH v4 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs " Baolin Wang
@ 2025-06-25  1:40 ` Baolin Wang
  2025-06-25  5:53 ` [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP " Hugh Dickins
  2025-07-09 12:36 ` Lorenzo Stoakes
  3 siblings, 0 replies; 37+ messages in thread
From: Baolin Wang @ 2025-06-25  1:40 UTC (permalink / raw)
  To: akpm, hughd, david
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, baolin.wang, linux-mm, linux-kernel

When invoking thp_vma_allowable_orders(), if the TVA_ENFORCE_SYSFS flag is not
specified, we will ignore the THP sysfs settings. And the MADV_COLLAPSE is an
example of such a case.

The MADV_COLLAPSE will ignore the system-wide shmem THP sysfs settings, which
means that even though we have disabled the shmem THP configuration, MADV_COLLAPSE
will still attempt to collapse into a shmem THP. This violates the rule we have
agreed upon: never means never.

Another rule for madvise, referring to David's suggestion: “allowing for collapsing
in a VM without VM_HUGEPAGE in the "madvise" mode would be fine".

To fix the MADV_COLLAPSE issue for shmem, then the current strategy should be:

For shmem, if none of ‘always’, ‘madvise’, ‘within_size’, and ‘inherit’ have enabled
PMD-sized THP, then MADV_COLLAPSE will be prohibited from collapsing PMD-sized THP.
Similarly, if 'deny' is set, it will also prohibit MADV_COLLAPSE.

For tmpfs, if the mount option is set with the 'huge=never' parameter, then
MADV_COLLAPSE will be prohibited from collapsing PMD-sized THP.

Meanwhile, we should fix the khugepaged selftest for shmem MADV_COLLAPSE by enabling
shmem THP like anonymous pages collapse.

Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c                              | 6 +++---
 tools/testing/selftests/mm/khugepaged.c | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 2b19965d27df..e3f51fab2b7d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -637,7 +637,7 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
 		return 0;
 	if (shmem_huge == SHMEM_HUGE_DENY)
 		return 0;
-	if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
+	if (shmem_huge == SHMEM_HUGE_FORCE)
 		return maybe_pmd_order;
 
 	/*
@@ -672,7 +672,7 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
 
 		fallthrough;
 	case SHMEM_HUGE_ADVISE:
-		if (vm_flags & VM_HUGEPAGE)
+		if (shmem_huge_force || (vm_flags & VM_HUGEPAGE))
 			return maybe_pmd_order;
 		fallthrough;
 	default:
@@ -1806,7 +1806,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
 	/* Allow mTHP that will be fully within i_size. */
 	mask |= shmem_get_orders_within_size(inode, within_size_orders, index, 0);
 
-	if (vm_flags & VM_HUGEPAGE)
+	if (shmem_huge_force || (vm_flags & VM_HUGEPAGE))
 		mask |= READ_ONCE(huge_shmem_orders_madvise);
 
 	if (global_orders > 0)
diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
index 85bfff53dba6..9517ed99c382 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -502,7 +502,7 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
 	printf("%s...", msg);
 
 	settings.thp_enabled = THP_ALWAYS;
-	settings.shmem_enabled = SHMEM_NEVER;
+	settings.shmem_enabled = SHMEM_ALWAYS;
 	thp_push_settings(&settings);
 
 	/* Clear VM_NOHUGEPAGE */
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
  2025-06-25  1:40 ` [PATCH v4 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs " Baolin Wang
@ 2025-06-25  4:34   ` Dev Jain
  0 siblings, 0 replies; 37+ messages in thread
From: Dev Jain @ 2025-06-25  4:34 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd, david
  Cc: ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, baohua,
	linux-mm, linux-kernel


On 25/06/25 7:10 am, Baolin Wang wrote:
> When invoking thp_vma_allowable_orders(), the TVA_ENFORCE_SYSFS flag is not
> specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
> callers who do not specify this flag, it creates a odd and surprising situation
> where a sysadmin specifying 'never' for all THP sizes still observing THP pages
> being allocated and used on the system.
>
> The motivating case for this is MADV_COLLAPSE. The MADV_COLLAPSE will ignore
> the system-wide Anon THP sysfs settings, which means that even though we have
> disabled the Anon THP configuration, MADV_COLLAPSE will still attempt to collapse
> into a Anon THP. This violates the rule we have agreed upon: never means never.
>
> Currently, besides MADV_COLLAPSE not setting TVA_ENFORCE_SYSFS, there is only
> one other instance where TVA_ENFORCE_SYSFS is not set, which is in the
> collapse_pte_mapped_thp() function, but I believe this is reasonable from its
> comments.
>
> "
> /*
>   * If we are here, we've succeeded in replacing all the native pages
>   * in the page cache with a single hugepage. If a mm were to fault-in
>   * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
>   * and map it by a PMD, regardless of sysfs THP settings. As such, let's
>   * analogously elide sysfs THP settings here.
>   */
> if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
> "
>
> Another rule for madvise, referring to David's suggestion: “allowing for
> collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine".
>
> To address this issue, the current strategy should be:
>
> If no hugepage modes are enabled for the desired orders, nor can we enable them
> by inheriting from a 'global' enabled setting - then it must be the case that
> all desired orders either specify or inherit 'NEVER' - and we must abort.
>
> Meanwhile, we should fix the khugepaged selftest for MADV_COLLAPSE. Originally,
> we could prevent khugepaged by setting THP_MADVISE and removing MADV_HUGEPAGE
> setting, while madvise_collapse() can still perform THP collapse. However,
> this would cause some test cases to fail because some tests previously set
> MADV_NOHUGEPAGE, and now there is no other way to clear the MADV_NOHUGEPAGE
> flag except for setting MADV_HUGEPAGE. Therefore, it should be changed to
> THP_ALWAYS here to allow madvise_collapse() to perform THP collapse.
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   

Reviewed-by: Dev Jain <dev.jain@arm.com>

>   


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  1:40 [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled Baolin Wang
  2025-06-25  1:40 ` [PATCH v4 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs " Baolin Wang
  2025-06-25  1:40 ` [PATCH v4 2/2] mm: shmem: disallow hugepages if the system-wide shmem " Baolin Wang
@ 2025-06-25  5:53 ` Hugh Dickins
  2025-06-25  6:05   ` Dev Jain
                     ` (3 more replies)
  2025-07-09 12:36 ` Lorenzo Stoakes
  3 siblings, 4 replies; 37+ messages in thread
From: Hugh Dickins @ 2025-06-25  5:53 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, david, ziy, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On Wed, 25 Jun 2025, Baolin Wang wrote:

> When invoking thp_vma_allowable_orders(), if the TVA_ENFORCE_SYSFS flag is not
> specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
> callers who do not specify this flag, it creates a odd and surprising situation
> where a sysadmin specifying 'never' for all THP sizes still observing THP pages
> being allocated and used on the system. And the MADV_COLLAPSE is an example of
> such a case, that means it will not set TVA_ENFORCE_SYSFS when calling
> thp_vma_allowable_orders().
> 
> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
> the system-wide anon/shmem THP sysfs settings, which means that even though
> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
> attempt to collapse into a anon/shmem THP. This violates the rule we have
> agreed upon: never means never.
> 
> For example, system administrators who disabled THP everywhere must indeed very
> much not want THP to be used for whatever reason - having individual programs
> being able to quietly override this is very surprising and likely to cause headaches
> for those who desire this not to happen on their systems.
> 
> This patch set will address the MADV_COLLAPSE issue.
> 
> Test
> ====
> 1. Tested the mm selftests and found no regressions.
> 2. With toggling different Anon mTHP settings, the allocation and madvise collapse for
> anonymous pages work well.
> 3. With toggling different shmem mTHP settings, the allocation and madvise collapse for
> shmem work well.
> 4. Tested the large order allocation for tmpfs, and works as expected.
> 
> [1] https://lore.kernel.org/all/1f00fdc3-a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com/
> 
> Changes from v3:
>  - Collect reviewed tags. Thanks.
>  - Update the commit message, per David.
> 
> Changes from v2:
>  - Update the commit message and cover letter, per Lorenzo. Thanks.
>  - Simplify the logic in thp_vma_allowable_orders(), per Lorenzo and David. Thanks.
> 
> Changes from v1:
>  - Update the commit message, per Zi.
>  - Add Zi's reviewed tag. Thanks.
>  - Update the shmem logic.
> 
> Baolin Wang (2):
>   mm: huge_memory: disallow hugepages if the system-wide THP sysfs
>     settings are disabled
>   mm: shmem: disallow hugepages if the system-wide shmem THP sysfs
>     settings are disabled
> 
>  include/linux/huge_mm.h                 | 51 ++++++++++++++++++-------
>  mm/shmem.c                              |  6 +--
>  tools/testing/selftests/mm/khugepaged.c |  8 +---
>  3 files changed, 43 insertions(+), 22 deletions(-)
> 
> -- 
> 2.43.5

Sorry for chiming in so late, after so much effort: but I beg you,
please drop these.

I did not want to get into a fight, and had been hoping a voice of
reason would come from others, before I got around to responding.

And indeed Ryan understood correctly at the start; and he, Usama
and Barry, perhaps others I've missed, have raised appropriate
concerns but not prevailed.

If we're sloganeering, I much prefer "never break userspace" to
"never means never", attractive though that over-simplification is.

Seldom has a feature been so thorougly documented as MADV_COLLAPSE,
in its 6.1 commits and in the "man 2 madvise" page: which are
explicit about MADV_COLLAPSE providing a way to get THPs where the
sysfs setting governing automatic behaviour does not insert them.

We would all prefer a less messy world of THP tunables.  I certainly
find plenty to dislike there too; and wish that a less assertive name
than "never" had been chosen originally for the default off position.

But please don't break the accepted and documented behaviour of
MADV_COLLAPSE now.

If you want to exclude all possibility of THPs, then please use the
prctl(PR_SET_THP_DISABLE); or shmem_enabled=deny (I think it was me
who insisted that be respected by MADV_COLLAPSE back then).

Add a "deny" option to /sys/kernel/mm/transparent_hugepage/enabled
if you like.  (But in these days of filesystem large folios, adding
new protections against them seems a few years late.)

If Andrew decides that these patches should go in, then I'll have to
scrutinize them more carefully than I've done so far: but currently
I'm hoping to avoid that.

Hugh


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  5:53 ` [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP " Hugh Dickins
@ 2025-06-25  6:05   ` Dev Jain
  2025-06-25  6:26   ` Baolin Wang
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 37+ messages in thread
From: Dev Jain @ 2025-06-25  6:05 UTC (permalink / raw)
  To: Hugh Dickins, Baolin Wang
  Cc: akpm, david, ziy, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, zokeefe, shy828301, usamaarif642, linux-mm,
	linux-kernel


On 25/06/25 11:23 am, Hugh Dickins wrote:
> On Wed, 25 Jun 2025, Baolin Wang wrote:
>
>> When invoking thp_vma_allowable_orders(), if the TVA_ENFORCE_SYSFS flag is not
>> specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
>> callers who do not specify this flag, it creates a odd and surprising situation
>> where a sysadmin specifying 'never' for all THP sizes still observing THP pages
>> being allocated and used on the system. And the MADV_COLLAPSE is an example of
>> such a case, that means it will not set TVA_ENFORCE_SYSFS when calling
>> thp_vma_allowable_orders().
>>
>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>> the system-wide anon/shmem THP sysfs settings, which means that even though
>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>> agreed upon: never means never.
>>
>> For example, system administrators who disabled THP everywhere must indeed very
>> much not want THP to be used for whatever reason - having individual programs
>> being able to quietly override this is very surprising and likely to cause headaches
>> for those who desire this not to happen on their systems.
>>
>> This patch set will address the MADV_COLLAPSE issue.
>>
>> Test
>> ====
>> 1. Tested the mm selftests and found no regressions.
>> 2. With toggling different Anon mTHP settings, the allocation and madvise collapse for
>> anonymous pages work well.
>> 3. With toggling different shmem mTHP settings, the allocation and madvise collapse for
>> shmem work well.
>> 4. Tested the large order allocation for tmpfs, and works as expected.
>>
>> [1] https://lore.kernel.org/all/1f00fdc3-a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com/
>>
>> Changes from v3:
>>   - Collect reviewed tags. Thanks.
>>   - Update the commit message, per David.
>>
>> Changes from v2:
>>   - Update the commit message and cover letter, per Lorenzo. Thanks.
>>   - Simplify the logic in thp_vma_allowable_orders(), per Lorenzo and David. Thanks.
>>
>> Changes from v1:
>>   - Update the commit message, per Zi.
>>   - Add Zi's reviewed tag. Thanks.
>>   - Update the shmem logic.
>>
>> Baolin Wang (2):
>>    mm: huge_memory: disallow hugepages if the system-wide THP sysfs
>>      settings are disabled
>>    mm: shmem: disallow hugepages if the system-wide shmem THP sysfs
>>      settings are disabled
>>
>>   include/linux/huge_mm.h                 | 51 ++++++++++++++++++-------
>>   mm/shmem.c                              |  6 +--
>>   tools/testing/selftests/mm/khugepaged.c |  8 +---
>>   3 files changed, 43 insertions(+), 22 deletions(-)
>>
>> -- 
>> 2.43.5
> Sorry for chiming in so late, after so much effort: but I beg you,
> please drop these.
>
> I did not want to get into a fight, and had been hoping a voice of
> reason would come from others, before I got around to responding.
>
> And indeed Ryan understood correctly at the start; and he, Usama
> and Barry, perhaps others I've missed, have raised appropriate
> concerns but not prevailed.
>
> If we're sloganeering, I much prefer "never break userspace" to
> "never means never", attractive though that over-simplification is.

FWIW I had the same thought of "never break userspace" when I first
saw the patch but then this point did not shine out in the discussion
so I became nervous to mention it again :)

>
> Seldom has a feature been so thorougly documented as MADV_COLLAPSE,
> in its 6.1 commits and in the "man 2 madvise" page: which are
> explicit about MADV_COLLAPSE providing a way to get THPs where the
> sysfs setting governing automatic behaviour does not insert them.
>
> We would all prefer a less messy world of THP tunables.  I certainly
> find plenty to dislike there too; and wish that a less assertive name
> than "never" had been chosen originally for the default off position.
>
> But please don't break the accepted and documented behaviour of
> MADV_COLLAPSE now.
>
> If you want to exclude all possibility of THPs, then please use the
> prctl(PR_SET_THP_DISABLE); or shmem_enabled=deny (I think it was me
> who insisted that be respected by MADV_COLLAPSE back then).
>
> Add a "deny" option to /sys/kernel/mm/transparent_hugepage/enabled
> if you like.  (But in these days of filesystem large folios, adding
> new protections against them seems a few years late.)
>
> If Andrew decides that these patches should go in, then I'll have to
> scrutinize them more carefully than I've done so far: but currently
> I'm hoping to avoid that.
>
> Hugh


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  5:53 ` [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP " Hugh Dickins
  2025-06-25  6:05   ` Dev Jain
@ 2025-06-25  6:26   ` Baolin Wang
  2025-06-25  6:49     ` Dev Jain
  2025-06-25  7:20   ` Lorenzo Stoakes
  2025-06-25  7:23   ` David Hildenbrand
  3 siblings, 1 reply; 37+ messages in thread
From: Baolin Wang @ 2025-06-25  6:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, david, ziy, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel



On 2025/6/25 13:53, Hugh Dickins wrote:
> On Wed, 25 Jun 2025, Baolin Wang wrote:
> 
>> When invoking thp_vma_allowable_orders(), if the TVA_ENFORCE_SYSFS flag is not
>> specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
>> callers who do not specify this flag, it creates a odd and surprising situation
>> where a sysadmin specifying 'never' for all THP sizes still observing THP pages
>> being allocated and used on the system. And the MADV_COLLAPSE is an example of
>> such a case, that means it will not set TVA_ENFORCE_SYSFS when calling
>> thp_vma_allowable_orders().
>>
>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>> the system-wide anon/shmem THP sysfs settings, which means that even though
>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>> agreed upon: never means never.
>>
>> For example, system administrators who disabled THP everywhere must indeed very
>> much not want THP to be used for whatever reason - having individual programs
>> being able to quietly override this is very surprising and likely to cause headaches
>> for those who desire this not to happen on their systems.
>>
>> This patch set will address the MADV_COLLAPSE issue.
>>
>> Test
>> ====
>> 1. Tested the mm selftests and found no regressions.
>> 2. With toggling different Anon mTHP settings, the allocation and madvise collapse for
>> anonymous pages work well.
>> 3. With toggling different shmem mTHP settings, the allocation and madvise collapse for
>> shmem work well.
>> 4. Tested the large order allocation for tmpfs, and works as expected.
>>
>> [1] https://lore.kernel.org/all/1f00fdc3-a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com/
>>
>> Changes from v3:
>>   - Collect reviewed tags. Thanks.
>>   - Update the commit message, per David.
>>
>> Changes from v2:
>>   - Update the commit message and cover letter, per Lorenzo. Thanks.
>>   - Simplify the logic in thp_vma_allowable_orders(), per Lorenzo and David. Thanks.
>>
>> Changes from v1:
>>   - Update the commit message, per Zi.
>>   - Add Zi's reviewed tag. Thanks.
>>   - Update the shmem logic.
>>
>> Baolin Wang (2):
>>    mm: huge_memory: disallow hugepages if the system-wide THP sysfs
>>      settings are disabled
>>    mm: shmem: disallow hugepages if the system-wide shmem THP sysfs
>>      settings are disabled
>>
>>   include/linux/huge_mm.h                 | 51 ++++++++++++++++++-------
>>   mm/shmem.c                              |  6 +--
>>   tools/testing/selftests/mm/khugepaged.c |  8 +---
>>   3 files changed, 43 insertions(+), 22 deletions(-)
>>
>> -- 
>> 2.43.5
> 
> Sorry for chiming in so late, after so much effort: but I beg you,
> please drop these.

Thanks Hugh for your input. (yes, we put in a lot of effort on 
discussion and testing:( ).

> I did not want to get into a fight, and had been hoping a voice of
> reason would come from others, before I got around to responding.
> 
> And indeed Ryan understood correctly at the start; and he, Usama
> and Barry, perhaps others I've missed, have raised appropriate
> concerns but not prevailed.
> 
> If we're sloganeering, I much prefer "never break userspace" to
> "never means never", attractive though that over-simplification is.

Yes, agree. we should not break userspace, however, I suspect whether 
this can really break userspace. We can set 
'/sys/kernel/mm/transparent_hugepage/enabled' to 'madvise' to allow 
MADV_COLLAPSE. Additionally, I really doubt that when the system-wide 
THP settings are set to 'never', userspace would still expect to 
collapse into THP using MADV_COLLAPSE.

Moreover, what makes this issue particularly frustrating is that when we 
introduce mTHP collapse[1], MADV_COLLAPSE complicates matters further. 
That is, when the system only enables 64K mTHP, MADV_COLLAPSE still 
allows collapsing into PMD-sized THP. This really breaks the user's 
settings.

[1] https://lore.kernel.org/all/20250515032226.128900-1-npache@redhat.com/

> Seldom has a feature been so thorougly documented as MADV_COLLAPSE,
> in its 6.1 commits and in the "man 2 madvise" page: which are
> explicit about MADV_COLLAPSE providing a way to get THPs where the
> sysfs setting governing automatic behaviour does not insert them.
> 
> We would all prefer a less messy world of THP tunables.  I certainly
> find plenty to dislike there too; and wish that a less assertive name
> than "never" had been chosen originally for the default off position.
> 
> But please don't break the accepted and documented behaviour of
> MADV_COLLAPSE now.
> 
> If you want to exclude all possibility of THPs, then please use the
> prctl(PR_SET_THP_DISABLE); or shmem_enabled=deny (I think it was me
> who insisted that be respected by MADV_COLLAPSE back then).

Yes, that will prevent MADV_COLLAPSE.

> Add a "deny" option to /sys/kernel/mm/transparent_hugepage/enabled
> if you like.  (But in these days of filesystem large folios, adding
> new protections against them seems a few years late.)
> 
> If Andrew decides that these patches should go in, then I'll have to
> scrutinize them more carefully than I've done so far: but currently
> I'm hoping to avoid that.
> 
> Hugh



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  6:26   ` Baolin Wang
@ 2025-06-25  6:49     ` Dev Jain
  2025-06-25  6:55       ` Baolin Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Dev Jain @ 2025-06-25  6:49 UTC (permalink / raw)
  To: Baolin Wang, Hugh Dickins
  Cc: akpm, david, ziy, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, zokeefe, shy828301, usamaarif642, linux-mm,
	linux-kernel


On 25/06/25 11:56 am, Baolin Wang wrote:
>
>
> On 2025/6/25 13:53, Hugh Dickins wrote:
>> On Wed, 25 Jun 2025, Baolin Wang wrote:
>>
>>> When invoking thp_vma_allowable_orders(), if the TVA_ENFORCE_SYSFS 
>>> flag is not
>>> specified, we will ignore the THP sysfs settings. Whilst it makes 
>>> sense for the
>>> callers who do not specify this flag, it creates a odd and 
>>> surprising situation
>>> where a sysadmin specifying 'never' for all THP sizes still 
>>> observing THP pages
>>> being allocated and used on the system. And the MADV_COLLAPSE is an 
>>> example of
>>> such a case, that means it will not set TVA_ENFORCE_SYSFS when calling
>>> thp_vma_allowable_orders().
>>>
>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will 
>>> ignore
>>> the system-wide anon/shmem THP sysfs settings, which means that even 
>>> though
>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE 
>>> will still
>>> attempt to collapse into a anon/shmem THP. This violates the rule we 
>>> have
>>> agreed upon: never means never.
>>>
>>> For example, system administrators who disabled THP everywhere must 
>>> indeed very
>>> much not want THP to be used for whatever reason - having individual 
>>> programs
>>> being able to quietly override this is very surprising and likely to 
>>> cause headaches
>>> for those who desire this not to happen on their systems.
>>>
>>> This patch set will address the MADV_COLLAPSE issue.
>>>
>>> Test
>>> ====
>>> 1. Tested the mm selftests and found no regressions.
>>> 2. With toggling different Anon mTHP settings, the allocation and 
>>> madvise collapse for
>>> anonymous pages work well.
>>> 3. With toggling different shmem mTHP settings, the allocation and 
>>> madvise collapse for
>>> shmem work well.
>>> 4. Tested the large order allocation for tmpfs, and works as expected.
>>>
>>> [1] 
>>> https://lore.kernel.org/all/1f00fdc3-a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com/
>>>
>>> Changes from v3:
>>>   - Collect reviewed tags. Thanks.
>>>   - Update the commit message, per David.
>>>
>>> Changes from v2:
>>>   - Update the commit message and cover letter, per Lorenzo. Thanks.
>>>   - Simplify the logic in thp_vma_allowable_orders(), per Lorenzo 
>>> and David. Thanks.
>>>
>>> Changes from v1:
>>>   - Update the commit message, per Zi.
>>>   - Add Zi's reviewed tag. Thanks.
>>>   - Update the shmem logic.
>>>
>>> Baolin Wang (2):
>>>    mm: huge_memory: disallow hugepages if the system-wide THP sysfs
>>>      settings are disabled
>>>    mm: shmem: disallow hugepages if the system-wide shmem THP sysfs
>>>      settings are disabled
>>>
>>>   include/linux/huge_mm.h                 | 51 
>>> ++++++++++++++++++-------
>>>   mm/shmem.c                              |  6 +--
>>>   tools/testing/selftests/mm/khugepaged.c |  8 +---
>>>   3 files changed, 43 insertions(+), 22 deletions(-)
>>>
>>> -- 
>>> 2.43.5
>>
>> Sorry for chiming in so late, after so much effort: but I beg you,
>> please drop these.
>
> Thanks Hugh for your input. (yes, we put in a lot of effort on 
> discussion and testing:( ).
>
>> I did not want to get into a fight, and had been hoping a voice of
>> reason would come from others, before I got around to responding.
>>
>> And indeed Ryan understood correctly at the start; and he, Usama
>> and Barry, perhaps others I've missed, have raised appropriate
>> concerns but not prevailed.
>>
>> If we're sloganeering, I much prefer "never break userspace" to
>> "never means never", attractive though that over-simplification is.
>
> Yes, agree. we should not break userspace, however, I suspect whether 
> this can really break userspace. We can set 
> '/sys/kernel/mm/transparent_hugepage/enabled' to 'madvise' to allow 
> MADV_COLLAPSE. Additionally, I really doubt that when the system-wide 
> THP settings are set to 'never', userspace would still expect to 
> collapse into THP using MADV_COLLAPSE.

After this patch, will a user still be able to use MADV_COLLAPSE and 
ensure no interference from khugepaged?


>
> Moreover, what makes this issue particularly frustrating is that when 
> we introduce mTHP collapse[1], MADV_COLLAPSE complicates matters 
> further. That is, when the system only enables 64K mTHP, MADV_COLLAPSE 
> still allows collapsing into PMD-sized THP. This really breaks the 
> user's settings.

This issue will still be there without this patch right?


>
> [1] 
> https://lore.kernel.org/all/20250515032226.128900-1-npache@redhat.com/
>
>> Seldom has a feature been so thorougly documented as MADV_COLLAPSE,
>> in its 6.1 commits and in the "man 2 madvise" page: which are
>> explicit about MADV_COLLAPSE providing a way to get THPs where the
>> sysfs setting governing automatic behaviour does not insert them.
>>
>> We would all prefer a less messy world of THP tunables.  I certainly
>> find plenty to dislike there too; and wish that a less assertive name
>> than "never" had been chosen originally for the default off position.
>>
>> But please don't break the accepted and documented behaviour of
>> MADV_COLLAPSE now.
>>
>> If you want to exclude all possibility of THPs, then please use the
>> prctl(PR_SET_THP_DISABLE); or shmem_enabled=deny (I think it was me
>> who insisted that be respected by MADV_COLLAPSE back then).
>
> Yes, that will prevent MADV_COLLAPSE.
>
>> Add a "deny" option to /sys/kernel/mm/transparent_hugepage/enabled
>> if you like.  (But in these days of filesystem large folios, adding
>> new protections against them seems a few years late.)
>>
>> If Andrew decides that these patches should go in, then I'll have to
>> scrutinize them more carefully than I've done so far: but currently
>> I'm hoping to avoid that.
>>
>> Hugh
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  6:49     ` Dev Jain
@ 2025-06-25  6:55       ` Baolin Wang
  0 siblings, 0 replies; 37+ messages in thread
From: Baolin Wang @ 2025-06-25  6:55 UTC (permalink / raw)
  To: Dev Jain, Hugh Dickins
  Cc: akpm, david, ziy, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, zokeefe, shy828301, usamaarif642, linux-mm,
	linux-kernel



On 2025/6/25 14:49, Dev Jain wrote:
> 
> On 25/06/25 11:56 am, Baolin Wang wrote:
>>
>>
>> On 2025/6/25 13:53, Hugh Dickins wrote:
>>> On Wed, 25 Jun 2025, Baolin Wang wrote:
>>>
>>>> When invoking thp_vma_allowable_orders(), if the TVA_ENFORCE_SYSFS 
>>>> flag is not
>>>> specified, we will ignore the THP sysfs settings. Whilst it makes 
>>>> sense for the
>>>> callers who do not specify this flag, it creates a odd and 
>>>> surprising situation
>>>> where a sysadmin specifying 'never' for all THP sizes still 
>>>> observing THP pages
>>>> being allocated and used on the system. And the MADV_COLLAPSE is an 
>>>> example of
>>>> such a case, that means it will not set TVA_ENFORCE_SYSFS when calling
>>>> thp_vma_allowable_orders().
>>>>
>>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will 
>>>> ignore
>>>> the system-wide anon/shmem THP sysfs settings, which means that even 
>>>> though
>>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE 
>>>> will still
>>>> attempt to collapse into a anon/shmem THP. This violates the rule we 
>>>> have
>>>> agreed upon: never means never.
>>>>
>>>> For example, system administrators who disabled THP everywhere must 
>>>> indeed very
>>>> much not want THP to be used for whatever reason - having individual 
>>>> programs
>>>> being able to quietly override this is very surprising and likely to 
>>>> cause headaches
>>>> for those who desire this not to happen on their systems.
>>>>
>>>> This patch set will address the MADV_COLLAPSE issue.
>>>>
>>>> Test
>>>> ====
>>>> 1. Tested the mm selftests and found no regressions.
>>>> 2. With toggling different Anon mTHP settings, the allocation and 
>>>> madvise collapse for
>>>> anonymous pages work well.
>>>> 3. With toggling different shmem mTHP settings, the allocation and 
>>>> madvise collapse for
>>>> shmem work well.
>>>> 4. Tested the large order allocation for tmpfs, and works as expected.
>>>>
>>>> [1] https://lore.kernel.org/all/1f00fdc3- 
>>>> a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com/
>>>>
>>>> Changes from v3:
>>>>   - Collect reviewed tags. Thanks.
>>>>   - Update the commit message, per David.
>>>>
>>>> Changes from v2:
>>>>   - Update the commit message and cover letter, per Lorenzo. Thanks.
>>>>   - Simplify the logic in thp_vma_allowable_orders(), per Lorenzo 
>>>> and David. Thanks.
>>>>
>>>> Changes from v1:
>>>>   - Update the commit message, per Zi.
>>>>   - Add Zi's reviewed tag. Thanks.
>>>>   - Update the shmem logic.
>>>>
>>>> Baolin Wang (2):
>>>>    mm: huge_memory: disallow hugepages if the system-wide THP sysfs
>>>>      settings are disabled
>>>>    mm: shmem: disallow hugepages if the system-wide shmem THP sysfs
>>>>      settings are disabled
>>>>
>>>>   include/linux/huge_mm.h                 | 51 +++++++++++++++++ 
>>>> +-------
>>>>   mm/shmem.c                              |  6 +--
>>>>   tools/testing/selftests/mm/khugepaged.c |  8 +---
>>>>   3 files changed, 43 insertions(+), 22 deletions(-)
>>>>
>>>> -- 
>>>> 2.43.5
>>>
>>> Sorry for chiming in so late, after so much effort: but I beg you,
>>> please drop these.
>>
>> Thanks Hugh for your input. (yes, we put in a lot of effort on 
>> discussion and testing:( ).
>>
>>> I did not want to get into a fight, and had been hoping a voice of
>>> reason would come from others, before I got around to responding.
>>>
>>> And indeed Ryan understood correctly at the start; and he, Usama
>>> and Barry, perhaps others I've missed, have raised appropriate
>>> concerns but not prevailed.
>>>
>>> If we're sloganeering, I much prefer "never break userspace" to
>>> "never means never", attractive though that over-simplification is.
>>
>> Yes, agree. we should not break userspace, however, I suspect whether 
>> this can really break userspace. We can set '/sys/kernel/mm/ 
>> transparent_hugepage/enabled' to 'madvise' to allow MADV_COLLAPSE. 
>> Additionally, I really doubt that when the system-wide THP settings 
>> are set to 'never', userspace would still expect to collapse into THP 
>> using MADV_COLLAPSE.
> 
> After this patch, will a user still be able to use MADV_COLLAPSE and 
> ensure no interference from khugepaged?

I think so. Becuase khugepaged will still check VM_HUGEPAGE if we set 
'/sys/kernel/mm/transparent_hugepage/enabled' to 'madvise'.

>> Moreover, what makes this issue particularly frustrating is that when 
>> we introduce mTHP collapse[1], MADV_COLLAPSE complicates matters 
>> further. That is, when the system only enables 64K mTHP, MADV_COLLAPSE 
>> still allows collapsing into PMD-sized THP. This really breaks the 
>> user's settings.
> 
> This issue will still be there without this patch right?

NO. Will fix this issue. After this patch, MADV_COLLAPSE can not 
continue to collapse PMD-sized THP if the system only enables 64K mTHP.

>> [1] https://lore.kernel.org/all/20250515032226.128900-1- 
>> npache@redhat.com/
>>
>>> Seldom has a feature been so thorougly documented as MADV_COLLAPSE,
>>> in its 6.1 commits and in the "man 2 madvise" page: which are
>>> explicit about MADV_COLLAPSE providing a way to get THPs where the
>>> sysfs setting governing automatic behaviour does not insert them.
>>>
>>> We would all prefer a less messy world of THP tunables.  I certainly
>>> find plenty to dislike there too; and wish that a less assertive name
>>> than "never" had been chosen originally for the default off position.
>>>
>>> But please don't break the accepted and documented behaviour of
>>> MADV_COLLAPSE now.
>>>
>>> If you want to exclude all possibility of THPs, then please use the
>>> prctl(PR_SET_THP_DISABLE); or shmem_enabled=deny (I think it was me
>>> who insisted that be respected by MADV_COLLAPSE back then).
>>
>> Yes, that will prevent MADV_COLLAPSE.
>>
>>> Add a "deny" option to /sys/kernel/mm/transparent_hugepage/enabled
>>> if you like.  (But in these days of filesystem large folios, adding
>>> new protections against them seems a few years late.)
>>>
>>> If Andrew decides that these patches should go in, then I'll have to
>>> scrutinize them more carefully than I've done so far: but currently
>>> I'm hoping to avoid that.
>>>
>>> Hugh
>>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  5:53 ` [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP " Hugh Dickins
  2025-06-25  6:05   ` Dev Jain
  2025-06-25  6:26   ` Baolin Wang
@ 2025-06-25  7:20   ` Lorenzo Stoakes
  2025-06-25  7:34     ` David Hildenbrand
  2025-06-25  7:23   ` David Hildenbrand
  3 siblings, 1 reply; 37+ messages in thread
From: Lorenzo Stoakes @ 2025-06-25  7:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Baolin Wang, akpm, david, ziy, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, zokeefe, shy828301, usamaarif642, linux-mm,
	linux-kernel

On Tue, Jun 24, 2025 at 10:53:28PM -0700, Hugh Dickins wrote:
> On Wed, 25 Jun 2025, Baolin Wang wrote:
> Sorry for chiming in so late, after so much effort: but I beg you,
> please drop these.
>
> I did not want to get into a fight, and had been hoping a voice of
> reason would come from others, before I got around to responding.

There'll be no fighting :>)

I for one (and I'm absolutely confident others in this thread) very much
respect your opinion - so some civil disagreement is perfectly natural and
healthy.

This is how we get the best outcomes...

>
> And indeed Ryan understood correctly at the start; and he, Usama
> and Barry, perhaps others I've missed, have raised appropriate
> concerns but not prevailed.
>
> If we're sloganeering, I much prefer "never break userspace" to
> "never means never", attractive though that over-simplification is.

It would have been useful to have this discussion earlier indeed... :)

I disagree we're breaking userspace. See below.

>
> Seldom has a feature been so thorougly documented as MADV_COLLAPSE,
> in its 6.1 commits and in the "man 2 madvise" page: which are
> explicit about MADV_COLLAPSE providing a way to get THPs where the
> sysfs setting governing automatic behaviour does not insert them.

I disagree, I feel like the unfortunately poorly named 'never' toggle in
sysfs makes everything uncertain.

One can easily read 'is independent of any sysfs setting...  both in terms
of determining THP eligibility, and allocation semantics' as meaning that
it ignores things such as madvise vs. inherit etc. But it's not clear that
it means you ignore 'never' - yes a very poorly chosen name.

I think a reasonable person would interpret 'never' to mean literally
'never'.

But if we go further, note that the man page points to the
Documentation/admin-guide/mm/transhuge.rst document which states:

	Global THP controls
	-------------------

	Transparent Hugepage Support for anonymous memory can be entirely
	disabled (mostly for debugging purposes)...

"Entirely disabled".

So I disagree that 'seldom has feature been so thoroughly documented', I
mean in a sense - yes! But also unfortunately, and unintentionally,
vaguely.

This really does come down to some poor choices made early as to wording
and names.

But I'd argue that as a result of this, absolutely the expectation of
system administrators is that never means never.

So we're sort of arguing de jure vs de facto here.

>
> We would all prefer a less messy world of THP tunables.  I certainly
> find plenty to dislike there too; and wish that a less assertive name
> than "never" had been chosen originally for the default off position.
>
> But please don't break the accepted and documented behaviour of
> MADV_COLLAPSE now.

Again see above, I absolutely disagree this is documented _clearly_. And
that's the underlying issue here.

I feel like if you polled 100 system administrators (assuming they knew
about THP) as to how you globally disable THP, probably all 100 would say
you do it via:

# echo never > /sys/kernel/mm/transparent_hugepage/enabled

So shouldn't 'never break userspace' be based on practical reality rather
than a theorised interpretation of documents that sadly are not clear
enough?

>
> If you want to exclude all possibility of THPs, then please use the
> prctl(PR_SET_THP_DISABLE); or shmem_enabled=deny (I think it was me
> who insisted that be respected by MADV_COLLAPSE back then).

While it's useful to have this, prctl() is where APIs go to die. It's a
hidden wasteland that nobody knows about, it may as well not exist.

We have a whole sysctl directory for configuring this stuff. It's sort of
crazy to have that then to have a special prctl() hidden away also...

>
> Add a "deny" option to /sys/kernel/mm/transparent_hugepage/enabled
> if you like.  (But in these days of filesystem large folios, adding
> new protections against them seems a few years late.)

Based on a reasonable interpretation of 'never' I would say we retain this
series as-is, and 'deny' could be what 'never' was intended to be before.

>
> If Andrew decides that these patches should go in, then I'll have to
> scrutinize them more carefully than I've done so far: but currently
> I'm hoping to avoid that.

Sure and that'd be hugely appreciated!

>
> Hugh

Thanks for your feedback, it's much appreciated! I hope we can figure this
out sensibly.

Also note Baolin's point about mTHP which rather complicates matters.

I think we all feel the THP interfaces are... not perfectly ideal.

/sys/kernel/mm/transparent_hugepage_2 anyone? :P

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  5:53 ` [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP " Hugh Dickins
                     ` (2 preceding siblings ...)
  2025-06-25  7:20   ` Lorenzo Stoakes
@ 2025-06-25  7:23   ` David Hildenbrand
  2025-06-25  7:30     ` Lorenzo Stoakes
  3 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2025-06-25  7:23 UTC (permalink / raw)
  To: Hugh Dickins, Baolin Wang
  Cc: akpm, ziy, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, zokeefe, shy828301, usamaarif642, linux-mm,
	linux-kernel

On 25.06.25 07:53, Hugh Dickins wrote:
> On Wed, 25 Jun 2025, Baolin Wang wrote:
> 
>> When invoking thp_vma_allowable_orders(), if the TVA_ENFORCE_SYSFS flag is not
>> specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
>> callers who do not specify this flag, it creates a odd and surprising situation
>> where a sysadmin specifying 'never' for all THP sizes still observing THP pages
>> being allocated and used on the system. And the MADV_COLLAPSE is an example of
>> such a case, that means it will not set TVA_ENFORCE_SYSFS when calling
>> thp_vma_allowable_orders().
>>
>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>> the system-wide anon/shmem THP sysfs settings, which means that even though
>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>> agreed upon: never means never.
>>
>> For example, system administrators who disabled THP everywhere must indeed very
>> much not want THP to be used for whatever reason - having individual programs
>> being able to quietly override this is very surprising and likely to cause headaches
>> for those who desire this not to happen on their systems.
>>
>> This patch set will address the MADV_COLLAPSE issue.
>>
>> Test
>> ====
>> 1. Tested the mm selftests and found no regressions.
>> 2. With toggling different Anon mTHP settings, the allocation and madvise collapse for
>> anonymous pages work well.
>> 3. With toggling different shmem mTHP settings, the allocation and madvise collapse for
>> shmem work well.
>> 4. Tested the large order allocation for tmpfs, and works as expected.
>>
>> [1] https://lore.kernel.org/all/1f00fdc3-a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com/
>>
>> Changes from v3:
>>   - Collect reviewed tags. Thanks.
>>   - Update the commit message, per David.
>>
>> Changes from v2:
>>   - Update the commit message and cover letter, per Lorenzo. Thanks.
>>   - Simplify the logic in thp_vma_allowable_orders(), per Lorenzo and David. Thanks.
>>
>> Changes from v1:
>>   - Update the commit message, per Zi.
>>   - Add Zi's reviewed tag. Thanks.
>>   - Update the shmem logic.
>>
>> Baolin Wang (2):
>>    mm: huge_memory: disallow hugepages if the system-wide THP sysfs
>>      settings are disabled
>>    mm: shmem: disallow hugepages if the system-wide shmem THP sysfs
>>      settings are disabled
>>
>>   include/linux/huge_mm.h                 | 51 ++++++++++++++++++-------
>>   mm/shmem.c                              |  6 +--
>>   tools/testing/selftests/mm/khugepaged.c |  8 +---
>>   3 files changed, 43 insertions(+), 22 deletions(-)
>>
>> -- 
>> 2.43.5
> 
> Sorry for chiming in so late, after so much effort: but I beg you,
> please drop these.
> 
> I did not want to get into a fight, and had been hoping a voice of
> reason would come from others, before I got around to responding.

Thanks for being that voice of reason :)

I would have hoped someone from the original discussion would have 
raised that this was indeed all discussed before (below).

> 
> And indeed Ryan understood correctly at the start; and he, Usama
> and Barry, perhaps others I've missed, have raised appropriate
> concerns but not prevailed.
> 
> If we're sloganeering, I much prefer "never break userspace" to
> "never means never", attractive though that over-simplification is.

Well, one could argue we broke user space (admin settings) when we 
converted "never" to no longer mean "never", but "never by page faults + 
khugepaged". And we did so without updating the documentation.

I finally went back and checked the original discussions and, yes, this 
was deliberate [1].

As so often, we created a mess with THP toggles.

Probably best to fixup the "never" documentation, and state that there 
is no way to disable MADV_COLLAPSE anymore.

I agree that if we want a way to disable all of them, we better have a 
"deny" now. ... until someone else breaks that, then we can have a 
"really_never_deny_all" etc. ;)

[1] 
https://lore.kernel.org/linux-mm/CAAa6QmTCuHWuQ=dcdPX8hS3mKMucwjsjEoBCeFoDSwXCca6hpA@mail.gmail.com/

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  7:23   ` David Hildenbrand
@ 2025-06-25  7:30     ` Lorenzo Stoakes
  2025-06-25  7:36       ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Lorenzo Stoakes @ 2025-06-25  7:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Baolin Wang, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On Wed, Jun 25, 2025 at 09:23:31AM +0200, David Hildenbrand wrote:
> Well, one could argue we broke user space (admin settings) when we converted
> "never" to no longer mean "never", but "never by page faults + khugepaged".
> And we did so without updating the documentation.
>
> I finally went back and checked the original discussions and, yes, this was
> deliberate [1].
>
> As so often, we created a mess with THP toggles.

I mean... !!!

>
> Probably best to fixup the "never" documentation, and state that there is no
> way to disable MADV_COLLAPSE anymore.

I disagree on the basis that system administrators will absolutely expect:

# echo never > /sys/kernel/mm/transparent-hugepage/enabled

To disable THP.

I _guarantee_ you that's what nearly everybody except a handful of people will
expect.

If we do decide to not do this series, _please_ can we seriously update the
documentation to be _absolutely crystal clear_ about this.

I will volunteer to do this in this case :)

>
> I agree that if we want a way to disable all of them, we better have a
> "deny" now. ... until someone else breaks that, then we can have a
> "really_never_deny_all" etc. ;)

I really really dislike this. 'Deny' is weaker than 'never'. And now we have to
add even more complexity to the thing.

Sorry but with 'never' we basically chose the strongest possible term.

It would have to be something really horrid like 'never_even_collapse'.

I want to throw up...


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  7:20   ` Lorenzo Stoakes
@ 2025-06-25  7:34     ` David Hildenbrand
  2025-06-25  7:55       ` Lorenzo Stoakes
  2025-06-25 11:03       ` Usama Arif
  0 siblings, 2 replies; 37+ messages in thread
From: David Hildenbrand @ 2025-06-25  7:34 UTC (permalink / raw)
  To: Lorenzo Stoakes, Hugh Dickins
  Cc: Baolin Wang, akpm, ziy, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, zokeefe, shy828301, usamaarif642, linux-mm,
	linux-kernel

>>
>> We would all prefer a less messy world of THP tunables.  I certainly
>> find plenty to dislike there too; and wish that a less assertive name
>> than "never" had been chosen originally for the default off position.
>>
>> But please don't break the accepted and documented behaviour of
>> MADV_COLLAPSE now.
> 
> Again see above, I absolutely disagree this is documented _clearly_. And
> that's the underlying issue here.
 > > I feel like if you polled 100 system administrators (assuming they knew
> about THP) as to how you globally disable THP, probably all 100 would say
> you do it via:
> 
> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
> 

Yes. One big problem is that the documentation was not updated.

Changing the meaning of "entirely disabled" to "entirely disabled 
automatically (page faults, khugepaged)"

> So shouldn't 'never break userspace' be based on practical reality rather
> than a theorised interpretation of documents that sadly are not clear
> enough?

I think the problem is that there might indeed be more users out there 
relying on "never+MADV_COLLPASE" to now place THPs than 
"never+MADV_COLLPASE" to no place THPs.

What is the harm when not placing THPs? Performance degradation for some 
apps?

What is the harm when placing THPs although disabled? Making the life 
for David for debugging customer issues harder :P

> 
>>
>> If you want to exclude all possibility of THPs, then please use the
>> prctl(PR_SET_THP_DISABLE); or shmem_enabled=deny (I think it was me
>> who insisted that be respected by MADV_COLLAPSE back then).
> 
> While it's useful to have this, prctl() is where APIs go to die. It's a
> hidden wasteland that nobody knows about, it may as well not exist.
> 
> We have a whole sysctl directory for configuring this stuff. It's sort of
> crazy to have that then to have a special prctl() hidden away also...

prctl(PR_SET_THP_DISABLE) is today not really a reasonable way for an 
admin to disable THPs system wide I'm afraid.

> 
>>
>> Add a "deny" option to /sys/kernel/mm/transparent_hugepage/enabled
>> if you like.  (But in these days of filesystem large folios, adding
>> new protections against them seems a few years late.)
> 
> Based on a reasonable interpretation of 'never' I would say we retain this
> series as-is, and 'deny' could be what 'never' was intended to be before.

At least for shmem_enabled never means "unless overwritten per mount 
through huge=" and "deny" means "force off for all mounts". So "deny" is 
"harder"

Inverting the semantics here might be causing even more problems.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  7:30     ` Lorenzo Stoakes
@ 2025-06-25  7:36       ` David Hildenbrand
  2025-06-25  7:42         ` Lorenzo Stoakes
  2025-06-25 21:51         ` Hugh Dickins
  0 siblings, 2 replies; 37+ messages in thread
From: David Hildenbrand @ 2025-06-25  7:36 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Hugh Dickins, Baolin Wang, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On 25.06.25 09:30, Lorenzo Stoakes wrote:
> On Wed, Jun 25, 2025 at 09:23:31AM +0200, David Hildenbrand wrote:
>> Well, one could argue we broke user space (admin settings) when we converted
>> "never" to no longer mean "never", but "never by page faults + khugepaged".
>> And we did so without updating the documentation.
>>
>> I finally went back and checked the original discussions and, yes, this was
>> deliberate [1].
>>
>> As so often, we created a mess with THP toggles.
> 
> I mean... !!!
> 
>>
>> Probably best to fixup the "never" documentation, and state that there is no
>> way to disable MADV_COLLAPSE anymore.
> 
> I disagree on the basis that system administrators will absolutely expect:
> 
> # echo never > /sys/kernel/mm/transparent-hugepage/enabled
> 
> To disable THP.
> 
> I _guarantee_ you that's what nearly everybody except a handful of people will
> expect.

I know, See my other mail, the problem is rather if there is no somebody 
relying on never+MADV_COLLAPSE from doing the MADV_COLLAPSE-documented 
thing.

It's a mess.

(I have the suspicion that Hugh might know of such a user :) )

> 
> If we do decide to not do this series, _please_ can we seriously update the
> documentation to be _absolutely crystal clear_ about this.
> 
> I will volunteer to do this in this case :)
> 
>>
>> I agree that if we want a way to disable all of them, we better have a
>> "deny" now. ... until someone else breaks that, then we can have a
>> "really_never_deny_all" etc. ;)
> 
> I really really dislike this. 'Deny' is weaker than 'never'. And now we have to
> add even more complexity to the thing.

See my other mail, for shmem_enabled "deny" is stronger than "never", lol.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  7:36       ` David Hildenbrand
@ 2025-06-25  7:42         ` Lorenzo Stoakes
  2025-06-25  7:49           ` David Hildenbrand
  2025-06-25 21:51         ` Hugh Dickins
  1 sibling, 1 reply; 37+ messages in thread
From: Lorenzo Stoakes @ 2025-06-25  7:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Baolin Wang, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On Wed, Jun 25, 2025 at 09:36:53AM +0200, David Hildenbrand wrote:
> On 25.06.25 09:30, Lorenzo Stoakes wrote:
> > I _guarantee_ you that's what nearly everybody except a handful of people will
> > expect.
>
> I know, See my other mail, the problem is rather if there is no somebody
> relying on never+MADV_COLLAPSE from doing the MADV_COLLAPSE-documented
> thing.
>
> It's a mess.

Well now we have an almost philosophical debate - we have different sets of
users, 99% of whom believe the uAPI is X, and 1% of whom believe it is Y.

Now what is the uAPI? What is 'breaking userspace'? :)

Temptation to cc Linus here ;)

>
> (I have the suspicion that Hugh might know of such a user :) )

Speak up Hugh we're all friends here haha ;)

I mean if this really is a problem for real users then I'll concede the point.

Let me reply on other thread...

> > I really really dislike this. 'Deny' is weaker than 'never'. And now we have to
> > add even more complexity to the thing.
>
> See my other mail, for shmem_enabled "deny" is stronger than "never", lol.

OK so we're past the throwing up stage, can I cry now? :P


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  7:42         ` Lorenzo Stoakes
@ 2025-06-25  7:49           ` David Hildenbrand
  2025-06-25  8:16             ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2025-06-25  7:49 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Hugh Dickins, Baolin Wang, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On 25.06.25 09:42, Lorenzo Stoakes wrote:
> On Wed, Jun 25, 2025 at 09:36:53AM +0200, David Hildenbrand wrote:
>> On 25.06.25 09:30, Lorenzo Stoakes wrote:
>>> I _guarantee_ you that's what nearly everybody except a handful of people will
>>> expect.
>>
>> I know, See my other mail, the problem is rather if there is no somebody
>> relying on never+MADV_COLLAPSE from doing the MADV_COLLAPSE-documented
>> thing.
>>
>> It's a mess.
> 
> Well now we have an almost philosophical debate - we have different sets of
> users, 99% of whom believe the uAPI is X, and 1% of whom believe it is Y.
> 
> Now what is the uAPI? What is 'breaking userspace'? :)

Yeah, that's why I mentioned that I think we broke "something" when we 
changed the semantics. But that breakage probably only affects real 
corner cases (debugging, customer workarounds).

I think the whole use case of using MADV_COLLAPSE to completely control 
THP allocation in a system is otherwise pretty hard to achieve, if there 
is no other way to tame THP allocation through page faults+khugepaged.

Mess.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  7:34     ` David Hildenbrand
@ 2025-06-25  7:55       ` Lorenzo Stoakes
  2025-06-25  8:12         ` Lorenzo Stoakes
  2025-06-25 11:03       ` Usama Arif
  1 sibling, 1 reply; 37+ messages in thread
From: Lorenzo Stoakes @ 2025-06-25  7:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Baolin Wang, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On Wed, Jun 25, 2025 at 09:34:51AM +0200, David Hildenbrand wrote:
> > >
> > > We would all prefer a less messy world of THP tunables.  I certainly
> > > find plenty to dislike there too; and wish that a less assertive name
> > > than "never" had been chosen originally for the default off position.
> > >
> > > But please don't break the accepted and documented behaviour of
> > > MADV_COLLAPSE now.
> >
> > Again see above, I absolutely disagree this is documented _clearly_. And
> > that's the underlying issue here.
> > > I feel like if you polled 100 system administrators (assuming they knew
> > about THP) as to how you globally disable THP, probably all 100 would say
> > you do it via:
> >
> > # echo never > /sys/kernel/mm/transparent_hugepage/enabled
> >
>
> Yes. One big problem is that the documentation was not updated.
>
> Changing the meaning of "entirely disabled" to "entirely disabled
> automatically (page faults, khugepaged)"

I mean, if we decide to drop this series, I will fix this in both man page
and documentation.

It'll be basically heavily underlining the fact that sysfs never DOES NOT
MEAN NEVER it means never _automatically_ as you say.

>
> > So shouldn't 'never break userspace' be based on practical reality rather
> > than a theorised interpretation of documents that sadly are not clear
> > enough?
>
> I think the problem is that there might indeed be more users out there
> relying on "never+MADV_COLLPASE" to now place THPs than
> "never+MADV_COLLPASE" to no place THPs.
>
> What is the harm when not placing THPs? Performance degradation for some
> apps?

Right, but we're already not placing the _right_ kinds of
pages. MADV_COLLAPSE was a pre-mTHP era thing...

>
> What is the harm when placing THPs although disabled? Making the life for
> David for debugging customer issues harder :P

Well Baolin pointed out the case of a system that only enables mTHP for
instance. Now these apps all get PMD pages. That is surprising and also
absolutely reduces availablility of mTHP pages.

This is a very big problem.

I guess this comes down to - do we have users who are _absolutely reliant_
on never + collapse?

And no we do not want to degrade performance. But do such people exist?

I'm guessing Hugh might know ;)

> > While it's useful to have this, prctl() is where APIs go to die. It's a
> > hidden wasteland that nobody knows about, it may as well not exist.
> >
> > We have a whole sysctl directory for configuring this stuff. It's sort of
> > crazy to have that then to have a special prctl() hidden away also...
>
> prctl(PR_SET_THP_DISABLE) is today not really a reasonable way for an admin
> to disable THPs system wide I'm afraid.

Yes.

>
> >
> > >
> > > Add a "deny" option to /sys/kernel/mm/transparent_hugepage/enabled
> > > if you like.  (But in these days of filesystem large folios, adding
> > > new protections against them seems a few years late.)
> >
> > Based on a reasonable interpretation of 'never' I would say we retain this
> > series as-is, and 'deny' could be what 'never' was intended to be before.
>
> At least for shmem_enabled never means "unless overwritten per mount through
> huge=" and "deny" means "force off for all mounts". So "deny" is "harder"

This is so... ugh. Lord. My sweet English language... "I will never let you
in here" vs. "I deny you entry", one is a lot stronger, and it's not the
latter one...

Anyway ok given that this is the case, I guess we could use deny for this.

>
> Inverting the semantics here might be causing even more problems.

Yeah you're probably right.

But obviously every distro who defaults to 'never' is going to get a
surprise here.

>
> --
> Cheers,
>
> David / dhildenb
>

I suppose the least awful way of addressing Baolin's concerns re: mTHP
while simultaneosly keeping existing semantics is:

1. Introduce deny to mean what never should have meant.
2. Use something like the logic here to enforce it.
3. Heavily document it (I can do this).

But I still find this yucky based on the fact that nobody will expect any
of this.

But maybe the THP toggles are such a mess that we're past that anyway?...


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  7:55       ` Lorenzo Stoakes
@ 2025-06-25  8:12         ` Lorenzo Stoakes
  2025-06-25  8:24           ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Lorenzo Stoakes @ 2025-06-25  8:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Baolin Wang, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On Wed, Jun 25, 2025 at 08:55:28AM +0100, Lorenzo Stoakes wrote:
> I suppose the least awful way of addressing Baolin's concerns re: mTHP
> while simultaneosly keeping existing semantics is:
>
> 1. Introduce deny to mean what never should have meant.

To fix Baolin's issue btw we'd have to add 'deny' to both 'global' settings
_and_ each page size setting.

Because otherwise we'd end up in a weird case where say:

global 'deny'

 2 MiB 'never'
64 KiB 'inherit'

And err... get 2 MiB THP pages from MADV_COLLAPSE :)

Or:

global 'deny'

 2 MiB 'never'
64 KiB 'always'

Or:

global 'never'

 2 MiB 'never'
64 KiB 'always'

Or:

global 'never'

 2 MiB 'madvise'
64 KiB 'always'

All doing the same. Not very clear is it?

We have sowed the seeds of something terrible here, truly.

Even with this change we'd end up with:

global 'always'

  2 MiB 'deny'
256 KiB 'never'
 64 KiB 'never'

Means 256 KiB...

I wonder whether doing this just adds even more of a mess?

> 2. Use something like the logic here to enforce it.
> 3. Heavily document it (I can do this).
>
> But I still find this yucky based on the fact that nobody will expect any
> of this.
>
> But maybe the THP toggles are such a mess that we're past that anyway?...


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  7:49           ` David Hildenbrand
@ 2025-06-25  8:16             ` David Hildenbrand
  2025-06-25  8:22               ` Lorenzo Stoakes
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2025-06-25  8:16 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Hugh Dickins, Baolin Wang, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On 25.06.25 09:49, David Hildenbrand wrote:
> On 25.06.25 09:42, Lorenzo Stoakes wrote:
>> On Wed, Jun 25, 2025 at 09:36:53AM +0200, David Hildenbrand wrote:
>>> On 25.06.25 09:30, Lorenzo Stoakes wrote:
>>>> I _guarantee_ you that's what nearly everybody except a handful of people will
>>>> expect.
>>>
>>> I know, See my other mail, the problem is rather if there is no somebody
>>> relying on never+MADV_COLLAPSE from doing the MADV_COLLAPSE-documented
>>> thing.
>>>
>>> It's a mess.
>>
>> Well now we have an almost philosophical debate - we have different sets of
>> users, 99% of whom believe the uAPI is X, and 1% of whom believe it is Y.
>>
>> Now what is the uAPI? What is 'breaking userspace'? :)
> 
> Yeah, that's why I mentioned that I think we broke "something" when we
> changed the semantics. But that breakage probably only affects real
> corner cases (debugging, customer workarounds).
> 
> I think the whole use case of using MADV_COLLAPSE to completely control
> THP allocation in a system is otherwise pretty hard to achieve, if there
> is no other way to tame THP allocation through page faults+khugepaged.

Just want to add: for an app itself, it's doable in "madvise" mode 
perfectly fine.

If your app does a MADV_HUGEPAGE, it can get a THP during page-fault + 
khugepaged.

If your app does not do a MADV_HUGEPAGE, it can get a THP through 
MADV_COLLAPSE.

So the "madvise" mode actually works.



The problem appears as soon as we want to control other processes that 
might be setting MADV_HUGEPAGE, and we actually want to control the 
behavior using process_madvise(MADV_COLLAPSE), to say "well, the 
MADV_HUGEPAGE" should be ignored.

Then, you configure "never" system-wide and use 
process_madvise(MADV_COLLAPSE) to drive it all manually.

Curious to learn if there is such a user out there.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  8:16             ` David Hildenbrand
@ 2025-06-25  8:22               ` Lorenzo Stoakes
  2025-06-25  8:40                 ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Lorenzo Stoakes @ 2025-06-25  8:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Baolin Wang, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On Wed, Jun 25, 2025 at 10:16:46AM +0200, David Hildenbrand wrote:
> On 25.06.25 09:49, David Hildenbrand wrote:
> > I think the whole use case of using MADV_COLLAPSE to completely control
> > THP allocation in a system is otherwise pretty hard to achieve, if there
> > is no other way to tame THP allocation through page faults+khugepaged.
>
> Just want to add: for an app itself, it's doable in "madvise" mode perfectly
> fine.
>
> If your app does a MADV_HUGEPAGE, it can get a THP during page-fault +
> khugepaged.
>
> If your app does not do a MADV_HUGEPAGE, it can get a THP through
> MADV_COLLAPSE.
>
> So the "madvise" mode actually works.

Right, but for me MADV_COLLAPSE is more about 'I want THPs _now_ (if available),
not when khugepaged decides to give me some'.

So we have multiple semantics at work here, unfortunately.

>
> The problem appears as soon as we want to control other processes that might
> be setting MADV_HUGEPAGE, and we actually want to control the behavior using
> process_madvise(MADV_COLLAPSE), to say "well, the MADV_HUGEPAGE" should be
> ignored.

This is a _very_ specialist use.

I'd argue for a 'manual' mode to be added to sysfs to cover this case, with
'never' having the 'actually means never' semantics.

You might argue that could confuse things, but it'd retain the 'de facto'
understanding nearly everybody has about what thees flags mean, but give
whatever user is out there that needs this the ability to continue doing what
they want.

And we get into philosophy about not 'breaking' userland, not sure we have a
TLB/page fault/folio allocation efficiency contract with userland :)

No program will break with this patch applied. Just potentially get performance
degradation in a very, very specialist case.

>
> Then, you configure "never" system-wide and use
> process_madvise(MADV_COLLAPSE) to drive it all manually.
>
> Curious to learn if there is such a user out there.

Oh me too :)

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  8:12         ` Lorenzo Stoakes
@ 2025-06-25  8:24           ` David Hildenbrand
  2025-06-25  8:37             ` Lorenzo Stoakes
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2025-06-25  8:24 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Hugh Dickins, Baolin Wang, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On 25.06.25 10:12, Lorenzo Stoakes wrote:
> On Wed, Jun 25, 2025 at 08:55:28AM +0100, Lorenzo Stoakes wrote:
>> I suppose the least awful way of addressing Baolin's concerns re: mTHP
>> while simultaneosly keeping existing semantics is:
>>
>> 1. Introduce deny to mean what never should have meant.
> 
> To fix Baolin's issue btw we'd have to add 'deny' to both 'global' settings
> _and_ each page size setting.
> 
> Because otherwise we'd end up in a weird case where say:
> 
> global 'deny'
> 
>   2 MiB 'never'
> 64 KiB 'inherit'
> 
> And err... get 2 MiB THP pages from MADV_COLLAPSE :)
> 
> Or:
> 
> global 'deny'
> 
>   2 MiB 'never'
> 64 KiB 'always'
> 
> Or:
> 
> global 'never'
> 
>   2 MiB 'never'
> 64 KiB 'always'
> 
> Or:
> 
> global 'never'
> 
>   2 MiB 'madvise'
> 64 KiB 'always'
> 
> All doing the same. Not very clear is it?
> 
> We have sowed the seeds of something terrible here, truly.

Fully agreed. "Deny" is nasty. Maybe if we really need a way to disable 
"madv_collapse", it should be done differently, not using this toggle here.

Regarding MADV_COLLAPSE, I strongly assume that we should not change it 
to collapse smaller mTHPs as part of the khugepaged mTHP work. For now, 
it will simply always collapse to PMD THPs.

Once we want to support other sizes, likely MADV_COLLAPSE users want to 
have better control over which size to use, at which point it all gets 
nasty.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  8:24           ` David Hildenbrand
@ 2025-06-25  8:37             ` Lorenzo Stoakes
  2025-06-25  8:52               ` Baolin Wang
  2025-06-25  8:53               ` David Hildenbrand
  0 siblings, 2 replies; 37+ messages in thread
From: Lorenzo Stoakes @ 2025-06-25  8:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Baolin Wang, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On Wed, Jun 25, 2025 at 10:24:53AM +0200, David Hildenbrand wrote:
> On 25.06.25 10:12, Lorenzo Stoakes wrote:
> > On Wed, Jun 25, 2025 at 08:55:28AM +0100, Lorenzo Stoakes wrote:
> > > I suppose the least awful way of addressing Baolin's concerns re: mTHP
> > > while simultaneosly keeping existing semantics is:
> > >
> > > 1. Introduce deny to mean what never should have meant.
> >
> > To fix Baolin's issue btw we'd have to add 'deny' to both 'global' settings
> > _and_ each page size setting.
> >
> > Because otherwise we'd end up in a weird case where say:
> >
> > global 'deny'
> >
> >   2 MiB 'never'
> > 64 KiB 'inherit'
> >
> > And err... get 2 MiB THP pages from MADV_COLLAPSE :)
> >
> > Or:
> >
> > global 'deny'
> >
> >   2 MiB 'never'
> > 64 KiB 'always'
> >
> > Or:
> >
> > global 'never'
> >
> >   2 MiB 'never'
> > 64 KiB 'always'
> >
> > Or:
> >
> > global 'never'
> >
> >   2 MiB 'madvise'
> > 64 KiB 'always'
> >
> > All doing the same. Not very clear is it?
> >
> > We have sowed the seeds of something terrible here, truly.
>
> Fully agreed. "Deny" is nasty. Maybe if we really need a way to disable
> "madv_collapse", it should be done differently, not using this toggle here.

Yeah maybe the best way is to just have another tunable for this?

/sys/kernel/mm/transparent_hugepage/disable_collapse perhaps?

What do you think Hugh, Baolin?

>
> Regarding MADV_COLLAPSE, I strongly assume that we should not change it to
> collapse smaller mTHPs as part of the khugepaged mTHP work. For now, it will
> simply always collapse to PMD THPs.

Yeah thinking about it maybe this is the best way. And we can then update
the man page to make this ABUNDANTLY clear (am happy to do this).

This keeps things simple.

(One side note on PMD-sized MADV_COLLAPSE - this is basically completely
useless for 64 KB page size arm64 systems where PMD's are 512 MB :)

Thoughts Baolin?

>
> Once we want to support other sizes, likely MADV_COLLAPSE users want to have
> better control over which size to use, at which point it all gets nasty.

madvise2() this time with extra parameters? ;)

I sort of wish we had added a flags parameter there.

But lacking a time machine... :)

>
> --
> Cheers,
>
> David / dhildenb
>

To summarise:

Drop series:

* Might degrade performance for very specific users using
  never/MADV_COLLAPSE (quite possibly via process_madvise() + a remote
  process).

* Matches 'de jure' interpretation of documentation.

Keep series:

* Provides no means whatsoever to have a 'manual only' collapse mode,
  though does provide for manual khugepaged THP.

* MADV_COLLAPSE automatically gets mTHP support based on obeying 'never'.

* Matches likely 'de facto' understanding system admins have about THP
  usage.

Action items:

* Either way, I (Lorenzo) will improve documentation.

* If we drop the series, provide another means to disable
  MADV_COLLAPSE. But not using existing sysfs toggles, something new. We
  will document MADV_COLLAPSE as PMD only.

* If we drop the series, also consider how we might provide mTHP-compatible
  MADV_COLLAPSE.

* Totally and completely refactor the hell out of the THP implementation
  from top-to-bottom (over time this is becoming more and more of a me
  thing... as I'm getting ever more frustrated with the implementation ;)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  8:22               ` Lorenzo Stoakes
@ 2025-06-25  8:40                 ` David Hildenbrand
  2025-06-25  8:45                   ` Lorenzo Stoakes
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2025-06-25  8:40 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Hugh Dickins, Baolin Wang, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On 25.06.25 10:22, Lorenzo Stoakes wrote:
> On Wed, Jun 25, 2025 at 10:16:46AM +0200, David Hildenbrand wrote:
>> On 25.06.25 09:49, David Hildenbrand wrote:
>>> I think the whole use case of using MADV_COLLAPSE to completely control
>>> THP allocation in a system is otherwise pretty hard to achieve, if there
>>> is no other way to tame THP allocation through page faults+khugepaged.
>>
>> Just want to add: for an app itself, it's doable in "madvise" mode perfectly
>> fine.
>>
>> If your app does a MADV_HUGEPAGE, it can get a THP during page-fault +
>> khugepaged.
>>
>> If your app does not do a MADV_HUGEPAGE, it can get a THP through
>> MADV_COLLAPSE.
>>
>> So the "madvise" mode actually works.
> 
> Right, but for me MADV_COLLAPSE is more about 'I want THPs _now_ (if available),
> not when khugepaged decides to give me some'.
> 
> So we have multiple semantics at work here, unfortunately.
> 
>>
>> The problem appears as soon as we want to control other processes that might
>> be setting MADV_HUGEPAGE, and we actually want to control the behavior using
>> process_madvise(MADV_COLLAPSE), to say "well, the MADV_HUGEPAGE" should be
>> ignored.
> 
> This is a _very_ specialist use.
> 
> I'd argue for a 'manual' mode to be added to sysfs to cover this case, with
> 'never' having the 'actually means never' semantics.
> 
> You might argue that could confuse things, but it'd retain the 'de facto'
> understanding nearly everybody has about what thees flags mean, but give
> whatever user is out there that needs this the ability to continue doing what
> they want.
> 
> And we get into philosophy about not 'breaking' userland, not sure we have a
> TLB/page fault/folio allocation efficiency contract with userland :)
> 
> No program will break with this patch applied. Just potentially get performance
> degradation in a very, very specialist case.
> 
>>
>> Then, you configure "never" system-wide and use
>> process_madvise(MADV_COLLAPSE) to drive it all manually.
>>
>> Curious to learn if there is such a user out there.
> 
> Oh me too :)

I just looked at the original use cases [1], such a use case is not 
mentioned.

But it did add process_madvise(MADV_COLLAPSE) in 
876b4a1896646cc85ec6b1fc1c9270928b7e0831 where we document

"
     This is useful for the development of userspace agents that seek to
     optimize THP utilization system-wide by using userspace signals to
     prioritize what memory is most deserving of being THP-backed.
"

The "prioritize" might indicate that this is used in combination with 
"madvise", not with "never"/


So yeah, it all boils down to

(1) If there is no such use case, "never can mean never". Because there
     is nothing to break, really.

(2) If there is such a use case, we might be breaking it.

[1] 
https://lore.kernel.org/linux-mm/20220706235936.2197195-1-zokeefe@google.com/

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  8:40                 ` David Hildenbrand
@ 2025-06-25  8:45                   ` Lorenzo Stoakes
  0 siblings, 0 replies; 37+ messages in thread
From: Lorenzo Stoakes @ 2025-06-25  8:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Baolin Wang, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On Wed, Jun 25, 2025 at 10:40:23AM +0200, David Hildenbrand wrote:
>
> I just looked at the original use cases [1], such a use case is not
> mentioned.
>
> But it did add process_madvise(MADV_COLLAPSE) in
> 876b4a1896646cc85ec6b1fc1c9270928b7e0831 where we document
>
> "
>     This is useful for the development of userspace agents that seek to
>     optimize THP utilization system-wide by using userspace signals to
>     prioritize what memory is most deserving of being THP-backed.
> "
>
> The "prioritize" might indicate that this is used in combination with
> "madvise", not with "never"/
>
>
> So yeah, it all boils down to
>
> (1) If there is no such use case, "never can mean never". Because there
>     is nothing to break, really.

Yeah and as I said in my first response to Hugh, I think the (unintentional)
vagueness of the docs means we are not breaking documented behaviour anyway.

>
> (2) If there is such a use case, we might be breaking it.
>
> [1]
> https://lore.kernel.org/linux-mm/20220706235936.2197195-1-zokeefe@google.com/

Agreed.

This speaks to the 'I want THP now' being the intent of MADV_COLLAPSE. Not a
never/manual (but immediate) control over THP assignment.

Which aligns with the docs:

"Perform a best-effort synchronous collapse of the native pages mapped by the
 memory range into Transparent Huge Pages (THPs)."


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  8:37             ` Lorenzo Stoakes
@ 2025-06-25  8:52               ` Baolin Wang
  2025-06-25  9:31                 ` Lorenzo Stoakes
  2025-06-25  8:53               ` David Hildenbrand
  1 sibling, 1 reply; 37+ messages in thread
From: Baolin Wang @ 2025-06-25  8:52 UTC (permalink / raw)
  To: Lorenzo Stoakes, David Hildenbrand
  Cc: Hugh Dickins, akpm, ziy, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, zokeefe, shy828301, usamaarif642, linux-mm,
	linux-kernel



On 2025/6/25 16:37, Lorenzo Stoakes wrote:
> On Wed, Jun 25, 2025 at 10:24:53AM +0200, David Hildenbrand wrote:
>> On 25.06.25 10:12, Lorenzo Stoakes wrote:
>>> On Wed, Jun 25, 2025 at 08:55:28AM +0100, Lorenzo Stoakes wrote:
>>>> I suppose the least awful way of addressing Baolin's concerns re: mTHP
>>>> while simultaneosly keeping existing semantics is:
>>>>
>>>> 1. Introduce deny to mean what never should have meant.
>>>
>>> To fix Baolin's issue btw we'd have to add 'deny' to both 'global' settings
>>> _and_ each page size setting.
>>>
>>> Because otherwise we'd end up in a weird case where say:
>>>
>>> global 'deny'
>>>
>>>    2 MiB 'never'
>>> 64 KiB 'inherit'
>>>
>>> And err... get 2 MiB THP pages from MADV_COLLAPSE :)
>>>
>>> Or:
>>>
>>> global 'deny'
>>>
>>>    2 MiB 'never'
>>> 64 KiB 'always'
>>>
>>> Or:
>>>
>>> global 'never'
>>>
>>>    2 MiB 'never'
>>> 64 KiB 'always'
>>>
>>> Or:
>>>
>>> global 'never'
>>>
>>>    2 MiB 'madvise'
>>> 64 KiB 'always'
>>>
>>> All doing the same. Not very clear is it?
>>>
>>> We have sowed the seeds of something terrible here, truly.
>>
>> Fully agreed. "Deny" is nasty. Maybe if we really need a way to disable
>> "madv_collapse", it should be done differently, not using this toggle here.
> 
> Yeah maybe the best way is to just have another tunable for this?
> 
> /sys/kernel/mm/transparent_hugepage/disable_collapse perhaps?
> 
> What do you think Hugh, Baolin?

I think it's not necessary to find a way to disable madvise_collapse. 
Essentially, it's a conflict between the semantics of madvise_collapse 
and the '/sys/kernel/mm/transparent_hugepage/enabled' interface. We 
should reach a consensus on the semantics first:

Semantic 1: madv_collapse() should ignore any THP system settings, 
meaning we need to update the 'never' semantics in 
'/sys/kernel/mm/transparent_hugepage/enabled', which would only disable 
page fault and khugepaged, not including madvise_collapse. If we agree 
on this, then the 'never' for per-sized mTHP would have the same 
semantics, i.e., when I set 64K mTHP to 'always' and 2M mTHP to 'never', 
madvise_collapse would still allow the collapse of 2M THP. We should 
document this clearly in case users still want 64K mTHP from 
madvise_collapse.


Semantic 2: madv_collapse() needs to respect THP system settings, which 
is what my patch does. Never means never, and we would need to update 
the documentation of madv_collapse() to make it clearer.

>> Regarding MADV_COLLAPSE, I strongly assume that we should not change it to
>> collapse smaller mTHPs as part of the khugepaged mTHP work. For now, it will
>> simply always collapse to PMD THPs.
> 
> Yeah thinking about it maybe this is the best way. And we can then update
> the man page to make this ABUNDANTLY clear (am happy to do this).
> 
> This keeps things simple.

Yes, agree.

> (One side note on PMD-sized MADV_COLLAPSE - this is basically completely
> useless for 64 KB page size arm64 systems where PMD's are 512 MB :)
> 
> Thoughts Baolin?

We should not collapse 512MB THP on 64K pagesize kernel. So seems 
madv_collapse() can not work on 64K pagesize kernel.

>> Once we want to support other sizes, likely MADV_COLLAPSE users want to have
>> better control over which size to use, at which point it all gets nasty.
> 
> madvise2() this time with extra parameters? ;)
> 
> I sort of wish we had added a flags parameter there.
> 
> But lacking a time machine... :)
> 
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> To summarise:
> 
> Drop series:
> 
> * Might degrade performance for very specific users using
>    never/MADV_COLLAPSE (quite possibly via process_madvise() + a remote
>    process).
> 
> * Matches 'de jure' interpretation of documentation.
> 
> Keep series:
> 
> * Provides no means whatsoever to have a 'manual only' collapse mode,
>    though does provide for manual khugepaged THP.
> 
> * MADV_COLLAPSE automatically gets mTHP support based on obeying 'never'.
> 
> * Matches likely 'de facto' understanding system admins have about THP
>    usage.
> 
> Action items:
> 
> * Either way, I (Lorenzo) will improve documentation.

Great. Thanks.

> * If we drop the series, provide another means to disable
>    MADV_COLLAPSE. But not using existing sysfs toggles, something new. We
>    will document MADV_COLLAPSE as PMD only.
> 
> * If we drop the series, also consider how we might provide mTHP-compatible
>    MADV_COLLAPSE.

Yes. Agree. Will be another mess I guess :)

> 
> * Totally and completely refactor the hell out of the THP implementation
>    from top-to-bottom (over time this is becoming more and more of a me
>    thing... as I'm getting ever more frustrated with the implementation ;)

Yes.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  8:37             ` Lorenzo Stoakes
  2025-06-25  8:52               ` Baolin Wang
@ 2025-06-25  8:53               ` David Hildenbrand
  1 sibling, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2025-06-25  8:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Hugh Dickins, Baolin Wang, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On 25.06.25 10:37, Lorenzo Stoakes wrote:
> On Wed, Jun 25, 2025 at 10:24:53AM +0200, David Hildenbrand wrote:
>> On 25.06.25 10:12, Lorenzo Stoakes wrote:
>>> On Wed, Jun 25, 2025 at 08:55:28AM +0100, Lorenzo Stoakes wrote:
>>>> I suppose the least awful way of addressing Baolin's concerns re: mTHP
>>>> while simultaneosly keeping existing semantics is:
>>>>
>>>> 1. Introduce deny to mean what never should have meant.
>>>
>>> To fix Baolin's issue btw we'd have to add 'deny' to both 'global' settings
>>> _and_ each page size setting.
>>>
>>> Because otherwise we'd end up in a weird case where say:
>>>
>>> global 'deny'
>>>
>>>    2 MiB 'never'
>>> 64 KiB 'inherit'
>>>
>>> And err... get 2 MiB THP pages from MADV_COLLAPSE :)
>>>
>>> Or:
>>>
>>> global 'deny'
>>>
>>>    2 MiB 'never'
>>> 64 KiB 'always'
>>>
>>> Or:
>>>
>>> global 'never'
>>>
>>>    2 MiB 'never'
>>> 64 KiB 'always'
>>>
>>> Or:
>>>
>>> global 'never'
>>>
>>>    2 MiB 'madvise'
>>> 64 KiB 'always'
>>>
>>> All doing the same. Not very clear is it?
>>>
>>> We have sowed the seeds of something terrible here, truly.
>>
>> Fully agreed. "Deny" is nasty. Maybe if we really need a way to disable
>> "madv_collapse", it should be done differently, not using this toggle here.
> 
> Yeah maybe the best way is to just have another tunable for this?
> 
> /sys/kernel/mm/transparent_hugepage/disable_collapse perhaps?
> 
> What do you think Hugh, Baolin?

Probably, for debugging purposes, a kernel cmdline option would work as 
well. Something we can just easily change later.

> 
>>
>> Regarding MADV_COLLAPSE, I strongly assume that we should not change it to
>> collapse smaller mTHPs as part of the khugepaged mTHP work. For now, it will
>> simply always collapse to PMD THPs.
> 
> Yeah thinking about it maybe this is the best way. And we can then update
> the man page to make this ABUNDANTLY clear (am happy to do this).
> 
> This keeps things simple.
> 
> (One side note on PMD-sized MADV_COLLAPSE - this is basically completely
> useless for 64 KB page size arm64 systems where PMD's are 512 MB :)
> 
> Thoughts Baolin?
> 
>>
>> Once we want to support other sizes, likely MADV_COLLAPSE users want to have
>> better control over which size to use, at which point it all gets nasty.
> 
> madvise2() this time with extra parameters? ;)
> 
> I sort of wish we had added a flags parameter there.
> 
> But lacking a time machine... :)

Reminds me of how MAP_HUGETLB squeezed in the sizes by stealing plenty 
of bits ... :)


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  8:52               ` Baolin Wang
@ 2025-06-25  9:31                 ` Lorenzo Stoakes
  2025-06-25 10:02                   ` Baolin Wang
  0 siblings, 1 reply; 37+ messages in thread
From: Lorenzo Stoakes @ 2025-06-25  9:31 UTC (permalink / raw)
  To: Baolin Wang
  Cc: David Hildenbrand, Hugh Dickins, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On Wed, Jun 25, 2025 at 04:52:03PM +0800, Baolin Wang wrote:
>
>
> On 2025/6/25 16:37, Lorenzo Stoakes wrote:
> > Yeah maybe the best way is to just have another tunable for this?
> >
> > /sys/kernel/mm/transparent_hugepage/disable_collapse perhaps?
> >
> > What do you think Hugh, Baolin?
>
> I think it's not necessary to find a way to disable madvise_collapse.
> Essentially, it's a conflict between the semantics of madvise_collapse and
> the '/sys/kernel/mm/transparent_hugepage/enabled' interface. We should reach
> a consensus on the semantics first:
>
> Semantic 1: madv_collapse() should ignore any THP system settings, meaning
> we need to update the 'never' semantics in
> '/sys/kernel/mm/transparent_hugepage/enabled', which would only disable page
> fault and khugepaged, not including madvise_collapse. If we agree on this,
> then the 'never' for per-sized mTHP would have the same semantics, i.e.,
> when I set 64K mTHP to 'always' and 2M mTHP to 'never', madvise_collapse
> would still allow the collapse of 2M THP. We should document this clearly in
> case users still want 64K mTHP from madvise_collapse.

Right yeah, I mean this is in effect how things are now. So the task is
documentation.

>
>
> Semantic 2: madv_collapse() needs to respect THP system settings, which is
> what my patch does. Never means never, and we would need to update the
> documentation of madv_collapse() to make it clearer.

Yes, and indeed this is the choice.

I think, as David said, it comes down to whether we have a legit use case that
truly relies on this.

> > (One side note on PMD-sized MADV_COLLAPSE - this is basically completely
> > useless for 64 KB page size arm64 systems where PMD's are 512 MB :)
> >
> > Thoughts Baolin?
>
> We should not collapse 512MB THP on 64K pagesize kernel. So seems
> madv_collapse() can not work on 64K pagesize kernel.

Well I don't think anything would prevent this now right? So MADV_COLLAPSE is
pretty problematic on 64K pagesize kernels in general.

Anyway that's maybe a problem for another time :)

> >
> > * If we drop the series, also consider how we might provide mTHP-compatible
> >    MADV_COLLAPSE.
>
> Yes. Agree. Will be another mess I guess :)

Sigh yes sadly. Let's try and make it as unmessy as we can I guess :>)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  9:31                 ` Lorenzo Stoakes
@ 2025-06-25 10:02                   ` Baolin Wang
  2025-06-25 10:07                     ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Baolin Wang @ 2025-06-25 10:02 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Hugh Dickins, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel



On 2025/6/25 17:31, Lorenzo Stoakes wrote:
> On Wed, Jun 25, 2025 at 04:52:03PM +0800, Baolin Wang wrote:
>>
>>
>> On 2025/6/25 16:37, Lorenzo Stoakes wrote:
>>> Yeah maybe the best way is to just have another tunable for this?
>>>
>>> /sys/kernel/mm/transparent_hugepage/disable_collapse perhaps?
>>>
>>> What do you think Hugh, Baolin?
>>
>> I think it's not necessary to find a way to disable madvise_collapse.
>> Essentially, it's a conflict between the semantics of madvise_collapse and
>> the '/sys/kernel/mm/transparent_hugepage/enabled' interface. We should reach
>> a consensus on the semantics first:
>>
>> Semantic 1: madv_collapse() should ignore any THP system settings, meaning
>> we need to update the 'never' semantics in
>> '/sys/kernel/mm/transparent_hugepage/enabled', which would only disable page
>> fault and khugepaged, not including madvise_collapse. If we agree on this,
>> then the 'never' for per-sized mTHP would have the same semantics, i.e.,
>> when I set 64K mTHP to 'always' and 2M mTHP to 'never', madvise_collapse
>> would still allow the collapse of 2M THP. We should document this clearly in
>> case users still want 64K mTHP from madvise_collapse.
> 
> Right yeah, I mean this is in effect how things are now. So the task is
> documentation.
> 
>>
>>
>> Semantic 2: madv_collapse() needs to respect THP system settings, which is
>> what my patch does. Never means never, and we would need to update the
>> documentation of madv_collapse() to make it clearer.
> 
> Yes, and indeed this is the choice.
> 
> I think, as David said, it comes down to whether we have a legit use case that
> truly relies on this.
> 
>>> (One side note on PMD-sized MADV_COLLAPSE - this is basically completely
>>> useless for 64 KB page size arm64 systems where PMD's are 512 MB :)
>>>
>>> Thoughts Baolin?
>>
>> We should not collapse 512MB THP on 64K pagesize kernel. So seems
>> madv_collapse() can not work on 64K pagesize kernel.
> 
> Well I don't think anything would prevent this now right? So MADV_COLLAPSE is
> pretty problematic on 64K pagesize kernels in general.

Yes, I don't mean it will prevent madvise_collapse(), just as you said 
that it could be problematic (it's horrible to try to collapse 512MB).

> Anyway that's maybe a problem for another time :)

Yeah, should consider mTHP-compatible MADV_COLLAPSE.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25 10:02                   ` Baolin Wang
@ 2025-06-25 10:07                     ` David Hildenbrand
  2025-06-25 10:15                       ` Lorenzo Stoakes
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2025-06-25 10:07 UTC (permalink / raw)
  To: Baolin Wang, Lorenzo Stoakes
  Cc: Hugh Dickins, akpm, ziy, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, zokeefe, shy828301, usamaarif642, linux-mm,
	linux-kernel

On 25.06.25 12:02, Baolin Wang wrote:
> 
> 
> On 2025/6/25 17:31, Lorenzo Stoakes wrote:
>> On Wed, Jun 25, 2025 at 04:52:03PM +0800, Baolin Wang wrote:
>>>
>>>
>>> On 2025/6/25 16:37, Lorenzo Stoakes wrote:
>>>> Yeah maybe the best way is to just have another tunable for this?
>>>>
>>>> /sys/kernel/mm/transparent_hugepage/disable_collapse perhaps?
>>>>
>>>> What do you think Hugh, Baolin?
>>>
>>> I think it's not necessary to find a way to disable madvise_collapse.
>>> Essentially, it's a conflict between the semantics of madvise_collapse and
>>> the '/sys/kernel/mm/transparent_hugepage/enabled' interface. We should reach
>>> a consensus on the semantics first:
>>>
>>> Semantic 1: madv_collapse() should ignore any THP system settings, meaning
>>> we need to update the 'never' semantics in
>>> '/sys/kernel/mm/transparent_hugepage/enabled', which would only disable page
>>> fault and khugepaged, not including madvise_collapse. If we agree on this,
>>> then the 'never' for per-sized mTHP would have the same semantics, i.e.,
>>> when I set 64K mTHP to 'always' and 2M mTHP to 'never', madvise_collapse
>>> would still allow the collapse of 2M THP. We should document this clearly in
>>> case users still want 64K mTHP from madvise_collapse.
>>
>> Right yeah, I mean this is in effect how things are now. So the task is
>> documentation.
>>
>>>
>>>
>>> Semantic 2: madv_collapse() needs to respect THP system settings, which is
>>> what my patch does. Never means never, and we would need to update the
>>> documentation of madv_collapse() to make it clearer.
>>
>> Yes, and indeed this is the choice.
>>
>> I think, as David said, it comes down to whether we have a legit use case that
>> truly relies on this.
>>
>>>> (One side note on PMD-sized MADV_COLLAPSE - this is basically completely
>>>> useless for 64 KB page size arm64 systems where PMD's are 512 MB :)
>>>>
>>>> Thoughts Baolin?
>>>
>>> We should not collapse 512MB THP on 64K pagesize kernel. So seems
>>> madv_collapse() can not work on 64K pagesize kernel.
>>
>> Well I don't think anything would prevent this now right? So MADV_COLLAPSE is
>> pretty problematic on 64K pagesize kernels in general.
> 
> Yes, I don't mean it will prevent madvise_collapse(), just as you said
> that it could be problematic (it's horrible to try to collapse 512MB).

Well, assume you have a VM at that is 2 GiB and could use 4 THPs. It's 
stupid, but there might be some selected use cases where it's not 
completely stupid.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25 10:07                     ` David Hildenbrand
@ 2025-06-25 10:15                       ` Lorenzo Stoakes
  2025-06-25 10:29                         ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Lorenzo Stoakes @ 2025-06-25 10:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Baolin Wang, Hugh Dickins, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On Wed, Jun 25, 2025 at 12:07:12PM +0200, David Hildenbrand wrote:
> > Yes, I don't mean it will prevent madvise_collapse(), just as you said
> > that it could be problematic (it's horrible to try to collapse 512MB).
>
> Well, assume you have a VM at that is 2 GiB and could use 4 THPs. It's
> stupid, but there might be some selected use cases where it's not completely
> stupid.

I guess we limit the stupidity by MADV_COLLAPSE working on a PMD-aligned range
so it'll be a no-op for an attempt to collapse 2 MiB (unless I'm misreading the
code).

But it is a bit weird to think that users might assume they will get some
benefit in collapsing a range and in this case would not.

Anyway 64 KiB page sizes throws up a lot worse issues than this (e.g., page
blocks + water marks) so this is tip of the iceberg for that.

>
> --
> Cheers,
>
> David / dhildenb
>
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25 10:15                       ` Lorenzo Stoakes
@ 2025-06-25 10:29                         ` David Hildenbrand
  0 siblings, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2025-06-25 10:29 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Baolin Wang, Hugh Dickins, akpm, ziy, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, zokeefe, shy828301, usamaarif642,
	linux-mm, linux-kernel

On 25.06.25 12:15, Lorenzo Stoakes wrote:
> On Wed, Jun 25, 2025 at 12:07:12PM +0200, David Hildenbrand wrote:
>>> Yes, I don't mean it will prevent madvise_collapse(), just as you said
>>> that it could be problematic (it's horrible to try to collapse 512MB).
>>
>> Well, assume you have a VM at that is 2 GiB and could use 4 THPs. It's
>> stupid, but there might be some selected use cases where it's not completely
>> stupid.
> 
> I guess we limit the stupidity by MADV_COLLAPSE working on a PMD-aligned range
> so it'll be a no-op for an attempt to collapse 2 MiB (unless I'm misreading the
> code).

At least the doc says: "MADV_COLLAPSE will automatically clamp the 
provided range to be hugepage-aligned.", and skimming over the code that 
seems to be the case.

Which makes sense to me: never collapse outside the provided range.

So if someone would ever specify something smaller (e.g., 2MiB aligned 
range with 512MiB PMD size) we could decide to collapse a 2MiB THP there 
instead. Right now, it would be ignored.

Guess the rule is simply to not collapse outside of the requested range. 
Maybe that rule can be used to support collapsing of smaller THPs in the 
future as well. Maybe  (can of worms) :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  7:34     ` David Hildenbrand
  2025-06-25  7:55       ` Lorenzo Stoakes
@ 2025-06-25 11:03       ` Usama Arif
  2025-06-25 11:09         ` David Hildenbrand
  1 sibling, 1 reply; 37+ messages in thread
From: Usama Arif @ 2025-06-25 11:03 UTC (permalink / raw)
  To: David Hildenbrand, Lorenzo Stoakes, Hugh Dickins
  Cc: Baolin Wang, akpm, ziy, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, zokeefe, shy828301, linux-mm, linux-kernel,
	Johannes Weiner



On 25/06/2025 08:34, David Hildenbrand wrote:
>>>
>>> We would all prefer a less messy world of THP tunables.  I certainly
>>> find plenty to dislike there too; and wish that a less assertive name
>>> than "never" had been chosen originally for the default off position.
>>>
>>> But please don't break the accepted and documented behaviour of
>>> MADV_COLLAPSE now.
>>
>> Again see above, I absolutely disagree this is documented _clearly_. And
>> that's the underlying issue here.
>> > I feel like if you polled 100 system administrators (assuming they knew
>> about THP) as to how you globally disable THP, probably all 100 would say
>> you do it via:
>>
>> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
>>
> 
> Yes. One big problem is that the documentation was not updated.
> 
> Changing the meaning of "entirely disabled" to "entirely disabled automatically (page faults, khugepaged)"
> 
>> So shouldn't 'never break userspace' be based on practical reality rather
>> than a theorised interpretation of documents that sadly are not clear
>> enough?
> 
> I think the problem is that there might indeed be more users out there relying on "never+MADV_COLLPASE" to now place THPs than "never+MADV_COLLPASE" to no place THPs.
> 
> What is the harm when not placing THPs? Performance degradation for some apps?
> 

I think a bigger issue than performance degradation is someone upgrading the kernel and not
seeing MADV_COLLAPSE working as it has since the beginning and not knowing that its due
to a kernel change.

I feel transparent_hugepage/enabled is too messed up, and its difficult to fix it without
breaking it for someone? I still find it weird that we can set transparent_hugepage/enabled
to never and transparent_hugepage/hugepages-2048kB/enabled to madvise and still get hugepages.
(And we actually use this configuration in production for our ARM servers).

Introducing deny for global and page size I feel will over complicate it because of the issue in
the previous paragraph, page size setting overrides global setting. so even if
transparent_hugepage/enabled is deny, we might still get a THP if the page setting is not.
Someone needs to file to deny, which is the same as setting every file to never.

So I just wanted to throw another bad idea in the mix, what if we introduce another sysfs file
(I hate introducing sysfs :)), something like /sys/kernel/mm/thp_allowed (or some other alternate name)
which is default 1.
Once someone sets it to 0, no one can ever get a THP, no matter what future changes we make. Whether its
madv_collapse, bpf THPs, cgroup THPs, prctls, syscalls.. never will mean never.
Notice that its /sys/kernel/mm/thp_allowed and not /sys/kernel/mm/transparent_hugepage/thp_allowed.
Having it one directory above will make it look uglier, but it highlights that whatever you
set in /sys/kernel/mm/transparent_hugepage/ wont matter if /sys/kernel/mm/thp_allowed is set to 0.
Ideally this would be /sys/kernel/mm/transparent_hugepage/enabled=never if we were developing this
from scratch..
Not pushing for this idea, just throwing it out there.

Thanks,
Usama
 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25 11:03       ` Usama Arif
@ 2025-06-25 11:09         ` David Hildenbrand
  2025-06-26  3:49           ` Hugh Dickins
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2025-06-25 11:09 UTC (permalink / raw)
  To: Usama Arif, Lorenzo Stoakes, Hugh Dickins
  Cc: Baolin Wang, akpm, ziy, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, zokeefe, shy828301, linux-mm, linux-kernel,
	Johannes Weiner

On 25.06.25 13:03, Usama Arif wrote:
> 
> 
> On 25/06/2025 08:34, David Hildenbrand wrote:
>>>>
>>>> We would all prefer a less messy world of THP tunables.  I certainly
>>>> find plenty to dislike there too; and wish that a less assertive name
>>>> than "never" had been chosen originally for the default off position.
>>>>
>>>> But please don't break the accepted and documented behaviour of
>>>> MADV_COLLAPSE now.
>>>
>>> Again see above, I absolutely disagree this is documented _clearly_. And
>>> that's the underlying issue here.
>>>> I feel like if you polled 100 system administrators (assuming they knew
>>> about THP) as to how you globally disable THP, probably all 100 would say
>>> you do it via:
>>>
>>> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
>>>
>>
>> Yes. One big problem is that the documentation was not updated.
>>
>> Changing the meaning of "entirely disabled" to "entirely disabled automatically (page faults, khugepaged)"
>>
>>> So shouldn't 'never break userspace' be based on practical reality rather
>>> than a theorised interpretation of documents that sadly are not clear
>>> enough?
>>
>> I think the problem is that there might indeed be more users out there relying on "never+MADV_COLLPASE" to now place THPs than "never+MADV_COLLPASE" to no place THPs.
>>
>> What is the harm when not placing THPs? Performance degradation for some apps?
>>
> 
> I think a bigger issue than performance degradation is someone upgrading the kernel and not
> seeing MADV_COLLAPSE working as it has since the beginning and not knowing that its due
> to a kernel change.
> 
> I feel transparent_hugepage/enabled is too messed up, and its difficult to fix it without
> breaking it for someone? I still find it weird that we can set transparent_hugepage/enabled
> to never and transparent_hugepage/hugepages-2048kB/enabled to madvise and still get hugepages.
> (And we actually use this configuration in production for our ARM servers).
> 
> Introducing deny for global and page size I feel will over complicate it because of the issue in
> the previous paragraph, page size setting overrides global setting. so even if
> transparent_hugepage/enabled is deny, we might still get a THP if the page setting is not.
> Someone needs to file to deny, which is the same as setting every file to never.
> 
> So I just wanted to throw another bad idea in the mix, what if we introduce another sysfs file
> (I hate introducing sysfs :)), something like /sys/kernel/mm/thp_allowed (or some other alternate name)
> which is default 1.

Let's rather not :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  7:36       ` David Hildenbrand
  2025-06-25  7:42         ` Lorenzo Stoakes
@ 2025-06-25 21:51         ` Hugh Dickins
  1 sibling, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2025-06-25 21:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Hugh Dickins, Baolin Wang, akpm, ziy,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, zokeefe,
	shy828301, usamaarif642, linux-mm, linux-kernel


On Wed, 25 Jun 2025, David Hildenbrand wrote:
> 
> I know, See my other mail, the problem is rather if there is no somebody
> relying on never+MADV_COLLAPSE from doing the MADV_COLLAPSE-documented thing.
> 
> It's a mess.
> 
> (I have the suspicion that Hugh might know of such a user :) )

I've not yet read my way through this firestorm I've raised,
but on this point: no, I don't know of such a user.

You'll be suspecting that I know of some such use by Google: no, I haven't
even investigated to see if that's the case; I think not, I think we're
not neverers, but cannot check at the moment (and might have got in a 
muddle with the defrag options).

I never(!) bothered to investigate that, because if Google does rely on
this behaviour, it is very easy for Google to maintain its own variation
from whatever is decided here (though management won't like me saying so).

My concern is for everyone else who read the madvise MADV_COLLAPSE manpage.

Hugh


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25 11:09         ` David Hildenbrand
@ 2025-06-26  3:49           ` Hugh Dickins
  0 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2025-06-26  3:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Usama Arif, Lorenzo Stoakes, Hugh Dickins, Baolin Wang, akpm, ziy,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, zokeefe,
	shy828301, linux-mm, linux-kernel, Johannes Weiner

On Wed, 25 Jun 2025, David Hildenbrand wrote:
> On 25.06.25 13:03, Usama Arif wrote:
> > 
> > So I just wanted to throw another bad idea in the mix, what if we introduce
> > another sysfs file
> > (I hate introducing sysfs :)), something like /sys/kernel/mm/thp_allowed (or
> > some other alternate name)
> > which is default 1.
> 
> Let's rather not :)

:)

Thank you all, I've read through and agreed with much and pondered, I wish
wish wish I had something useful to add but do not, so shall just go quiet
again ...

... but there seems to be no need for rush to decide on any change at all.

Hugh


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-06-25  1:40 [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled Baolin Wang
                   ` (2 preceding siblings ...)
  2025-06-25  5:53 ` [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP " Hugh Dickins
@ 2025-07-09 12:36 ` Lorenzo Stoakes
  2025-07-10  1:58   ` Baolin Wang
  3 siblings, 1 reply; 37+ messages in thread
From: Lorenzo Stoakes @ 2025-07-09 12:36 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, david, ziy, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, linux-mm, linux-kernel, Pedro Falcato

+cc Pedro as he'd raised concerns here also.

Hi Baolin,

Just for some clarification on this - thank you very much for this series,
but based on discussion with David and concerns raised by Hugh + others,
overall it feels as if, while the documentation is no doubt vague in ways
it ought not to be, this behaviour is something we have put out into the
world and we should continue to support it.

So overall I feel that this series should not be applied.

Your work here is great, and really massive apologies for this after all
the work you've put in (and of course the review work here also), but on
reflection I think it's a risk we shouldn't take.

I understand this means that MADV_COLLAPSE can't be used to collapse at a
mTHP granularity - we definitely need to have a think about how we might
provide this sensibly.

As for how to move forward - I will go ahead and update documentation to
make the situation absolutely crystal clear, both in the man page and the
rst.

Thanks, Lorenzo

On Wed, Jun 25, 2025 at 09:40:08AM +0800, Baolin Wang wrote:
> When invoking thp_vma_allowable_orders(), if the TVA_ENFORCE_SYSFS flag is not
> specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
> callers who do not specify this flag, it creates a odd and surprising situation
> where a sysadmin specifying 'never' for all THP sizes still observing THP pages
> being allocated and used on the system. And the MADV_COLLAPSE is an example of
> such a case, that means it will not set TVA_ENFORCE_SYSFS when calling
> thp_vma_allowable_orders().
>
> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
> the system-wide anon/shmem THP sysfs settings, which means that even though
> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
> attempt to collapse into a anon/shmem THP. This violates the rule we have
> agreed upon: never means never.
>
> For example, system administrators who disabled THP everywhere must indeed very
> much not want THP to be used for whatever reason - having individual programs
> being able to quietly override this is very surprising and likely to cause headaches
> for those who desire this not to happen on their systems.
>
> This patch set will address the MADV_COLLAPSE issue.
>
> Test
> ====
> 1. Tested the mm selftests and found no regressions.
> 2. With toggling different Anon mTHP settings, the allocation and madvise collapse for
> anonymous pages work well.
> 3. With toggling different shmem mTHP settings, the allocation and madvise collapse for
> shmem work well.
> 4. Tested the large order allocation for tmpfs, and works as expected.
>
> [1] https://lore.kernel.org/all/1f00fdc3-a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com/
>
> Changes from v3:
>  - Collect reviewed tags. Thanks.
>  - Update the commit message, per David.
>
> Changes from v2:
>  - Update the commit message and cover letter, per Lorenzo. Thanks.
>  - Simplify the logic in thp_vma_allowable_orders(), per Lorenzo and David. Thanks.
>
> Changes from v1:
>  - Update the commit message, per Zi.
>  - Add Zi's reviewed tag. Thanks.
>  - Update the shmem logic.
>
> Baolin Wang (2):
>   mm: huge_memory: disallow hugepages if the system-wide THP sysfs
>     settings are disabled
>   mm: shmem: disallow hugepages if the system-wide shmem THP sysfs
>     settings are disabled
>
>  include/linux/huge_mm.h                 | 51 ++++++++++++++++++-------
>  mm/shmem.c                              |  6 +--
>  tools/testing/selftests/mm/khugepaged.c |  8 +---
>  3 files changed, 43 insertions(+), 22 deletions(-)
>
> --
> 2.43.5
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
  2025-07-09 12:36 ` Lorenzo Stoakes
@ 2025-07-10  1:58   ` Baolin Wang
  0 siblings, 0 replies; 37+ messages in thread
From: Baolin Wang @ 2025-07-10  1:58 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, hughd, david, ziy, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, linux-mm, linux-kernel, Pedro Falcato



On 2025/7/9 20:36, Lorenzo Stoakes wrote:
> +cc Pedro as he'd raised concerns here also.
> 
> Hi Baolin,
> 
> Just for some clarification on this - thank you very much for this series,
> but based on discussion with David and concerns raised by Hugh + others,
> overall it feels as if, while the documentation is no doubt vague in ways
> it ought not to be, this behaviour is something we have put out into the
> world and we should continue to support it.
> 
> So overall I feel that this series should not be applied.

Fair enough.

> 
> Your work here is great, and really massive apologies for this after all
> the work you've put in (and of course the review work here also), but on
> reflection I think it's a risk we shouldn't take.

Consensus is the key. Thank you and David for the discussion and 
suggestions.

> I understand this means that MADV_COLLAPSE can't be used to collapse at a
> mTHP granularity - we definitely need to have a think about how we might
> provide this sensibly.
> 
> As for how to move forward - I will go ahead and update documentation to
> make the situation absolutely crystal clear, both in the man page and the
> rst.

OK. Great. Thanks.

> Thanks, Lorenzo
> 
> On Wed, Jun 25, 2025 at 09:40:08AM +0800, Baolin Wang wrote:
>> When invoking thp_vma_allowable_orders(), if the TVA_ENFORCE_SYSFS flag is not
>> specified, we will ignore the THP sysfs settings. Whilst it makes sense for the
>> callers who do not specify this flag, it creates a odd and surprising situation
>> where a sysadmin specifying 'never' for all THP sizes still observing THP pages
>> being allocated and used on the system. And the MADV_COLLAPSE is an example of
>> such a case, that means it will not set TVA_ENFORCE_SYSFS when calling
>> thp_vma_allowable_orders().
>>
>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>> the system-wide anon/shmem THP sysfs settings, which means that even though
>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>> agreed upon: never means never.
>>
>> For example, system administrators who disabled THP everywhere must indeed very
>> much not want THP to be used for whatever reason - having individual programs
>> being able to quietly override this is very surprising and likely to cause headaches
>> for those who desire this not to happen on their systems.
>>
>> This patch set will address the MADV_COLLAPSE issue.
>>
>> Test
>> ====
>> 1. Tested the mm selftests and found no regressions.
>> 2. With toggling different Anon mTHP settings, the allocation and madvise collapse for
>> anonymous pages work well.
>> 3. With toggling different shmem mTHP settings, the allocation and madvise collapse for
>> shmem work well.
>> 4. Tested the large order allocation for tmpfs, and works as expected.
>>
>> [1] https://lore.kernel.org/all/1f00fdc3-a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com/
>>
>> Changes from v3:
>>   - Collect reviewed tags. Thanks.
>>   - Update the commit message, per David.
>>
>> Changes from v2:
>>   - Update the commit message and cover letter, per Lorenzo. Thanks.
>>   - Simplify the logic in thp_vma_allowable_orders(), per Lorenzo and David. Thanks.
>>
>> Changes from v1:
>>   - Update the commit message, per Zi.
>>   - Add Zi's reviewed tag. Thanks.
>>   - Update the shmem logic.
>>
>> Baolin Wang (2):
>>    mm: huge_memory: disallow hugepages if the system-wide THP sysfs
>>      settings are disabled
>>    mm: shmem: disallow hugepages if the system-wide shmem THP sysfs
>>      settings are disabled
>>
>>   include/linux/huge_mm.h                 | 51 ++++++++++++++++++-------
>>   mm/shmem.c                              |  6 +--
>>   tools/testing/selftests/mm/khugepaged.c |  8 +---
>>   3 files changed, 43 insertions(+), 22 deletions(-)
>>
>> --
>> 2.43.5
>>



^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2025-07-10  1:58 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-25  1:40 [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP settings are disabled Baolin Wang
2025-06-25  1:40 ` [PATCH v4 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs " Baolin Wang
2025-06-25  4:34   ` Dev Jain
2025-06-25  1:40 ` [PATCH v4 2/2] mm: shmem: disallow hugepages if the system-wide shmem " Baolin Wang
2025-06-25  5:53 ` [PATCH v4 0/2] fix MADV_COLLAPSE issue if THP " Hugh Dickins
2025-06-25  6:05   ` Dev Jain
2025-06-25  6:26   ` Baolin Wang
2025-06-25  6:49     ` Dev Jain
2025-06-25  6:55       ` Baolin Wang
2025-06-25  7:20   ` Lorenzo Stoakes
2025-06-25  7:34     ` David Hildenbrand
2025-06-25  7:55       ` Lorenzo Stoakes
2025-06-25  8:12         ` Lorenzo Stoakes
2025-06-25  8:24           ` David Hildenbrand
2025-06-25  8:37             ` Lorenzo Stoakes
2025-06-25  8:52               ` Baolin Wang
2025-06-25  9:31                 ` Lorenzo Stoakes
2025-06-25 10:02                   ` Baolin Wang
2025-06-25 10:07                     ` David Hildenbrand
2025-06-25 10:15                       ` Lorenzo Stoakes
2025-06-25 10:29                         ` David Hildenbrand
2025-06-25  8:53               ` David Hildenbrand
2025-06-25 11:03       ` Usama Arif
2025-06-25 11:09         ` David Hildenbrand
2025-06-26  3:49           ` Hugh Dickins
2025-06-25  7:23   ` David Hildenbrand
2025-06-25  7:30     ` Lorenzo Stoakes
2025-06-25  7:36       ` David Hildenbrand
2025-06-25  7:42         ` Lorenzo Stoakes
2025-06-25  7:49           ` David Hildenbrand
2025-06-25  8:16             ` David Hildenbrand
2025-06-25  8:22               ` Lorenzo Stoakes
2025-06-25  8:40                 ` David Hildenbrand
2025-06-25  8:45                   ` Lorenzo Stoakes
2025-06-25 21:51         ` Hugh Dickins
2025-07-09 12:36 ` Lorenzo Stoakes
2025-07-10  1:58   ` Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).