* [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
@ 2025-05-29 8:23 Baolin Wang
2025-05-29 8:23 ` [PATCH 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs " Baolin Wang
` (2 more replies)
0 siblings, 3 replies; 26+ messages in thread
From: Baolin Wang @ 2025-05-29 8:23 UTC (permalink / raw)
To: akpm, hughd, david
Cc: lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain,
ziy, baolin.wang, linux-mm, linux-fsdevel, linux-kernel
As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
the system-wide anon/shmem THP sysfs settings, which means that even though
we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
attempt to collapse into a anon/shmem THP. This violates the rule we have
agreed upon: never means never. This patch set will address this issue.
[1] https://lore.kernel.org/all/1f00fdc3-a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com/
Baolin Wang (2):
mm: huge_memory: disallow hugepages if the system-wide THP sysfs
settings are disabled
mm: shmem: disallow hugepages if the system-wide shmem THP sysfs
settings are disabled
include/linux/huge_mm.h | 23 +++++++++++++++++++----
mm/huge_memory.c | 2 +-
mm/shmem.c | 12 ++++++------
3 files changed, 26 insertions(+), 11 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
2025-05-29 8:23 [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled Baolin Wang
@ 2025-05-29 8:23 ` Baolin Wang
2025-05-29 15:10 ` Zi Yan
2025-05-29 8:23 ` [PATCH 2/2] mm: shmem: disallow hugepages if the system-wide shmem " Baolin Wang
2025-05-30 8:04 ` [PATCH 0/2] fix MADV_COLLAPSE issue if THP " Ryan Roberts
2 siblings, 1 reply; 26+ messages in thread
From: Baolin Wang @ 2025-05-29 8:23 UTC (permalink / raw)
To: akpm, hughd, david
Cc: lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain,
ziy, baolin.wang, linux-mm, linux-fsdevel, linux-kernel
The MADV_COLLAPSE will ignore the system-wide Anon THP sysfs settings, which
means that even though we have disabled the Anon THP configuration, MADV_COLLAPSE
will still attempt to collapse into a Anon THP. This violates the rule we have
agreed upon: never means never.
To address this issue, should check whether the Anon THP configuration is disabled
in thp_vma_allowable_orders(), even when the TVA_ENFORCE_SYSFS flag is set.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
include/linux/huge_mm.h | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..199ddc9f04a1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -287,20 +287,35 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
unsigned long orders)
{
/* Optimization to check if required orders are enabled early. */
- if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
- unsigned long mask = READ_ONCE(huge_anon_orders_always);
+ if (vma_is_anonymous(vma)) {
+ unsigned long always = READ_ONCE(huge_anon_orders_always);
+ unsigned long madvise = READ_ONCE(huge_anon_orders_madvise);
+ unsigned long inherit = READ_ONCE(huge_anon_orders_inherit);
+ unsigned long mask = always | madvise;
+
+ /*
+ * If the system-wide THP/mTHP sysfs settings are disabled,
+ * then we should never allow hugepages.
+ */
+ if (!(mask & orders) && !(hugepage_global_enabled() && (inherit & orders)))
+ return 0;
+
+ if (!(tva_flags & TVA_ENFORCE_SYSFS))
+ goto skip;
+ mask = always;
if (vm_flags & VM_HUGEPAGE)
- mask |= READ_ONCE(huge_anon_orders_madvise);
+ mask |= madvise;
if (hugepage_global_always() ||
((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
- mask |= READ_ONCE(huge_anon_orders_inherit);
+ mask |= inherit;
orders &= mask;
if (!orders)
return 0;
}
+skip:
return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
}
--
2.43.5
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 2/2] mm: shmem: disallow hugepages if the system-wide shmem THP sysfs settings are disabled
2025-05-29 8:23 [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled Baolin Wang
2025-05-29 8:23 ` [PATCH 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs " Baolin Wang
@ 2025-05-29 8:23 ` Baolin Wang
2025-05-29 15:21 ` Zi Yan
2025-05-30 8:04 ` [PATCH 0/2] fix MADV_COLLAPSE issue if THP " Ryan Roberts
2 siblings, 1 reply; 26+ messages in thread
From: Baolin Wang @ 2025-05-29 8:23 UTC (permalink / raw)
To: akpm, hughd, david
Cc: lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain,
ziy, baolin.wang, linux-mm, linux-fsdevel, linux-kernel
The MADV_COLLAPSE will ignore the system-wide shmem THP sysfs settings, which
means that even though we have disabled the shmem THP configuration, MADV_COLLAPSE
will still attempt to collapse into a shmem THP. This violates the rule we have
agreed upon: never means never.
Then the current strategy is:
For shmem, if none of always, madvise, within_size, and inherit have enabled
PMD-sized mTHP, then MADV_COLLAPSE will be prohibited from collapsing PMD-sized mTHP.
For tmpfs, if the mount option is set with the 'huge=never' parameter, then
MADV_COLLAPSE will be prohibited from collapsing PMD-sized mTHP.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/huge_memory.c | 2 +-
mm/shmem.c | 12 ++++++------
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d3e66136e41a..a8cfa37cae72 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -166,7 +166,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
* own flags.
*/
if (!in_pf && shmem_file(vma->vm_file))
- return shmem_allowable_huge_orders(file_inode(vma->vm_file),
+ return orders & shmem_allowable_huge_orders(file_inode(vma->vm_file),
vma, vma->vm_pgoff, 0,
!enforce_sysfs);
diff --git a/mm/shmem.c b/mm/shmem.c
index 4b42419ce6b2..4dbb28d85cd9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -613,7 +613,7 @@ static unsigned int shmem_get_orders_within_size(struct inode *inode,
}
static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
- loff_t write_end, bool shmem_huge_force,
+ loff_t write_end,
struct vm_area_struct *vma,
unsigned long vm_flags)
{
@@ -625,7 +625,7 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
return 0;
if (shmem_huge == SHMEM_HUGE_DENY)
return 0;
- if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
+ if (shmem_huge == SHMEM_HUGE_FORCE)
return maybe_pmd_order;
/*
@@ -860,7 +860,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
}
static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
- loff_t write_end, bool shmem_huge_force,
+ loff_t write_end,
struct vm_area_struct *vma,
unsigned long vm_flags)
{
@@ -1261,7 +1261,7 @@ static int shmem_getattr(struct mnt_idmap *idmap,
STATX_ATTR_NODUMP);
generic_fillattr(idmap, request_mask, inode, stat);
- if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0))
+ if (shmem_huge_global_enabled(inode, 0, 0, NULL, 0))
stat->blksize = HPAGE_PMD_SIZE;
if (request_mask & STATX_BTIME) {
@@ -1768,7 +1768,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
return 0;
global_orders = shmem_huge_global_enabled(inode, index, write_end,
- shmem_huge_force, vma, vm_flags);
+ vma, vm_flags);
/* Tmpfs huge pages allocation */
if (!vma || !vma_is_anon_shmem(vma))
return global_orders;
@@ -1790,7 +1790,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
/* Allow mTHP that will be fully within i_size. */
mask |= shmem_get_orders_within_size(inode, within_size_orders, index, 0);
- if (vm_flags & VM_HUGEPAGE)
+ if (shmem_huge_force || (vm_flags & VM_HUGEPAGE))
mask |= READ_ONCE(huge_shmem_orders_madvise);
if (global_orders > 0)
--
2.43.5
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
2025-05-29 8:23 ` [PATCH 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs " Baolin Wang
@ 2025-05-29 15:10 ` Zi Yan
2025-05-30 1:51 ` Baolin Wang
0 siblings, 1 reply; 26+ messages in thread
From: Zi Yan @ 2025-05-29 15:10 UTC (permalink / raw)
To: Baolin Wang
Cc: akpm, hughd, david, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, linux-mm, linux-fsdevel, linux-kernel
On 29 May 2025, at 4:23, Baolin Wang wrote:
> The MADV_COLLAPSE will ignore the system-wide Anon THP sysfs settings, which
> means that even though we have disabled the Anon THP configuration, MADV_COLLAPSE
> will still attempt to collapse into a Anon THP. This violates the rule we have
> agreed upon: never means never.
>
> To address this issue, should check whether the Anon THP configuration is disabled
> in thp_vma_allowable_orders(), even when the TVA_ENFORCE_SYSFS flag is set.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> include/linux/huge_mm.h | 23 +++++++++++++++++++----
> 1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..199ddc9f04a1 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -287,20 +287,35 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> unsigned long orders)
> {
> /* Optimization to check if required orders are enabled early. */
> - if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
> - unsigned long mask = READ_ONCE(huge_anon_orders_always);
> + if (vma_is_anonymous(vma)) {
> + unsigned long always = READ_ONCE(huge_anon_orders_always);
> + unsigned long madvise = READ_ONCE(huge_anon_orders_madvise);
> + unsigned long inherit = READ_ONCE(huge_anon_orders_inherit);
> + unsigned long mask = always | madvise;
> +
> + /*
> + * If the system-wide THP/mTHP sysfs settings are disabled,
> + * then we should never allow hugepages.
> + */
> + if (!(mask & orders) && !(hugepage_global_enabled() && (inherit & orders)))
Can you explain the logic here? Is it equivalent to:
1. if THP is set to always, always_mask & orders == 0, or
2. if THP if set to madvise, madvise_mask & order == 0, or
3. if THP is set to inherit, inherit_mask & order == 0?
I cannot figure out why (always | madvise) & orders does not check
THP enablement case, but inherit & orders checks hugepage_global_enabled().
Thanks.
> + return 0;
> +
> + if (!(tva_flags & TVA_ENFORCE_SYSFS))
> + goto skip;
>
> + mask = always;
> if (vm_flags & VM_HUGEPAGE)
> - mask |= READ_ONCE(huge_anon_orders_madvise);
> + mask |= madvise;
> if (hugepage_global_always() ||
> ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
> - mask |= READ_ONCE(huge_anon_orders_inherit);
> + mask |= inherit;
>
> orders &= mask;
> if (!orders)
> return 0;
> }
>
> +skip:
> return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> }
>
> --
> 2.43.5
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 2/2] mm: shmem: disallow hugepages if the system-wide shmem THP sysfs settings are disabled
2025-05-29 8:23 ` [PATCH 2/2] mm: shmem: disallow hugepages if the system-wide shmem " Baolin Wang
@ 2025-05-29 15:21 ` Zi Yan
2025-05-30 1:58 ` Baolin Wang
0 siblings, 1 reply; 26+ messages in thread
From: Zi Yan @ 2025-05-29 15:21 UTC (permalink / raw)
To: Baolin Wang
Cc: akpm, hughd, david, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, linux-mm, linux-fsdevel, linux-kernel
On 29 May 2025, at 4:23, Baolin Wang wrote:
> The MADV_COLLAPSE will ignore the system-wide shmem THP sysfs settings, which
> means that even though we have disabled the shmem THP configuration, MADV_COLLAPSE
> will still attempt to collapse into a shmem THP. This violates the rule we have
> agreed upon: never means never.
>
> Then the current strategy is:
> For shmem, if none of always, madvise, within_size, and inherit have enabled
> PMD-sized mTHP, then MADV_COLLAPSE will be prohibited from collapsing PMD-sized mTHP.
>
> For tmpfs, if the mount option is set with the 'huge=never' parameter, then
> MADV_COLLAPSE will be prohibited from collapsing PMD-sized mTHP.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> mm/huge_memory.c | 2 +-
> mm/shmem.c | 12 ++++++------
> 2 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d3e66136e41a..a8cfa37cae72 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -166,7 +166,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> * own flags.
> */
> if (!in_pf && shmem_file(vma->vm_file))
> - return shmem_allowable_huge_orders(file_inode(vma->vm_file),
> + return orders & shmem_allowable_huge_orders(file_inode(vma->vm_file),
> vma, vma->vm_pgoff, 0,
> !enforce_sysfs);
OK, here orders is checked against allowed orders.
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 4b42419ce6b2..4dbb28d85cd9 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -613,7 +613,7 @@ static unsigned int shmem_get_orders_within_size(struct inode *inode,
> }
>
> static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
> - loff_t write_end, bool shmem_huge_force,
> + loff_t write_end,
> struct vm_area_struct *vma,
> unsigned long vm_flags)
> {
> @@ -625,7 +625,7 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
> return 0;
> if (shmem_huge == SHMEM_HUGE_DENY)
> return 0;
> - if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
> + if (shmem_huge == SHMEM_HUGE_FORCE)
> return maybe_pmd_order;
shmem_huge is set by sysfs?
>
> /*
> @@ -860,7 +860,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
> }
>
> static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
> - loff_t write_end, bool shmem_huge_force,
> + loff_t write_end,
> struct vm_area_struct *vma,
> unsigned long vm_flags)
> {
> @@ -1261,7 +1261,7 @@ static int shmem_getattr(struct mnt_idmap *idmap,
> STATX_ATTR_NODUMP);
> generic_fillattr(idmap, request_mask, inode, stat);
>
> - if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0))
> + if (shmem_huge_global_enabled(inode, 0, 0, NULL, 0))
> stat->blksize = HPAGE_PMD_SIZE;
>
> if (request_mask & STATX_BTIME) {
> @@ -1768,7 +1768,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
> return 0;
>
> global_orders = shmem_huge_global_enabled(inode, index, write_end,
> - shmem_huge_force, vma, vm_flags);
> + vma, vm_flags);
> /* Tmpfs huge pages allocation */
> if (!vma || !vma_is_anon_shmem(vma))
> return global_orders;
> @@ -1790,7 +1790,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
> /* Allow mTHP that will be fully within i_size. */
> mask |= shmem_get_orders_within_size(inode, within_size_orders, index, 0);
>
> - if (vm_flags & VM_HUGEPAGE)
> + if (shmem_huge_force || (vm_flags & VM_HUGEPAGE))
> mask |= READ_ONCE(huge_shmem_orders_madvise);
>
> if (global_orders > 0)
> --
> 2.43.5
shmem_huge_force comes from !enforce_sysfs in __thp_vma_allowable_orders().
Do you know when sysfs is not enforced and why?
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
2025-05-29 15:10 ` Zi Yan
@ 2025-05-30 1:51 ` Baolin Wang
2025-05-30 2:04 ` Zi Yan
0 siblings, 1 reply; 26+ messages in thread
From: Baolin Wang @ 2025-05-30 1:51 UTC (permalink / raw)
To: Zi Yan
Cc: akpm, hughd, david, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, linux-mm, linux-fsdevel, linux-kernel
On 2025/5/29 23:10, Zi Yan wrote:
> On 29 May 2025, at 4:23, Baolin Wang wrote:
>
>> The MADV_COLLAPSE will ignore the system-wide Anon THP sysfs settings, which
>> means that even though we have disabled the Anon THP configuration, MADV_COLLAPSE
>> will still attempt to collapse into a Anon THP. This violates the rule we have
>> agreed upon: never means never.
>>
>> To address this issue, should check whether the Anon THP configuration is disabled
>> in thp_vma_allowable_orders(), even when the TVA_ENFORCE_SYSFS flag is set.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>> include/linux/huge_mm.h | 23 +++++++++++++++++++----
>> 1 file changed, 19 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2f190c90192d..199ddc9f04a1 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -287,20 +287,35 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>> unsigned long orders)
>> {
>> /* Optimization to check if required orders are enabled early. */
>> - if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
>> - unsigned long mask = READ_ONCE(huge_anon_orders_always);
>> + if (vma_is_anonymous(vma)) {
>> + unsigned long always = READ_ONCE(huge_anon_orders_always);
>> + unsigned long madvise = READ_ONCE(huge_anon_orders_madvise);
>> + unsigned long inherit = READ_ONCE(huge_anon_orders_inherit);
>> + unsigned long mask = always | madvise;
>> +
>> + /*
>> + * If the system-wide THP/mTHP sysfs settings are disabled,
>> + * then we should never allow hugepages.
>> + */
>> + if (!(mask & orders) && !(hugepage_global_enabled() && (inherit & orders)))
>
> Can you explain the logic here? Is it equivalent to:
> 1. if THP is set to always, always_mask & orders == 0, or
> 2. if THP if set to madvise, madvise_mask & order == 0, or
> 3. if THP is set to inherit, inherit_mask & order == 0?
>
> I cannot figure out why (always | madvise) & orders does not check
> THP enablement case, but inherit & orders checks hugepage_global_enabled().
Sorry for not being clear. Let me try again:
Now we can control per-sized mTHP through ‘huge_anon_orders_always’, so
always does not need to rely on the check of hugepage_global_enabled().
For madvise, referring to David's suggestion: “allowing for collapsing
in a VM without VM_HUGEPAGE in the "madvise" mode would be fine", so we
can just check 'huge_anon_orders_madvise' without relying on
hugepage_global_enabled().
In the case where hugepage_global_enabled() is enabled, we need to check
whether the 'inherit' has enabled the corresponding orders.
In summary, the current strategy is:
1. If always & orders == 0, and madvise & orders == 0, and
hugepage_global_enabled() == false (global THP settings are not
enabled), it means mTHP of the orders are prohibited from being used,
then madvise_collapse() is forbidden.
2. If always & orders == 0, and madvise & orders == 0, and
hugepage_global_enabled() == true (global THP settings are enabled), and
inherit & orders == 0, it means mTHP of the orders are still prohibited
from being used, and thus madvise_collapse() is not allowed.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 2/2] mm: shmem: disallow hugepages if the system-wide shmem THP sysfs settings are disabled
2025-05-29 15:21 ` Zi Yan
@ 2025-05-30 1:58 ` Baolin Wang
2025-05-30 2:17 ` Zi Yan
0 siblings, 1 reply; 26+ messages in thread
From: Baolin Wang @ 2025-05-30 1:58 UTC (permalink / raw)
To: Zi Yan
Cc: akpm, hughd, david, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, linux-mm, linux-fsdevel, linux-kernel
On 2025/5/29 23:21, Zi Yan wrote:
> On 29 May 2025, at 4:23, Baolin Wang wrote:
>
>> The MADV_COLLAPSE will ignore the system-wide shmem THP sysfs settings, which
>> means that even though we have disabled the shmem THP configuration, MADV_COLLAPSE
>> will still attempt to collapse into a shmem THP. This violates the rule we have
>> agreed upon: never means never.
>>
>> Then the current strategy is:
>> For shmem, if none of always, madvise, within_size, and inherit have enabled
>> PMD-sized mTHP, then MADV_COLLAPSE will be prohibited from collapsing PMD-sized mTHP.
>>
>> For tmpfs, if the mount option is set with the 'huge=never' parameter, then
>> MADV_COLLAPSE will be prohibited from collapsing PMD-sized mTHP.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>> mm/huge_memory.c | 2 +-
>> mm/shmem.c | 12 ++++++------
>> 2 files changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index d3e66136e41a..a8cfa37cae72 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -166,7 +166,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>> * own flags.
>> */
>> if (!in_pf && shmem_file(vma->vm_file))
>> - return shmem_allowable_huge_orders(file_inode(vma->vm_file),
>> + return orders & shmem_allowable_huge_orders(file_inode(vma->vm_file),
>> vma, vma->vm_pgoff, 0,
>> !enforce_sysfs);
>
> OK, here orders is checked against allowed orders.
>
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 4b42419ce6b2..4dbb28d85cd9 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -613,7 +613,7 @@ static unsigned int shmem_get_orders_within_size(struct inode *inode,
>> }
>>
>> static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
>> - loff_t write_end, bool shmem_huge_force,
>> + loff_t write_end,
>> struct vm_area_struct *vma,
>> unsigned long vm_flags)
>> {
>> @@ -625,7 +625,7 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
>> return 0;
>> if (shmem_huge == SHMEM_HUGE_DENY)
>> return 0;
>> - if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
>> + if (shmem_huge == SHMEM_HUGE_FORCE)
>> return maybe_pmd_order;
>
> shmem_huge is set by sysfs?
Yes, through the '/sys/kernel/mm/transparent_hugepage/shmem_enabled'
interface.
>> /*
>> @@ -860,7 +860,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
>> }
>>
>> static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
>> - loff_t write_end, bool shmem_huge_force,
>> + loff_t write_end,
>> struct vm_area_struct *vma,
>> unsigned long vm_flags)
>> {
>> @@ -1261,7 +1261,7 @@ static int shmem_getattr(struct mnt_idmap *idmap,
>> STATX_ATTR_NODUMP);
>> generic_fillattr(idmap, request_mask, inode, stat);
>>
>> - if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0))
>> + if (shmem_huge_global_enabled(inode, 0, 0, NULL, 0))
>> stat->blksize = HPAGE_PMD_SIZE;
>>
>> if (request_mask & STATX_BTIME) {
>> @@ -1768,7 +1768,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
>> return 0;
>>
>> global_orders = shmem_huge_global_enabled(inode, index, write_end,
>> - shmem_huge_force, vma, vm_flags);
>> + vma, vm_flags);
>> /* Tmpfs huge pages allocation */
>> if (!vma || !vma_is_anon_shmem(vma))
>> return global_orders;
>> @@ -1790,7 +1790,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
>> /* Allow mTHP that will be fully within i_size. */
>> mask |= shmem_get_orders_within_size(inode, within_size_orders, index, 0);
>>
>> - if (vm_flags & VM_HUGEPAGE)
>> + if (shmem_huge_force || (vm_flags & VM_HUGEPAGE))
>> mask |= READ_ONCE(huge_shmem_orders_madvise);
>>
>> if (global_orders > 0)
>> --
>> 2.43.5
>
> shmem_huge_force comes from !enforce_sysfs in __thp_vma_allowable_orders().
> Do you know when sysfs is not enforced and why?
IIUC, shmem_huge_force will only be set during MADV_COLLAPSE.
Originally, MADV_COLLAPSE was intended to ignore the system-wide THP
sysfs settings. However, if all system-wide shmem THP settings are
disabled, we should not allow MADV_COLLAPSE to collapse a THP. This is
the issue this patchset aims to fix. Thanks for the review.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
2025-05-30 1:51 ` Baolin Wang
@ 2025-05-30 2:04 ` Zi Yan
2025-05-30 2:21 ` Baolin Wang
0 siblings, 1 reply; 26+ messages in thread
From: Zi Yan @ 2025-05-30 2:04 UTC (permalink / raw)
To: Baolin Wang
Cc: akpm, hughd, david, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, linux-mm, linux-fsdevel, linux-kernel
On 29 May 2025, at 21:51, Baolin Wang wrote:
> On 2025/5/29 23:10, Zi Yan wrote:
>> On 29 May 2025, at 4:23, Baolin Wang wrote:
>>
>>> The MADV_COLLAPSE will ignore the system-wide Anon THP sysfs settings, which
>>> means that even though we have disabled the Anon THP configuration, MADV_COLLAPSE
>>> will still attempt to collapse into a Anon THP. This violates the rule we have
>>> agreed upon: never means never.
>>>
>>> To address this issue, should check whether the Anon THP configuration is disabled
>>> in thp_vma_allowable_orders(), even when the TVA_ENFORCE_SYSFS flag is set.
>>>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>> include/linux/huge_mm.h | 23 +++++++++++++++++++----
>>> 1 file changed, 19 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 2f190c90192d..199ddc9f04a1 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -287,20 +287,35 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>>> unsigned long orders)
>>> {
>>> /* Optimization to check if required orders are enabled early. */
>>> - if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
>>> - unsigned long mask = READ_ONCE(huge_anon_orders_always);
>>> + if (vma_is_anonymous(vma)) {
>>> + unsigned long always = READ_ONCE(huge_anon_orders_always);
>>> + unsigned long madvise = READ_ONCE(huge_anon_orders_madvise);
>>> + unsigned long inherit = READ_ONCE(huge_anon_orders_inherit);
>>> + unsigned long mask = always | madvise;
>>> +
>>> + /*
>>> + * If the system-wide THP/mTHP sysfs settings are disabled,
>>> + * then we should never allow hugepages.
>>> + */
>>> + if (!(mask & orders) && !(hugepage_global_enabled() && (inherit & orders)))
>>
>> Can you explain the logic here? Is it equivalent to:
>> 1. if THP is set to always, always_mask & orders == 0, or
>> 2. if THP if set to madvise, madvise_mask & order == 0, or
>> 3. if THP is set to inherit, inherit_mask & order == 0?
>>
>> I cannot figure out why (always | madvise) & orders does not check
>> THP enablement case, but inherit & orders checks hugepage_global_enabled().
>
> Sorry for not being clear. Let me try again:
>
> Now we can control per-sized mTHP through ‘huge_anon_orders_always’, so always does not need to rely on the check of hugepage_global_enabled().
>
> For madvise, referring to David's suggestion: “allowing for collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine", so we can just check 'huge_anon_orders_madvise' without relying on hugepage_global_enabled().
Got it. Setting always or madvise knob in per-size mTHP means user wants to
enable that size, so their orders are not limited by the global config.
Setting inherit means user wants to follow the global config.
Now it makes sense to me. I wonder if renaming inherit to inherit_global
and huge_anon_orders_inherit to huge_anon_orders_inherit_global
could make code more clear (We cannot change sysfs names, but changing
kernel variable names should be fine?).
>
> In the case where hugepage_global_enabled() is enabled, we need to check whether the 'inherit' has enabled the corresponding orders.
>
> In summary, the current strategy is:
>
> 1. If always & orders == 0, and madvise & orders == 0, and hugepage_global_enabled() == false (global THP settings are not enabled), it means mTHP of the orders are prohibited from being used, then madvise_collapse() is forbidden.
>
> 2. If always & orders == 0, and madvise & orders == 0, and hugepage_global_enabled() == true (global THP settings are enabled), and inherit & orders == 0, it means mTHP of the orders are still prohibited from being used, and thus madvise_collapse() is not allowed.
Putting the summary in the comment might be very helpful. WDYT?
Otherwise, the patch looks good to me. Thanks.
Reviewed-by: Zi Yan <ziy@nvidia.com>
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 2/2] mm: shmem: disallow hugepages if the system-wide shmem THP sysfs settings are disabled
2025-05-30 1:58 ` Baolin Wang
@ 2025-05-30 2:17 ` Zi Yan
2025-05-30 2:32 ` Baolin Wang
0 siblings, 1 reply; 26+ messages in thread
From: Zi Yan @ 2025-05-30 2:17 UTC (permalink / raw)
To: Baolin Wang
Cc: akpm, hughd, david, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, linux-mm, linux-fsdevel, linux-kernel
On 29 May 2025, at 21:58, Baolin Wang wrote:
> On 2025/5/29 23:21, Zi Yan wrote:
>> On 29 May 2025, at 4:23, Baolin Wang wrote:
>>
>>> The MADV_COLLAPSE will ignore the system-wide shmem THP sysfs settings, which
>>> means that even though we have disabled the shmem THP configuration, MADV_COLLAPSE
>>> will still attempt to collapse into a shmem THP. This violates the rule we have
>>> agreed upon: never means never.
>>>
>>> Then the current strategy is:
>>> For shmem, if none of always, madvise, within_size, and inherit have enabled
>>> PMD-sized mTHP, then MADV_COLLAPSE will be prohibited from collapsing PMD-sized mTHP.
>>>
>>> For tmpfs, if the mount option is set with the 'huge=never' parameter, then
>>> MADV_COLLAPSE will be prohibited from collapsing PMD-sized mTHP.
>>>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>> mm/huge_memory.c | 2 +-
>>> mm/shmem.c | 12 ++++++------
>>> 2 files changed, 7 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index d3e66136e41a..a8cfa37cae72 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -166,7 +166,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>>> * own flags.
>>> */
>>> if (!in_pf && shmem_file(vma->vm_file))
>>> - return shmem_allowable_huge_orders(file_inode(vma->vm_file),
>>> + return orders & shmem_allowable_huge_orders(file_inode(vma->vm_file),
>>> vma, vma->vm_pgoff, 0,
>>> !enforce_sysfs);
>>
>> OK, here orders is checked against allowed orders.
>>
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 4b42419ce6b2..4dbb28d85cd9 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -613,7 +613,7 @@ static unsigned int shmem_get_orders_within_size(struct inode *inode,
>>> }
>>>
>>> static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
>>> - loff_t write_end, bool shmem_huge_force,
>>> + loff_t write_end,
>>> struct vm_area_struct *vma,
>>> unsigned long vm_flags)
>>> {
>>> @@ -625,7 +625,7 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
>>> return 0;
>>> if (shmem_huge == SHMEM_HUGE_DENY)
>>> return 0;
>>> - if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
>>> + if (shmem_huge == SHMEM_HUGE_FORCE)
>>> return maybe_pmd_order;
>>
>> shmem_huge is set by sysfs?
>
> Yes, through the '/sys/kernel/mm/transparent_hugepage/shmem_enabled' interface.
>
>>> /*
>>> @@ -860,7 +860,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
>>> }
>>>
>>> static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
>>> - loff_t write_end, bool shmem_huge_force,
>>> + loff_t write_end,
>>> struct vm_area_struct *vma,
>>> unsigned long vm_flags)
>>> {
>>> @@ -1261,7 +1261,7 @@ static int shmem_getattr(struct mnt_idmap *idmap,
>>> STATX_ATTR_NODUMP);
>>> generic_fillattr(idmap, request_mask, inode, stat);
>>>
>>> - if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0))
>>> + if (shmem_huge_global_enabled(inode, 0, 0, NULL, 0))
>>> stat->blksize = HPAGE_PMD_SIZE;
>>>
>>> if (request_mask & STATX_BTIME) {
>>> @@ -1768,7 +1768,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
>>> return 0;
>>>
>>> global_orders = shmem_huge_global_enabled(inode, index, write_end,
>>> - shmem_huge_force, vma, vm_flags);
>>> + vma, vm_flags);
>>> /* Tmpfs huge pages allocation */
>>> if (!vma || !vma_is_anon_shmem(vma))
>>> return global_orders;
>>> @@ -1790,7 +1790,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
>>> /* Allow mTHP that will be fully within i_size. */
>>> mask |= shmem_get_orders_within_size(inode, within_size_orders, index, 0);
>>>
>>> - if (vm_flags & VM_HUGEPAGE)
>>> + if (shmem_huge_force || (vm_flags & VM_HUGEPAGE))
>>> mask |= READ_ONCE(huge_shmem_orders_madvise);
>>>
>>> if (global_orders > 0)
>>> --
>>> 2.43.5
>>
>> shmem_huge_force comes from !enforce_sysfs in __thp_vma_allowable_orders().
>> Do you know when sysfs is not enforced and why?
>
> IIUC, shmem_huge_force will only be set during MADV_COLLAPSE. Originally, MADV_COLLAPSE was intended to ignore the system-wide THP sysfs settings. However, if all system-wide shmem THP settings are disabled, we should not allow MADV_COLLAPSE to collapse a THP. This is the issue this patchset aims to fix. Thanks for the review.
Got it. If we want to enforce sysfs, why not just get rid of TVA_ENFORCE_SYSFS
and make everyone follow sysfs?
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
2025-05-30 2:04 ` Zi Yan
@ 2025-05-30 2:21 ` Baolin Wang
2025-05-30 2:22 ` Zi Yan
0 siblings, 1 reply; 26+ messages in thread
From: Baolin Wang @ 2025-05-30 2:21 UTC (permalink / raw)
To: Zi Yan
Cc: akpm, hughd, david, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, linux-mm, linux-fsdevel, linux-kernel
On 2025/5/30 10:04, Zi Yan wrote:
> On 29 May 2025, at 21:51, Baolin Wang wrote:
>
>> On 2025/5/29 23:10, Zi Yan wrote:
>>> On 29 May 2025, at 4:23, Baolin Wang wrote:
>>>
>>>> The MADV_COLLAPSE will ignore the system-wide Anon THP sysfs settings, which
>>>> means that even though we have disabled the Anon THP configuration, MADV_COLLAPSE
>>>> will still attempt to collapse into a Anon THP. This violates the rule we have
>>>> agreed upon: never means never.
>>>>
>>>> To address this issue, should check whether the Anon THP configuration is disabled
>>>> in thp_vma_allowable_orders(), even when the TVA_ENFORCE_SYSFS flag is set.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>> include/linux/huge_mm.h | 23 +++++++++++++++++++----
>>>> 1 file changed, 19 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index 2f190c90192d..199ddc9f04a1 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -287,20 +287,35 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>> unsigned long orders)
>>>> {
>>>> /* Optimization to check if required orders are enabled early. */
>>>> - if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
>>>> - unsigned long mask = READ_ONCE(huge_anon_orders_always);
>>>> + if (vma_is_anonymous(vma)) {
>>>> + unsigned long always = READ_ONCE(huge_anon_orders_always);
>>>> + unsigned long madvise = READ_ONCE(huge_anon_orders_madvise);
>>>> + unsigned long inherit = READ_ONCE(huge_anon_orders_inherit);
>>>> + unsigned long mask = always | madvise;
>>>> +
>>>> + /*
>>>> + * If the system-wide THP/mTHP sysfs settings are disabled,
>>>> + * then we should never allow hugepages.
>>>> + */
>>>> + if (!(mask & orders) && !(hugepage_global_enabled() && (inherit & orders)))
>>>
>>> Can you explain the logic here? Is it equivalent to:
>>> 1. if THP is set to always, always_mask & orders == 0, or
>>> 2. if THP if set to madvise, madvise_mask & order == 0, or
>>> 3. if THP is set to inherit, inherit_mask & order == 0?
>>>
>>> I cannot figure out why (always | madvise) & orders does not check
>>> THP enablement case, but inherit & orders checks hugepage_global_enabled().
>>
>> Sorry for not being clear. Let me try again:
>>
>> Now we can control per-sized mTHP through ‘huge_anon_orders_always’, so always does not need to rely on the check of hugepage_global_enabled().
>>
>> For madvise, referring to David's suggestion: “allowing for collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine", so we can just check 'huge_anon_orders_madvise' without relying on hugepage_global_enabled().
>
> Got it. Setting always or madvise knob in per-size mTHP means user wants to
> enable that size, so their orders are not limited by the global config.
> Setting inherit means user wants to follow the global config.
> Now it makes sense to me. I wonder if renaming inherit to inherit_global
> and huge_anon_orders_inherit to huge_anon_orders_inherit_global
> could make code more clear (We cannot change sysfs names, but changing
> kernel variable names should be fine?).
The 'huge_anon_orders_inherit' is also a per-size mTHP interface. See
the doc:
"
Alternatively it is possible to specify that a given hugepage size
will inherit the top-level "enabled" value::
echo inherit
>/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
"
The 'inherit' already implies that it is meant to inherit the top-level
'enabled' value, so I personally still prefer the variable name
'inherit', just as we use it elsewhere.
>> In the case where hugepage_global_enabled() is enabled, we need to check whether the 'inherit' has enabled the corresponding orders.
>>
>> In summary, the current strategy is:
>>
>> 1. If always & orders == 0, and madvise & orders == 0, and hugepage_global_enabled() == false (global THP settings are not enabled), it means mTHP of the orders are prohibited from being used, then madvise_collapse() is forbidden.
>>
>> 2. If always & orders == 0, and madvise & orders == 0, and hugepage_global_enabled() == true (global THP settings are enabled), and inherit & orders == 0, it means mTHP of the orders are still prohibited from being used, and thus madvise_collapse() is not allowed.
>
> Putting the summary in the comment might be very helpful. WDYT?
Sure. will do.
> Otherwise, the patch looks good to me. Thanks.
>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
Thanks.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs settings are disabled
2025-05-30 2:21 ` Baolin Wang
@ 2025-05-30 2:22 ` Zi Yan
0 siblings, 0 replies; 26+ messages in thread
From: Zi Yan @ 2025-05-30 2:22 UTC (permalink / raw)
To: Baolin Wang
Cc: akpm, hughd, david, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, linux-mm, linux-fsdevel, linux-kernel
On 29 May 2025, at 22:21, Baolin Wang wrote:
> On 2025/5/30 10:04, Zi Yan wrote:
>> On 29 May 2025, at 21:51, Baolin Wang wrote:
>>
>>> On 2025/5/29 23:10, Zi Yan wrote:
>>>> On 29 May 2025, at 4:23, Baolin Wang wrote:
>>>>
>>>>> The MADV_COLLAPSE will ignore the system-wide Anon THP sysfs settings, which
>>>>> means that even though we have disabled the Anon THP configuration, MADV_COLLAPSE
>>>>> will still attempt to collapse into a Anon THP. This violates the rule we have
>>>>> agreed upon: never means never.
>>>>>
>>>>> To address this issue, should check whether the Anon THP configuration is disabled
>>>>> in thp_vma_allowable_orders(), even when the TVA_ENFORCE_SYSFS flag is set.
>>>>>
>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>> ---
>>>>> include/linux/huge_mm.h | 23 +++++++++++++++++++----
>>>>> 1 file changed, 19 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>> index 2f190c90192d..199ddc9f04a1 100644
>>>>> --- a/include/linux/huge_mm.h
>>>>> +++ b/include/linux/huge_mm.h
>>>>> @@ -287,20 +287,35 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>>> unsigned long orders)
>>>>> {
>>>>> /* Optimization to check if required orders are enabled early. */
>>>>> - if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
>>>>> - unsigned long mask = READ_ONCE(huge_anon_orders_always);
>>>>> + if (vma_is_anonymous(vma)) {
>>>>> + unsigned long always = READ_ONCE(huge_anon_orders_always);
>>>>> + unsigned long madvise = READ_ONCE(huge_anon_orders_madvise);
>>>>> + unsigned long inherit = READ_ONCE(huge_anon_orders_inherit);
>>>>> + unsigned long mask = always | madvise;
>>>>> +
>>>>> + /*
>>>>> + * If the system-wide THP/mTHP sysfs settings are disabled,
>>>>> + * then we should never allow hugepages.
>>>>> + */
>>>>> + if (!(mask & orders) && !(hugepage_global_enabled() && (inherit & orders)))
>>>>
>>>> Can you explain the logic here? Is it equivalent to:
>>>> 1. if THP is set to always, always_mask & orders == 0, or
>>>> 2. if THP if set to madvise, madvise_mask & order == 0, or
>>>> 3. if THP is set to inherit, inherit_mask & order == 0?
>>>>
>>>> I cannot figure out why (always | madvise) & orders does not check
>>>> THP enablement case, but inherit & orders checks hugepage_global_enabled().
>>>
>>> Sorry for not being clear. Let me try again:
>>>
>>> Now we can control per-sized mTHP through ‘huge_anon_orders_always’, so always does not need to rely on the check of hugepage_global_enabled().
>>>
>>> For madvise, referring to David's suggestion: “allowing for collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine", so we can just check 'huge_anon_orders_madvise' without relying on hugepage_global_enabled().
>>
>> Got it. Setting always or madvise knob in per-size mTHP means user wants to
>> enable that size, so their orders are not limited by the global config.
>> Setting inherit means user wants to follow the global config.
>> Now it makes sense to me. I wonder if renaming inherit to inherit_global
>> and huge_anon_orders_inherit to huge_anon_orders_inherit_global
>> could make code more clear (We cannot change sysfs names, but changing
>> kernel variable names should be fine?).
>
> The 'huge_anon_orders_inherit' is also a per-size mTHP interface. See the doc:
> "
> Alternatively it is possible to specify that a given hugepage size
> will inherit the top-level "enabled" value::
>
> echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
> "
>
> The 'inherit' already implies that it is meant to inherit the top-level 'enabled' value, so I personally still prefer the variable name 'inherit', just as we use it elsewhere.
OK. No problem.
>
>>> In the case where hugepage_global_enabled() is enabled, we need to check whether the 'inherit' has enabled the corresponding orders.
>>>
>>> In summary, the current strategy is:
>>>
>>> 1. If always & orders == 0, and madvise & orders == 0, and hugepage_global_enabled() == false (global THP settings are not enabled), it means mTHP of the orders are prohibited from being used, then madvise_collapse() is forbidden.
>>>
>>> 2. If always & orders == 0, and madvise & orders == 0, and hugepage_global_enabled() == true (global THP settings are enabled), and inherit & orders == 0, it means mTHP of the orders are still prohibited from being used, and thus madvise_collapse() is not allowed.
>>
>> Putting the summary in the comment might be very helpful. WDYT?
>
> Sure. will do.
Thanks.
>
>> Otherwise, the patch looks good to me. Thanks.
>>
>> Reviewed-by: Zi Yan <ziy@nvidia.com>
>
> Thanks.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 2/2] mm: shmem: disallow hugepages if the system-wide shmem THP sysfs settings are disabled
2025-05-30 2:17 ` Zi Yan
@ 2025-05-30 2:32 ` Baolin Wang
2025-05-30 2:35 ` Zi Yan
0 siblings, 1 reply; 26+ messages in thread
From: Baolin Wang @ 2025-05-30 2:32 UTC (permalink / raw)
To: Zi Yan
Cc: akpm, hughd, david, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, linux-mm, linux-fsdevel, linux-kernel
On 2025/5/30 10:17, Zi Yan wrote:
> On 29 May 2025, at 21:58, Baolin Wang wrote:
>
>> On 2025/5/29 23:21, Zi Yan wrote:
>>> On 29 May 2025, at 4:23, Baolin Wang wrote:
>>>
>>>> The MADV_COLLAPSE will ignore the system-wide shmem THP sysfs settings, which
>>>> means that even though we have disabled the shmem THP configuration, MADV_COLLAPSE
>>>> will still attempt to collapse into a shmem THP. This violates the rule we have
>>>> agreed upon: never means never.
>>>>
>>>> Then the current strategy is:
>>>> For shmem, if none of always, madvise, within_size, and inherit have enabled
>>>> PMD-sized mTHP, then MADV_COLLAPSE will be prohibited from collapsing PMD-sized mTHP.
>>>>
>>>> For tmpfs, if the mount option is set with the 'huge=never' parameter, then
>>>> MADV_COLLAPSE will be prohibited from collapsing PMD-sized mTHP.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>> mm/huge_memory.c | 2 +-
>>>> mm/shmem.c | 12 ++++++------
>>>> 2 files changed, 7 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index d3e66136e41a..a8cfa37cae72 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -166,7 +166,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>> * own flags.
>>>> */
>>>> if (!in_pf && shmem_file(vma->vm_file))
>>>> - return shmem_allowable_huge_orders(file_inode(vma->vm_file),
>>>> + return orders & shmem_allowable_huge_orders(file_inode(vma->vm_file),
>>>> vma, vma->vm_pgoff, 0,
>>>> !enforce_sysfs);
>>>
>>> OK, here orders is checked against allowed orders.
>>>
>>>>
>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>> index 4b42419ce6b2..4dbb28d85cd9 100644
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -613,7 +613,7 @@ static unsigned int shmem_get_orders_within_size(struct inode *inode,
>>>> }
>>>>
>>>> static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
>>>> - loff_t write_end, bool shmem_huge_force,
>>>> + loff_t write_end,
>>>> struct vm_area_struct *vma,
>>>> unsigned long vm_flags)
>>>> {
>>>> @@ -625,7 +625,7 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
>>>> return 0;
>>>> if (shmem_huge == SHMEM_HUGE_DENY)
>>>> return 0;
>>>> - if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
>>>> + if (shmem_huge == SHMEM_HUGE_FORCE)
>>>> return maybe_pmd_order;
>>>
>>> shmem_huge is set by sysfs?
>>
>> Yes, through the '/sys/kernel/mm/transparent_hugepage/shmem_enabled' interface.
>>
>>>> /*
>>>> @@ -860,7 +860,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
>>>> }
>>>>
>>>> static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
>>>> - loff_t write_end, bool shmem_huge_force,
>>>> + loff_t write_end,
>>>> struct vm_area_struct *vma,
>>>> unsigned long vm_flags)
>>>> {
>>>> @@ -1261,7 +1261,7 @@ static int shmem_getattr(struct mnt_idmap *idmap,
>>>> STATX_ATTR_NODUMP);
>>>> generic_fillattr(idmap, request_mask, inode, stat);
>>>>
>>>> - if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0))
>>>> + if (shmem_huge_global_enabled(inode, 0, 0, NULL, 0))
>>>> stat->blksize = HPAGE_PMD_SIZE;
>>>>
>>>> if (request_mask & STATX_BTIME) {
>>>> @@ -1768,7 +1768,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
>>>> return 0;
>>>>
>>>> global_orders = shmem_huge_global_enabled(inode, index, write_end,
>>>> - shmem_huge_force, vma, vm_flags);
>>>> + vma, vm_flags);
>>>> /* Tmpfs huge pages allocation */
>>>> if (!vma || !vma_is_anon_shmem(vma))
>>>> return global_orders;
>>>> @@ -1790,7 +1790,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
>>>> /* Allow mTHP that will be fully within i_size. */
>>>> mask |= shmem_get_orders_within_size(inode, within_size_orders, index, 0);
>>>>
>>>> - if (vm_flags & VM_HUGEPAGE)
>>>> + if (shmem_huge_force || (vm_flags & VM_HUGEPAGE))
>>>> mask |= READ_ONCE(huge_shmem_orders_madvise);
>>>>
>>>> if (global_orders > 0)
>>>> --
>>>> 2.43.5
>>>
>>> shmem_huge_force comes from !enforce_sysfs in __thp_vma_allowable_orders().
>>> Do you know when sysfs is not enforced and why?
>>
>> IIUC, shmem_huge_force will only be set during MADV_COLLAPSE. Originally, MADV_COLLAPSE was intended to ignore the system-wide THP sysfs settings. However, if all system-wide shmem THP settings are disabled, we should not allow MADV_COLLAPSE to collapse a THP. This is the issue this patchset aims to fix. Thanks for the review.
>
> Got it. If we want to enforce sysfs, why not just get rid of TVA_ENFORCE_SYSFS
> and make everyone follow sysfs?
Now MADV_COLLAPSE will ignore the VM_HUGEPAGE, while the others will
check the VM_HUGEPAGE flag before using 'huge_shmem_orders_madvise' with
the TVA_ENFORCE_SYSFS flag set.
That is to follow the rule: “allowing for collapsing in a VM without
VM_HUGEPAGE in the "madvise" mode would be fine".
So I think we should still keep the TVA_ENFORCE_SYSFS flag.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 2/2] mm: shmem: disallow hugepages if the system-wide shmem THP sysfs settings are disabled
2025-05-30 2:32 ` Baolin Wang
@ 2025-05-30 2:35 ` Zi Yan
2025-05-30 2:39 ` Baolin Wang
0 siblings, 1 reply; 26+ messages in thread
From: Zi Yan @ 2025-05-30 2:35 UTC (permalink / raw)
To: Baolin Wang
Cc: akpm, hughd, david, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, linux-mm, linux-fsdevel, linux-kernel
On 29 May 2025, at 22:32, Baolin Wang wrote:
> On 2025/5/30 10:17, Zi Yan wrote:
>> On 29 May 2025, at 21:58, Baolin Wang wrote:
>>
>>> On 2025/5/29 23:21, Zi Yan wrote:
>>>> On 29 May 2025, at 4:23, Baolin Wang wrote:
>>>>
>>>>> The MADV_COLLAPSE will ignore the system-wide shmem THP sysfs settings, which
>>>>> means that even though we have disabled the shmem THP configuration, MADV_COLLAPSE
>>>>> will still attempt to collapse into a shmem THP. This violates the rule we have
>>>>> agreed upon: never means never.
>>>>>
>>>>> Then the current strategy is:
>>>>> For shmem, if none of always, madvise, within_size, and inherit have enabled
>>>>> PMD-sized mTHP, then MADV_COLLAPSE will be prohibited from collapsing PMD-sized mTHP.
>>>>>
>>>>> For tmpfs, if the mount option is set with the 'huge=never' parameter, then
>>>>> MADV_COLLAPSE will be prohibited from collapsing PMD-sized mTHP.
>>>>>
>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>> ---
>>>>> mm/huge_memory.c | 2 +-
>>>>> mm/shmem.c | 12 ++++++------
>>>>> 2 files changed, 7 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index d3e66136e41a..a8cfa37cae72 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -166,7 +166,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>>> * own flags.
>>>>> */
>>>>> if (!in_pf && shmem_file(vma->vm_file))
>>>>> - return shmem_allowable_huge_orders(file_inode(vma->vm_file),
>>>>> + return orders & shmem_allowable_huge_orders(file_inode(vma->vm_file),
>>>>> vma, vma->vm_pgoff, 0,
>>>>> !enforce_sysfs);
>>>>
>>>> OK, here orders is checked against allowed orders.
>>>>
>>>>>
>>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>>> index 4b42419ce6b2..4dbb28d85cd9 100644
>>>>> --- a/mm/shmem.c
>>>>> +++ b/mm/shmem.c
>>>>> @@ -613,7 +613,7 @@ static unsigned int shmem_get_orders_within_size(struct inode *inode,
>>>>> }
>>>>>
>>>>> static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
>>>>> - loff_t write_end, bool shmem_huge_force,
>>>>> + loff_t write_end,
>>>>> struct vm_area_struct *vma,
>>>>> unsigned long vm_flags)
>>>>> {
>>>>> @@ -625,7 +625,7 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
>>>>> return 0;
>>>>> if (shmem_huge == SHMEM_HUGE_DENY)
>>>>> return 0;
>>>>> - if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
>>>>> + if (shmem_huge == SHMEM_HUGE_FORCE)
>>>>> return maybe_pmd_order;
>>>>
>>>> shmem_huge is set by sysfs?
>>>
>>> Yes, through the '/sys/kernel/mm/transparent_hugepage/shmem_enabled' interface.
>>>
>>>>> /*
>>>>> @@ -860,7 +860,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
>>>>> }
>>>>>
>>>>> static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
>>>>> - loff_t write_end, bool shmem_huge_force,
>>>>> + loff_t write_end,
>>>>> struct vm_area_struct *vma,
>>>>> unsigned long vm_flags)
>>>>> {
>>>>> @@ -1261,7 +1261,7 @@ static int shmem_getattr(struct mnt_idmap *idmap,
>>>>> STATX_ATTR_NODUMP);
>>>>> generic_fillattr(idmap, request_mask, inode, stat);
>>>>>
>>>>> - if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0))
>>>>> + if (shmem_huge_global_enabled(inode, 0, 0, NULL, 0))
>>>>> stat->blksize = HPAGE_PMD_SIZE;
>>>>>
>>>>> if (request_mask & STATX_BTIME) {
>>>>> @@ -1768,7 +1768,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
>>>>> return 0;
>>>>>
>>>>> global_orders = shmem_huge_global_enabled(inode, index, write_end,
>>>>> - shmem_huge_force, vma, vm_flags);
>>>>> + vma, vm_flags);
>>>>> /* Tmpfs huge pages allocation */
>>>>> if (!vma || !vma_is_anon_shmem(vma))
>>>>> return global_orders;
>>>>> @@ -1790,7 +1790,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
>>>>> /* Allow mTHP that will be fully within i_size. */
>>>>> mask |= shmem_get_orders_within_size(inode, within_size_orders, index, 0);
>>>>>
>>>>> - if (vm_flags & VM_HUGEPAGE)
>>>>> + if (shmem_huge_force || (vm_flags & VM_HUGEPAGE))
>>>>> mask |= READ_ONCE(huge_shmem_orders_madvise);
>>>>>
>>>>> if (global_orders > 0)
>>>>> --
>>>>> 2.43.5
>>>>
>>>> shmem_huge_force comes from !enforce_sysfs in __thp_vma_allowable_orders().
>>>> Do you know when sysfs is not enforced and why?
>>>
>>> IIUC, shmem_huge_force will only be set during MADV_COLLAPSE. Originally, MADV_COLLAPSE was intended to ignore the system-wide THP sysfs settings. However, if all system-wide shmem THP settings are disabled, we should not allow MADV_COLLAPSE to collapse a THP. This is the issue this patchset aims to fix. Thanks for the review.
>>
>> Got it. If we want to enforce sysfs, why not just get rid of TVA_ENFORCE_SYSFS
>> and make everyone follow sysfs?
>
> Now MADV_COLLAPSE will ignore the VM_HUGEPAGE, while the others will check the VM_HUGEPAGE flag before using 'huge_shmem_orders_madvise' with the TVA_ENFORCE_SYSFS flag set.
>
> That is to follow the rule: “allowing for collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine".
Can you add this rule in your commit message? It clarifies things.
>
> So I think we should still keep the TVA_ENFORCE_SYSFS flag.
Got it. Thank you for the explanation.
Acked-by: Zi Yan <ziy@nvidia.com>
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 2/2] mm: shmem: disallow hugepages if the system-wide shmem THP sysfs settings are disabled
2025-05-30 2:35 ` Zi Yan
@ 2025-05-30 2:39 ` Baolin Wang
0 siblings, 0 replies; 26+ messages in thread
From: Baolin Wang @ 2025-05-30 2:39 UTC (permalink / raw)
To: Zi Yan
Cc: akpm, hughd, david, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, linux-mm, linux-fsdevel, linux-kernel
On 2025/5/30 10:35, Zi Yan wrote:
> On 29 May 2025, at 22:32, Baolin Wang wrote:
>
>> On 2025/5/30 10:17, Zi Yan wrote:
>>> On 29 May 2025, at 21:58, Baolin Wang wrote:
>>>
>>>> On 2025/5/29 23:21, Zi Yan wrote:
>>>>> On 29 May 2025, at 4:23, Baolin Wang wrote:
>>>>>
>>>>>> The MADV_COLLAPSE will ignore the system-wide shmem THP sysfs settings, which
>>>>>> means that even though we have disabled the shmem THP configuration, MADV_COLLAPSE
>>>>>> will still attempt to collapse into a shmem THP. This violates the rule we have
>>>>>> agreed upon: never means never.
>>>>>>
>>>>>> Then the current strategy is:
>>>>>> For shmem, if none of always, madvise, within_size, and inherit have enabled
>>>>>> PMD-sized mTHP, then MADV_COLLAPSE will be prohibited from collapsing PMD-sized mTHP.
>>>>>>
>>>>>> For tmpfs, if the mount option is set with the 'huge=never' parameter, then
>>>>>> MADV_COLLAPSE will be prohibited from collapsing PMD-sized mTHP.
>>>>>>
>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>> ---
>>>>>> mm/huge_memory.c | 2 +-
>>>>>> mm/shmem.c | 12 ++++++------
>>>>>> 2 files changed, 7 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index d3e66136e41a..a8cfa37cae72 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -166,7 +166,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>>>> * own flags.
>>>>>> */
>>>>>> if (!in_pf && shmem_file(vma->vm_file))
>>>>>> - return shmem_allowable_huge_orders(file_inode(vma->vm_file),
>>>>>> + return orders & shmem_allowable_huge_orders(file_inode(vma->vm_file),
>>>>>> vma, vma->vm_pgoff, 0,
>>>>>> !enforce_sysfs);
>>>>>
>>>>> OK, here orders is checked against allowed orders.
>>>>>
>>>>>>
>>>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>>>> index 4b42419ce6b2..4dbb28d85cd9 100644
>>>>>> --- a/mm/shmem.c
>>>>>> +++ b/mm/shmem.c
>>>>>> @@ -613,7 +613,7 @@ static unsigned int shmem_get_orders_within_size(struct inode *inode,
>>>>>> }
>>>>>>
>>>>>> static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
>>>>>> - loff_t write_end, bool shmem_huge_force,
>>>>>> + loff_t write_end,
>>>>>> struct vm_area_struct *vma,
>>>>>> unsigned long vm_flags)
>>>>>> {
>>>>>> @@ -625,7 +625,7 @@ static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index
>>>>>> return 0;
>>>>>> if (shmem_huge == SHMEM_HUGE_DENY)
>>>>>> return 0;
>>>>>> - if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
>>>>>> + if (shmem_huge == SHMEM_HUGE_FORCE)
>>>>>> return maybe_pmd_order;
>>>>>
>>>>> shmem_huge is set by sysfs?
>>>>
>>>> Yes, through the '/sys/kernel/mm/transparent_hugepage/shmem_enabled' interface.
>>>>
>>>>>> /*
>>>>>> @@ -860,7 +860,7 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
>>>>>> }
>>>>>>
>>>>>> static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index,
>>>>>> - loff_t write_end, bool shmem_huge_force,
>>>>>> + loff_t write_end,
>>>>>> struct vm_area_struct *vma,
>>>>>> unsigned long vm_flags)
>>>>>> {
>>>>>> @@ -1261,7 +1261,7 @@ static int shmem_getattr(struct mnt_idmap *idmap,
>>>>>> STATX_ATTR_NODUMP);
>>>>>> generic_fillattr(idmap, request_mask, inode, stat);
>>>>>>
>>>>>> - if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0))
>>>>>> + if (shmem_huge_global_enabled(inode, 0, 0, NULL, 0))
>>>>>> stat->blksize = HPAGE_PMD_SIZE;
>>>>>>
>>>>>> if (request_mask & STATX_BTIME) {
>>>>>> @@ -1768,7 +1768,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
>>>>>> return 0;
>>>>>>
>>>>>> global_orders = shmem_huge_global_enabled(inode, index, write_end,
>>>>>> - shmem_huge_force, vma, vm_flags);
>>>>>> + vma, vm_flags);
>>>>>> /* Tmpfs huge pages allocation */
>>>>>> if (!vma || !vma_is_anon_shmem(vma))
>>>>>> return global_orders;
>>>>>> @@ -1790,7 +1790,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
>>>>>> /* Allow mTHP that will be fully within i_size. */
>>>>>> mask |= shmem_get_orders_within_size(inode, within_size_orders, index, 0);
>>>>>>
>>>>>> - if (vm_flags & VM_HUGEPAGE)
>>>>>> + if (shmem_huge_force || (vm_flags & VM_HUGEPAGE))
>>>>>> mask |= READ_ONCE(huge_shmem_orders_madvise);
>>>>>>
>>>>>> if (global_orders > 0)
>>>>>> --
>>>>>> 2.43.5
>>>>>
>>>>> shmem_huge_force comes from !enforce_sysfs in __thp_vma_allowable_orders().
>>>>> Do you know when sysfs is not enforced and why?
>>>>
>>>> IIUC, shmem_huge_force will only be set during MADV_COLLAPSE. Originally, MADV_COLLAPSE was intended to ignore the system-wide THP sysfs settings. However, if all system-wide shmem THP settings are disabled, we should not allow MADV_COLLAPSE to collapse a THP. This is the issue this patchset aims to fix. Thanks for the review.
>>>
>>> Got it. If we want to enforce sysfs, why not just get rid of TVA_ENFORCE_SYSFS
>>> and make everyone follow sysfs?
>>
>> Now MADV_COLLAPSE will ignore the VM_HUGEPAGE, while the others will check the VM_HUGEPAGE flag before using 'huge_shmem_orders_madvise' with the TVA_ENFORCE_SYSFS flag set.
>>
>> That is to follow the rule: “allowing for collapsing in a VM without VM_HUGEPAGE in the "madvise" mode would be fine".
>
> Can you add this rule in your commit message? It clarifies things.
Sure. Will do.
>> So I think we should still keep the TVA_ENFORCE_SYSFS flag.
>
> Got it. Thank you for the explanation.
>
> Acked-by: Zi Yan <ziy@nvidia.com>
Thanks for reviewing.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
2025-05-29 8:23 [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled Baolin Wang
2025-05-29 8:23 ` [PATCH 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs " Baolin Wang
2025-05-29 8:23 ` [PATCH 2/2] mm: shmem: disallow hugepages if the system-wide shmem " Baolin Wang
@ 2025-05-30 8:04 ` Ryan Roberts
2025-05-30 8:44 ` David Hildenbrand
2 siblings, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2025-05-30 8:04 UTC (permalink / raw)
To: Baolin Wang, akpm, hughd, david
Cc: lorenzo.stoakes, Liam.Howlett, npache, dev.jain, ziy, linux-mm,
linux-fsdevel, linux-kernel
On 29/05/2025 09:23, Baolin Wang wrote:
> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
> the system-wide anon/shmem THP sysfs settings, which means that even though
> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
> attempt to collapse into a anon/shmem THP. This violates the rule we have
> agreed upon: never means never. This patch set will address this issue.
This is a drive-by comment from me without having the previous context, but...
Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
user-initiated, synchonous request to use huge pages for a range of memory.
There is nothing *transparent* about it, it just happens to be implemented using
the same logic that THP uses.
I always thought this was a deliberate design decision.
Thanks,
Ryan
>
> [1] https://lore.kernel.org/all/1f00fdc3-a3a3-464b-8565-4c1b23d34f8d@linux.alibaba.com/
>
> Baolin Wang (2):
> mm: huge_memory: disallow hugepages if the system-wide THP sysfs
> settings are disabled
> mm: shmem: disallow hugepages if the system-wide shmem THP sysfs
> settings are disabled
>
> include/linux/huge_mm.h | 23 +++++++++++++++++++----
> mm/huge_memory.c | 2 +-
> mm/shmem.c | 12 ++++++------
> 3 files changed, 26 insertions(+), 11 deletions(-)
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
2025-05-30 8:04 ` [PATCH 0/2] fix MADV_COLLAPSE issue if THP " Ryan Roberts
@ 2025-05-30 8:44 ` David Hildenbrand
2025-05-30 8:47 ` Lorenzo Stoakes
2025-05-30 8:59 ` Ryan Roberts
0 siblings, 2 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-05-30 8:44 UTC (permalink / raw)
To: Ryan Roberts, Baolin Wang, akpm, hughd
Cc: lorenzo.stoakes, Liam.Howlett, npache, dev.jain, ziy, linux-mm,
linux-fsdevel, linux-kernel
On 30.05.25 10:04, Ryan Roberts wrote:
> On 29/05/2025 09:23, Baolin Wang wrote:
>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>> the system-wide anon/shmem THP sysfs settings, which means that even though
>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>> agreed upon: never means never. This patch set will address this issue.
>
> This is a drive-by comment from me without having the previous context, but...
>
> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
> user-initiated, synchonous request to use huge pages for a range of memory.
> There is nothing *transparent* about it, it just happens to be implemented using
> the same logic that THP uses.
>
> I always thought this was a deliberate design decision.
If the admin said "never", then why should a user be able to overwrite that?
The design decision I recall is that if VM_NOHUGEPAGE is set, we'll
ignore that. Because that was set by the app itself (MADV_NOHUEPAGE).
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
2025-05-30 8:44 ` David Hildenbrand
@ 2025-05-30 8:47 ` Lorenzo Stoakes
2025-05-30 8:52 ` David Hildenbrand
2025-05-30 8:59 ` Ryan Roberts
1 sibling, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-05-30 8:47 UTC (permalink / raw)
To: David Hildenbrand
Cc: Ryan Roberts, Baolin Wang, akpm, hughd, Liam.Howlett, npache,
dev.jain, ziy, linux-mm, linux-fsdevel, linux-kernel
On Fri, May 30, 2025 at 10:44:36AM +0200, David Hildenbrand wrote:
> On 30.05.25 10:04, Ryan Roberts wrote:
> > On 29/05/2025 09:23, Baolin Wang wrote:
> > > As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
> > > the system-wide anon/shmem THP sysfs settings, which means that even though
> > > we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
> > > attempt to collapse into a anon/shmem THP. This violates the rule we have
> > > agreed upon: never means never. This patch set will address this issue.
> >
> > This is a drive-by comment from me without having the previous context, but...
> >
> > Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
> > user-initiated, synchonous request to use huge pages for a range of memory.
> > There is nothing *transparent* about it, it just happens to be implemented using
> > the same logic that THP uses.
> >
> > I always thought this was a deliberate design decision.
>
> If the admin said "never", then why should a user be able to overwrite that?
>
> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll ignore
> that. Because that was set by the app itself (MADV_NOHUEPAGE).
>
I'm with David on this one.
I think it's principal of least surprise - to me 'never' is pretty
emphatic, and keep in mind the other choices are 'always' and... 'madvise'
:) which explicitly is 'hey only do this if madvise tells you to'.
I'd be rather surprised if I hadn't set madvise and a user uses madvise to
in some fashion override the never.
I mean I think we all agree this interface is to use a technical term -
crap - and we need something a lot more fine-grained and smart, but I think
given the situation we're in we should make it at least as least surprising
as possible.
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
2025-05-30 8:47 ` Lorenzo Stoakes
@ 2025-05-30 8:52 ` David Hildenbrand
2025-05-30 9:07 ` Ryan Roberts
0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-05-30 8:52 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Ryan Roberts, Baolin Wang, akpm, hughd, Liam.Howlett, npache,
dev.jain, ziy, linux-mm, linux-fsdevel, linux-kernel
On 30.05.25 10:47, Lorenzo Stoakes wrote:
> On Fri, May 30, 2025 at 10:44:36AM +0200, David Hildenbrand wrote:
>> On 30.05.25 10:04, Ryan Roberts wrote:
>>> On 29/05/2025 09:23, Baolin Wang wrote:
>>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>>>> the system-wide anon/shmem THP sysfs settings, which means that even though
>>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>>>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>>>> agreed upon: never means never. This patch set will address this issue.
>>>
>>> This is a drive-by comment from me without having the previous context, but...
>>>
>>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
>>> user-initiated, synchonous request to use huge pages for a range of memory.
>>> There is nothing *transparent* about it, it just happens to be implemented using
>>> the same logic that THP uses.
>>>
>>> I always thought this was a deliberate design decision.
>>
>> If the admin said "never", then why should a user be able to overwrite that?
>>
>> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll ignore
>> that. Because that was set by the app itself (MADV_NOHUEPAGE).
>>
>
> I'm with David on this one.
>
> I think it's principal of least surprise - to me 'never' is pretty
> emphatic, and keep in mind the other choices are 'always' and... 'madvise'
> :) which explicitly is 'hey only do this if madvise tells you to'.
>
> I'd be rather surprised if I hadn't set madvise and a user uses madvise to
> in some fashion override the never.
>
> I mean I think we all agree this interface is to use a technical term -
> crap - and we need something a lot more fine-grained and smart, but I think
> given the situation we're in we should make it at least as least surprising
> as possible.
Yes. If you configure "never" you are supposed to suffer, consistently.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
2025-05-30 8:44 ` David Hildenbrand
2025-05-30 8:47 ` Lorenzo Stoakes
@ 2025-05-30 8:59 ` Ryan Roberts
2025-05-30 9:10 ` David Hildenbrand
1 sibling, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2025-05-30 8:59 UTC (permalink / raw)
To: David Hildenbrand, Baolin Wang, akpm, hughd
Cc: lorenzo.stoakes, Liam.Howlett, npache, dev.jain, ziy, linux-mm,
linux-fsdevel, linux-kernel
On 30/05/2025 09:44, David Hildenbrand wrote:
> On 30.05.25 10:04, Ryan Roberts wrote:
>> On 29/05/2025 09:23, Baolin Wang wrote:
>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>>> the system-wide anon/shmem THP sysfs settings, which means that even though
>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>>> agreed upon: never means never. This patch set will address this issue.
>>
>> This is a drive-by comment from me without having the previous context, but...
>>
>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
>> user-initiated, synchonous request to use huge pages for a range of memory.
>> There is nothing *transparent* about it, it just happens to be implemented using
>> the same logic that THP uses.
>>
>> I always thought this was a deliberate design decision.
>
> If the admin said "never", then why should a user be able to overwrite that?
Well my interpretation would be that the admin is saying never *transparently*
give anyone any hugepages; on balance it does more harm than good for my
workloads. The toggle is called transparent_hugepage/enabled, after all.
Whereas MADV_COLLAPSE is deliberately applied to a specific region at an
opportune moment in time, presumably because the user knows that the region
*will* benefit and because that point in the execution is not sensitive to latency.
I see them as logically separate.
>
> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll ignore that.
> Because that was set by the app itself (MADV_NOHUEPAGE).
Hmm, ok. My instinct would have been the opposite; MADV_NOHUGEPAGE means "I
don't want the risk of latency spikes and memory bloat that THP can cause". Not
"ignore my explicit requests to MADV_COLLAPSE".
But if that descision was already taken and that's the current behavior then I
agree we have an inconsistency with respect to the sysfs control.
Perhaps we should be guided by real world usage - AIUI there is a cloud that
disables THP at system level today (Google?). Is there any concern that there
are workloads in such environments that are using MADV_COLLAPSE today that would
then see a performance drop?
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
2025-05-30 8:52 ` David Hildenbrand
@ 2025-05-30 9:07 ` Ryan Roberts
2025-05-30 9:14 ` Lorenzo Stoakes
0 siblings, 1 reply; 26+ messages in thread
From: Ryan Roberts @ 2025-05-30 9:07 UTC (permalink / raw)
To: David Hildenbrand, Lorenzo Stoakes
Cc: Baolin Wang, akpm, hughd, Liam.Howlett, npache, dev.jain, ziy,
linux-mm, linux-fsdevel, linux-kernel
On 30/05/2025 09:52, David Hildenbrand wrote:
> On 30.05.25 10:47, Lorenzo Stoakes wrote:
>> On Fri, May 30, 2025 at 10:44:36AM +0200, David Hildenbrand wrote:
>>> On 30.05.25 10:04, Ryan Roberts wrote:
>>>> On 29/05/2025 09:23, Baolin Wang wrote:
>>>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>>>>> the system-wide anon/shmem THP sysfs settings, which means that even though
>>>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>>>>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>>>>> agreed upon: never means never. This patch set will address this issue.
>>>>
>>>> This is a drive-by comment from me without having the previous context, but...
>>>>
>>>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
>>>> user-initiated, synchonous request to use huge pages for a range of memory.
>>>> There is nothing *transparent* about it, it just happens to be implemented
>>>> using
>>>> the same logic that THP uses.
>>>>
>>>> I always thought this was a deliberate design decision.
>>>
>>> If the admin said "never", then why should a user be able to overwrite that?
>>>
>>> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll ignore
>>> that. Because that was set by the app itself (MADV_NOHUEPAGE).
>>>
>>
>> I'm with David on this one.
>>
>> I think it's principal of least surprise - to me 'never' is pretty
>> emphatic, and keep in mind the other choices are 'always' and... 'madvise'
>> :) which explicitly is 'hey only do this if madvise tells you to'.
I think it's a bit reductive to suggest that enabled=madvise means all madvise
calls though. I don't think anyone would suggest MADV_DONTNEED should be ignored
if enabled=never. MADV_COLLAPSE just happens to be implemented on top of the THP
logic. But it's a different feature in my view.
>>
>> I'd be rather surprised if I hadn't set madvise and a user uses madvise to
>> in some fashion override the never.
>>
>> I mean I think we all agree this interface is to use a technical term -
>> crap - and we need something a lot more fine-grained and smart,
Yes agreed there!
>> but I think
>> given the situation we're in we should make it at least as least surprising
>> as possible.
>
> Yes. If you configure "never" you are supposed to suffer, consistently.
>
OK fair enough. Just giving my 2 cents.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
2025-05-30 8:59 ` Ryan Roberts
@ 2025-05-30 9:10 ` David Hildenbrand
2025-05-30 9:16 ` David Hildenbrand
0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-05-30 9:10 UTC (permalink / raw)
To: Ryan Roberts, Baolin Wang, akpm, hughd
Cc: lorenzo.stoakes, Liam.Howlett, npache, dev.jain, ziy, linux-mm,
linux-fsdevel, linux-kernel
On 30.05.25 10:59, Ryan Roberts wrote:
> On 30/05/2025 09:44, David Hildenbrand wrote:
>> On 30.05.25 10:04, Ryan Roberts wrote:
>>> On 29/05/2025 09:23, Baolin Wang wrote:
>>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>>>> the system-wide anon/shmem THP sysfs settings, which means that even though
>>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>>>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>>>> agreed upon: never means never. This patch set will address this issue.
>>>
>>> This is a drive-by comment from me without having the previous context, but...
>>>
>>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
>>> user-initiated, synchonous request to use huge pages for a range of memory.
>>> There is nothing *transparent* about it, it just happens to be implemented using
>>> the same logic that THP uses.
>>>
>>> I always thought this was a deliberate design decision.
>>
>> If the admin said "never", then why should a user be able to overwrite that?
>
> Well my interpretation would be that the admin is saying never *transparently*
> give anyone any hugepages; on balance it does more harm than good for my
> workloads. The toggle is called transparent_hugepage/enabled, after all.
I'd say it's "enabling transparent huge pages" not "transparently
enabling huge pages". After all, these things are ... transparent huge
pages.
But yeah, it's confusing.
>
> Whereas MADV_COLLAPSE is deliberately applied to a specific region at an
> opportune moment in time, presumably because the user knows that the region
> *will* benefit and because that point in the execution is not sensitive to latency.
Not sure if MADV_HUGEPAGE is really *that* different.
>
> I see them as logically separate.
>
>>
>> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll ignore that.
>> Because that was set by the app itself (MADV_NOHUEPAGE).
>
> Hmm, ok. My instinct would have been the opposite; MADV_NOHUGEPAGE means "I
> don't want the risk of latency spikes and memory bloat that THP can cause". Not
> "ignore my explicit requests to MADV_COLLAPSE".
>
> But if that descision was already taken and that's the current behavior then I
> agree we have an inconsistency with respect to the sysfs control.
>
> Perhaps we should be guided by real world usage - AIUI there is a cloud that
> disables THP at system level today (Google?).
The use case I am aware of for disabling it for debugging purposes.
Saved us quite some headake in the past at customer sites for
troubleshooting + workarounds ...
Let's take a look at the man page:
MADV_COLLAPSE is independent of any sysfs (see sysfs(5)) setting
under /sys/kernel/mm/transparent_hugepage, both in terms of determining
THP eligibility, and allocation semantics.
I recall we discussed that it should ignore the max_ptes_none/swap/shared.
But "any" setting would include "enable" ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
2025-05-30 9:07 ` Ryan Roberts
@ 2025-05-30 9:14 ` Lorenzo Stoakes
0 siblings, 0 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-05-30 9:14 UTC (permalink / raw)
To: Ryan Roberts
Cc: David Hildenbrand, Baolin Wang, akpm, hughd, Liam.Howlett, npache,
dev.jain, ziy, linux-mm, linux-fsdevel, linux-kernel
On Fri, May 30, 2025 at 10:07:11AM +0100, Ryan Roberts wrote:
> On 30/05/2025 09:52, David Hildenbrand wrote:
> > On 30.05.25 10:47, Lorenzo Stoakes wrote:
> >> On Fri, May 30, 2025 at 10:44:36AM +0200, David Hildenbrand wrote:
> >>> On 30.05.25 10:04, Ryan Roberts wrote:
> >>>> On 29/05/2025 09:23, Baolin Wang wrote:
> >>>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
> >>>>> the system-wide anon/shmem THP sysfs settings, which means that even though
> >>>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
> >>>>> attempt to collapse into a anon/shmem THP. This violates the rule we have
> >>>>> agreed upon: never means never. This patch set will address this issue.
> >>>>
> >>>> This is a drive-by comment from me without having the previous context, but...
> >>>>
> >>>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
> >>>> user-initiated, synchonous request to use huge pages for a range of memory.
> >>>> There is nothing *transparent* about it, it just happens to be implemented
> >>>> using
> >>>> the same logic that THP uses.
> >>>>
> >>>> I always thought this was a deliberate design decision.
> >>>
> >>> If the admin said "never", then why should a user be able to overwrite that?
> >>>
> >>> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll ignore
> >>> that. Because that was set by the app itself (MADV_NOHUEPAGE).
> >>>
> >>
> >> I'm with David on this one.
> >>
> >> I think it's principal of least surprise - to me 'never' is pretty
> >> emphatic, and keep in mind the other choices are 'always' and... 'madvise'
> >> :) which explicitly is 'hey only do this if madvise tells you to'.
>
> I think it's a bit reductive to suggest that enabled=madvise means all madvise
> calls though. I don't think anyone would suggest MADV_DONTNEED should be ignored
> if enabled=never. MADV_COLLAPSE just happens to be implemented on top of the THP
> logic. But it's a different feature in my view.
No I absolutely take your point, and indeed this is very reductive, but I think
that's a product of this interface being... sub-optimal.
if you dig into the docs for instance it's explicit about that referring to
MADV_[NO]HUGEPAGE.
But, as a user/sys-admin, I'd definitely find that surprising.
I think the intent of 'never' people is 'THP bad I don't want it' for whatever
reason that might be the case.
>
> >>
> >> I'd be rather surprised if I hadn't set madvise and a user uses madvise to
> >> in some fashion override the never.
> >>
> >> I mean I think we all agree this interface is to use a technical term -
> >> crap - and we need something a lot more fine-grained and smart,
>
> Yes agreed there!
>
> >> but I think
> >> given the situation we're in we should make it at least as least surprising
> >> as possible.
>
> >
> > Yes. If you configure "never" you are supposed to suffer, consistently.
> >
>
> OK fair enough. Just giving my 2 cents.
>
>
Your input is very welcome! We have made a mess here so it's good to talk it
through.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
2025-05-30 9:10 ` David Hildenbrand
@ 2025-05-30 9:16 ` David Hildenbrand
2025-05-30 9:34 ` Lorenzo Stoakes
2025-05-30 9:52 ` Baolin Wang
0 siblings, 2 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-05-30 9:16 UTC (permalink / raw)
To: Ryan Roberts, Baolin Wang, akpm, hughd
Cc: lorenzo.stoakes, Liam.Howlett, npache, dev.jain, ziy, linux-mm,
linux-fsdevel, linux-kernel
On 30.05.25 11:10, David Hildenbrand wrote:
> On 30.05.25 10:59, Ryan Roberts wrote:
>> On 30/05/2025 09:44, David Hildenbrand wrote:
>>> On 30.05.25 10:04, Ryan Roberts wrote:
>>>> On 29/05/2025 09:23, Baolin Wang wrote:
>>>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will ignore
>>>>> the system-wide anon/shmem THP sysfs settings, which means that even though
>>>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE will still
>>>>> attempt to collapse into a anon/shmem THP. This violates the rule we have
>>>>> agreed upon: never means never. This patch set will address this issue.
>>>>
>>>> This is a drive-by comment from me without having the previous context, but...
>>>>
>>>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a deliberate
>>>> user-initiated, synchonous request to use huge pages for a range of memory.
>>>> There is nothing *transparent* about it, it just happens to be implemented using
>>>> the same logic that THP uses.
>>>>
>>>> I always thought this was a deliberate design decision.
>>>
>>> If the admin said "never", then why should a user be able to overwrite that?
>>
>> Well my interpretation would be that the admin is saying never *transparently*
>> give anyone any hugepages; on balance it does more harm than good for my
>> workloads. The toggle is called transparent_hugepage/enabled, after all.
>
> I'd say it's "enabling transparent huge pages" not "transparently
> enabling huge pages". After all, these things are ... transparent huge
> pages.
>
> But yeah, it's confusing.
>
>>
>> Whereas MADV_COLLAPSE is deliberately applied to a specific region at an
>> opportune moment in time, presumably because the user knows that the region
>> *will* benefit and because that point in the execution is not sensitive to latency.
>
> Not sure if MADV_HUGEPAGE is really *that* different.
>
>>
>> I see them as logically separate.
>>
>>>
>>> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll ignore that.
>>> Because that was set by the app itself (MADV_NOHUEPAGE).
>>
>> Hmm, ok. My instinct would have been the opposite; MADV_NOHUGEPAGE means "I
>> don't want the risk of latency spikes and memory bloat that THP can cause". Not
>> "ignore my explicit requests to MADV_COLLAPSE".
>>
>> But if that descision was already taken and that's the current behavior then I
>> agree we have an inconsistency with respect to the sysfs control.
>>
>> Perhaps we should be guided by real world usage - AIUI there is a cloud that
>> disables THP at system level today (Google?).
> The use case I am aware of for disabling it for debugging purposes.
> Saved us quite some headake in the past at customer sites for
> troubleshooting + workarounds ...
>
>
> Let's take a look at the man page:
>
> MADV_COLLAPSE is independent of any sysfs (see sysfs(5)) setting
> under /sys/kernel/mm/transparent_hugepage, both in terms of determining
> THP eligibility, and allocation semantics.
>
> I recall we discussed that it should ignore the max_ptes_none/swap/shared.
>
> But "any" setting would include "enable" ...
It kind-of contradicts the linked
Documentation/admin-guide/mm/transhuge.rst, where we have this
*beautiful* comment
"Transparent Hugepage Support for anonymous memory can be entirely
disable (mostly for debugging purposes".
I mean, "entirely" is also pretty clear to me.
I would assume that the man page of MADV_COLLAPSE should have talked
about ignoring *khugepaged* toggles (max_ptes_none ...), at least that's
what I recall from the discussions back then.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
2025-05-30 9:16 ` David Hildenbrand
@ 2025-05-30 9:34 ` Lorenzo Stoakes
2025-05-30 9:52 ` Baolin Wang
1 sibling, 0 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-05-30 9:34 UTC (permalink / raw)
To: David Hildenbrand
Cc: Ryan Roberts, Baolin Wang, akpm, hughd, Liam.Howlett, npache,
dev.jain, ziy, linux-mm, linux-fsdevel, linux-kernel
On Fri, May 30, 2025 at 11:16:51AM +0200, David Hildenbrand wrote:
[snip]
> It kind-of contradicts the linked
> Documentation/admin-guide/mm/transhuge.rst, where we have this *beautiful*
> comment
>
> "Transparent Hugepage Support for anonymous memory can be entirely disable
> (mostly for debugging purposes".
>
> I mean, "entirely" is also pretty clear to me.
>
> I would assume that the man page of MADV_COLLAPSE should have talked about
> ignoring *khugepaged* toggles (max_ptes_none ...), at least that's what I
> recall from the discussions back then.
Sorry I don't want to turn this stuff into too much of a mega-thread but
just a small comment here - I think we should go and update the
documentation/man pages to be clearer and consistent.
There is enough confusion around this as it is...
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
2025-05-30 9:16 ` David Hildenbrand
2025-05-30 9:34 ` Lorenzo Stoakes
@ 2025-05-30 9:52 ` Baolin Wang
2025-05-30 10:45 ` David Hildenbrand
1 sibling, 1 reply; 26+ messages in thread
From: Baolin Wang @ 2025-05-30 9:52 UTC (permalink / raw)
To: David Hildenbrand, Ryan Roberts, akpm, hughd
Cc: lorenzo.stoakes, Liam.Howlett, npache, dev.jain, ziy, linux-mm,
linux-fsdevel, linux-kernel
On 2025/5/30 17:16, David Hildenbrand wrote:
> On 30.05.25 11:10, David Hildenbrand wrote:
>> On 30.05.25 10:59, Ryan Roberts wrote:
>>> On 30/05/2025 09:44, David Hildenbrand wrote:
>>>> On 30.05.25 10:04, Ryan Roberts wrote:
>>>>> On 29/05/2025 09:23, Baolin Wang wrote:
>>>>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will
>>>>>> ignore
>>>>>> the system-wide anon/shmem THP sysfs settings, which means that
>>>>>> even though
>>>>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE
>>>>>> will still
>>>>>> attempt to collapse into a anon/shmem THP. This violates the rule
>>>>>> we have
>>>>>> agreed upon: never means never. This patch set will address this
>>>>>> issue.
>>>>>
>>>>> This is a drive-by comment from me without having the previous
>>>>> context, but...
>>>>>
>>>>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a
>>>>> deliberate
>>>>> user-initiated, synchonous request to use huge pages for a range of
>>>>> memory.
>>>>> There is nothing *transparent* about it, it just happens to be
>>>>> implemented using
>>>>> the same logic that THP uses.
>>>>>
>>>>> I always thought this was a deliberate design decision.
>>>>
>>>> If the admin said "never", then why should a user be able to
>>>> overwrite that?
>>>
>>> Well my interpretation would be that the admin is saying never
>>> *transparently*
>>> give anyone any hugepages; on balance it does more harm than good for my
>>> workloads. The toggle is called transparent_hugepage/enabled, after all.
>>
>> I'd say it's "enabling transparent huge pages" not "transparently
>> enabling huge pages". After all, these things are ... transparent huge
>> pages.
>>
>> But yeah, it's confusing.
>>
>>>
>>> Whereas MADV_COLLAPSE is deliberately applied to a specific region at an
>>> opportune moment in time, presumably because the user knows that the
>>> region
>>> *will* benefit and because that point in the execution is not
>>> sensitive to latency.
>>
>> Not sure if MADV_HUGEPAGE is really *that* different.
>>
>>>
>>> I see them as logically separate.
>>>
>>>>
>>>> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll
>>>> ignore that.
>>>> Because that was set by the app itself (MADV_NOHUEPAGE).
IIUC, MADV_COLLAPSE does not ignore the VM_NOHUGEPAGE setting, if we set
VM_NOHUGEPAGE, then MADV_COLLAPSE will not be allowed to collapse a THP.
See:
__thp_vma_allowable_orders() ---> vma_thp_disabled()
>>> Hmm, ok. My instinct would have been the opposite; MADV_NOHUGEPAGE
>>> means "I
>>> don't want the risk of latency spikes and memory bloat that THP can
>>> cause". Not
>>> "ignore my explicit requests to MADV_COLLAPSE".
>>>
>>> But if that descision was already taken and that's the current
>>> behavior then I
>>> agree we have an inconsistency with respect to the sysfs control.
>>>
>>> Perhaps we should be guided by real world usage - AIUI there is a
>>> cloud that
>>> disables THP at system level today (Google?).
>> The use case I am aware of for disabling it for debugging purposes.
>> Saved us quite some headake in the past at customer sites for
>> troubleshooting + workarounds ...
>>
>>
>> Let's take a look at the man page:
>>
>> MADV_COLLAPSE is independent of any sysfs (see sysfs(5)) setting
>> under /sys/kernel/mm/transparent_hugepage, both in terms of determining
>> THP eligibility, and allocation semantics.
>>
>> I recall we discussed that it should ignore the
>> max_ptes_none/swap/shared.
>>
>> But "any" setting would include "enable" ...
>
> It kind-of contradicts the linked
> Documentation/admin-guide/mm/transhuge.rst, where we have this
> *beautiful* comment
>
> "Transparent Hugepage Support for anonymous memory can be entirely
> disable (mostly for debugging purposes".
>
> I mean, "entirely" is also pretty clear to me.
Yes, agree. We have encountered issues caused by THP in our Alibaba
fleet. The quickest way to stop the bleeding was to disable THP. In such
case, we do not expect MADV_HUGEPAGE to still collapse a THP.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled
2025-05-30 9:52 ` Baolin Wang
@ 2025-05-30 10:45 ` David Hildenbrand
0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-05-30 10:45 UTC (permalink / raw)
To: Baolin Wang, Ryan Roberts, akpm, hughd
Cc: lorenzo.stoakes, Liam.Howlett, npache, dev.jain, ziy, linux-mm,
linux-fsdevel, linux-kernel
On 30.05.25 11:52, Baolin Wang wrote:
>
>
> On 2025/5/30 17:16, David Hildenbrand wrote:
>> On 30.05.25 11:10, David Hildenbrand wrote:
>>> On 30.05.25 10:59, Ryan Roberts wrote:
>>>> On 30/05/2025 09:44, David Hildenbrand wrote:
>>>>> On 30.05.25 10:04, Ryan Roberts wrote:
>>>>>> On 29/05/2025 09:23, Baolin Wang wrote:
>>>>>>> As we discussed in the previous thread [1], the MADV_COLLAPSE will
>>>>>>> ignore
>>>>>>> the system-wide anon/shmem THP sysfs settings, which means that
>>>>>>> even though
>>>>>>> we have disabled the anon/shmem THP configuration, MADV_COLLAPSE
>>>>>>> will still
>>>>>>> attempt to collapse into a anon/shmem THP. This violates the rule
>>>>>>> we have
>>>>>>> agreed upon: never means never. This patch set will address this
>>>>>>> issue.
>>>>>>
>>>>>> This is a drive-by comment from me without having the previous
>>>>>> context, but...
>>>>>>
>>>>>> Surely MADV_COLLAPSE *should* ignore the THP sysfs settings? It's a
>>>>>> deliberate
>>>>>> user-initiated, synchonous request to use huge pages for a range of
>>>>>> memory.
>>>>>> There is nothing *transparent* about it, it just happens to be
>>>>>> implemented using
>>>>>> the same logic that THP uses.
>>>>>>
>>>>>> I always thought this was a deliberate design decision.
>>>>>
>>>>> If the admin said "never", then why should a user be able to
>>>>> overwrite that?
>>>>
>>>> Well my interpretation would be that the admin is saying never
>>>> *transparently*
>>>> give anyone any hugepages; on balance it does more harm than good for my
>>>> workloads. The toggle is called transparent_hugepage/enabled, after all.
>>>
>>> I'd say it's "enabling transparent huge pages" not "transparently
>>> enabling huge pages". After all, these things are ... transparent huge
>>> pages.
>>>
>>> But yeah, it's confusing.
>>>
>>>>
>>>> Whereas MADV_COLLAPSE is deliberately applied to a specific region at an
>>>> opportune moment in time, presumably because the user knows that the
>>>> region
>>>> *will* benefit and because that point in the execution is not
>>>> sensitive to latency.
>>>
>>> Not sure if MADV_HUGEPAGE is really *that* different.
>>>
>>>>
>>>> I see them as logically separate.
>>>>
>>>>>
>>>>> The design decision I recall is that if VM_NOHUGEPAGE is set, we'll
>>>>> ignore that.
>>>>> Because that was set by the app itself (MADV_NOHUEPAGE).
>
> IIUC, MADV_COLLAPSE does not ignore the VM_NOHUGEPAGE setting, if we set
> VM_NOHUGEPAGE, then MADV_COLLAPSE will not be allowed to collapse a THP.
> See:
> __thp_vma_allowable_orders() ---> vma_thp_disabled()
Interesting, maybe I misremember things.
Maybe because process_madvise() could try MADV_COLLAPSE on a different
process. And if that process as VM_NOHUGEPAGE set, it could be problematic.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2025-05-30 10:45 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-29 8:23 [PATCH 0/2] fix MADV_COLLAPSE issue if THP settings are disabled Baolin Wang
2025-05-29 8:23 ` [PATCH 1/2] mm: huge_memory: disallow hugepages if the system-wide THP sysfs " Baolin Wang
2025-05-29 15:10 ` Zi Yan
2025-05-30 1:51 ` Baolin Wang
2025-05-30 2:04 ` Zi Yan
2025-05-30 2:21 ` Baolin Wang
2025-05-30 2:22 ` Zi Yan
2025-05-29 8:23 ` [PATCH 2/2] mm: shmem: disallow hugepages if the system-wide shmem " Baolin Wang
2025-05-29 15:21 ` Zi Yan
2025-05-30 1:58 ` Baolin Wang
2025-05-30 2:17 ` Zi Yan
2025-05-30 2:32 ` Baolin Wang
2025-05-30 2:35 ` Zi Yan
2025-05-30 2:39 ` Baolin Wang
2025-05-30 8:04 ` [PATCH 0/2] fix MADV_COLLAPSE issue if THP " Ryan Roberts
2025-05-30 8:44 ` David Hildenbrand
2025-05-30 8:47 ` Lorenzo Stoakes
2025-05-30 8:52 ` David Hildenbrand
2025-05-30 9:07 ` Ryan Roberts
2025-05-30 9:14 ` Lorenzo Stoakes
2025-05-30 8:59 ` Ryan Roberts
2025-05-30 9:10 ` David Hildenbrand
2025-05-30 9:16 ` David Hildenbrand
2025-05-30 9:34 ` Lorenzo Stoakes
2025-05-30 9:52 ` Baolin Wang
2025-05-30 10:45 ` David Hildenbrand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).