* [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY
@ 2025-05-15 13:33 Usama Arif
2025-05-15 13:33 ` [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process Usama Arif
` (6 more replies)
0 siblings, 7 replies; 51+ messages in thread
From: Usama Arif @ 2025-05-15 13:33 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel,
linux-doc, kernel-team, Usama Arif
This allows to change the THP policy of a process, according to the value
set in arg2, all of which will be inherited during fork+exec:
- PR_THP_POLICY_DEFAULT_HUGE: This will set the MMF2_THP_VMA_DEFAULT_HUGE
process flag which changes the default of new VMAs to be VM_HUGEPAGE. The
call also modifies all existing VMAs that are not VM_NOHUGEPAGE
to be VM_HUGEPAGE.
This allows systems where the global policy is set to "madvise"
to effectively have THPs always for the process. In an environment
where different types of workloads are stacked on the same machine
whose global policy is set to "madvise", this will allow workloads
that benefit from always having hugepages to do so, without regressing
those that don't.
- PR_THP_POLICY_DEFAULT_NOHUGE: This will set the MMF2_THP_VMA_DEFAULT_NOHUGE
process flag which changes the default of new VMAs to be VM_NOHUGEPAGE.
The call also modifies all existing VMAs that are not VM_HUGEPAGE
to be VM_NOHUGEPAGE.
This allows systems where the global policy is set to "always"
to effectively have THPs on madvise only for the process. In an
environment where different types of workloads are stacked on the
same machine whose global policy is set to "always", this will allow
workloads that benefit from having hugepages on an madvise basis only
to do so, without regressing those that benefit from having hugepages
always.
- PR_THP_POLICY_DEFAULT_SYSTEM: This will clear the MMF2_THP_VMA_DEFAULT_HUGE
and MMF2_THP_VMA_DEFAULT_NOHUGE process flags.
These patches are required in rolling out hugepages in hyperscaler
configurations for workloads that benefit from them, where workloads are
stacked anda single THP global policy is likely to be used across the entire
fleet, and prctl will help override it.
v1->v2:
- change from modifying the THP decision making for the process, to modifying
VMA flags only. This prevents further complicating the logic used to
determine THP order (Thanks David!)
- change from using a prctl per policy change to just using PR_SET_THP_POLICY
and arg2 to set the policy. (Zi Yan)
- Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM
- Add selftests and documentation.
Usama Arif (6):
prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE for the process
prctl: introduce PR_THP_POLICY_SYSTEM for the process
selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_NOHUGE
selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
docs: transhuge: document process level THP controls
Documentation/admin-guide/mm/transhuge.rst | 40 +++
include/linux/huge_mm.h | 4 +
include/linux/mm_types.h | 14 +
include/uapi/linux/prctl.h | 6 +
kernel/fork.c | 1 +
kernel/sys.c | 35 +++
mm/huge_memory.c | 56 ++++
mm/vma.c | 2 +
tools/include/uapi/linux/prctl.h | 6 +
.../trace/beauty/include/uapi/linux/prctl.h | 6 +
tools/testing/selftests/prctl/Makefile | 2 +-
tools/testing/selftests/prctl/thp_policy.c | 286 ++++++++++++++++++
12 files changed, 457 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/prctl/thp_policy.c
--
2.47.1
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 13:33 [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
@ 2025-05-15 13:33 ` Usama Arif
2025-05-15 14:40 ` Lorenzo Stoakes
2025-05-16 6:12 ` kernel test robot
2025-05-15 13:33 ` [PATCH 2/6] prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE " Usama Arif
` (5 subsequent siblings)
6 siblings, 2 replies; 51+ messages in thread
From: Usama Arif @ 2025-05-15 13:33 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel,
linux-doc, kernel-team, Usama Arif
This is set via the new PR_SET_THP_POLICY prctl.
This will set the MMF2_THP_VMA_DEFAULT_HUGE process flag
which changes the default of new VMAs to be VM_HUGEPAGE. The
call also modifies all existing VMAs that are not VM_NOHUGEPAGE
to be VM_HUGEPAGE. The policy is inherited during fork+exec.
This allows systems where the global policy is set to "madvise"
to effectively have THPs always for the process. In an environment
where different types of workloads are stacked on the same machine,
this will allow workloads that benefit from always having hugepages
to do so, without regressing those that don't.
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
include/linux/huge_mm.h | 3 ++
include/linux/mm_types.h | 11 +++++++
include/uapi/linux/prctl.h | 4 +++
kernel/fork.c | 1 +
kernel/sys.c | 21 ++++++++++++
mm/huge_memory.c | 32 +++++++++++++++++++
mm/vma.c | 2 ++
tools/include/uapi/linux/prctl.h | 4 +++
.../trace/beauty/include/uapi/linux/prctl.h | 4 +++
9 files changed, 82 insertions(+)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..e652ad9ddbbd 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -260,6 +260,9 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
return orders;
}
+void vma_set_thp_policy(struct vm_area_struct *vma);
+void process_vmas_thp_default_huge(struct mm_struct *mm);
+
unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
unsigned long vm_flags,
unsigned long tva_flags,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e76bade9ebb1..2fe93965e761 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1066,6 +1066,7 @@ struct mm_struct {
mm_context_t context;
unsigned long flags; /* Must use atomic bitops to access */
+ unsigned long flags2;
#ifdef CONFIG_AIO
spinlock_t ioctx_lock;
@@ -1744,6 +1745,11 @@ enum {
MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
+#define MMF2_THP_VMA_DEFAULT_HUGE 0
+#define MMF2_THP_VMA_DEFAULT_HUGE_MASK (1 << MMF2_THP_VMA_DEFAULT_HUGE)
+
+#define MMF2_INIT_MASK (MMF2_THP_VMA_DEFAULT_HUGE_MASK)
+
static inline unsigned long mmf_init_flags(unsigned long flags)
{
if (flags & (1UL << MMF_HAS_MDWE_NO_INHERIT))
@@ -1752,4 +1758,9 @@ static inline unsigned long mmf_init_flags(unsigned long flags)
return flags & MMF_INIT_MASK;
}
+static inline unsigned long mmf2_init_flags(unsigned long flags)
+{
+ return flags & MMF2_INIT_MASK;
+}
+
#endif /* _LINUX_MM_TYPES_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 15c18ef4eb11..325c72f40a93 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -364,4 +364,8 @@ struct prctl_mm_map {
# define PR_TIMER_CREATE_RESTORE_IDS_ON 1
# define PR_TIMER_CREATE_RESTORE_IDS_GET 2
+#define PR_SET_THP_POLICY 78
+#define PR_GET_THP_POLICY 79
+#define PR_THP_POLICY_DEFAULT_HUGE 0
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9e4616dacd82..6e5f4a8869dc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1054,6 +1054,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
if (current->mm) {
mm->flags = mmf_init_flags(current->mm->flags);
+ mm->flags2 = mmf2_init_flags(current->mm->flags2);
mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
} else {
mm->flags = default_dump_filter;
diff --git a/kernel/sys.c b/kernel/sys.c
index c434968e9f5d..1115f258f253 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2658,6 +2658,27 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
clear_bit(MMF_DISABLE_THP, &me->mm->flags);
mmap_write_unlock(me->mm);
break;
+ case PR_GET_THP_POLICY:
+ if (arg2 || arg3 || arg4 || arg5)
+ return -EINVAL;
+ if (!!test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2))
+ error = PR_THP_POLICY_DEFAULT_HUGE;
+ break;
+ case PR_SET_THP_POLICY:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ if (mmap_write_lock_killable(me->mm))
+ return -EINTR;
+ switch (arg2) {
+ case PR_THP_POLICY_DEFAULT_HUGE:
+ set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
+ process_vmas_thp_default_huge(me->mm);
+ break;
+ default:
+ return -EINVAL;
+ }
+ mmap_write_unlock(me->mm);
+ break;
case PR_MPX_ENABLE_MANAGEMENT:
case PR_MPX_DISABLE_MANAGEMENT:
/* No longer implemented: */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2780a12b25f0..64f66d5295e8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -98,6 +98,38 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
}
+void vma_set_thp_policy(struct vm_area_struct *vma)
+{
+ struct mm_struct *mm = vma->vm_mm;
+
+ if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
+ vm_flags_set(vma, VM_HUGEPAGE);
+}
+
+static void vmas_thp_default_huge(struct mm_struct *mm)
+{
+ struct vm_area_struct *vma;
+ unsigned long vm_flags;
+
+ VMA_ITERATOR(vmi, mm, 0);
+ for_each_vma(vmi, vma) {
+ vm_flags = vma->vm_flags;
+ if (vm_flags & VM_NOHUGEPAGE)
+ continue;
+ vm_flags_set(vma, VM_HUGEPAGE);
+ }
+}
+
+void process_vmas_thp_default_huge(struct mm_struct *mm)
+{
+ if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
+ return;
+
+ set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2);
+ vmas_thp_default_huge(mm);
+}
+
+
unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
unsigned long vm_flags,
unsigned long tva_flags,
diff --git a/mm/vma.c b/mm/vma.c
index 1f2634b29568..101b19c96803 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2476,6 +2476,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
if (!vma_is_anonymous(vma))
khugepaged_enter_vma(vma, map->flags);
ksm_add_vma(vma);
+ vma_set_thp_policy(vma);
*vmap = vma;
return 0;
@@ -2705,6 +2706,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
mm->map_count++;
validate_mm(mm);
ksm_add_vma(vma);
+ vma_set_thp_policy(vma);
out:
perf_event_mmap(vma);
mm->total_vm += len >> PAGE_SHIFT;
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 35791791a879..f5945ebfe3f2 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -328,4 +328,8 @@ struct prctl_mm_map {
# define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */
# define PR_PPC_DEXCR_CTRL_MASK 0x1f
+#define PR_SET_THP_POLICY 78
+#define PR_GET_THP_POLICY 79
+#define PR_THP_POLICY_DEFAULT_HUGE 0
+
#endif /* _LINUX_PRCTL_H */
diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
index 15c18ef4eb11..325c72f40a93 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
@@ -364,4 +364,8 @@ struct prctl_mm_map {
# define PR_TIMER_CREATE_RESTORE_IDS_ON 1
# define PR_TIMER_CREATE_RESTORE_IDS_GET 2
+#define PR_SET_THP_POLICY 78
+#define PR_GET_THP_POLICY 79
+#define PR_THP_POLICY_DEFAULT_HUGE 0
+
#endif /* _LINUX_PRCTL_H */
--
2.47.1
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH 2/6] prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE for the process
2025-05-15 13:33 [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
2025-05-15 13:33 ` [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process Usama Arif
@ 2025-05-15 13:33 ` Usama Arif
2025-05-16 8:19 ` kernel test robot
2025-05-15 13:33 ` [PATCH 3/6] prctl: introduce PR_THP_POLICY_SYSTEM " Usama Arif
` (4 subsequent siblings)
6 siblings, 1 reply; 51+ messages in thread
From: Usama Arif @ 2025-05-15 13:33 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel,
linux-doc, kernel-team, Usama Arif
This is set via the new PR_SET_THP_POLICY prctl.
This will set the MMF2_THP_VMA_DEFAULT_NOHUGE process flag
which changes the default of new VMAs to be VM_NOHUGEPAGE. The
call also modifies all existing VMAs that are not VM_HUGEPAGE
to be VM_NOHUGEPAGE. The policy is inherited during fork+exec.
This allows systems where the global policy is set to "always"
to effectively have THPs on madvise only for the process. In an
environment where different types of workloads are stacked on the
same machine,this will allow workloads that benefit from having
hugepages on an madvise basis only to do so, without regressing those
that benefit from having hugepages always.
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
include/linux/huge_mm.h | 1 +
include/linux/mm_types.h | 5 +++-
include/uapi/linux/prctl.h | 1 +
kernel/sys.c | 8 +++++++
mm/huge_memory.c | 24 +++++++++++++++++++
tools/include/uapi/linux/prctl.h | 1 +
.../trace/beauty/include/uapi/linux/prctl.h | 1 +
7 files changed, 40 insertions(+), 1 deletion(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e652ad9ddbbd..d46bba282701 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -262,6 +262,7 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
void vma_set_thp_policy(struct vm_area_struct *vma);
void process_vmas_thp_default_huge(struct mm_struct *mm);
+void process_vmas_thp_default_nohuge(struct mm_struct *mm);
unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
unsigned long vm_flags,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2fe93965e761..5e770411d8d1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1747,8 +1747,11 @@ enum {
#define MMF2_THP_VMA_DEFAULT_HUGE 0
#define MMF2_THP_VMA_DEFAULT_HUGE_MASK (1 << MMF2_THP_VMA_DEFAULT_HUGE)
+#define MMF2_THP_VMA_DEFAULT_NOHUGE 1
+#define MMF2_THP_VMA_DEFAULT_NOHUGE_MASK (1 << MMF2_THP_VMA_DEFAULT_NOHUGE)
-#define MMF2_INIT_MASK (MMF2_THP_VMA_DEFAULT_HUGE_MASK)
+#define MMF2_INIT_MASK (MMF2_THP_VMA_DEFAULT_HUGE_MASK |\
+ MMF2_THP_VMA_DEFAULT_NOHUGE_MASK)
static inline unsigned long mmf_init_flags(unsigned long flags)
{
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 325c72f40a93..d25458f4db9e 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -367,5 +367,6 @@ struct prctl_mm_map {
#define PR_SET_THP_POLICY 78
#define PR_GET_THP_POLICY 79
#define PR_THP_POLICY_DEFAULT_HUGE 0
+#define PR_THP_POLICY_DEFAULT_NOHUGE 1
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 1115f258f253..d91203e6dd0d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2663,6 +2663,8 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
if (!!test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2))
error = PR_THP_POLICY_DEFAULT_HUGE;
+ else if (!!test_bit(MMF2_THP_VMA_DEFAULT_NOHUGE, &me->mm->flags2))
+ error = PR_THP_POLICY_DEFAULT_NOHUGE;
break;
case PR_SET_THP_POLICY:
if (arg3 || arg4 || arg5)
@@ -2672,8 +2674,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
switch (arg2) {
case PR_THP_POLICY_DEFAULT_HUGE:
set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
+ clear_bit(MMF2_THP_VMA_DEFAULT_NOHUGE, &me->mm->flags2);
process_vmas_thp_default_huge(me->mm);
break;
+ case PR_THP_POLICY_DEFAULT_NOHUGE:
+ clear_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
+ set_bit(MMF2_THP_VMA_DEFAULT_NOHUGE, &me->mm->flags2);
+ process_vmas_thp_default_nohuge(me->mm);
+ break;
default:
return -EINVAL;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 64f66d5295e8..9d70a365ced3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -104,6 +104,8 @@ void vma_set_thp_policy(struct vm_area_struct *vma)
if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
vm_flags_set(vma, VM_HUGEPAGE);
+ else if (test_bit(MMF2_THP_VMA_DEFAULT_NOHUGE, &mm->flags2))
+ vm_flags_set(vma, VM_NOHUGEPAGE);
}
static void vmas_thp_default_huge(struct mm_struct *mm)
@@ -129,6 +131,28 @@ void process_vmas_thp_default_huge(struct mm_struct *mm)
vmas_thp_default_huge(mm);
}
+static void vmas_thp_default_nohuge(struct mm_struct *mm)
+{
+ struct vm_area_struct *vma;
+ unsigned long vm_flags;
+
+ VMA_ITERATOR(vmi, mm, 0);
+ for_each_vma(vmi, vma) {
+ vm_flags = vma->vm_flags;
+ if (vm_flags & VM_HUGEPAGE)
+ continue;
+ vm_flags_set(vma, VM_NOHUGEPAGE);
+ }
+}
+
+void process_vmas_thp_default_nohuge(struct mm_struct *mm)
+{
+ if (test_bit(MMF2_THP_VMA_DEFAULT_NOHUGE, &mm->flags2))
+ return;
+
+ set_bit(MMF2_THP_VMA_DEFAULT_NOHUGE, &mm->flags2);
+ vmas_thp_default_nohuge(mm);
+}
unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
unsigned long vm_flags,
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index f5945ebfe3f2..e03d0ed890c5 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -331,5 +331,6 @@ struct prctl_mm_map {
#define PR_SET_THP_POLICY 78
#define PR_GET_THP_POLICY 79
#define PR_THP_POLICY_DEFAULT_HUGE 0
+#define PR_THP_POLICY_DEFAULT_NOHUGE 1
#endif /* _LINUX_PRCTL_H */
diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
index 325c72f40a93..d25458f4db9e 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
@@ -367,5 +367,6 @@ struct prctl_mm_map {
#define PR_SET_THP_POLICY 78
#define PR_GET_THP_POLICY 79
#define PR_THP_POLICY_DEFAULT_HUGE 0
+#define PR_THP_POLICY_DEFAULT_NOHUGE 1
#endif /* _LINUX_PRCTL_H */
--
2.47.1
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH 3/6] prctl: introduce PR_THP_POLICY_SYSTEM for the process
2025-05-15 13:33 [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
2025-05-15 13:33 ` [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process Usama Arif
2025-05-15 13:33 ` [PATCH 2/6] prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE " Usama Arif
@ 2025-05-15 13:33 ` Usama Arif
2025-05-15 13:33 ` [PATCH 4/6] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_NOHUGE Usama Arif
` (3 subsequent siblings)
6 siblings, 0 replies; 51+ messages in thread
From: Usama Arif @ 2025-05-15 13:33 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel,
linux-doc, kernel-team, Usama Arif
This is set via the new PR_SET_THP_POLICY prctl.
This will clear both the MMF2_THP_VMA_DEFAULT_NOHUGE and
MMF2_THP_VMA_DEFAULT_HUGE process flags which will make
the VMA behaviour of the process the same as system.
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
include/uapi/linux/prctl.h | 1 +
kernel/sys.c | 6 ++++++
tools/include/uapi/linux/prctl.h | 1 +
tools/perf/trace/beauty/include/uapi/linux/prctl.h | 1 +
4 files changed, 9 insertions(+)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index d25458f4db9e..340d5ff769a9 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -368,5 +368,6 @@ struct prctl_mm_map {
#define PR_GET_THP_POLICY 79
#define PR_THP_POLICY_DEFAULT_HUGE 0
#define PR_THP_POLICY_DEFAULT_NOHUGE 1
+#define PR_THP_POLICY_SYSTEM 2
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index d91203e6dd0d..d556cdea97c4 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2665,6 +2665,8 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = PR_THP_POLICY_DEFAULT_HUGE;
else if (!!test_bit(MMF2_THP_VMA_DEFAULT_NOHUGE, &me->mm->flags2))
error = PR_THP_POLICY_DEFAULT_NOHUGE;
+ else
+ error = PR_THP_POLICY_SYSTEM;
break;
case PR_SET_THP_POLICY:
if (arg3 || arg4 || arg5)
@@ -2682,6 +2684,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
set_bit(MMF2_THP_VMA_DEFAULT_NOHUGE, &me->mm->flags2);
process_vmas_thp_default_nohuge(me->mm);
break;
+ case PR_THP_POLICY_SYSTEM:
+ clear_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
+ clear_bit(MMF2_THP_VMA_DEFAULT_NOHUGE, &me->mm->flags2);
+ break;
default:
return -EINVAL;
}
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index e03d0ed890c5..cc209c9a8afb 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -332,5 +332,6 @@ struct prctl_mm_map {
#define PR_GET_THP_POLICY 79
#define PR_THP_POLICY_DEFAULT_HUGE 0
#define PR_THP_POLICY_DEFAULT_NOHUGE 1
+#define PR_THP_POLICY_SYSTEM 2
#endif /* _LINUX_PRCTL_H */
diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
index d25458f4db9e..340d5ff769a9 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
@@ -368,5 +368,6 @@ struct prctl_mm_map {
#define PR_GET_THP_POLICY 79
#define PR_THP_POLICY_DEFAULT_HUGE 0
#define PR_THP_POLICY_DEFAULT_NOHUGE 1
+#define PR_THP_POLICY_SYSTEM 2
#endif /* _LINUX_PRCTL_H */
--
2.47.1
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH 4/6] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_NOHUGE
2025-05-15 13:33 [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
` (2 preceding siblings ...)
2025-05-15 13:33 ` [PATCH 3/6] prctl: introduce PR_THP_POLICY_SYSTEM " Usama Arif
@ 2025-05-15 13:33 ` Usama Arif
2025-05-15 13:33 ` [PATCH 5/6] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE Usama Arif
` (2 subsequent siblings)
6 siblings, 0 replies; 51+ messages in thread
From: Usama Arif @ 2025-05-15 13:33 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel,
linux-doc, kernel-team, Usama Arif
The test is limited to 2M PMD THPs. It does not modify the system
settings in order to not disturb other process running in the system.
It checks if the PMD size is 2M, if the 2M policy is set to inherit
and if the system global THP policy is set to "always", so that
the change in behaviour due to PR_THP_POLICY_DEFAULT_NOHUGE can
be seen.
This tests if:
- the process can successfully set the policy
- carry it over to the new process with fork
- if no hugepage is gotten when the process doesn't MADV_HUGEPAGE
- if hugepage is gotten when the process does MADV_HUGEPAGE
- the process can successfully reset the policy to PR_THP_POLICY_SYSTEM
- if hugepage is gotten after the policy reset
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
tools/testing/selftests/prctl/Makefile | 2 +-
tools/testing/selftests/prctl/thp_policy.c | 214 +++++++++++++++++++++
2 files changed, 215 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/prctl/thp_policy.c
diff --git a/tools/testing/selftests/prctl/Makefile b/tools/testing/selftests/prctl/Makefile
index 01dc90fbb509..ee8c98e45b53 100644
--- a/tools/testing/selftests/prctl/Makefile
+++ b/tools/testing/selftests/prctl/Makefile
@@ -5,7 +5,7 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)
ifeq ($(ARCH),x86)
TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \
- disable-tsc-test set-anon-vma-name-test set-process-name
+ disable-tsc-test set-anon-vma-name-test set-process-name thp_policy
all: $(TEST_PROGS)
include ../lib.mk
diff --git a/tools/testing/selftests/prctl/thp_policy.c b/tools/testing/selftests/prctl/thp_policy.c
new file mode 100644
index 000000000000..e39872a6d429
--- /dev/null
+++ b/tools/testing/selftests/prctl/thp_policy.c
@@ -0,0 +1,214 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This test covers the PR_GET/SET_THP_POLICY functionality of prctl calls
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+
+#ifndef PR_SET_THP_POLICY
+#define PR_SET_THP_POLICY 78
+#define PR_GET_THP_POLICY 79
+#define PR_THP_POLICY_DEFAULT_HUGE 0
+#define PR_THP_POLICY_DEFAULT_NOHUGE 1
+#define PR_THP_POLICY_SYSTEM 2
+#endif
+
+#define CONTENT_SIZE 256
+#define BUF_SIZE (12 * 2 * 1024 * 1024) // 12 x 2MB pages
+
+enum system_policy {
+ SYSTEM_POLICY_ALWAYS,
+ SYSTEM_POLICY_MADVISE,
+ SYSTEM_POLICY_NEVER,
+};
+
+int system_thp_policy;
+
+/* check if the sysfs file contains the expected substring */
+static int check_file_content(const char *file_path, const char *expected_substring)
+{
+ FILE *file = fopen(file_path, "r");
+ char buffer[CONTENT_SIZE];
+
+ if (!file) {
+ perror("Failed to open file");
+ return -1;
+ }
+ if (fgets(buffer, CONTENT_SIZE, file) == NULL) {
+ perror("Failed to read file");
+ fclose(file);
+ return -1;
+ }
+ fclose(file);
+ // Remove newline character from the buffer
+ buffer[strcspn(buffer, "\n")] = '\0';
+ if (strstr(buffer, expected_substring))
+ return 0;
+ else
+ return 1;
+}
+
+/*
+ * The test is designed for 2M hugepages only.
+ * Check if hugepage size is 2M, if 2M size inherits from global
+ * setting, and if the global setting is madvise or always.
+ */
+static int sysfs_check(void)
+{
+ int res = 0;
+
+ res = check_file_content("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", "2097152");
+ if (res) {
+ printf("hpage_pmd_size is not set to 2MB. Skipping test.\n");
+ return -1;
+ }
+ res |= check_file_content("/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled",
+ "[inherit]");
+ if (res) {
+ printf("hugepages-2048kB does not inherit global setting. Skipping test.\n");
+ return -1;
+ }
+
+ res = check_file_content("/sys/kernel/mm/transparent_hugepage/enabled", "[madvise]");
+ if (!res) {
+ system_thp_policy = SYSTEM_POLICY_MADVISE;
+ return 0;
+ }
+ res = check_file_content("/sys/kernel/mm/transparent_hugepage/enabled", "[always]");
+ if (!res) {
+ system_thp_policy = SYSTEM_POLICY_ALWAYS;
+ return 0;
+ }
+ printf("Global THP policy not set to madvise or always. Skipping test.\n");
+ return -1;
+}
+
+static int check_smaps_for_huge(void)
+{
+ FILE *file = fopen("/proc/self/smaps", "r");
+ int is_anonhuge = 0;
+ char line[256];
+
+ if (!file) {
+ perror("fopen");
+ return -1;
+ }
+
+ while (fgets(line, sizeof(line), file)) {
+ if (strstr(line, "AnonHugePages:") && strstr(line, "24576 kB")) {
+ is_anonhuge = 1;
+ break;
+ }
+ }
+ fclose(file);
+ return is_anonhuge;
+}
+
+static int test_mmap_thp(int madvise_buffer)
+{
+ int is_anonhuge;
+
+ char *buffer = (char *)mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (buffer == MAP_FAILED) {
+ perror("mmap");
+ return -1;
+ }
+ if (madvise_buffer)
+ madvise(buffer, BUF_SIZE, MADV_HUGEPAGE);
+
+ // set memory to ensure it's allocated
+ memset(buffer, 0, BUF_SIZE);
+ is_anonhuge = check_smaps_for_huge();
+ munmap(buffer, BUF_SIZE);
+ return is_anonhuge;
+}
+
+/* Global policy is always, process is changed to NOHUGE (process becomes madvise) */
+static int test_global_always_process_nohuge(void)
+{
+ int is_anonhuge = 0, res = 0, status = 0;
+ pid_t pid;
+
+ if (prctl(PR_SET_THP_POLICY, PR_THP_POLICY_DEFAULT_NOHUGE, NULL, NULL, NULL) != 0) {
+ perror("prctl failed to set policy to madvise");
+ return -1;
+ }
+
+ /* Make sure prctl changes are carried across fork */
+ pid = fork();
+ if (pid < 0) {
+ perror("fork");
+ exit(EXIT_FAILURE);
+ }
+
+ res = prctl(PR_GET_THP_POLICY, NULL, NULL, NULL, NULL);
+ if (res != PR_THP_POLICY_DEFAULT_NOHUGE) {
+ printf("prctl PR_GET_THP_POLICY returned %d pid %d\n", res, pid);
+ goto err_out;
+ }
+
+ /* global = always, process = madvise, we shouldn't get HPs without madvise */
+ is_anonhuge = test_mmap_thp(0);
+ if (is_anonhuge) {
+ printf(
+ "PR_THP_POLICY_DEFAULT_NOHUGE set but still got hugepages without MADV_HUGEPAGE\n");
+ goto err_out;
+ }
+
+ is_anonhuge = test_mmap_thp(1);
+ if (!is_anonhuge) {
+ printf(
+ "PR_THP_POLICY_DEFAULT_NOHUGE set but did't get hugepages with MADV_HUGEPAGE\n");
+ goto err_out;
+ }
+
+ /* Reset to system policy */
+ if (prctl(PR_SET_THP_POLICY, PR_THP_POLICY_SYSTEM, NULL, NULL, NULL) != 0) {
+ perror("prctl failed to set policy to system");
+ goto err_out;
+ }
+
+ is_anonhuge = test_mmap_thp(0);
+ if (!is_anonhuge) {
+ printf("global policy is always but we still didn't get hugepages\n");
+ goto err_out;
+ }
+
+ is_anonhuge = test_mmap_thp(1);
+ if (!is_anonhuge) {
+ printf("global policy is always but we still didn't get hugepages\n");
+ goto err_out;
+ }
+
+ if (pid == 0) {
+ exit(EXIT_SUCCESS);
+ } else {
+ wait(&status);
+ if (WIFEXITED(status))
+ return 0;
+ else
+ return -1;
+ }
+
+err_out:
+ if (pid == 0)
+ exit(EXIT_FAILURE);
+ else
+ return -1;
+}
+
+int main(void)
+{
+ if (sysfs_check())
+ return 0;
+
+ if (system_thp_policy == SYSTEM_POLICY_ALWAYS)
+ return test_global_always_process_nohuge();
+
+}
--
2.47.1
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH 5/6] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
2025-05-15 13:33 [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
` (3 preceding siblings ...)
2025-05-15 13:33 ` [PATCH 4/6] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_NOHUGE Usama Arif
@ 2025-05-15 13:33 ` Usama Arif
2025-05-15 13:33 ` [PATCH 6/6] docs: transhuge: document process level THP controls Usama Arif
2025-05-15 13:55 ` [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY Lorenzo Stoakes
6 siblings, 0 replies; 51+ messages in thread
From: Usama Arif @ 2025-05-15 13:33 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel,
linux-doc, kernel-team, Usama Arif
The test is limited to 2M PMD THPs. It does not modify the system
settings in order to not disturb other process running in the
system.
It runs if the PMD size is 2M, if the 2M policy is set to inherit
and if the system global THP policy is set to "madvise", so that
the change in behaviour due to PR_THP_POLICY_DEFAULT_HUGE can
be seen.
This tests if:
- the process can successfully set the policy
- carry it over to the new process with fork
- if hugepage is gotten both with and without madvise
- the process can successfully reset the policy to
PR_THP_POLICY_SYSTEM
- if hugepage is gotten after the policy reset only with MADV_HUGEPAGE
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
tools/testing/selftests/prctl/thp_policy.c | 74 +++++++++++++++++++++-
1 file changed, 73 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/prctl/thp_policy.c b/tools/testing/selftests/prctl/thp_policy.c
index e39872a6d429..65c87f99423e 100644
--- a/tools/testing/selftests/prctl/thp_policy.c
+++ b/tools/testing/selftests/prctl/thp_policy.c
@@ -203,6 +203,77 @@ static int test_global_always_process_nohuge(void)
return -1;
}
+/* Global policy is madvise, process is changed to HUGE (process becomes always) */
+static int test_global_madvise_process_huge(void)
+{
+ int is_anonhuge = 0, res = 0, status = 0;
+ pid_t pid;
+
+ if (prctl(PR_SET_THP_POLICY, PR_THP_POLICY_DEFAULT_HUGE, NULL, NULL, NULL) != 0) {
+ perror("prctl failed to set process policy to always");
+ return -1;
+ }
+
+ /* Make sure prctl changes are carried across fork */
+ pid = fork();
+ if (pid < 0) {
+ perror("fork");
+ exit(EXIT_FAILURE);
+ }
+
+ res = prctl(PR_GET_THP_POLICY, NULL, NULL, NULL, NULL);
+ if (res != PR_THP_POLICY_DEFAULT_HUGE) {
+ printf("prctl PR_GET_THP_POLICY returned %d pid %d\n", res, pid);
+ goto err_out;
+ }
+
+ /* global = madvise, process = always, we should get HPs irrespective of MADV_HUGEPAGE */
+ is_anonhuge = test_mmap_thp(0);
+ if (!is_anonhuge) {
+ printf("PR_THP_POLICY_DEFAULT_HUGE set but didn't get hugepages\n");
+ goto err_out;
+ }
+
+ is_anonhuge = test_mmap_thp(1);
+ if (!is_anonhuge) {
+ printf("PR_THP_POLICY_DEFAULT_HUGE set but did't get hugepages\n");
+ goto err_out;
+ }
+
+ /* Reset to system policy */
+ if (prctl(PR_SET_THP_POLICY, PR_THP_POLICY_SYSTEM, NULL, NULL, NULL) != 0) {
+ perror("prctl failed to set policy to system");
+ goto err_out;
+ }
+
+ is_anonhuge = test_mmap_thp(0);
+ if (is_anonhuge) {
+ printf("global policy is madvise\n");
+ goto err_out;
+ }
+
+ is_anonhuge = test_mmap_thp(1);
+ if (!is_anonhuge) {
+ printf("global policy is madvise\n");
+ goto err_out;
+ }
+
+ if (pid == 0) {
+ exit(EXIT_SUCCESS);
+ } else {
+ wait(&status);
+ if (WIFEXITED(status))
+ return 0;
+ else
+ return -1;
+ }
+err_out:
+ if (pid == 0)
+ exit(EXIT_FAILURE);
+ else
+ return -1;
+}
+
int main(void)
{
if (sysfs_check())
@@ -210,5 +281,6 @@ int main(void)
if (system_thp_policy == SYSTEM_POLICY_ALWAYS)
return test_global_always_process_nohuge();
-
+ else
+ return test_global_madvise_process_huge();
}
--
2.47.1
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH 6/6] docs: transhuge: document process level THP controls
2025-05-15 13:33 [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
` (4 preceding siblings ...)
2025-05-15 13:33 ` [PATCH 5/6] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE Usama Arif
@ 2025-05-15 13:33 ` Usama Arif
2025-05-15 13:55 ` [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY Lorenzo Stoakes
6 siblings, 0 replies; 51+ messages in thread
From: Usama Arif @ 2025-05-15 13:33 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, linux-kernel,
linux-doc, kernel-team, Usama Arif
This includes the already existing PR_GET/SET_THP_DISABLE policy,
as well as the newly introduced PR_GET/SET_THP_POLICY.
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
Documentation/admin-guide/mm/transhuge.rst | 40 ++++++++++++++++++++++
1 file changed, 40 insertions(+)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index dff8d5985f0f..cf3092eb239a 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -218,6 +218,46 @@ to "always" or "madvise"), and it'll be automatically shutdown when
PMD-sized THP is disabled (when both the per-size anon control and the
top-level control are "never")
+process THP controls
+--------------------
+
+Transparent Hugepage behaviour of a process can be modified/obtained by
+using the prctl system call. The following operations are supported:
+
+PR_SET_THP_DISABLE
+ This will set the MMF_DISABLE_THP process flag which will result
+ in no hugepages being faulted in or collapsed by khugepaged,
+ irrespective of global THP controls.
+
+PR_GET_THP_DISABLE
+ This will return the MMF_DISABLE_THP process flag, which will be
+ set if the process has previously been set with PR_SET_THP_DISABLE.
+
+PR_SET_THP_POLICY
+ This is used to change the behaviour of existing and future VMAs.
+ It has support for the following policies:
+
+ PR_THP_POLICY_DEFAULT_HUGE
+ This will set the MMF2_THP_VMA_DEFAULT_HUGE process flag which
+ changes the default of new VMAs to be VM_HUGEPAGE. The call
+ also modifies all existing VMAs that are not VM_NOHUGEPAGE
+ to be VM_HUGEPAGE. The policy is inherited during fork+exec.
+
+ PR_THP_POLICY_DEFAULT_NOHUGE
+ This will set the MMF2_THP_VMA_DEFAULT_NOHUGE process flag which
+ changes the default of new VMAs to be VM_NOHUGEPAGE. The call
+ also modifies all existing VMAs that are not VM_HUGEPAGE
+ to be VM_NOHUGEPAGE. The policy is inherited during fork+exec.
+
+ PR_THP_POLICY_DEFAULT_SYSTEM
+ This will clear both MMF2_THP_VMA_DEFAULT_HUGE and
+ MMF2_THP_VMA_DEFAULT_NOHUGE process flags.
+
+PR_SET_THP_POLICY
+ This will return the current THP policy of the process, i.e.
+ PR_THP_POLICY_DEFAULT_HUGE, PR_THP_POLICY_DEFAULT_NOHUGE or
+ PR_THP_POLICY_DEFAULT_SYSTEM.
+
Khugepaged controls
-------------------
--
2.47.1
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY
2025-05-15 13:33 [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
` (5 preceding siblings ...)
2025-05-15 13:33 ` [PATCH 6/6] docs: transhuge: document process level THP controls Usama Arif
@ 2025-05-15 13:55 ` Lorenzo Stoakes
2025-05-15 14:50 ` Usama Arif
6 siblings, 1 reply; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-15 13:55 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On Thu, May 15, 2025 at 02:33:29PM +0100, Usama Arif wrote:
> This allows to change the THP policy of a process, according to the value
> set in arg2, all of which will be inherited during fork+exec:
This is pretty confusing.
It should be something like 'add a new prctl() option that allows...' etc.
> - PR_THP_POLICY_DEFAULT_HUGE: This will set the MMF2_THP_VMA_DEFAULT_HUGE
> process flag which changes the default of new VMAs to be VM_HUGEPAGE. The
> call also modifies all existing VMAs that are not VM_NOHUGEPAGE
> to be VM_HUGEPAGE.
This is referring to implementation detail that doesn't matter for an overview,
just add a summary here e.g.
PR_THP_POLICY_DEFAULT_HUGE - set VM_HUGEPAGE flag in all VMAs by default,
including after fork/exec, ignoring global policy.
PR_THP_POLICY_DEFAULT_NOHUGE - clear VM_HUGEPAGE flag in all VMAs by default,
including after fork/exec, ignoring global policy.
PR_THP_POLICY_DEFAULT_SYSTEM - Eliminate any policy set above.
> This allows systems where the global policy is set to "madvise"
> to effectively have THPs always for the process. In an environment
> where different types of workloads are stacked on the same machine
> whose global policy is set to "madvise", this will allow workloads
> that benefit from always having hugepages to do so, without regressing
> those that don't.
So does this just ignore and override the global policy? I'm not sure I'm
comfortable with that.
What about if the the policy is 'never'? Does this override that? That seems
completely wrong.
> - PR_THP_POLICY_DEFAULT_NOHUGE: This will set the MMF2_THP_VMA_DEFAULT_NOHUGE
> process flag which changes the default of new VMAs to be VM_NOHUGEPAGE.
> The call also modifies all existing VMAs that are not VM_HUGEPAGE
> to be VM_NOHUGEPAGE.
> This allows systems where the global policy is set to "always"
> to effectively have THPs on madvise only for the process. In an
> environment where different types of workloads are stacked on the
> same machine whose global policy is set to "always", this will allow
> workloads that benefit from having hugepages on an madvise basis only
> to do so, without regressing those that benefit from having hugepages
> always.
Wait, so 'no huge' means 'madvise'? What? This is confusing.
> - PR_THP_POLICY_DEFAULT_SYSTEM: This will clear the MMF2_THP_VMA_DEFAULT_HUGE
> and MMF2_THP_VMA_DEFAULT_NOHUGE process flags.
>
> These patches are required in rolling out hugepages in hyperscaler
> configurations for workloads that benefit from them, where workloads are
> stacked anda single THP global policy is likely to be used across the entire
> fleet, and prctl will help override it.
I don't understand this justification whatsoever. What does 'stacked' mean? And
you're not justifying why you'd override the policy?
This series has no actual justificaiton here at all? You really need to provide one.
>
> v1->v2:
Where was the v1? Is it [0]?
This seems like a massive change compared to that series?
You've renamed it and not referenced the old series, please make sure you link
it or somehow let somebody see what this is against, because it makes review
difficult.
[0]: https://lore.kernel.org/linux-mm/20250507141132.2773275-1-usamaarif642@gmail.com/
> - change from modifying the THP decision making for the process, to modifying
> VMA flags only. This prevents further complicating the logic used to
> determine THP order (Thanks David!)
> - change from using a prctl per policy change to just using PR_SET_THP_POLICY
> and arg2 to set the policy. (Zi Yan)
> - Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM
> - Add selftests and documentation.
>
> Usama Arif (6):
> prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
> prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE for the process
> prctl: introduce PR_THP_POLICY_SYSTEM for the process
> selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_NOHUGE
> selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
> docs: transhuge: document process level THP controls
>
> Documentation/admin-guide/mm/transhuge.rst | 40 +++
> include/linux/huge_mm.h | 4 +
> include/linux/mm_types.h | 14 +
> include/uapi/linux/prctl.h | 6 +
> kernel/fork.c | 1 +
> kernel/sys.c | 35 +++
> mm/huge_memory.c | 56 ++++
> mm/vma.c | 2 +
> tools/include/uapi/linux/prctl.h | 6 +
> .../trace/beauty/include/uapi/linux/prctl.h | 6 +
> tools/testing/selftests/prctl/Makefile | 2 +-
> tools/testing/selftests/prctl/thp_policy.c | 286 ++++++++++++++++++
> 12 files changed, 457 insertions(+), 1 deletion(-)
> create mode 100644 tools/testing/selftests/prctl/thp_policy.c
>
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 13:33 ` [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process Usama Arif
@ 2025-05-15 14:40 ` Lorenzo Stoakes
2025-05-15 14:44 ` David Hildenbrand
2025-05-15 15:28 ` Usama Arif
2025-05-16 6:12 ` kernel test robot
1 sibling, 2 replies; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-15 14:40 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
Overall I feel this series should _DEFINITELY_ be an RFC. This is pretty
outlandish stuff and needs discussion.
You're basically making it so /sys/kernel/mm/transparent_hugepage/enabled =
never is completely ignored and overridden. Which I am emphatically not
comfortable with. And you're not saying that you're doing this,
anywhere. Which is wrong.
Also, this patch is quite broken.
I'm hugely not a fan of adding mm_struct->flags2, and I'm even more not a
fan of you not mentioning such a completely fundamental change in the
commit mesage.
This patch also breaks VMA merging and the VMA tests...
I really feel this series needs to be an RFC until we can get some
consensus on how to approach this.
On Thu, May 15, 2025 at 02:33:30PM +0100, Usama Arif wrote:
> This is set via the new PR_SET_THP_POLICY prctl.
What is?
You're making very major changes here, including adding a new flag to
mm_struct (!!) and the explanation/justification for this is missing.
> This will set the MMF2_THP_VMA_DEFAULT_HUGE process flag
> which changes the default of new VMAs to be VM_HUGEPAGE. The
> call also modifies all existing VMAs that are not VM_NOHUGEPAGE
> to be VM_HUGEPAGE. The policy is inherited during fork+exec.
So you can only set this flag?
>
> This allows systems where the global policy is set to "madvise"
> to effectively have THPs always for the process. In an environment
> where different types of workloads are stacked on the same machine,
> this will allow workloads that benefit from always having hugepages
> to do so, without regressing those that don't.
Again, this explanation really makes no sense at all to me, I don't really
know what you mean, you're not going into what you're doing in this change,
this is just a very unclear commit message.
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
> include/linux/huge_mm.h | 3 ++
> include/linux/mm_types.h | 11 +++++++
> include/uapi/linux/prctl.h | 4 +++
> kernel/fork.c | 1 +
> kernel/sys.c | 21 ++++++++++++
> mm/huge_memory.c | 32 +++++++++++++++++++
> mm/vma.c | 2 ++
> tools/include/uapi/linux/prctl.h | 4 +++
> .../trace/beauty/include/uapi/linux/prctl.h | 4 +++
> 9 files changed, 82 insertions(+)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..e652ad9ddbbd 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -260,6 +260,9 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
> return orders;
> }
>
> +void vma_set_thp_policy(struct vm_area_struct *vma);
This is a VMA-specific function but you're putting it in huge_mm.h? Why
can't this be in vma.h or vma.c?
> +void process_vmas_thp_default_huge(struct mm_struct *mm);
'vmas' is redundant here.
> +
> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> unsigned long vm_flags,
> unsigned long tva_flags,
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e76bade9ebb1..2fe93965e761 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1066,6 +1066,7 @@ struct mm_struct {
> mm_context_t context;
>
> unsigned long flags; /* Must use atomic bitops to access */
> + unsigned long flags2;
Ugh, god really??
I really am not a fan of adding flags2 just to add a prctl() feature like
this. This is crazy.
Also this is a TERRIBLE name. I mean, no please PLEASE no.
Do we really have absolutely no choice but to add a new flags field here?
It again doesn't help that you don't mention nor even try to justify this
in the commit message or cover letter.
If this is a 32-bit kernel vs. 64-bit kernel thing so we 'ran out of bits',
let's just go make this flags field 64-bit on 32-bit kernels.
I mean - I'm kind of insisting we do that to be honest. Because I really
don't like this.
Also if we _HAVE_ to have this, shouldn't we duplicate that comment about
atomic bitops?...
>
> #ifdef CONFIG_AIO
> spinlock_t ioctx_lock;
> @@ -1744,6 +1745,11 @@ enum {
> MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
> MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
>
> +#define MMF2_THP_VMA_DEFAULT_HUGE 0
I thought the whole idea was to move away from explicitly refrencing 'THP'
in a future where large folios are implicit and now we're saying 'THP'.
Anyway the 'VMA' is totally redundant here.
> +#define MMF2_THP_VMA_DEFAULT_HUGE_MASK (1 << MMF2_THP_VMA_DEFAULT_HUGE)
Do we really need explicit trivial mask declarations like this?
> +
> +#define MMF2_INIT_MASK (MMF2_THP_VMA_DEFAULT_HUGE_MASK)
> +
> static inline unsigned long mmf_init_flags(unsigned long flags)
> {
> if (flags & (1UL << MMF_HAS_MDWE_NO_INHERIT))
> @@ -1752,4 +1758,9 @@ static inline unsigned long mmf_init_flags(unsigned long flags)
> return flags & MMF_INIT_MASK;
> }
>
> +static inline unsigned long mmf2_init_flags(unsigned long flags)
> +{
> + return flags & MMF2_INIT_MASK;
> +}
> +
> #endif /* _LINUX_MM_TYPES_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 15c18ef4eb11..325c72f40a93 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -364,4 +364,8 @@ struct prctl_mm_map {
> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1
> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2
>
> +#define PR_SET_THP_POLICY 78
> +#define PR_GET_THP_POLICY 79
> +#define PR_THP_POLICY_DEFAULT_HUGE 0
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9e4616dacd82..6e5f4a8869dc 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1054,6 +1054,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>
> if (current->mm) {
> mm->flags = mmf_init_flags(current->mm->flags);
> + mm->flags2 = mmf2_init_flags(current->mm->flags2);
> mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
> } else {
> mm->flags = default_dump_filter;
> diff --git a/kernel/sys.c b/kernel/sys.c
> index c434968e9f5d..1115f258f253 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2658,6 +2658,27 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> clear_bit(MMF_DISABLE_THP, &me->mm->flags);
> mmap_write_unlock(me->mm);
> break;
> + case PR_GET_THP_POLICY:
> + if (arg2 || arg3 || arg4 || arg5)
> + return -EINVAL;
> + if (!!test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2))
I really don't think we need the !!? Do we?
Shouldn't we lock the mm when we do this no? Can't somebody change this?
> + error = PR_THP_POLICY_DEFAULT_HUGE;
> + break;
> + case PR_SET_THP_POLICY:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + if (mmap_write_lock_killable(me->mm))
> + return -EINTR;
> + switch (arg2) {
> + case PR_THP_POLICY_DEFAULT_HUGE:
> + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
> + process_vmas_thp_default_huge(me->mm);
> + break;
> + default:
> + return -EINVAL;
> + }
> + mmap_write_unlock(me->mm);
> + break;
> case PR_MPX_ENABLE_MANAGEMENT:
> case PR_MPX_DISABLE_MANAGEMENT:
> /* No longer implemented: */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2780a12b25f0..64f66d5295e8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -98,6 +98,38 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
> return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
> }
>
> +void vma_set_thp_policy(struct vm_area_struct *vma)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> +
> + if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
> + vm_flags_set(vma, VM_HUGEPAGE);
> +}
> +
> +static void vmas_thp_default_huge(struct mm_struct *mm)
> +{
> + struct vm_area_struct *vma;
> + unsigned long vm_flags;
> +
> + VMA_ITERATOR(vmi, mm, 0);
This is a declaration, it should be grouped with declarations...
> + for_each_vma(vmi, vma) {
> + vm_flags = vma->vm_flags;
> + if (vm_flags & VM_NOHUGEPAGE)
> + continue;
Literally no point in you putting vm_flags as a separate variable here.
So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
is to override global 'never'?
I'm really concerned about this.
> + vm_flags_set(vma, VM_HUGEPAGE);
> + }
> +}
Do we have an mmap write lock established here? Can you confirm that? Also
you should add an assert for that here.
> +
> +void process_vmas_thp_default_huge(struct mm_struct *mm)
> +{
> + if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
> + return;
> +
> + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2);
> + vmas_thp_default_huge(mm);
> +}
> +
> +
> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> unsigned long vm_flags,
> unsigned long tva_flags,
> diff --git a/mm/vma.c b/mm/vma.c
> index 1f2634b29568..101b19c96803 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -2476,6 +2476,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> if (!vma_is_anonymous(vma))
> khugepaged_enter_vma(vma, map->flags);
> ksm_add_vma(vma);
> + vma_set_thp_policy(vma);
You're breaking VMA merging completely by doing this here...
Now I can map one VMA with this policy set, then map another immediately
next to it and - oops - no merge, ever, because the VM_HUGEPAGE flag is not
set in the new VMA on merge attempt.
I realise KSM is just as broken (grr) but this doesn't justify us
completely breaking VMA merging here.
You need to set earlier than this. Then of course a driver might decide to
override this, so maybe then we need to override that.
But then we're getting into realms of changing fundamental VMA code _just
for this feature_.
Again I'm iffy about this. Very.
Also you've broken the VMA userland tests here:
$ cd tools/testing/vma
$ make
...
In file included from vma.c:33:
../../../mm/vma.c: In function ‘__mmap_new_vma’:
../../../mm/vma.c:2486:9: error: implicit declaration of function ‘vma_set_thp_policy’; did you mean ‘vma_dup_policy’? [-Wimplicit-function-declaration]
2486 | vma_set_thp_policy(vma);
| ^~~~~~~~~~~~~~~~~~
| vma_dup_policy
make: *** [<builtin>: vma.o] Error 1
You need to create stubs accordingly.
> *vmap = vma;
> return 0;
>
> @@ -2705,6 +2706,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> mm->map_count++;
> validate_mm(mm);
> ksm_add_vma(vma);
> + vma_set_thp_policy(vma);
You're breaking merging again... This is quite a bad case too as now you'll
have totally fragmented brk VMAs no?
We can't have it implemented this way.
> out:
> perf_event_mmap(vma);
> mm->total_vm += len >> PAGE_SHIFT;
> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
> index 35791791a879..f5945ebfe3f2 100644
> --- a/tools/include/uapi/linux/prctl.h
> +++ b/tools/include/uapi/linux/prctl.h
> @@ -328,4 +328,8 @@ struct prctl_mm_map {
> # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */
> # define PR_PPC_DEXCR_CTRL_MASK 0x1f
>
> +#define PR_SET_THP_POLICY 78
> +#define PR_GET_THP_POLICY 79
> +#define PR_THP_POLICY_DEFAULT_HUGE 0
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> index 15c18ef4eb11..325c72f40a93 100644
> --- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> +++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> @@ -364,4 +364,8 @@ struct prctl_mm_map {
> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1
> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2
>
> +#define PR_SET_THP_POLICY 78
> +#define PR_GET_THP_POLICY 79
> +#define PR_THP_POLICY_DEFAULT_HUGE 0
> +
> #endif /* _LINUX_PRCTL_H */
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 14:40 ` Lorenzo Stoakes
@ 2025-05-15 14:44 ` David Hildenbrand
2025-05-15 14:56 ` Usama Arif
2025-05-15 15:45 ` Liam R. Howlett
2025-05-15 15:28 ` Usama Arif
1 sibling, 2 replies; 51+ messages in thread
From: David Hildenbrand @ 2025-05-15 14:44 UTC (permalink / raw)
To: Lorenzo Stoakes, Usama Arif
Cc: Andrew Morton, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On 15.05.25 16:40, Lorenzo Stoakes wrote:
> Overall I feel this series should _DEFINITELY_ be an RFC. This is pretty
> outlandish stuff and needs discussion.
>
> You're basically making it so /sys/kernel/mm/transparent_hugepage/enabled =
> never is completely ignored and overridden.
I thought I made it very clear during earlier discussions that never
means never.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY
2025-05-15 13:55 ` [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY Lorenzo Stoakes
@ 2025-05-15 14:50 ` Usama Arif
2025-05-15 15:15 ` Lorenzo Stoakes
0 siblings, 1 reply; 51+ messages in thread
From: Usama Arif @ 2025-05-15 14:50 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On 15/05/2025 14:55, Lorenzo Stoakes wrote:
> On Thu, May 15, 2025 at 02:33:29PM +0100, Usama Arif wrote:
>> This allows to change the THP policy of a process, according to the value
>> set in arg2, all of which will be inherited during fork+exec:
>
> This is pretty confusing.
>
> It should be something like 'add a new prctl() option that allows...' etc.
>
>> - PR_THP_POLICY_DEFAULT_HUGE: This will set the MMF2_THP_VMA_DEFAULT_HUGE
>> process flag which changes the default of new VMAs to be VM_HUGEPAGE. The
>> call also modifies all existing VMAs that are not VM_NOHUGEPAGE
>> to be VM_HUGEPAGE.
>
> This is referring to implementation detail that doesn't matter for an overview,
> just add a summary here e.g.
>
> PR_THP_POLICY_DEFAULT_HUGE - set VM_HUGEPAGE flag in all VMAs by default,
> including after fork/exec, ignoring global policy.
>
> PR_THP_POLICY_DEFAULT_NOHUGE - clear VM_HUGEPAGE flag in all VMAs by default,
> including after fork/exec, ignoring global policy.
>
> PR_THP_POLICY_DEFAULT_SYSTEM - Eliminate any policy set above.
Hi Lorenzo,
Thanks for the review. I will make the cover letter clearer in the next revision.
>
>> This allows systems where the global policy is set to "madvise"
>> to effectively have THPs always for the process. In an environment
>> where different types of workloads are stacked on the same machine
>> whose global policy is set to "madvise", this will allow workloads
>> that benefit from always having hugepages to do so, without regressing
>> those that don't.
>
> So does this just ignore and override the global policy? I'm not sure I'm
> comfortable with that.
No. The decision making of when and what order THPs are allowed is not
changed, i.e. there are no changes in __thp_vma_allowable_orders and
thp_vma_allowable_orders. David has the same concern as you and this
current series is implementing what David suggested in
https://lore.kernel.org/all/3f7ba97d-04d5-4ea4-9f08-6ec3584e0d4c@redhat.com/
It will change the existing VMA (NO)HUGE flags according to
the prctl. For e.g. doing PR_THP_POLICY_DEFAULT_HUGE will not give
a THP when global policy is never.
>
> What about if the the policy is 'never'? Does this override that? That seems
> completely wrong.
No, it won't override it. hugepage_global_always and hugepage_global_enabled
will still evaluate to false and you wont get a hugepage no matter what prctl
is set.
>
>> - PR_THP_POLICY_DEFAULT_NOHUGE: This will set the MMF2_THP_VMA_DEFAULT_NOHUGE
>> process flag which changes the default of new VMAs to be VM_NOHUGEPAGE.
>> The call also modifies all existing VMAs that are not VM_HUGEPAGE
>> to be VM_NOHUGEPAGE.
>> This allows systems where the global policy is set to "always"
>> to effectively have THPs on madvise only for the process. In an
>> environment where different types of workloads are stacked on the
>> same machine whose global policy is set to "always", this will allow
>> workloads that benefit from having hugepages on an madvise basis only
>> to do so, without regressing those that benefit from having hugepages
>> always.
>
> Wait, so 'no huge' means 'madvise'? What? This is confusing.
I probably made the cover letter confusing :) or maybe need to rename the flags.
This flag work as follows:
a) Changes the default flag of new VMAs to be VM_NOHUGEPAGE
b) Modifies all existing VMAs that are not VM_HUGEPAGE to be VM_NOHUGEPAGE
c) Is inherited during fork+exec
I think maybe I should add VMA to the flag names and rename the flags to
PR_THP_POLICY_DEFAULT_VMA_(NO)HUGE ??
>
>> - PR_THP_POLICY_DEFAULT_SYSTEM: This will clear the MMF2_THP_VMA_DEFAULT_HUGE
>> and MMF2_THP_VMA_DEFAULT_NOHUGE process flags.
>>
>> These patches are required in rolling out hugepages in hyperscaler
>> configurations for workloads that benefit from them, where workloads are
>> stacked anda single THP global policy is likely to be used across the entire
>> fleet, and prctl will help override it.
>
> I don't understand this justification whatsoever. What does 'stacked' mean? And
> you're not justifying why you'd override the policy?
By stacked I just meant different types of workloads running on the same machine.
Lets say we have a single server whose global policy is set to madvise.
You can have a container on that server running some database workload that best
works with madvise.
You can have another container on that same server running some AI workload that would
benefit from having VM_HUGEPAGE set on all new VMAs. We can use prctl
PR_THP_POLICY_DEFAULT_HUGE to get VM_HUGEPAGE set by default on all new VMAs for that
container.
>
> This series has no actual justificaiton here at all? You really need to provide one.
>
There was a discussion on the usecases in
https://lore.kernel.org/all/13b68fa0-8755-43d8-8504-d181c2d46134@gmail.com/
I tried (and I guess failed :)) to summarize the justification from that thread.
I will try and rephrase it here.
In hyperscalers, we have a single THP policy for the entire fleet.
We have different types of workloads (e.g. AI/compute/databases/etc)
running on a single server (this is what I meant by 'stacked').
Some of these workloads will benefit from always getting THP at fault (or collapsed
by khugepaged), some of them will benefit by only getting them at madvise.
This series is useful for 2 usecases:
1) global system policy = madvise, while we want some workloads to get THPs
at fault and by khugepaged :- some processes (e.g. AI workloads) benefits from getting
THPs at fault (and collapsed by khugepaged). Other workloads like databases will incur
regression (either a performance regression or they are completely memory bound and
even a very slight increase in memory will cause them to OOM). So what these patches
will do is allow setting prctl(PR_THP_POLICY_DEFAULT_HUGE) on the AI workloads,
(This is how workloads are deployed in our (Meta's/Facebook) fleet at this moment).
2) global system policy = always, while we want some workloads to get THPs
only on madvise basis :- Same reason as 1). What these patches
will do is allow setting prctl(PR_THP_POLICY_DEFAULT_NOHUGE) on the database
workloads.
(We hope this is us (Meta) in the near future, if a majority of workloads show that they
benefit from always, we flip the default host setting to "always" across the fleet and
workloads that regress can opt-out and be "madvise".
New services developed will then be tested with always by default. "always" is also the
default defconfig option upstream, so I would imagine this is faced by others as well.)
Hope this makes the justification for the patches clearer :)
>>
>> v1->v2:
>
> Where was the v1? Is it [0]?
>
> This seems like a massive change compared to that series?
>
> You've renamed it and not referenced the old series, please make sure you link
> it or somehow let somebody see what this is against, because it makes review
> difficult.
>
Yes its the patch you linked below. Sorry should have linked it in this series.
Its a big change, but it was basically incorporating all feedback from David,
while trying to achieve a similar goal. Will link it in future series.
> [0]: https://lore.kernel.org/linux-mm/20250507141132.2773275-1-usamaarif642@gmail.com/
>
>> - change from modifying the THP decision making for the process, to modifying
>> VMA flags only. This prevents further complicating the logic used to
>> determine THP order (Thanks David!)
>> - change from using a prctl per policy change to just using PR_SET_THP_POLICY
>> and arg2 to set the policy. (Zi Yan)
>> - Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM
>> - Add selftests and documentation.
>>
>> Usama Arif (6):
>> prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
>> prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE for the process
>> prctl: introduce PR_THP_POLICY_SYSTEM for the process
>> selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_NOHUGE
>> selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
>> docs: transhuge: document process level THP controls
>>
>> Documentation/admin-guide/mm/transhuge.rst | 40 +++
>> include/linux/huge_mm.h | 4 +
>> include/linux/mm_types.h | 14 +
>> include/uapi/linux/prctl.h | 6 +
>> kernel/fork.c | 1 +
>> kernel/sys.c | 35 +++
>> mm/huge_memory.c | 56 ++++
>> mm/vma.c | 2 +
>> tools/include/uapi/linux/prctl.h | 6 +
>> .../trace/beauty/include/uapi/linux/prctl.h | 6 +
>> tools/testing/selftests/prctl/Makefile | 2 +-
>> tools/testing/selftests/prctl/thp_policy.c | 286 ++++++++++++++++++
>> 12 files changed, 457 insertions(+), 1 deletion(-)
>> create mode 100644 tools/testing/selftests/prctl/thp_policy.c
>>
>> --
>> 2.47.1
>>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 14:44 ` David Hildenbrand
@ 2025-05-15 14:56 ` Usama Arif
2025-05-15 14:58 ` David Hildenbrand
2025-05-15 15:45 ` Liam R. Howlett
1 sibling, 1 reply; 51+ messages in thread
From: Usama Arif @ 2025-05-15 14:56 UTC (permalink / raw)
To: David Hildenbrand, Lorenzo Stoakes
Cc: Andrew Morton, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On 15/05/2025 15:44, David Hildenbrand wrote:
> On 15.05.25 16:40, Lorenzo Stoakes wrote:
>> Overall I feel this series should _DEFINITELY_ be an RFC. This is pretty
>> outlandish stuff and needs discussion.
>>
>> You're basically making it so /sys/kernel/mm/transparent_hugepage/enabled =
>> never is completely ignored and overridden.
>
> I thought I made it very clear during earlier discussions that never means never.
>
Yes never means never, this is only implementing your suggestion in the original series.
If the policy is set to never, hugepage_global_always and hugepage_global_enabled will
evaluate to false and we wont get a hugepage.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 14:56 ` Usama Arif
@ 2025-05-15 14:58 ` David Hildenbrand
2025-05-15 15:18 ` Lorenzo Stoakes
0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-05-15 14:58 UTC (permalink / raw)
To: Usama Arif, Lorenzo Stoakes
Cc: Andrew Morton, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On 15.05.25 16:56, Usama Arif wrote:
>
>
> On 15/05/2025 15:44, David Hildenbrand wrote:
>> On 15.05.25 16:40, Lorenzo Stoakes wrote:
>>> Overall I feel this series should _DEFINITELY_ be an RFC. This is pretty
>>> outlandish stuff and needs discussion.
>>>
>>> You're basically making it so /sys/kernel/mm/transparent_hugepage/enabled =
>>> never is completely ignored and overridden.
>>
>> I thought I made it very clear during earlier discussions that never means never.
>>
>
> Yes never means never
Good, likely worth stating that clearly that there are no overrides (I
did not look into the series yet, I was only responding to Lorenzo's
concerns) :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY
2025-05-15 14:50 ` Usama Arif
@ 2025-05-15 15:15 ` Lorenzo Stoakes
2025-05-15 15:54 ` Usama Arif
0 siblings, 1 reply; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-15 15:15 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
Thanks for coming back to me so quickly, appreciated :)
I am reacting in a 'WTF' way here, but it's in proportion to the (at least
perceived) magnitude of this change. We really need to be sure this is
right.
On Thu, May 15, 2025 at 03:50:47PM +0100, Usama Arif wrote:
>
>
> On 15/05/2025 14:55, Lorenzo Stoakes wrote:
> > On Thu, May 15, 2025 at 02:33:29PM +0100, Usama Arif wrote:
> >> This allows to change the THP policy of a process, according to the value
> >> set in arg2, all of which will be inherited during fork+exec:
> >
> > This is pretty confusing.
> >
> > It should be something like 'add a new prctl() option that allows...' etc.
> >
> >> - PR_THP_POLICY_DEFAULT_HUGE: This will set the MMF2_THP_VMA_DEFAULT_HUGE
> >> process flag which changes the default of new VMAs to be VM_HUGEPAGE. The
> >> call also modifies all existing VMAs that are not VM_NOHUGEPAGE
> >> to be VM_HUGEPAGE.
> >
> > This is referring to implementation detail that doesn't matter for an overview,
> > just add a summary here e.g.
> >
> > PR_THP_POLICY_DEFAULT_HUGE - set VM_HUGEPAGE flag in all VMAs by default,
> > including after fork/exec, ignoring global policy.
> >
> > PR_THP_POLICY_DEFAULT_NOHUGE - clear VM_HUGEPAGE flag in all VMAs by default,
> > including after fork/exec, ignoring global policy.
> >
> > PR_THP_POLICY_DEFAULT_SYSTEM - Eliminate any policy set above.
>
> Hi Lorenzo,
>
> Thanks for the review. I will make the cover letter clearer in the next revision.
The next version should emphatically be an RFC also, please. Your cover letter
should mention you're fundamentally changing mm_struct and VMA logic, and
explain why your use cae is so important that that is justified.
>
> >
> >> This allows systems where the global policy is set to "madvise"
> >> to effectively have THPs always for the process. In an environment
> >> where different types of workloads are stacked on the same machine
> >> whose global policy is set to "madvise", this will allow workloads
> >> that benefit from always having hugepages to do so, without regressing
> >> those that don't.
> >
> > So does this just ignore and override the global policy? I'm not sure I'm
> > comfortable with that.
>
> No. The decision making of when and what order THPs are allowed is not
> changed, i.e. there are no changes in __thp_vma_allowable_orders and
> thp_vma_allowable_orders. David has the same concern as you and this
> current series is implementing what David suggested in
> https://lore.kernel.org/all/3f7ba97d-04d5-4ea4-9f08-6ec3584e0d4c@redhat.com/
>
> It will change the existing VMA (NO)HUGE flags according to
> the prctl. For e.g. doing PR_THP_POLICY_DEFAULT_HUGE will not give
> a THP when global policy is never.
Umm...
+ case PR_SET_THP_POLICY:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ if (mmap_write_lock_killable(me->mm))
+ return -EINTR;
+ switch (arg2) {
+ case PR_THP_POLICY_DEFAULT_HUGE:
+ set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
+ process_vmas_thp_default_huge(me->mm);
+ break;
+ default:
Where's the check against never? You're unconditionally setting VM_HUGEPAGE?
You're relying on VM_HUGEPAGE being ignored in this instance? But you're still:
1. Setting VM_HUGEPAGE everywhere (and breaking VMA merging everywhere).
2. Setting MMF2_THP_VMA_DEFAULT_HUGE and making it so PR_GET_THP_POLICY says it
has a policy of default huge even if policy is set to never?
I'm not ok with that. I'd much rather we do the never check here...
Also see hugepage_madvise(). There's arch-specific code that overrides
that, and you're now bypassing that (yes it's for one arch of course but
it's still a thing)
>
> >
> > What about if the the policy is 'never'? Does this override that? That seems
> > completely wrong.
>
> No, it won't override it. hugepage_global_always and hugepage_global_enabled
> will still evaluate to false and you wont get a hugepage no matter what prctl
> is set.
Ack ok I see as above, you're relying on VM_HUGEPAGE enforcing htis.
You really need to put stuff like this in the cover letter though!!
>
> >
> >> - PR_THP_POLICY_DEFAULT_NOHUGE: This will set the MMF2_THP_VMA_DEFAULT_NOHUGE
> >> process flag which changes the default of new VMAs to be VM_NOHUGEPAGE.
> >> The call also modifies all existing VMAs that are not VM_HUGEPAGE
> >> to be VM_NOHUGEPAGE.
> >> This allows systems where the global policy is set to "always"
> >> to effectively have THPs on madvise only for the process. In an
> >> environment where different types of workloads are stacked on the
> >> same machine whose global policy is set to "always", this will allow
> >> workloads that benefit from having hugepages on an madvise basis only
> >> to do so, without regressing those that benefit from having hugepages
> >> always.
> >
> > Wait, so 'no huge' means 'madvise'? What? This is confusing.
>
>
> I probably made the cover letter confusing :) or maybe need to rename the flags.
>
> This flag work as follows:
>
> a) Changes the default flag of new VMAs to be VM_NOHUGEPAGE
>
> b) Modifies all existing VMAs that are not VM_HUGEPAGE to be VM_NOHUGEPAGE
>
> c) Is inherited during fork+exec
>
> I think maybe I should add VMA to the flag names and rename the flags to
> PR_THP_POLICY_DEFAULT_VMA_(NO)HUGE ??
Please no :) 'VMA' is implicit re: mappings. If you're touching memory
mappings you're necessarily touching VMAs.
I know some prctl() (a pathway to many abilities some consider to be
unnatural) uses 'VMA' in some of the endpoints but generally when referring
to specific VMAs no?
These namesa are already kinda horrible (yes naming is hard, for everyone,
ask me about MADV_POISON/REMEDY) but I think something like:
PR_DEFAULT_MADV_HUGEPAGE
PR_DEFAULT_MADV_NOHUGEPAGE
-ish :)
>
> >
> >> - PR_THP_POLICY_DEFAULT_SYSTEM: This will clear the MMF2_THP_VMA_DEFAULT_HUGE
> >> and MMF2_THP_VMA_DEFAULT_NOHUGE process flags.
> >>
> >> These patches are required in rolling out hugepages in hyperscaler
> >> configurations for workloads that benefit from them, where workloads are
> >> stacked anda single THP global policy is likely to be used across the entire
> >> fleet, and prctl will help override it.
> >
> > I don't understand this justification whatsoever. What does 'stacked' mean? And
> > you're not justifying why you'd override the policy?
>
> By stacked I just meant different types of workloads running on the same machine.
> Lets say we have a single server whose global policy is set to madvise.
> You can have a container on that server running some database workload that best
> works with madvise.
> You can have another container on that same server running some AI workload that would
> benefit from having VM_HUGEPAGE set on all new VMAs. We can use prctl
> PR_THP_POLICY_DEFAULT_HUGE to get VM_HUGEPAGE set by default on all new VMAs for that
> container.
>
> >
> > This series has no actual justificaiton here at all? You really need to provide one.
> >
>
> There was a discussion on the usecases in
> https://lore.kernel.org/all/13b68fa0-8755-43d8-8504-d181c2d46134@gmail.com/
>
> I tried (and I guess failed :)) to summarize the justification from that thread.
It's fine, I have most definitely not been as clear as I could be in series
too :>) just need to add a bigger summary.
Don't afraid to waffle on... (I know I am not... ;)
>
> I will try and rephrase it here.
>
> In hyperscalers, we have a single THP policy for the entire fleet.
> We have different types of workloads (e.g. AI/compute/databases/etc)
> running on a single server (this is what I meant by 'stacked').
> Some of these workloads will benefit from always getting THP at fault (or collapsed
> by khugepaged), some of them will benefit by only getting them at madvise.
>
> This series is useful for 2 usecases:
>
> 1) global system policy = madvise, while we want some workloads to get THPs
> at fault and by khugepaged :- some processes (e.g. AI workloads) benefits from getting
> THPs at fault (and collapsed by khugepaged). Other workloads like databases will incur
> regression (either a performance regression or they are completely memory bound and
> even a very slight increase in memory will cause them to OOM). So what these patches
> will do is allow setting prctl(PR_THP_POLICY_DEFAULT_HUGE) on the AI workloads,
> (This is how workloads are deployed in our (Meta's/Facebook) fleet at this moment).
>
> 2) global system policy = always, while we want some workloads to get THPs
> only on madvise basis :- Same reason as 1). What these patches
> will do is allow setting prctl(PR_THP_POLICY_DEFAULT_NOHUGE) on the database
> workloads.
> (We hope this is us (Meta) in the near future, if a majority of workloads show that they
> benefit from always, we flip the default host setting to "always" across the fleet and
> workloads that regress can opt-out and be "madvise".
> New services developed will then be tested with always by default. "always" is also the
> default defconfig option upstream, so I would imagine this is faced by others as well.)
Right, but I'm not sure you're explaining why prctl(), one of the most cursed,
neglected and frankly evil (maybe exaggerating :P) APIs in the kernel is the way
to do this?
You do need to summarise why the suggested idea re: BPF, or cgroups, or whatnot
is _totally unworkable_.
And why not process_madvise() with MADV_HUGEPAGE?
I'm also not sure fork/exec is a great situation to have, because are you sure
the workloads stay the same across all fork/execs that you're now propagating?
It feels like this should be a cgroup thing, really.
>
> Hope this makes the justification for the patches clearer :)
Sure, please add this kind of thing to the cover letter to get fewer 'wtf'
reactions :)
You're doing something really _big_ and _opinonated_ here though, that's
basically fundamentally changing core stuff, so an extended discussion of why
you feel it's so important, why other approaches are not workable, why the
Sauron-spawned Mordor dwelling prctl() API is the way to go, etc.
>
> >>
> >> v1->v2:
> >
> > Where was the v1? Is it [0]?
> >
> > This seems like a massive change compared to that series?
> >
> > You've renamed it and not referenced the old series, please make sure you link
> > it or somehow let somebody see what this is against, because it makes review
> > difficult.
> >
>
> Yes its the patch you linked below. Sorry should have linked it in this series.
> Its a big change, but it was basically incorporating all feedback from David,
> while trying to achieve a similar goal. Will link it in future series.
Yeah, again, this should have been an RFC on that basis :)
>
> > [0]: https://lore.kernel.org/linux-mm/20250507141132.2773275-1-usamaarif642@gmail.com/
> >
> >> - change from modifying the THP decision making for the process, to modifying
> >> VMA flags only. This prevents further complicating the logic used to
> >> determine THP order (Thanks David!)
> >> - change from using a prctl per policy change to just using PR_SET_THP_POLICY
> >> and arg2 to set the policy. (Zi Yan)
> >> - Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM
> >> - Add selftests and documentation.
> >>
> >> Usama Arif (6):
> >> prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
> >> prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE for the process
> >> prctl: introduce PR_THP_POLICY_SYSTEM for the process
> >> selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_NOHUGE
> >> selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
> >> docs: transhuge: document process level THP controls
> >>
> >> Documentation/admin-guide/mm/transhuge.rst | 40 +++
> >> include/linux/huge_mm.h | 4 +
> >> include/linux/mm_types.h | 14 +
> >> include/uapi/linux/prctl.h | 6 +
> >> kernel/fork.c | 1 +
> >> kernel/sys.c | 35 +++
> >> mm/huge_memory.c | 56 ++++
> >> mm/vma.c | 2 +
> >> tools/include/uapi/linux/prctl.h | 6 +
> >> .../trace/beauty/include/uapi/linux/prctl.h | 6 +
> >> tools/testing/selftests/prctl/Makefile | 2 +-
> >> tools/testing/selftests/prctl/thp_policy.c | 286 ++++++++++++++++++
> >> 12 files changed, 457 insertions(+), 1 deletion(-)
> >> create mode 100644 tools/testing/selftests/prctl/thp_policy.c
> >>
> >> --
> >> 2.47.1
> >>
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 14:58 ` David Hildenbrand
@ 2025-05-15 15:18 ` Lorenzo Stoakes
0 siblings, 0 replies; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-15 15:18 UTC (permalink / raw)
To: David Hildenbrand
Cc: Usama Arif, Andrew Morton, linux-mm, hannes, shakeel.butt, riel,
ziy, laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On Thu, May 15, 2025 at 04:58:48PM +0200, David Hildenbrand wrote:
> On 15.05.25 16:56, Usama Arif wrote:
> >
> >
> > On 15/05/2025 15:44, David Hildenbrand wrote:
> > > On 15.05.25 16:40, Lorenzo Stoakes wrote:
> > > > Overall I feel this series should _DEFINITELY_ be an RFC. This is pretty
> > > > outlandish stuff and needs discussion.
> > > >
> > > > You're basically making it so /sys/kernel/mm/transparent_hugepage/enabled =
> > > > never is completely ignored and overridden.
> > >
> > > I thought I made it very clear during earlier discussions that never means never.
> > >
> >
> > Yes never means never
>
> Good, likely worth stating that clearly that there are no overrides (I did
> not look into the series yet, I was only responding to Lorenzo's concerns)
> :)
Yes, never is enforced after all (phew).
But we are setting VM_HUGEPAGE anyway... but at any rate, discussion is
ongoing of course...
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 14:40 ` Lorenzo Stoakes
2025-05-15 14:44 ` David Hildenbrand
@ 2025-05-15 15:28 ` Usama Arif
2025-05-15 16:06 ` Lorenzo Stoakes
1 sibling, 1 reply; 51+ messages in thread
From: Usama Arif @ 2025-05-15 15:28 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On 15/05/2025 15:40, Lorenzo Stoakes wrote:
> Overall I feel this series should _DEFINITELY_ be an RFC. This is pretty
> outlandish stuff and needs discussion.
>
There was a lot of discussion in the
original patch (https://lore.kernel.org/linux-mm/20250507141132.2773275-1-usamaarif642@gmail.com/).
And there was a conclusion to go with Davids suggestion (https://lore.kernel.org/all/13b68fa0-8755-43d8-8504-d181c2d46134@gmail.com/)
and the following reply (https://lore.kernel.org/all/97702ff0-fc50-4779-bfa8-83dc42352db1@redhat.com/)
> You're basically making it so /sys/kernel/mm/transparent_hugepage/enabled =
> never is completely ignored and overridden. Which I am emphatically not
> comfortable with. And you're not saying that you're doing this,
> anywhere. Which is wrong.
No I am not.
hugepage_global_always and hugepage_global_enabled will evaluate to false
and you will not get a hugepage.
>
> Also, this patch is quite broken.
>
> I'm hugely not a fan of adding mm_struct->flags2, and I'm even more not a
> fan of you not mentioning such a completely fundamental change in the
> commit mesage.
This was also discussed in the original series.
If there is a very serious issue with going with flags2, I can try and just
reuse mm->flags bit 18, but it will mean that only MMF2_THP_VMA_DEFAULT_HUGE
can be implemented and not MMF2_THP_VMA_DEFAULT_NOHUGE
We have run out of bits in mm->flags.
If there is something new that we are developing that needs another bit,
we either need to add flags2 (I don't care about the name, can be anything),
or we need to limit it to 64 bit machines only.
If the maintainers have an issue with flags2. I can limit this to 64 bits,
it will probably mean ifdefs everywhere...
>
> This patch also breaks VMA merging and the VMA tests...
>
Its doing the same as KSM as suggested by David. Does KSM break these tests?
Is there some specific test you can point to that I can run that is breaking
with this patch and not without it?
> I really feel this series needs to be an RFC until we can get some
> consensus on how to approach this.
There was consensus in https://lore.kernel.org/all/97702ff0-fc50-4779-bfa8-83dc42352db1@redhat.com/
>
> On Thu, May 15, 2025 at 02:33:30PM +0100, Usama Arif wrote:
>> This is set via the new PR_SET_THP_POLICY prctl.
>
> What is?
>
> You're making very major changes here, including adding a new flag to
> mm_struct (!!) and the explanation/justification for this is missing.
>
I have added the justification in your reply to the coverletter.
>> This will set the MMF2_THP_VMA_DEFAULT_HUGE process flag
>> which changes the default of new VMAs to be VM_HUGEPAGE. The
>> call also modifies all existing VMAs that are not VM_NOHUGEPAGE
>> to be VM_HUGEPAGE. The policy is inherited during fork+exec.
>
> So you can only set this flag?
>
??
>>
>> This allows systems where the global policy is set to "madvise"
>> to effectively have THPs always for the process. In an environment
>> where different types of workloads are stacked on the same machine,
>> this will allow workloads that benefit from always having hugepages
>> to do so, without regressing those that don't.
>
> Again, this explanation really makes no sense at all to me, I don't really
> know what you mean, you're not going into what you're doing in this change,
> this is just a very unclear commit message.
>
I hope this is answered in my reply to your coverletter.
>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>> include/linux/huge_mm.h | 3 ++
>> include/linux/mm_types.h | 11 +++++++
>> include/uapi/linux/prctl.h | 4 +++
>> kernel/fork.c | 1 +
>> kernel/sys.c | 21 ++++++++++++
>> mm/huge_memory.c | 32 +++++++++++++++++++
>> mm/vma.c | 2 ++
>> tools/include/uapi/linux/prctl.h | 4 +++
>> .../trace/beauty/include/uapi/linux/prctl.h | 4 +++
>> 9 files changed, 82 insertions(+)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2f190c90192d..e652ad9ddbbd 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -260,6 +260,9 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
>> return orders;
>> }
>>
>> +void vma_set_thp_policy(struct vm_area_struct *vma);
>
> This is a VMA-specific function but you're putting it in huge_mm.h? Why
> can't
this be in vma.h or vma.c?
>
Sure can move it there.
>> +void process_vmas_thp_default_huge(struct mm_struct *mm);
>
> 'vmas' is redundant here.
>
Sure.
>> +
>> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>> unsigned long vm_flags,
>> unsigned long tva_flags,
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index e76bade9ebb1..2fe93965e761 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -1066,6 +1066,7 @@ struct mm_struct {
>> mm_context_t context;
>>
>> unsigned long flags; /* Must use atomic bitops to access */
>> + unsigned long flags2;
>
>
> Ugh, god really??
>
> I really am not a fan of adding flags2 just to add a prctl() feature like
> this. This is crazy.
>
> Also this is a TERRIBLE name. I mean, no please PLEASE no.
>
> Do we really have absolutely no choice but to add a new flags field here?
>
> It again doesn't help that you don't mention nor even try to justify this
> in the commit message or cover letter.
>
And again, I hope my reply to your email has given you the justification.
> If this is a 32-bit kernel vs. 64-bit kernel thing so we 'ran out of bits',
> let's just go make this flags field 64-bit on 32-bit kernels.
>
> I mean - I'm kind of insisting we do that to be honest. Because I really
> don't like this.
If the maintainers want this, I will make it a 64 bit only feature. We
are only using it for 64 bit servers. But it will probably mean ifdef
config 64 bit in a lot of places.
>
> Also if we _HAVE_ to have this, shouldn't we duplicate that comment about
> atomic bitops?...
>
Sure
>>
>> #ifdef CONFIG_AIO
>> spinlock_t ioctx_lock;
>> @@ -1744,6 +1745,11 @@ enum {
>> MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
>> MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
>>
>> +#define MMF2_THP_VMA_DEFAULT_HUGE 0
>
> I thought the whole idea was to move away from explicitly refrencing 'THP'
> in a future where large folios are implicit and now we're saying 'THP'.
>
> Anyway the 'VMA' is totally redundant here.
>
Sure, I can remove VMA.
I see THP everywhere in the kernel code.
Its mentioned 108 times in transhuge.rst alone :)
If you have any suggestion to rename this flag, happy to take it :)
>> +#define MMF2_THP_VMA_DEFAULT_HUGE_MASK (1 << MMF2_THP_VMA_DEFAULT_HUGE)
>
> Do we really need explicit trivial mask declarations like this?
>
I have followed the convention that has existed in this file, please see below
links :)
https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1645
https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1623
https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1603
https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1582
>> +
>> +#define MMF2_INIT_MASK (MMF2_THP_VMA_DEFAULT_HUGE_MASK)
>
>> +
>> static inline unsigned long mmf_init_flags(unsigned long flags)
>> {
>> if (flags & (1UL << MMF_HAS_MDWE_NO_INHERIT))
>> @@ -1752,4 +1758,9 @@ static inline unsigned long mmf_init_flags(unsigned long flags)
>> return flags & MMF_INIT_MASK;
>> }
>>
>> +static inline unsigned long mmf2_init_flags(unsigned long flags)
>> +{
>> + return flags & MMF2_INIT_MASK;
>> +}
>> +
>> #endif /* _LINUX_MM_TYPES_H */
>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> index 15c18ef4eb11..325c72f40a93 100644
>> --- a/include/uapi/linux/prctl.h
>> +++ b/include/uapi/linux/prctl.h
>> @@ -364,4 +364,8 @@ struct prctl_mm_map {
>> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1
>> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2
>>
>> +#define PR_SET_THP_POLICY 78
>> +#define PR_GET_THP_POLICY 79
>> +#define PR_THP_POLICY_DEFAULT_HUGE 0
>> +
>> #endif /* _LINUX_PRCTL_H */
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 9e4616dacd82..6e5f4a8869dc 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1054,6 +1054,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>>
>> if (current->mm) {
>> mm->flags = mmf_init_flags(current->mm->flags);
>> + mm->flags2 = mmf2_init_flags(current->mm->flags2);
>> mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
>> } else {
>> mm->flags = default_dump_filter;
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index c434968e9f5d..1115f258f253 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -2658,6 +2658,27 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>> clear_bit(MMF_DISABLE_THP, &me->mm->flags);
>> mmap_write_unlock(me->mm);
>> break;
>> + case PR_GET_THP_POLICY:
>> + if (arg2 || arg3 || arg4 || arg5)
>> + return -EINVAL;
>> + if (!!test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2))
>
> I really don't think we need the !!? Do we?
I have followed the convention that has existed in this file already,
please see:
https://elixir.bootlin.com/linux/v6.14.6/source/kernel/sys.c#L2644
>
> Shouldn't we lock the mm when we do this no? Can't somebody change this?
>
It wasn't locked in PR_GET_THP_DISABLE
https://elixir.bootlin.com/linux/v6.14.6/source/kernel/sys.c#L2644
I can acquire do mmap_write_lock_killable the same as PR_SET_THP_POLICY
in the next series.
I can also add the lock in PR_GET_THP_DISABLE.
>> + error = PR_THP_POLICY_DEFAULT_HUGE;
>> + break;
>> + case PR_SET_THP_POLICY:
>> + if (arg3 || arg4 || arg5)
>> + return -EINVAL;
>> + if (mmap_write_lock_killable(me->mm))
>> + return -EINTR;
>> + switch (arg2) {
>> + case PR_THP_POLICY_DEFAULT_HUGE:
>> + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
>> + process_vmas_thp_default_huge(me->mm);
>> + break;
>> + default:
>> + return -EINVAL;
>> + }
>> + mmap_write_unlock(me->mm);
>> + break;
>> case PR_MPX_ENABLE_MANAGEMENT:
>> case PR_MPX_DISABLE_MANAGEMENT:
>> /* No longer implemented: */
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 2780a12b25f0..64f66d5295e8 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -98,6 +98,38 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>> return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
>> }
>>
>> +void vma_set_thp_policy(struct vm_area_struct *vma)
>> +{
>> + struct mm_struct *mm = vma->vm_mm;
>> +
>> + if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
>> + vm_flags_set(vma, VM_HUGEPAGE);
>> +}
>> +
>> +static void vmas_thp_default_huge(struct mm_struct *mm)
>> +{
>> + struct vm_area_struct *vma;
>> + unsigned long vm_flags;
>> +
>> + VMA_ITERATOR(vmi, mm, 0);
>
> This is a declaration, it should be grouped with declarations...
>
Sure, will make the change in next version.
Unfortunately checkpatch didn't complain.
>> + for_each_vma(vmi, vma) {
>> + vm_flags = vma->vm_flags;
>> + if (vm_flags & VM_NOHUGEPAGE)
>> + continue;
>
> Literally no point in you putting vm_flags as a separate variable here.
>
Sure, will make the change in next version.
> So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
> is to override global 'never'?
>
Again, I am not overriding never.
hugepage_global_always and hugepage_global_enabled will evaluate to false
and you will not get a hugepage.
> I'm really concerned about this.
>
>> + vm_flags_set(vma, VM_HUGEPAGE);
>> + }
>> +}
>
> Do we have an mmap write lock established here? Can you confirm that? Also
> you should add an assert for that here.
>
Yes I do, its only called in PR_SET_THP_POLICY where mmap_write lock was taken.
I can add an assert if it helps.
>> +
>> +void process_vmas_thp_default_huge(struct mm_struct *mm)
>> +{
>> + if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
>> + return;
>> +
>> + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2);
>> + vmas_thp_default_huge(mm);
>> +}
>> +
>> +
>> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>> unsigned long vm_flags,
>> unsigned long tva_flags,
>> diff --git a/mm/vma.c b/mm/vma.c
>> index 1f2634b29568..101b19c96803 100644
>> --- a/mm/vma.c
>> +++ b/mm/vma.c
>> @@ -2476,6 +2476,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
>> if (!vma_is_anonymous(vma))
>> khugepaged_enter_vma(vma, map->flags);
>> ksm_add_vma(vma);
>> + vma_set_thp_policy(vma);
>
> You're breaking VMA merging completely by doing this here...
>
> Now I can map one VMA with this policy set, then map another immediately
> next to it and - oops - no merge, ever, because the VM_HUGEPAGE flag is not
> set in the new VMA on merge attempt.
>
> I realise KSM is just as broken (grr) but this doesn't justify us
> completely breaking VMA merging here.
I think this answers it. Its doing the same as KSM.
>
> You need to set earlier than this. Then of course a driver might decide to
> override this, so maybe then we need to override that.
>
> But then we're getting into realms of changing fundamental VMA code _just
> for this feature_.
>
> Again I'm iffy about this. Very.
>
> Also you've broken the VMA userland tests here:
>
> $ cd tools/testing/vma
> $ make
> ...
> In file included from vma.c:33:
> ../../../mm/vma.c: In function ‘__mmap_new_vma’:
> ../../../mm/vma.c:2486:9: error: implicit declaration of function ‘vma_set_thp_policy’; did you mean ‘vma_dup_policy’? [-Wimplicit-function-declaration]
> 2486 | vma_set_thp_policy(vma);
> | ^~~~~~~~~~~~~~~~~~
> | vma_dup_policy
> make: *** [<builtin>: vma.o] Error 1
>
> You need to create stubs accordingly.
>
Thanks will do.
>> *vmap = vma;
>> return 0;
>>
>> @@ -2705,6 +2706,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>> mm->map_count++;
>> validate_mm(mm);
>> ksm_add_vma(vma);
>> + vma_set_thp_policy(vma);
>
> You're breaking merging again... This is quite a bad case too as now you'll
> have totally fragmented brk VMAs no?
>
Again doing it the same as KSM.
> We can't have it implemented this way.
>
>> out:
>> perf_event_mmap(vma);
>> mm->total_vm += len >> PAGE_SHIFT;
>> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
>> index 35791791a879..f5945ebfe3f2 100644
>> --- a/tools/include/uapi/linux/prctl.h
>> +++ b/tools/include/uapi/linux/prctl.h
>> @@ -328,4 +328,8 @@ struct prctl_mm_map {
>> # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */
>> # define PR_PPC_DEXCR_CTRL_MASK 0x1f
>>
>> +#define PR_SET_THP_POLICY 78
>> +#define PR_GET_THP_POLICY 79
>> +#define PR_THP_POLICY_DEFAULT_HUGE 0
>> +
>> #endif /* _LINUX_PRCTL_H */
>> diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
>> index 15c18ef4eb11..325c72f40a93 100644
>> --- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
>> +++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
>> @@ -364,4 +364,8 @@ struct prctl_mm_map {
>> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1
>> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2
>>
>> +#define PR_SET_THP_POLICY 78
>> +#define PR_GET_THP_POLICY 79
>> +#define PR_THP_POLICY_DEFAULT_HUGE 0
>> +
>> #endif /* _LINUX_PRCTL_H */
>> --
>> 2.47.1
>>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 14:44 ` David Hildenbrand
2025-05-15 14:56 ` Usama Arif
@ 2025-05-15 15:45 ` Liam R. Howlett
2025-05-15 15:57 ` David Hildenbrand
1 sibling, 1 reply; 51+ messages in thread
From: Liam R. Howlett @ 2025-05-15 15:45 UTC (permalink / raw)
To: David Hildenbrand
Cc: Lorenzo Stoakes, Usama Arif, Andrew Morton, linux-mm, hannes,
shakeel.butt, riel, ziy, laoar.shao, baolin.wang, npache,
ryan.roberts, linux-kernel, linux-doc, kernel-team
* David Hildenbrand <david@redhat.com> [250515 10:44]:
> On 15.05.25 16:40, Lorenzo Stoakes wrote:
> > Overall I feel this series should _DEFINITELY_ be an RFC. This is pretty
> > outlandish stuff and needs discussion.
> >
> > You're basically making it so /sys/kernel/mm/transparent_hugepage/enabled =
> > never is completely ignored and overridden.
>
> I thought I made it very clear during earlier discussions that never means
> never.
I also thought so, but the comments later made here [1] seem to
contradict that?
It seems "never" means "default_no" and not actually "never"?
Maybe the global/system toggles need to affect the state of each other?
That is, if /sys/kernel/mm/transparent_hugepage/enabled is never and you
set /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled to
madvise, it should not leave /sys/kernel/mm/transparent_hugepage/enabled
as never.
I just don't see "never" as the shutoff of the feature that I would
expect if it is overwritten by another enabled setting?
Obviously the need exists for a usecase of thp setting being inherited
as this is the 3rd(?) attempt at it.
We have control groups for resource control. We have decided THP is not
a resource but a policy (right?) and policies don't belong in control
groups.
I'm fine with this, btw. I just do see the similarities in the
inheritance above and the control group layout. Also, the cgroups name
doesn't exactly limit the control to resources.
I agree with Lorenzo that discussion is needed because navigating what
we have now is difficult to understand and it's going to be difficult to
make any additions understandable.
Thanks,
Liam
[1]. https://lore.kernel.org/all/97702ff0-fc50-4779-bfa8-83dc42352db1@redhat.com/
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY
2025-05-15 15:15 ` Lorenzo Stoakes
@ 2025-05-15 15:54 ` Usama Arif
2025-05-15 16:04 ` David Hildenbrand
2025-05-15 16:24 ` Lorenzo Stoakes
0 siblings, 2 replies; 51+ messages in thread
From: Usama Arif @ 2025-05-15 15:54 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On 15/05/2025 16:15, Lorenzo Stoakes wrote:
> Thanks for coming back to me so quickly, appreciated :)
>
> I am reacting in a 'WTF' way here, but it's in proportion to the (at least
> perceived) magnitude of this change. We really need to be sure this is
> right.
>
Lol I had to rewrite my replies a few times to tone them down.
Hopefully I don't come across as aggressive :)
> On Thu, May 15, 2025 at 03:50:47PM +0100, Usama Arif wrote:
>>
>>
>> On 15/05/2025 14:55, Lorenzo Stoakes wrote:
>>> On Thu, May 15, 2025 at 02:33:29PM +0100, Usama Arif wrote:
>>>> This allows to change the THP policy of a process, according to the value
>>>> set in arg2, all of which will be inherited during fork+exec:
>>>
>>> This is pretty confusing.
>>>
>>> It should be something like 'add a new prctl() option that allows...' etc.
>>>
>>>> - PR_THP_POLICY_DEFAULT_HUGE: This will set the MMF2_THP_VMA_DEFAULT_HUGE
>>>> process flag which changes the default of new VMAs to be VM_HUGEPAGE. The
>>>> call also modifies all existing VMAs that are not VM_NOHUGEPAGE
>>>> to be VM_HUGEPAGE.
>>>
>>> This is referring to implementation detail that doesn't matter for an overview,
>>> just add a summary here e.g.
>>>
>>> PR_THP_POLICY_DEFAULT_HUGE - set VM_HUGEPAGE flag in all VMAs by default,
>>> including after fork/exec, ignoring global policy.
>>>
>>> PR_THP_POLICY_DEFAULT_NOHUGE - clear VM_HUGEPAGE flag in all VMAs by default,
>>> including after fork/exec, ignoring global policy.
>>>
>>> PR_THP_POLICY_DEFAULT_SYSTEM - Eliminate any policy set above.
>>
>> Hi Lorenzo,
>>
>> Thanks for the review. I will make the cover letter clearer in the next revision.
>
> The next version should emphatically be an RFC also, please. Your cover letter
> should mention you're fundamentally changing mm_struct and VMA logic, and
> explain why your use cae is so important that that is justified.
>
Thanks, will make it RFC and add that I am making changes to mm_struct and VMA logic.
>>
>>>
>>>> This allows systems where the global policy is set to "madvise"
>>>> to effectively have THPs always for the process. In an environment
>>>> where different types of workloads are stacked on the same machine
>>>> whose global policy is set to "madvise", this will allow workloads
>>>> that benefit from always having hugepages to do so, without regressing
>>>> those that don't.
>>>
>>> So does this just ignore and override the global policy? I'm not sure I'm
>>> comfortable with that.
>>
>> No. The decision making of when and what order THPs are allowed is not
>> changed, i.e. there are no changes in __thp_vma_allowable_orders and
>> thp_vma_allowable_orders. David has the same concern as you and this
>> current series is implementing what David suggested in
>> https://lore.kernel.org/all/3f7ba97d-04d5-4ea4-9f08-6ec3584e0d4c@redhat.com/
>>
>> It will change the existing VMA (NO)HUGE flags according to
>> the prctl. For e.g. doing PR_THP_POLICY_DEFAULT_HUGE will not give
>> a THP when global policy is never.
>
> Umm...
>
> + case PR_SET_THP_POLICY:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + if (mmap_write_lock_killable(me->mm))
> + return -EINTR;
> + switch (arg2) {
> + case PR_THP_POLICY_DEFAULT_HUGE:
> + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
> + process_vmas_thp_default_huge(me->mm);
> + break;
> + default:
>
>
> Where's the check against never? You're unconditionally setting VM_HUGEPAGE?
So this was from the discussion with David. My initial implementation in v1,
messed with the policy evaluation in thp_vma_allowable_orders and __thp_vma_allowable_orders.
The whole point of doing it this way is that you dont mess with the policy evaluation.
hugepage_global_always and hugepage_global_enabled will still evaluate to false
when never is set and you will not get a hugepage. But more on it below.
>
> You're relying on VM_HUGEPAGE being ignored in this instance? But you're still:
>
> 1. Setting VM_HUGEPAGE everywhere (and breaking VMA merging everywhere).
>
> 2. Setting MMF2_THP_VMA_DEFAULT_HUGE and making it so PR_GET_THP_POLICY says it
> has a policy of default huge even if policy is set to never?
>
> I'm not ok with that. I'd much rather we do the never check here...
>
I am ok with that. I can add a check over here that wraps this in:
if (hugepage_global_enabled())
...
> Also see hugepage_madvise(). There's arch-specific code that overrides
> that, and you're now bypassing that (yes it's for one arch of course but
> it's still a thing)
>
Thanks, I will put
if (mm_has_pgste(vma->vm_mm))
return 0;
at the start.
>>
>>>
>>> What about if the the policy is 'never'? Does this override that? That seems
>>> completely wrong.
>>
>> No, it won't override it. hugepage_global_always and hugepage_global_enabled
>> will still evaluate to false and you wont get a hugepage no matter what prctl
>> is set.
>
> Ack ok I see as above, you're relying on VM_HUGEPAGE enforcing htis.
>
> You really need to put stuff like this in the cover letter though!!
>
Sure will do in the next revision, Thanks.
>>
>>>
>>>> - PR_THP_POLICY_DEFAULT_NOHUGE: This will set the MMF2_THP_VMA_DEFAULT_NOHUGE
>>>> process flag which changes the default of new VMAs to be VM_NOHUGEPAGE.
>>>> The call also modifies all existing VMAs that are not VM_HUGEPAGE
>>>> to be VM_NOHUGEPAGE.
>>>> This allows systems where the global policy is set to "always"
>>>> to effectively have THPs on madvise only for the process. In an
>>>> environment where different types of workloads are stacked on the
>>>> same machine whose global policy is set to "always", this will allow
>>>> workloads that benefit from having hugepages on an madvise basis only
>>>> to do so, without regressing those that benefit from having hugepages
>>>> always.
>>>
>>> Wait, so 'no huge' means 'madvise'? What? This is confusing.
>>
>>
>> I probably made the cover letter confusing :) or maybe need to rename the flags.
>>
>> This flag work as follows:
>>
>> a) Changes the default flag of new VMAs to be VM_NOHUGEPAGE
>>
>> b) Modifies all existing VMAs that are not VM_HUGEPAGE to be VM_NOHUGEPAGE
>>
>> c) Is inherited during fork+exec
>>
>> I think maybe I should add VMA to the flag names and rename the flags to
>> PR_THP_POLICY_DEFAULT_VMA_(NO)HUGE ??
>
> Please no :) 'VMA' is implicit re: mappings. If you're touching memory
> mappings you're necessarily touching VMAs.
>
> I know some prctl() (a pathway to many abilities some consider to be
> unnatural) uses 'VMA' in some of the endpoints but generally when referring
> to specific VMAs no?
>
> These namesa are already kinda horrible (yes naming is hard, for everyone,
> ask me about MADV_POISON/REMEDY) but I think something like:
>
> PR_DEFAULT_MADV_HUGEPAGE
> PR_DEFAULT_MADV_NOHUGEPAGE
>
> -ish :)
>
Sure, happy with that, Thanks.
>>
>>>
>>>> - PR_THP_POLICY_DEFAULT_SYSTEM: This will clear the MMF2_THP_VMA_DEFAULT_HUGE
>>>> and MMF2_THP_VMA_DEFAULT_NOHUGE process flags.
>>>>
>>>> These patches are required in rolling out hugepages in hyperscaler
>>>> configurations for workloads that benefit from them, where workloads are
>>>> stacked anda single THP global policy is likely to be used across the entire
>>>> fleet, and prctl will help override it.
>>>
>>> I don't understand this justification whatsoever. What does 'stacked' mean? And
>>> you're not justifying why you'd override the policy?
>>
>> By stacked I just meant different types of workloads running on the same machine.
>> Lets say we have a single server whose global policy is set to madvise.
>> You can have a container on that server running some database workload that best
>> works with madvise.
>> You can have another container on that same server running some AI workload that would
>> benefit from having VM_HUGEPAGE set on all new VMAs. We can use prctl
>> PR_THP_POLICY_DEFAULT_HUGE to get VM_HUGEPAGE set by default on all new VMAs for that
>> container.
>>
>>>
>>> This series has no actual justificaiton here at all? You really need to provide one.
>>>
>>
>> There was a discussion on the usecases in
>> https://lore.kernel.org/all/13b68fa0-8755-43d8-8504-d181c2d46134@gmail.com/
>>
>> I tried (and I guess failed :)) to summarize the justification from that thread.
>
> It's fine, I have most definitely not been as clear as I could be in series
> too :>) just need to add a bigger summary.
>
> Don't afraid to waffle on... (I know I am not... ;)
>
>>
>> I will try and rephrase it here.
>>
>> In hyperscalers, we have a single THP policy for the entire fleet.
>> We have different types of workloads (e.g. AI/compute/databases/etc)
>> running on a single server (this is what I meant by 'stacked').
>> Some of these workloads will benefit from always getting THP at fault (or collapsed
>> by khugepaged), some of them will benefit by only getting them at madvise.
>>
>> This series is useful for 2 usecases:
>>
>> 1) global system policy = madvise, while we want some workloads to get THPs
>> at fault and by khugepaged :- some processes (e.g. AI workloads) benefits from getting
>> THPs at fault (and collapsed by khugepaged). Other workloads like databases will incur
>> regression (either a performance regression or they are completely memory bound and
>> even a very slight increase in memory will cause them to OOM). So what these patches
>> will do is allow setting prctl(PR_THP_POLICY_DEFAULT_HUGE) on the AI workloads,
>> (This is how workloads are deployed in our (Meta's/Facebook) fleet at this moment).
>>
>> 2) global system policy = always, while we want some workloads to get THPs
>> only on madvise basis :- Same reason as 1). What these patches
>> will do is allow setting prctl(PR_THP_POLICY_DEFAULT_NOHUGE) on the database
>> workloads.
>> (We hope this is us (Meta) in the near future, if a majority of workloads show that they
>> benefit from always, we flip the default host setting to "always" across the fleet and
>> workloads that regress can opt-out and be "madvise".
>> New services developed will then be tested with always by default. "always" is also the
>> default defconfig option upstream, so I would imagine this is faced by others as well.)
>
> Right, but I'm not sure you're explaining why prctl(), one of the most cursed,
> neglected and frankly evil (maybe exaggerating :P) APIs in the kernel is the way
> to do this?
>
> You do need to summarise why the suggested idea re: BPF, or cgroups, or whatnot
> is _totally unworkable_.
>
> And why not process_madvise() with MADV_HUGEPAGE?
>
> I'm also not sure fork/exec is a great situation to have, because are you sure
> the workloads stay the same across all fork/execs that you're now propagating?
>
> It feels like this should be a cgroup thing, really.
>
So I actually dont mind the cgroup implementation (that was actually my first
prototype and after that I saw there was someone who had posted it earlier).
It was shot down because it wont be hierarchical and doesnt solve it when
its not being done in a cgroup.
A large proportion of the thread in v1 was discussion with David, Johannes, Zi and
Yafang (the bpf THP policy author) on different ways of doing this.
>>
>> Hope this makes the justification for the patches clearer :)
>
> Sure, please add this kind of thing to the cover letter to get fewer 'wtf'
> reactions :)
>
> You're doing something really _big_ and _opinonated_ here though, that's
> basically fundamentally changing core stuff, so an extended discussion of why
> you feel it's so important, why other approaches are not workable, why the
> Sauron-spawned Mordor dwelling prctl() API is the way to go, etc.
>
>>
>>>>
>>>> v1->v2:
>>>
>>> Where was the v1? Is it [0]?
>>>
>>> This seems like a massive change compared to that series?
>>>
>>> You've renamed it and not referenced the old series, please make sure you link
>>> it or somehow let somebody see what this is against, because it makes review
>>> difficult.
>>>
>>
>> Yes its the patch you linked below. Sorry should have linked it in this series.
>> Its a big change, but it was basically incorporating all feedback from David,
>> while trying to achieve a similar goal. Will link it in future series.
>
> Yeah, again, this should have been an RFC on that basis :)
>
>>
>>> [0]: https://lore.kernel.org/linux-mm/20250507141132.2773275-1-usamaarif642@gmail.com/
>>>
>>>> - change from modifying the THP decision making for the process, to modifying
>>>> VMA flags only. This prevents further complicating the logic used to
>>>> determine THP order (Thanks David!)
>>>> - change from using a prctl per policy change to just using PR_SET_THP_POLICY
>>>> and arg2 to set the policy. (Zi Yan)
>>>> - Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM
>>>> - Add selftests and documentation.
>>>>
>>>> Usama Arif (6):
>>>> prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
>>>> prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE for the process
>>>> prctl: introduce PR_THP_POLICY_SYSTEM for the process
>>>> selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_NOHUGE
>>>> selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
>>>> docs: transhuge: document process level THP controls
>>>>
>>>> Documentation/admin-guide/mm/transhuge.rst | 40 +++
>>>> include/linux/huge_mm.h | 4 +
>>>> include/linux/mm_types.h | 14 +
>>>> include/uapi/linux/prctl.h | 6 +
>>>> kernel/fork.c | 1 +
>>>> kernel/sys.c | 35 +++
>>>> mm/huge_memory.c | 56 ++++
>>>> mm/vma.c | 2 +
>>>> tools/include/uapi/linux/prctl.h | 6 +
>>>> .../trace/beauty/include/uapi/linux/prctl.h | 6 +
>>>> tools/testing/selftests/prctl/Makefile | 2 +-
>>>> tools/testing/selftests/prctl/thp_policy.c | 286 ++++++++++++++++++
>>>> 12 files changed, 457 insertions(+), 1 deletion(-)
>>>> create mode 100644 tools/testing/selftests/prctl/thp_policy.c
>>>>
>>>> --
>>>> 2.47.1
>>>>
>>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 15:45 ` Liam R. Howlett
@ 2025-05-15 15:57 ` David Hildenbrand
2025-05-15 16:38 ` Lorenzo Stoakes
0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-05-15 15:57 UTC (permalink / raw)
To: Liam R. Howlett, Lorenzo Stoakes, Usama Arif, Andrew Morton,
linux-mm, hannes, shakeel.butt, riel, ziy, laoar.shao,
baolin.wang, npache, ryan.roberts, linux-kernel, linux-doc,
kernel-team
On 15.05.25 17:45, Liam R. Howlett wrote:
> * David Hildenbrand <david@redhat.com> [250515 10:44]:
>> On 15.05.25 16:40, Lorenzo Stoakes wrote:
>>> Overall I feel this series should _DEFINITELY_ be an RFC. This is pretty
>>> outlandish stuff and needs discussion.
>>>
>>> You're basically making it so /sys/kernel/mm/transparent_hugepage/enabled =
>>> never is completely ignored and overridden.
>>
>> I thought I made it very clear during earlier discussions that never means
>> never.
>
> I also thought so, but the comments later made here [1] seem to
> contradict that?
It's ... complicated.
>
> It seems "never" means "default_no" and not actually "never"?
We should consider these system toggles a single set of toggles that define a
state, and not individual toggles that overwrite each other.
If you say
/sys/kernel/mm/transparent_hugepage/enabled = never
and
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled = always
instead of the *default*
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled = inherit
the admin explicitly stats "I want the system behavior for 2048kB not to be configured using
/sys/kernel/mm/transparent_hugepage/enabled". That's an admin decision, not a
per-process overwrite or whatever.
>
> Maybe the global/system toggles need to affect the state of each other?
> That is, if /sys/kernel/mm/transparent_hugepage/enabled is never and you
> set /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled to
> madvise, it should not leave /sys/kernel/mm/transparent_hugepage/enabled
> as never.
I recall we discussed that, but there was also a catch to that. :(
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY
2025-05-15 15:54 ` Usama Arif
@ 2025-05-15 16:04 ` David Hildenbrand
2025-05-15 16:24 ` Lorenzo Stoakes
1 sibling, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2025-05-15 16:04 UTC (permalink / raw)
To: Usama Arif, Lorenzo Stoakes
Cc: Andrew Morton, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
>> Please no :) 'VMA' is implicit re: mappings. If you're touching memory
>> mappings you're necessarily touching VMAs.
>>
>> I know some prctl() (a pathway to many abilities some consider to be
>> unnatural) uses 'VMA' in some of the endpoints but generally when referring
>> to specific VMAs no?
>>
>> These namesa are already kinda horrible (yes naming is hard, for everyone,
>> ask me about MADV_POISON/REMEDY) but I think something like:
>>
>> PR_DEFAULT_MADV_HUGEPAGE
>> PR_DEFAULT_MADV_NOHUGEPAGE
>>
>> -ish :)
>>
>
> Sure, happy with that, Thanks.
Yes, please :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 15:28 ` Usama Arif
@ 2025-05-15 16:06 ` Lorenzo Stoakes
2025-05-15 16:11 ` David Hildenbrand
2025-05-15 16:47 ` Usama Arif
0 siblings, 2 replies; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-15 16:06 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On Thu, May 15, 2025 at 04:28:51PM +0100, Usama Arif wrote:
>
>
> On 15/05/2025 15:40, Lorenzo Stoakes wrote:
> > Overall I feel this series should _DEFINITELY_ be an RFC. This is pretty
> > outlandish stuff and needs discussion.
> >
>
> There was a lot of discussion in the
> original patch (https://lore.kernel.org/linux-mm/20250507141132.2773275-1-usamaarif642@gmail.com/).
> And there was a conclusion to go with Davids suggestion (https://lore.kernel.org/all/13b68fa0-8755-43d8-8504-d181c2d46134@gmail.com/)
> and the following reply (https://lore.kernel.org/all/97702ff0-fc50-4779-bfa8-83dc42352db1@redhat.com/)
>
>
> > You're basically making it so /sys/kernel/mm/transparent_hugepage/enabled =
> > never is completely ignored and overridden. Which I am emphatically not
> > comfortable with. And you're not saying that you're doing this,
> > anywhere. Which is wrong.
>
> No I am not.
>
> hugepage_global_always and hugepage_global_enabled will evaluate to false
> and you will not get a hugepage.
Ack yes as discussed elsewhere.
However again, as discussed elsewhere, you are setting the MMF and VMA
flags without accounting for the logic in hugepage_madvise().
>
> >
> > Also, this patch is quite broken.
> >
> > I'm hugely not a fan of adding mm_struct->flags2, and I'm even more not a
> > fan of you not mentioning such a completely fundamental change in the
> > commit mesage.
>
> This was also discussed in the original series.
It being discussed in the other series doesn't mean we can no longer
discuss it :)
>
> If there is a very serious issue with going with flags2, I can try and just
> reuse mm->flags bit 18, but it will mean that only MMF2_THP_VMA_DEFAULT_HUGE
> can be implemented and not MMF2_THP_VMA_DEFAULT_NOHUGE
I have a big problem with us fundamentally changing a core mm data
structure for what appear to be very specific purposes, yes.
But I think we have a better solution... see below
>
> We have run out of bits in mm->flags.
> If there is something new that we are developing that needs another bit,
> we either need to add flags2 (I don't care about the name, can be anything),
> or we need to limit it to 64 bit machines only.
>
> If the maintainers have an issue with flags2. I can limit this to 64 bits,
> it will probably mean ifdefs everywhere...
I'm a maintainer also good sir :)
As I say elsewhere, we could make a pre-requisite change of making
mm->flags 64-bit even on 32-bit kernels.
This is a change that I will be doing to vma->vm_flags soon as I have the
exact same problem, and it's very silly indeed.
Adding 8 bytes per process for every single process everywhere just for
this to suit essentially deprecated kernels is... not ideal :)
But also there's further issues with having 2 flags fields, I discuss
elsewhere in this semi-stream-of-consciousness..
>
> >
> > This patch also breaks VMA merging and the VMA tests...
> >
>
> Its doing the same as KSM as suggested by David. Does KSM break these tests?
> Is there some specific test you can point to that I can run that is breaking
> with this patch and not without it?
They don't build, at all. I explain how you can attempt a build below.
And no, KSM doesn't break the tests, because steps were taken to make them
not break the tests :) I mean it's really easy - it's just adding some
trivial stubs.
If you need help with it just ping me in whatever way helps and I can help!
It's understandable as it's not necessarily clear this is a thing (maybe we
need self tests to build it, but that might break CI setups so unclear).
The merging is much more important!
>
>
> > I really feel this series needs to be an RFC until we can get some
> > consensus on how to approach this.
>
> There was consensus in https://lore.kernel.org/all/97702ff0-fc50-4779-bfa8-83dc42352db1@redhat.com/
I disagree with this asssessment, that doesn't look like consensus at all,
I think at least this is a very contentious or at least _complicated_ topic
that we need to really dig into.
So in my view - it's this kind of situation that warrants an RFC until
there's some stabilisation and agreement on a way forward.
>
> >
> > On Thu, May 15, 2025 at 02:33:30PM +0100, Usama Arif wrote:
> >> This is set via the new PR_SET_THP_POLICY prctl.
> >
> > What is?
> >
> > You're making very major changes here, including adding a new flag to
> > mm_struct (!!) and the explanation/justification for this is missing.
> >
>
> I have added the justification in your reply to the coverletter.
As stated there, you've not explained why alternatives are unworkable, I
think we need this!
Sort of:
1. Why not cgroups? blah blah blah
2. Why not process_madvise()? blah blah blah
3. Why not bpf? blah blah blah
4. Why not <something I've not thought of>? blah blah blah
>
> >> This will set the MMF2_THP_VMA_DEFAULT_HUGE process flag
> >> which changes the default of new VMAs to be VM_HUGEPAGE. The
> >> call also modifies all existing VMAs that are not VM_NOHUGEPAGE
> >> to be VM_HUGEPAGE. The policy is inherited during fork+exec.
> >
> > So you can only set this flag?
> >
>
> ??
This patch is only allowing the setting of this flag. I am asking 'so you
can only set this flag?'
To which it appears the answer is, yes I think :)
An improved cover letter could say something like:
"
Here we implement the first flag intended to allow the _overriding_ of huge
page policy to ensure that, when
/sys/kernel/mm/transparent_hugepage/enabled is set to madvise, we are able
to maintain fine-grained control of individual processes, including any
fork/exec'd, by setting this flag.
In subsequent commits, we intend to permit further such control.
"
>
> >>
> >> This allows systems where the global policy is set to "madvise"
> >> to effectively have THPs always for the process. In an environment
> >> where different types of workloads are stacked on the same machine,
> >> this will allow workloads that benefit from always having hugepages
> >> to do so, without regressing those that don't.
> >
> > Again, this explanation really makes no sense at all to me, I don't really
> > know what you mean, you're not going into what you're doing in this change,
> > this is just a very unclear commit message.
> >
>
> I hope this is answered in my reply to your coverletter.
You still need to improve the cover letter here I think, see above for a
suggestion!
>
> >>
> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >> ---
> >> include/linux/huge_mm.h | 3 ++
> >> include/linux/mm_types.h | 11 +++++++
> >> include/uapi/linux/prctl.h | 4 +++
> >> kernel/fork.c | 1 +
> >> kernel/sys.c | 21 ++++++++++++
> >> mm/huge_memory.c | 32 +++++++++++++++++++
> >> mm/vma.c | 2 ++
> >> tools/include/uapi/linux/prctl.h | 4 +++
> >> .../trace/beauty/include/uapi/linux/prctl.h | 4 +++
> >> 9 files changed, 82 insertions(+)
> >>
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index 2f190c90192d..e652ad9ddbbd 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -260,6 +260,9 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
> >> return orders;
> >> }
> >>
> >> +void vma_set_thp_policy(struct vm_area_struct *vma);
> >
> > This is a VMA-specific function but you're putting it in huge_mm.h? Why
> > can't
> this be in vma.h or vma.c?
> >
>
> Sure can move it there.
>
> >> +void process_vmas_thp_default_huge(struct mm_struct *mm);
> >
> > 'vmas' is redundant here.
> >
>
> Sure.
> >> +
> >> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> >> unsigned long vm_flags,
> >> unsigned long tva_flags,
> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >> index e76bade9ebb1..2fe93965e761 100644
> >> --- a/include/linux/mm_types.h
> >> +++ b/include/linux/mm_types.h
> >> @@ -1066,6 +1066,7 @@ struct mm_struct {
> >> mm_context_t context;
> >>
> >> unsigned long flags; /* Must use atomic bitops to access */
> >> + unsigned long flags2;
> >
> >
> > Ugh, god really??
> >
> > I really am not a fan of adding flags2 just to add a prctl() feature like
> > this. This is crazy.
> >
> > Also this is a TERRIBLE name. I mean, no please PLEASE no.
> >
> > Do we really have absolutely no choice but to add a new flags field here?
> >
> > It again doesn't help that you don't mention nor even try to justify this
> > in the commit message or cover letter.
> >
>
> And again, I hope my reply to your email has given you the justification.
No :) I understood why you did this though of course.
>
> > If this is a 32-bit kernel vs. 64-bit kernel thing so we 'ran out of bits',
> > let's just go make this flags field 64-bit on 32-bit kernels.
> >
> > I mean - I'm kind of insisting we do that to be honest. Because I really
> > don't like this.
>
>
> If the maintainers want this, I will make it a 64 bit only feature. We
> are only using it for 64 bit servers. But it will probably mean ifdef
> config 64 bit in a lot of places.
I'm going to presume you are including me in this category rather than
implying that you are deferring only to others :)
So, there's another option:
Have a prerequisite series that makes mm_struct->flags 64-bit on 32-bit
kernels, which solves this problem everywhere and avoids us wasting a bunch
of memory for a very specific usecase, splitting flag state across 2 fields
(which are no longer atomic as a whole of course), adding confusion,
possibly subtly breaking anywhere that assumes mm->flags completely
describes mm-granularity flag state etc.
The RoI here is not looking good, otherwise.
>
> >
> > Also if we _HAVE_ to have this, shouldn't we duplicate that comment about
> > atomic bitops?...
> >
>
> Sure
>
> >>
> >> #ifdef CONFIG_AIO
> >> spinlock_t ioctx_lock;
> >> @@ -1744,6 +1745,11 @@ enum {
> >> MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
> >> MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
> >>
> >> +#define MMF2_THP_VMA_DEFAULT_HUGE 0
> >
> > I thought the whole idea was to move away from explicitly refrencing 'THP'
> > in a future where large folios are implicit and now we're saying 'THP'.
> >
> > Anyway the 'VMA' is totally redundant here.
> >
>
> Sure, I can remove VMA.
> I see THP everywhere in the kernel code.
> Its mentioned 108 times in transhuge.rst alone :)
> If you have any suggestion to rename this flag, happy to take it :)
Yeah I mean it's a mess man, and it's not your fault... Again naming is
hard, I put a suggestion in reply to cover letter anyway...
>
> >> +#define MMF2_THP_VMA_DEFAULT_HUGE_MASK (1 << MMF2_THP_VMA_DEFAULT_HUGE)
> >
> > Do we really need explicit trivial mask declarations like this?
> >
>
> I have followed the convention that has existed in this file, please see below
> links :)
> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1645
> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1623
> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1603
> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1582
Ack, yuck but ack.
>
>
> >> +
> >> +#define MMF2_INIT_MASK (MMF2_THP_VMA_DEFAULT_HUGE_MASK)
> >
> >> +
> >> static inline unsigned long mmf_init_flags(unsigned long flags)
> >> {
> >> if (flags & (1UL << MMF_HAS_MDWE_NO_INHERIT))
> >> @@ -1752,4 +1758,9 @@ static inline unsigned long mmf_init_flags(unsigned long flags)
> >> return flags & MMF_INIT_MASK;
> >> }
> >>
> >> +static inline unsigned long mmf2_init_flags(unsigned long flags)
> >> +{
> >> + return flags & MMF2_INIT_MASK;
> >> +}
> >> +
> >> #endif /* _LINUX_MM_TYPES_H */
> >> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> >> index 15c18ef4eb11..325c72f40a93 100644
> >> --- a/include/uapi/linux/prctl.h
> >> +++ b/include/uapi/linux/prctl.h
> >> @@ -364,4 +364,8 @@ struct prctl_mm_map {
> >> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1
> >> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2
> >>
> >> +#define PR_SET_THP_POLICY 78
> >> +#define PR_GET_THP_POLICY 79
> >> +#define PR_THP_POLICY_DEFAULT_HUGE 0
> >> +
> >> #endif /* _LINUX_PRCTL_H */
> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> index 9e4616dacd82..6e5f4a8869dc 100644
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -1054,6 +1054,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> >>
> >> if (current->mm) {
> >> mm->flags = mmf_init_flags(current->mm->flags);
> >> + mm->flags2 = mmf2_init_flags(current->mm->flags2);
> >> mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
> >> } else {
> >> mm->flags = default_dump_filter;
> >> diff --git a/kernel/sys.c b/kernel/sys.c
> >> index c434968e9f5d..1115f258f253 100644
> >> --- a/kernel/sys.c
> >> +++ b/kernel/sys.c
> >> @@ -2658,6 +2658,27 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> >> clear_bit(MMF_DISABLE_THP, &me->mm->flags);
> >> mmap_write_unlock(me->mm);
> >> break;
> >> + case PR_GET_THP_POLICY:
> >> + if (arg2 || arg3 || arg4 || arg5)
> >> + return -EINVAL;
> >> + if (!!test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2))
> >
> > I really don't think we need the !!? Do we?
>
> I have followed the convention that has existed in this file already,
> please see:
> https://elixir.bootlin.com/linux/v6.14.6/source/kernel/sys.c#L2644
OK, but please don't, I don't see why this is necessary. if (truthy) is
fine.
Unless somebody has a really good reason why this is necessary, it's just
ugly ceremony.
>
> >
> > Shouldn't we lock the mm when we do this no? Can't somebody change this?
> >
>
> It wasn't locked in PR_GET_THP_DISABLE
> https://elixir.bootlin.com/linux/v6.14.6/source/kernel/sys.c#L2644
>
> I can acquire do mmap_write_lock_killable the same as PR_SET_THP_POLICY
> in the next series.
>
> I can also add the lock in PR_GET_THP_DISABLE.
Well, the issue I guess is... if the flags field is atomic, and we know
over this call maybe we can rely on mm sticking around, then we probalby
don't need an mmap lock actually.
>
> >> + error = PR_THP_POLICY_DEFAULT_HUGE;
Wait, error = PR_THP_POLICY_DEFAULT_HUGE? Is this the convention for
returning here? :)
> >> + break;
> >> + case PR_SET_THP_POLICY:
> >> + if (arg3 || arg4 || arg5)
> >> + return -EINVAL;
> >> + if (mmap_write_lock_killable(me->mm))
> >> + return -EINTR;
> >> + switch (arg2) {
> >> + case PR_THP_POLICY_DEFAULT_HUGE:
> >> + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
> >> + process_vmas_thp_default_huge(me->mm);
> >> + break;
> >> + default:
> >> + return -EINVAL;
Oh I just noticed - this is really broken - you're not unlocking the mmap()
here on error... :) you definitely need to fix this.
> >> + }
> >> + mmap_write_unlock(me->mm);
> >> + break;
> >> case PR_MPX_ENABLE_MANAGEMENT:
> >> case PR_MPX_DISABLE_MANAGEMENT:
> >> /* No longer implemented: */
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 2780a12b25f0..64f66d5295e8 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -98,6 +98,38 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
> >> return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
> >> }
> >>
> >> +void vma_set_thp_policy(struct vm_area_struct *vma)
> >> +{
> >> + struct mm_struct *mm = vma->vm_mm;
> >> +
> >> + if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
> >> + vm_flags_set(vma, VM_HUGEPAGE);
> >> +}
> >> +
> >> +static void vmas_thp_default_huge(struct mm_struct *mm)
> >> +{
> >> + struct vm_area_struct *vma;
> >> + unsigned long vm_flags;
> >> +
> >> + VMA_ITERATOR(vmi, mm, 0);
> >
> > This is a declaration, it should be grouped with declarations...
> >
>
> Sure, will make the change in next version.
>
> Unfortunately checkpatch didn't complain.
Checkpatch actually complains the other way :P it doesn't understand
macros.
So you'll start getting a warning here, which you can ignore. It sucks, but
there we go. Making checkpatch.pl understand that would be a pain, probs.
>
> >> + for_each_vma(vmi, vma) {
> >> + vm_flags = vma->vm_flags;
> >> + if (vm_flags & VM_NOHUGEPAGE)
> >> + continue;
> >
> > Literally no point in you putting vm_flags as a separate variable here.
> >
>
> Sure, will make the change in next version.
Thanks!
>
> > So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
> > is to override global 'never'?
> >
>
> Again, I am not overriding never.
>
> hugepage_global_always and hugepage_global_enabled will evaluate to false
> and you will not get a hugepage.
Yeah, again ack, but I kind of hate that we set VM_HUGEPAGE everywhere even
if the policy is never.
And we now get into realms of:
'Hey I set prctl() to make everything huge pages, and PR_GET_THP_POLICY
says I've set that, but nothing is huge? BUG???'
Of course then you get into - if somebody sets it to never, do we go around
and remove VM_HUGEPAGE and this MMF_ flag?
>
>
> > I'm really concerned about this.
> >
> >> + vm_flags_set(vma, VM_HUGEPAGE);
> >> + }
> >> +}
> >
> > Do we have an mmap write lock established here? Can you confirm that? Also
> > you should add an assert for that here.
> >
>
> Yes I do, its only called in PR_SET_THP_POLICY where mmap_write lock was taken.
> I can add an assert if it helps.
It not only helps, it's utterly critical :)
'It's only called in xxx()' is famous last words for a programmer, because
later somebody (maybe even your good self) calls it from somewhere else
and... we've all been there...
>
> >> +
> >> +void process_vmas_thp_default_huge(struct mm_struct *mm)
> >> +{
> >> + if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
> >> + return;
> >> +
> >> + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2);
> >> + vmas_thp_default_huge(mm);
> >> +}
> >> +
> >> +
> >> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> >> unsigned long vm_flags,
> >> unsigned long tva_flags,
> >> diff --git a/mm/vma.c b/mm/vma.c
> >> index 1f2634b29568..101b19c96803 100644
> >> --- a/mm/vma.c
> >> +++ b/mm/vma.c
> >> @@ -2476,6 +2476,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> >> if (!vma_is_anonymous(vma))
> >> khugepaged_enter_vma(vma, map->flags);
> >> ksm_add_vma(vma);
> >> + vma_set_thp_policy(vma);
> >
> > You're breaking VMA merging completely by doing this here...
> >
> > Now I can map one VMA with this policy set, then map another immediately
> > next to it and - oops - no merge, ever, because the VM_HUGEPAGE flag is not
> > set in the new VMA on merge attempt.
> >
> > I realise KSM is just as broken (grr) but this doesn't justify us
> > completely breaking VMA merging here.
>
> I think this answers it. Its doing the same as KSM.
Yes, but as I said there, it's not acceptable, at all.
You're making it so litearlly VMA merging _does not happen at all_. That's
unacceptable and might even break some workloads.
You'll certainly cause very big kernel metadata usage.
Consider:
|-----------------------------|..................|
| some VMA flags, VM_HUGEPAGE | proposed new VMA |
|-----------------------------|..................|
Now, because you set VM_HUGEPAGE _after any merge is attempted_, this will
_always_ be fragmented, forever.
That's just not... acceptable.
The fact KSM is broken this way doesn't make that OK.
Especially on brk(), which now will _always_ allocate new VMAs for every
brk() expansion which doesn't seem very efficient.
It may also majorly degrade performance.
That makes me think we need some perf testing for this ideally...
>
> >
> > You need to set earlier than this. Then of course a driver might decide to
> > override this, so maybe then we need to override that.
> >
> > But then we're getting into realms of changing fundamental VMA code _just
> > for this feature_.
> >
> > Again I'm iffy about this. Very.
> >
> > Also you've broken the VMA userland tests here:
> >
> > $ cd tools/testing/vma
> > $ make
> > ...
> > In file included from vma.c:33:
> > ../../../mm/vma.c: In function ‘__mmap_new_vma’:
> > ../../../mm/vma.c:2486:9: error: implicit declaration of function ‘vma_set_thp_policy’; did you mean ‘vma_dup_policy’? [-Wimplicit-function-declaration]
> > 2486 | vma_set_thp_policy(vma);
> > | ^~~~~~~~~~~~~~~~~~
> > | vma_dup_policy
> > make: *** [<builtin>: vma.o] Error 1
> >
> > You need to create stubs accordingly.
> >
>
> Thanks will do.
Thanks!
>
> >> *vmap = vma;
> >> return 0;
> >>
> >> @@ -2705,6 +2706,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >> mm->map_count++;
> >> validate_mm(mm);
> >> ksm_add_vma(vma);
> >> + vma_set_thp_policy(vma);
> >
> > You're breaking merging again... This is quite a bad case too as now you'll
> > have totally fragmented brk VMAs no?
> >
>
> Again doing it the same as KSM.
That doesn't make it ok. Just because KSM is broken doesn't make this ok. I
mean grr at KSM :) I'm going to look into that and see about
investigating/fixing that behaviour.
obviously I can't accept anything that will fundamentally break VMA
merging.
The answer really is to do this earlier, but you risk a driver overriding
it, but that's OK I think (I don't even think any in-tree ones do actually
_anywhere_ - and yes I was literally reading through _every single_ .mmap()
callback lately because I am quite obviously insane ;)
Again I can help with this.
>
> > We can't have it implemented this way.
> >
> >> out:
> >> perf_event_mmap(vma);
> >> mm->total_vm += len >> PAGE_SHIFT;
> >> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
> >> index 35791791a879..f5945ebfe3f2 100644
> >> --- a/tools/include/uapi/linux/prctl.h
> >> +++ b/tools/include/uapi/linux/prctl.h
> >> @@ -328,4 +328,8 @@ struct prctl_mm_map {
> >> # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */
> >> # define PR_PPC_DEXCR_CTRL_MASK 0x1f
> >>
> >> +#define PR_SET_THP_POLICY 78
> >> +#define PR_GET_THP_POLICY 79
> >> +#define PR_THP_POLICY_DEFAULT_HUGE 0
> >> +
> >> #endif /* _LINUX_PRCTL_H */
> >> diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> >> index 15c18ef4eb11..325c72f40a93 100644
> >> --- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> >> +++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> >> @@ -364,4 +364,8 @@ struct prctl_mm_map {
> >> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1
> >> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2
> >>
> >> +#define PR_SET_THP_POLICY 78
> >> +#define PR_GET_THP_POLICY 79
> >> +#define PR_THP_POLICY_DEFAULT_HUGE 0
> >> +
> >> #endif /* _LINUX_PRCTL_H */
> >> --
> >> 2.47.1
> >>
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 16:06 ` Lorenzo Stoakes
@ 2025-05-15 16:11 ` David Hildenbrand
2025-05-15 18:08 ` Lorenzo Stoakes
2025-05-15 16:47 ` Usama Arif
1 sibling, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-05-15 16:11 UTC (permalink / raw)
To: Lorenzo Stoakes, Usama Arif
Cc: Andrew Morton, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
>>> So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
>>> is to override global 'never'?
>>>
>>
>> Again, I am not overriding never.
>>
>> hugepage_global_always and hugepage_global_enabled will evaluate to false
>> and you will not get a hugepage.
>
> Yeah, again ack, but I kind of hate that we set VM_HUGEPAGE everywhere even
> if the policy is never.
I think it should behave just as if someone does manually an madvise().
So whatever we do here during an madvise, we should try to do the same
thing here.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY
2025-05-15 15:54 ` Usama Arif
2025-05-15 16:04 ` David Hildenbrand
@ 2025-05-15 16:24 ` Lorenzo Stoakes
1 sibling, 0 replies; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-15 16:24 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On Thu, May 15, 2025 at 04:54:40PM +0100, Usama Arif wrote:
>
>
> On 15/05/2025 16:15, Lorenzo Stoakes wrote:
> > Thanks for coming back to me so quickly, appreciated :)
> >
> > I am reacting in a 'WTF' way here, but it's in proportion to the (at least
> > perceived) magnitude of this change. We really need to be sure this is
> > right.
> >
>
> Lol I had to rewrite my replies a few times to tone them down.
> Hopefully I don't come across as aggressive :)
Haha this is literally totally normal. And something I do a lot (I thik we
all do :). Ask me about the time a friend called immediately prior to me
deleting a sentence that I had realised came across badly, then forgetting
to do so, and then sending the mail... ;)
No it's fine, equally apologies if I seem aggressive, it's not intended, I
am just passionate about _us getting it right_ - and I think the healthiest
approach is to be as direct as possible, which maintaining professionalism
(not withstanding star wars references and comments on how prctl() is
simply the dictionary definition of deep, unmitigated evil).
But it's vital in kernel work to be able to be absolutely robust, in both
directions, while maintaining civility :)
>
>
> > On Thu, May 15, 2025 at 03:50:47PM +0100, Usama Arif wrote:
> >>
> >>
> >> On 15/05/2025 14:55, Lorenzo Stoakes wrote:
> >>> On Thu, May 15, 2025 at 02:33:29PM +0100, Usama Arif wrote:
> >>>> This allows to change the THP policy of a process, according to the value
> >>>> set in arg2, all of which will be inherited during fork+exec:
> >>>
> >>> This is pretty confusing.
> >>>
> >>> It should be something like 'add a new prctl() option that allows...' etc.
> >>>
> >>>> - PR_THP_POLICY_DEFAULT_HUGE: This will set the MMF2_THP_VMA_DEFAULT_HUGE
> >>>> process flag which changes the default of new VMAs to be VM_HUGEPAGE. The
> >>>> call also modifies all existing VMAs that are not VM_NOHUGEPAGE
> >>>> to be VM_HUGEPAGE.
> >>>
> >>> This is referring to implementation detail that doesn't matter for an overview,
> >>> just add a summary here e.g.
> >>>
> >>> PR_THP_POLICY_DEFAULT_HUGE - set VM_HUGEPAGE flag in all VMAs by default,
> >>> including after fork/exec, ignoring global policy.
> >>>
> >>> PR_THP_POLICY_DEFAULT_NOHUGE - clear VM_HUGEPAGE flag in all VMAs by default,
> >>> including after fork/exec, ignoring global policy.
> >>>
> >>> PR_THP_POLICY_DEFAULT_SYSTEM - Eliminate any policy set above.
> >>
> >> Hi Lorenzo,
> >>
> >> Thanks for the review. I will make the cover letter clearer in the next revision.
> >
> > The next version should emphatically be an RFC also, please. Your cover letter
> > should mention you're fundamentally changing mm_struct and VMA logic, and
> > explain why your use cae is so important that that is justified.
> >
>
> Thanks, will make it RFC and add that I am making changes to mm_struct and VMA logic.
Thanks! This isn't sort of 'another way of saying no' by the way, it's
literally because, as is obvious already, we all need to talk about this :)
Once we stabilise on a way forwards then we're good to un-RFC.
>
> >>
> >>>
> >>>> This allows systems where the global policy is set to "madvise"
> >>>> to effectively have THPs always for the process. In an environment
> >>>> where different types of workloads are stacked on the same machine
> >>>> whose global policy is set to "madvise", this will allow workloads
> >>>> that benefit from always having hugepages to do so, without regressing
> >>>> those that don't.
> >>>
> >>> So does this just ignore and override the global policy? I'm not sure I'm
> >>> comfortable with that.
> >>
> >> No. The decision making of when and what order THPs are allowed is not
> >> changed, i.e. there are no changes in __thp_vma_allowable_orders and
> >> thp_vma_allowable_orders. David has the same concern as you and this
> >> current series is implementing what David suggested in
> >> https://lore.kernel.org/all/3f7ba97d-04d5-4ea4-9f08-6ec3584e0d4c@redhat.com/
> >>
> >> It will change the existing VMA (NO)HUGE flags according to
> >> the prctl. For e.g. doing PR_THP_POLICY_DEFAULT_HUGE will not give
> >> a THP when global policy is never.
> >
> > Umm...
> >
> > + case PR_SET_THP_POLICY:
> > + if (arg3 || arg4 || arg5)
> > + return -EINVAL;
> > + if (mmap_write_lock_killable(me->mm))
> > + return -EINTR;
> > + switch (arg2) {
> > + case PR_THP_POLICY_DEFAULT_HUGE:
> > + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
> > + process_vmas_thp_default_huge(me->mm);
> > + break;
> > + default:
> >
> >
> > Where's the check against never? You're unconditionally setting VM_HUGEPAGE?
>
> So this was from the discussion with David. My initial implementation in v1,
> messed with the policy evaluation in thp_vma_allowable_orders and __thp_vma_allowable_orders.
>
> The whole point of doing it this way is that you dont mess with the policy evaluation.
>
> hugepage_global_always and hugepage_global_enabled will still evaluate to false
> when never is set and you will not get a hugepage. But more on it below.
Yeah, I mean I kind of hate this, but madvise(..., VM_[NO]HUGEPAGE) already
does this, regardless of whether the VMA is in any way suited to THP or
global policies say otherwise.
So maybe I can put my criticisms of this aside and wonder out loud:
1. We want to make this behave as if we were just setting madvise(..,
MADV_[NO]HUGEPAGE) everywhere.
2. If so, do we then take the semantics to be, in the case of ...HUGEPAGE,
'these VMAs should be considered by khugepaged for collapse if and only
if global policy is set to one of [madvise, always]'.
3. If so, do we also then take the semantics to be, in the case of
...NOHUGEPAGE, 'these VMAs should NOT be considered by khugepaged for
collapse even if global policy is set to [always]'.
I wonder, since this is at mm_struct granularity, why do we even bother
applying this to VMAs?
I guess the reason we are doing this at VMA level rather than mm level is
that we want the user to have the ability to override this per-VMA.
But couldn't we just more efficiently do this by having the khugepaged code
check both mm->flags AND vma->vm_flags, and have some function handle the
override _explcitly_ there? Is there some reason we're not doing it that
way?
Forgive me if this was already discussed.
>
> >
> > You're relying on VM_HUGEPAGE being ignored in this instance? But you're still:
> >
> > 1. Setting VM_HUGEPAGE everywhere (and breaking VMA merging everywhere).
> >
> > 2. Setting MMF2_THP_VMA_DEFAULT_HUGE and making it so PR_GET_THP_POLICY says it
> > has a policy of default huge even if policy is set to never?
> >
> > I'm not ok with that. I'd much rather we do the never check here...
> >
>
> I am ok with that. I can add a check over here that wraps this in:
> if (hugepage_global_enabled())
> ...
>
> > Also see hugepage_madvise(). There's arch-specific code that overrides
> > that, and you're now bypassing that (yes it's for one arch of course but
> > it's still a thing)
> >
>
> Thanks, I will put
> if (mm_has_pgste(vma->vm_mm))
> return 0;
> at the start.
Well, hang on, let's maybe try to reuse this code if we can :) we
definitely don't want to duplicate this.
>
> >>
> >>>
> >>> What about if the the policy is 'never'? Does this override that? That seems
> >>> completely wrong.
> >>
> >> No, it won't override it. hugepage_global_always and hugepage_global_enabled
> >> will still evaluate to false and you wont get a hugepage no matter what prctl
> >> is set.
> >
> > Ack ok I see as above, you're relying on VM_HUGEPAGE enforcing htis.
> >
> > You really need to put stuff like this in the cover letter though!!
> >
>
> Sure will do in the next revision, Thanks.
Thanks!
> >>
> >>>
> >>>> - PR_THP_POLICY_DEFAULT_NOHUGE: This will set the MMF2_THP_VMA_DEFAULT_NOHUGE
> >>>> process flag which changes the default of new VMAs to be VM_NOHUGEPAGE.
> >>>> The call also modifies all existing VMAs that are not VM_HUGEPAGE
> >>>> to be VM_NOHUGEPAGE.
> >>>> This allows systems where the global policy is set to "always"
> >>>> to effectively have THPs on madvise only for the process. In an
> >>>> environment where different types of workloads are stacked on the
> >>>> same machine whose global policy is set to "always", this will allow
> >>>> workloads that benefit from having hugepages on an madvise basis only
> >>>> to do so, without regressing those that benefit from having hugepages
> >>>> always.
> >>>
> >>> Wait, so 'no huge' means 'madvise'? What? This is confusing.
> >>
> >>
> >> I probably made the cover letter confusing :) or maybe need to rename the flags.
> >>
> >> This flag work as follows:
> >>
> >> a) Changes the default flag of new VMAs to be VM_NOHUGEPAGE
> >>
> >> b) Modifies all existing VMAs that are not VM_HUGEPAGE to be VM_NOHUGEPAGE
> >>
> >> c) Is inherited during fork+exec
> >>
> >> I think maybe I should add VMA to the flag names and rename the flags to
> >> PR_THP_POLICY_DEFAULT_VMA_(NO)HUGE ??
> >
> > Please no :) 'VMA' is implicit re: mappings. If you're touching memory
> > mappings you're necessarily touching VMAs.
> >
> > I know some prctl() (a pathway to many abilities some consider to be
> > unnatural) uses 'VMA' in some of the endpoints but generally when referring
> > to specific VMAs no?
> >
> > These namesa are already kinda horrible (yes naming is hard, for everyone,
> > ask me about MADV_POISON/REMEDY) but I think something like:
> >
> > PR_DEFAULT_MADV_HUGEPAGE
> > PR_DEFAULT_MADV_NOHUGEPAGE
> >
> > -ish :)
> >
>
> Sure, happy with that, Thanks.
Thanks!
> >>
> >>>
> >>>> - PR_THP_POLICY_DEFAULT_SYSTEM: This will clear the MMF2_THP_VMA_DEFAULT_HUGE
> >>>> and MMF2_THP_VMA_DEFAULT_NOHUGE process flags.
> >>>>
> >>>> These patches are required in rolling out hugepages in hyperscaler
> >>>> configurations for workloads that benefit from them, where workloads are
> >>>> stacked anda single THP global policy is likely to be used across the entire
> >>>> fleet, and prctl will help override it.
> >>>
> >>> I don't understand this justification whatsoever. What does 'stacked' mean? And
> >>> you're not justifying why you'd override the policy?
> >>
> >> By stacked I just meant different types of workloads running on the same machine.
> >> Lets say we have a single server whose global policy is set to madvise.
> >> You can have a container on that server running some database workload that best
> >> works with madvise.
> >> You can have another container on that same server running some AI workload that would
> >> benefit from having VM_HUGEPAGE set on all new VMAs. We can use prctl
> >> PR_THP_POLICY_DEFAULT_HUGE to get VM_HUGEPAGE set by default on all new VMAs for that
> >> container.
> >>
> >>>
> >>> This series has no actual justificaiton here at all? You really need to provide one.
> >>>
> >>
> >> There was a discussion on the usecases in
> >> https://lore.kernel.org/all/13b68fa0-8755-43d8-8504-d181c2d46134@gmail.com/
> >>
> >> I tried (and I guess failed :)) to summarize the justification from that thread.
> >
> > It's fine, I have most definitely not been as clear as I could be in series
> > too :>) just need to add a bigger summary.
> >
> > Don't afraid to waffle on... (I know I am not... ;)
> >
> >>
> >> I will try and rephrase it here.
> >>
> >> In hyperscalers, we have a single THP policy for the entire fleet.
> >> We have different types of workloads (e.g. AI/compute/databases/etc)
> >> running on a single server (this is what I meant by 'stacked').
> >> Some of these workloads will benefit from always getting THP at fault (or collapsed
> >> by khugepaged), some of them will benefit by only getting them at madvise.
> >>
> >> This series is useful for 2 usecases:
> >>
> >> 1) global system policy = madvise, while we want some workloads to get THPs
> >> at fault and by khugepaged :- some processes (e.g. AI workloads) benefits from getting
> >> THPs at fault (and collapsed by khugepaged). Other workloads like databases will incur
> >> regression (either a performance regression or they are completely memory bound and
> >> even a very slight increase in memory will cause them to OOM). So what these patches
> >> will do is allow setting prctl(PR_THP_POLICY_DEFAULT_HUGE) on the AI workloads,
> >> (This is how workloads are deployed in our (Meta's/Facebook) fleet at this moment).
> >>
> >> 2) global system policy = always, while we want some workloads to get THPs
> >> only on madvise basis :- Same reason as 1). What these patches
> >> will do is allow setting prctl(PR_THP_POLICY_DEFAULT_NOHUGE) on the database
> >> workloads.
> >> (We hope this is us (Meta) in the near future, if a majority of workloads show that they
> >> benefit from always, we flip the default host setting to "always" across the fleet and
> >> workloads that regress can opt-out and be "madvise".
> >> New services developed will then be tested with always by default. "always" is also the
> >> default defconfig option upstream, so I would imagine this is faced by others as well.)
> >
> > Right, but I'm not sure you're explaining why prctl(), one of the most cursed,
> > neglected and frankly evil (maybe exaggerating :P) APIs in the kernel is the way
> > to do this?
> >
> > You do need to summarise why the suggested idea re: BPF, or cgroups, or whatnot
> > is _totally unworkable_.
> >
> > And why not process_madvise() with MADV_HUGEPAGE?
> >
> > I'm also not sure fork/exec is a great situation to have, because are you sure
> > the workloads stay the same across all fork/execs that you're now propagating?
> >
> > It feels like this should be a cgroup thing, really.
> >
>
> So I actually dont mind the cgroup implementation (that was actually my first
> prototype and after that I saw there was someone who had posted it earlier).
> It was shot down because it wont be hierarchical and doesnt solve it when
> its not being done in a cgroup.
>
> A large proportion of the thread in v1 was discussion with David, Johannes, Zi and
> Yafang (the bpf THP policy author) on different ways of doing this.
Ack yes. I think this whole discussion underlines why it's so important for
you to _summarise_ that discussion. I'm repeating myself from elsewhere but
the cover letter needs something like:
We considered the following alternatives, and none of them were workable:
[list of alternatives]
And ack on cgroups maybe not being quite right due to this not being
strictly hierarchical.
>
> >>
> >> Hope this makes the justification for the patches clearer :)
> >
> > Sure, please add this kind of thing to the cover letter to get fewer 'wtf'
> > reactions :)
> >
> > You're doing something really _big_ and _opinonated_ here though, that's
> > basically fundamentally changing core stuff, so an extended discussion of why
> > you feel it's so important, why other approaches are not workable, why the
> > Sauron-spawned Mordor dwelling prctl() API is the way to go, etc.
> >
> >>
> >>>>
> >>>> v1->v2:
> >>>
> >>> Where was the v1? Is it [0]?
> >>>
> >>> This seems like a massive change compared to that series?
> >>>
> >>> You've renamed it and not referenced the old series, please make sure you link
> >>> it or somehow let somebody see what this is against, because it makes review
> >>> difficult.
> >>>
> >>
> >> Yes its the patch you linked below. Sorry should have linked it in this series.
> >> Its a big change, but it was basically incorporating all feedback from David,
> >> while trying to achieve a similar goal. Will link it in future series.
> >
> > Yeah, again, this should have been an RFC on that basis :)
> >
> >>
> >>> [0]: https://lore.kernel.org/linux-mm/20250507141132.2773275-1-usamaarif642@gmail.com/
> >>>
> >>>> - change from modifying the THP decision making for the process, to modifying
> >>>> VMA flags only. This prevents further complicating the logic used to
> >>>> determine THP order (Thanks David!)
> >>>> - change from using a prctl per policy change to just using PR_SET_THP_POLICY
> >>>> and arg2 to set the policy. (Zi Yan)
> >>>> - Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM
> >>>> - Add selftests and documentation.
> >>>>
> >>>> Usama Arif (6):
> >>>> prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
> >>>> prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE for the process
> >>>> prctl: introduce PR_THP_POLICY_SYSTEM for the process
> >>>> selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_NOHUGE
> >>>> selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
> >>>> docs: transhuge: document process level THP controls
> >>>>
> >>>> Documentation/admin-guide/mm/transhuge.rst | 40 +++
> >>>> include/linux/huge_mm.h | 4 +
> >>>> include/linux/mm_types.h | 14 +
> >>>> include/uapi/linux/prctl.h | 6 +
> >>>> kernel/fork.c | 1 +
> >>>> kernel/sys.c | 35 +++
> >>>> mm/huge_memory.c | 56 ++++
> >>>> mm/vma.c | 2 +
> >>>> tools/include/uapi/linux/prctl.h | 6 +
> >>>> .../trace/beauty/include/uapi/linux/prctl.h | 6 +
> >>>> tools/testing/selftests/prctl/Makefile | 2 +-
> >>>> tools/testing/selftests/prctl/thp_policy.c | 286 ++++++++++++++++++
> >>>> 12 files changed, 457 insertions(+), 1 deletion(-)
> >>>> create mode 100644 tools/testing/selftests/prctl/thp_policy.c
> >>>>
> >>>> --
> >>>> 2.47.1
> >>>>
> >>
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 15:57 ` David Hildenbrand
@ 2025-05-15 16:38 ` Lorenzo Stoakes
2025-05-15 17:29 ` David Hildenbrand
0 siblings, 1 reply; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-15 16:38 UTC (permalink / raw)
To: David Hildenbrand
Cc: Liam R. Howlett, Usama Arif, Andrew Morton, linux-mm, hannes,
shakeel.butt, riel, ziy, laoar.shao, baolin.wang, npache,
ryan.roberts, linux-kernel, linux-doc, kernel-team
On Thu, May 15, 2025 at 05:57:57PM +0200, David Hildenbrand wrote:
> On 15.05.25 17:45, Liam R. Howlett wrote:
> > * David Hildenbrand <david@redhat.com> [250515 10:44]:
> > > On 15.05.25 16:40, Lorenzo Stoakes wrote:
> > > > Overall I feel this series should _DEFINITELY_ be an RFC. This is pretty
> > > > outlandish stuff and needs discussion.
> > > >
> > > > You're basically making it so /sys/kernel/mm/transparent_hugepage/enabled =
> > > > never is completely ignored and overridden.
> > >
> > > I thought I made it very clear during earlier discussions that never means
> > > never.
> >
> > I also thought so, but the comments later made here [1] seem to
> > contradict that?
>
> It's ... complicated.
>
> >
> > It seems "never" means "default_no" and not actually "never"?
>
> We should consider these system toggles a single set of toggles that define a
> state, and not individual toggles that overwrite each other.
>
> If you say
> /sys/kernel/mm/transparent_hugepage/enabled = never
> and
> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled = always
>
> instead of the *default*
>
> /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled = inherit
>
> the admin explicitly stats "I want the system behavior for 2048kB not to be configured using
> /sys/kernel/mm/transparent_hugepage/enabled". That's an admin decision, not a
> per-process overwrite or whatever.
>
>
> >
> > Maybe the global/system toggles need to affect the state of each other?
> > That is, if /sys/kernel/mm/transparent_hugepage/enabled is never and you
> > set /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled to
> > madvise, it should not leave /sys/kernel/mm/transparent_hugepage/enabled
> > as never.
>
> I recall we discussed that, but there was also a catch to that. :(
>
> --
> Cheers,
>
> David / dhildenb
>
Did we document all this? :)
It'd be good to be super explicit about these sorts of 'dependency chains'.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 16:06 ` Lorenzo Stoakes
2025-05-15 16:11 ` David Hildenbrand
@ 2025-05-15 16:47 ` Usama Arif
2025-05-15 18:36 ` Lorenzo Stoakes
1 sibling, 1 reply; 51+ messages in thread
From: Usama Arif @ 2025-05-15 16:47 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On 15/05/2025 17:06, Lorenzo Stoakes wrote:
>> Its doing the same as KSM as suggested by David. Does KSM break these tests?
>> Is there some specific test you can point to that I can run that is breaking
>> with this patch and not without it?
>
> They don't build, at all. I explain how you can attempt a build below.
>
Ah yes, I initially thought by break you meant they were failing. I saw that,
will fix it.
> And no, KSM doesn't break the tests, because steps were taken to make them
> not break the tests :) I mean it's really easy - it's just adding some
> trivial stubs.
>
Yes, will do Thanks!
> If you need help with it just ping me in whatever way helps and I can help!
>
> It's understandable as it's not necessarily clear this is a thing (maybe we
> need self tests to build it, but that might break CI setups so unclear).
>
> The merging is much more important!
>
>>
>>
>>> I really feel this series needs to be an RFC until we can get some
>>> consensus on how to approach this.
>>
>> There was consensus in https://lore.kernel.org/all/97702ff0-fc50-4779-bfa8-83dc42352db1@redhat.com/
>
> I disagree with this asssessment, that doesn't look like consensus at all,
> I think at least this is a very contentious or at least _complicated_ topic
> that we need to really dig into.
>
> So in my view - it's this kind of situation that warrants an RFC until
> there's some stabilisation and agreement on a way forward.
>
Sure will change next revision to RFC, unless hopefully maybe we can
get a consensus in this revision :)
>>
>>>
>>> On Thu, May 15, 2025 at 02:33:30PM +0100, Usama Arif wrote:
>>>> This is set via the new PR_SET_THP_POLICY prctl.
>>>
>>> What is?
>>>
>>> You're making very major changes here, including adding a new flag to
>>> mm_struct (!!) and the explanation/justification for this is missing.
>>>
>>
>> I have added the justification in your reply to the coverletter.
>
> As stated there, you've not explained why alternatives are unworkable, I
> think we need this!
>
> Sort of:
>
> 1. Why not cgroups? blah blah blah
> 2. Why not process_madvise()? blah blah blah
> 3. Why not bpf? blah blah blah
> 4. Why not <something I've not thought of>? blah blah blah
>
I will add this in the next cover letter.
>>
>>>> This will set the MMF2_THP_VMA_DEFAULT_HUGE process flag
>>>> which changes the default of new VMAs to be VM_HUGEPAGE. The
>>>> call also modifies all existing VMAs that are not VM_NOHUGEPAGE
>>>> to be VM_HUGEPAGE. The policy is inherited during fork+exec.
>>>
>>> So you can only set this flag?
>>>
>>
>> ??
>
> This patch is only allowing the setting of this flag. I am asking 'so you
> can only set this flag?'
>
> To which it appears the answer is, yes I think :)
>
> An improved cover letter could say something like:
>
> "
> Here we implement the first flag intended to allow the _overriding_ of huge
> page policy to ensure that, when
> /sys/kernel/mm/transparent_hugepage/enabled is set to madvise, we are able
> to maintain fine-grained control of individual processes, including any
> fork/exec'd, by setting this flag.
>
> In subsequent commits, we intend to permit further such control.
> "
>
>>
>>>>
>>>> This allows systems where the global policy is set to "madvise"
>>>> to effectively have THPs always for the process. In an environment
>>>> where different types of workloads are stacked on the same machine,
>>>> this will allow workloads that benefit from always having hugepages
>>>> to do so, without regressing those that don't.
>>>
>>> Again, this explanation really makes no sense at all to me, I don't really
>>> know what you mean, you're not going into what you're doing in this change,
>>> this is just a very unclear commit message.
>>>
>>
>> I hope this is answered in my reply to your coverletter.
>
> You still need to improve the cover letter here I think, see above for a
> suggestion!
>
Sure, will do in the next revision, Thanks!
>>
>>>>
>>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>> ---
>>>> include/linux/huge_mm.h | 3 ++
>>>> include/linux/mm_types.h | 11 +++++++
>>>> include/uapi/linux/prctl.h | 4 +++
>>>> kernel/fork.c | 1 +
>>>> kernel/sys.c | 21 ++++++++++++
>>>> mm/huge_memory.c | 32 +++++++++++++++++++
>>>> mm/vma.c | 2 ++
>>>> tools/include/uapi/linux/prctl.h | 4 +++
>>>> .../trace/beauty/include/uapi/linux/prctl.h | 4 +++
>>>> 9 files changed, 82 insertions(+)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index 2f190c90192d..e652ad9ddbbd 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -260,6 +260,9 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
>>>> return orders;
>>>> }
>>>>
>>>> +void vma_set_thp_policy(struct vm_area_struct *vma);
>>>
>>> This is a VMA-specific function but you're putting it in huge_mm.h? Why
>>> can't
>> this be in vma.h or vma.c?
>>>
>>
>> Sure can move it there.
>>
>>>> +void process_vmas_thp_default_huge(struct mm_struct *mm);
>>>
>>> 'vmas' is redundant here.
>>>
>>
>> Sure.
>>>> +
>>>> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>> unsigned long vm_flags,
>>>> unsigned long tva_flags,
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index e76bade9ebb1..2fe93965e761 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -1066,6 +1066,7 @@ struct mm_struct {
>>>> mm_context_t context;
>>>>
>>>> unsigned long flags; /* Must use atomic bitops to access */
>>>> + unsigned long flags2;
>>>
>>>
>>> Ugh, god really??
>>>
>>> I really am not a fan of adding flags2 just to add a prctl() feature like
>>> this. This is crazy.
>>>
>>> Also this is a TERRIBLE name. I mean, no please PLEASE no.
>>>
>>> Do we really have absolutely no choice but to add a new flags field here?
>>>
>>> It again doesn't help that you don't mention nor even try to justify this
>>> in the commit message or cover letter.
>>>
>>
>> And again, I hope my reply to your email has given you the justification.
>
> No :) I understood why you did this though of course.
>
>>
>>> If this is a 32-bit kernel vs. 64-bit kernel thing so we 'ran out of bits',
>>> let's just go make this flags field 64-bit on 32-bit kernels.
>>>
>>> I mean - I'm kind of insisting we do that to be honest. Because I really
>>> don't like this.
>>
>>
>> If the maintainers want this, I will make it a 64 bit only feature. We
>> are only using it for 64 bit servers. But it will probably mean ifdef
>> config 64 bit in a lot of places.
>
> I'm going to presume you are including me in this category rather than
> implying that you are deferring only to others :)
>
Yes ofcourse! I mean all maintainers :)
And hopefully everyone else as well :)
> So, there's another option:
>
> Have a prerequisite series that makes mm_struct->flags 64-bit on 32-bit
> kernels, which solves this problem everywhere and avoids us wasting a bunch
> of memory for a very specific usecase, splitting flag state across 2 fields
> (which are no longer atomic as a whole of course), adding confusion,
> possibly subtly breaking anywhere that assumes mm->flags completely
> describes mm-granularity flag state etc.
>
This is probably a very basic question, but by make mm_struct->flags 64-bit on 32-bit
do you mean convert flags to unsigned long long when !CONIFG_64BIT?
> The RoI here is not looking good, otherwise.
>
>>
>>>
>>> Also if we _HAVE_ to have this, shouldn't we duplicate that comment about
>>> atomic bitops?...
>>>
>>
>> Sure
>>
>>>>
>>>> #ifdef CONFIG_AIO
>>>> spinlock_t ioctx_lock;
>>>> @@ -1744,6 +1745,11 @@ enum {
>>>> MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
>>>> MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
>>>>
>>>> +#define MMF2_THP_VMA_DEFAULT_HUGE 0
>>>
>>> I thought the whole idea was to move away from explicitly refrencing 'THP'
>>> in a future where large folios are implicit and now we're saying 'THP'.
>>>
>>> Anyway the 'VMA' is totally redundant here.
>>>
>>
>> Sure, I can remove VMA.
>> I see THP everywhere in the kernel code.
>> Its mentioned 108 times in transhuge.rst alone :)
>> If you have any suggestion to rename this flag, happy to take it :)
>
> Yeah I mean it's a mess man, and it's not your fault... Again naming is
> hard, I put a suggestion in reply to cover letter anyway...
>
>>
>>>> +#define MMF2_THP_VMA_DEFAULT_HUGE_MASK (1 << MMF2_THP_VMA_DEFAULT_HUGE)
>>>
>>> Do we really need explicit trivial mask declarations like this?
>>>
>>
>> I have followed the convention that has existed in this file, please see below
>> links :)
>> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1645
>> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1623
>> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1603
>> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1582
>
> Ack, yuck but ack.
>
>>
>>
>>>> +
>>>> +#define MMF2_INIT_MASK (MMF2_THP_VMA_DEFAULT_HUGE_MASK)
>>>
>>>> +
>>>> static inline unsigned long mmf_init_flags(unsigned long flags)
>>>> {
>>>> if (flags & (1UL << MMF_HAS_MDWE_NO_INHERIT))
>>>> @@ -1752,4 +1758,9 @@ static inline unsigned long mmf_init_flags(unsigned long flags)
>>>> return flags & MMF_INIT_MASK;
>>>> }
>>>>
>>>> +static inline unsigned long mmf2_init_flags(unsigned long flags)
>>>> +{
>>>> + return flags & MMF2_INIT_MASK;
>>>> +}
>>>> +
>>>> #endif /* _LINUX_MM_TYPES_H */
>>>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>>>> index 15c18ef4eb11..325c72f40a93 100644
>>>> --- a/include/uapi/linux/prctl.h
>>>> +++ b/include/uapi/linux/prctl.h
>>>> @@ -364,4 +364,8 @@ struct prctl_mm_map {
>>>> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1
>>>> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2
>>>>
>>>> +#define PR_SET_THP_POLICY 78
>>>> +#define PR_GET_THP_POLICY 79
>>>> +#define PR_THP_POLICY_DEFAULT_HUGE 0
>>>> +
>>>> #endif /* _LINUX_PRCTL_H */
>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>> index 9e4616dacd82..6e5f4a8869dc 100644
>>>> --- a/kernel/fork.c
>>>> +++ b/kernel/fork.c
>>>> @@ -1054,6 +1054,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>>>>
>>>> if (current->mm) {
>>>> mm->flags = mmf_init_flags(current->mm->flags);
>>>> + mm->flags2 = mmf2_init_flags(current->mm->flags2);
>>>> mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
>>>> } else {
>>>> mm->flags = default_dump_filter;
>>>> diff --git a/kernel/sys.c b/kernel/sys.c
>>>> index c434968e9f5d..1115f258f253 100644
>>>> --- a/kernel/sys.c
>>>> +++ b/kernel/sys.c
>>>> @@ -2658,6 +2658,27 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>>>> clear_bit(MMF_DISABLE_THP, &me->mm->flags);
>>>> mmap_write_unlock(me->mm);
>>>> break;
>>>> + case PR_GET_THP_POLICY:
>>>> + if (arg2 || arg3 || arg4 || arg5)
>>>> + return -EINVAL;
>>>> + if (!!test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2))
>>>
>>> I really don't think we need the !!? Do we?
>>
>> I have followed the convention that has existed in this file already,
>> please see:
>> https://elixir.bootlin.com/linux/v6.14.6/source/kernel/sys.c#L2644
>
> OK, but please don't, I don't see why this is necessary. if (truthy) is
> fine.
>
> Unless somebody has a really good reason why this is necessary, it's just
> ugly ceremony.
>
Agreed :)
>>
>>>
>>> Shouldn't we lock the mm when we do this no? Can't somebody change this?
>>>
>>
>> It wasn't locked in PR_GET_THP_DISABLE
>> https://elixir.bootlin.com/linux/v6.14.6/source/kernel/sys.c#L2644
>>
>> I can acquire do mmap_write_lock_killable the same as PR_SET_THP_POLICY
>> in the next series.
>>
>> I can also add the lock in PR_GET_THP_DISABLE.
>
> Well, the issue I guess is... if the flags field is atomic, and we know
> over this call maybe we can rely on mm sticking around, then we probalby
> don't need an mmap lock actually.
>
>>
>>>> + error = PR_THP_POLICY_DEFAULT_HUGE;
>
> Wait, error = PR_THP_POLICY_DEFAULT_HUGE? Is this the convention for
> returning here? :)
I see a few of the PR_GET_.. setting the return value. I hope I didnt
misinterpret that.
>
>>>> + break;
>>>> + case PR_SET_THP_POLICY:
>>>> + if (arg3 || arg4 || arg5)
>>>> + return -EINVAL;
>>>> + if (mmap_write_lock_killable(me->mm))
>>>> + return -EINTR;
>>>> + switch (arg2) {
>>>> + case PR_THP_POLICY_DEFAULT_HUGE:
>>>> + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
>>>> + process_vmas_thp_default_huge(me->mm);
>>>> + break;
>>>> + default:
>>>> + return -EINVAL;
>
> Oh I just noticed - this is really broken - you're not unlocking the mmap()
> here on error... :) you definitely need to fix this.
>
Ah yes, will do Thanks!
>>>> + }
>>>> + mmap_write_unlock(me->mm);
>>>> + break;
>>>> case PR_MPX_ENABLE_MANAGEMENT:
>>>> case PR_MPX_DISABLE_MANAGEMENT:
>>>> /* No longer implemented: */
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 2780a12b25f0..64f66d5295e8 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -98,6 +98,38 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>>>> return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
>>>> }
>>>>
>>>> +void vma_set_thp_policy(struct vm_area_struct *vma)
>>>> +{
>>>> + struct mm_struct *mm = vma->vm_mm;
>>>> +
>>>> + if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
>>>> + vm_flags_set(vma, VM_HUGEPAGE);
>>>> +}
>>>> +
>>>> +static void vmas_thp_default_huge(struct mm_struct *mm)
>>>> +{
>>>> + struct vm_area_struct *vma;
>>>> + unsigned long vm_flags;
>>>> +
>>>> + VMA_ITERATOR(vmi, mm, 0);
>>>
>>> This is a declaration, it should be grouped with declarations...
>>>
>>
>> Sure, will make the change in next version.
>>
>> Unfortunately checkpatch didn't complain.
>
> Checkpatch actually complains the other way :P it doesn't understand
> macros.
>
> So you'll start getting a warning here, which you can ignore. It sucks, but
> there we go. Making checkpatch.pl understand that would be a pain, probs.
>
>>
>>>> + for_each_vma(vmi, vma) {
>>>> + vm_flags = vma->vm_flags;
>>>> + if (vm_flags & VM_NOHUGEPAGE)
>>>> + continue;
>>>
>>> Literally no point in you putting vm_flags as a separate variable here.
>>>
>>
>> Sure, will make the change in next version.
>
> Thanks!
>
>>
>>> So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
>>> is to override global 'never'?
>>>
>>
>> Again, I am not overriding never.
>>
>> hugepage_global_always and hugepage_global_enabled will evaluate to false
>> and you will not get a hugepage.
>
> Yeah, again ack, but I kind of hate that we set VM_HUGEPAGE everywhere even
> if the policy is never.
>
> And we now get into realms of:
>
> 'Hey I set prctl() to make everything huge pages, and PR_GET_THP_POLICY
> says I've set that, but nothing is huge? BUG???'
>
> Of course then you get into - if somebody sets it to never, do we go around
> and remove VM_HUGEPAGE and this MMF_ flag?
>
>>
>>
>>> I'm really concerned about this.
>>>
>>>> + vm_flags_set(vma, VM_HUGEPAGE);
>>>> + }
>>>> +}
>>>
>>> Do we have an mmap write lock established here? Can you confirm that? Also
>>> you should add an assert for that here.
>>>
>>
>> Yes I do, its only called in PR_SET_THP_POLICY where mmap_write lock was taken.
>> I can add an assert if it helps.
>
> It not only helps, it's utterly critical :)
>
> 'It's only called in xxx()' is famous last words for a programmer, because
> later somebody (maybe even your good self) calls it from somewhere else
> and... we've all been there...
>
Thanks! Will do.
>>
>>>> +
>>>> +void process_vmas_thp_default_huge(struct mm_struct *mm)
>>>> +{
>>>> + if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
>>>> + return;
>>>> +
>>>> + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2);
>>>> + vmas_thp_default_huge(mm);
>>>> +}
>>>> +
>>>> +
>>>> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>> unsigned long vm_flags,
>>>> unsigned long tva_flags,
>>>> diff --git a/mm/vma.c b/mm/vma.c
>>>> index 1f2634b29568..101b19c96803 100644
>>>> --- a/mm/vma.c
>>>> +++ b/mm/vma.c
>>>> @@ -2476,6 +2476,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
>>>> if (!vma_is_anonymous(vma))
>>>> khugepaged_enter_vma(vma, map->flags);
>>>> ksm_add_vma(vma);
>>>> + vma_set_thp_policy(vma);
>>>
>>> You're breaking VMA merging completely by doing this here...
>>>
>>> Now I can map one VMA with this policy set, then map another immediately
>>> next to it and - oops - no merge, ever, because the VM_HUGEPAGE flag is not
>>> set in the new VMA on merge attempt.
>>>
>>> I realise KSM is just as broken (grr) but this doesn't justify us
>>> completely breaking VMA merging here.
>>
>> I think this answers it. Its doing the same as KSM.
>
> Yes, but as I said there, it's not acceptable, at all.
>
> You're making it so litearlly VMA merging _does not happen at all_. That's
> unacceptable and might even break some workloads.
>
> You'll certainly cause very big kernel metadata usage.
>
> Consider:
>
> |-----------------------------|..................|
> | some VMA flags, VM_HUGEPAGE | proposed new VMA |
> |-----------------------------|..................|
>
> Now, because you set VM_HUGEPAGE _after any merge is attempted_, this will
> _always_ be fragmented, forever.
>
So if __mmap_new_vma and do_brk_flags are called after merge attempt,
is it possible to vma_set_thp_policy (or do something similar) before
the merge attempt?
Actually I just read your reply to the next block, so I think its ok?
Added more to the next block.
I dont have any preference on where its put, so happy with putting this
earlier.
> That's just not... acceptable.
>
> The fact KSM is broken this way doesn't make that OK.
>
> Especially on brk(), which now will _always_ allocate new VMAs for every
> brk() expansion which doesn't seem very efficient.
>
> It may also majorly degrade performance.
>
> That makes me think we need some perf testing for this ideally...
>
>>
>>>
>>> You need to set earlier than this. Then of course a driver might decide to
>>> override this, so maybe then we need to override that.
>>>
>>> But then we're getting into realms of changing fundamental VMA code _just
>>> for this feature_.
>>>
>>> Again I'm iffy about this. Very.
>>>
>>> Also you've broken the VMA userland tests here:
>>>
>>> $ cd tools/testing/vma
>>> $ make
>>> ...
>>> In file included from vma.c:33:
>>> ../../../mm/vma.c: In function ‘__mmap_new_vma’:
>>> ../../../mm/vma.c:2486:9: error: implicit declaration of function ‘vma_set_thp_policy’; did you mean ‘vma_dup_policy’? [-Wimplicit-function-declaration]
>>> 2486 | vma_set_thp_policy(vma);
>>> | ^~~~~~~~~~~~~~~~~~
>>> | vma_dup_policy
>>> make: *** [<builtin>: vma.o] Error 1
>>>
>>> You need to create stubs accordingly.
>>>
>>
>> Thanks will do.
>
> Thanks!
>
>>
>>>> *vmap = vma;
>>>> return 0;
>>>>
>>>> @@ -2705,6 +2706,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>>>> mm->map_count++;
>>>> validate_mm(mm);
>>>> ksm_add_vma(vma);
>>>> + vma_set_thp_policy(vma);
>>>
>>> You're breaking merging again... This is quite a bad case too as now you'll
>>> have totally fragmented brk VMAs no?
>>>
>>
>> Again doing it the same as KSM.
>
> That doesn't make it ok. Just because KSM is broken doesn't make this ok. I
> mean grr at KSM :) I'm going to look into that and see about
> investigating/fixing that behaviour.
>
> obviously I can't accept anything that will fundamentally break VMA
> merging.
>
Ofcourse!
> The answer really is to do this earlier, but you risk a driver overriding
> it, but that's OK I think (I don't even think any in-tree ones do actually
> _anywhere_ - and yes I was literally reading through _every single_ .mmap()
> callback lately because I am quite obviously insane ;)
>
> Again I can help with this.
>
Appreaciate it!
I am actually not familiar with the merge code. I will try and have a look,
but if you could give a pointer to the file:line after which its not acceptable
to have and I can move vma_set_thp_policy to before it or try and do something
similar to that.
>>
>>> We can't have it implemented this way.
>>>
>>>> out:
>>>> perf_event_mmap(vma);
>>>> mm->total_vm += len >> PAGE_SHIFT;
>>>> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
>>>> index 35791791a879..f5945ebfe3f2 100644
>>>> --- a/tools/include/uapi/linux/prctl.h
>>>> +++ b/tools/include/uapi/linux/prctl.h
>>>> @@ -328,4 +328,8 @@ struct prctl_mm_map {
>>>> # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */
>>>> # define PR_PPC_DEXCR_CTRL_MASK 0x1f
>>>>
>>>> +#define PR_SET_THP_POLICY 78
>>>> +#define PR_GET_THP_POLICY 79
>>>> +#define PR_THP_POLICY_DEFAULT_HUGE 0
>>>> +
>>>> #endif /* _LINUX_PRCTL_H */
>>>> diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
>>>> index 15c18ef4eb11..325c72f40a93 100644
>>>> --- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
>>>> +++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
>>>> @@ -364,4 +364,8 @@ struct prctl_mm_map {
>>>> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1
>>>> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2
>>>>
>>>> +#define PR_SET_THP_POLICY 78
>>>> +#define PR_GET_THP_POLICY 79
>>>> +#define PR_THP_POLICY_DEFAULT_HUGE 0
>>>> +
>>>> #endif /* _LINUX_PRCTL_H */
>>>> --
>>>> 2.47.1
>>>>
>>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 16:38 ` Lorenzo Stoakes
@ 2025-05-15 17:29 ` David Hildenbrand
2025-05-15 18:09 ` Liam R. Howlett
0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-05-15 17:29 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Liam R. Howlett, Usama Arif, Andrew Morton, linux-mm, hannes,
shakeel.butt, riel, ziy, laoar.shao, baolin.wang, npache,
ryan.roberts, linux-kernel, linux-doc, kernel-team
>>
>
> Did we document all this? :)
>
> It'd be good to be super explicit about these sorts of 'dependency chains'.
>
Documentation/admin-guide/mm/transhuge.rst has under "Global THP
controls" quite some stuff about all that, yes.
The whole document needs an overhaul, to clarify on the whole
terminology, make it consistent, and better explain how the pagecache
behaves etc. On my todo list, but I'm afraid it will be a bit of work to
get it right / please most people.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 16:11 ` David Hildenbrand
@ 2025-05-15 18:08 ` Lorenzo Stoakes
2025-05-15 19:12 ` David Hildenbrand
0 siblings, 1 reply; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-15 18:08 UTC (permalink / raw)
To: David Hildenbrand
Cc: Usama Arif, Andrew Morton, linux-mm, hannes, shakeel.butt, riel,
ziy, laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On Thu, May 15, 2025 at 06:11:55PM +0200, David Hildenbrand wrote:
> > > > So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
> > > > is to override global 'never'?
> > > >
> > >
> > > Again, I am not overriding never.
> > >
> > > hugepage_global_always and hugepage_global_enabled will evaluate to false
> > > and you will not get a hugepage.
> >
> > Yeah, again ack, but I kind of hate that we set VM_HUGEPAGE everywhere even
> > if the policy is never.
>
> I think it should behave just as if someone does manually an madvise(). So
> whatever we do here during an madvise, we should try to do the same thing
> here.
Ack I agree with this.
It actually simplifies things a LOT to view it this way - we're saying 'by
default apply madvise(...) to new VMAs'.
Hm I wonder if we could have a more generic version of this...
Note though that we're not _quite_ doing this.
So in hugepage_madvise():
int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
{
...
switch (advice) {
case MADV_HUGEPAGE:
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
...
break;
...
}
...
}
So here we're actually clearing VM_NOHUGEPAGE and overriding it, but in the
proposed code we're not.
So we're back into confusing territory again :)
I wonder if we could...
1. Add an MADV_xxx that mimics the desired behaviour here.
2. Add a generic 'madvise() by default' thing at a process level?
Is this crazy?
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 17:29 ` David Hildenbrand
@ 2025-05-15 18:09 ` Liam R. Howlett
2025-05-15 18:21 ` Lorenzo Stoakes
2025-05-15 19:20 ` David Hildenbrand
0 siblings, 2 replies; 51+ messages in thread
From: Liam R. Howlett @ 2025-05-15 18:09 UTC (permalink / raw)
To: David Hildenbrand
Cc: Lorenzo Stoakes, Usama Arif, Andrew Morton, linux-mm, hannes,
shakeel.butt, riel, ziy, laoar.shao, baolin.wang, npache,
ryan.roberts, linux-kernel, linux-doc, kernel-team
* David Hildenbrand <david@redhat.com> [250515 13:30]:
> > >
> >
> > Did we document all this? :)
> >
> > It'd be good to be super explicit about these sorts of 'dependency chains'.
> >
>
> Documentation/admin-guide/mm/transhuge.rst has under "Global THP controls"
> quite some stuff about all that, yes.
>
> The whole document needs an overhaul, to clarify on the whole terminology,
> make it consistent, and better explain how the pagecache behaves etc. On my
> todo list, but I'm afraid it will be a bit of work to get it right / please
> most people.
Yes, the whole thing is making me grumpy (more than my default state).
The more I think about it, the more I don't like the prctl approach
either...
I more than dislike flags2... I hate it.
but no prctl, no cgroups, no bpf.. what is left? A new policy groups
thing? No, not that either, please.
To state the obvious, none of this is transparent.
Regards,
Liam
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 18:09 ` Liam R. Howlett
@ 2025-05-15 18:21 ` Lorenzo Stoakes
2025-05-15 18:42 ` Zi Yan
2025-05-15 18:46 ` Usama Arif
2025-05-15 19:20 ` David Hildenbrand
1 sibling, 2 replies; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-15 18:21 UTC (permalink / raw)
To: Liam R. Howlett, David Hildenbrand, Usama Arif, Andrew Morton,
linux-mm, hannes, shakeel.butt, riel, ziy, laoar.shao,
baolin.wang, npache, ryan.roberts, linux-kernel, linux-doc,
kernel-team
On Thu, May 15, 2025 at 02:09:56PM -0400, Liam R. Howlett wrote:
> * David Hildenbrand <david@redhat.com> [250515 13:30]:
> > > >
> > >
> > > Did we document all this? :)
> > >
> > > It'd be good to be super explicit about these sorts of 'dependency chains'.
> > >
> >
> > Documentation/admin-guide/mm/transhuge.rst has under "Global THP controls"
> > quite some stuff about all that, yes.
> >
> > The whole document needs an overhaul, to clarify on the whole terminology,
> > make it consistent, and better explain how the pagecache behaves etc. On my
> > todo list, but I'm afraid it will be a bit of work to get it right / please
> > most people.
>
> Yes, the whole thing is making me grumpy (more than my default state).
> The more I think about it, the more I don't like the prctl approach
> either...
prctl() feels like it's literally never, ever the right choice.
It feels like we shove all the dark stuff we want to put under the rug
there.
Reading the man page is genuinely frightening. there's stuff about VMAs _I
wasn't aware of_.
It's also never really the _right time_ to do it - it's not process
inception is it? It's when the process has started, now you suddenly fiddle
with it.
Then relying on mm flags being propagated over fork/exec is just, it's a
hack really.
>
> I more than dislike flags2... I hate it.
Yeah, to be clear - I will NACK any series that tries to add flags2 unless
a VERY VERY good justification is given. It's horrid. And frankly this
feature doesn't warrant something as horrible.
But making mm->flags 64-bit on 32-bit kernels (which are in effect
deprecated in my view) would fix this.
>
> but no prctl, no cgroups, no bpf.. what is left? A new policy groups
> thing? No, not that either, please.
A new clone[,2,3]() flag?
process_madvise() feels literally made for this, but at the same time only
lets you do a small range at a time.
Some ability to default a process's madvise() state on VMA inception and
having a new syscall for that could be a thing, but then is that prctl()
wearing a hat...
>
> To state the obvious, none of this is transparent.
Indeed...
>
> Regards,
> Liam
>
>
I'm worried about knock on effects too, now we have a way to force-set VMA
flags.
And a working version of this series will involve literally hard-coding
this stuff into the VMA logic and 'just remembering' to always set it up
right on new VMA inception.
It's all very horrible.
I guess going to RFC is also about testing for _feasibility_ of this as an
idea.
But let's explore and assess, maybe there's some way of making this work
that isn't horrid...
I feel like it exposes some weaknesses on policy setting anyway that maybe
we need to think about more deeply.
pidfd to the rescue?? Somehow?? ;)
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 16:47 ` Usama Arif
@ 2025-05-15 18:36 ` Lorenzo Stoakes
2025-05-15 19:17 ` David Hildenbrand
0 siblings, 1 reply; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-15 18:36 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
One thing I wanted to emphasise - shouldn't we invoke
khugepaged_enter_vma() for VMAs we set VM_HUGEPAGE for?
I note in __mmap_new_vma() we do so for non-anon VMAs _prior_ to invoking
vma_set_thp_policy().
I _really_ hate putting in conditional logic like this for specific
cases. Feels very hacky.
On Thu, May 15, 2025 at 05:47:34PM +0100, Usama Arif wrote:
>
>
> On 15/05/2025 17:06, Lorenzo Stoakes wrote:
>
> >> Its doing the same as KSM as suggested by David. Does KSM break these tests?
> >> Is there some specific test you can point to that I can run that is breaking
> >> with this patch and not without it?
> >
> > They don't build, at all. I explain how you can attempt a build below.
> >
>
> Ah yes, I initially thought by break you meant they were failing. I saw that,
> will fix it.
>
> > And no, KSM doesn't break the tests, because steps were taken to make them
> > not break the tests :) I mean it's really easy - it's just adding some
> > trivial stubs.
> >
>
> Yes, will do Thanks!
Thanks.
>
> > If you need help with it just ping me in whatever way helps and I can help!
> >
> > It's understandable as it's not necessarily clear this is a thing (maybe we
> > need self tests to build it, but that might break CI setups so unclear).
> >
> > The merging is much more important!
To re-emphasise ^ the merging is key - I will unfortunately have to NACK
any series that breaks it. So this is an absolute requirement here.
> >
> >>
> >>
> >>> I really feel this series needs to be an RFC until we can get some
> >>> consensus on how to approach this.
> >>
> >> There was consensus in https://lore.kernel.org/all/97702ff0-fc50-4779-bfa8-83dc42352db1@redhat.com/
> >
> > I disagree with this asssessment, that doesn't look like consensus at all,
> > I think at least this is a very contentious or at least _complicated_ topic
> > that we need to really dig into.
> >
> > So in my view - it's this kind of situation that warrants an RFC until
> > there's some stabilisation and agreement on a way forward.
> >
>
> Sure will change next revision to RFC, unless hopefully maybe we can
> get a consensus in this revision :)
Thanks!
>
> >>
> >>>
> >>> On Thu, May 15, 2025 at 02:33:30PM +0100, Usama Arif wrote:
> >>>> This is set via the new PR_SET_THP_POLICY prctl.
> >>>
> >>> What is?
> >>>
> >>> You're making very major changes here, including adding a new flag to
> >>> mm_struct (!!) and the explanation/justification for this is missing.
> >>>
> >>
> >> I have added the justification in your reply to the coverletter.
> >
> > As stated there, you've not explained why alternatives are unworkable, I
> > think we need this!
> >
> > Sort of:
> >
> > 1. Why not cgroups? blah blah blah
> > 2. Why not process_madvise()? blah blah blah
> > 3. Why not bpf? blah blah blah
> > 4. Why not <something I've not thought of>? blah blah blah
> >
>
> I will add this in the next cover letter.
Thanks
>
>
> >>
> >>>> This will set the MMF2_THP_VMA_DEFAULT_HUGE process flag
> >>>> which changes the default of new VMAs to be VM_HUGEPAGE. The
> >>>> call also modifies all existing VMAs that are not VM_NOHUGEPAGE
> >>>> to be VM_HUGEPAGE. The policy is inherited during fork+exec.
> >>>
> >>> So you can only set this flag?
> >>>
> >>
> >> ??
> >
> > This patch is only allowing the setting of this flag. I am asking 'so you
> > can only set this flag?'
> >
> > To which it appears the answer is, yes I think :)
> >
> > An improved cover letter could say something like:
> >
> > "
> > Here we implement the first flag intended to allow the _overriding_ of huge
> > page policy to ensure that, when
> > /sys/kernel/mm/transparent_hugepage/enabled is set to madvise, we are able
> > to maintain fine-grained control of individual processes, including any
> > fork/exec'd, by setting this flag.
> >
> > In subsequent commits, we intend to permit further such control.
> > "
> >
> >>
> >>>>
> >>>> This allows systems where the global policy is set to "madvise"
> >>>> to effectively have THPs always for the process. In an environment
> >>>> where different types of workloads are stacked on the same machine,
> >>>> this will allow workloads that benefit from always having hugepages
> >>>> to do so, without regressing those that don't.
> >>>
> >>> Again, this explanation really makes no sense at all to me, I don't really
> >>> know what you mean, you're not going into what you're doing in this change,
> >>> this is just a very unclear commit message.
> >>>
> >>
> >> I hope this is answered in my reply to your coverletter.
> >
> > You still need to improve the cover letter here I think, see above for a
> > suggestion!
> >
>
> Sure, will do in the next revision, Thanks!
Thanks
> >>
> >>>>
> >>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >>>> ---
> >>>> include/linux/huge_mm.h | 3 ++
> >>>> include/linux/mm_types.h | 11 +++++++
> >>>> include/uapi/linux/prctl.h | 4 +++
> >>>> kernel/fork.c | 1 +
> >>>> kernel/sys.c | 21 ++++++++++++
> >>>> mm/huge_memory.c | 32 +++++++++++++++++++
> >>>> mm/vma.c | 2 ++
> >>>> tools/include/uapi/linux/prctl.h | 4 +++
> >>>> .../trace/beauty/include/uapi/linux/prctl.h | 4 +++
> >>>> 9 files changed, 82 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >>>> index 2f190c90192d..e652ad9ddbbd 100644
> >>>> --- a/include/linux/huge_mm.h
> >>>> +++ b/include/linux/huge_mm.h
> >>>> @@ -260,6 +260,9 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
> >>>> return orders;
> >>>> }
> >>>>
> >>>> +void vma_set_thp_policy(struct vm_area_struct *vma);
> >>>
> >>> This is a VMA-specific function but you're putting it in huge_mm.h? Why
> >>> can't
> >> this be in vma.h or vma.c?
> >>>
> >>
> >> Sure can move it there.
> >>
> >>>> +void process_vmas_thp_default_huge(struct mm_struct *mm);
> >>>
> >>> 'vmas' is redundant here.
> >>>
> >>
> >> Sure.
> >>>> +
> >>>> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> >>>> unsigned long vm_flags,
> >>>> unsigned long tva_flags,
> >>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >>>> index e76bade9ebb1..2fe93965e761 100644
> >>>> --- a/include/linux/mm_types.h
> >>>> +++ b/include/linux/mm_types.h
> >>>> @@ -1066,6 +1066,7 @@ struct mm_struct {
> >>>> mm_context_t context;
> >>>>
> >>>> unsigned long flags; /* Must use atomic bitops to access */
> >>>> + unsigned long flags2;
> >>>
> >>>
> >>> Ugh, god really??
> >>>
> >>> I really am not a fan of adding flags2 just to add a prctl() feature like
> >>> this. This is crazy.
> >>>
> >>> Also this is a TERRIBLE name. I mean, no please PLEASE no.
> >>>
> >>> Do we really have absolutely no choice but to add a new flags field here?
> >>>
> >>> It again doesn't help that you don't mention nor even try to justify this
> >>> in the commit message or cover letter.
> >>>
> >>
> >> And again, I hope my reply to your email has given you the justification.
> >
> > No :) I understood why you did this though of course.
> >
> >>
> >>> If this is a 32-bit kernel vs. 64-bit kernel thing so we 'ran out of bits',
> >>> let's just go make this flags field 64-bit on 32-bit kernels.
> >>>
> >>> I mean - I'm kind of insisting we do that to be honest. Because I really
> >>> don't like this.
> >>
> >>
> >> If the maintainers want this, I will make it a 64 bit only feature. We
> >> are only using it for 64 bit servers. But it will probably mean ifdef
> >> config 64 bit in a lot of places.
> >
> > I'm going to presume you are including me in this category rather than
> > implying that you are deferring only to others :)
> >
>
> Yes ofcourse! I mean all maintainers :)
>
> And hopefully everyone else as well :)
>
> > So, there's another option:
> >
> > Have a prerequisite series that makes mm_struct->flags 64-bit on 32-bit
> > kernels, which solves this problem everywhere and avoids us wasting a bunch
> > of memory for a very specific usecase, splitting flag state across 2 fields
> > (which are no longer atomic as a whole of course), adding confusion,
> > possibly subtly breaking anywhere that assumes mm->flags completely
> > describes mm-granularity flag state etc.
> >
>
> This is probably a very basic question, but by make mm_struct->flags 64-bit on 32-bit
> do you mean convert flags to unsigned long long when !CONIFG_64BIT?
Yes. This would be a worthwhile project in its own right.
I will be changing vma->vm_flags in the same way soon.
>
> > The RoI here is not looking good, otherwise.
> >
> >>
> >>>
> >>> Also if we _HAVE_ to have this, shouldn't we duplicate that comment about
> >>> atomic bitops?...
> >>>
> >>
> >> Sure
> >>
> >>>>
> >>>> #ifdef CONFIG_AIO
> >>>> spinlock_t ioctx_lock;
> >>>> @@ -1744,6 +1745,11 @@ enum {
> >>>> MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
> >>>> MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
> >>>>
> >>>> +#define MMF2_THP_VMA_DEFAULT_HUGE 0
> >>>
> >>> I thought the whole idea was to move away from explicitly refrencing 'THP'
> >>> in a future where large folios are implicit and now we're saying 'THP'.
> >>>
> >>> Anyway the 'VMA' is totally redundant here.
> >>>
> >>
> >> Sure, I can remove VMA.
> >> I see THP everywhere in the kernel code.
> >> Its mentioned 108 times in transhuge.rst alone :)
> >> If you have any suggestion to rename this flag, happy to take it :)
> >
> > Yeah I mean it's a mess man, and it's not your fault... Again naming is
> > hard, I put a suggestion in reply to cover letter anyway...
> >
> >>
> >>>> +#define MMF2_THP_VMA_DEFAULT_HUGE_MASK (1 << MMF2_THP_VMA_DEFAULT_HUGE)
> >>>
> >>> Do we really need explicit trivial mask declarations like this?
> >>>
> >>
> >> I have followed the convention that has existed in this file, please see below
> >> links :)
> >> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1645
> >> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1623
> >> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1603
> >> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1582
> >
> > Ack, yuck but ack.
> >
> >>
> >>
> >>>> +
> >>>> +#define MMF2_INIT_MASK (MMF2_THP_VMA_DEFAULT_HUGE_MASK)
> >>>
> >>>> +
> >>>> static inline unsigned long mmf_init_flags(unsigned long flags)
> >>>> {
> >>>> if (flags & (1UL << MMF_HAS_MDWE_NO_INHERIT))
> >>>> @@ -1752,4 +1758,9 @@ static inline unsigned long mmf_init_flags(unsigned long flags)
> >>>> return flags & MMF_INIT_MASK;
> >>>> }
> >>>>
> >>>> +static inline unsigned long mmf2_init_flags(unsigned long flags)
> >>>> +{
> >>>> + return flags & MMF2_INIT_MASK;
> >>>> +}
> >>>> +
> >>>> #endif /* _LINUX_MM_TYPES_H */
> >>>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> >>>> index 15c18ef4eb11..325c72f40a93 100644
> >>>> --- a/include/uapi/linux/prctl.h
> >>>> +++ b/include/uapi/linux/prctl.h
> >>>> @@ -364,4 +364,8 @@ struct prctl_mm_map {
> >>>> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1
> >>>> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2
> >>>>
> >>>> +#define PR_SET_THP_POLICY 78
> >>>> +#define PR_GET_THP_POLICY 79
> >>>> +#define PR_THP_POLICY_DEFAULT_HUGE 0
> >>>> +
> >>>> #endif /* _LINUX_PRCTL_H */
> >>>> diff --git a/kernel/fork.c b/kernel/fork.c
> >>>> index 9e4616dacd82..6e5f4a8869dc 100644
> >>>> --- a/kernel/fork.c
> >>>> +++ b/kernel/fork.c
> >>>> @@ -1054,6 +1054,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> >>>>
> >>>> if (current->mm) {
> >>>> mm->flags = mmf_init_flags(current->mm->flags);
> >>>> + mm->flags2 = mmf2_init_flags(current->mm->flags2);
> >>>> mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
> >>>> } else {
> >>>> mm->flags = default_dump_filter;
> >>>> diff --git a/kernel/sys.c b/kernel/sys.c
> >>>> index c434968e9f5d..1115f258f253 100644
> >>>> --- a/kernel/sys.c
> >>>> +++ b/kernel/sys.c
> >>>> @@ -2658,6 +2658,27 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> >>>> clear_bit(MMF_DISABLE_THP, &me->mm->flags);
> >>>> mmap_write_unlock(me->mm);
> >>>> break;
> >>>> + case PR_GET_THP_POLICY:
> >>>> + if (arg2 || arg3 || arg4 || arg5)
> >>>> + return -EINVAL;
> >>>> + if (!!test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2))
> >>>
> >>> I really don't think we need the !!? Do we?
> >>
> >> I have followed the convention that has existed in this file already,
> >> please see:
> >> https://elixir.bootlin.com/linux/v6.14.6/source/kernel/sys.c#L2644
> >
> > OK, but please don't, I don't see why this is necessary. if (truthy) is
> > fine.
> >
> > Unless somebody has a really good reason why this is necessary, it's just
> > ugly ceremony.
> >
>
> Agreed :)
Thanks
>
> >>
> >>>
> >>> Shouldn't we lock the mm when we do this no? Can't somebody change this?
> >>>
> >>
> >> It wasn't locked in PR_GET_THP_DISABLE
> >> https://elixir.bootlin.com/linux/v6.14.6/source/kernel/sys.c#L2644
> >>
> >> I can acquire do mmap_write_lock_killable the same as PR_SET_THP_POLICY
> >> in the next series.
> >>
> >> I can also add the lock in PR_GET_THP_DISABLE.
> >
> > Well, the issue I guess is... if the flags field is atomic, and we know
> > over this call maybe we can rely on mm sticking around, then we probalby
> > don't need an mmap lock actually.
> >
> >>
> >>>> + error = PR_THP_POLICY_DEFAULT_HUGE;
> >
> > Wait, error = PR_THP_POLICY_DEFAULT_HUGE? Is this the convention for
> > returning here? :)
>
> I see a few of the PR_GET_.. setting the return value. I hope I didnt
> misinterpret that.
Yeah I thought it might be the case. I reemphasise my dislike of prctl().
>
> >
> >>>> + break;
> >>>> + case PR_SET_THP_POLICY:
> >>>> + if (arg3 || arg4 || arg5)
> >>>> + return -EINVAL;
> >>>> + if (mmap_write_lock_killable(me->mm))
> >>>> + return -EINTR;
> >>>> + switch (arg2) {
> >>>> + case PR_THP_POLICY_DEFAULT_HUGE:
> >>>> + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
> >>>> + process_vmas_thp_default_huge(me->mm);
> >>>> + break;
> >>>> + default:
> >>>> + return -EINVAL;
> >
> > Oh I just noticed - this is really broken - you're not unlocking the mmap()
> > here on error... :) you definitely need to fix this.
> >
>
> Ah yes, will do Thanks!
Thanks
>
> >>>> + }
> >>>> + mmap_write_unlock(me->mm);
> >>>> + break;
> >>>> case PR_MPX_ENABLE_MANAGEMENT:
> >>>> case PR_MPX_DISABLE_MANAGEMENT:
> >>>> /* No longer implemented: */
> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>> index 2780a12b25f0..64f66d5295e8 100644
> >>>> --- a/mm/huge_memory.c
> >>>> +++ b/mm/huge_memory.c
> >>>> @@ -98,6 +98,38 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
> >>>> return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
> >>>> }
> >>>>
> >>>> +void vma_set_thp_policy(struct vm_area_struct *vma)
> >>>> +{
> >>>> + struct mm_struct *mm = vma->vm_mm;
> >>>> +
> >>>> + if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
> >>>> + vm_flags_set(vma, VM_HUGEPAGE);
> >>>> +}
> >>>> +
> >>>> +static void vmas_thp_default_huge(struct mm_struct *mm)
> >>>> +{
> >>>> + struct vm_area_struct *vma;
> >>>> + unsigned long vm_flags;
> >>>> +
> >>>> + VMA_ITERATOR(vmi, mm, 0);
> >>>
> >>> This is a declaration, it should be grouped with declarations...
> >>>
> >>
> >> Sure, will make the change in next version.
> >>
> >> Unfortunately checkpatch didn't complain.
> >
> > Checkpatch actually complains the other way :P it doesn't understand
> > macros.
> >
> > So you'll start getting a warning here, which you can ignore. It sucks, but
> > there we go. Making checkpatch.pl understand that would be a pain, probs.
> >
> >>
> >>>> + for_each_vma(vmi, vma) {
> >>>> + vm_flags = vma->vm_flags;
> >>>> + if (vm_flags & VM_NOHUGEPAGE)
> >>>> + continue;
> >>>
> >>> Literally no point in you putting vm_flags as a separate variable here.
> >>>
> >>
> >> Sure, will make the change in next version.
> >
> > Thanks!
> >
> >>
> >>> So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
> >>> is to override global 'never'?
> >>>
> >>
> >> Again, I am not overriding never.
> >>
> >> hugepage_global_always and hugepage_global_enabled will evaluate to false
> >> and you will not get a hugepage.
> >
> > Yeah, again ack, but I kind of hate that we set VM_HUGEPAGE everywhere even
> > if the policy is never.
> >
> > And we now get into realms of:
> >
> > 'Hey I set prctl() to make everything huge pages, and PR_GET_THP_POLICY
> > says I've set that, but nothing is huge? BUG???'
> >
> > Of course then you get into - if somebody sets it to never, do we go around
> > and remove VM_HUGEPAGE and this MMF_ flag?
> >
> >>
> >>
> >>> I'm really concerned about this.
> >>>
> >>>> + vm_flags_set(vma, VM_HUGEPAGE);
> >>>> + }
> >>>> +}
> >>>
> >>> Do we have an mmap write lock established here? Can you confirm that? Also
> >>> you should add an assert for that here.
> >>>
> >>
> >> Yes I do, its only called in PR_SET_THP_POLICY where mmap_write lock was taken.
> >> I can add an assert if it helps.
> >
> > It not only helps, it's utterly critical :)
> >
> > 'It's only called in xxx()' is famous last words for a programmer, because
> > later somebody (maybe even your good self) calls it from somewhere else
> > and... we've all been there...
> >
>
> Thanks! Will do.
Thanks.
> >>
> >>>> +
> >>>> +void process_vmas_thp_default_huge(struct mm_struct *mm)
> >>>> +{
> >>>> + if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
> >>>> + return;
> >>>> +
> >>>> + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2);
> >>>> + vmas_thp_default_huge(mm);
> >>>> +}
> >>>> +
> >>>> +
> >>>> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> >>>> unsigned long vm_flags,
> >>>> unsigned long tva_flags,
> >>>> diff --git a/mm/vma.c b/mm/vma.c
> >>>> index 1f2634b29568..101b19c96803 100644
> >>>> --- a/mm/vma.c
> >>>> +++ b/mm/vma.c
> >>>> @@ -2476,6 +2476,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> >>>> if (!vma_is_anonymous(vma))
> >>>> khugepaged_enter_vma(vma, map->flags);
> >>>> ksm_add_vma(vma);
> >>>> + vma_set_thp_policy(vma);
> >>>
> >>> You're breaking VMA merging completely by doing this here...
> >>>
> >>> Now I can map one VMA with this policy set, then map another immediately
> >>> next to it and - oops - no merge, ever, because the VM_HUGEPAGE flag is not
> >>> set in the new VMA on merge attempt.
> >>>
> >>> I realise KSM is just as broken (grr) but this doesn't justify us
> >>> completely breaking VMA merging here.
> >>
> >> I think this answers it. Its doing the same as KSM.
> >
> > Yes, but as I said there, it's not acceptable, at all.
> >
> > You're making it so litearlly VMA merging _does not happen at all_. That's
> > unacceptable and might even break some workloads.
> >
> > You'll certainly cause very big kernel metadata usage.
> >
> > Consider:
> >
> > |-----------------------------|..................|
> > | some VMA flags, VM_HUGEPAGE | proposed new VMA |
> > |-----------------------------|..................|
> >
> > Now, because you set VM_HUGEPAGE _after any merge is attempted_, this will
> > _always_ be fragmented, forever.
> >
>
> So if __mmap_new_vma and do_brk_flags are called after merge attempt,
> is it possible to vma_set_thp_policy (or do something similar) before
> the merge attempt?
>
> Actually I just read your reply to the next block, so I think its ok?
> Added more to the next block.
>
> I dont have any preference on where its put, so happy with putting this
> earlier.
Yeah, you can just do it earlier. But you maybe should just set the flag in
the appropriate field rather than using the set flags helper.
>
>
> > That's just not... acceptable.
> >
> > The fact KSM is broken this way doesn't make that OK.
> >
> > Especially on brk(), which now will _always_ allocate new VMAs for every
> > brk() expansion which doesn't seem very efficient.
> >
> > It may also majorly degrade performance.
> >
> > That makes me think we need some perf testing for this ideally...
> >
> >>
> >>>
> >>> You need to set earlier than this. Then of course a driver might decide to
> >>> override this, so maybe then we need to override that.
> >>>
> >>> But then we're getting into realms of changing fundamental VMA code _just
> >>> for this feature_.
> >>>
> >>> Again I'm iffy about this. Very.
> >>>
> >>> Also you've broken the VMA userland tests here:
> >>>
> >>> $ cd tools/testing/vma
> >>> $ make
> >>> ...
> >>> In file included from vma.c:33:
> >>> ../../../mm/vma.c: In function ‘__mmap_new_vma’:
> >>> ../../../mm/vma.c:2486:9: error: implicit declaration of function ‘vma_set_thp_policy’; did you mean ‘vma_dup_policy’? [-Wimplicit-function-declaration]
> >>> 2486 | vma_set_thp_policy(vma);
> >>> | ^~~~~~~~~~~~~~~~~~
> >>> | vma_dup_policy
> >>> make: *** [<builtin>: vma.o] Error 1
> >>>
> >>> You need to create stubs accordingly.
> >>>
> >>
> >> Thanks will do.
> >
> > Thanks!
> >
> >>
> >>>> *vmap = vma;
> >>>> return 0;
> >>>>
> >>>> @@ -2705,6 +2706,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >>>> mm->map_count++;
> >>>> validate_mm(mm);
> >>>> ksm_add_vma(vma);
> >>>> + vma_set_thp_policy(vma);
> >>>
> >>> You're breaking merging again... This is quite a bad case too as now you'll
> >>> have totally fragmented brk VMAs no?
> >>>
> >>
> >> Again doing it the same as KSM.
> >
> > That doesn't make it ok. Just because KSM is broken doesn't make this ok. I
> > mean grr at KSM :) I'm going to look into that and see about
> > investigating/fixing that behaviour.
> >
> > obviously I can't accept anything that will fundamentally break VMA
> > merging.
> >
>
> Ofcourse!
>
> > The answer really is to do this earlier, but you risk a driver overriding
> > it, but that's OK I think (I don't even think any in-tree ones do actually
> > _anywhere_ - and yes I was literally reading through _every single_ .mmap()
> > callback lately because I am quite obviously insane ;)
> >
> > Again I can help with this.
> >
>
> Appreaciate it!
>
> I am actually not familiar with the merge code. I will try and have a look,
> but if you could give a pointer to the file:line after which its not acceptable
> to have and I can move vma_set_thp_policy to before it or try and do something
> similar to that.
Ack.
I wrote the latest merge and mmap() code so am well placed on this :>)
But I don't think we should use vma_set_thp_policy() in these places, we
should just set the flag, to avoid trying to do a write lock etc. etc.,
plus we want to set the flag in a place that's not a VMA yet in both cases.
So we'd need something like in do_mmap():
+ vm_flags |= mm_implied_vma_flags(mm);
addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
Where mm_implied_vma_flags() reads the MMF flags and sees if any imply VMA
flags.
But we have something for that already don't we? mm->def_flags.
Can't we use that actually? That should work for mmap too?
>
> >>
> >>> We can't have it implemented this way.
> >>>
> >>>> out:
> >>>> perf_event_mmap(vma);
> >>>> mm->total_vm += len >> PAGE_SHIFT;
> >>>> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
> >>>> index 35791791a879..f5945ebfe3f2 100644
> >>>> --- a/tools/include/uapi/linux/prctl.h
> >>>> +++ b/tools/include/uapi/linux/prctl.h
> >>>> @@ -328,4 +328,8 @@ struct prctl_mm_map {
> >>>> # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */
> >>>> # define PR_PPC_DEXCR_CTRL_MASK 0x1f
> >>>>
> >>>> +#define PR_SET_THP_POLICY 78
> >>>> +#define PR_GET_THP_POLICY 79
> >>>> +#define PR_THP_POLICY_DEFAULT_HUGE 0
> >>>> +
> >>>> #endif /* _LINUX_PRCTL_H */
> >>>> diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> >>>> index 15c18ef4eb11..325c72f40a93 100644
> >>>> --- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> >>>> +++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> >>>> @@ -364,4 +364,8 @@ struct prctl_mm_map {
> >>>> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1
> >>>> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2
> >>>>
> >>>> +#define PR_SET_THP_POLICY 78
> >>>> +#define PR_GET_THP_POLICY 79
> >>>> +#define PR_THP_POLICY_DEFAULT_HUGE 0
> >>>> +
> >>>> #endif /* _LINUX_PRCTL_H */
> >>>> --
> >>>> 2.47.1
> >>>>
> >>
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 18:21 ` Lorenzo Stoakes
@ 2025-05-15 18:42 ` Zi Yan
2025-05-15 21:04 ` Lorenzo Stoakes
2025-05-15 18:46 ` Usama Arif
1 sibling, 1 reply; 51+ messages in thread
From: Zi Yan @ 2025-05-15 18:42 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Liam R. Howlett, David Hildenbrand, Usama Arif, Andrew Morton,
linux-mm, hannes, shakeel.butt, riel, laoar.shao, baolin.wang,
npache, ryan.roberts, linux-kernel, linux-doc, kernel-team
On 15 May 2025, at 14:21, Lorenzo Stoakes wrote:
> On Thu, May 15, 2025 at 02:09:56PM -0400, Liam R. Howlett wrote:
>> * David Hildenbrand <david@redhat.com> [250515 13:30]:
>>>>>
>>>>
>>>> Did we document all this? :)
>>>>
>>>> It'd be good to be super explicit about these sorts of 'dependency chains'.
>>>>
>>>
>>> Documentation/admin-guide/mm/transhuge.rst has under "Global THP controls"
>>> quite some stuff about all that, yes.
>>>
>>> The whole document needs an overhaul, to clarify on the whole terminology,
>>> make it consistent, and better explain how the pagecache behaves etc. On my
>>> todo list, but I'm afraid it will be a bit of work to get it right / please
>>> most people.
>>
>> Yes, the whole thing is making me grumpy (more than my default state).
>> The more I think about it, the more I don't like the prctl approach
>> either...
>
> prctl() feels like it's literally never, ever the right choice.
>
> It feels like we shove all the dark stuff we want to put under the rug
> there.
>
> Reading the man page is genuinely frightening. there's stuff about VMAs _I
> wasn't aware of_.
>
> It's also never really the _right time_ to do it - it's not process
> inception is it? It's when the process has started, now you suddenly fiddle
> with it.
>
> Then relying on mm flags being propagated over fork/exec is just, it's a
> hack really.
>
>>
>> I more than dislike flags2... I hate it.
>
> Yeah, to be clear - I will NACK any series that tries to add flags2 unless
> a VERY VERY good justification is given. It's horrid. And frankly this
> feature doesn't warrant something as horrible.
>
> But making mm->flags 64-bit on 32-bit kernels (which are in effect
> deprecated in my view) would fix this.
>
>>
>> but no prctl, no cgroups, no bpf.. what is left? A new policy groups
>> thing? No, not that either, please.
BPF might be OK, as long as we provide right functions for BPF to manipulate
system, process, MM, VMA level knobs. My only objection to Yafang's patch[1] is
that the patch adds a VMA parameter to the global hugepage checking functions.
My take on BPF approach is that it does not add new APIs, so we can change it
at any time, assuming people is willing to accept that the functions instrumented
by BPF can go away at any time and the corresponding BPF programs will not work
forever. It allows us to explore various huge page policies without the burden
of maintaining APIs. Eventually, huge page policies become transparent after
we learn enough.
[1] https://lore.kernel.org/linux-mm/20250429024139.34365-1-laoar.shao@gmail.com/
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 18:21 ` Lorenzo Stoakes
2025-05-15 18:42 ` Zi Yan
@ 2025-05-15 18:46 ` Usama Arif
1 sibling, 0 replies; 51+ messages in thread
From: Usama Arif @ 2025-05-15 18:46 UTC (permalink / raw)
To: Lorenzo Stoakes, Liam R. Howlett, David Hildenbrand,
Andrew Morton, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, npache, ryan.roberts, linux-kernel,
linux-doc, kernel-team
On 15/05/2025 19:21, Lorenzo Stoakes wrote:
> On Thu, May 15, 2025 at 02:09:56PM -0400, Liam R. Howlett wrote:
>>
>> I more than dislike flags2... I hate it.
>
> Yeah, to be clear - I will NACK any series that tries to add flags2 unless
> a VERY VERY good justification is given. It's horrid. And frankly this
> feature doesn't warrant something as horrible.
>
> But making mm->flags 64-bit on 32-bit kernels (which are in effect
> deprecated in my view) would fix this.
>
Just for clarity, flags2 is just one of the ways.
I had suggested making this a 64bit feature only as well in the initial version.
And Lorenzos suggestion about making flags 64 bit on 32 bit machines is good for me
as well.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 18:08 ` Lorenzo Stoakes
@ 2025-05-15 19:12 ` David Hildenbrand
2025-05-15 20:35 ` Lorenzo Stoakes
0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-05-15 19:12 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Usama Arif, Andrew Morton, linux-mm, hannes, shakeel.butt, riel,
ziy, laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On 15.05.25 20:08, Lorenzo Stoakes wrote:
> On Thu, May 15, 2025 at 06:11:55PM +0200, David Hildenbrand wrote:
>>>>> So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
>>>>> is to override global 'never'?
>>>>>
>>>>
>>>> Again, I am not overriding never.
>>>>
>>>> hugepage_global_always and hugepage_global_enabled will evaluate to false
>>>> and you will not get a hugepage.
>>>
>>> Yeah, again ack, but I kind of hate that we set VM_HUGEPAGE everywhere even
>>> if the policy is never.
>>
>> I think it should behave just as if someone does manually an madvise(). So
>> whatever we do here during an madvise, we should try to do the same thing
>> here.
>
> Ack I agree with this.
>
> It actually simplifies things a LOT to view it this way - we're saying 'by
> default apply madvise(...) to new VMAs'.
>
> Hm I wonder if we could have a more generic version of this...
>
> Note though that we're not _quite_ doing this.
>
> So in hugepage_madvise():
>
> int hugepage_madvise(struct vm_area_struct *vma,
> unsigned long *vm_flags, int advice)
> {
> ...
>
> switch (advice) {
> case MADV_HUGEPAGE:
> *vm_flags &= ~VM_NOHUGEPAGE;
> *vm_flags |= VM_HUGEPAGE;
>
> ...
>
> break;
>
> ...
> }
>
> ...
> }
>
> So here we're actually clearing VM_NOHUGEPAGE and overriding it, but in the
> proposed code we're not.
Yeah, I think I suggested that, but probably we should just do exactly
what madvise() does.
>
> So we're back into confusing territory again :)
>
> I wonder if we could...
>
> 1. Add an MADV_xxx that mimics the desired behaviour here.
>
> 2. Add a generic 'madvise() by default' thing at a process level?
>
> Is this crazy?
I think that's what I had in mind, just a bit twisted.
What could work is
1) prctl to set the default
2) madvise() to adjust all existing VMAs
We might have to teach 2) to ignore non-compatible VMAs / holes. Maybe
not, worth an investigation.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 18:36 ` Lorenzo Stoakes
@ 2025-05-15 19:17 ` David Hildenbrand
2025-05-15 20:42 ` Lorenzo Stoakes
0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-05-15 19:17 UTC (permalink / raw)
To: Lorenzo Stoakes, Usama Arif
Cc: Andrew Morton, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On 15.05.25 20:36, Lorenzo Stoakes wrote:
> One thing I wanted to emphasise - shouldn't we invoke
> khugepaged_enter_vma() for VMAs we set VM_HUGEPAGE for?
>
> I note in __mmap_new_vma() we do so for non-anon VMAs _prior_ to invoking
> vma_set_thp_policy().
>
> I _really_ hate putting in conditional logic like this for specific
> cases. Feels very hacky.
>
> On Thu, May 15, 2025 at 05:47:34PM +0100, Usama Arif wrote:
>>
>>
>> On 15/05/2025 17:06, Lorenzo Stoakes wrote:
>>
>>>> Its doing the same as KSM as suggested by David. Does KSM break these tests?
>>>> Is there some specific test you can point to that I can run that is breaking
>>>> with this patch and not without it?
>>>
>>> They don't build, at all. I explain how you can attempt a build below.
>>>
>>
>> Ah yes, I initially thought by break you meant they were failing. I saw that,
>> will fix it.
>>
>>> And no, KSM doesn't break the tests, because steps were taken to make them
>>> not break the tests :) I mean it's really easy - it's just adding some
>>> trivial stubs.
>>>
>>
>> Yes, will do Thanks!
>
> Thanks.
>
>>
>>> If you need help with it just ping me in whatever way helps and I can help!
>>>
>>> It's understandable as it's not necessarily clear this is a thing (maybe we
>>> need self tests to build it, but that might break CI setups so unclear).
>>>
>>> The merging is much more important!
>
> To re-emphasise ^ the merging is key - I will unfortunately have to NACK
> any series that breaks it. So this is an absolute requirement here.
>
>>>
>>>>
>>>>
>>>>> I really feel this series needs to be an RFC until we can get some
>>>>> consensus on how to approach this.
>>>>
>>>> There was consensus in https://lore.kernel.org/all/97702ff0-fc50-4779-bfa8-83dc42352db1@redhat.com/
>>>
>>> I disagree with this asssessment, that doesn't look like consensus at all,
>>> I think at least this is a very contentious or at least _complicated_ topic
>>> that we need to really dig into.
>>>
>>> So in my view - it's this kind of situation that warrants an RFC until
>>> there's some stabilisation and agreement on a way forward.
>>>
>>
>> Sure will change next revision to RFC, unless hopefully maybe we can
>> get a consensus in this revision :)
>
> Thanks!
>
>>
>>>>
>>>>>
>>>>> On Thu, May 15, 2025 at 02:33:30PM +0100, Usama Arif wrote:
>>>>>> This is set via the new PR_SET_THP_POLICY prctl.
>>>>>
>>>>> What is?
>>>>>
>>>>> You're making very major changes here, including adding a new flag to
>>>>> mm_struct (!!) and the explanation/justification for this is missing.
>>>>>
>>>>
>>>> I have added the justification in your reply to the coverletter.
>>>
>>> As stated there, you've not explained why alternatives are unworkable, I
>>> think we need this!
>>>
>>> Sort of:
>>>
>>> 1. Why not cgroups? blah blah blah
>>> 2. Why not process_madvise()? blah blah blah
>>> 3. Why not bpf? blah blah blah
>>> 4. Why not <something I've not thought of>? blah blah blah
>>>
>>
>> I will add this in the next cover letter.
>
> Thanks
>
>>
>>
>>>>
>>>>>> This will set the MMF2_THP_VMA_DEFAULT_HUGE process flag
>>>>>> which changes the default of new VMAs to be VM_HUGEPAGE. The
>>>>>> call also modifies all existing VMAs that are not VM_NOHUGEPAGE
>>>>>> to be VM_HUGEPAGE. The policy is inherited during fork+exec.
>>>>>
>>>>> So you can only set this flag?
>>>>>
>>>>
>>>> ??
>>>
>>> This patch is only allowing the setting of this flag. I am asking 'so you
>>> can only set this flag?'
>>>
>>> To which it appears the answer is, yes I think :)
>>>
>>> An improved cover letter could say something like:
>>>
>>> "
>>> Here we implement the first flag intended to allow the _overriding_ of huge
>>> page policy to ensure that, when
>>> /sys/kernel/mm/transparent_hugepage/enabled is set to madvise, we are able
>>> to maintain fine-grained control of individual processes, including any
>>> fork/exec'd, by setting this flag.
>>>
>>> In subsequent commits, we intend to permit further such control.
>>> "
>>>
>>>>
>>>>>>
>>>>>> This allows systems where the global policy is set to "madvise"
>>>>>> to effectively have THPs always for the process. In an environment
>>>>>> where different types of workloads are stacked on the same machine,
>>>>>> this will allow workloads that benefit from always having hugepages
>>>>>> to do so, without regressing those that don't.
>>>>>
>>>>> Again, this explanation really makes no sense at all to me, I don't really
>>>>> know what you mean, you're not going into what you're doing in this change,
>>>>> this is just a very unclear commit message.
>>>>>
>>>>
>>>> I hope this is answered in my reply to your coverletter.
>>>
>>> You still need to improve the cover letter here I think, see above for a
>>> suggestion!
>>>
>>
>> Sure, will do in the next revision, Thanks!
>
> Thanks
>
>>>>
>>>>>>
>>>>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>>>> ---
>>>>>> include/linux/huge_mm.h | 3 ++
>>>>>> include/linux/mm_types.h | 11 +++++++
>>>>>> include/uapi/linux/prctl.h | 4 +++
>>>>>> kernel/fork.c | 1 +
>>>>>> kernel/sys.c | 21 ++++++++++++
>>>>>> mm/huge_memory.c | 32 +++++++++++++++++++
>>>>>> mm/vma.c | 2 ++
>>>>>> tools/include/uapi/linux/prctl.h | 4 +++
>>>>>> .../trace/beauty/include/uapi/linux/prctl.h | 4 +++
>>>>>> 9 files changed, 82 insertions(+)
>>>>>>
>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>> index 2f190c90192d..e652ad9ddbbd 100644
>>>>>> --- a/include/linux/huge_mm.h
>>>>>> +++ b/include/linux/huge_mm.h
>>>>>> @@ -260,6 +260,9 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
>>>>>> return orders;
>>>>>> }
>>>>>>
>>>>>> +void vma_set_thp_policy(struct vm_area_struct *vma);
>>>>>
>>>>> This is a VMA-specific function but you're putting it in huge_mm.h? Why
>>>>> can't
>>>> this be in vma.h or vma.c?
>>>>>
>>>>
>>>> Sure can move it there.
>>>>
>>>>>> +void process_vmas_thp_default_huge(struct mm_struct *mm);
>>>>>
>>>>> 'vmas' is redundant here.
>>>>>
>>>>
>>>> Sure.
>>>>>> +
>>>>>> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>>>> unsigned long vm_flags,
>>>>>> unsigned long tva_flags,
>>>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>>>> index e76bade9ebb1..2fe93965e761 100644
>>>>>> --- a/include/linux/mm_types.h
>>>>>> +++ b/include/linux/mm_types.h
>>>>>> @@ -1066,6 +1066,7 @@ struct mm_struct {
>>>>>> mm_context_t context;
>>>>>>
>>>>>> unsigned long flags; /* Must use atomic bitops to access */
>>>>>> + unsigned long flags2;
>>>>>
>>>>>
>>>>> Ugh, god really??
>>>>>
>>>>> I really am not a fan of adding flags2 just to add a prctl() feature like
>>>>> this. This is crazy.
>>>>>
>>>>> Also this is a TERRIBLE name. I mean, no please PLEASE no.
>>>>>
>>>>> Do we really have absolutely no choice but to add a new flags field here?
>>>>>
>>>>> It again doesn't help that you don't mention nor even try to justify this
>>>>> in the commit message or cover letter.
>>>>>
>>>>
>>>> And again, I hope my reply to your email has given you the justification.
>>>
>>> No :) I understood why you did this though of course.
>>>
>>>>
>>>>> If this is a 32-bit kernel vs. 64-bit kernel thing so we 'ran out of bits',
>>>>> let's just go make this flags field 64-bit on 32-bit kernels.
>>>>>
>>>>> I mean - I'm kind of insisting we do that to be honest. Because I really
>>>>> don't like this.
>>>>
>>>>
>>>> If the maintainers want this, I will make it a 64 bit only feature. We
>>>> are only using it for 64 bit servers. But it will probably mean ifdef
>>>> config 64 bit in a lot of places.
>>>
>>> I'm going to presume you are including me in this category rather than
>>> implying that you are deferring only to others :)
>>>
>>
>> Yes ofcourse! I mean all maintainers :)
>>
>> And hopefully everyone else as well :)
>>
>>> So, there's another option:
>>>
>>> Have a prerequisite series that makes mm_struct->flags 64-bit on 32-bit
>>> kernels, which solves this problem everywhere and avoids us wasting a bunch
>>> of memory for a very specific usecase, splitting flag state across 2 fields
>>> (which are no longer atomic as a whole of course), adding confusion,
>>> possibly subtly breaking anywhere that assumes mm->flags completely
>>> describes mm-granularity flag state etc.
>>>
>>
>> This is probably a very basic question, but by make mm_struct->flags 64-bit on 32-bit
>> do you mean convert flags to unsigned long long when !CONIFG_64BIT?
>
> Yes. This would be a worthwhile project in its own right.
>
> I will be changing vma->vm_flags in the same way soon.
>
>>
>>> The RoI here is not looking good, otherwise.
>>>
>>>>
>>>>>
>>>>> Also if we _HAVE_ to have this, shouldn't we duplicate that comment about
>>>>> atomic bitops?...
>>>>>
>>>>
>>>> Sure
>>>>
>>>>>>
>>>>>> #ifdef CONFIG_AIO
>>>>>> spinlock_t ioctx_lock;
>>>>>> @@ -1744,6 +1745,11 @@ enum {
>>>>>> MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
>>>>>> MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
>>>>>>
>>>>>> +#define MMF2_THP_VMA_DEFAULT_HUGE 0
>>>>>
>>>>> I thought the whole idea was to move away from explicitly refrencing 'THP'
>>>>> in a future where large folios are implicit and now we're saying 'THP'.
>>>>>
>>>>> Anyway the 'VMA' is totally redundant here.
>>>>>
>>>>
>>>> Sure, I can remove VMA.
>>>> I see THP everywhere in the kernel code.
>>>> Its mentioned 108 times in transhuge.rst alone :)
>>>> If you have any suggestion to rename this flag, happy to take it :)
>>>
>>> Yeah I mean it's a mess man, and it's not your fault... Again naming is
>>> hard, I put a suggestion in reply to cover letter anyway...
>>>
>>>>
>>>>>> +#define MMF2_THP_VMA_DEFAULT_HUGE_MASK (1 << MMF2_THP_VMA_DEFAULT_HUGE)
>>>>>
>>>>> Do we really need explicit trivial mask declarations like this?
>>>>>
>>>>
>>>> I have followed the convention that has existed in this file, please see below
>>>> links :)
>>>> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1645
>>>> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1623
>>>> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1603
>>>> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1582
>>>
>>> Ack, yuck but ack.
>>>
>>>>
>>>>
>>>>>> +
>>>>>> +#define MMF2_INIT_MASK (MMF2_THP_VMA_DEFAULT_HUGE_MASK)
>>>>>
>>>>>> +
>>>>>> static inline unsigned long mmf_init_flags(unsigned long flags)
>>>>>> {
>>>>>> if (flags & (1UL << MMF_HAS_MDWE_NO_INHERIT))
>>>>>> @@ -1752,4 +1758,9 @@ static inline unsigned long mmf_init_flags(unsigned long flags)
>>>>>> return flags & MMF_INIT_MASK;
>>>>>> }
>>>>>>
>>>>>> +static inline unsigned long mmf2_init_flags(unsigned long flags)
>>>>>> +{
>>>>>> + return flags & MMF2_INIT_MASK;
>>>>>> +}
>>>>>> +
>>>>>> #endif /* _LINUX_MM_TYPES_H */
>>>>>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>>>>>> index 15c18ef4eb11..325c72f40a93 100644
>>>>>> --- a/include/uapi/linux/prctl.h
>>>>>> +++ b/include/uapi/linux/prctl.h
>>>>>> @@ -364,4 +364,8 @@ struct prctl_mm_map {
>>>>>> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1
>>>>>> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2
>>>>>>
>>>>>> +#define PR_SET_THP_POLICY 78
>>>>>> +#define PR_GET_THP_POLICY 79
>>>>>> +#define PR_THP_POLICY_DEFAULT_HUGE 0
>>>>>> +
>>>>>> #endif /* _LINUX_PRCTL_H */
>>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>>> index 9e4616dacd82..6e5f4a8869dc 100644
>>>>>> --- a/kernel/fork.c
>>>>>> +++ b/kernel/fork.c
>>>>>> @@ -1054,6 +1054,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>>>>>>
>>>>>> if (current->mm) {
>>>>>> mm->flags = mmf_init_flags(current->mm->flags);
>>>>>> + mm->flags2 = mmf2_init_flags(current->mm->flags2);
>>>>>> mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
>>>>>> } else {
>>>>>> mm->flags = default_dump_filter;
>>>>>> diff --git a/kernel/sys.c b/kernel/sys.c
>>>>>> index c434968e9f5d..1115f258f253 100644
>>>>>> --- a/kernel/sys.c
>>>>>> +++ b/kernel/sys.c
>>>>>> @@ -2658,6 +2658,27 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>>>>>> clear_bit(MMF_DISABLE_THP, &me->mm->flags);
>>>>>> mmap_write_unlock(me->mm);
>>>>>> break;
>>>>>> + case PR_GET_THP_POLICY:
>>>>>> + if (arg2 || arg3 || arg4 || arg5)
>>>>>> + return -EINVAL;
>>>>>> + if (!!test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2))
>>>>>
>>>>> I really don't think we need the !!? Do we?
>>>>
>>>> I have followed the convention that has existed in this file already,
>>>> please see:
>>>> https://elixir.bootlin.com/linux/v6.14.6/source/kernel/sys.c#L2644
>>>
>>> OK, but please don't, I don't see why this is necessary. if (truthy) is
>>> fine.
>>>
>>> Unless somebody has a really good reason why this is necessary, it's just
>>> ugly ceremony.
>>>
>>
>> Agreed :)
>
> Thanks
>
>>
>>>>
>>>>>
>>>>> Shouldn't we lock the mm when we do this no? Can't somebody change this?
>>>>>
>>>>
>>>> It wasn't locked in PR_GET_THP_DISABLE
>>>> https://elixir.bootlin.com/linux/v6.14.6/source/kernel/sys.c#L2644
>>>>
>>>> I can acquire do mmap_write_lock_killable the same as PR_SET_THP_POLICY
>>>> in the next series.
>>>>
>>>> I can also add the lock in PR_GET_THP_DISABLE.
>>>
>>> Well, the issue I guess is... if the flags field is atomic, and we know
>>> over this call maybe we can rely on mm sticking around, then we probalby
>>> don't need an mmap lock actually.
>>>
>>>>
>>>>>> + error = PR_THP_POLICY_DEFAULT_HUGE;
>>>
>>> Wait, error = PR_THP_POLICY_DEFAULT_HUGE? Is this the convention for
>>> returning here? :)
>>
>> I see a few of the PR_GET_.. setting the return value. I hope I didnt
>> misinterpret that.
>
> Yeah I thought it might be the case. I reemphasise my dislike of prctl().
>
>>
>>>
>>>>>> + break;
>>>>>> + case PR_SET_THP_POLICY:
>>>>>> + if (arg3 || arg4 || arg5)
>>>>>> + return -EINVAL;
>>>>>> + if (mmap_write_lock_killable(me->mm))
>>>>>> + return -EINTR;
>>>>>> + switch (arg2) {
>>>>>> + case PR_THP_POLICY_DEFAULT_HUGE:
>>>>>> + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
>>>>>> + process_vmas_thp_default_huge(me->mm);
>>>>>> + break;
>>>>>> + default:
>>>>>> + return -EINVAL;
>>>
>>> Oh I just noticed - this is really broken - you're not unlocking the mmap()
>>> here on error... :) you definitely need to fix this.
>>>
>>
>> Ah yes, will do Thanks!
>
> Thanks
>
>>
>>>>>> + }
>>>>>> + mmap_write_unlock(me->mm);
>>>>>> + break;
>>>>>> case PR_MPX_ENABLE_MANAGEMENT:
>>>>>> case PR_MPX_DISABLE_MANAGEMENT:
>>>>>> /* No longer implemented: */
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index 2780a12b25f0..64f66d5295e8 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -98,6 +98,38 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>>>>>> return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
>>>>>> }
>>>>>>
>>>>>> +void vma_set_thp_policy(struct vm_area_struct *vma)
>>>>>> +{
>>>>>> + struct mm_struct *mm = vma->vm_mm;
>>>>>> +
>>>>>> + if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
>>>>>> + vm_flags_set(vma, VM_HUGEPAGE);
>>>>>> +}
>>>>>> +
>>>>>> +static void vmas_thp_default_huge(struct mm_struct *mm)
>>>>>> +{
>>>>>> + struct vm_area_struct *vma;
>>>>>> + unsigned long vm_flags;
>>>>>> +
>>>>>> + VMA_ITERATOR(vmi, mm, 0);
>>>>>
>>>>> This is a declaration, it should be grouped with declarations...
>>>>>
>>>>
>>>> Sure, will make the change in next version.
>>>>
>>>> Unfortunately checkpatch didn't complain.
>>>
>>> Checkpatch actually complains the other way :P it doesn't understand
>>> macros.
>>>
>>> So you'll start getting a warning here, which you can ignore. It sucks, but
>>> there we go. Making checkpatch.pl understand that would be a pain, probs.
>>>
>>>>
>>>>>> + for_each_vma(vmi, vma) {
>>>>>> + vm_flags = vma->vm_flags;
>>>>>> + if (vm_flags & VM_NOHUGEPAGE)
>>>>>> + continue;
>>>>>
>>>>> Literally no point in you putting vm_flags as a separate variable here.
>>>>>
>>>>
>>>> Sure, will make the change in next version.
>>>
>>> Thanks!
>>>
>>>>
>>>>> So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
>>>>> is to override global 'never'?
>>>>>
>>>>
>>>> Again, I am not overriding never.
>>>>
>>>> hugepage_global_always and hugepage_global_enabled will evaluate to false
>>>> and you will not get a hugepage.
>>>
>>> Yeah, again ack, but I kind of hate that we set VM_HUGEPAGE everywhere even
>>> if the policy is never.
>>>
>>> And we now get into realms of:
>>>
>>> 'Hey I set prctl() to make everything huge pages, and PR_GET_THP_POLICY
>>> says I've set that, but nothing is huge? BUG???'
>>>
>>> Of course then you get into - if somebody sets it to never, do we go around
>>> and remove VM_HUGEPAGE and this MMF_ flag?
>>>
>>>>
>>>>
>>>>> I'm really concerned about this.
>>>>>
>>>>>> + vm_flags_set(vma, VM_HUGEPAGE);
>>>>>> + }
>>>>>> +}
>>>>>
>>>>> Do we have an mmap write lock established here? Can you confirm that? Also
>>>>> you should add an assert for that here.
>>>>>
>>>>
>>>> Yes I do, its only called in PR_SET_THP_POLICY where mmap_write lock was taken.
>>>> I can add an assert if it helps.
>>>
>>> It not only helps, it's utterly critical :)
>>>
>>> 'It's only called in xxx()' is famous last words for a programmer, because
>>> later somebody (maybe even your good self) calls it from somewhere else
>>> and... we've all been there...
>>>
>>
>> Thanks! Will do.
>
> Thanks.
>
>>>>
>>>>>> +
>>>>>> +void process_vmas_thp_default_huge(struct mm_struct *mm)
>>>>>> +{
>>>>>> + if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2))
>>>>>> + return;
>>>>>> +
>>>>>> + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2);
>>>>>> + vmas_thp_default_huge(mm);
>>>>>> +}
>>>>>> +
>>>>>> +
>>>>>> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>>>> unsigned long vm_flags,
>>>>>> unsigned long tva_flags,
>>>>>> diff --git a/mm/vma.c b/mm/vma.c
>>>>>> index 1f2634b29568..101b19c96803 100644
>>>>>> --- a/mm/vma.c
>>>>>> +++ b/mm/vma.c
>>>>>> @@ -2476,6 +2476,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
>>>>>> if (!vma_is_anonymous(vma))
>>>>>> khugepaged_enter_vma(vma, map->flags);
>>>>>> ksm_add_vma(vma);
>>>>>> + vma_set_thp_policy(vma);
>>>>>
>>>>> You're breaking VMA merging completely by doing this here...
>>>>>
>>>>> Now I can map one VMA with this policy set, then map another immediately
>>>>> next to it and - oops - no merge, ever, because the VM_HUGEPAGE flag is not
>>>>> set in the new VMA on merge attempt.
>>>>>
>>>>> I realise KSM is just as broken (grr) but this doesn't justify us
>>>>> completely breaking VMA merging here.
>>>>
>>>> I think this answers it. Its doing the same as KSM.
>>>
>>> Yes, but as I said there, it's not acceptable, at all.
>>>
>>> You're making it so litearlly VMA merging _does not happen at all_. That's
>>> unacceptable and might even break some workloads.
>>>
>>> You'll certainly cause very big kernel metadata usage.
>>>
>>> Consider:
>>>
>>> |-----------------------------|..................|
>>> | some VMA flags, VM_HUGEPAGE | proposed new VMA |
>>> |-----------------------------|..................|
>>>
>>> Now, because you set VM_HUGEPAGE _after any merge is attempted_, this will
>>> _always_ be fragmented, forever.
>>>
>>
>> So if __mmap_new_vma and do_brk_flags are called after merge attempt,
>> is it possible to vma_set_thp_policy (or do something similar) before
>> the merge attempt?
>>
>> Actually I just read your reply to the next block, so I think its ok?
>> Added more to the next block.
>>
>> I dont have any preference on where its put, so happy with putting this
>> earlier.
>
> Yeah, you can just do it earlier. But you maybe should just set the flag in
> the appropriate field rather than using the set flags helper.
>
>>
>>
>>> That's just not... acceptable.
>>>
>>> The fact KSM is broken this way doesn't make that OK.
>>>
>>> Especially on brk(), which now will _always_ allocate new VMAs for every
>>> brk() expansion which doesn't seem very efficient.
>>>
>>> It may also majorly degrade performance.
>>>
>>> That makes me think we need some perf testing for this ideally...
>>>
>>>>
>>>>>
>>>>> You need to set earlier than this. Then of course a driver might decide to
>>>>> override this, so maybe then we need to override that.
>>>>>
>>>>> But then we're getting into realms of changing fundamental VMA code _just
>>>>> for this feature_.
>>>>>
>>>>> Again I'm iffy about this. Very.
>>>>>
>>>>> Also you've broken the VMA userland tests here:
>>>>>
>>>>> $ cd tools/testing/vma
>>>>> $ make
>>>>> ...
>>>>> In file included from vma.c:33:
>>>>> ../../../mm/vma.c: In function ‘__mmap_new_vma’:
>>>>> ../../../mm/vma.c:2486:9: error: implicit declaration of function ‘vma_set_thp_policy’; did you mean ‘vma_dup_policy’? [-Wimplicit-function-declaration]
>>>>> 2486 | vma_set_thp_policy(vma);
>>>>> | ^~~~~~~~~~~~~~~~~~
>>>>> | vma_dup_policy
>>>>> make: *** [<builtin>: vma.o] Error 1
>>>>>
>>>>> You need to create stubs accordingly.
>>>>>
>>>>
>>>> Thanks will do.
>>>
>>> Thanks!
>>>
>>>>
>>>>>> *vmap = vma;
>>>>>> return 0;
>>>>>>
>>>>>> @@ -2705,6 +2706,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>>>>>> mm->map_count++;
>>>>>> validate_mm(mm);
>>>>>> ksm_add_vma(vma);
>>>>>> + vma_set_thp_policy(vma);
>>>>>
>>>>> You're breaking merging again... This is quite a bad case too as now you'll
>>>>> have totally fragmented brk VMAs no?
>>>>>
>>>>
>>>> Again doing it the same as KSM.
>>>
>>> That doesn't make it ok. Just because KSM is broken doesn't make this ok. I
>>> mean grr at KSM :) I'm going to look into that and see about
>>> investigating/fixing that behaviour.
>>>
>>> obviously I can't accept anything that will fundamentally break VMA
>>> merging.
>>>
>>
>> Ofcourse!
>>
>>> The answer really is to do this earlier, but you risk a driver overriding
>>> it, but that's OK I think (I don't even think any in-tree ones do actually
>>> _anywhere_ - and yes I was literally reading through _every single_ .mmap()
>>> callback lately because I am quite obviously insane ;)
>>>
>>> Again I can help with this.
>>>
>>
>> Appreaciate it!
>>
>> I am actually not familiar with the merge code. I will try and have a look,
>> but if you could give a pointer to the file:line after which its not acceptable
>> to have and I can move vma_set_thp_policy to before it or try and do something
>> similar to that.
>
> Ack.
>
> I wrote the latest merge and mmap() code so am well placed on this :>)
>
> But I don't think we should use vma_set_thp_policy() in these places, we
> should just set the flag, to avoid trying to do a write lock etc. etc.,
> plus we want to set the flag in a place that's not a VMA yet in both cases.
>
> So we'd need something like in do_mmap():
>
> + vm_flags |= mm_implied_vma_flags(mm);
> addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
>
> Where mm_implied_vma_flags() reads the MMF flags and sees if any imply VMA
> flags.
>
> But we have something for that already don't we? mm->def_flags.
>
> Can't we use that actually? That should work for mmap too?
Yeah, look at
commit 1860033237d4be09c5d7382585f0c7229367a534
Author: Michal Hocko <mhocko@suse.com>
Date: Mon Jul 10 15:48:02 2017 -0700
mm: make PR_SET_THP_DISABLE immediately active
Where we moved away from that. As raised, I am not sure I like what we
did with PR_SET_THP_DISABLE.
And I don't want any new magical prctl like that that add new magical
internal toggles.
OTOH, I am completely fine with a prctl that just changes the default
for new VMAs (just like applying madvise imemdiately afterwards). I'm
also fine with a prtctl that changes all existing VMAs, but maybe just
issuing a madvise() is the better solution, to cleanly separate it.
All not too crazy and not too invasive -- piggybagging on VM_HUGEPAGE /
VM_NOHUGEPAGE.
I can life with that if it solves a use case.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 18:09 ` Liam R. Howlett
2025-05-15 18:21 ` Lorenzo Stoakes
@ 2025-05-15 19:20 ` David Hildenbrand
1 sibling, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2025-05-15 19:20 UTC (permalink / raw)
To: Liam R. Howlett, Lorenzo Stoakes, Usama Arif, Andrew Morton,
linux-mm, hannes, shakeel.butt, riel, ziy, laoar.shao,
baolin.wang, npache, ryan.roberts, linux-kernel, linux-doc,
kernel-team
On 15.05.25 20:09, Liam R. Howlett wrote:
> * David Hildenbrand <david@redhat.com> [250515 13:30]:
>>>>
>>>
>>> Did we document all this? :)
>>>
>>> It'd be good to be super explicit about these sorts of 'dependency chains'.
>>>
>>
>> Documentation/admin-guide/mm/transhuge.rst has under "Global THP controls"
>> quite some stuff about all that, yes.
>>
>> The whole document needs an overhaul, to clarify on the whole terminology,
>> make it consistent, and better explain how the pagecache behaves etc. On my
>> todo list, but I'm afraid it will be a bit of work to get it right / please
>> most people.
>
> Yes, the whole thing is making me grumpy (more than my default state).
> The more I think about it, the more I don't like the prctl approach
> either...
>
> I more than dislike flags2... I hate it.
>
> but no prctl, no cgroups, no bpf.. what is left? A new policy groups
> thing? No, not that either, please.
>
> To state the obvious, none of this is transparent.
New to the "transparent" huge page world where not that much is
"transparent"!?
It's completely in-transparent to most people how it works :D
Yeah, that's why I suggested to piggyback on VM_HUGEPAGE/VM_NOHUGEPAGE.
Something we already have and that we will probably have for a long time
... :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 19:12 ` David Hildenbrand
@ 2025-05-15 20:35 ` Lorenzo Stoakes
2025-05-16 7:45 ` David Hildenbrand
0 siblings, 1 reply; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-15 20:35 UTC (permalink / raw)
To: David Hildenbrand
Cc: Usama Arif, Andrew Morton, linux-mm, hannes, shakeel.butt, riel,
ziy, laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On Thu, May 15, 2025 at 09:12:13PM +0200, David Hildenbrand wrote:
> On 15.05.25 20:08, Lorenzo Stoakes wrote:
> > On Thu, May 15, 2025 at 06:11:55PM +0200, David Hildenbrand wrote:
> > > > > > So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
> > > > > > is to override global 'never'?
> > > > > >
> > > > >
> > > > > Again, I am not overriding never.
> > > > >
> > > > > hugepage_global_always and hugepage_global_enabled will evaluate to false
> > > > > and you will not get a hugepage.
> > > >
> > > > Yeah, again ack, but I kind of hate that we set VM_HUGEPAGE everywhere even
> > > > if the policy is never.
> > >
> > > I think it should behave just as if someone does manually an madvise(). So
> > > whatever we do here during an madvise, we should try to do the same thing
> > > here.
> >
> > Ack I agree with this.
> >
> > It actually simplifies things a LOT to view it this way - we're saying 'by
> > default apply madvise(...) to new VMAs'.
> >
> > Hm I wonder if we could have a more generic version of this...
> >
> > Note though that we're not _quite_ doing this.
> >
> > So in hugepage_madvise():
> >
> > int hugepage_madvise(struct vm_area_struct *vma,
> > unsigned long *vm_flags, int advice)
> > {
> > ...
> >
> > switch (advice) {
> > case MADV_HUGEPAGE:
> > *vm_flags &= ~VM_NOHUGEPAGE;
> > *vm_flags |= VM_HUGEPAGE;
> >
> > ...
> >
> > break;
> >
> > ...
> > }
> >
> > ...
> > }
> >
> > So here we're actually clearing VM_NOHUGEPAGE and overriding it, but in the
> > proposed code we're not.
>
> Yeah, I think I suggested that, but probably we should just do exactly what
> madvise() does.
Yes, agreed.
Usama - do you have any issue with us switching to how madvise() does it?
>
> >
> > So we're back into confusing territory again :)
> >
> > I wonder if we could...
> >
> > 1. Add an MADV_xxx that mimics the desired behaviour here.
> >
> > 2. Add a generic 'madvise() by default' thing at a process level?
> >
> > Is this crazy?
>
> I think that's what I had in mind, just a bit twisted.
>
> What could work is
>
> 1) prctl to set the default
>
> 2) madvise() to adjust all existing VMAs
>
>
> We might have to teach 2) to ignore non-compatible VMAs / holes. Maybe not,
> worth an investigation.
Yeah, I think it'd _probably_ be ok except on s390 (which can fail, and so
we'd have to be able to say - skip on error, carry on).
We'll just get an -ENOMEM at the end for the gaps (god how I hate
that). Otherwise I don't think MADV_HUGEPAGE actually is really that
restrictive.
That would simplify :)
But I still so hate using prctl()... this might be one of those cases where
we simply figure out we have no other choice.
But when you put it as simply as this maybe it's not so bad. With the
flags2 gone by fixing this stupid 32-bit limit it's less awful.
Perhaps worth seeing what an improved RFC of this series looks like with
all the various bits fixed to give an idea.
But you do then wonder if we could make this _generic_ for _any_ madvise(),
and how _that_ would look.
But perhaps that's insane because many VMAs would simply not be suited to
having certain madvise flags set hmm.
Maybe let me have a think about an improved madvise() interface along these
lines anyway in general... interesting thought experiment :)
>
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 19:17 ` David Hildenbrand
@ 2025-05-15 20:42 ` Lorenzo Stoakes
0 siblings, 0 replies; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-15 20:42 UTC (permalink / raw)
To: David Hildenbrand
Cc: Usama Arif, Andrew Morton, linux-mm, hannes, shakeel.butt, riel,
ziy, laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On Thu, May 15, 2025 at 09:17:28PM +0200, David Hildenbrand wrote:
> On 15.05.25 20:36, Lorenzo Stoakes wrote:
[snip]
> > But I don't think we should use vma_set_thp_policy() in these places, we
> > should just set the flag, to avoid trying to do a write lock etc. etc.,
> > plus we want to set the flag in a place that's not a VMA yet in both cases.
> >
> > So we'd need something like in do_mmap():
> >
> > + vm_flags |= mm_implied_vma_flags(mm);
> > addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
> >
> > Where mm_implied_vma_flags() reads the MMF flags and sees if any imply VMA
> > flags.
> >
> > But we have something for that already don't we? mm->def_flags.
> >
> > Can't we use that actually? That should work for mmap too?
>
> Yeah, look at
>
> commit 1860033237d4be09c5d7382585f0c7229367a534
> Author: Michal Hocko <mhocko@suse.com>
> Date: Mon Jul 10 15:48:02 2017 -0700
>
> mm: make PR_SET_THP_DISABLE immediately active
>
>
> Where we moved away from that. As raised, I am not sure I like what we did
> with PR_SET_THP_DISABLE.
>
> And I don't want any new magical prctl like that that add new magical
> internal toggles.
>
> OTOH, I am completely fine with a prctl that just changes the default for
> new VMAs (just like applying madvise imemdiately afterwards). I'm also fine
> with a prtctl that changes all existing VMAs, but maybe just issuing a
> madvise() is the better solution, to cleanly separate it.
>
> All not too crazy and not too invasive -- piggybagging on VM_HUGEPAGE /
> VM_NOHUGEPAGE.
>
> I can life with that if it solves a use case.
I guess you're not suggesting using an MMF_ in this way which overrides
VMAs, I think the main reason Michael wanted to not use mm->def_flags here
is because doing so doesn't immediately change existing VMAs.
But we're doing things at the VMA level anyway, so we could just set:
1. mm->def_flags accordingly (no new MMF flag needed!)
2. update existing VMAs using (possibly improved) madvise() interface
And all should work.
The get policy stuff could then just check mm->def_flags & VM_HUGEPAGE or
VM_NOHUGEPAGE and use this as state.
This might be the least egregious way of doing this...
Maybe then I could hold my nose and possibly live with truly the most Evil
Interface in the History of Computing (TM), prctl() ;)
>
> --
> Cheers,
>
> David / dhildenb
>
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 18:42 ` Zi Yan
@ 2025-05-15 21:04 ` Lorenzo Stoakes
0 siblings, 0 replies; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-15 21:04 UTC (permalink / raw)
To: Zi Yan
Cc: Liam R. Howlett, David Hildenbrand, Usama Arif, Andrew Morton,
linux-mm, hannes, shakeel.butt, riel, laoar.shao, baolin.wang,
npache, ryan.roberts, linux-kernel, linux-doc, kernel-team
On Thu, May 15, 2025 at 02:42:02PM -0400, Zi Yan wrote:
> On 15 May 2025, at 14:21, Lorenzo Stoakes wrote:
>
> > On Thu, May 15, 2025 at 02:09:56PM -0400, Liam R. Howlett wrote:
> >> * David Hildenbrand <david@redhat.com> [250515 13:30]:
> >>>>>
> >>>>
> >>>> Did we document all this? :)
> >>>>
> >>>> It'd be good to be super explicit about these sorts of 'dependency chains'.
> >>>>
> >>>
> >>> Documentation/admin-guide/mm/transhuge.rst has under "Global THP controls"
> >>> quite some stuff about all that, yes.
> >>>
> >>> The whole document needs an overhaul, to clarify on the whole terminology,
> >>> make it consistent, and better explain how the pagecache behaves etc. On my
> >>> todo list, but I'm afraid it will be a bit of work to get it right / please
> >>> most people.
> >>
> >> Yes, the whole thing is making me grumpy (more than my default state).
> >> The more I think about it, the more I don't like the prctl approach
> >> either...
> >
> > prctl() feels like it's literally never, ever the right choice.
> >
> > It feels like we shove all the dark stuff we want to put under the rug
> > there.
> >
> > Reading the man page is genuinely frightening. there's stuff about VMAs _I
> > wasn't aware of_.
> >
> > It's also never really the _right time_ to do it - it's not process
> > inception is it? It's when the process has started, now you suddenly fiddle
> > with it.
> >
> > Then relying on mm flags being propagated over fork/exec is just, it's a
> > hack really.
> >
> >>
> >> I more than dislike flags2... I hate it.
> >
> > Yeah, to be clear - I will NACK any series that tries to add flags2 unless
> > a VERY VERY good justification is given. It's horrid. And frankly this
> > feature doesn't warrant something as horrible.
> >
> > But making mm->flags 64-bit on 32-bit kernels (which are in effect
> > deprecated in my view) would fix this.
> >
> >>
> >> but no prctl, no cgroups, no bpf.. what is left? A new policy groups
> >> thing? No, not that either, please.
>
> BPF might be OK, as long as we provide right functions for BPF to manipulate
> system, process, MM, VMA level knobs. My only objection to Yafang's patch[1] is
> that the patch adds a VMA parameter to the global hugepage checking functions.
Yeah, that was a good point to raise :)
>
> My take on BPF approach is that it does not add new APIs, so we can change it
> at any time, assuming people is willing to accept that the functions instrumented
> by BPF can go away at any time and the corresponding BPF programs will not work
> forever. It allows us to explore various huge page policies without the burden
> of maintaining APIs. Eventually, huge page policies become transparent after
> we learn enough.
Yeah I am quite worried about the consequences of infiltrating BPF that far into
this to be honest, and we do want to get to a future where THP is something
people don't think about but an automated thing and this feels like we might end
up putting ourselves in a position where we make that impossible?
It's interesting but I think needs really careful analysis rather than 'bpf all
the things'...
But we do have these awkward 'not really sure how to do this' scenarios that
fall between the gaps like the one here.
And I guess prctl() ends up being the catch-all because it saves us having to
create a new system call, etc.
>
> [1] https://lore.kernel.org/linux-mm/20250429024139.34365-1-laoar.shao@gmail.com/
>
>
>
> --
> Best Regards,
> Yan, Zi
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 13:33 ` [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process Usama Arif
2025-05-15 14:40 ` Lorenzo Stoakes
@ 2025-05-16 6:12 ` kernel test robot
1 sibling, 0 replies; 51+ messages in thread
From: kernel test robot @ 2025-05-16 6:12 UTC (permalink / raw)
To: Usama Arif, Andrew Morton, david
Cc: oe-kbuild-all, Linux Memory Management List, hannes, shakeel.butt,
riel, ziy, laoar.shao, baolin.wang, lorenzo.stoakes, Liam.Howlett,
npache, ryan.roberts, linux-kernel, linux-doc, kernel-team,
Usama Arif
Hi Usama,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on perf-tools-next/perf-tools-next tip/perf/core perf-tools/perf-tools linus/master v6.15-rc6]
[cannot apply to acme/perf/core next-20250515]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/prctl-introduce-PR_THP_POLICY_DEFAULT_HUGE-for-the-process/20250515-213850
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20250515133519.2779639-2-usamaarif642%40gmail.com
patch subject: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
config: sparc-allnoconfig (https://download.01.org/0day-ci/archive/20250516/202505161340.aOL3UHxo-lkp@intel.com/config)
compiler: sparc-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250516/202505161340.aOL3UHxo-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505161340.aOL3UHxo-lkp@intel.com/
All errors (new ones prefixed by >>):
mm/vma.c: In function '__mmap_new_vma':
>> mm/vma.c:2486:9: error: implicit declaration of function 'vma_set_thp_policy'; did you mean 'vma_dup_policy'? [-Wimplicit-function-declaration]
2486 | vma_set_thp_policy(vma);
| ^~~~~~~~~~~~~~~~~~
| vma_dup_policy
vim +2486 mm/vma.c
2472
2473 /* Lock the VMA since it is modified after insertion into VMA tree */
2474 vma_start_write(vma);
2475 vma_iter_store_new(vmi, vma);
2476 map->mm->map_count++;
2477 vma_link_file(vma);
2478
2479 /*
2480 * vma_merge_new_range() calls khugepaged_enter_vma() too, the below
2481 * call covers the non-merge case.
2482 */
2483 if (!vma_is_anonymous(vma))
2484 khugepaged_enter_vma(vma, map->flags);
2485 ksm_add_vma(vma);
> 2486 vma_set_thp_policy(vma);
2487 *vmap = vma;
2488 return 0;
2489
2490 free_iter_vma:
2491 vma_iter_free(vmi);
2492 free_vma:
2493 vm_area_free(vma);
2494 return error;
2495 }
2496
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-15 20:35 ` Lorenzo Stoakes
@ 2025-05-16 7:45 ` David Hildenbrand
2025-05-16 10:57 ` Lorenzo Stoakes
0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-05-16 7:45 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Usama Arif, Andrew Morton, linux-mm, hannes, shakeel.butt, riel,
ziy, laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On 15.05.25 22:35, Lorenzo Stoakes wrote:
> On Thu, May 15, 2025 at 09:12:13PM +0200, David Hildenbrand wrote:
>> On 15.05.25 20:08, Lorenzo Stoakes wrote:
>>> On Thu, May 15, 2025 at 06:11:55PM +0200, David Hildenbrand wrote:
>>>>>>> So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
>>>>>>> is to override global 'never'?
>>>>>>>
>>>>>>
>>>>>> Again, I am not overriding never.
>>>>>>
>>>>>> hugepage_global_always and hugepage_global_enabled will evaluate to false
>>>>>> and you will not get a hugepage.
>>>>>
>>>>> Yeah, again ack, but I kind of hate that we set VM_HUGEPAGE everywhere even
>>>>> if the policy is never.
>>>>
>>>> I think it should behave just as if someone does manually an madvise(). So
>>>> whatever we do here during an madvise, we should try to do the same thing
>>>> here.
>>>
>>> Ack I agree with this.
>>>
>>> It actually simplifies things a LOT to view it this way - we're saying 'by
>>> default apply madvise(...) to new VMAs'.
>>>
>>> Hm I wonder if we could have a more generic version of this...
>>>
>>> Note though that we're not _quite_ doing this.
>>>
>>> So in hugepage_madvise():
>>>
>>> int hugepage_madvise(struct vm_area_struct *vma,
>>> unsigned long *vm_flags, int advice)
>>> {
>>> ...
>>>
>>> switch (advice) {
>>> case MADV_HUGEPAGE:
>>> *vm_flags &= ~VM_NOHUGEPAGE;
>>> *vm_flags |= VM_HUGEPAGE;
>>>
>>> ...
>>>
>>> break;
>>>
>>> ...
>>> }
>>>
>>> ...
>>> }
>>>
>>> So here we're actually clearing VM_NOHUGEPAGE and overriding it, but in the
>>> proposed code we're not.
>>
>> Yeah, I think I suggested that, but probably we should just do exactly what
>> madvise() does.
>
> Yes, agreed.
>
> Usama - do you have any issue with us switching to how madvise() does it?
>
>>
>>>
>>> So we're back into confusing territory again :)
>>>
>>> I wonder if we could...
>>>
>>> 1. Add an MADV_xxx that mimics the desired behaviour here.
>>>
>>> 2. Add a generic 'madvise() by default' thing at a process level?
>>>
>>> Is this crazy?
>>
>> I think that's what I had in mind, just a bit twisted.
>>
>> What could work is
>>
>> 1) prctl to set the default
>>
>> 2) madvise() to adjust all existing VMAs
>>
>>
>> We might have to teach 2) to ignore non-compatible VMAs / holes. Maybe not,
>> worth an investigation.
>
> Yeah, I think it'd _probably_ be ok except on s390 (which can fail, and so
> we'd have to be able to say - skip on error, carry on).
>
> We'll just get an -ENOMEM at the end for the gaps (god how I hate
> that). Otherwise I don't think MADV_HUGEPAGE actually is really that
> restrictive.
>
> That would simplify :)
>
> But I still so hate using prctl()... this might be one of those cases where
> we simply figure out we have no other choice.
> > But when you put it as simply as this maybe it's not so bad. With the
> flags2 gone by fixing this stupid 32-bit limit it's less awful.
>
> Perhaps worth seeing what an improved RFC of this series looks like with
> all the various bits fixed to give an idea.
Yes.
>
> But you do then wonder if we could make this _generic_ for _any_ madvise(),
> and how _that_ would look.
>
> But perhaps that's insane because many VMAs would simply not be suited to
> having certain madvise flags set hmm.
Same thinking. I think this is rather special.
In a perfect world not even the madvise(*HUGEPAGE) would exist.
But here we are ... 14 years (wow!) after
commit 0af4e98b6b095c74588af04872f83d333c958c32
Author: Andrea Arcangeli <aarcange@redhat.com>
Date: Thu Jan 13 15:46:55 2011 -0800
thp: madvise(MADV_HUGEPAGE)
(I'm surprised you don't complain about madvise(). IMHO, prctl() is even
a better interface than catch-all madvise(); a syscall where an advise
might not be an advise. I saw some funny rants about MADV_DONTNEED on
reddit at some point ... :) mctrl() would have been clearer, at least
for me :D )
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 2/6] prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE for the process
2025-05-15 13:33 ` [PATCH 2/6] prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE " Usama Arif
@ 2025-05-16 8:19 ` kernel test robot
0 siblings, 0 replies; 51+ messages in thread
From: kernel test robot @ 2025-05-16 8:19 UTC (permalink / raw)
To: Usama Arif, Andrew Morton, david
Cc: oe-kbuild-all, Linux Memory Management List, hannes, shakeel.butt,
riel, ziy, laoar.shao, baolin.wang, lorenzo.stoakes, Liam.Howlett,
npache, ryan.roberts, linux-kernel, linux-doc, kernel-team,
Usama Arif
Hi Usama,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on perf-tools-next/perf-tools-next tip/perf/core perf-tools/perf-tools linus/master v6.15-rc6]
[cannot apply to acme/perf/core next-20250515]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/prctl-introduce-PR_THP_POLICY_DEFAULT_HUGE-for-the-process/20250515-213850
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20250515133519.2779639-3-usamaarif642%40gmail.com
patch subject: [PATCH 2/6] prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE for the process
config: m68k-allnoconfig (https://download.01.org/0day-ci/archive/20250516/202505161626.4OeUVh4j-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250516/202505161626.4OeUVh4j-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505161626.4OeUVh4j-lkp@intel.com/
All errors (new ones prefixed by >>):
kernel/sys.c: In function '__do_sys_prctl':
kernel/sys.c:2678:25: error: implicit declaration of function 'process_vmas_thp_default_huge' [-Wimplicit-function-declaration]
2678 | process_vmas_thp_default_huge(me->mm);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/sys.c:2683:25: error: implicit declaration of function 'process_vmas_thp_default_nohuge' [-Wimplicit-function-declaration]
2683 | process_vmas_thp_default_nohuge(me->mm);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
vim +/process_vmas_thp_default_nohuge +2683 kernel/sys.c
2472
2473 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
2474 unsigned long, arg4, unsigned long, arg5)
2475 {
2476 struct task_struct *me = current;
2477 unsigned char comm[sizeof(me->comm)];
2478 long error;
2479
2480 error = security_task_prctl(option, arg2, arg3, arg4, arg5);
2481 if (error != -ENOSYS)
2482 return error;
2483
2484 error = 0;
2485 switch (option) {
2486 case PR_SET_PDEATHSIG:
2487 if (!valid_signal(arg2)) {
2488 error = -EINVAL;
2489 break;
2490 }
2491 me->pdeath_signal = arg2;
2492 break;
2493 case PR_GET_PDEATHSIG:
2494 error = put_user(me->pdeath_signal, (int __user *)arg2);
2495 break;
2496 case PR_GET_DUMPABLE:
2497 error = get_dumpable(me->mm);
2498 break;
2499 case PR_SET_DUMPABLE:
2500 if (arg2 != SUID_DUMP_DISABLE && arg2 != SUID_DUMP_USER) {
2501 error = -EINVAL;
2502 break;
2503 }
2504 set_dumpable(me->mm, arg2);
2505 break;
2506
2507 case PR_SET_UNALIGN:
2508 error = SET_UNALIGN_CTL(me, arg2);
2509 break;
2510 case PR_GET_UNALIGN:
2511 error = GET_UNALIGN_CTL(me, arg2);
2512 break;
2513 case PR_SET_FPEMU:
2514 error = SET_FPEMU_CTL(me, arg2);
2515 break;
2516 case PR_GET_FPEMU:
2517 error = GET_FPEMU_CTL(me, arg2);
2518 break;
2519 case PR_SET_FPEXC:
2520 error = SET_FPEXC_CTL(me, arg2);
2521 break;
2522 case PR_GET_FPEXC:
2523 error = GET_FPEXC_CTL(me, arg2);
2524 break;
2525 case PR_GET_TIMING:
2526 error = PR_TIMING_STATISTICAL;
2527 break;
2528 case PR_SET_TIMING:
2529 if (arg2 != PR_TIMING_STATISTICAL)
2530 error = -EINVAL;
2531 break;
2532 case PR_SET_NAME:
2533 comm[sizeof(me->comm) - 1] = 0;
2534 if (strncpy_from_user(comm, (char __user *)arg2,
2535 sizeof(me->comm) - 1) < 0)
2536 return -EFAULT;
2537 set_task_comm(me, comm);
2538 proc_comm_connector(me);
2539 break;
2540 case PR_GET_NAME:
2541 get_task_comm(comm, me);
2542 if (copy_to_user((char __user *)arg2, comm, sizeof(comm)))
2543 return -EFAULT;
2544 break;
2545 case PR_GET_ENDIAN:
2546 error = GET_ENDIAN(me, arg2);
2547 break;
2548 case PR_SET_ENDIAN:
2549 error = SET_ENDIAN(me, arg2);
2550 break;
2551 case PR_GET_SECCOMP:
2552 error = prctl_get_seccomp();
2553 break;
2554 case PR_SET_SECCOMP:
2555 error = prctl_set_seccomp(arg2, (char __user *)arg3);
2556 break;
2557 case PR_GET_TSC:
2558 error = GET_TSC_CTL(arg2);
2559 break;
2560 case PR_SET_TSC:
2561 error = SET_TSC_CTL(arg2);
2562 break;
2563 case PR_TASK_PERF_EVENTS_DISABLE:
2564 error = perf_event_task_disable();
2565 break;
2566 case PR_TASK_PERF_EVENTS_ENABLE:
2567 error = perf_event_task_enable();
2568 break;
2569 case PR_GET_TIMERSLACK:
2570 if (current->timer_slack_ns > ULONG_MAX)
2571 error = ULONG_MAX;
2572 else
2573 error = current->timer_slack_ns;
2574 break;
2575 case PR_SET_TIMERSLACK:
2576 if (rt_or_dl_task_policy(current))
2577 break;
2578 if (arg2 <= 0)
2579 current->timer_slack_ns =
2580 current->default_timer_slack_ns;
2581 else
2582 current->timer_slack_ns = arg2;
2583 break;
2584 case PR_MCE_KILL:
2585 if (arg4 | arg5)
2586 return -EINVAL;
2587 switch (arg2) {
2588 case PR_MCE_KILL_CLEAR:
2589 if (arg3 != 0)
2590 return -EINVAL;
2591 current->flags &= ~PF_MCE_PROCESS;
2592 break;
2593 case PR_MCE_KILL_SET:
2594 current->flags |= PF_MCE_PROCESS;
2595 if (arg3 == PR_MCE_KILL_EARLY)
2596 current->flags |= PF_MCE_EARLY;
2597 else if (arg3 == PR_MCE_KILL_LATE)
2598 current->flags &= ~PF_MCE_EARLY;
2599 else if (arg3 == PR_MCE_KILL_DEFAULT)
2600 current->flags &=
2601 ~(PF_MCE_EARLY|PF_MCE_PROCESS);
2602 else
2603 return -EINVAL;
2604 break;
2605 default:
2606 return -EINVAL;
2607 }
2608 break;
2609 case PR_MCE_KILL_GET:
2610 if (arg2 | arg3 | arg4 | arg5)
2611 return -EINVAL;
2612 if (current->flags & PF_MCE_PROCESS)
2613 error = (current->flags & PF_MCE_EARLY) ?
2614 PR_MCE_KILL_EARLY : PR_MCE_KILL_LATE;
2615 else
2616 error = PR_MCE_KILL_DEFAULT;
2617 break;
2618 case PR_SET_MM:
2619 error = prctl_set_mm(arg2, arg3, arg4, arg5);
2620 break;
2621 case PR_GET_TID_ADDRESS:
2622 error = prctl_get_tid_address(me, (int __user * __user *)arg2);
2623 break;
2624 case PR_SET_CHILD_SUBREAPER:
2625 me->signal->is_child_subreaper = !!arg2;
2626 if (!arg2)
2627 break;
2628
2629 walk_process_tree(me, propagate_has_child_subreaper, NULL);
2630 break;
2631 case PR_GET_CHILD_SUBREAPER:
2632 error = put_user(me->signal->is_child_subreaper,
2633 (int __user *)arg2);
2634 break;
2635 case PR_SET_NO_NEW_PRIVS:
2636 if (arg2 != 1 || arg3 || arg4 || arg5)
2637 return -EINVAL;
2638
2639 task_set_no_new_privs(current);
2640 break;
2641 case PR_GET_NO_NEW_PRIVS:
2642 if (arg2 || arg3 || arg4 || arg5)
2643 return -EINVAL;
2644 return task_no_new_privs(current) ? 1 : 0;
2645 case PR_GET_THP_DISABLE:
2646 if (arg2 || arg3 || arg4 || arg5)
2647 return -EINVAL;
2648 error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
2649 break;
2650 case PR_SET_THP_DISABLE:
2651 if (arg3 || arg4 || arg5)
2652 return -EINVAL;
2653 if (mmap_write_lock_killable(me->mm))
2654 return -EINTR;
2655 if (arg2)
2656 set_bit(MMF_DISABLE_THP, &me->mm->flags);
2657 else
2658 clear_bit(MMF_DISABLE_THP, &me->mm->flags);
2659 mmap_write_unlock(me->mm);
2660 break;
2661 case PR_GET_THP_POLICY:
2662 if (arg2 || arg3 || arg4 || arg5)
2663 return -EINVAL;
2664 if (!!test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2))
2665 error = PR_THP_POLICY_DEFAULT_HUGE;
2666 else if (!!test_bit(MMF2_THP_VMA_DEFAULT_NOHUGE, &me->mm->flags2))
2667 error = PR_THP_POLICY_DEFAULT_NOHUGE;
2668 break;
2669 case PR_SET_THP_POLICY:
2670 if (arg3 || arg4 || arg5)
2671 return -EINVAL;
2672 if (mmap_write_lock_killable(me->mm))
2673 return -EINTR;
2674 switch (arg2) {
2675 case PR_THP_POLICY_DEFAULT_HUGE:
2676 set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
2677 clear_bit(MMF2_THP_VMA_DEFAULT_NOHUGE, &me->mm->flags2);
2678 process_vmas_thp_default_huge(me->mm);
2679 break;
2680 case PR_THP_POLICY_DEFAULT_NOHUGE:
2681 clear_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
2682 set_bit(MMF2_THP_VMA_DEFAULT_NOHUGE, &me->mm->flags2);
> 2683 process_vmas_thp_default_nohuge(me->mm);
2684 break;
2685 default:
2686 return -EINVAL;
2687 }
2688 mmap_write_unlock(me->mm);
2689 break;
2690 case PR_MPX_ENABLE_MANAGEMENT:
2691 case PR_MPX_DISABLE_MANAGEMENT:
2692 /* No longer implemented: */
2693 return -EINVAL;
2694 case PR_SET_FP_MODE:
2695 error = SET_FP_MODE(me, arg2);
2696 break;
2697 case PR_GET_FP_MODE:
2698 error = GET_FP_MODE(me);
2699 break;
2700 case PR_SVE_SET_VL:
2701 error = SVE_SET_VL(arg2);
2702 break;
2703 case PR_SVE_GET_VL:
2704 error = SVE_GET_VL();
2705 break;
2706 case PR_SME_SET_VL:
2707 error = SME_SET_VL(arg2);
2708 break;
2709 case PR_SME_GET_VL:
2710 error = SME_GET_VL();
2711 break;
2712 case PR_GET_SPECULATION_CTRL:
2713 if (arg3 || arg4 || arg5)
2714 return -EINVAL;
2715 error = arch_prctl_spec_ctrl_get(me, arg2);
2716 break;
2717 case PR_SET_SPECULATION_CTRL:
2718 if (arg4 || arg5)
2719 return -EINVAL;
2720 error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
2721 break;
2722 case PR_PAC_RESET_KEYS:
2723 if (arg3 || arg4 || arg5)
2724 return -EINVAL;
2725 error = PAC_RESET_KEYS(me, arg2);
2726 break;
2727 case PR_PAC_SET_ENABLED_KEYS:
2728 if (arg4 || arg5)
2729 return -EINVAL;
2730 error = PAC_SET_ENABLED_KEYS(me, arg2, arg3);
2731 break;
2732 case PR_PAC_GET_ENABLED_KEYS:
2733 if (arg2 || arg3 || arg4 || arg5)
2734 return -EINVAL;
2735 error = PAC_GET_ENABLED_KEYS(me);
2736 break;
2737 case PR_SET_TAGGED_ADDR_CTRL:
2738 if (arg3 || arg4 || arg5)
2739 return -EINVAL;
2740 error = SET_TAGGED_ADDR_CTRL(arg2);
2741 break;
2742 case PR_GET_TAGGED_ADDR_CTRL:
2743 if (arg2 || arg3 || arg4 || arg5)
2744 return -EINVAL;
2745 error = GET_TAGGED_ADDR_CTRL();
2746 break;
2747 case PR_SET_IO_FLUSHER:
2748 if (!capable(CAP_SYS_RESOURCE))
2749 return -EPERM;
2750
2751 if (arg3 || arg4 || arg5)
2752 return -EINVAL;
2753
2754 if (arg2 == 1)
2755 current->flags |= PR_IO_FLUSHER;
2756 else if (!arg2)
2757 current->flags &= ~PR_IO_FLUSHER;
2758 else
2759 return -EINVAL;
2760 break;
2761 case PR_GET_IO_FLUSHER:
2762 if (!capable(CAP_SYS_RESOURCE))
2763 return -EPERM;
2764
2765 if (arg2 || arg3 || arg4 || arg5)
2766 return -EINVAL;
2767
2768 error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
2769 break;
2770 case PR_SET_SYSCALL_USER_DISPATCH:
2771 error = set_syscall_user_dispatch(arg2, arg3, arg4,
2772 (char __user *) arg5);
2773 break;
2774 #ifdef CONFIG_SCHED_CORE
2775 case PR_SCHED_CORE:
2776 error = sched_core_share_pid(arg2, arg3, arg4, arg5);
2777 break;
2778 #endif
2779 case PR_SET_MDWE:
2780 error = prctl_set_mdwe(arg2, arg3, arg4, arg5);
2781 break;
2782 case PR_GET_MDWE:
2783 error = prctl_get_mdwe(arg2, arg3, arg4, arg5);
2784 break;
2785 case PR_PPC_GET_DEXCR:
2786 if (arg3 || arg4 || arg5)
2787 return -EINVAL;
2788 error = PPC_GET_DEXCR_ASPECT(me, arg2);
2789 break;
2790 case PR_PPC_SET_DEXCR:
2791 if (arg4 || arg5)
2792 return -EINVAL;
2793 error = PPC_SET_DEXCR_ASPECT(me, arg2, arg3);
2794 break;
2795 case PR_SET_VMA:
2796 error = prctl_set_vma(arg2, arg3, arg4, arg5);
2797 break;
2798 case PR_GET_AUXV:
2799 if (arg4 || arg5)
2800 return -EINVAL;
2801 error = prctl_get_auxv((void __user *)arg2, arg3);
2802 break;
2803 #ifdef CONFIG_KSM
2804 case PR_SET_MEMORY_MERGE:
2805 if (arg3 || arg4 || arg5)
2806 return -EINVAL;
2807 if (mmap_write_lock_killable(me->mm))
2808 return -EINTR;
2809
2810 if (arg2)
2811 error = ksm_enable_merge_any(me->mm);
2812 else
2813 error = ksm_disable_merge_any(me->mm);
2814 mmap_write_unlock(me->mm);
2815 break;
2816 case PR_GET_MEMORY_MERGE:
2817 if (arg2 || arg3 || arg4 || arg5)
2818 return -EINVAL;
2819
2820 error = !!test_bit(MMF_VM_MERGE_ANY, &me->mm->flags);
2821 break;
2822 #endif
2823 case PR_RISCV_V_SET_CONTROL:
2824 error = RISCV_V_SET_CONTROL(arg2);
2825 break;
2826 case PR_RISCV_V_GET_CONTROL:
2827 error = RISCV_V_GET_CONTROL();
2828 break;
2829 case PR_RISCV_SET_ICACHE_FLUSH_CTX:
2830 error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
2831 break;
2832 case PR_GET_SHADOW_STACK_STATUS:
2833 if (arg3 || arg4 || arg5)
2834 return -EINVAL;
2835 error = arch_get_shadow_stack_status(me, (unsigned long __user *) arg2);
2836 break;
2837 case PR_SET_SHADOW_STACK_STATUS:
2838 if (arg3 || arg4 || arg5)
2839 return -EINVAL;
2840 error = arch_set_shadow_stack_status(me, arg2);
2841 break;
2842 case PR_LOCK_SHADOW_STACK_STATUS:
2843 if (arg3 || arg4 || arg5)
2844 return -EINVAL;
2845 error = arch_lock_shadow_stack_status(me, arg2);
2846 break;
2847 case PR_TIMER_CREATE_RESTORE_IDS:
2848 if (arg3 || arg4 || arg5)
2849 return -EINVAL;
2850 error = posixtimer_create_prctl(arg2);
2851 break;
2852 default:
2853 trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
2854 error = -EINVAL;
2855 break;
2856 }
2857 return error;
2858 }
2859
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-16 7:45 ` David Hildenbrand
@ 2025-05-16 10:57 ` Lorenzo Stoakes
2025-05-16 11:24 ` David Hildenbrand
0 siblings, 1 reply; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-16 10:57 UTC (permalink / raw)
To: David Hildenbrand
Cc: Usama Arif, Andrew Morton, linux-mm, hannes, shakeel.butt, riel,
ziy, laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On Fri, May 16, 2025 at 09:45:17AM +0200, David Hildenbrand wrote:
> On 15.05.25 22:35, Lorenzo Stoakes wrote:
> > On Thu, May 15, 2025 at 09:12:13PM +0200, David Hildenbrand wrote:
> > > On 15.05.25 20:08, Lorenzo Stoakes wrote:
> > > > On Thu, May 15, 2025 at 06:11:55PM +0200, David Hildenbrand wrote:
> > > > > > > > So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
> > > > > > > > is to override global 'never'?
> > > > > > > >
> > > > > > >
> > > > > > > Again, I am not overriding never.
> > > > > > >
> > > > > > > hugepage_global_always and hugepage_global_enabled will evaluate to false
> > > > > > > and you will not get a hugepage.
> > > > > >
> > > > > > Yeah, again ack, but I kind of hate that we set VM_HUGEPAGE everywhere even
> > > > > > if the policy is never.
> > > > >
> > > > > I think it should behave just as if someone does manually an madvise(). So
> > > > > whatever we do here during an madvise, we should try to do the same thing
> > > > > here.
> > > >
> > > > Ack I agree with this.
> > > >
> > > > It actually simplifies things a LOT to view it this way - we're saying 'by
> > > > default apply madvise(...) to new VMAs'.
> > > >
> > > > Hm I wonder if we could have a more generic version of this...
> > > >
> > > > Note though that we're not _quite_ doing this.
> > > >
> > > > So in hugepage_madvise():
> > > >
> > > > int hugepage_madvise(struct vm_area_struct *vma,
> > > > unsigned long *vm_flags, int advice)
> > > > {
> > > > ...
> > > >
> > > > switch (advice) {
> > > > case MADV_HUGEPAGE:
> > > > *vm_flags &= ~VM_NOHUGEPAGE;
> > > > *vm_flags |= VM_HUGEPAGE;
> > > >
> > > > ...
> > > >
> > > > break;
> > > >
> > > > ...
> > > > }
> > > >
> > > > ...
> > > > }
> > > >
> > > > So here we're actually clearing VM_NOHUGEPAGE and overriding it, but in the
> > > > proposed code we're not.
> > >
> > > Yeah, I think I suggested that, but probably we should just do exactly what
> > > madvise() does.
> >
> > Yes, agreed.
> >
> > Usama - do you have any issue with us switching to how madvise() does it?
> >
> > >
> > > >
> > > > So we're back into confusing territory again :)
> > > >
> > > > I wonder if we could...
> > > >
> > > > 1. Add an MADV_xxx that mimics the desired behaviour here.
> > > >
> > > > 2. Add a generic 'madvise() by default' thing at a process level?
> > > >
> > > > Is this crazy?
> > >
> > > I think that's what I had in mind, just a bit twisted.
> > >
> > > What could work is
> > >
> > > 1) prctl to set the default
> > >
> > > 2) madvise() to adjust all existing VMAs
> > >
> > >
> > > We might have to teach 2) to ignore non-compatible VMAs / holes. Maybe not,
> > > worth an investigation.
> >
> > Yeah, I think it'd _probably_ be ok except on s390 (which can fail, and so
> > we'd have to be able to say - skip on error, carry on).
> >
> > We'll just get an -ENOMEM at the end for the gaps (god how I hate
> > that). Otherwise I don't think MADV_HUGEPAGE actually is really that
> > restrictive.
> >
> > That would simplify :)
> >
> > But I still so hate using prctl()... this might be one of those cases where
> > we simply figure out we have no other choice.
> > > But when you put it as simply as this maybe it's not so bad. With the
> > flags2 gone by fixing this stupid 32-bit limit it's less awful.
> >
> > Perhaps worth seeing what an improved RFC of this series looks like with
> > all the various bits fixed to give an idea.
>
> Yes.
>
> >
> > But you do then wonder if we could make this _generic_ for _any_ madvise(),
> > and how _that_ would look.
> >
> > But perhaps that's insane because many VMAs would simply not be suited to
> > having certain madvise flags set hmm.
>
> Same thinking. I think this is rather special.
>
> In a perfect world not even the madvise(*HUGEPAGE) would exist.
>
> But here we are ... 14 years (wow!) after
This feels like the tale of the kernel :)
>
> commit 0af4e98b6b095c74588af04872f83d333c958c32
> Author: Andrea Arcangeli <aarcange@redhat.com>
> Date: Thu Jan 13 15:46:55 2011 -0800
>
> thp: madvise(MADV_HUGEPAGE)
>
>
>
> (I'm surprised you don't complain about madvise(). IMHO, prctl() is even a
> better interface than catch-all madvise(); a syscall where an advise might
> not be an advise. I saw some funny rants about MADV_DONTNEED on reddit at
> some point ... :) mctrl() would have been clearer, at least for me :D )
No I prefer madvise() massively, I mean yes in a way it's hacky, but prctl() is
the ultimate hack.
So as an interface it's actually kinda fine like 'virtual range X-Y, advise ZZZ
about it'.
(as for naming haha maybe you have a point actually, the 'advice' bit
has always been strange... :)
But.
The actual set of advice is bloody hideous and confusing and I've seen
first hand userspace people get very, very confused about what each thing
does. The naming is horrible, overloaded, overwrought.
And the weird behaviour with gaps is also horrible...
So there's lots to moan about there, but saying prctl() is somehow superior
to the true evil of prctl() is far too far :P
I mean take a look at https://man7.org/linux/man-pages/man2/prctl.2.html
Things like:
PR_SET_MM
PR_SET_VMA
Are super worrying...
>
> --
> Cheers,
>
> David / dhildenb
>
I wonder if we just need a new syscall overall... *puts thinking cap on*.
A galaxy brained idea may be coming to me good sir :P
Watch this space...
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-16 10:57 ` Lorenzo Stoakes
@ 2025-05-16 11:24 ` David Hildenbrand
2025-05-16 12:57 ` Lorenzo Stoakes
0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-05-16 11:24 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Usama Arif, Andrew Morton, linux-mm, hannes, shakeel.butt, riel,
ziy, laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On 16.05.25 12:57, Lorenzo Stoakes wrote:
> On Fri, May 16, 2025 at 09:45:17AM +0200, David Hildenbrand wrote:
>> On 15.05.25 22:35, Lorenzo Stoakes wrote:
>>> On Thu, May 15, 2025 at 09:12:13PM +0200, David Hildenbrand wrote:
>>>> On 15.05.25 20:08, Lorenzo Stoakes wrote:
>>>>> On Thu, May 15, 2025 at 06:11:55PM +0200, David Hildenbrand wrote:
>>>>>>>>> So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
>>>>>>>>> is to override global 'never'?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Again, I am not overriding never.
>>>>>>>>
>>>>>>>> hugepage_global_always and hugepage_global_enabled will evaluate to false
>>>>>>>> and you will not get a hugepage.
>>>>>>>
>>>>>>> Yeah, again ack, but I kind of hate that we set VM_HUGEPAGE everywhere even
>>>>>>> if the policy is never.
>>>>>>
>>>>>> I think it should behave just as if someone does manually an madvise(). So
>>>>>> whatever we do here during an madvise, we should try to do the same thing
>>>>>> here.
>>>>>
>>>>> Ack I agree with this.
>>>>>
>>>>> It actually simplifies things a LOT to view it this way - we're saying 'by
>>>>> default apply madvise(...) to new VMAs'.
>>>>>
>>>>> Hm I wonder if we could have a more generic version of this...
>>>>>
>>>>> Note though that we're not _quite_ doing this.
>>>>>
>>>>> So in hugepage_madvise():
>>>>>
>>>>> int hugepage_madvise(struct vm_area_struct *vma,
>>>>> unsigned long *vm_flags, int advice)
>>>>> {
>>>>> ...
>>>>>
>>>>> switch (advice) {
>>>>> case MADV_HUGEPAGE:
>>>>> *vm_flags &= ~VM_NOHUGEPAGE;
>>>>> *vm_flags |= VM_HUGEPAGE;
>>>>>
>>>>> ...
>>>>>
>>>>> break;
>>>>>
>>>>> ...
>>>>> }
>>>>>
>>>>> ...
>>>>> }
>>>>>
>>>>> So here we're actually clearing VM_NOHUGEPAGE and overriding it, but in the
>>>>> proposed code we're not.
>>>>
>>>> Yeah, I think I suggested that, but probably we should just do exactly what
>>>> madvise() does.
>>>
>>> Yes, agreed.
>>>
>>> Usama - do you have any issue with us switching to how madvise() does it?
>>>
>>>>
>>>>>
>>>>> So we're back into confusing territory again :)
>>>>>
>>>>> I wonder if we could...
>>>>>
>>>>> 1. Add an MADV_xxx that mimics the desired behaviour here.
>>>>>
>>>>> 2. Add a generic 'madvise() by default' thing at a process level?
>>>>>
>>>>> Is this crazy?
>>>>
>>>> I think that's what I had in mind, just a bit twisted.
>>>>
>>>> What could work is
>>>>
>>>> 1) prctl to set the default
>>>>
>>>> 2) madvise() to adjust all existing VMAs
>>>>
>>>>
>>>> We might have to teach 2) to ignore non-compatible VMAs / holes. Maybe not,
>>>> worth an investigation.
>>>
>>> Yeah, I think it'd _probably_ be ok except on s390 (which can fail, and so
>>> we'd have to be able to say - skip on error, carry on).
>>>
>>> We'll just get an -ENOMEM at the end for the gaps (god how I hate
>>> that). Otherwise I don't think MADV_HUGEPAGE actually is really that
>>> restrictive.
>>>
>>> That would simplify :)
>>>
>>> But I still so hate using prctl()... this might be one of those cases where
>>> we simply figure out we have no other choice.
>>>> But when you put it as simply as this maybe it's not so bad. With the
>>> flags2 gone by fixing this stupid 32-bit limit it's less awful.
>>>
>>> Perhaps worth seeing what an improved RFC of this series looks like with
>>> all the various bits fixed to give an idea.
>>
>> Yes.
>>
>>>
>>> But you do then wonder if we could make this _generic_ for _any_ madvise(),
>>> and how _that_ would look.
>>>
>>> But perhaps that's insane because many VMAs would simply not be suited to
>>> having certain madvise flags set hmm.
>>
>> Same thinking. I think this is rather special.
>>
>> In a perfect world not even the madvise(*HUGEPAGE) would exist.
>>
>> But here we are ... 14 years (wow!) after
>
> This feels like the tale of the kernel :)
>
>>
>> commit 0af4e98b6b095c74588af04872f83d333c958c32
>> Author: Andrea Arcangeli <aarcange@redhat.com>
>> Date: Thu Jan 13 15:46:55 2011 -0800
>>
>> thp: madvise(MADV_HUGEPAGE)
>>
>>
>>
>> (I'm surprised you don't complain about madvise(). IMHO, prctl() is even a
>> better interface than catch-all madvise(); a syscall where an advise might
>> not be an advise. I saw some funny rants about MADV_DONTNEED on reddit at
>> some point ... :) mctrl() would have been clearer, at least for me :D )
>
> No I prefer madvise() massively, I mean yes in a way it's hacky, but prctl() is
> the ultimate hack.
>
> So as an interface it's actually kinda fine like 'virtual range X-Y, advise ZZZ
> about it'.
>
> (as for naming haha maybe you have a point actually, the 'advice' bit
> has always been strange... :)
>
> But.
>
> The actual set of advice is bloody hideous and confusing and I've seen
> first hand userspace people get very, very confused about what each thing
> does. The naming is horrible, overloaded, overwrought.
>
> And the weird behaviour with gaps is also horrible...
>
> So there's lots to moan about there, but saying prctl() is somehow superior
> to the true evil of prctl() is far too far :P
Haha :)
>
> I mean take a look at https://man7.org/linux/man-pages/man2/prctl.2.html
>
> Things like:
>
> PR_SET_MM
> PR_SET_VMA
>
> Are super worrying...
Don't get me wrong. I like the concept of prctl(), but not whatever
weird stuff we squeezed in there. And there is *a lot* of weird stuff in
there that probably shouldn't exist.
Similar to madvise(), where we squeezed in a lot of stuff ... but that
ship has sailed.
Looking forward to hearing what your magic thinking cap can do! :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-16 11:24 ` David Hildenbrand
@ 2025-05-16 12:57 ` Lorenzo Stoakes
2025-05-16 17:19 ` Usama Arif
` (2 more replies)
0 siblings, 3 replies; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-16 12:57 UTC (permalink / raw)
To: David Hildenbrand
Cc: Usama Arif, Andrew Morton, linux-mm, hannes, shakeel.butt, riel,
ziy, laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On Fri, May 16, 2025 at 01:24:18PM +0200, David Hildenbrand wrote:
> Looking forward to hearing what your magic thinking cap can do! :)
OK so just to say at the outset, this is purely playing around with a
theoretical idea here, so if it's crazy just let me know :))
Right now madvise() has limited utility because:
- You have little control over how the operation is done
- You get little feedback about what's actually succeeded or not
- While you can perform multiple operations at once via process_madvise(),
even to the current process (after my changes to extend it), it's limited
to a single advice over 8 ranges.
- You can't say 'ignore errors just try'
- You get the weird gap behaviour.
So the concept is - make everything explicit and add a new syscall that
wraps the existing madvise() stuff and addresses all the above issues.
Specifically pertinent to the case at hand - also add a 'set_default'
boolean (you'll see shortly exactly where) to also tell madvise() to make
all future VMAs default to the specified advice. We'll whitelist what we're
allowed to use here and should be able to use mm->def_flags.
So the idea is we'll use a helper struct-configured function (hey, it's me,
I <3 helper structs so of course) like:
int madvise_ranges(struct madvise_range_control *ctl);
With the data structures as follows (untested, etc. etc.):
enum madvise_range_type {
MADVISE_RANGE_SINGLE,
MADVISE_RANGE_MULTI,
MADVISE_RANGE_ALL,
};
struct madvise_range {
const void *addr;
size_t size;
int advice;
};
struct madvise_ranges {
const struct madvise_range *arr;
size_t count;
};
struct madvise_range_stats {
struct madvise_range range;
bool success;
bool partial;
};
struct madvise_ranges_stats {
unsigned long nr_mappings_advised;
unsigned long nr_mappings_skipped;
unsigned long nr_pages_advised;
unsigned long nr_pages_skipped;
unsigned long nr_gaps;
/*
* Useful for madvise_range_control->ignore_errors:
*
* If non-NULL, points to an array of size equal to the number of ranges
* specified. Indiciates the specified range, whether it succeeded, and
* whether that success was partial (that is, the range specified
* multiple mappings, only some of which had advice applied
* successfully).
*
* Not valid for MADVISE_RANGE_ALL.
*/
struct madvise_range_stats *per_range_stats;
/* Error details. */
int err;
unsigned long failed_address;
size_t offset; /* If multi, at which offset did this occur? */
};
struct madvise_ranges_control {
int version; /* Allow future updates to API. */
enum madvise_range_type type;
union {
struct madvise_range range; /* MADVISE_RANGE_SINGLE */
struct madvise_ranges ranges; /* MADVISE_RANGE_MULTI */
struct all { /* MADVISE_RANGE_ALL */
int advice;
/*
* If set, also have all future mappings have this applied by default.
*
* Only whitelisted advice may set this, otherwise -EINVAL will be returned.
*/
bool set_default;
};
};
struct madvise_ranges_stats *stats; /* If non-NULL, report information about operation. */
int pidfd; /* If is_remote set, the remote process. */
/* Options. */
bool is_remote :1; /* Target remote process as specified by pidfd. */
bool ignore_errors :1; /* If error occurs applying advice, carry on to next VMA. */
bool single_mapping_only :1; /* Error out if any range is not a single VMA. */
bool stop_on_gap :1; /* Stop operation if input range includes unmapped memory. */
};
So the user can specify whether to apply advice to a single range,
multiple, or the whole address space, with real control over how the operation proceeds.
This basically solves the problem this series tries to address while also
providing an improved madvise() API at the same time.
Thoughts? Have I finally completely lost my mind?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-16 12:57 ` Lorenzo Stoakes
@ 2025-05-16 17:19 ` Usama Arif
2025-05-16 17:51 ` Lorenzo Stoakes
2025-05-17 16:20 ` Is number of process_madvise()-able ranges limited to 8? (was Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process) SeongJae Park
2025-05-17 19:01 ` [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process Lorenzo Stoakes
2 siblings, 1 reply; 51+ messages in thread
From: Usama Arif @ 2025-05-16 17:19 UTC (permalink / raw)
To: Lorenzo Stoakes, David Hildenbrand, Liam R. Howlett
Cc: Andrew Morton, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team
On 16/05/2025 13:57, Lorenzo Stoakes wrote:
> On Fri, May 16, 2025 at 01:24:18PM +0200, David Hildenbrand wrote:
>> Looking forward to hearing what your magic thinking cap can do! :)
>
> OK so just to say at the outset, this is purely playing around with a
> theoretical idea here, so if it's crazy just let me know :))
>
> Right now madvise() has limited utility because:
>
> - You have little control over how the operation is done
> - You get little feedback about what's actually succeeded or not
> - While you can perform multiple operations at once via process_madvise(),
> even to the current process (after my changes to extend it), it's limited
> to a single advice over 8 ranges.
> - You can't say 'ignore errors just try'
> - You get the weird gap behaviour.
>
> So the concept is - make everything explicit and add a new syscall that
> wraps the existing madvise() stuff and addresses all the above issues.
>
> Specifically pertinent to the case at hand - also add a 'set_default'
> boolean (you'll see shortly exactly where) to also tell madvise() to make
> all future VMAs default to the specified advice. We'll whitelist what we're
> allowed to use here and should be able to use mm->def_flags.
>
> So the idea is we'll use a helper struct-configured function (hey, it's me,
> I <3 helper structs so of course) like:
>
> int madvise_ranges(struct madvise_range_control *ctl);
>
> With the data structures as follows (untested, etc. etc.):
>
> enum madvise_range_type {
> MADVISE_RANGE_SINGLE,
> MADVISE_RANGE_MULTI,
> MADVISE_RANGE_ALL,
> };
>
> struct madvise_range {
> const void *addr;
> size_t size;
> int advice;
> };
>
> struct madvise_ranges {
> const struct madvise_range *arr;
> size_t count;
> };
>
> struct madvise_range_stats {
> struct madvise_range range;
> bool success;
> bool partial;
> };
>
> struct madvise_ranges_stats {
> unsigned long nr_mappings_advised;
> unsigned long nr_mappings_skipped;
> unsigned long nr_pages_advised;
> unsigned long nr_pages_skipped;
> unsigned long nr_gaps;
>
> /*
> * Useful for madvise_range_control->ignore_errors:
> *
> * If non-NULL, points to an array of size equal to the number of ranges
> * specified. Indiciates the specified range, whether it succeeded, and
> * whether that success was partial (that is, the range specified
> * multiple mappings, only some of which had advice applied
> * successfully).
> *
> * Not valid for MADVISE_RANGE_ALL.
> */
> struct madvise_range_stats *per_range_stats;
>
> /* Error details. */
> int err;
> unsigned long failed_address;
> size_t offset; /* If multi, at which offset did this occur? */
> };
>
> struct madvise_ranges_control {
> int version; /* Allow future updates to API. */
>
> enum madvise_range_type type;
>
> union {
> struct madvise_range range; /* MADVISE_RANGE_SINGLE */
> struct madvise_ranges ranges; /* MADVISE_RANGE_MULTI */
> struct all { /* MADVISE_RANGE_ALL */
> int advice;
> /*
> * If set, also have all future mappings have this applied by default.
> *
> * Only whitelisted advice may set this, otherwise -EINVAL will be returned.
> */
> bool set_default;
> };
> };
> struct madvise_ranges_stats *stats; /* If non-NULL, report information about operation. */
>
> int pidfd; /* If is_remote set, the remote process. */
>
> /* Options. */
> bool is_remote :1; /* Target remote process as specified by pidfd. */
> bool ignore_errors :1; /* If error occurs applying advice, carry on to next VMA. */
> bool single_mapping_only :1; /* Error out if any range is not a single VMA. */
> bool stop_on_gap :1; /* Stop operation if input range includes unmapped memory. */
> };
>
> So the user can specify whether to apply advice to a single range,
> multiple, or the whole address space, with real control over how the operation proceeds.
>
For single range, we have madvise, for multiple ranges we have process_madvise,
we can have a very very simple solution for whole address space with prctl.
IMHO, above is really not be needed (but I might be wrong :)), this will introduce a
lot of code to solve something that can be done in a very very simple way and it will introduce
another syscall when prctl is designed for this, I understand that you don't like prctl,
but it is there.
I have added below what patch 1 of 6 would look like after incorporating all your feedback.
(Thanks for all the feedback, really appreciate it!!)
Main difference from the current revisions:
- no more flags2.
- no more MMF2_...
- renamed policy to PR_DEFAULT_MADV_HUGEPAGE
- mmap_write_lock_killable acquired in PR_GET_THP_POLICY
- mmap_write lock fixed in PR_SET_THP_POLICY
- check if hugepage_global_enabled is enabled in the call and account for s390
- set mm->def_flags VM_HUGEPAGE and VM_NOHUGEPAGE according to the policy in the way
done by madvise(). I believe VM merge will not be broken in this way, please let me know
otherwise.
- process_default_madv_hugepage function that does for_each_vma and calls hugepage_madvise.
(I can move it to vma.c or any other file you prefer).
Please let me know if this looks acceptable and I can send this as RFC v3 for all the
6 patches (the rest are done in a similar way to below)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..a8c3ce15a504 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -260,6 +260,8 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
return orders;
}
+void process_default_madv_hugepage(struct mm_struct *mm, int advice);
+
unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
unsigned long vm_flags,
unsigned long tva_flags,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 43748c8f3454..436f4588bce8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -466,7 +466,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)
/* This mask defines which mm->def_flags a process can inherit its parent */
-#define VM_INIT_DEF_MASK VM_NOHUGEPAGE
+#define VM_INIT_DEF_MASK (VM_HUGEPAGE | VM_NOHUGEPAGE)
/* This mask represents all the VMA flag bits used by mlock */
#define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e76bade9ebb1..f1836b7c5704 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1703,6 +1703,7 @@ enum {
/* leave room for more dump flags */
#define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */
#define MMF_VM_HUGEPAGE 17 /* set when mm is available for khugepaged */
+#define MMF_VM_HUGEPAGE_MASK (1 << MMF_VM_HUGEPAGE)
/*
* This one-shot flag is dropped due to necessity of changing exe once again
@@ -1742,7 +1743,8 @@ enum {
#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
- MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
+ MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK |\
+ MMF_VM_HUGEPAGE_MASK)
static inline unsigned long mmf_init_flags(unsigned long flags)
{
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 15c18ef4eb11..15aaa4db5ff8 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -364,4 +364,8 @@ struct prctl_mm_map {
# define PR_TIMER_CREATE_RESTORE_IDS_ON 1
# define PR_TIMER_CREATE_RESTORE_IDS_GET 2
+#define PR_SET_THP_POLICY 78
+#define PR_GET_THP_POLICY 79
+#define PR_DEFAULT_MADV_HUGEPAGE 0
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index c434968e9f5d..4fe860b0ff25 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2658,6 +2658,44 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
clear_bit(MMF_DISABLE_THP, &me->mm->flags);
mmap_write_unlock(me->mm);
break;
+ case PR_GET_THP_POLICY:
+ if (arg2 || arg3 || arg4 || arg5)
+ return -EINVAL;
+ if (mmap_write_lock_killable(me->mm))
+ return -EINTR;
+ if (me->mm->def_flags & VM_HUGEPAGE)
+ error = PR_DEFAULT_MADV_HUGEPAGE;
+ mmap_write_unlock(me->mm);
+ break;
+ case PR_SET_THP_POLICY:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ if (mmap_write_lock_killable(me->mm))
+ return -EINTR;
+ switch (arg2) {
+ case PR_DEFAULT_MADV_HUGEPAGE:
+ if (!hugepage_global_enabled())
+ error = -EPERM;
+#ifdef CONFIG_S390
+ /*
+ * qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
+ * can't handle this properly after s390_enable_sie, so we simply
+ * ignore the madvise to prevent qemu from causing a SIGSEGV.
+ */
+ else if (mm_has_pgste(vma->vm_mm))
+ error = -EPERM;
+#endif
+ else {
+ me->mm->def_flags &= ~VM_NOHUGEPAGE;
+ me->mm->def_flags |= VM_HUGEPAGE;
+ process_default_madv_hugepage(me->mm, MADV_HUGEPAGE);
+ }
+ break;
+ default:
+ error = -EINVAL;
+ }
+ mmap_write_unlock(me->mm);
+ break;
case PR_MPX_ENABLE_MANAGEMENT:
case PR_MPX_DISABLE_MANAGEMENT:
/* No longer implemented: */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2780a12b25f0..2b9a3e280ae4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -98,6 +98,18 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
}
+void process_default_madv_hugepage(struct mm_struct *mm, int advice)
+{
+ struct vm_area_struct *vma;
+ unsigned long vm_flags;
+
+ VMA_ITERATOR(vmi, mm, 0);
+ for_each_vma(vmi, vma) {
+ vm_flags = vma->vm_flags;
+ hugepage_madvise(vma, &vm_flags, advice);
+ }
+}
+
unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
unsigned long vm_flags,
unsigned long tva_flags,
> This basically solves the problem this series tries to address while also
> providing an improved madvise() API at the same time.
>
> Thoughts? Have I finally completely lost my mind?
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-16 17:19 ` Usama Arif
@ 2025-05-16 17:51 ` Lorenzo Stoakes
2025-05-16 19:34 ` Usama Arif
0 siblings, 1 reply; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-16 17:51 UTC (permalink / raw)
To: Usama Arif
Cc: David Hildenbrand, Liam R. Howlett, Andrew Morton, linux-mm,
hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang, npache,
ryan.roberts, linux-kernel, linux-doc, kernel-team
On Fri, May 16, 2025 at 06:19:32PM +0100, Usama Arif wrote:
>
>
> On 16/05/2025 13:57, Lorenzo Stoakes wrote:
> > On Fri, May 16, 2025 at 01:24:18PM +0200, David Hildenbrand wrote:
> >> Looking forward to hearing what your magic thinking cap can do! :)
> >
> > OK so just to say at the outset, this is purely playing around with a
> > theoretical idea here, so if it's crazy just let me know :))
> >
> > Right now madvise() has limited utility because:
> >
> > - You have little control over how the operation is done
> > - You get little feedback about what's actually succeeded or not
> > - While you can perform multiple operations at once via process_madvise(),
> > even to the current process (after my changes to extend it), it's limited
> > to a single advice over 8 ranges.
> > - You can't say 'ignore errors just try'
> > - You get the weird gap behaviour.
> >
> > So the concept is - make everything explicit and add a new syscall that
> > wraps the existing madvise() stuff and addresses all the above issues.
> >
> > Specifically pertinent to the case at hand - also add a 'set_default'
> > boolean (you'll see shortly exactly where) to also tell madvise() to make
> > all future VMAs default to the specified advice. We'll whitelist what we're
> > allowed to use here and should be able to use mm->def_flags.
> >
> > So the idea is we'll use a helper struct-configured function (hey, it's me,
> > I <3 helper structs so of course) like:
> >
> > int madvise_ranges(struct madvise_range_control *ctl);
> >
> > With the data structures as follows (untested, etc. etc.):
> >
> > enum madvise_range_type {
> > MADVISE_RANGE_SINGLE,
> > MADVISE_RANGE_MULTI,
> > MADVISE_RANGE_ALL,
> > };
> >
> > struct madvise_range {
> > const void *addr;
> > size_t size;
> > int advice;
> > };
> >
> > struct madvise_ranges {
> > const struct madvise_range *arr;
> > size_t count;
> > };
> >
> > struct madvise_range_stats {
> > struct madvise_range range;
> > bool success;
> > bool partial;
> > };
> >
> > struct madvise_ranges_stats {
> > unsigned long nr_mappings_advised;
> > unsigned long nr_mappings_skipped;
> > unsigned long nr_pages_advised;
> > unsigned long nr_pages_skipped;
> > unsigned long nr_gaps;
> >
> > /*
> > * Useful for madvise_range_control->ignore_errors:
> > *
> > * If non-NULL, points to an array of size equal to the number of ranges
> > * specified. Indiciates the specified range, whether it succeeded, and
> > * whether that success was partial (that is, the range specified
> > * multiple mappings, only some of which had advice applied
> > * successfully).
> > *
> > * Not valid for MADVISE_RANGE_ALL.
> > */
> > struct madvise_range_stats *per_range_stats;
> >
> > /* Error details. */
> > int err;
> > unsigned long failed_address;
> > size_t offset; /* If multi, at which offset did this occur? */
> > };
> >
> > struct madvise_ranges_control {
> > int version; /* Allow future updates to API. */
> >
> > enum madvise_range_type type;
> >
> > union {
> > struct madvise_range range; /* MADVISE_RANGE_SINGLE */
> > struct madvise_ranges ranges; /* MADVISE_RANGE_MULTI */
> > struct all { /* MADVISE_RANGE_ALL */
> > int advice;
> > /*
> > * If set, also have all future mappings have this applied by default.
> > *
> > * Only whitelisted advice may set this, otherwise -EINVAL will be returned.
> > */
> > bool set_default;
> > };
> > };
> > struct madvise_ranges_stats *stats; /* If non-NULL, report information about operation. */
> >
> > int pidfd; /* If is_remote set, the remote process. */
> >
> > /* Options. */
> > bool is_remote :1; /* Target remote process as specified by pidfd. */
> > bool ignore_errors :1; /* If error occurs applying advice, carry on to next VMA. */
> > bool single_mapping_only :1; /* Error out if any range is not a single VMA. */
> > bool stop_on_gap :1; /* Stop operation if input range includes unmapped memory. */
> > };
> >
> > So the user can specify whether to apply advice to a single range,
> > multiple, or the whole address space, with real control over how the operation proceeds.
> >
>
> For single range, we have madvise, for multiple ranges we have process_madvise,
> we can have a very very simple solution for whole address space with prctl.
With respect, I suggest you read through my justifications a little more
carefully :)
What happens for a single range when you want to ignore errors? You just can't
do it. What happens if you want to actually determine if an error arose or
whether a gap appeared (-ENOMEM happens on gaps, regardless of whether any
operation failed or not)? You can't.
process_madvise(), a function I personally expanded very significantly and
actually made it possible to be used in this way, is limited in:
1. It only allows single advice to be applied to each range.
2. It's limited to 8 operations at a time.
Also neither allow you to sensibly apply something to the _entire address
space_, ignoring errors.
Also neither allow you to 'set default' in the all casae.
Not to mention the ability to actually determine if gaps occurred, more details
about errors, etc.
I'm essentially talking about a fixed madvise().
>
> IMHO, above is really not be needed (but I might be wrong :)), this will introduce a
> lot of code to solve something that can be done in a very very simple way and it will introduce
> another syscall when prctl is designed for this, I understand that you don't like prctl,
> but it is there.
By this argument we don't need any system calls relating to processes and
instead should use prctl()... I mean mmap() could be a prctl() right? munmap()?
mremap()? The list goes on...
So no, I don't think prctl() is 'designed for this' at all. I think it's a bad
generic interface used to brush stuff under the carpet at we don't want to put
anywhere else.
However in the case of the problem you're trying to solve, we might perhaps
decide prctl() is the only sensible (if yucky) place for it.
But with my proposal above, we can actually have two wins - we both enable your
use case and provide a general means of doing 'madvise by default' and 'mass
madvise'.
At any rate I'm not saying it's right, or sane, but what I am saying is I feel
you have not refuted this as a concept :)
>
> I have added below what patch 1 of 6 would look like after incorporating all your feedback.
> (Thanks for all the feedback, really appreciate it!!)
> Main difference from the current revisions:
> - no more flags2.
> - no more MMF2_...
> - renamed policy to PR_DEFAULT_MADV_HUGEPAGE
> - mmap_write_lock_killable acquired in PR_GET_THP_POLICY
> - mmap_write lock fixed in PR_SET_THP_POLICY
> - check if hugepage_global_enabled is enabled in the call and account for s390
> - set mm->def_flags VM_HUGEPAGE and VM_NOHUGEPAGE according to the policy in the way
> done by madvise(). I believe VM merge will not be broken in this way, please let me know
> otherwise.
> - process_default_madv_hugepage function that does for_each_vma and calls hugepage_madvise.
> (I can move it to vma.c or any other file you prefer).
Thanks for taking on board the review, it's much appreciated and I hope you can
agree this is a big improvement :>)
>
> Please let me know if this looks acceptable and I can send this as RFC v3 for all the
> 6 patches (the rest are done in a similar way to below)
I think it will be useful to see this as an RFC notwithstanding my idea above (I
was saying to David previously it'd be useful to just see how it is now with
these changes).
Then that gives us the basis for further conversation. Thanks for helping us
iterate towards a solution here!
I've commented inline below though you need to address the duplication issue.
Thanks!
>
>
>
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..a8c3ce15a504 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -260,6 +260,8 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
> return orders;
> }
>
> +void process_default_madv_hugepage(struct mm_struct *mm, int advice);
> +
> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> unsigned long vm_flags,
> unsigned long tva_flags,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 43748c8f3454..436f4588bce8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -466,7 +466,7 @@ extern unsigned int kobjsize(const void *objp);
> #define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)
>
> /* This mask defines which mm->def_flags a process can inherit its parent */
> -#define VM_INIT_DEF_MASK VM_NOHUGEPAGE
> +#define VM_INIT_DEF_MASK (VM_HUGEPAGE | VM_NOHUGEPAGE)
>
> /* This mask represents all the VMA flag bits used by mlock */
> #define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e76bade9ebb1..f1836b7c5704 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1703,6 +1703,7 @@ enum {
> /* leave room for more dump flags */
> #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */
> #define MMF_VM_HUGEPAGE 17 /* set when mm is available for khugepaged */
> +#define MMF_VM_HUGEPAGE_MASK (1 << MMF_VM_HUGEPAGE)
>
> /*
> * This one-shot flag is dropped due to necessity of changing exe once again
> @@ -1742,7 +1743,8 @@ enum {
>
> #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
> MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
> - MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
> + MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK |\
> + MMF_VM_HUGEPAGE_MASK)
>
> static inline unsigned long mmf_init_flags(unsigned long flags)
> {
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 15c18ef4eb11..15aaa4db5ff8 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -364,4 +364,8 @@ struct prctl_mm_map {
> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1
> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2
>
> +#define PR_SET_THP_POLICY 78
> +#define PR_GET_THP_POLICY 79
> +#define PR_DEFAULT_MADV_HUGEPAGE 0
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index c434968e9f5d..4fe860b0ff25 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2658,6 +2658,44 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> clear_bit(MMF_DISABLE_THP, &me->mm->flags);
> mmap_write_unlock(me->mm);
> break;
> + case PR_GET_THP_POLICY:
> + if (arg2 || arg3 || arg4 || arg5)
> + return -EINVAL;
> + if (mmap_write_lock_killable(me->mm))
> + return -EINTR;
> + if (me->mm->def_flags & VM_HUGEPAGE)
> + error = PR_DEFAULT_MADV_HUGEPAGE;
> + mmap_write_unlock(me->mm);
> + break;
> + case PR_SET_THP_POLICY:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + if (mmap_write_lock_killable(me->mm))
> + return -EINTR;
> + switch (arg2) {
> + case PR_DEFAULT_MADV_HUGEPAGE:
> + if (!hugepage_global_enabled())
> + error = -EPERM;
> +#ifdef CONFIG_S390
> + /*
> + * qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
> + * can't handle this properly after s390_enable_sie, so we simply
> + * ignore the madvise to prevent qemu from causing a SIGSEGV.
> + */
> + else if (mm_has_pgste(vma->vm_mm))
> + error = -EPERM;
> +#endif
No, we definitely don't want to duplicate this. You need to share this code with
madvise(). This is classic duplication, and loathesome specialisation anyway
that we want to limit to one place _only_.
> + else {
> + me->mm->def_flags &= ~VM_NOHUGEPAGE;
> + me->mm->def_flags |= VM_HUGEPAGE;
> + process_default_madv_hugepage(me->mm, MADV_HUGEPAGE);
Nit, but let's at least abstract out the mm here.
> + }
> + break;
> + default:
> + error = -EINVAL;
Thanks for fixing this! But technically you should have a break here too.
> + }
> + mmap_write_unlock(me->mm);
> + break;
> case PR_MPX_ENABLE_MANAGEMENT:
> case PR_MPX_DISABLE_MANAGEMENT:
> /* No longer implemented: */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2780a12b25f0..2b9a3e280ae4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -98,6 +98,18 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
> return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
> }
>
> +void process_default_madv_hugepage(struct mm_struct *mm, int advice)
> +{
> + struct vm_area_struct *vma;
> + unsigned long vm_flags;
> +
Please add the discussed assert for the mmap lock :)
> + VMA_ITERATOR(vmi, mm, 0);
> + for_each_vma(vmi, vma) {
> + vm_flags = vma->vm_flags;
> + hugepage_madvise(vma, &vm_flags, advice);
> + }
> +}
> +
> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> unsigned long vm_flags,
> unsigned long tva_flags,
>
>
>
>
> > This basically solves the problem this series tries to address while also
> > providing an improved madvise() API at the same time.
> >
> > Thoughts? Have I finally completely lost my mind?
>
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-16 17:51 ` Lorenzo Stoakes
@ 2025-05-16 19:34 ` Usama Arif
0 siblings, 0 replies; 51+ messages in thread
From: Usama Arif @ 2025-05-16 19:34 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: David Hildenbrand, Liam R. Howlett, Andrew Morton, linux-mm,
hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang, npache,
ryan.roberts, linux-kernel, linux-doc, kernel-team
On 16/05/2025 18:51, Lorenzo Stoakes wrote:
> On Fri, May 16, 2025 at 06:19:32PM +0100, Usama Arif wrote:
>>
>>
>> On 16/05/2025 13:57, Lorenzo Stoakes wrote:
>>> On Fri, May 16, 2025 at 01:24:18PM +0200, David Hildenbrand wrote:
>>>> Looking forward to hearing what your magic thinking cap can do! :)
>>>
>>> OK so just to say at the outset, this is purely playing around with a
>>> theoretical idea here, so if it's crazy just let me know :))
>>>
>>> Right now madvise() has limited utility because:
>>>
>>> - You have little control over how the operation is done
>>> - You get little feedback about what's actually succeeded or not
>>> - While you can perform multiple operations at once via process_madvise(),
>>> even to the current process (after my changes to extend it), it's limited
>>> to a single advice over 8 ranges.
>>> - You can't say 'ignore errors just try'
>>> - You get the weird gap behaviour.
>>>
>>> So the concept is - make everything explicit and add a new syscall that
>>> wraps the existing madvise() stuff and addresses all the above issues.
>>>
>>> Specifically pertinent to the case at hand - also add a 'set_default'
>>> boolean (you'll see shortly exactly where) to also tell madvise() to make
>>> all future VMAs default to the specified advice. We'll whitelist what we're
>>> allowed to use here and should be able to use mm->def_flags.
>>>
>>> So the idea is we'll use a helper struct-configured function (hey, it's me,
>>> I <3 helper structs so of course) like:
>>>
>>> int madvise_ranges(struct madvise_range_control *ctl);
>>>
>>> With the data structures as follows (untested, etc. etc.):
>>>
>>> enum madvise_range_type {
>>> MADVISE_RANGE_SINGLE,
>>> MADVISE_RANGE_MULTI,
>>> MADVISE_RANGE_ALL,
>>> };
>>>
>>> struct madvise_range {
>>> const void *addr;
>>> size_t size;
>>> int advice;
>>> };
>>>
>>> struct madvise_ranges {
>>> const struct madvise_range *arr;
>>> size_t count;
>>> };
>>>
>>> struct madvise_range_stats {
>>> struct madvise_range range;
>>> bool success;
>>> bool partial;
>>> };
>>>
>>> struct madvise_ranges_stats {
>>> unsigned long nr_mappings_advised;
>>> unsigned long nr_mappings_skipped;
>>> unsigned long nr_pages_advised;
>>> unsigned long nr_pages_skipped;
>>> unsigned long nr_gaps;
>>>
>>> /*
>>> * Useful for madvise_range_control->ignore_errors:
>>> *
>>> * If non-NULL, points to an array of size equal to the number of ranges
>>> * specified. Indiciates the specified range, whether it succeeded, and
>>> * whether that success was partial (that is, the range specified
>>> * multiple mappings, only some of which had advice applied
>>> * successfully).
>>> *
>>> * Not valid for MADVISE_RANGE_ALL.
>>> */
>>> struct madvise_range_stats *per_range_stats;
>>>
>>> /* Error details. */
>>> int err;
>>> unsigned long failed_address;
>>> size_t offset; /* If multi, at which offset did this occur? */
>>> };
>>>
>>> struct madvise_ranges_control {
>>> int version; /* Allow future updates to API. */
>>>
>>> enum madvise_range_type type;
>>>
>>> union {
>>> struct madvise_range range; /* MADVISE_RANGE_SINGLE */
>>> struct madvise_ranges ranges; /* MADVISE_RANGE_MULTI */
>>> struct all { /* MADVISE_RANGE_ALL */
>>> int advice;
>>> /*
>>> * If set, also have all future mappings have this applied by default.
>>> *
>>> * Only whitelisted advice may set this, otherwise -EINVAL will be returned.
>>> */
>>> bool set_default;
>>> };
>>> };
>>> struct madvise_ranges_stats *stats; /* If non-NULL, report information about operation. */
>>>
>>> int pidfd; /* If is_remote set, the remote process. */
>>>
>>> /* Options. */
>>> bool is_remote :1; /* Target remote process as specified by pidfd. */
>>> bool ignore_errors :1; /* If error occurs applying advice, carry on to next VMA. */
>>> bool single_mapping_only :1; /* Error out if any range is not a single VMA. */
>>> bool stop_on_gap :1; /* Stop operation if input range includes unmapped memory. */
>>> };
>>>
>>> So the user can specify whether to apply advice to a single range,
>>> multiple, or the whole address space, with real control over how the operation proceeds.
>>>
>>
>> For single range, we have madvise, for multiple ranges we have process_madvise,
>> we can have a very very simple solution for whole address space with prctl.
>
> With respect, I suggest you read through my justifications a little more
> carefully :)
>
> What happens for a single range when you want to ignore errors? You just can't
> do it. What happens if you want to actually determine if an error arose or
> whether a gap appeared (-ENOMEM happens on gaps, regardless of whether any
> operation failed or not)? You can't.
>
> process_madvise(), a function I personally expanded very significantly and
> actually made it possible to be used in this way, is limited in:
>
> 1. It only allows single advice to be applied to each range.
> 2. It's limited to 8 operations at a time.
>
> Also neither allow you to sensibly apply something to the _entire address
> space_, ignoring errors.
>
> Also neither allow you to 'set default' in the all casae.
>
> Not to mention the ability to actually determine if gaps occurred, more details
> about errors, etc.
>
> I'm essentially talking about a fixed madvise().
>
>>
>> IMHO, above is really not be needed (but I might be wrong :)), this will introduce a
>> lot of code to solve something that can be done in a very very simple way and it will introduce
>> another syscall when prctl is designed for this, I understand that you don't like prctl,
>> but it is there.
>
> By this argument we don't need any system calls relating to processes and
> instead should use prctl()... I mean mmap() could be a prctl() right? munmap()?
> mremap()? The list goes on...
>
> So no, I don't think prctl() is 'designed for this' at all. I think it's a bad
> generic interface used to brush stuff under the carpet at we don't want to put
> anywhere else.
>
> However in the case of the problem you're trying to solve, we might perhaps
> decide prctl() is the only sensible (if yucky) place for it.
>
> But with my proposal above, we can actually have two wins - we both enable your
> use case and provide a general means of doing 'madvise by default' and 'mass
> madvise'.
>
> At any rate I'm not saying it's right, or sane, but what I am saying is I feel
> you have not refuted this as a concept :)
>
>>
>> I have added below what patch 1 of 6 would look like after incorporating all your feedback.
>> (Thanks for all the feedback, really appreciate it!!)
>> Main difference from the current revisions:
>> - no more flags2.
>> - no more MMF2_...
>> - renamed policy to PR_DEFAULT_MADV_HUGEPAGE
>> - mmap_write_lock_killable acquired in PR_GET_THP_POLICY
>> - mmap_write lock fixed in PR_SET_THP_POLICY
>> - check if hugepage_global_enabled is enabled in the call and account for s390
>> - set mm->def_flags VM_HUGEPAGE and VM_NOHUGEPAGE according to the policy in the way
>> done by madvise(). I believe VM merge will not be broken in this way, please let me know
>> otherwise.
>> - process_default_madv_hugepage function that does for_each_vma and calls hugepage_madvise.
>> (I can move it to vma.c or any other file you prefer).
>
> Thanks for taking on board the review, it's much appreciated and I hope you can
> agree this is a big improvement :>)
>
>>
>> Please let me know if this looks acceptable and I can send this as RFC v3 for all the
>> 6 patches (the rest are done in a similar way to below)
>
> I think it will be useful to see this as an RFC notwithstanding my idea above (I
> was saying to David previously it'd be useful to just see how it is now with
> these changes).
>
> Then that gives us the basis for further conversation. Thanks for helping us
> iterate towards a solution here!
>
> I've commented inline below though you need to address the duplication issue.
>
> Thanks!
>
Ack on all the points you mentioned above.
Also agree on the review below, I made it as draft to check if it would work
properly but need to fix the duplication, abstraction, break and the assert
(I went through all the comments while writing below but missed the assert
one, sorry about that!).
Let me clean up the whole thing with a much better cover letter and send a RFC v3
on Monday.
Thanks again for the quick and really valuable feedback, its looking a lot better
than the current version!
Thanks,
Usama
>>
>>
>>
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2f190c90192d..a8c3ce15a504 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -260,6 +260,8 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
>> return orders;
>> }
>>
>> +void process_default_madv_hugepage(struct mm_struct *mm, int advice);
>> +
>> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>> unsigned long vm_flags,
>> unsigned long tva_flags,
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 43748c8f3454..436f4588bce8 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -466,7 +466,7 @@ extern unsigned int kobjsize(const void *objp);
>> #define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)
>>
>> /* This mask defines which mm->def_flags a process can inherit its parent */
>> -#define VM_INIT_DEF_MASK VM_NOHUGEPAGE
>> +#define VM_INIT_DEF_MASK (VM_HUGEPAGE | VM_NOHUGEPAGE)
>>
>> /* This mask represents all the VMA flag bits used by mlock */
>> #define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT)
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index e76bade9ebb1..f1836b7c5704 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -1703,6 +1703,7 @@ enum {
>> /* leave room for more dump flags */
>> #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */
>> #define MMF_VM_HUGEPAGE 17 /* set when mm is available for khugepaged */
>> +#define MMF_VM_HUGEPAGE_MASK (1 << MMF_VM_HUGEPAGE)
>>
>> /*
>> * This one-shot flag is dropped due to necessity of changing exe once again
>> @@ -1742,7 +1743,8 @@ enum {
>>
>> #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
>> MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
>> - MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
>> + MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK |\
>> + MMF_VM_HUGEPAGE_MASK)
>>
>> static inline unsigned long mmf_init_flags(unsigned long flags)
>> {
>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> index 15c18ef4eb11..15aaa4db5ff8 100644
>> --- a/include/uapi/linux/prctl.h
>> +++ b/include/uapi/linux/prctl.h
>> @@ -364,4 +364,8 @@ struct prctl_mm_map {
>> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1
>> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2
>>
>> +#define PR_SET_THP_POLICY 78
>> +#define PR_GET_THP_POLICY 79
>> +#define PR_DEFAULT_MADV_HUGEPAGE 0
>> +
>> #endif /* _LINUX_PRCTL_H */
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index c434968e9f5d..4fe860b0ff25 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -2658,6 +2658,44 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>> clear_bit(MMF_DISABLE_THP, &me->mm->flags);
>> mmap_write_unlock(me->mm);
>> break;
>> + case PR_GET_THP_POLICY:
>> + if (arg2 || arg3 || arg4 || arg5)
>> + return -EINVAL;
>> + if (mmap_write_lock_killable(me->mm))
>> + return -EINTR;
>> + if (me->mm->def_flags & VM_HUGEPAGE)
>> + error = PR_DEFAULT_MADV_HUGEPAGE;
>> + mmap_write_unlock(me->mm);
>> + break;
>> + case PR_SET_THP_POLICY:
>> + if (arg3 || arg4 || arg5)
>> + return -EINVAL;
>> + if (mmap_write_lock_killable(me->mm))
>> + return -EINTR;
>> + switch (arg2) {
>> + case PR_DEFAULT_MADV_HUGEPAGE:
>> + if (!hugepage_global_enabled())
>> + error = -EPERM;
>> +#ifdef CONFIG_S390
>> + /*
>> + * qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
>> + * can't handle this properly after s390_enable_sie, so we simply
>> + * ignore the madvise to prevent qemu from causing a SIGSEGV.
>> + */
>> + else if (mm_has_pgste(vma->vm_mm))
>> + error = -EPERM;
>> +#endif
>
> No, we definitely don't want to duplicate this. You need to share this code with
> madvise(). This is classic duplication, and loathesome specialisation anyway
> that we want to limit to one place _only_.
>
>> + else {
>> + me->mm->def_flags &= ~VM_NOHUGEPAGE;
>> + me->mm->def_flags |= VM_HUGEPAGE;
>> + process_default_madv_hugepage(me->mm, MADV_HUGEPAGE);
>
> Nit, but let's at least abstract out the mm here.
>
>> + }
>> + break;
>> + default:
>> + error = -EINVAL;
>
> Thanks for fixing this! But technically you should have a break here too.
>
>> + }
>> + mmap_write_unlock(me->mm);
>> + break;
>> case PR_MPX_ENABLE_MANAGEMENT:
>> case PR_MPX_DISABLE_MANAGEMENT:
>> /* No longer implemented: */
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 2780a12b25f0..2b9a3e280ae4 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -98,6 +98,18 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>> return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
>> }
>>
>> +void process_default_madv_hugepage(struct mm_struct *mm, int advice)
>> +{
>> + struct vm_area_struct *vma;
>> + unsigned long vm_flags;
>> +
>
> Please add the discussed assert for the mmap lock :)
>
>> + VMA_ITERATOR(vmi, mm, 0);
>> + for_each_vma(vmi, vma) {
>> + vm_flags = vma->vm_flags;
>> + hugepage_madvise(vma, &vm_flags, advice);
>> + }
>> +}
>> +
>> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>> unsigned long vm_flags,
>> unsigned long tva_flags,
>>
>>
>>
>>
>>> This basically solves the problem this series tries to address while also
>>> providing an improved madvise() API at the same time.
>>>
>>> Thoughts? Have I finally completely lost my mind?
>>
>>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Is number of process_madvise()-able ranges limited to 8? (was Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process)
2025-05-16 12:57 ` Lorenzo Stoakes
2025-05-16 17:19 ` Usama Arif
@ 2025-05-17 16:20 ` SeongJae Park
2025-05-17 18:50 ` Lorenzo Stoakes
2025-05-17 19:01 ` [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process Lorenzo Stoakes
2 siblings, 1 reply; 51+ messages in thread
From: SeongJae Park @ 2025-05-17 16:20 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: SeongJae Park, David Hildenbrand, Usama Arif, Andrew Morton,
linux-mm, hannes, shakeel.butt, riel, ziy, laoar.shao,
baolin.wang, Liam.Howlett, npache, ryan.roberts, linux-kernel,
linux-doc, kernel-team
Hi Lorenzo,
On Fri, 16 May 2025 13:57:18 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
[...]
> Right now madvise() has limited utility because:
[...]
> - While you can perform multiple operations at once via process_madvise(),
> even to the current process (after my changes to extend it), it's limited
> to a single advice over 8 ranges.
I'm bit confused by the last part, since I'm understanding your point as 'vlen'
parameter of process_madvise() is limited to 8, but my test code below succeeds
with 'vlen' parameter value 512. Could you please enlighten me?
Attaching my test code below. You could simply run it as below.
gcc test.c && ./a.out
==== Attachment 0 (test.c) ====
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/uio.h>
#include <unistd.h>
#define SZ_PAGE (4096)
#define NR_PAGES (512)
#define MMAP_SZ (SZ_PAGE * NR_PAGES)
int main(void)
{
char *buf;
unsigned int i;
int ret;
pid_t pid = getpid();
int pidfd = syscall(SYS_pidfd_open, pid, 0);
struct iovec *vec;
buf = mmap(NULL, MMAP_SZ, PROT_READ | PROT_WRITE, MAP_PRIVATE |
MAP_ANON, -1, 0);
if (buf == MAP_FAILED) {
printf("mmap fail\n");
return -1;
}
for (i = 0; i < MMAP_SZ; i++)
buf[i] = 123;
vec = malloc(sizeof(*vec) * NR_PAGES);
for (i = 0; i < NR_PAGES; i++) {
vec[i].iov_base = &buf[i * SZ_PAGE];
vec[i].iov_len = SZ_PAGE;
}
ret = syscall(SYS_process_madvise, pidfd, vec, NR_PAGES,
MADV_DONTNEED, 0);
if (ret != MMAP_SZ) {
printf("process_madvise fail\n");
return -1;
}
ret = munmap(buf, MMAP_SZ);
if (ret) {
printf("munmap failed\n");
return -1;
}
close(pidfd);
printf("good\n");
return 0;
}
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Is number of process_madvise()-able ranges limited to 8? (was Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process)
2025-05-17 16:20 ` Is number of process_madvise()-able ranges limited to 8? (was Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process) SeongJae Park
@ 2025-05-17 18:50 ` Lorenzo Stoakes
2025-05-17 20:25 ` SeongJae Park
0 siblings, 1 reply; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-17 18:50 UTC (permalink / raw)
To: SeongJae Park
Cc: David Hildenbrand, Usama Arif, Andrew Morton, linux-mm, hannes,
shakeel.butt, riel, ziy, laoar.shao, baolin.wang, Liam.Howlett,
npache, ryan.roberts, linux-kernel, linux-doc, kernel-team
Hi SJ,
I'm happy to discuss this, and reply below, but I _think_ replying in this
thread is really not optimal, as we're digressing quite a bit from the
proposal/issue at hand and the cc's now not quite aligned, potentially
creating confusion and noise.
I know you're in good faith based on your (excellent :) series in this
area, so I presume it was just to provide context -as to why you're raising
it- more than anything else.
This is more of a 'email development is sucky' comment, but I _think_ this
would be better as a [DISCUSSION] thread maybe linking this original one
back to it or something.
But anyway, getting to the point - my answer is simple so not much
discussion _really_ required here - you're right, I'm wrong! :)
I go into detail inline below:
On Sat, May 17, 2025 at 09:20:48AM -0700, SeongJae Park wrote:
> Hi Lorenzo,
>
> On Fri, 16 May 2025 13:57:18 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> [...]
> > Right now madvise() has limited utility because:
> [...]
> > - While you can perform multiple operations at once via process_madvise(),
> > even to the current process (after my changes to extend it), it's limited
> > to a single advice over 8 ranges.
>
> I'm bit confused by the last part, since I'm understanding your point as 'vlen'
> parameter of process_madvise() is limited to 8, but my test code below succeeds
> with 'vlen' parameter value 512. Could you please enlighten me?
Let's keep this simple - I'm just wrong here :) apologies, entirely my
fault.
We have discussed this a few times before, where I suspect my incorrect
assertion on this has led to you also assuming the wrong thing (again,
apologies!).
But it does raise the important point - we need to re-examine your changes
(see [0]) where this assumption reduced the urgency of considering
contention issues.
Let's take a look at that again on Monday. Though I do strongly suspect
it's fine honestly. We just need to take a look...!
[0]: https://lore.kernel.org/linux-mm/5fc4e100-70d3-44c1-99f7-f8a5a6a0ba65@lucifer.local/
Anyway, let's dig into the code to get things right:
SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
size_t, vlen, int, behavior, unsigned int, flags)
{
...
struct iovec iovstack[UIO_FASTIOV];
struct iovec *iov = iovstack;
struct iov_iter iter;
...
ret = import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
if (ret < 0)
goto out;
...
}
My mistake was assuming that UIO_FASTIOV was the hard limit on size. This
is not the case, it's just an optimisation - if the iovec is small enough
to fit we use it, otherwise we allocate.
We can see this by examining the comment from import_iovec():
/*
* ...
* If the array pointed to by *@iov is large enough to hold all @nr_segs,
* then this function places %NULL in *@iov on return. Otherwise, a new
* array will be allocated and the result placed in *@iov. This means that
* the caller may call kfree() on *@iov regardless of whether the small
* on-stack array was used or not (and regardless of whether this function
* returns an error or not).
* ...
*/
ssize_t import_iovec(int type, const struct iovec __user *uvec,
unsigned nr_segs, unsigned fast_segs,
struct iovec **iovp, struct iov_iter *i)
{
return __import_iovec(type, uvec, nr_segs, fast_segs, iovp, i,
in_compat_syscall());
}
Where nr_segs == vlen, fast_segs == UIO_FASTIOV (8), iovp is &iov, and I
think iov referred to by the comment is *iovp (only sensible conclusion,
really).
Looking into the code further we see:
ssize_t __import_iovec(int type, const struct iovec __user *uvec,
unsigned nr_segs, unsigned fast_segs, struct iovec **iovp,
struct iov_iter *i, bool compat)
{
...
iov = iovec_from_user(uvec, nr_segs, fast_segs, *iovp, compat);
...
}
struct iovec *iovec_from_user(const struct iovec __user *uvec,
unsigned long nr_segs, unsigned long fast_segs,
struct iovec *fast_iov, bool compat)
{
...
if (nr_segs > UIO_MAXIOV)
return ERR_PTR(-EINVAL);
if (nr_segs > fast_segs) {
iov = kmalloc_array(nr_segs, sizeof(struct iovec), GFP_KERNEL);
if (!iov)
return ERR_PTR(-ENOMEM);
}
...
}
So - this confirms it - we're fine, it just tries to use the stack-based
array if it can - otherwise it kmalloc()'s.
Of course, UIO_MAXIOV remains the _actual_ hard limit (hardcoded to 1,024
in include/uapi/linux/uio.h).
The other points I made about the proposed interface remain, but I won't go
into more detail as we are obviously lacking that context here.
Thanks for bringing this up and correcting my misinterpretation, as well as
providing the below repro code, and let's revisit your old series... but on
Monday :)
I should really not be looking at work mail on a Saturday (mea culpa, once
again... :)
One small nit in the repro code below (hey I'm a kernel dev, can't help
myself... ;)
Cheers, Lorenzo
>
> Attaching my test code below. You could simply run it as below.
>
> gcc test.c && ./a.out
>
> ==== Attachment 0 (test.c) ====
> #define _GNU_SOURCE
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <sys/uio.h>
> #include <unistd.h>
>
> #define SZ_PAGE (4096)
> #define NR_PAGES (512)
> #define MMAP_SZ (SZ_PAGE * NR_PAGES)
>
> int main(void)
> {
> char *buf;
> unsigned int i;
> int ret;
> pid_t pid = getpid();
> int pidfd = syscall(SYS_pidfd_open, pid, 0);
> struct iovec *vec;
>
> buf = mmap(NULL, MMAP_SZ, PROT_READ | PROT_WRITE, MAP_PRIVATE |
> MAP_ANON, -1, 0);
> if (buf == MAP_FAILED) {
> printf("mmap fail\n");
> return -1;
> }
>
> for (i = 0; i < MMAP_SZ; i++)
> buf[i] = 123;
>
> vec = malloc(sizeof(*vec) * NR_PAGES);
> for (i = 0; i < NR_PAGES; i++) {
> vec[i].iov_base = &buf[i * SZ_PAGE];
> vec[i].iov_len = SZ_PAGE;
> }
>
> ret = syscall(SYS_process_madvise, pidfd, vec, NR_PAGES,
> MADV_DONTNEED, 0);
> if (ret != MMAP_SZ) {
> printf("process_madvise fail\n");
> return -1;
> }
To be pedantic, you are really only checking to see if an error was
returned, in theory no error might have been returned but the operation
might have not proceeded, so a more proper check here would be to populated
the anon memory with non-zero data, then check afterwards that it's zeroed.
Given this outcome would probably imply iovec issues, it's not likely, but
to really assert the point you'd probably want to do that!
>
> ret = munmap(buf, MMAP_SZ);
> if (ret) {
> printf("munmap failed\n");
> return -1;
> }
>
> close(pidfd);
> printf("good\n");
> return 0;
> }
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
2025-05-16 12:57 ` Lorenzo Stoakes
2025-05-16 17:19 ` Usama Arif
2025-05-17 16:20 ` Is number of process_madvise()-able ranges limited to 8? (was Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process) SeongJae Park
@ 2025-05-17 19:01 ` Lorenzo Stoakes
2 siblings, 0 replies; 51+ messages in thread
From: Lorenzo Stoakes @ 2025-05-17 19:01 UTC (permalink / raw)
To: David Hildenbrand
Cc: Usama Arif, Andrew Morton, linux-mm, hannes, shakeel.butt, riel,
ziy, laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
linux-kernel, linux-doc, kernel-team, SeongJae Park
+cc SJ.
On Fri, May 16, 2025 at 01:57:18PM +0100, Lorenzo Stoakes wrote:
> On Fri, May 16, 2025 at 01:24:18PM +0200, David Hildenbrand wrote:
> > Looking forward to hearing what your magic thinking cap can do! :)
>
> OK so just to say at the outset, this is purely playing around with a
> theoretical idea here, so if it's crazy just let me know :))
>
> Right now madvise() has limited utility because:
>
> - You have little control over how the operation is done
> - You get little feedback about what's actually succeeded or not
> - While you can perform multiple operations at once via process_madvise(),
> even to the current process (after my changes to extend it), it's limited
> to a single advice over 8 ranges.
SJ raised the point that I am just wrong about this (see [0]).
So this makes the interface even more compelling in my view, we can either:
1. Simply remove the 'madvise_ranges' stuff below, and replace with an iovec,
and simply forward that to process_madvise() (simpler, probably preferable).
2. Keep it as-is so we get to perform _multiple_ advice operations in a batch.
3. Use an iovec but also have some other array specifying operations to get the
same thing, but maybe extend process_madvise()?
I really like the idea however of just - using the existing process_madvise()
code - picking up all the recent improvements - avoiding duplication, etc.
The key idea of this interface is more-so being able to control certain
behaviours (such as stopping on gaps etc.)
Yet Another (TM) alternative would be to use the -currently unused-
process_madvise() flags field to specify options as mentioned here, including
the 'set default' thing.
Now _that_ is really interesting actually.
It will give us less flexibility, but require a much much less major change.
OK damn, that's quite compelling... maybe I will do an RFC patch for that... :)
Happy to hear thoughts on it...
[0]: https://lore.kernel.org/all/20250517162048.36347-1-sj@kernel.org/
> - You can't say 'ignore errors just try'
> - You get the weird gap behaviour.
>
> So the concept is - make everything explicit and add a new syscall that
> wraps the existing madvise() stuff and addresses all the above issues.
>
> Specifically pertinent to the case at hand - also add a 'set_default'
> boolean (you'll see shortly exactly where) to also tell madvise() to make
> all future VMAs default to the specified advice. We'll whitelist what we're
> allowed to use here and should be able to use mm->def_flags.
>
> So the idea is we'll use a helper struct-configured function (hey, it's me,
> I <3 helper structs so of course) like:
>
> int madvise_ranges(struct madvise_range_control *ctl);
>
> With the data structures as follows (untested, etc. etc.):
>
> enum madvise_range_type {
> MADVISE_RANGE_SINGLE,
> MADVISE_RANGE_MULTI,
> MADVISE_RANGE_ALL,
> };
>
> struct madvise_range {
> const void *addr;
> size_t size;
> int advice;
> };
>
> struct madvise_ranges {
> const struct madvise_range *arr;
> size_t count;
> };
>
> struct madvise_range_stats {
> struct madvise_range range;
> bool success;
> bool partial;
> };
>
> struct madvise_ranges_stats {
> unsigned long nr_mappings_advised;
> unsigned long nr_mappings_skipped;
> unsigned long nr_pages_advised;
> unsigned long nr_pages_skipped;
> unsigned long nr_gaps;
>
> /*
> * Useful for madvise_range_control->ignore_errors:
> *
> * If non-NULL, points to an array of size equal to the number of ranges
> * specified. Indiciates the specified range, whether it succeeded, and
> * whether that success was partial (that is, the range specified
> * multiple mappings, only some of which had advice applied
> * successfully).
> *
> * Not valid for MADVISE_RANGE_ALL.
> */
> struct madvise_range_stats *per_range_stats;
>
> /* Error details. */
> int err;
> unsigned long failed_address;
> size_t offset; /* If multi, at which offset did this occur? */
> };
>
> struct madvise_ranges_control {
> int version; /* Allow future updates to API. */
>
> enum madvise_range_type type;
>
> union {
> struct madvise_range range; /* MADVISE_RANGE_SINGLE */
> struct madvise_ranges ranges; /* MADVISE_RANGE_MULTI */
> struct all { /* MADVISE_RANGE_ALL */
> int advice;
> /*
> * If set, also have all future mappings have this applied by default.
> *
> * Only whitelisted advice may set this, otherwise -EINVAL will be returned.
> */
> bool set_default;
> };
> };
> struct madvise_ranges_stats *stats; /* If non-NULL, report information about operation. */
>
> int pidfd; /* If is_remote set, the remote process. */
>
> /* Options. */
> bool is_remote :1; /* Target remote process as specified by pidfd. */
> bool ignore_errors :1; /* If error occurs applying advice, carry on to next VMA. */
> bool single_mapping_only :1; /* Error out if any range is not a single VMA. */
> bool stop_on_gap :1; /* Stop operation if input range includes unmapped memory. */
> };
>
> So the user can specify whether to apply advice to a single range,
> multiple, or the whole address space, with real control over how the operation proceeds.
>
> This basically solves the problem this series tries to address while also
> providing an improved madvise() API at the same time.
>
> Thoughts? Have I finally completely lost my mind?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: Is number of process_madvise()-able ranges limited to 8? (was Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process)
2025-05-17 18:50 ` Lorenzo Stoakes
@ 2025-05-17 20:25 ` SeongJae Park
0 siblings, 0 replies; 51+ messages in thread
From: SeongJae Park @ 2025-05-17 20:25 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: SeongJae Park, David Hildenbrand, Usama Arif, Andrew Morton,
linux-mm, hannes, shakeel.butt, riel, ziy, laoar.shao,
baolin.wang, Liam.Howlett, npache, ryan.roberts, linux-kernel,
linux-doc, kernel-team
On Sat, 17 May 2025 19:50:34 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
[...]
> Let's keep this simple - I'm just wrong here :) apologies, entirely my
> fault.
No worry, appreciate your kind and detailed answer.
[...]
> Anyway, let's dig into the code to get things right:
[...]
> So - this confirms it - we're fine, it just tries to use the stack-based
> array if it can - otherwise it kmalloc()'s.
>
> Of course, UIO_MAXIOV remains the _actual_ hard limit (hardcoded to 1,024
> in include/uapi/linux/uio.h).
Thanks for kind clarifications. All your explanations perfectly matches with
my understanding. I'm happy to be on the same page with you!
>
> The other points I made about the proposed interface remain, but I won't go
> into more detail as we are obviously lacking that context here.
>
> Thanks for bringing this up and correcting my misinterpretation, as well as
> providing the below repro code, and let's revisit your old series... but on
> Monday :)
Sure, and no worry, take your time :)
>
> I should really not be looking at work mail on a Saturday (mea culpa, once
> again... :)
I hope your remaining weekend be calm and uninterruptable. Keeping you not
burned out is important for the community :)
>
> One small nit in the repro code below (hey I'm a kernel dev, can't help
> myself... ;)
To me, being a kernel programmer rather than a user-space c code programmer is
a good excuse for asking to be generous to my user-space bugs ;) Thank you for
your kind comment below, anyway :)
>
> Cheers, Lorenzo
>
> >
> > Attaching my test code below. You could simply run it as below.
> >
> > gcc test.c && ./a.out
> >
> > ==== Attachment 0 (test.c) ====
[...]
> > ret = syscall(SYS_process_madvise, pidfd, vec, NR_PAGES,
> > MADV_DONTNEED, 0);
> > if (ret != MMAP_SZ) {
> > printf("process_madvise fail\n");
> > return -1;
> > }
>
> To be pedantic, you are really only checking to see if an error was
> returned, in theory no error might have been returned but the operation
> might have not proceeded, so a more proper check here would be to populated
> the anon memory with non-zero data, then check afterwards that it's zeroed.
>
> Given this outcome would probably imply iovec issues, it's not likely, but
> to really assert the point you'd probably want to do that!
Good points! I once considered making this test better and posting to be
included in mm selftests, but found no time to do that so far. Above input
must be very helpful in a case that I (or someone else) find a time to write
such process_madvise() selftest.
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2025-05-17 20:25 UTC | newest]
Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-15 13:33 [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
2025-05-15 13:33 ` [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process Usama Arif
2025-05-15 14:40 ` Lorenzo Stoakes
2025-05-15 14:44 ` David Hildenbrand
2025-05-15 14:56 ` Usama Arif
2025-05-15 14:58 ` David Hildenbrand
2025-05-15 15:18 ` Lorenzo Stoakes
2025-05-15 15:45 ` Liam R. Howlett
2025-05-15 15:57 ` David Hildenbrand
2025-05-15 16:38 ` Lorenzo Stoakes
2025-05-15 17:29 ` David Hildenbrand
2025-05-15 18:09 ` Liam R. Howlett
2025-05-15 18:21 ` Lorenzo Stoakes
2025-05-15 18:42 ` Zi Yan
2025-05-15 21:04 ` Lorenzo Stoakes
2025-05-15 18:46 ` Usama Arif
2025-05-15 19:20 ` David Hildenbrand
2025-05-15 15:28 ` Usama Arif
2025-05-15 16:06 ` Lorenzo Stoakes
2025-05-15 16:11 ` David Hildenbrand
2025-05-15 18:08 ` Lorenzo Stoakes
2025-05-15 19:12 ` David Hildenbrand
2025-05-15 20:35 ` Lorenzo Stoakes
2025-05-16 7:45 ` David Hildenbrand
2025-05-16 10:57 ` Lorenzo Stoakes
2025-05-16 11:24 ` David Hildenbrand
2025-05-16 12:57 ` Lorenzo Stoakes
2025-05-16 17:19 ` Usama Arif
2025-05-16 17:51 ` Lorenzo Stoakes
2025-05-16 19:34 ` Usama Arif
2025-05-17 16:20 ` Is number of process_madvise()-able ranges limited to 8? (was Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process) SeongJae Park
2025-05-17 18:50 ` Lorenzo Stoakes
2025-05-17 20:25 ` SeongJae Park
2025-05-17 19:01 ` [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process Lorenzo Stoakes
2025-05-15 16:47 ` Usama Arif
2025-05-15 18:36 ` Lorenzo Stoakes
2025-05-15 19:17 ` David Hildenbrand
2025-05-15 20:42 ` Lorenzo Stoakes
2025-05-16 6:12 ` kernel test robot
2025-05-15 13:33 ` [PATCH 2/6] prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE " Usama Arif
2025-05-16 8:19 ` kernel test robot
2025-05-15 13:33 ` [PATCH 3/6] prctl: introduce PR_THP_POLICY_SYSTEM " Usama Arif
2025-05-15 13:33 ` [PATCH 4/6] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_NOHUGE Usama Arif
2025-05-15 13:33 ` [PATCH 5/6] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE Usama Arif
2025-05-15 13:33 ` [PATCH 6/6] docs: transhuge: document process level THP controls Usama Arif
2025-05-15 13:55 ` [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY Lorenzo Stoakes
2025-05-15 14:50 ` Usama Arif
2025-05-15 15:15 ` Lorenzo Stoakes
2025-05-15 15:54 ` Usama Arif
2025-05-15 16:04 ` David Hildenbrand
2025-05-15 16:24 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).