linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY
@ 2025-05-19 22:29 Usama Arif
  2025-05-19 22:29 ` [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise Usama Arif
                   ` (9 more replies)
  0 siblings, 10 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-19 22:29 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, linux-kernel, linux-doc, kernel-team,
	Usama Arif

This series allows to change the THP policy of a process, according to the
value set in arg2, all of which will be inherited during fork+exec:
- PR_DEFAULT_MADV_HUGEPAGE: This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE
  for the default VMA flags. It will also iterate through every VMA in the
  process and call hugepage_madvise on it, with MADV_HUGEPAGE policy.
  This effectively allows setting MADV_HUGEPAGE on the entire process.
  In an environment where different types of workloads are run on the
  same machine, this will allow workloads that benefit from always having
  hugepages to do so, without regressing those that don't.
- PR_DEFAULT_MADV_NOHUGEPAGE: This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE
  for the default VMA flags. It will also iterate through every VMA in the
  process and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy.
  This effectively allows setting MADV_NOHUGEPAGE on the entire process.
  In an environment where different types of workloads are run on the
  same machine,this will allow workloads that benefit from having
  hugepages on an madvise basis only to do so, without regressing those
  that benefit from having hugepages always.
- PR_THP_POLICY_SYSTEM: This will reset (clear) both VM_HUGEPAGE and
  VM_NOHUGEPAGE process for the default flags.

In hyperscalers, we have a single THP policy for the entire fleet.
We have different types of workloads (e.g. AI/compute/databases/etc)
running on a single server.
Some of these workloads will benefit from always getting THP at fault
(or collapsed by khugepaged), some of them will benefit by only getting
them at madvise.

This series is useful for 2 usecases:
1) global system policy = madvise, while we want some workloads to get THPs
at fault and by khugepaged :- some processes (e.g. AI workloads) benefits
from getting THPs at fault (and collapsed by khugepaged). Other workloads
like databases will incur regression (either a performance regression or
they are completely memory bound and even a very slight increase in memory
will cause them to OOM). So what these patches will do is allow setting
prctl(PR_DEFAULT_MADV_HUGEPAGE) on the AI workloads, (This is how
workloads are deployed in our (Meta's/Facebook) fleet at this moment).

2) global system policy = always, while we want some workloads to get THPs
only on madvise basis :- Same reason as 1). What these patches
will do is allow setting prctl(PR_DEFAULT_MADV_NOHUGEPAGE) on the database
workloads. (We hope this is us (Meta) in the near future, if a majority of
workloads show that they benefit from always, we flip the default host
setting to "always" across the fleet and workloads that regress can opt-out
and be "madvise". New services developed will then be tested with always by
default. "always" is also the default defconfig option upstream, so I would
imagine this is faced by others as well.)

v2->v3: (Thanks Lorenzo for all the below feedback!)
v2: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@gmail.com/
- no more flags2.
- no more MMF2_...
- renamed policy to PR_DEFAULT_MADV_(NO)HUGEPAGE
- mmap_write_lock_killable acquired in PR_GET_THP_POLICY
- mmap_write lock fixed in PR_SET_THP_POLICY
- mmap assert check in process_default_madv_hugepage
- check if hugepage_global_enabled is enabled in the call and account for s390
- set mm->def_flags VM_HUGEPAGE and VM_NOHUGEPAGE according to the policy in
  the way done by madvise(). I believe VM merge will not be broken in
  this way.
- process_default_madv_hugepage function that does for_each_vma and calls
  hugepage_madvise.

v1->v2:
- change from modifying the THP decision making for the process, to modifying
  VMA flags only. This prevents further complicating the logic used to
  determine THP order (Thanks David!)
- change from using a prctl per policy change to just using PR_SET_THP_POLICY
  and arg2 to set the policy. (Zi Yan)
- Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM
- Add selftests and documentation.
 
Usama Arif (7):
  mm: khugepaged: extract vm flag setting outside of hugepage_madvise
  prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
  prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE for the process
  prctl: introduce PR_THP_POLICY_SYSTEM for the process
  selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE
  selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
  docs: transhuge: document process level THP controls

 Documentation/admin-guide/mm/transhuge.rst    |  42 +++
 include/linux/huge_mm.h                       |   2 +
 include/linux/mm.h                            |   2 +-
 include/linux/mm_types.h                      |   4 +-
 include/uapi/linux/prctl.h                    |   6 +
 kernel/sys.c                                  |  53 ++++
 mm/huge_memory.c                              |  13 +
 mm/khugepaged.c                               |  26 +-
 tools/include/uapi/linux/prctl.h              |   6 +
 .../trace/beauty/include/uapi/linux/prctl.h   |   6 +
 tools/testing/selftests/prctl/Makefile        |   2 +-
 tools/testing/selftests/prctl/thp_policy.c    | 286 ++++++++++++++++++
 12 files changed, 436 insertions(+), 12 deletions(-)
 create mode 100644 tools/testing/selftests/prctl/thp_policy.c

-- 
2.47.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise
  2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
@ 2025-05-19 22:29 ` Usama Arif
  2025-05-20  9:51   ` kernel test robot
  2025-05-20 14:43   ` Lorenzo Stoakes
  2025-05-19 22:29 ` [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process Usama Arif
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-19 22:29 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, linux-kernel, linux-doc, kernel-team,
	Usama Arif

This is so that flag setting can be resused later in other functions,
to reduce code duplication (including the s390 exception).

No functional change intended with this patch.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/huge_mm.h |  1 +
 mm/khugepaged.c         | 26 +++++++++++++++++---------
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..23580a43787c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -431,6 +431,7 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			__split_huge_pud(__vma, __pud, __address);	\
 	}  while (0)
 
+int hugepage_set_vmflags(unsigned long *vm_flags, int advice);
 int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
 		     int advice);
 int madvise_collapse(struct vm_area_struct *vma,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b04b6a770afe..ab3427c87422 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -346,8 +346,7 @@ struct attribute_group khugepaged_attr_group = {
 };
 #endif /* CONFIG_SYSFS */
 
-int hugepage_madvise(struct vm_area_struct *vma,
-		     unsigned long *vm_flags, int advice)
+int hugepage_set_vmflags(unsigned long *vm_flags, int advice)
 {
 	switch (advice) {
 	case MADV_HUGEPAGE:
@@ -358,16 +357,10 @@ int hugepage_madvise(struct vm_area_struct *vma,
 		 * ignore the madvise to prevent qemu from causing a SIGSEGV.
 		 */
 		if (mm_has_pgste(vma->vm_mm))
-			return 0;
+			return -EPERM;
 #endif
 		*vm_flags &= ~VM_NOHUGEPAGE;
 		*vm_flags |= VM_HUGEPAGE;
-		/*
-		 * If the vma become good for khugepaged to scan,
-		 * register it here without waiting a page fault that
-		 * may not happen any time soon.
-		 */
-		khugepaged_enter_vma(vma, *vm_flags);
 		break;
 	case MADV_NOHUGEPAGE:
 		*vm_flags &= ~VM_HUGEPAGE;
@@ -383,6 +376,21 @@ int hugepage_madvise(struct vm_area_struct *vma,
 	return 0;
 }
 
+int hugepage_madvise(struct vm_area_struct *vma,
+		     unsigned long *vm_flags, int advice)
+{
+	if (advice == MADV_HUGEPAGE && !hugepage_set_vmflags(vm_flags, advice)) {
+		/*
+		 * If the vma become good for khugepaged to scan,
+		 * register it here without waiting a page fault that
+		 * may not happen any time soon.
+		 */
+		khugepaged_enter_vma(vma, *vm_flags);
+	}
+
+	return 0;
+}
+
 int __init khugepaged_init(void)
 {
 	mm_slot_cache = KMEM_CACHE(khugepaged_mm_slot, 0);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
  2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
  2025-05-19 22:29 ` [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise Usama Arif
@ 2025-05-19 22:29 ` Usama Arif
  2025-05-19 23:01   ` Jann Horn
  2025-05-20  8:48   ` kernel test robot
  2025-05-19 22:29 ` [PATCH v3 3/7] prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE " Usama Arif
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-19 22:29 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, linux-kernel, linux-doc, kernel-team,
	Usama Arif

This is set via the new PR_SET_THP_POLICY prctl. It has 2 affects:
- It sets VM_HUGEPAGE and clears VM_NOHUGEPAGE on the default VMA flags
  (def_flags). This means that every new VMA will be considered for
  hugepage.
- Iterate through every VMA in the process and call hugepage_madvise
  on it, with MADV_HUGEPAGE policy.
The policy is inherited during fork+exec.

This effectively allows setting MADV_HUGEPAGE on the entire process.
In an environment where different types of workloads are run on the
same machine, this will allow workloads that benefit from always having
hugepages to do so, without regressing those that don't.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/huge_mm.h                       |  1 +
 include/linux/mm.h                            |  2 +-
 include/linux/mm_types.h                      |  4 ++-
 include/uapi/linux/prctl.h                    |  4 +++
 kernel/sys.c                                  | 29 +++++++++++++++++++
 mm/huge_memory.c                              | 13 +++++++++
 tools/include/uapi/linux/prctl.h              |  4 +++
 .../trace/beauty/include/uapi/linux/prctl.h   |  4 +++
 8 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 23580a43787c..b24a2e0ae642 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -431,6 +431,7 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			__split_huge_pud(__vma, __pud, __address);	\
 	}  while (0)
 
+void process_default_madv_hugepage(struct mm_struct *mm, int advice);
 int hugepage_set_vmflags(unsigned long *vm_flags, int advice);
 int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
 		     int advice);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 43748c8f3454..436f4588bce8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -466,7 +466,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)
 
 /* This mask defines which mm->def_flags a process can inherit its parent */
-#define VM_INIT_DEF_MASK	VM_NOHUGEPAGE
+#define VM_INIT_DEF_MASK	(VM_HUGEPAGE | VM_NOHUGEPAGE)
 
 /* This mask represents all the VMA flag bits used by mlock */
 #define VM_LOCKED_MASK	(VM_LOCKED | VM_LOCKONFAULT)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e76bade9ebb1..f1836b7c5704 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1703,6 +1703,7 @@ enum {
 					/* leave room for more dump flags */
 #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
 #define MMF_VM_HUGEPAGE		17	/* set when mm is available for khugepaged */
+#define MMF_VM_HUGEPAGE_MASK	(1 << MMF_VM_HUGEPAGE)
 
 /*
  * This one-shot flag is dropped due to necessity of changing exe once again
@@ -1742,7 +1743,8 @@ enum {
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
 				 MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
-				 MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
+				 MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK |\
+				 MMF_VM_HUGEPAGE_MASK)
 
 static inline unsigned long mmf_init_flags(unsigned long flags)
 {
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 15c18ef4eb11..15aaa4db5ff8 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -364,4 +364,8 @@ struct prctl_mm_map {
 # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
 # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
 
+#define PR_SET_THP_POLICY		78
+#define PR_GET_THP_POLICY		79
+#define PR_DEFAULT_MADV_HUGEPAGE	0
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index c434968e9f5d..74397ace62f3 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2474,6 +2474,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
 	struct task_struct *me = current;
+	struct mm_struct *mm = me->mm;
 	unsigned char comm[sizeof(me->comm)];
 	long error;
 
@@ -2658,6 +2659,34 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
 		mmap_write_unlock(me->mm);
 		break;
+	case PR_GET_THP_POLICY:
+		if (arg2 || arg3 || arg4 || arg5)
+			return -EINVAL;
+		if (mmap_write_lock_killable(mm))
+			return -EINTR;
+		if (mm->def_flags & VM_HUGEPAGE)
+			error = PR_DEFAULT_MADV_HUGEPAGE;
+		mmap_write_unlock(mm);
+		break;
+	case PR_SET_THP_POLICY:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		if (mmap_write_lock_killable(mm))
+			return -EINTR;
+		switch (arg2) {
+		case PR_DEFAULT_MADV_HUGEPAGE:
+			if (!hugepage_global_enabled())
+				error = -EPERM;
+			error = hugepage_set_vmflags(&mm->def_flags, MADV_HUGEPAGE);
+			if (!error)
+				process_default_madv_hugepage(mm, MADV_HUGEPAGE);
+			break;
+		default:
+			error = -EINVAL;
+			break;
+		}
+		mmap_write_unlock(mm);
+		break;
 	case PR_MPX_ENABLE_MANAGEMENT:
 	case PR_MPX_DISABLE_MANAGEMENT:
 		/* No longer implemented: */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2780a12b25f0..72806fe772b5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -98,6 +98,19 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
 	return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
 }
 
+void process_default_madv_hugepage(struct mm_struct *mm, int advice)
+{
+	struct vm_area_struct *vma;
+	unsigned long vm_flags;
+
+	mmap_assert_write_locked(mm);
+	VMA_ITERATOR(vmi, mm, 0);
+	for_each_vma(vmi, vma) {
+		vm_flags = vma->vm_flags;
+		hugepage_madvise(vma, &vm_flags, advice);
+	}
+}
+
 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 unsigned long vm_flags,
 					 unsigned long tva_flags,
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 35791791a879..f5945ebfe3f2 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -328,4 +328,8 @@ struct prctl_mm_map {
 # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC	0x10 /* Clear the aspect on exec */
 # define PR_PPC_DEXCR_CTRL_MASK		0x1f
 
+#define PR_SET_THP_POLICY		78
+#define PR_GET_THP_POLICY		79
+#define PR_THP_POLICY_DEFAULT_HUGE	0
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
index 15c18ef4eb11..325c72f40a93 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
@@ -364,4 +364,8 @@ struct prctl_mm_map {
 # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
 # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
 
+#define PR_SET_THP_POLICY		78
+#define PR_GET_THP_POLICY		79
+#define PR_THP_POLICY_DEFAULT_HUGE	0
+
 #endif /* _LINUX_PRCTL_H */
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 3/7] prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE for the process
  2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
  2025-05-19 22:29 ` [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise Usama Arif
  2025-05-19 22:29 ` [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process Usama Arif
@ 2025-05-19 22:29 ` Usama Arif
  2025-05-19 22:29 ` [PATCH v3 4/7] prctl: introduce PR_THP_POLICY_SYSTEM " Usama Arif
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-19 22:29 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, linux-kernel, linux-doc, kernel-team,
	Usama Arif

This is set via the new PR_SET_THP_POLICY prctl. It has 2 affects:
- It sets VM_NOHUGEPAGE and clears VM_HUGEPAGE on the default VMA
  flags (def_flags). This means that every new VMA will not be
  considered for hugepage by default.
- Iterate through every VMA in the process and call hugepage_madvise
  on it, with MADV_NOHUGEPAGE policy.
The policy is inherited during fork+exec.

This effectively allows setting MADV_NOHUGEPAGE on the entire process.
In anenvironment where different types of workloads are stacked on the
same machine,this will allow workloads that benefit from having
hugepages on an madvise basis only to do so, without regressing those
that benefit from having hugepages always.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/uapi/linux/prctl.h                         | 1 +
 kernel/sys.c                                       | 7 +++++++
 tools/include/uapi/linux/prctl.h                   | 1 +
 tools/perf/trace/beauty/include/uapi/linux/prctl.h | 1 +
 4 files changed, 10 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 15aaa4db5ff8..33a6ef6a5a72 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -367,5 +367,6 @@ struct prctl_mm_map {
 #define PR_SET_THP_POLICY		78
 #define PR_GET_THP_POLICY		79
 #define PR_DEFAULT_MADV_HUGEPAGE	0
+#define PR_DEFAULT_MADV_NOHUGEPAGE	1
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 74397ace62f3..6bb28b3666f7 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2666,6 +2666,8 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			return -EINTR;
 		if (mm->def_flags & VM_HUGEPAGE)
 			error = PR_DEFAULT_MADV_HUGEPAGE;
+		else if (mm->def_flags & VM_NOHUGEPAGE)
+			error = PR_DEFAULT_MADV_NOHUGEPAGE;
 		mmap_write_unlock(mm);
 		break;
 	case PR_SET_THP_POLICY:
@@ -2681,6 +2683,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			if (!error)
 				process_default_madv_hugepage(mm, MADV_HUGEPAGE);
 			break;
+		case PR_DEFAULT_MADV_NOHUGEPAGE:
+			error = hugepage_set_vmflags(&mm->def_flags, MADV_NOHUGEPAGE);
+			if (!error)
+				process_default_madv_hugepage(mm, MADV_NOHUGEPAGE);
+			break;
 		default:
 			error = -EINVAL;
 			break;
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index f5945ebfe3f2..e03d0ed890c5 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -331,5 +331,6 @@ struct prctl_mm_map {
 #define PR_SET_THP_POLICY		78
 #define PR_GET_THP_POLICY		79
 #define PR_THP_POLICY_DEFAULT_HUGE	0
+#define PR_THP_POLICY_DEFAULT_NOHUGE	1
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
index 325c72f40a93..d25458f4db9e 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
@@ -367,5 +367,6 @@ struct prctl_mm_map {
 #define PR_SET_THP_POLICY		78
 #define PR_GET_THP_POLICY		79
 #define PR_THP_POLICY_DEFAULT_HUGE	0
+#define PR_THP_POLICY_DEFAULT_NOHUGE	1
 
 #endif /* _LINUX_PRCTL_H */
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 4/7] prctl: introduce PR_THP_POLICY_SYSTEM for the process
  2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
                   ` (2 preceding siblings ...)
  2025-05-19 22:29 ` [PATCH v3 3/7] prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE " Usama Arif
@ 2025-05-19 22:29 ` Usama Arif
  2025-05-19 22:29 ` [PATCH v3 5/7] selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE Usama Arif
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-19 22:29 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, linux-kernel, linux-doc, kernel-team,
	Usama Arif

This is set via the new PR_SET_THP_POLICY prctl.
This will clear VM_HUGEPAGE and VM_NOHUGEPAGE in mm->def_flags
to reset VMA hugepage policy to system specific.
(except in the case of s390 where pgstes are switched
on for userspace process, in which case it will only
clear VM_HUGEPAGE).

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/uapi/linux/prctl.h                      |  1 +
 kernel/sys.c                                    | 17 +++++++++++++++++
 tools/include/uapi/linux/prctl.h                |  1 +
 .../trace/beauty/include/uapi/linux/prctl.h     |  1 +
 4 files changed, 20 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 33a6ef6a5a72..508d78bc3364 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -368,5 +368,6 @@ struct prctl_mm_map {
 #define PR_GET_THP_POLICY		79
 #define PR_DEFAULT_MADV_HUGEPAGE	0
 #define PR_DEFAULT_MADV_NOHUGEPAGE	1
+#define PR_THP_POLICY_SYSTEM		2
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 6bb28b3666f7..cffb60632d97 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2668,6 +2668,8 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			error = PR_DEFAULT_MADV_HUGEPAGE;
 		else if (mm->def_flags & VM_NOHUGEPAGE)
 			error = PR_DEFAULT_MADV_NOHUGEPAGE;
+		else
+			error = PR_THP_POLICY_SYSTEM;
 		mmap_write_unlock(mm);
 		break;
 	case PR_SET_THP_POLICY:
@@ -2688,6 +2690,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			if (!error)
 				process_default_madv_hugepage(mm, MADV_NOHUGEPAGE);
 			break;
+		case PR_THP_POLICY_SYSTEM:
+#ifdef CONFIG_S390
+			/*
+			 * When s390 switches on pgstes for its userspace
+			 * process (for kvm), it sets VM_NOHUGEPAGE.
+			 * Do not clear it with system policy.
+			 */
+			if (mm_has_pgste(mm))
+				mm->def_flags &= ~VM_HUGEPAGE;
+			else
+				mm->def_flags &= ~(VM_HUGEPAGE | VM_NOHUGEPAGE);
+#else
+			mm->def_flags &= ~(VM_HUGEPAGE | VM_NOHUGEPAGE);
+#endif
+			break;
 		default:
 			error = -EINVAL;
 			break;
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index e03d0ed890c5..cc209c9a8afb 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -332,5 +332,6 @@ struct prctl_mm_map {
 #define PR_GET_THP_POLICY		79
 #define PR_THP_POLICY_DEFAULT_HUGE	0
 #define PR_THP_POLICY_DEFAULT_NOHUGE	1
+#define PR_THP_POLICY_SYSTEM		2
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
index d25458f4db9e..340d5ff769a9 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
@@ -368,5 +368,6 @@ struct prctl_mm_map {
 #define PR_GET_THP_POLICY		79
 #define PR_THP_POLICY_DEFAULT_HUGE	0
 #define PR_THP_POLICY_DEFAULT_NOHUGE	1
+#define PR_THP_POLICY_SYSTEM		2
 
 #endif /* _LINUX_PRCTL_H */
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 5/7] selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE
  2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
                   ` (3 preceding siblings ...)
  2025-05-19 22:29 ` [PATCH v3 4/7] prctl: introduce PR_THP_POLICY_SYSTEM " Usama Arif
@ 2025-05-19 22:29 ` Usama Arif
  2025-05-19 22:29 ` [PATCH v3 6/7] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE Usama Arif
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-19 22:29 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, linux-kernel, linux-doc, kernel-team,
	Usama Arif

The test is limited to 2M PMD THPs. It does not modify the system
settings in order to not disturb other process running in the system.
It checks if the PMD size is 2M, if the 2M policy is set to inherit
and if the system global THP policy is set to "always", so that
the change in behaviour due to PR_DEFAULT_MADV_NOHUGEPAGE can
be seen.

This tests if:
- the process can successfully set the policy
- carry it over to the new process with fork
- if no hugepage is gotten when the process doesn't MADV_HUGEPAGE
- if hugepage is gotten when the process does MADV_HUGEPAGE
- the process can successfully reset the policy to PR_DEFAULT_SYSTEM
- if hugepage is gotten after the policy reset

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 tools/testing/selftests/prctl/Makefile     |   2 +-
 tools/testing/selftests/prctl/thp_policy.c | 214 +++++++++++++++++++++
 2 files changed, 215 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/prctl/thp_policy.c

diff --git a/tools/testing/selftests/prctl/Makefile b/tools/testing/selftests/prctl/Makefile
index 01dc90fbb509..ee8c98e45b53 100644
--- a/tools/testing/selftests/prctl/Makefile
+++ b/tools/testing/selftests/prctl/Makefile
@@ -5,7 +5,7 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)
 
 ifeq ($(ARCH),x86)
 TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \
-		disable-tsc-test set-anon-vma-name-test set-process-name
+		disable-tsc-test set-anon-vma-name-test set-process-name thp_policy
 all: $(TEST_PROGS)
 
 include ../lib.mk
diff --git a/tools/testing/selftests/prctl/thp_policy.c b/tools/testing/selftests/prctl/thp_policy.c
new file mode 100644
index 000000000000..7791d282f7c8
--- /dev/null
+++ b/tools/testing/selftests/prctl/thp_policy.c
@@ -0,0 +1,214 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This test covers the PR_GET/SET_THP_POLICY functionality of prctl calls
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+
+#ifndef PR_SET_THP_POLICY
+#define PR_SET_THP_POLICY		78
+#define PR_GET_THP_POLICY		79
+#define PR_DEFAULT_MADV_HUGEPAGE	0
+#define PR_DEFAULT_MADV_NOHUGEPAGE	1
+#define PR_DEFAULT_SYSTEM		2
+#endif
+
+#define CONTENT_SIZE 256
+#define BUF_SIZE (12 * 2 * 1024 * 1024) // 12 x 2MB pages
+
+enum system_policy {
+	SYSTEM_POLICY_ALWAYS,
+	SYSTEM_POLICY_MADVISE,
+	SYSTEM_POLICY_NEVER,
+};
+
+int system_thp_policy;
+
+/* check if the sysfs file contains the expected substring */
+static int check_file_content(const char *file_path, const char *expected_substring)
+{
+	FILE *file = fopen(file_path, "r");
+	char buffer[CONTENT_SIZE];
+
+	if (!file) {
+		perror("Failed to open file");
+		return -1;
+	}
+	if (fgets(buffer, CONTENT_SIZE, file) == NULL) {
+		perror("Failed to read file");
+		fclose(file);
+		return -1;
+	}
+	fclose(file);
+	// Remove newline character from the buffer
+	buffer[strcspn(buffer, "\n")] = '\0';
+	if (strstr(buffer, expected_substring))
+		return 0;
+	else
+		return 1;
+}
+
+/*
+ * The test is designed for 2M hugepages only.
+ * Check if hugepage size is 2M, if 2M size inherits from global
+ * setting, and if the global setting is madvise or always.
+ */
+static int sysfs_check(void)
+{
+	int res = 0;
+
+	res = check_file_content("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", "2097152");
+	if (res) {
+		printf("hpage_pmd_size is not set to 2MB. Skipping test.\n");
+		return -1;
+	}
+	res |= check_file_content("/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled",
+				  "[inherit]");
+	if (res) {
+		printf("hugepages-2048kB does not inherit global setting. Skipping test.\n");
+		return -1;
+	}
+
+	res = check_file_content("/sys/kernel/mm/transparent_hugepage/enabled", "[madvise]");
+	if (!res) {
+		system_thp_policy = SYSTEM_POLICY_MADVISE;
+		return 0;
+	}
+	res = check_file_content("/sys/kernel/mm/transparent_hugepage/enabled", "[always]");
+	if (!res) {
+		system_thp_policy = SYSTEM_POLICY_ALWAYS;
+		return 0;
+	}
+	printf("Global THP policy not set to madvise or always. Skipping test.\n");
+	return -1;
+}
+
+static int check_smaps_for_huge(void)
+{
+	FILE *file = fopen("/proc/self/smaps", "r");
+	int is_anonhuge = 0;
+	char line[256];
+
+	if (!file) {
+		perror("fopen");
+		return -1;
+	}
+
+	while (fgets(line, sizeof(line), file)) {
+		if (strstr(line, "AnonHugePages:") && strstr(line, "24576 kB")) {
+			is_anonhuge = 1;
+			break;
+		}
+	}
+	fclose(file);
+	return is_anonhuge;
+}
+
+static int test_mmap_thp(int madvise_buffer)
+{
+	int is_anonhuge;
+
+	char *buffer = (char *)mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
+				    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (buffer == MAP_FAILED) {
+		perror("mmap");
+		return -1;
+	}
+	if (madvise_buffer)
+		madvise(buffer, BUF_SIZE, MADV_HUGEPAGE);
+
+	// set memory to ensure it's allocated
+	memset(buffer, 0, BUF_SIZE);
+	is_anonhuge = check_smaps_for_huge();
+	munmap(buffer, BUF_SIZE);
+	return is_anonhuge;
+}
+
+/* Global policy is always, process is changed to NOHUGE (process becomes madvise) */
+static int test_global_always_process_nohuge(void)
+{
+	int is_anonhuge = 0, res = 0, status = 0;
+	pid_t pid;
+
+	if (prctl(PR_SET_THP_POLICY, PR_DEFAULT_MADV_NOHUGEPAGE, NULL, NULL, NULL) != 0) {
+		perror("prctl failed to set policy to madvise");
+		return -1;
+	}
+
+	/* Make sure prctl changes are carried across fork */
+	pid = fork();
+	if (pid < 0) {
+		perror("fork");
+		exit(EXIT_FAILURE);
+	}
+
+	res = prctl(PR_GET_THP_POLICY, NULL, NULL, NULL, NULL);
+	if (res != PR_DEFAULT_MADV_NOHUGEPAGE) {
+		printf("prctl PR_GET_THP_POLICY returned %d pid %d\n", res, pid);
+		goto err_out;
+	}
+
+	/* global = always, process = madvise, we shouldn't get HPs without madvise */
+	is_anonhuge = test_mmap_thp(0);
+	if (is_anonhuge) {
+		printf(
+		"PR_DEFAULT_MADV_NOHUGEPAGE set but still got hugepages without MADV_HUGEPAGE\n");
+		goto err_out;
+	}
+
+	is_anonhuge = test_mmap_thp(1);
+	if (!is_anonhuge) {
+		printf(
+		"PR_DEFAULT_MADV_NOHUGEPAGE set but did't get hugepages with MADV_HUGEPAGE\n");
+		goto err_out;
+	}
+
+	/* Reset to system policy */
+	if (prctl(PR_SET_THP_POLICY, PR_DEFAULT_SYSTEM, NULL, NULL, NULL) != 0) {
+		perror("prctl failed to set policy to system");
+		goto err_out;
+	}
+
+	is_anonhuge = test_mmap_thp(0);
+	if (!is_anonhuge) {
+		printf("global policy is always but we still didn't get hugepages\n");
+		goto err_out;
+	}
+
+	is_anonhuge = test_mmap_thp(1);
+	if (!is_anonhuge) {
+		printf("global policy is always but we still didn't get hugepages\n");
+		goto err_out;
+	}
+
+	if (pid == 0) {
+		exit(EXIT_SUCCESS);
+	} else {
+		wait(&status);
+		if (WIFEXITED(status))
+			return 0;
+		else
+			return -1;
+	}
+
+err_out:
+	if (pid == 0)
+		exit(EXIT_FAILURE);
+	else
+		return -1;
+}
+
+int main(void)
+{
+	if (sysfs_check())
+		return 0;
+
+	if (system_thp_policy == SYSTEM_POLICY_ALWAYS)
+		return test_global_always_process_nohuge();
+
+}
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 6/7] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
  2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
                   ` (4 preceding siblings ...)
  2025-05-19 22:29 ` [PATCH v3 5/7] selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE Usama Arif
@ 2025-05-19 22:29 ` Usama Arif
  2025-05-19 22:29 ` [PATCH v3 7/7] docs: transhuge: document process level THP controls Usama Arif
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-19 22:29 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, linux-kernel, linux-doc, kernel-team,
	Usama Arif

The test is limited to 2M PMD THPs. It does not modify the system
settings in order to not disturb other process running in the
system.
It runs if the PMD size is 2M, if the 2M policy is set to inherit
and if the system global THP policy is set to "madvise", so that
the change in behaviour due to PR_THP_POLICY_DEFAULT_HUGE can
be seen.

This tests if:
- the process can successfully set the policy
- carry it over to the new process with fork
- if hugepage is gotten both with and without madvise
- the process can successfully reset the policy to
  PR_DEFAULT_SYSTEM
- if hugepage is gotten after the policy reset only with MADV_HUGEPAGE

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 tools/testing/selftests/prctl/thp_policy.c | 74 +++++++++++++++++++++-
 1 file changed, 73 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/prctl/thp_policy.c b/tools/testing/selftests/prctl/thp_policy.c
index 7791d282f7c8..62cf1fa6fd28 100644
--- a/tools/testing/selftests/prctl/thp_policy.c
+++ b/tools/testing/selftests/prctl/thp_policy.c
@@ -203,6 +203,77 @@ static int test_global_always_process_nohuge(void)
 		return -1;
 }
 
+/* Global policy is madvise, process is changed to HUGE (process becomes always) */
+static int test_global_madvise_process_huge(void)
+{
+	int is_anonhuge = 0, res = 0, status = 0;
+	pid_t pid;
+
+	if (prctl(PR_SET_THP_POLICY, PR_DEFAULT_MADV_HUGEPAGE, NULL, NULL, NULL) != 0) {
+		perror("prctl failed to set process policy to always");
+		return -1;
+	}
+
+	/* Make sure prctl changes are carried across fork */
+	pid = fork();
+	if (pid < 0) {
+		perror("fork");
+		exit(EXIT_FAILURE);
+	}
+
+	res = prctl(PR_GET_THP_POLICY, NULL, NULL, NULL, NULL);
+	if (res != PR_DEFAULT_MADV_HUGEPAGE) {
+		printf("prctl PR_GET_THP_POLICY returned %d pid %d\n", res, pid);
+		goto err_out;
+	}
+
+	/* global = madvise, process = always, we should get HPs irrespective of MADV_HUGEPAGE */
+	is_anonhuge = test_mmap_thp(0);
+	if (!is_anonhuge) {
+		printf("PR_DEFAULT_MADV_HUGEPAGE set but didn't get hugepages\n");
+		goto err_out;
+	}
+
+	is_anonhuge = test_mmap_thp(1);
+	if (!is_anonhuge) {
+		printf("PR_DEFAULT_MADV_HUGEPAGE set but did't get hugepages\n");
+		goto err_out;
+	}
+
+	/* Reset to system policy */
+	if (prctl(PR_SET_THP_POLICY, PR_DEFAULT_SYSTEM, NULL, NULL, NULL) != 0) {
+		perror("prctl failed to set policy to system");
+		goto err_out;
+	}
+
+	is_anonhuge = test_mmap_thp(0);
+	if (is_anonhuge) {
+		printf("global policy is madvise\n");
+		goto err_out;
+	}
+
+	is_anonhuge = test_mmap_thp(1);
+	if (!is_anonhuge) {
+		printf("global policy is madvise\n");
+		goto err_out;
+	}
+
+	if (pid == 0) {
+		exit(EXIT_SUCCESS);
+	} else {
+		wait(&status);
+		if (WIFEXITED(status))
+			return 0;
+		else
+			return -1;
+	}
+err_out:
+	if (pid == 0)
+		exit(EXIT_FAILURE);
+	else
+		return -1;
+}
+
 int main(void)
 {
 	if (sysfs_check())
@@ -210,5 +281,6 @@ int main(void)
 
 	if (system_thp_policy == SYSTEM_POLICY_ALWAYS)
 		return test_global_always_process_nohuge();
-
+	else
+		return test_global_madvise_process_huge();
 }
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 7/7] docs: transhuge: document process level THP controls
  2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
                   ` (5 preceding siblings ...)
  2025-05-19 22:29 ` [PATCH v3 6/7] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE Usama Arif
@ 2025-05-19 22:29 ` Usama Arif
  2025-05-20  5:14 ` [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Lorenzo Stoakes
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-19 22:29 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, linux-kernel, linux-doc, kernel-team,
	Usama Arif

This includes the already existing PR_GET/SET_THP_DISABLE policy,
as well as the newly introduced PR_GET/SET_THP_POLICY.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 42 ++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index dff8d5985f0f..79983c20ae48 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -218,6 +218,48 @@ to "always" or "madvise"), and it'll be automatically shutdown when
 PMD-sized THP is disabled (when both the per-size anon control and the
 top-level control are "never")
 
+process THP controls
+--------------------
+
+Transparent Hugepage behaviour of a process can be modified/obtained by
+using the prctl system call. The following operations are supported:
+
+PR_SET_THP_DISABLE
+	This will set the MMF_DISABLE_THP process flag which will result
+	in no hugepages being faulted in or collapsed by khugepaged,
+	irrespective of global THP controls.
+
+PR_GET_THP_DISABLE
+	This will return the MMF_DISABLE_THP process flag, which will be
+	set if the process has previously been set with PR_SET_THP_DISABLE.
+
+PR_SET_THP_POLICY
+	This is used to change the behaviour of existing and future VMAs.
+	It has support for the following policies:
+
+	PR_DEFAULT_MADV_HUGEPAGE
+		This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE for the default
+		VMA flags. It will also iterate through every VMA in the process
+		and call hugepage_madvise on it, with MADV_HUGEPAGE policy.
+		This effectively allows setting MADV_HUGEPAGE on the entire process.
+		The policy is inherited during fork+exec.
+
+	PR_DEFAULT_MADV_NOHUGEPAGE
+		This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE for the default
+		VMA flags. It will also iterate through every VMA in the process
+		and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy.
+		This effectively allows setting MADV_NOHUGEPAGE on the entire process.
+		The policy is inherited during fork+exec.
+
+	PR_THP_POLICY_SYSTEM
+		This will reset (clear) both VM_HUGEPAGE and VM_NOHUGEPAGE process
+		for the default flags.
+
+PR_SET_THP_POLICY
+	This will return the current THP policy of the process, i.e.
+	PR_DEFAULT_MADV_HUGEPAGE, PR_DEFAULT_MADV_NOHUGEPAGE or
+	PR_THP_POLICY_SYSTEM.
+
 Khugepaged controls
 -------------------
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
  2025-05-19 22:29 ` [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process Usama Arif
@ 2025-05-19 23:01   ` Jann Horn
  2025-05-20  5:23     ` Lorenzo Stoakes
  2025-05-20  8:48   ` kernel test robot
  1 sibling, 1 reply; 25+ messages in thread
From: Jann Horn @ 2025-05-19 23:01 UTC (permalink / raw)
  To: Usama Arif, lorenzo.stoakes
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
	laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	vbabka, Arnd Bergmann, linux-kernel, linux-doc, kernel-team

On Tue, May 20, 2025 at 12:33 AM Usama Arif <usamaarif642@gmail.com> wrote:
> This is set via the new PR_SET_THP_POLICY prctl. It has 2 affects:
> - It sets VM_HUGEPAGE and clears VM_NOHUGEPAGE on the default VMA flags
>   (def_flags). This means that every new VMA will be considered for
>   hugepage.
> - Iterate through every VMA in the process and call hugepage_madvise
>   on it, with MADV_HUGEPAGE policy.
> The policy is inherited during fork+exec.

As I replied to Lorenzo's series
(https://lore.kernel.org/all/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com/),
it would be nice if you could avoid introducing new flags that have
the combination of all the following properties:

1. persists across exec
2. not cleared on secureexec execution
3. settable without ns_capable(CAP_SYS_ADMIN)
4. settable without NO_NEW_PRIVS

Flags that have all of these properties need to be reviewed extra
carefully to see if there is any way they could impact the security of
setuid binaries, for example by changing mmap() behavior in a way that
makes addresses significantly more predictable.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY
  2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
                   ` (6 preceding siblings ...)
  2025-05-19 22:29 ` [PATCH v3 7/7] docs: transhuge: document process level THP controls Usama Arif
@ 2025-05-20  5:14 ` Lorenzo Stoakes
  2025-05-20  7:46   ` Usama Arif
  2025-05-21  2:33 ` Liam R. Howlett
  2025-05-22 12:10 ` Mike Rapoport
  9 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20  5:14 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
	laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
	kernel-team

NACK the whole series.

Usama - I explicitly said make this an RFC, so we can see what this
approach _looks like_ to further examine it, to which you agreed. And now
you've sent it non-RFC. That's not acceptable.

If you agree to something in review, it's not then optional as to whether
you do it.

Thanks.

On Mon, May 19, 2025 at 11:29:52PM +0100, Usama Arif wrote:
> This series allows to change the THP policy of a process, according to the
> value set in arg2, all of which will be inherited during fork+exec:
> - PR_DEFAULT_MADV_HUGEPAGE: This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE
>   for the default VMA flags. It will also iterate through every VMA in the
>   process and call hugepage_madvise on it, with MADV_HUGEPAGE policy.
>   This effectively allows setting MADV_HUGEPAGE on the entire process.
>   In an environment where different types of workloads are run on the
>   same machine, this will allow workloads that benefit from always having
>   hugepages to do so, without regressing those that don't.
> - PR_DEFAULT_MADV_NOHUGEPAGE: This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE
>   for the default VMA flags. It will also iterate through every VMA in the
>   process and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy.
>   This effectively allows setting MADV_NOHUGEPAGE on the entire process.
>   In an environment where different types of workloads are run on the
>   same machine,this will allow workloads that benefit from having
>   hugepages on an madvise basis only to do so, without regressing those
>   that benefit from having hugepages always.
> - PR_THP_POLICY_SYSTEM: This will reset (clear) both VM_HUGEPAGE and
>   VM_NOHUGEPAGE process for the default flags.
>
> In hyperscalers, we have a single THP policy for the entire fleet.
> We have different types of workloads (e.g. AI/compute/databases/etc)
> running on a single server.
> Some of these workloads will benefit from always getting THP at fault
> (or collapsed by khugepaged), some of them will benefit by only getting
> them at madvise.
>
> This series is useful for 2 usecases:
> 1) global system policy = madvise, while we want some workloads to get THPs
> at fault and by khugepaged :- some processes (e.g. AI workloads) benefits
> from getting THPs at fault (and collapsed by khugepaged). Other workloads
> like databases will incur regression (either a performance regression or
> they are completely memory bound and even a very slight increase in memory
> will cause them to OOM). So what these patches will do is allow setting
> prctl(PR_DEFAULT_MADV_HUGEPAGE) on the AI workloads, (This is how
> workloads are deployed in our (Meta's/Facebook) fleet at this moment).
>
> 2) global system policy = always, while we want some workloads to get THPs
> only on madvise basis :- Same reason as 1). What these patches
> will do is allow setting prctl(PR_DEFAULT_MADV_NOHUGEPAGE) on the database
> workloads. (We hope this is us (Meta) in the near future, if a majority of
> workloads show that they benefit from always, we flip the default host
> setting to "always" across the fleet and workloads that regress can opt-out
> and be "madvise". New services developed will then be tested with always by
> default. "always" is also the default defconfig option upstream, so I would
> imagine this is faced by others as well.)
>
> v2->v3: (Thanks Lorenzo for all the below feedback!)
> v2: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@gmail.com/
> - no more flags2.
> - no more MMF2_...
> - renamed policy to PR_DEFAULT_MADV_(NO)HUGEPAGE
> - mmap_write_lock_killable acquired in PR_GET_THP_POLICY
> - mmap_write lock fixed in PR_SET_THP_POLICY
> - mmap assert check in process_default_madv_hugepage
> - check if hugepage_global_enabled is enabled in the call and account for s390
> - set mm->def_flags VM_HUGEPAGE and VM_NOHUGEPAGE according to the policy in
>   the way done by madvise(). I believe VM merge will not be broken in
>   this way.
> - process_default_madv_hugepage function that does for_each_vma and calls
>   hugepage_madvise.
>
> v1->v2:
> - change from modifying the THP decision making for the process, to modifying
>   VMA flags only. This prevents further complicating the logic used to
>   determine THP order (Thanks David!)
> - change from using a prctl per policy change to just using PR_SET_THP_POLICY
>   and arg2 to set the policy. (Zi Yan)
> - Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM
> - Add selftests and documentation.
>
> Usama Arif (7):
>   mm: khugepaged: extract vm flag setting outside of hugepage_madvise
>   prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
>   prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE for the process
>   prctl: introduce PR_THP_POLICY_SYSTEM for the process
>   selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE
>   selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
>   docs: transhuge: document process level THP controls
>
>  Documentation/admin-guide/mm/transhuge.rst    |  42 +++
>  include/linux/huge_mm.h                       |   2 +
>  include/linux/mm.h                            |   2 +-
>  include/linux/mm_types.h                      |   4 +-
>  include/uapi/linux/prctl.h                    |   6 +
>  kernel/sys.c                                  |  53 ++++
>  mm/huge_memory.c                              |  13 +
>  mm/khugepaged.c                               |  26 +-
>  tools/include/uapi/linux/prctl.h              |   6 +
>  .../trace/beauty/include/uapi/linux/prctl.h   |   6 +
>  tools/testing/selftests/prctl/Makefile        |   2 +-
>  tools/testing/selftests/prctl/thp_policy.c    | 286 ++++++++++++++++++
>  12 files changed, 436 insertions(+), 12 deletions(-)
>  create mode 100644 tools/testing/selftests/prctl/thp_policy.c
>
> --
> 2.47.1
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
  2025-05-19 23:01   ` Jann Horn
@ 2025-05-20  5:23     ` Lorenzo Stoakes
  2025-05-20  9:09       ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20  5:23 UTC (permalink / raw)
  To: Jann Horn
  Cc: Usama Arif, Andrew Morton, david, linux-mm, hannes, shakeel.butt,
	riel, ziy, laoar.shao, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, vbabka, Arnd Bergmann, linux-kernel, linux-doc,
	kernel-team

On Tue, May 20, 2025 at 01:01:38AM +0200, Jann Horn wrote:
> On Tue, May 20, 2025 at 12:33 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > This is set via the new PR_SET_THP_POLICY prctl. It has 2 affects:
> > - It sets VM_HUGEPAGE and clears VM_NOHUGEPAGE on the default VMA flags
> >   (def_flags). This means that every new VMA will be considered for
> >   hugepage.
> > - Iterate through every VMA in the process and call hugepage_madvise
> >   on it, with MADV_HUGEPAGE policy.
> > The policy is inherited during fork+exec.
>
> As I replied to Lorenzo's series
> (https://lore.kernel.org/all/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com/),
> it would be nice if you could avoid introducing new flags that have
> the combination of all the following properties:
>
> 1. persists across exec
> 2. not cleared on secureexec execution
> 3. settable without ns_capable(CAP_SYS_ADMIN)
> 4. settable without NO_NEW_PRIVS
>
> Flags that have all of these properties need to be reviewed extra
> carefully to see if there is any way they could impact the security of
> setuid binaries, for example by changing mmap() behavior in a way that
> makes addresses significantly more predictable.

Indeed, this series was meant to be as RFC as mine while we still figured this
out :) grr. Well, with the NACK it is - in effect - now an RFC.

Yes having something persistent like this is not great, the idea of
introducing this in my series was to provide an alternative generic version
of this approach that can be better controlled and isn't just a 'tacked on'
change specific to one company's needs but rather a more general idea of
'madvise() by default'.

I do wonder in this case, whether we need be so cautious however given the
_relatively_ safe nature of these flags?

I do absolutely agree we need to very carefully review whether:

1. It really even makes sense to do this
2. Any such restrictions need be made

I am weaker on the security side so very glad for your input here (thanks!)

I suspect probably we want ns_capable(CAP_SYS_ADMIN) _as a rule_ for this
kind of mm->def_flags change.

I also wanted to dig a little deeper into whether this was sensible as a
general approach.

I, however, do _very much_ prefer it to an mm->flags change (that'd
necessity a pre-requisite 'make mm->flags 64-bit on 32-bit kernels'
series anyway).


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY
  2025-05-20  5:14 ` [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Lorenzo Stoakes
@ 2025-05-20  7:46   ` Usama Arif
  2025-05-20  8:51     ` Lorenzo Stoakes
  0 siblings, 1 reply; 25+ messages in thread
From: Usama Arif @ 2025-05-20  7:46 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
	laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
	kernel-team



On 20/05/2025 06:14, Lorenzo Stoakes wrote:
> NACK the whole series.
> 
> Usama - I explicitly said make this an RFC, so we can see what this
> approach _looks like_ to further examine it, to which you agreed. And now
> you've sent it non-RFC. That's not acceptable.
> 
> If you agree to something in review, it's not then optional as to whether
> you do it.

It was a bit late yesterday and I completely forgot to change --subject-prefix="PATCH v3" 
to --subject-prefix="RFC v3". Mistakes happen and I apologize.

I agreed to make it RFC and had full intention of doing that.
Would you like me to resend it with the RFC tag?

Thanks,
Usama


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
  2025-05-19 22:29 ` [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process Usama Arif
  2025-05-19 23:01   ` Jann Horn
@ 2025-05-20  8:48   ` kernel test robot
  1 sibling, 0 replies; 25+ messages in thread
From: kernel test robot @ 2025-05-20  8:48 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, david
  Cc: llvm, oe-kbuild-all, Linux Memory Management List, hannes,
	shakeel.butt, riel, ziy, laoar.shao, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, vbabka, jannh, Arnd Bergmann,
	linux-kernel, linux-doc, kernel-team, Usama Arif

Hi Usama,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on perf-tools-next/perf-tools-next tip/perf/core perf-tools/perf-tools linus/master v6.15-rc7]
[cannot apply to acme/perf/core next-20250516]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-khugepaged-extract-vm-flag-setting-outside-of-hugepage_madvise/20250520-063452
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250519223307.3601786-3-usamaarif642%40gmail.com
patch subject: [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
config: s390-randconfig-001-20250520 (https://download.01.org/0day-ci/archive/20250520/202505201614.N4SXnAln-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project f819f46284f2a79790038e1f6649172789734ae8)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250520/202505201614.N4SXnAln-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505201614.N4SXnAln-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/sys.c:2678:9: error: call to undeclared function 'hugepage_global_enabled'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    2678 |                         if (!hugepage_global_enabled())
         |                              ^
>> kernel/sys.c:2680:12: error: call to undeclared function 'hugepage_set_vmflags'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    2680 |                         error = hugepage_set_vmflags(&mm->def_flags, MADV_HUGEPAGE);
         |                                 ^
>> kernel/sys.c:2682:5: error: call to undeclared function 'process_default_madv_hugepage'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    2682 |                                 process_default_madv_hugepage(mm, MADV_HUGEPAGE);
         |                                 ^
   3 errors generated.


vim +/hugepage_global_enabled +2678 kernel/sys.c

  2472	
  2473	SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
  2474			unsigned long, arg4, unsigned long, arg5)
  2475	{
  2476		struct task_struct *me = current;
  2477		struct mm_struct *mm = me->mm;
  2478		unsigned char comm[sizeof(me->comm)];
  2479		long error;
  2480	
  2481		error = security_task_prctl(option, arg2, arg3, arg4, arg5);
  2482		if (error != -ENOSYS)
  2483			return error;
  2484	
  2485		error = 0;
  2486		switch (option) {
  2487		case PR_SET_PDEATHSIG:
  2488			if (!valid_signal(arg2)) {
  2489				error = -EINVAL;
  2490				break;
  2491			}
  2492			me->pdeath_signal = arg2;
  2493			break;
  2494		case PR_GET_PDEATHSIG:
  2495			error = put_user(me->pdeath_signal, (int __user *)arg2);
  2496			break;
  2497		case PR_GET_DUMPABLE:
  2498			error = get_dumpable(me->mm);
  2499			break;
  2500		case PR_SET_DUMPABLE:
  2501			if (arg2 != SUID_DUMP_DISABLE && arg2 != SUID_DUMP_USER) {
  2502				error = -EINVAL;
  2503				break;
  2504			}
  2505			set_dumpable(me->mm, arg2);
  2506			break;
  2507	
  2508		case PR_SET_UNALIGN:
  2509			error = SET_UNALIGN_CTL(me, arg2);
  2510			break;
  2511		case PR_GET_UNALIGN:
  2512			error = GET_UNALIGN_CTL(me, arg2);
  2513			break;
  2514		case PR_SET_FPEMU:
  2515			error = SET_FPEMU_CTL(me, arg2);
  2516			break;
  2517		case PR_GET_FPEMU:
  2518			error = GET_FPEMU_CTL(me, arg2);
  2519			break;
  2520		case PR_SET_FPEXC:
  2521			error = SET_FPEXC_CTL(me, arg2);
  2522			break;
  2523		case PR_GET_FPEXC:
  2524			error = GET_FPEXC_CTL(me, arg2);
  2525			break;
  2526		case PR_GET_TIMING:
  2527			error = PR_TIMING_STATISTICAL;
  2528			break;
  2529		case PR_SET_TIMING:
  2530			if (arg2 != PR_TIMING_STATISTICAL)
  2531				error = -EINVAL;
  2532			break;
  2533		case PR_SET_NAME:
  2534			comm[sizeof(me->comm) - 1] = 0;
  2535			if (strncpy_from_user(comm, (char __user *)arg2,
  2536					      sizeof(me->comm) - 1) < 0)
  2537				return -EFAULT;
  2538			set_task_comm(me, comm);
  2539			proc_comm_connector(me);
  2540			break;
  2541		case PR_GET_NAME:
  2542			get_task_comm(comm, me);
  2543			if (copy_to_user((char __user *)arg2, comm, sizeof(comm)))
  2544				return -EFAULT;
  2545			break;
  2546		case PR_GET_ENDIAN:
  2547			error = GET_ENDIAN(me, arg2);
  2548			break;
  2549		case PR_SET_ENDIAN:
  2550			error = SET_ENDIAN(me, arg2);
  2551			break;
  2552		case PR_GET_SECCOMP:
  2553			error = prctl_get_seccomp();
  2554			break;
  2555		case PR_SET_SECCOMP:
  2556			error = prctl_set_seccomp(arg2, (char __user *)arg3);
  2557			break;
  2558		case PR_GET_TSC:
  2559			error = GET_TSC_CTL(arg2);
  2560			break;
  2561		case PR_SET_TSC:
  2562			error = SET_TSC_CTL(arg2);
  2563			break;
  2564		case PR_TASK_PERF_EVENTS_DISABLE:
  2565			error = perf_event_task_disable();
  2566			break;
  2567		case PR_TASK_PERF_EVENTS_ENABLE:
  2568			error = perf_event_task_enable();
  2569			break;
  2570		case PR_GET_TIMERSLACK:
  2571			if (current->timer_slack_ns > ULONG_MAX)
  2572				error = ULONG_MAX;
  2573			else
  2574				error = current->timer_slack_ns;
  2575			break;
  2576		case PR_SET_TIMERSLACK:
  2577			if (rt_or_dl_task_policy(current))
  2578				break;
  2579			if (arg2 <= 0)
  2580				current->timer_slack_ns =
  2581						current->default_timer_slack_ns;
  2582			else
  2583				current->timer_slack_ns = arg2;
  2584			break;
  2585		case PR_MCE_KILL:
  2586			if (arg4 | arg5)
  2587				return -EINVAL;
  2588			switch (arg2) {
  2589			case PR_MCE_KILL_CLEAR:
  2590				if (arg3 != 0)
  2591					return -EINVAL;
  2592				current->flags &= ~PF_MCE_PROCESS;
  2593				break;
  2594			case PR_MCE_KILL_SET:
  2595				current->flags |= PF_MCE_PROCESS;
  2596				if (arg3 == PR_MCE_KILL_EARLY)
  2597					current->flags |= PF_MCE_EARLY;
  2598				else if (arg3 == PR_MCE_KILL_LATE)
  2599					current->flags &= ~PF_MCE_EARLY;
  2600				else if (arg3 == PR_MCE_KILL_DEFAULT)
  2601					current->flags &=
  2602							~(PF_MCE_EARLY|PF_MCE_PROCESS);
  2603				else
  2604					return -EINVAL;
  2605				break;
  2606			default:
  2607				return -EINVAL;
  2608			}
  2609			break;
  2610		case PR_MCE_KILL_GET:
  2611			if (arg2 | arg3 | arg4 | arg5)
  2612				return -EINVAL;
  2613			if (current->flags & PF_MCE_PROCESS)
  2614				error = (current->flags & PF_MCE_EARLY) ?
  2615					PR_MCE_KILL_EARLY : PR_MCE_KILL_LATE;
  2616			else
  2617				error = PR_MCE_KILL_DEFAULT;
  2618			break;
  2619		case PR_SET_MM:
  2620			error = prctl_set_mm(arg2, arg3, arg4, arg5);
  2621			break;
  2622		case PR_GET_TID_ADDRESS:
  2623			error = prctl_get_tid_address(me, (int __user * __user *)arg2);
  2624			break;
  2625		case PR_SET_CHILD_SUBREAPER:
  2626			me->signal->is_child_subreaper = !!arg2;
  2627			if (!arg2)
  2628				break;
  2629	
  2630			walk_process_tree(me, propagate_has_child_subreaper, NULL);
  2631			break;
  2632		case PR_GET_CHILD_SUBREAPER:
  2633			error = put_user(me->signal->is_child_subreaper,
  2634					 (int __user *)arg2);
  2635			break;
  2636		case PR_SET_NO_NEW_PRIVS:
  2637			if (arg2 != 1 || arg3 || arg4 || arg5)
  2638				return -EINVAL;
  2639	
  2640			task_set_no_new_privs(current);
  2641			break;
  2642		case PR_GET_NO_NEW_PRIVS:
  2643			if (arg2 || arg3 || arg4 || arg5)
  2644				return -EINVAL;
  2645			return task_no_new_privs(current) ? 1 : 0;
  2646		case PR_GET_THP_DISABLE:
  2647			if (arg2 || arg3 || arg4 || arg5)
  2648				return -EINVAL;
  2649			error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
  2650			break;
  2651		case PR_SET_THP_DISABLE:
  2652			if (arg3 || arg4 || arg5)
  2653				return -EINVAL;
  2654			if (mmap_write_lock_killable(me->mm))
  2655				return -EINTR;
  2656			if (arg2)
  2657				set_bit(MMF_DISABLE_THP, &me->mm->flags);
  2658			else
  2659				clear_bit(MMF_DISABLE_THP, &me->mm->flags);
  2660			mmap_write_unlock(me->mm);
  2661			break;
  2662		case PR_GET_THP_POLICY:
  2663			if (arg2 || arg3 || arg4 || arg5)
  2664				return -EINVAL;
  2665			if (mmap_write_lock_killable(mm))
  2666				return -EINTR;
  2667			if (mm->def_flags & VM_HUGEPAGE)
  2668				error = PR_DEFAULT_MADV_HUGEPAGE;
  2669			mmap_write_unlock(mm);
  2670			break;
  2671		case PR_SET_THP_POLICY:
  2672			if (arg3 || arg4 || arg5)
  2673				return -EINVAL;
  2674			if (mmap_write_lock_killable(mm))
  2675				return -EINTR;
  2676			switch (arg2) {
  2677			case PR_DEFAULT_MADV_HUGEPAGE:
> 2678				if (!hugepage_global_enabled())
  2679					error = -EPERM;
> 2680				error = hugepage_set_vmflags(&mm->def_flags, MADV_HUGEPAGE);
  2681				if (!error)
> 2682					process_default_madv_hugepage(mm, MADV_HUGEPAGE);
  2683				break;
  2684			default:
  2685				error = -EINVAL;
  2686				break;
  2687			}
  2688			mmap_write_unlock(mm);
  2689			break;
  2690		case PR_MPX_ENABLE_MANAGEMENT:
  2691		case PR_MPX_DISABLE_MANAGEMENT:
  2692			/* No longer implemented: */
  2693			return -EINVAL;
  2694		case PR_SET_FP_MODE:
  2695			error = SET_FP_MODE(me, arg2);
  2696			break;
  2697		case PR_GET_FP_MODE:
  2698			error = GET_FP_MODE(me);
  2699			break;
  2700		case PR_SVE_SET_VL:
  2701			error = SVE_SET_VL(arg2);
  2702			break;
  2703		case PR_SVE_GET_VL:
  2704			error = SVE_GET_VL();
  2705			break;
  2706		case PR_SME_SET_VL:
  2707			error = SME_SET_VL(arg2);
  2708			break;
  2709		case PR_SME_GET_VL:
  2710			error = SME_GET_VL();
  2711			break;
  2712		case PR_GET_SPECULATION_CTRL:
  2713			if (arg3 || arg4 || arg5)
  2714				return -EINVAL;
  2715			error = arch_prctl_spec_ctrl_get(me, arg2);
  2716			break;
  2717		case PR_SET_SPECULATION_CTRL:
  2718			if (arg4 || arg5)
  2719				return -EINVAL;
  2720			error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
  2721			break;
  2722		case PR_PAC_RESET_KEYS:
  2723			if (arg3 || arg4 || arg5)
  2724				return -EINVAL;
  2725			error = PAC_RESET_KEYS(me, arg2);
  2726			break;
  2727		case PR_PAC_SET_ENABLED_KEYS:
  2728			if (arg4 || arg5)
  2729				return -EINVAL;
  2730			error = PAC_SET_ENABLED_KEYS(me, arg2, arg3);
  2731			break;
  2732		case PR_PAC_GET_ENABLED_KEYS:
  2733			if (arg2 || arg3 || arg4 || arg5)
  2734				return -EINVAL;
  2735			error = PAC_GET_ENABLED_KEYS(me);
  2736			break;
  2737		case PR_SET_TAGGED_ADDR_CTRL:
  2738			if (arg3 || arg4 || arg5)
  2739				return -EINVAL;
  2740			error = SET_TAGGED_ADDR_CTRL(arg2);
  2741			break;
  2742		case PR_GET_TAGGED_ADDR_CTRL:
  2743			if (arg2 || arg3 || arg4 || arg5)
  2744				return -EINVAL;
  2745			error = GET_TAGGED_ADDR_CTRL();
  2746			break;
  2747		case PR_SET_IO_FLUSHER:
  2748			if (!capable(CAP_SYS_RESOURCE))
  2749				return -EPERM;
  2750	
  2751			if (arg3 || arg4 || arg5)
  2752				return -EINVAL;
  2753	
  2754			if (arg2 == 1)
  2755				current->flags |= PR_IO_FLUSHER;
  2756			else if (!arg2)
  2757				current->flags &= ~PR_IO_FLUSHER;
  2758			else
  2759				return -EINVAL;
  2760			break;
  2761		case PR_GET_IO_FLUSHER:
  2762			if (!capable(CAP_SYS_RESOURCE))
  2763				return -EPERM;
  2764	
  2765			if (arg2 || arg3 || arg4 || arg5)
  2766				return -EINVAL;
  2767	
  2768			error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
  2769			break;
  2770		case PR_SET_SYSCALL_USER_DISPATCH:
  2771			error = set_syscall_user_dispatch(arg2, arg3, arg4,
  2772							  (char __user *) arg5);
  2773			break;
  2774	#ifdef CONFIG_SCHED_CORE
  2775		case PR_SCHED_CORE:
  2776			error = sched_core_share_pid(arg2, arg3, arg4, arg5);
  2777			break;
  2778	#endif
  2779		case PR_SET_MDWE:
  2780			error = prctl_set_mdwe(arg2, arg3, arg4, arg5);
  2781			break;
  2782		case PR_GET_MDWE:
  2783			error = prctl_get_mdwe(arg2, arg3, arg4, arg5);
  2784			break;
  2785		case PR_PPC_GET_DEXCR:
  2786			if (arg3 || arg4 || arg5)
  2787				return -EINVAL;
  2788			error = PPC_GET_DEXCR_ASPECT(me, arg2);
  2789			break;
  2790		case PR_PPC_SET_DEXCR:
  2791			if (arg4 || arg5)
  2792				return -EINVAL;
  2793			error = PPC_SET_DEXCR_ASPECT(me, arg2, arg3);
  2794			break;
  2795		case PR_SET_VMA:
  2796			error = prctl_set_vma(arg2, arg3, arg4, arg5);
  2797			break;
  2798		case PR_GET_AUXV:
  2799			if (arg4 || arg5)
  2800				return -EINVAL;
  2801			error = prctl_get_auxv((void __user *)arg2, arg3);
  2802			break;
  2803	#ifdef CONFIG_KSM
  2804		case PR_SET_MEMORY_MERGE:
  2805			if (arg3 || arg4 || arg5)
  2806				return -EINVAL;
  2807			if (mmap_write_lock_killable(me->mm))
  2808				return -EINTR;
  2809	
  2810			if (arg2)
  2811				error = ksm_enable_merge_any(me->mm);
  2812			else
  2813				error = ksm_disable_merge_any(me->mm);
  2814			mmap_write_unlock(me->mm);
  2815			break;
  2816		case PR_GET_MEMORY_MERGE:
  2817			if (arg2 || arg3 || arg4 || arg5)
  2818				return -EINVAL;
  2819	
  2820			error = !!test_bit(MMF_VM_MERGE_ANY, &me->mm->flags);
  2821			break;
  2822	#endif
  2823		case PR_RISCV_V_SET_CONTROL:
  2824			error = RISCV_V_SET_CONTROL(arg2);
  2825			break;
  2826		case PR_RISCV_V_GET_CONTROL:
  2827			error = RISCV_V_GET_CONTROL();
  2828			break;
  2829		case PR_RISCV_SET_ICACHE_FLUSH_CTX:
  2830			error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
  2831			break;
  2832		case PR_GET_SHADOW_STACK_STATUS:
  2833			if (arg3 || arg4 || arg5)
  2834				return -EINVAL;
  2835			error = arch_get_shadow_stack_status(me, (unsigned long __user *) arg2);
  2836			break;
  2837		case PR_SET_SHADOW_STACK_STATUS:
  2838			if (arg3 || arg4 || arg5)
  2839				return -EINVAL;
  2840			error = arch_set_shadow_stack_status(me, arg2);
  2841			break;
  2842		case PR_LOCK_SHADOW_STACK_STATUS:
  2843			if (arg3 || arg4 || arg5)
  2844				return -EINVAL;
  2845			error = arch_lock_shadow_stack_status(me, arg2);
  2846			break;
  2847		case PR_TIMER_CREATE_RESTORE_IDS:
  2848			if (arg3 || arg4 || arg5)
  2849				return -EINVAL;
  2850			error = posixtimer_create_prctl(arg2);
  2851			break;
  2852		default:
  2853			trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
  2854			error = -EINVAL;
  2855			break;
  2856		}
  2857		return error;
  2858	}
  2859	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY
  2025-05-20  7:46   ` Usama Arif
@ 2025-05-20  8:51     ` Lorenzo Stoakes
  0 siblings, 0 replies; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20  8:51 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
	laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
	kernel-team

On Tue, May 20, 2025 at 08:46:43AM +0100, Usama Arif wrote:
>
>
> On 20/05/2025 06:14, Lorenzo Stoakes wrote:
> > NACK the whole series.
> >
> > Usama - I explicitly said make this an RFC, so we can see what this
> > approach _looks like_ to further examine it, to which you agreed. And now
> > you've sent it non-RFC. That's not acceptable.
> >
> > If you agree to something in review, it's not then optional as to whether
> > you do it.
>
> It was a bit late yesterday and I completely forgot to change --subject-prefix="PATCH v3"
> to --subject-prefix="RFC v3". Mistakes happen and I apologize.

Ack, but in future please try to be careful about this! This obviously
changes the nature of the series and important to highlight we're still in
the planning stages here.

>
> I agreed to make it RFC and had full intention of doing that.
> Would you like me to resend it with the RFC tag?

There's no need, we've got discussion here already so it's sensible to keep
things as-is, the series is in-effect an RFC now as it's NACK'd.

>
> Thanks,
> Usama

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
  2025-05-20  5:23     ` Lorenzo Stoakes
@ 2025-05-20  9:09       ` David Hildenbrand
  2025-05-20  9:16         ` Lorenzo Stoakes
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2025-05-20  9:09 UTC (permalink / raw)
  To: Lorenzo Stoakes, Jann Horn
  Cc: Usama Arif, Andrew Morton, linux-mm, hannes, shakeel.butt, riel,
	ziy, laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	vbabka, Arnd Bergmann, linux-kernel, linux-doc, kernel-team

On 20.05.25 07:23, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 01:01:38AM +0200, Jann Horn wrote:
>> On Tue, May 20, 2025 at 12:33 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>> This is set via the new PR_SET_THP_POLICY prctl. It has 2 affects:
>>> - It sets VM_HUGEPAGE and clears VM_NOHUGEPAGE on the default VMA flags
>>>    (def_flags). This means that every new VMA will be considered for
>>>    hugepage.
>>> - Iterate through every VMA in the process and call hugepage_madvise
>>>    on it, with MADV_HUGEPAGE policy.
>>> The policy is inherited during fork+exec.
>>
>> As I replied to Lorenzo's series
>> (https://lore.kernel.org/all/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com/),
>> it would be nice if you could avoid introducing new flags that have
>> the combination of all the following properties:
>>
>> 1. persists across exec
>> 2. not cleared on secureexec execution
>> 3. settable without ns_capable(CAP_SYS_ADMIN)
>> 4. settable without NO_NEW_PRIVS
>>
>> Flags that have all of these properties need to be reviewed extra
>> carefully to see if there is any way they could impact the security of
>> setuid binaries, for example by changing mmap() behavior in a way that
>> makes addresses significantly more predictable.
> 
> Indeed, this series was meant to be as RFC as mine while we still figured this
> out :) grr. Well, with the NACK it is - in effect - now an RFC.
> 
> Yes having something persistent like this is not great, the idea of
> introducing this in my series was to provide an alternative generic version
> of this approach that can be better controlled and isn't just a 'tacked on'
> change specific to one company's needs but rather a more general idea of
> 'madvise() by default'.
> 
> I do wonder in this case, whether we need be so cautious however given the
> _relatively_ safe nature of these flags?

Yes. Changing VM_HUGEPAGE / VM_NOHUGEPAGE defaults should have little 
impact, but we better be careful.

setuid execution is certainly an interesting point. Maybe the general 
rule should be, that it is not inherited over secureexec unless 
CAP_SYS_ADMIN?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
  2025-05-20  9:09       ` David Hildenbrand
@ 2025-05-20  9:16         ` Lorenzo Stoakes
  0 siblings, 0 replies; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20  9:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jann Horn, Usama Arif, Andrew Morton, linux-mm, hannes,
	shakeel.butt, riel, ziy, laoar.shao, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, vbabka, Arnd Bergmann, linux-kernel,
	linux-doc, kernel-team

On Tue, May 20, 2025 at 11:09:05AM +0200, David Hildenbrand wrote:
> On 20.05.25 07:23, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 01:01:38AM +0200, Jann Horn wrote:
> > > On Tue, May 20, 2025 at 12:33 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > > > This is set via the new PR_SET_THP_POLICY prctl. It has 2 affects:
> > > > - It sets VM_HUGEPAGE and clears VM_NOHUGEPAGE on the default VMA flags
> > > >    (def_flags). This means that every new VMA will be considered for
> > > >    hugepage.
> > > > - Iterate through every VMA in the process and call hugepage_madvise
> > > >    on it, with MADV_HUGEPAGE policy.
> > > > The policy is inherited during fork+exec.
> > >
> > > As I replied to Lorenzo's series
> > > (https://lore.kernel.org/all/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com/),
> > > it would be nice if you could avoid introducing new flags that have
> > > the combination of all the following properties:
> > >
> > > 1. persists across exec
> > > 2. not cleared on secureexec execution
> > > 3. settable without ns_capable(CAP_SYS_ADMIN)
> > > 4. settable without NO_NEW_PRIVS
> > >
> > > Flags that have all of these properties need to be reviewed extra
> > > carefully to see if there is any way they could impact the security of
> > > setuid binaries, for example by changing mmap() behavior in a way that
> > > makes addresses significantly more predictable.
> >
> > Indeed, this series was meant to be as RFC as mine while we still figured this
> > out :) grr. Well, with the NACK it is - in effect - now an RFC.
> >
> > Yes having something persistent like this is not great, the idea of
> > introducing this in my series was to provide an alternative generic version
> > of this approach that can be better controlled and isn't just a 'tacked on'
> > change specific to one company's needs but rather a more general idea of
> > 'madvise() by default'.
> >
> > I do wonder in this case, whether we need be so cautious however given the
> > _relatively_ safe nature of these flags?
>
> Yes. Changing VM_HUGEPAGE / VM_NOHUGEPAGE defaults should have little
> impact, but we better be careful.
>
> setuid execution is certainly an interesting point. Maybe the general rule
> should be, that it is not inherited over secureexec unless CAP_SYS_ADMIN?

I think probably we should just restrict this operation to system admins
anyway. This will be the most cautious option, and simplifies things as we
then don't have to especially check for things at certain points?

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise
  2025-05-19 22:29 ` [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise Usama Arif
@ 2025-05-20  9:51   ` kernel test robot
  2025-05-20 14:43   ` Lorenzo Stoakes
  1 sibling, 0 replies; 25+ messages in thread
From: kernel test robot @ 2025-05-20  9:51 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, david
  Cc: llvm, oe-kbuild-all, Linux Memory Management List, hannes,
	shakeel.butt, riel, ziy, laoar.shao, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, vbabka, jannh, Arnd Bergmann,
	linux-kernel, linux-doc, kernel-team, Usama Arif

Hi Usama,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on perf-tools-next/perf-tools-next tip/perf/core perf-tools/perf-tools linus/master acme/perf/core v6.15-rc7 next-20250516]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-khugepaged-extract-vm-flag-setting-outside-of-hugepage_madvise/20250520-063452
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250519223307.3601786-2-usamaarif642%40gmail.com
patch subject: [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise
config: s390-randconfig-002-20250520 (https://download.01.org/0day-ci/archive/20250520/202505201734.8Fyk3qKi-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project f819f46284f2a79790038e1f6649172789734ae8)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250520/202505201734.8Fyk3qKi-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505201734.8Fyk3qKi-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/khugepaged.c:359:20: error: use of undeclared identifier 'vma'
     359 |                 if (mm_has_pgste(vma->vm_mm))
         |                                  ^
   1 error generated.


vim +/vma +359 mm/khugepaged.c

b46e756f5e4703 Kirill A. Shutemov 2016-07-26  348  
d2a8f83f11a4ba Usama Arif         2025-05-19  349  int hugepage_set_vmflags(unsigned long *vm_flags, int advice)
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  350  {
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  351  	switch (advice) {
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  352  	case MADV_HUGEPAGE:
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  353  #ifdef CONFIG_S390
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  354  		/*
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  355  		 * qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  356  		 * can't handle this properly after s390_enable_sie, so we simply
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  357  		 * ignore the madvise to prevent qemu from causing a SIGSEGV.
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  358  		 */
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 @359  		if (mm_has_pgste(vma->vm_mm))
d2a8f83f11a4ba Usama Arif         2025-05-19  360  			return -EPERM;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  361  #endif
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  362  		*vm_flags &= ~VM_NOHUGEPAGE;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  363  		*vm_flags |= VM_HUGEPAGE;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  364  		break;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  365  	case MADV_NOHUGEPAGE:
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  366  		*vm_flags &= ~VM_HUGEPAGE;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  367  		*vm_flags |= VM_NOHUGEPAGE;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  368  		/*
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  369  		 * Setting VM_NOHUGEPAGE will prevent khugepaged from scanning
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  370  		 * this vma even if we leave the mm registered in khugepaged if
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  371  		 * it got registered before VM_NOHUGEPAGE was set.
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  372  		 */
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  373  		break;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  374  	}
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  375  
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  376  	return 0;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  377  }
b46e756f5e4703 Kirill A. Shutemov 2016-07-26  378  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise
  2025-05-19 22:29 ` [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise Usama Arif
  2025-05-20  9:51   ` kernel test robot
@ 2025-05-20 14:43   ` Lorenzo Stoakes
  2025-05-20 14:57     ` Usama Arif
  1 sibling, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 14:43 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
	laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
	kernel-team

This commit message is really poor. You're also not mentioning that you're
changing s390 behaviour?

On Mon, May 19, 2025 at 11:29:53PM +0100, Usama Arif wrote:
> This is so that flag setting can be resused later in other functions,

Typo.

> to reduce code duplication (including the s390 exception).
>
> No functional change intended with this patch.

I'm pretty sure somebody reviewed that this should just be merged with whatever
uses this? I'm not sure this is all that valuable as you're not really changing
this structurally very much.

>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>

Yeah I'm not a fan of this patch, it's buggy and really unclear what the
purpose is here.

> ---
>  include/linux/huge_mm.h |  1 +
>  mm/khugepaged.c         | 26 +++++++++++++++++---------
>  2 files changed, 18 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..23580a43787c 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -431,6 +431,7 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			__split_huge_pud(__vma, __pud, __address);	\
>  	}  while (0)
>
> +int hugepage_set_vmflags(unsigned long *vm_flags, int advice);
>  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
>  		     int advice);
>  int madvise_collapse(struct vm_area_struct *vma,
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b04b6a770afe..ab3427c87422 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -346,8 +346,7 @@ struct attribute_group khugepaged_attr_group = {
>  };
>  #endif /* CONFIG_SYSFS */
>
> -int hugepage_madvise(struct vm_area_struct *vma,
> -		     unsigned long *vm_flags, int advice)
> +int hugepage_set_vmflags(unsigned long *vm_flags, int advice)


>  {
>  	switch (advice) {
>  	case MADV_HUGEPAGE:
> @@ -358,16 +357,10 @@ int hugepage_madvise(struct vm_area_struct *vma,
>  		 * ignore the madvise to prevent qemu from causing a SIGSEGV.
>  		 */
>  		if (mm_has_pgste(vma->vm_mm))

This is broken, you refer to vma which doesn't exist.

As the kernel bots are telling you...

> -			return 0;
> +			return -EPERM;

Why are you now returning an error?

This seems like a super broken way of making the caller return 0. Just make this
whole thing a bool return if you're going to treat it like a boolean function.

>  #endif
>  		*vm_flags &= ~VM_NOHUGEPAGE;
>  		*vm_flags |= VM_HUGEPAGE;
> -		/*
> -		 * If the vma become good for khugepaged to scan,
> -		 * register it here without waiting a page fault that
> -		 * may not happen any time soon.
> -		 */
> -		khugepaged_enter_vma(vma, *vm_flags);
>  		break;
>  	case MADV_NOHUGEPAGE:
>  		*vm_flags &= ~VM_HUGEPAGE;
> @@ -383,6 +376,21 @@ int hugepage_madvise(struct vm_area_struct *vma,
>  	return 0;
>  }
>
> +int hugepage_madvise(struct vm_area_struct *vma,
> +		     unsigned long *vm_flags, int advice)
> +{
> +	if (advice == MADV_HUGEPAGE && !hugepage_set_vmflags(vm_flags, advice)) {

So now you've completely broken MADV_NOHUGEPAGE haven't you?

> +		/*
> +		 * If the vma become good for khugepaged to scan,
> +		 * register it here without waiting a page fault that
> +		 * may not happen any time soon.
> +		 */
> +		khugepaged_enter_vma(vma, *vm_flags);
> +	}
> +
> +	return 0;
> +}
> +
>  int __init khugepaged_init(void)
>  {
>  	mm_slot_cache = KMEM_CACHE(khugepaged_mm_slot, 0);
> --
> 2.47.1
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise
  2025-05-20 14:43   ` Lorenzo Stoakes
@ 2025-05-20 14:57     ` Usama Arif
  2025-05-20 15:13       ` Usama Arif
  2025-05-20 15:31       ` Lorenzo Stoakes
  0 siblings, 2 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-20 14:57 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
	laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
	kernel-team



On 20/05/2025 15:43, Lorenzo Stoakes wrote:
> This commit message is really poor. You're also not mentioning that you're
> changing s390 behaviour?
> 
> On Mon, May 19, 2025 at 11:29:53PM +0100, Usama Arif wrote:
>> This is so that flag setting can be resused later in other functions,
> 
> Typo.
> 
>> to reduce code duplication (including the s390 exception).
>>
>> No functional change intended with this patch.
> 
> I'm pretty sure somebody reviewed that this should just be merged with whatever
> uses this? I'm not sure this is all that valuable as you're not really changing
> this structurally very much.
> 

Please see patch 2 where hugepage_set_vmflags is reused.
I was just trying to follow your feedback from previous revision that the flag
setting and s390 code part is duplicate code and should be common in the prctl
and madvise function.

I realize I messed up the arg not having vma and the order of the if statement.

>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> 
> Yeah I'm not a fan of this patch, it's buggy and really unclear what the
> purpose is here.

No functional change was intended (I realized the order below broke it but can be fixed).

In the previous revision it was:
+       case PR_SET_THP_POLICY:
+               if (arg3 || arg4 || arg5)
+                       return -EINVAL;
+               if (mmap_write_lock_killable(me->mm))
+                       return -EINTR;
+               switch (arg2) {
+               case PR_DEFAULT_MADV_HUGEPAGE:
+                       if (!hugepage_global_enabled())
+                               error = -EPERM;
+#ifdef CONFIG_S390
+                       /*
+                       * qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
+                       * can't handle this properly after s390_enable_sie, so we simply
+                       * ignore the madvise to prevent qemu from causing a SIGSEGV.
+                       */
+                       else if (mm_has_pgste(vma->vm_mm))
+                               error = -EPERM;
+#endif
+                       else {
+                               me->mm->def_flags &= ~VM_NOHUGEPAGE;
+                               me->mm->def_flags |= VM_HUGEPAGE;
+                               process_default_madv_hugepage(me->mm, MADV_HUGEPAGE);
+                       }
+                       break;
...

Now with this hugepage_set_vmflags, it would be 

+       case PR_SET_THP_POLICY:
+               if (arg3 || arg4 || arg5)
+                       return -EINVAL;
+               if (mmap_write_lock_killable(mm))
+                       return -EINTR;
+               switch (arg2) {
+               case PR_DEFAULT_MADV_HUGEPAGE:
+                       if (!hugepage_global_enabled())
+                               error = -EPERM;
+                       error = hugepage_set_vmflags(&mm->def_flags, MADV_HUGEPAGE);
+                       if (!error)
+                               process_default_madv_hugepage(mm, MADV_HUGEPAGE);
+                       break;


I am happy to go with either of the methods above, but was just trying to
incorporate your feedback :)

Would you like the method from previous version?

> 
>> ---
>>  include/linux/huge_mm.h |  1 +
>>  mm/khugepaged.c         | 26 +++++++++++++++++---------
>>  2 files changed, 18 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2f190c90192d..23580a43787c 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -431,6 +431,7 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  			__split_huge_pud(__vma, __pud, __address);	\
>>  	}  while (0)
>>
>> +int hugepage_set_vmflags(unsigned long *vm_flags, int advice);
>>  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
>>  		     int advice);
>>  int madvise_collapse(struct vm_area_struct *vma,
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index b04b6a770afe..ab3427c87422 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -346,8 +346,7 @@ struct attribute_group khugepaged_attr_group = {
>>  };
>>  #endif /* CONFIG_SYSFS */
>>
>> -int hugepage_madvise(struct vm_area_struct *vma,
>> -		     unsigned long *vm_flags, int advice)
>> +int hugepage_set_vmflags(unsigned long *vm_flags, int advice)
> 
> 
>>  {
>>  	switch (advice) {
>>  	case MADV_HUGEPAGE:
>> @@ -358,16 +357,10 @@ int hugepage_madvise(struct vm_area_struct *vma,
>>  		 * ignore the madvise to prevent qemu from causing a SIGSEGV.
>>  		 */
>>  		if (mm_has_pgste(vma->vm_mm))
> 
> This is broken, you refer to vma which doesn't exist.
> 
> As the kernel bots are telling you...
> 
>> -			return 0;
>> +			return -EPERM;
> 
> Why are you now returning an error?
> 
> This seems like a super broken way of making the caller return 0. Just make this
> whole thing a bool return if you're going to treat it like a boolean function.
> 
>>  #endif
>>  		*vm_flags &= ~VM_NOHUGEPAGE;
>>  		*vm_flags |= VM_HUGEPAGE;
>> -		/*
>> -		 * If the vma become good for khugepaged to scan,
>> -		 * register it here without waiting a page fault that
>> -		 * may not happen any time soon.
>> -		 */
>> -		khugepaged_enter_vma(vma, *vm_flags);
>>  		break;
>>  	case MADV_NOHUGEPAGE:
>>  		*vm_flags &= ~VM_HUGEPAGE;
>> @@ -383,6 +376,21 @@ int hugepage_madvise(struct vm_area_struct *vma,
>>  	return 0;
>>  }
>>
>> +int hugepage_madvise(struct vm_area_struct *vma,
>> +		     unsigned long *vm_flags, int advice)
>> +{
>> +	if (advice == MADV_HUGEPAGE && !hugepage_set_vmflags(vm_flags, advice)) {
> 
> So now you've completely broken MADV_NOHUGEPAGE haven't you?
> 

Yeah order needs to be reversed.

>> +		/*
>> +		 * If the vma become good for khugepaged to scan,
>> +		 * register it here without waiting a page fault that
>> +		 * may not happen any time soon.
>> +		 */
>> +		khugepaged_enter_vma(vma, *vm_flags);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>  int __init khugepaged_init(void)
>>  {
>>  	mm_slot_cache = KMEM_CACHE(khugepaged_mm_slot, 0);
>> --
>> 2.47.1
>>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise
  2025-05-20 14:57     ` Usama Arif
@ 2025-05-20 15:13       ` Usama Arif
  2025-05-20 15:31       ` Lorenzo Stoakes
  1 sibling, 0 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-20 15:13 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
	laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
	kernel-team



On 20/05/2025 15:57, Usama Arif wrote:
> 
> 
> On 20/05/2025 15:43, Lorenzo Stoakes wrote:
>> This commit message is really poor. You're also not mentioning that you're
>> changing s390 behaviour?
>>
>> On Mon, May 19, 2025 at 11:29:53PM +0100, Usama Arif wrote:
>>> This is so that flag setting can be resused later in other functions,
>>
>> Typo.
>>
>>> to reduce code duplication (including the s390 exception).
>>>
>>> No functional change intended with this patch.
>>
>> I'm pretty sure somebody reviewed that this should just be merged with whatever
>> uses this? I'm not sure this is all that valuable as you're not really changing
>> this structurally very much.
>>
> 

So I unfortunately never tested s390 build which the kernel bot is complaining.

So If I want to reuse hugepage_set_vmflags in patch 2 and 3 for the prctls,
the fix over here would be at the end.

If you don't like the approach of trying to abstract the flag setting away
and reusing it in prctl in this patch I can change it to the way in previous
revision and just do something like below. Happy with either approach and
can drop patch 1 if you prefer.


+       case PR_SET_THP_POLICY:
+               if (arg3 || arg4 || arg5)
+                       return -EINVAL;
+               if (mmap_write_lock_killable(me->mm))
+                       return -EINTR;
+               switch (arg2) {
+               case PR_DEFAULT_MADV_HUGEPAGE:
+                       if (!hugepage_global_enabled())
+                               error = -EPERM;
+#ifdef CONFIG_S390
+                       /*
+                       * qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
+                       * can't handle this properly after s390_enable_sie, so we simply
+                       * ignore the madvise to prevent qemu from causing a SIGSEGV.
+                       */
+                       else if (mm_has_pgste(vma->vm_mm))
+                               error = -EPERM;
+#endif
+                       else {
+                               me->mm->def_flags &= ~VM_NOHUGEPAGE;
+                               me->mm->def_flags |= VM_HUGEPAGE;
+                               process_default_madv_hugepage(me->mm, MADV_HUGEPAGE);
+                       }
+                       break;
+               default:
+                       error = -EINVAL;
+               }
+               mmap_write_unlock(me->mm);
+               break;




Thanks!

diff for fixing this patch:

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b24a2e0ae642..e5176afaaffe 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -432,7 +432,7 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
        }  while (0)
 
 void process_default_madv_hugepage(struct mm_struct *mm, int advice);
-int hugepage_set_vmflags(unsigned long *vm_flags, int advice);
+int hugepage_set_vmflags(struct mm_struct* mm, unsigned long *vm_flags, int advice);
 int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
                     int advice);
 int madvise_collapse(struct vm_area_struct *vma,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ab3427c87422..b6c9ed6bb442 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -346,7 +346,7 @@ struct attribute_group khugepaged_attr_group = {
 };
 #endif /* CONFIG_SYSFS */
 
-int hugepage_set_vmflags(unsigned long *vm_flags, int advice)
+int hugepage_set_vmflags(struct mm_struct * mm, unsigned long *vm_flags, int advice)
 {
        switch (advice) {
        case MADV_HUGEPAGE:
@@ -356,8 +356,8 @@ int hugepage_set_vmflags(unsigned long *vm_flags, int advice)
                 * can't handle this properly after s390_enable_sie, so we simply
                 * ignore the madvise to prevent qemu from causing a SIGSEGV.
                 */
-               if (mm_has_pgste(vma->vm_mm))
-                       return -EPERM;
+               if (mm_has_pgste(mm))
+                       return 0;
 #endif
                *vm_flags &= ~VM_NOHUGEPAGE;
                *vm_flags |= VM_HUGEPAGE;
@@ -373,13 +373,14 @@ int hugepage_set_vmflags(unsigned long *vm_flags, int advice)
                break;
        }
 
-       return 0;
+       return 1;
 }
 
 int hugepage_madvise(struct vm_area_struct *vma,
                     unsigned long *vm_flags, int advice)
 {
-       if (advice == MADV_HUGEPAGE && !hugepage_set_vmflags(vm_flags, advice)) {
+       if (hugepage_set_vmflags(vma->vm_mm, vm_flags, advice)
+           && advice == MADV_HUGEPAGE) {
                /*
                 * If the vma become good for khugepaged to scan,
                 * register it here without waiting a page fault that



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise
  2025-05-20 14:57     ` Usama Arif
  2025-05-20 15:13       ` Usama Arif
@ 2025-05-20 15:31       ` Lorenzo Stoakes
  1 sibling, 0 replies; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 15:31 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
	laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
	kernel-team

On Tue, May 20, 2025 at 03:57:35PM +0100, Usama Arif wrote:
>
>
> On 20/05/2025 15:43, Lorenzo Stoakes wrote:
> > This commit message is really poor. You're also not mentioning that you're
> > changing s390 behaviour?
> >
> > On Mon, May 19, 2025 at 11:29:53PM +0100, Usama Arif wrote:
> >> This is so that flag setting can be resused later in other functions,
> >
> > Typo.
> >
> >> to reduce code duplication (including the s390 exception).
> >>
> >> No functional change intended with this patch.
> >
> > I'm pretty sure somebody reviewed that this should just be merged with whatever
> > uses this? I'm not sure this is all that valuable as you're not really changing
> > this structurally very much.
> >
>
> Please see patch 2 where hugepage_set_vmflags is reused.
> I was just trying to follow your feedback from previous revision that the flag
> setting and s390 code part is duplicate code and should be common in the prctl
> and madvise function.

Sure, but I think it'd be better as part of that patch probably. Perhaps I
was thinking of another comment in reference to a 'no function change'
remark.

>
> I realize I messed up the arg not having vma and the order of the if statement.

I am getting the strong impression here that you're rushing :)

I strongly suggest slowing thing down here. We're in RC7, this is (or
should be) an RFC for us to explore concepts. There's no need for it.

I appreciate your input and enthusiasm, but clearly rushing is causing you
to make mistakes. I get it, we've all been there.

But right now we have what 5 maybe? THP series in-flight at the same time,
all touching similar stuff, and it'll make everybody's lives easier and
less chaotic if we take a little more time to assess.

We are ultimately going to choose what's best for the kernel, there's no
'race' as to which series is 'ready' first.

>
> >>
> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >
> > Yeah I'm not a fan of this patch, it's buggy and really unclear what the
> > purpose is here.
>
> No functional change was intended (I realized the order below broke it but can be fixed).
>
> In the previous revision it was:
> +       case PR_SET_THP_POLICY:
> +               if (arg3 || arg4 || arg5)
> +                       return -EINVAL;
> +               if (mmap_write_lock_killable(me->mm))
> +                       return -EINTR;
> +               switch (arg2) {
> +               case PR_DEFAULT_MADV_HUGEPAGE:
> +                       if (!hugepage_global_enabled())
> +                               error = -EPERM;
> +#ifdef CONFIG_S390
> +                       /*
> +                       * qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
> +                       * can't handle this properly after s390_enable_sie, so we simply
> +                       * ignore the madvise to prevent qemu from causing a SIGSEGV.
> +                       */
> +                       else if (mm_has_pgste(vma->vm_mm))
> +                               error = -EPERM;
> +#endif
> +                       else {
> +                               me->mm->def_flags &= ~VM_NOHUGEPAGE;
> +                               me->mm->def_flags |= VM_HUGEPAGE;
> +                               process_default_madv_hugepage(me->mm, MADV_HUGEPAGE);
> +                       }
> +                       break;
> ...
>
> Now with this hugepage_set_vmflags, it would be
>
> +       case PR_SET_THP_POLICY:
> +               if (arg3 || arg4 || arg5)
> +                       return -EINVAL;
> +               if (mmap_write_lock_killable(mm))
> +                       return -EINTR;
> +               switch (arg2) {
> +               case PR_DEFAULT_MADV_HUGEPAGE:
> +                       if (!hugepage_global_enabled())
> +                               error = -EPERM;
> +                       error = hugepage_set_vmflags(&mm->def_flags, MADV_HUGEPAGE);
> +                       if (!error)
> +                               process_default_madv_hugepage(mm, MADV_HUGEPAGE);
> +                       break;
>
>
> I am happy to go with either of the methods above, but was just trying to
> incorporate your feedback :)
>
> Would you like the method from previous version?

I'm going to go ahead and overlook what would be in the UK 100% a
deployment of the finest British sarcasm here, and assume not intended :)

Very obviously we do not want to duplicate architecture-specific code. I'm
a little concerned you're ok with both (imagine if one changed but not the
other for instance), but clearly this series is unmergeable without
de-duplicating this.

My objections here are that you submitted a totally broken patch with a
poor commit message that seems that it could well be merged with the
subsequent patch.

I also have concerns about your levels of testing here - you completely
broken MADV_NOHUGEPAGE but didn't notice? Are you running self-tests? Do we
have one that'd pick that up? If not, can we have one like that?

Thanks!

>
> >
> >> ---
> >>  include/linux/huge_mm.h |  1 +
> >>  mm/khugepaged.c         | 26 +++++++++++++++++---------
> >>  2 files changed, 18 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index 2f190c90192d..23580a43787c 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -431,6 +431,7 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >>  			__split_huge_pud(__vma, __pud, __address);	\
> >>  	}  while (0)
> >>
> >> +int hugepage_set_vmflags(unsigned long *vm_flags, int advice);
> >>  int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> >>  		     int advice);
> >>  int madvise_collapse(struct vm_area_struct *vma,
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index b04b6a770afe..ab3427c87422 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -346,8 +346,7 @@ struct attribute_group khugepaged_attr_group = {
> >>  };
> >>  #endif /* CONFIG_SYSFS */
> >>
> >> -int hugepage_madvise(struct vm_area_struct *vma,
> >> -		     unsigned long *vm_flags, int advice)
> >> +int hugepage_set_vmflags(unsigned long *vm_flags, int advice)
> >
> >
> >>  {
> >>  	switch (advice) {
> >>  	case MADV_HUGEPAGE:
> >> @@ -358,16 +357,10 @@ int hugepage_madvise(struct vm_area_struct *vma,
> >>  		 * ignore the madvise to prevent qemu from causing a SIGSEGV.
> >>  		 */
> >>  		if (mm_has_pgste(vma->vm_mm))
> >
> > This is broken, you refer to vma which doesn't exist.
> >
> > As the kernel bots are telling you...
> >
> >> -			return 0;
> >> +			return -EPERM;
> >
> > Why are you now returning an error?
> >
> > This seems like a super broken way of making the caller return 0. Just make this
> > whole thing a bool return if you're going to treat it like a boolean function.
> >
> >>  #endif
> >>  		*vm_flags &= ~VM_NOHUGEPAGE;
> >>  		*vm_flags |= VM_HUGEPAGE;
> >> -		/*
> >> -		 * If the vma become good for khugepaged to scan,
> >> -		 * register it here without waiting a page fault that
> >> -		 * may not happen any time soon.
> >> -		 */
> >> -		khugepaged_enter_vma(vma, *vm_flags);
> >>  		break;
> >>  	case MADV_NOHUGEPAGE:
> >>  		*vm_flags &= ~VM_HUGEPAGE;
> >> @@ -383,6 +376,21 @@ int hugepage_madvise(struct vm_area_struct *vma,
> >>  	return 0;
> >>  }
> >>
> >> +int hugepage_madvise(struct vm_area_struct *vma,
> >> +		     unsigned long *vm_flags, int advice)
> >> +{
> >> +	if (advice == MADV_HUGEPAGE && !hugepage_set_vmflags(vm_flags, advice)) {
> >
> > So now you've completely broken MADV_NOHUGEPAGE haven't you?
> >
>
> Yeah order needs to be reversed.
>
> >> +		/*
> >> +		 * If the vma become good for khugepaged to scan,
> >> +		 * register it here without waiting a page fault that
> >> +		 * may not happen any time soon.
> >> +		 */
> >> +		khugepaged_enter_vma(vma, *vm_flags);
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >> +
> >>  int __init khugepaged_init(void)
> >>  {
> >>  	mm_slot_cache = KMEM_CACHE(khugepaged_mm_slot, 0);
> >> --
> >> 2.47.1
> >>
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY
  2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
                   ` (7 preceding siblings ...)
  2025-05-20  5:14 ` [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Lorenzo Stoakes
@ 2025-05-21  2:33 ` Liam R. Howlett
  2025-05-21  9:31   ` Usama Arif
  2025-05-22 12:10 ` Mike Rapoport
  9 siblings, 1 reply; 25+ messages in thread
From: Liam R. Howlett @ 2025-05-21  2:33 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
	laoar.shao, baolin.wang, lorenzo.stoakes, npache, ryan.roberts,
	vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
	kernel-team

* Usama Arif <usamaarif642@gmail.com> [250519 18:34]:
> This series allows to change the THP policy of a process, according to the
> value set in arg2, all of which will be inherited during fork+exec:
> - PR_DEFAULT_MADV_HUGEPAGE: This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE
>   for the default VMA flags. It will also iterate through every VMA in the
>   process and call hugepage_madvise on it, with MADV_HUGEPAGE policy.
>   This effectively allows setting MADV_HUGEPAGE on the entire process.
>   In an environment where different types of workloads are run on the
>   same machine, this will allow workloads that benefit from always having
>   hugepages to do so, without regressing those that don't.
> - PR_DEFAULT_MADV_NOHUGEPAGE: This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE
>   for the default VMA flags. It will also iterate through every VMA in the
>   process and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy.
>   This effectively allows setting MADV_NOHUGEPAGE on the entire process.
>   In an environment where different types of workloads are run on the
>   same machine,this will allow workloads that benefit from having
>   hugepages on an madvise basis only to do so, without regressing those
>   that benefit from having hugepages always.
> - PR_THP_POLICY_SYSTEM: This will reset (clear) both VM_HUGEPAGE and
>   VM_NOHUGEPAGE process for the default flags.
> 

Subject seems outdated now?  PR_DEFAULT_ vs PR_SET/GET_THP ?

On that note, doesn't it make sense to change the default mm flag under
PR_SET_MM?  PR_SET_MM_FLAG maybe?

Thanks,
Liam


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY
  2025-05-21  2:33 ` Liam R. Howlett
@ 2025-05-21  9:31   ` Usama Arif
  2025-05-21 16:37     ` Liam R. Howlett
  0 siblings, 1 reply; 25+ messages in thread
From: Usama Arif @ 2025-05-21  9:31 UTC (permalink / raw)
  To: Liam R. Howlett, Andrew Morton, david, linux-mm, hannes,
	shakeel.butt, riel, ziy, laoar.shao, baolin.wang, lorenzo.stoakes,
	npache, ryan.roberts, vbabka, jannh, Arnd Bergmann, linux-kernel,
	linux-doc, kernel-team



On 21/05/2025 03:33, Liam R. Howlett wrote:
> * Usama Arif <usamaarif642@gmail.com> [250519 18:34]:
>> This series allows to change the THP policy of a process, according to the
>> value set in arg2, all of which will be inherited during fork+exec:
>> - PR_DEFAULT_MADV_HUGEPAGE: This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE
>>   for the default VMA flags. It will also iterate through every VMA in the
>>   process and call hugepage_madvise on it, with MADV_HUGEPAGE policy.
>>   This effectively allows setting MADV_HUGEPAGE on the entire process.
>>   In an environment where different types of workloads are run on the
>>   same machine, this will allow workloads that benefit from always having
>>   hugepages to do so, without regressing those that don't.
>> - PR_DEFAULT_MADV_NOHUGEPAGE: This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE
>>   for the default VMA flags. It will also iterate through every VMA in the
>>   process and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy.
>>   This effectively allows setting MADV_NOHUGEPAGE on the entire process.
>>   In an environment where different types of workloads are run on the
>>   same machine,this will allow workloads that benefit from having
>>   hugepages on an madvise basis only to do so, without regressing those
>>   that benefit from having hugepages always.
>> - PR_THP_POLICY_SYSTEM: This will reset (clear) both VM_HUGEPAGE and
>>   VM_NOHUGEPAGE process for the default flags.
>>
> 
> Subject seems outdated now?  PR_DEFAULT_ vs PR_SET/GET_THP ?

No its not.

prctl takes 5 args, the first 2 are relevant here.

The first arg is to decide the op. This series introduces 2 ops. PR_SET_THP_POLICY
and PR_GET_THP_POLICY to set and get the policy. This is the subject.

The 2nd arg describes the policies: PR_DEFAULT_MADV_HUGEPAGE, PR_DEFAULT_MADV_NOHUGEPAGE
and PR_THP_POLICY_SYSTEM.

The subject is correct.

> 
> On that note, doesn't it make sense to change the default mm flag under
> PR_SET_MM?  PR_SET_MM_FLAG maybe?

I don't think thats the right approach. PR_SET_MM is used to modify kernel
memory map descriptor fields. Thats not what we are doing here.

I am not sure how the usecase in this series fits at all in the below 
switch statement for PR_SET_MM:

	switch (opt) {
	case PR_SET_MM_START_CODE:
		prctl_map.start_code = addr;
		break;
	case PR_SET_MM_END_CODE:
		prctl_map.end_code = addr;
		break;
	case PR_SET_MM_START_DATA:
		prctl_map.start_data = addr;
		break;
	case PR_SET_MM_END_DATA:
		prctl_map.end_data = addr;
		break;
	case PR_SET_MM_START_STACK:
		prctl_map.start_stack = addr;
		break;
	case PR_SET_MM_START_BRK:
		prctl_map.start_brk = addr;
		break;
	case PR_SET_MM_BRK:
		prctl_map.brk = addr;
		break;
	case PR_SET_MM_ARG_START:
		prctl_map.arg_start = addr;
		break;
	case PR_SET_MM_ARG_END:
		prctl_map.arg_end = addr;
		break;
	case PR_SET_MM_ENV_START:
		prctl_map.env_start = addr;
		break;
	case PR_SET_MM_ENV_END:
		prctl_map.env_end = addr;
		break;
	default:
		goto out;
	}


> 
> Thanks,
> Liam



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY
  2025-05-21  9:31   ` Usama Arif
@ 2025-05-21 16:37     ` Liam R. Howlett
  0 siblings, 0 replies; 25+ messages in thread
From: Liam R. Howlett @ 2025-05-21 16:37 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
	laoar.shao, baolin.wang, lorenzo.stoakes, npache, ryan.roberts,
	vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
	kernel-team

* Usama Arif <usamaarif642@gmail.com> [250521 05:31]:
> 
> 
> On 21/05/2025 03:33, Liam R. Howlett wrote:
> > * Usama Arif <usamaarif642@gmail.com> [250519 18:34]:
> >> This series allows to change the THP policy of a process, according to the
> >> value set in arg2, all of which will be inherited during fork+exec:
> >> - PR_DEFAULT_MADV_HUGEPAGE: This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE
> >>   for the default VMA flags. It will also iterate through every VMA in the
> >>   process and call hugepage_madvise on it, with MADV_HUGEPAGE policy.
> >>   This effectively allows setting MADV_HUGEPAGE on the entire process.
> >>   In an environment where different types of workloads are run on the
> >>   same machine, this will allow workloads that benefit from always having
> >>   hugepages to do so, without regressing those that don't.
> >> - PR_DEFAULT_MADV_NOHUGEPAGE: This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE
> >>   for the default VMA flags. It will also iterate through every VMA in the
> >>   process and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy.
> >>   This effectively allows setting MADV_NOHUGEPAGE on the entire process.
> >>   In an environment where different types of workloads are run on the
> >>   same machine,this will allow workloads that benefit from having
> >>   hugepages on an madvise basis only to do so, without regressing those
> >>   that benefit from having hugepages always.
> >> - PR_THP_POLICY_SYSTEM: This will reset (clear) both VM_HUGEPAGE and
> >>   VM_NOHUGEPAGE process for the default flags.
> >>
> > 
> > Subject seems outdated now?  PR_DEFAULT_ vs PR_SET/GET_THP ?
> 
> No its not.
> 
> prctl takes 5 args, the first 2 are relevant here.
> 
> The first arg is to decide the op. This series introduces 2 ops. PR_SET_THP_POLICY
> and PR_GET_THP_POLICY to set and get the policy. This is the subject.
> 
> The 2nd arg describes the policies: PR_DEFAULT_MADV_HUGEPAGE, PR_DEFAULT_MADV_NOHUGEPAGE
> and PR_THP_POLICY_SYSTEM.
> 
> The subject is correct.

Thanks, that makes sense.  You are adding an entire new configuration
item to the prctl fun.

> 
> > 
> > On that note, doesn't it make sense to change the default mm flag under
> > PR_SET_MM?  PR_SET_MM_FLAG maybe?
> 
> I don't think thats the right approach. PR_SET_MM is used to modify kernel
> memory map descriptor fields. Thats not what we are doing here.

Fair enough, you are changing the memory map default flags for vmas.

So we are going to add another top level THP specific prctl that changes
flags, but now def_flags and that's communicated by the word POLICY.

I'm not sure this is the right approach either.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY
  2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
                   ` (8 preceding siblings ...)
  2025-05-21  2:33 ` Liam R. Howlett
@ 2025-05-22 12:10 ` Mike Rapoport
  9 siblings, 0 replies; 25+ messages in thread
From: Mike Rapoport @ 2025-05-22 12:10 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
	laoar.shao, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, vbabka, jannh, Arnd Bergmann, linux-kernel,
	linux-doc, kernel-team, linux-api

(cc'ing linux-api)

On Mon, May 19, 2025 at 11:29:52PM +0100, Usama Arif wrote:
> This series allows to change the THP policy of a process, according to the
> value set in arg2, all of which will be inherited during fork+exec:
> - PR_DEFAULT_MADV_HUGEPAGE: This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE
>   for the default VMA flags. It will also iterate through every VMA in the
>   process and call hugepage_madvise on it, with MADV_HUGEPAGE policy.
>   This effectively allows setting MADV_HUGEPAGE on the entire process.
>   In an environment where different types of workloads are run on the
>   same machine, this will allow workloads that benefit from always having
>   hugepages to do so, without regressing those that don't.
> - PR_DEFAULT_MADV_NOHUGEPAGE: This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE
>   for the default VMA flags. It will also iterate through every VMA in the
>   process and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy.
>   This effectively allows setting MADV_NOHUGEPAGE on the entire process.
>   In an environment where different types of workloads are run on the
>   same machine,this will allow workloads that benefit from having
>   hugepages on an madvise basis only to do so, without regressing those
>   that benefit from having hugepages always.
> - PR_THP_POLICY_SYSTEM: This will reset (clear) both VM_HUGEPAGE and
>   VM_NOHUGEPAGE process for the default flags.
> 
> In hyperscalers, we have a single THP policy for the entire fleet.
> We have different types of workloads (e.g. AI/compute/databases/etc)
> running on a single server.
> Some of these workloads will benefit from always getting THP at fault
> (or collapsed by khugepaged), some of them will benefit by only getting
> them at madvise.
> 
> This series is useful for 2 usecases:
> 1) global system policy = madvise, while we want some workloads to get THPs
> at fault and by khugepaged :- some processes (e.g. AI workloads) benefits
> from getting THPs at fault (and collapsed by khugepaged). Other workloads
> like databases will incur regression (either a performance regression or
> they are completely memory bound and even a very slight increase in memory
> will cause them to OOM). So what these patches will do is allow setting
> prctl(PR_DEFAULT_MADV_HUGEPAGE) on the AI workloads, (This is how
> workloads are deployed in our (Meta's/Facebook) fleet at this moment).
> 
> 2) global system policy = always, while we want some workloads to get THPs
> only on madvise basis :- Same reason as 1). What these patches
> will do is allow setting prctl(PR_DEFAULT_MADV_NOHUGEPAGE) on the database
> workloads. (We hope this is us (Meta) in the near future, if a majority of
> workloads show that they benefit from always, we flip the default host
> setting to "always" across the fleet and workloads that regress can opt-out
> and be "madvise". New services developed will then be tested with always by
> default. "always" is also the default defconfig option upstream, so I would
> imagine this is faced by others as well.)
> 
> v2->v3: (Thanks Lorenzo for all the below feedback!)
> v2: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@gmail.com/
> - no more flags2.
> - no more MMF2_...
> - renamed policy to PR_DEFAULT_MADV_(NO)HUGEPAGE
> - mmap_write_lock_killable acquired in PR_GET_THP_POLICY
> - mmap_write lock fixed in PR_SET_THP_POLICY
> - mmap assert check in process_default_madv_hugepage
> - check if hugepage_global_enabled is enabled in the call and account for s390
> - set mm->def_flags VM_HUGEPAGE and VM_NOHUGEPAGE according to the policy in
>   the way done by madvise(). I believe VM merge will not be broken in
>   this way.
> - process_default_madv_hugepage function that does for_each_vma and calls
>   hugepage_madvise.
> 
> v1->v2:
> - change from modifying the THP decision making for the process, to modifying
>   VMA flags only. This prevents further complicating the logic used to
>   determine THP order (Thanks David!)
> - change from using a prctl per policy change to just using PR_SET_THP_POLICY
>   and arg2 to set the policy. (Zi Yan)
> - Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM
> - Add selftests and documentation.
>  
> Usama Arif (7):
>   mm: khugepaged: extract vm flag setting outside of hugepage_madvise
>   prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
>   prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE for the process
>   prctl: introduce PR_THP_POLICY_SYSTEM for the process
>   selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE
>   selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
>   docs: transhuge: document process level THP controls
> 
>  Documentation/admin-guide/mm/transhuge.rst    |  42 +++
>  include/linux/huge_mm.h                       |   2 +
>  include/linux/mm.h                            |   2 +-
>  include/linux/mm_types.h                      |   4 +-
>  include/uapi/linux/prctl.h                    |   6 +
>  kernel/sys.c                                  |  53 ++++
>  mm/huge_memory.c                              |  13 +
>  mm/khugepaged.c                               |  26 +-
>  tools/include/uapi/linux/prctl.h              |   6 +
>  .../trace/beauty/include/uapi/linux/prctl.h   |   6 +
>  tools/testing/selftests/prctl/Makefile        |   2 +-
>  tools/testing/selftests/prctl/thp_policy.c    | 286 ++++++++++++++++++
>  12 files changed, 436 insertions(+), 12 deletions(-)
>  create mode 100644 tools/testing/selftests/prctl/thp_policy.c
> 
> -- 
> 2.47.1
> 
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-05-22 12:11 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
2025-05-19 22:29 ` [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise Usama Arif
2025-05-20  9:51   ` kernel test robot
2025-05-20 14:43   ` Lorenzo Stoakes
2025-05-20 14:57     ` Usama Arif
2025-05-20 15:13       ` Usama Arif
2025-05-20 15:31       ` Lorenzo Stoakes
2025-05-19 22:29 ` [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process Usama Arif
2025-05-19 23:01   ` Jann Horn
2025-05-20  5:23     ` Lorenzo Stoakes
2025-05-20  9:09       ` David Hildenbrand
2025-05-20  9:16         ` Lorenzo Stoakes
2025-05-20  8:48   ` kernel test robot
2025-05-19 22:29 ` [PATCH v3 3/7] prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE " Usama Arif
2025-05-19 22:29 ` [PATCH v3 4/7] prctl: introduce PR_THP_POLICY_SYSTEM " Usama Arif
2025-05-19 22:29 ` [PATCH v3 5/7] selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE Usama Arif
2025-05-19 22:29 ` [PATCH v3 6/7] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE Usama Arif
2025-05-19 22:29 ` [PATCH v3 7/7] docs: transhuge: document process level THP controls Usama Arif
2025-05-20  5:14 ` [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Lorenzo Stoakes
2025-05-20  7:46   ` Usama Arif
2025-05-20  8:51     ` Lorenzo Stoakes
2025-05-21  2:33 ` Liam R. Howlett
2025-05-21  9:31   ` Usama Arif
2025-05-21 16:37     ` Liam R. Howlett
2025-05-22 12:10 ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).