linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/7] prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised
@ 2025-08-13 13:55 Usama Arif
  2025-08-13 13:55 ` [PATCH v4 1/7] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE Usama Arif
                   ` (6 more replies)
  0 siblings, 7 replies; 34+ messages in thread
From: Usama Arif @ 2025-08-13 13:55 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
	shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
	npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team,
	Usama Arif

This will allow individual processes to opt-out of THP = "always"
into THP = "madvise", without affecting other workloads on the system.
This has been extensively discussed on the mailing list and has been
summarized very well by David in the first patch which also includes
the links to alternatives, please refer to the first patch commit message
for the motivation for this series.

Patch 1 adds the PR_THP_DISABLE_EXCEPT_ADVISED flag to implement this, along
with the MMF changes.
Patch 2 is a cleanup patch for tva_flags that will allow the forced collapse
case to be transmitted to vma_thp_disabled (which is done in patch 3).
Patch 4 adds documentation for PR_SET_THP_DISABLE/PR_GET_THP_DISABLE.
Patches 6-7 implement the selftests for PR_SET_THP_DISABLE for completely
disabling THPs (old behaviour) and only enabling it at advise
(PR_THP_DISABLE_EXCEPT_ADVISED).

The patches are tested on top of 694c8e78f486b09137ee3efadae044d01aba971b
from mm-new.

v3 -> v4: https://lore.kernel.org/all/20250804154317.1648084-1-usamaarif642@gmail.com/
- rebase to latest mm-new (Aug 13), which includes the mm flag changes from Lonrenzo.
- remove mention of MM flags from admin doc in transhuge.rst and other other
  improvements to it (David and Lorenzo)
- extract size2ord into vm_util.h (David)
- check if the respective prctl can be set in the fixture setup instead of the fixture
  itself (David)

v2 -> v3: https://lore.kernel.org/all/20250731122825.2102184-1-usamaarif642@gmail.com/
- Fix sign off and added ack for patch 1 (Lorenzo and Zi Yan)
- Fix up commit message, comments and variable names in patch 2 and 3 (Lorenzo)
- Added documentation for PR_SET_THP_DISABLE/PR_GET_THP_DISABLE (Lorenzo)
- remove struct test_results and enum thp_policy for prctl tests (David)

v1 -> v2: https://lore.kernel.org/all/20250725162258.1043176-1-usamaarif642@gmail.com/
- Change thp_push_settings to thp_write_settings (David)
- Add tests for all the system policies for the prctl call (David)
- Small fixes and cleanups


David Hildenbrand (3):
  prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  mm/huge_memory: convert "tva_flags" to "enum tva_type"
  mm/huge_memory: respect MADV_COLLAPSE with
    PR_THP_DISABLE_EXCEPT_ADVISED

Usama Arif (4):
  docs: transhuge: document process level THP controls
  selftest/mm: Extract sz2ord function into vm_util.h
  selftests: prctl: introduce tests for disabling THPs completely
  selftests: prctl: introduce tests for disabling THPs except for
    madvise

 Documentation/admin-guide/mm/transhuge.rst    |  37 +++
 Documentation/filesystems/proc.rst            |   5 +-
 fs/proc/array.c                               |   2 +-
 fs/proc/task_mmu.c                            |   4 +-
 include/linux/huge_mm.h                       |  60 ++--
 include/linux/mm_types.h                      |  14 +-
 include/uapi/linux/prctl.h                    |  10 +
 kernel/sys.c                                  |  59 +++-
 mm/huge_memory.c                              |  11 +-
 mm/khugepaged.c                               |  19 +-
 mm/memory.c                                   |  20 +-
 mm/shmem.c                                    |   2 +-
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   1 +
 tools/testing/selftests/mm/cow.c              |  12 +-
 .../testing/selftests/mm/prctl_thp_disable.c  | 275 ++++++++++++++++++
 tools/testing/selftests/mm/thp_settings.c     |   9 +-
 tools/testing/selftests/mm/thp_settings.h     |   1 +
 tools/testing/selftests/mm/uffd-wp-mremap.c   |   9 +-
 tools/testing/selftests/mm/vm_util.h          |   5 +
 20 files changed, 470 insertions(+), 86 deletions(-)
 create mode 100644 tools/testing/selftests/mm/prctl_thp_disable.c

-- 
2.47.3


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 1/7] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-08-13 13:55 [PATCH v4 0/7] prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised Usama Arif
@ 2025-08-13 13:55 ` Usama Arif
  2025-08-13 13:55 ` [PATCH v4 2/7] mm/huge_memory: convert "tva_flags" to "enum tva_type" Usama Arif
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 34+ messages in thread
From: Usama Arif @ 2025-08-13 13:55 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
	shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
	npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team,
	Usama Arif

From: David Hildenbrand <david@redhat.com>

People want to make use of more THPs, for example, moving from
the "never" system policy to "madvise", or from "madvise" to "always".

While this is great news for every THP desperately waiting to get
allocated out there, apparently there are some workloads that require a
bit of care during that transition: individual processes may need to
opt-out from this behavior for various reasons, and this should be
permitted without needing to make all other workloads on the system
similarly opt-out.

The following scenarios are imaginable:

(1) Switch from "none" system policy to "madvise"/"always", but keep THPs
    disabled for selected workloads.

(2) Stay at "none" system policy, but enable THPs for selected
    workloads, making only these workloads use the "madvise" or "always"
    policy.

(3) Switch from "madvise" system policy to "always", but keep the
    "madvise" policy for selected workloads: allocate THPs only when
    advised.

(4) Stay at "madvise" system policy, but enable THPs even when not advised
    for selected workloads -- "always" policy.

Once can emulate (2) through (1), by setting the system policy to
"madvise"/"always" while disabling THPs for all processes that don't want
THPs. It requires configuring all workloads, but that is a user-space
problem to sort out.

(4) can be emulated through (3) in a similar way.

Back when (1) was relevant in the past, as people started enabling THPs,
we added PR_SET_THP_DISABLE, so relevant workloads that were not ready
yet (i.e., used by Redis) were able to just disable THPs completely. Redis
still implements the option to use this interface to disable THPs
completely.

With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
workload -- a process, including fork+exec'ed process hierarchy.
That essentially made us support (1): simply disable THPs for all workloads
that are not ready for THPs yet, while still enabling THPs system-wide.

The quest for handling (3) and (4) started, but current approaches
(completely new prctl, options to set other policies per process,
alternatives to prctl -- mctrl, cgroup handling) don't look particularly
promising. Likely, the future will use bpf or something similar to
implement better policies, in particular to also make better decisions
about THP sizes to use, but this will certainly take a while as that work
just started.

Long story short: a simple enable/disable is not really suitable for the
future, so we're not willing to add completely new toggles.

While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
completely for these processes, this is a step backwards, because these
processes can no longer allocate THPs in regions where THPs were
explicitly advised: regions flagged as VM_HUGEPAGE. Apparently, that
imposes a problem for relevant workloads, because "not THPs" is certainly
worse than "THPs only when advised".

Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this
would change the documented semantics quite a bit, and the versatility
to use it for debugging purposes, so I am not 100% sure that is what we
want -- although it would certainly be much easier.

So instead, as an easy way forward for (3) and (4), add an option to
make PR_SET_THP_DISABLE disable *less* THPs for a process.

In essence, this patch:

(A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
    of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).

    prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED).

(B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
    PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.

    Previously, it would return 1 if THPs were disabled completely. Now
    it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED
    was set.

(C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
    the semantics clearly.

    Fortunately, there are only two instances outside of prctl() code.

(D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
    with VM_HUGEPAGE" -- essentially "thp=madvise" behavior

    Fortunately, we only have to extend vma_thp_disabled().

(E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are
    disabled completely

    Only indicating that THPs are disabled when they are really disabled
    completely, not only partially.

    For now, we don't add another interface to obtained whether THPs
    are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If
    ever required, we could add a new entry.

The documented semantics in the man page for PR_SET_THP_DISABLE
"is inherited by a child created via fork(2) and is preserved across
execve(2)" is maintained. This behavior, for example, allows for
disabling THPs for a workload through the launching process (e.g.,
systemd where we fork() a helper process to then exec()).

For now, MADV_COLLAPSE will *fail* in regions without VM_HUGEPAGE and
VM_NOHUGEPAGE. As MADV_COLLAPSE is a clear advise that user space
thinks a THP is a good idea, we'll enable that separately next
(requiring a bit of cleanup first).

There is currently not way to prevent that a process will not issue
PR_SET_THP_DISABLE itself to re-enable THP. There are not really known
users for re-enabling it, and it's against the purpose of the original
interface. So if ever required, we could investigate just forbidding to
re-enable them, or make this somehow configurable.

Acked-by: Usama Arif <usamaarif642@gmail.com>
Tested-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Acked-by: Zi Yan <ziy@nvidia.com>
---
 Documentation/filesystems/proc.rst |  5 ++-
 fs/proc/array.c                    |  2 +-
 include/linux/huge_mm.h            | 20 +++++++---
 include/linux/mm_types.h           | 14 +++----
 include/uapi/linux/prctl.h         | 10 +++++
 kernel/sys.c                       | 59 ++++++++++++++++++++++++------
 mm/khugepaged.c                    |  2 +-
 7 files changed, 83 insertions(+), 29 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2971551b72353..915a3e44bc120 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -291,8 +291,9 @@ It's slow but very precise.
  HugetlbPages                size of hugetlb memory portions
  CoreDumping                 process's memory is currently being dumped
                              (killing the process may lead to a corrupted core)
- THP_enabled		     process is allowed to use THP (returns 0 when
-			     PR_SET_THP_DISABLE is set on the process
+ THP_enabled                 process is allowed to use THP (returns 0 when
+                             PR_SET_THP_DISABLE is set on the process to disable
+                             THP completely, not just partially)
  Threads                     number of threads
  SigQ                        number of signals queued/max. number for queue
  SigPnd                      bitmap of pending signals for the thread
diff --git a/fs/proc/array.c b/fs/proc/array.c
index c286dc12325ed..d84b291dd1ed8 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
 	bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
 
 	if (thp_enabled)
-		thp_enabled = !mm_flags_test(MMF_DISABLE_THP, mm);
+		thp_enabled = !mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
 	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
 }
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 84b7eebe0d685..22b8b067b295e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -318,16 +318,26 @@ struct thpsize {
 	(transparent_hugepage_flags &					\
 	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
 
+/*
+ * Check whether THPs are explicitly disabled for this VMA, for example,
+ * through madvise or prctl.
+ */
 static inline bool vma_thp_disabled(struct vm_area_struct *vma,
 		vm_flags_t vm_flags)
 {
+	/* Are THPs disabled for this VMA? */
+	if (vm_flags & VM_NOHUGEPAGE)
+		return true;
+	/* Are THPs disabled for all VMAs in the whole process? */
+	if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, vma->vm_mm))
+		return true;
 	/*
-	 * Explicitly disabled through madvise or prctl, or some
-	 * architectures may disable THP for some mappings, for
-	 * example, s390 kvm.
+	 * Are THPs disabled only for VMAs where we didn't get an explicit
+	 * advise to use them?
 	 */
-	return (vm_flags & VM_NOHUGEPAGE) ||
-	       mm_flags_test(MMF_DISABLE_THP, vma->vm_mm);
+	if (vm_flags & VM_HUGEPAGE)
+		return false;
+	return mm_flags_test(MMF_DISABLE_THP_EXCEPT_ADVISED, vma->vm_mm);
 }
 
 static inline bool thp_disabled_by_hw(void)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 47d2e4598acd6..3b369dfbbedd6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1781,19 +1781,17 @@ enum {
 #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
 #define MMF_VM_HUGEPAGE		17	/* set when mm is available for khugepaged */
 
-/*
- * This one-shot flag is dropped due to necessity of changing exe once again
- * on NFS restore
- */
-//#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
+#define MMF_HUGE_ZERO_FOLIO	18      /* mm has ever used the global huge zero folio */
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
 #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
-#define MMF_HUGE_ZERO_FOLIO	23      /* mm has ever used the global huge zero folio */
-#define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
-#define MMF_DISABLE_THP_MASK	_BITUL(MMF_DISABLE_THP)
+
+#define MMF_DISABLE_THP_EXCEPT_ADVISED	23	/* no THP except when advised (e.g., VM_HUGEPAGE) */
+#define MMF_DISABLE_THP_COMPLETELY	24	/* no THP for all VMAs */
+#define MMF_DISABLE_THP_MASK	(_BITUL(MMF_DISABLE_THP_COMPLETELY) | \
+				 _BITUL(MMF_DISABLE_THP_EXCEPT_ADVISED))
 #define MMF_OOM_REAP_QUEUED	25	/* mm was queued for oom_reaper */
 #define MMF_MULTIPROCESS	26	/* mm is shared between processes */
 /*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index ed3aed264aeb2..150b6deebfb1e 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -177,7 +177,17 @@ struct prctl_mm_map {
 
 #define PR_GET_TID_ADDRESS	40
 
+/*
+ * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0
+ * is reserved, so PR_GET_THP_DISABLE can return "1 | flags", to effectively
+ * return "1" when no flags were specified for PR_SET_THP_DISABLE.
+ */
 #define PR_SET_THP_DISABLE	41
+/*
+ * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
+ * VM_HUGEPAGE).
+ */
+# define PR_THP_DISABLE_EXCEPT_ADVISED	(1 << 1)
 #define PR_GET_THP_DISABLE	42
 
 /*
diff --git a/kernel/sys.c b/kernel/sys.c
index 605f7fe9a1432..a46d9b75880b8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2452,6 +2452,51 @@ static int prctl_get_auxv(void __user *addr, unsigned long len)
 	return sizeof(mm->saved_auxv);
 }
 
+static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5)
+{
+	struct mm_struct *mm = current->mm;
+
+	if (arg2 || arg3 || arg4 || arg5)
+		return -EINVAL;
+
+	/* If disabled, we return "1 | flags", otherwise 0. */
+	if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
+		return 1;
+	else if (mm_flags_test(MMF_DISABLE_THP_EXCEPT_ADVISED, mm))
+		return 1 | PR_THP_DISABLE_EXCEPT_ADVISED;
+	return 0;
+}
+
+static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
+				 unsigned long arg4, unsigned long arg5)
+{
+	struct mm_struct *mm = current->mm;
+
+	if (arg4 || arg5)
+		return -EINVAL;
+
+	/* Flags are only allowed when disabling. */
+	if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
+		return -EINVAL;
+	if (mmap_write_lock_killable(current->mm))
+		return -EINTR;
+	if (thp_disable) {
+		if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) {
+			mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
+			mm_flags_set(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
+		} else {
+			mm_flags_set(MMF_DISABLE_THP_COMPLETELY, mm);
+			mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
+		}
+	} else {
+		mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
+		mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
+	}
+	mmap_write_unlock(current->mm);
+	return 0;
+}
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -2625,20 +2670,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			return -EINVAL;
 		return task_no_new_privs(current) ? 1 : 0;
 	case PR_GET_THP_DISABLE:
-		if (arg2 || arg3 || arg4 || arg5)
-			return -EINVAL;
-		error = !!mm_flags_test(MMF_DISABLE_THP, me->mm);
+		error = prctl_get_thp_disable(arg2, arg3, arg4, arg5);
 		break;
 	case PR_SET_THP_DISABLE:
-		if (arg3 || arg4 || arg5)
-			return -EINVAL;
-		if (mmap_write_lock_killable(me->mm))
-			return -EINTR;
-		if (arg2)
-			mm_flags_set(MMF_DISABLE_THP, me->mm);
-		else
-			mm_flags_clear(MMF_DISABLE_THP, me->mm);
-		mmap_write_unlock(me->mm);
+		error = prctl_set_thp_disable(arg2, arg3, arg4, arg5);
 		break;
 	case PR_MPX_ENABLE_MANAGEMENT:
 	case PR_MPX_DISABLE_MANAGEMENT:
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 550eb00116c51..1a416b8659972 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -410,7 +410,7 @@ static inline int hpage_collapse_test_exit(struct mm_struct *mm)
 static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
 {
 	return hpage_collapse_test_exit(mm) ||
-		mm_flags_test(MMF_DISABLE_THP, mm);
+		mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
 }
 
 static bool hugepage_pmd_enabled(void)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 2/7] mm/huge_memory: convert "tva_flags" to "enum tva_type"
  2025-08-13 13:55 [PATCH v4 0/7] prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised Usama Arif
  2025-08-13 13:55 ` [PATCH v4 1/7] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE Usama Arif
@ 2025-08-13 13:55 ` Usama Arif
  2025-08-14  3:07   ` Yafang Shao
  2025-08-14 14:59   ` Zi Yan
  2025-08-13 13:55 ` [PATCH v4 3/7] mm/huge_memory: respect MADV_COLLAPSE with PR_THP_DISABLE_EXCEPT_ADVISED Usama Arif
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 34+ messages in thread
From: Usama Arif @ 2025-08-13 13:55 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
	shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
	npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team,
	Usama Arif

From: David Hildenbrand <david@redhat.com>

When determining which THP orders are eligible for a VMA mapping,
we have previously specified tva_flags, however it turns out it is
really not necessary to treat these as flags.

Rather, we distinguish between distinct modes.

The only case where we previously combined flags was with
TVA_ENFORCE_SYSFS, but we can avoid this by observing that this
is the default, except for MADV_COLLAPSE or an edge cases in
collapse_pte_mapped_thp() and hugepage_vma_revalidate(), and
adding a mode specifically for this case - TVA_FORCED_COLLAPSE.

We have:
* smaps handling for showing "THPeligible"
* Pagefault handling
* khugepaged handling
* Forced collapse handling: primarily MADV_COLLAPSE, but also for
  an edge case in collapse_pte_mapped_thp()

Disregarding the edge cases, we only want to ignore sysfs settings only
when we are forcing a collapse through MADV_COLLAPSE, otherwise we
want to enforce it, hence this patch does the following flag to enum
conversions:

* TVA_SMAPS | TVA_ENFORCE_SYSFS -> TVA_SMAPS
* TVA_IN_PF | TVA_ENFORCE_SYSFS -> TVA_PAGEFAULT
* TVA_ENFORCE_SYSFS             -> TVA_KHUGEPAGED
* 0                             -> TVA_FORCED_COLLAPSE

With this change, we immediately know if we are in the forced collapse
case, which will be valuable next.

Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 fs/proc/task_mmu.c      |  4 ++--
 include/linux/huge_mm.h | 30 ++++++++++++++++++------------
 mm/huge_memory.c        |  8 ++++----
 mm/khugepaged.c         | 17 ++++++++---------
 mm/memory.c             | 14 ++++++--------
 5 files changed, 38 insertions(+), 35 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e8e7bef345313..ced01cf3c5ab3 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1369,8 +1369,8 @@ static int show_smap(struct seq_file *m, void *v)
 	__show_smap(m, &mss, false);
 
 	seq_printf(m, "THPeligible:    %8u\n",
-		   !!thp_vma_allowable_orders(vma, vma->vm_flags,
-			   TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL));
+		   !!thp_vma_allowable_orders(vma, vma->vm_flags, TVA_SMAPS,
+					      THP_ORDERS_ALL));
 
 	if (arch_pkeys_enabled())
 		seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 22b8b067b295e..92ea0b9771fae 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -94,12 +94,15 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 #define THP_ORDERS_ALL	\
 	(THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_SPECIAL | THP_ORDERS_ALL_FILE_DEFAULT)
 
-#define TVA_SMAPS		(1 << 0)	/* Will be used for procfs */
-#define TVA_IN_PF		(1 << 1)	/* Page fault handler */
-#define TVA_ENFORCE_SYSFS	(1 << 2)	/* Obey sysfs configuration */
+enum tva_type {
+	TVA_SMAPS,		/* Exposing "THPeligible:" in smaps. */
+	TVA_PAGEFAULT,		/* Serving a page fault. */
+	TVA_KHUGEPAGED,		/* Khugepaged collapse. */
+	TVA_FORCED_COLLAPSE,	/* Forced collapse (e.g. MADV_COLLAPSE). */
+};
 
-#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
-	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
+#define thp_vma_allowable_order(vma, vm_flags, type, order) \
+	(!!thp_vma_allowable_orders(vma, vm_flags, type, BIT(order)))
 
 #define split_folio(f) split_folio_to_list(f, NULL)
 
@@ -264,14 +267,14 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
 
 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 vm_flags_t vm_flags,
-					 unsigned long tva_flags,
+					 enum tva_type type,
 					 unsigned long orders);
 
 /**
  * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
  * @vma:  the vm area to check
  * @vm_flags: use these vm_flags instead of vma->vm_flags
- * @tva_flags: Which TVA flags to honour
+ * @type: TVA type
  * @orders: bitfield of all orders to consider
  *
  * Calculates the intersection of the requested hugepage orders and the allowed
@@ -285,11 +288,14 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 static inline
 unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 				       vm_flags_t vm_flags,
-				       unsigned long tva_flags,
+				       enum tva_type type,
 				       unsigned long orders)
 {
-	/* Optimization to check if required orders are enabled early. */
-	if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
+	/*
+	 * Optimization to check if required orders are enabled early. Only
+	 * forced collapse ignores sysfs configs.
+	 */
+	if (type != TVA_FORCED_COLLAPSE && vma_is_anonymous(vma)) {
 		unsigned long mask = READ_ONCE(huge_anon_orders_always);
 
 		if (vm_flags & VM_HUGEPAGE)
@@ -303,7 +309,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 			return 0;
 	}
 
-	return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
+	return __thp_vma_allowable_orders(vma, vm_flags, type, orders);
 }
 
 struct thpsize {
@@ -547,7 +553,7 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
 
 static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 					vm_flags_t vm_flags,
-					unsigned long tva_flags,
+					enum tva_type type,
 					unsigned long orders)
 {
 	return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6df1ed0cef5cf..9c716be949cbf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -99,12 +99,12 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
 
 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 vm_flags_t vm_flags,
-					 unsigned long tva_flags,
+					 enum tva_type type,
 					 unsigned long orders)
 {
-	bool smaps = tva_flags & TVA_SMAPS;
-	bool in_pf = tva_flags & TVA_IN_PF;
-	bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS;
+	const bool smaps = type == TVA_SMAPS;
+	const bool in_pf = type == TVA_PAGEFAULT;
+	const bool enforce_sysfs = type != TVA_FORCED_COLLAPSE;
 	unsigned long supported_orders;
 
 	/* Check the intersection of requested and supported orders. */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1a416b8659972..d3d4f116e14b6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -474,8 +474,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 {
 	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
 	    hugepage_pmd_enabled()) {
-		if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
-					    PMD_ORDER))
+		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
@@ -921,7 +920,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 				   struct collapse_control *cc)
 {
 	struct vm_area_struct *vma;
-	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
+	enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
+				 TVA_FORCED_COLLAPSE;
 
 	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
 		return SCAN_ANY_PROCESS;
@@ -932,7 +932,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 
 	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
 		return SCAN_ADDRESS_RANGE;
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
+	if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
@@ -1533,9 +1533,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	 * in the page cache with a single hugepage. If a mm were to fault-in
 	 * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
 	 * and map it by a PMD, regardless of sysfs THP settings. As such, let's
-	 * analogously elide sysfs THP settings here.
+	 * analogously elide sysfs THP settings here and force collapse.
 	 */
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
+	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
 		return SCAN_VMA_CHECK;
 
 	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
@@ -2432,8 +2432,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			progress++;
 			break;
 		}
-		if (!thp_vma_allowable_order(vma, vma->vm_flags,
-					TVA_ENFORCE_SYSFS, PMD_ORDER)) {
+		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
 skip:
 			progress++;
 			continue;
@@ -2767,7 +2766,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	BUG_ON(vma->vm_start > start);
 	BUG_ON(vma->vm_end < end);
 
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
+	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
 		return -EINVAL;
 
 	cc = kmalloc(sizeof(*cc), GFP_KERNEL);
diff --git a/mm/memory.c b/mm/memory.c
index 002c28795d8b7..7b1e8f137fa3f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4515,8 +4515,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
 	 * and suitable for swapping THP.
 	 */
-	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
-			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
+					  BIT(PMD_ORDER) - 1);
 	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 	orders = thp_swap_suitable_orders(swp_offset(entry),
 					  vmf->address, orders);
@@ -5063,8 +5063,8 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 	 * for this vma. Then filter out the orders that can't be allocated over
 	 * the faulting address and still be fully contained in the vma.
 	 */
-	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
-			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
+					  BIT(PMD_ORDER) - 1);
 	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 
 	if (!orders)
@@ -6254,8 +6254,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		return VM_FAULT_OOM;
 retry_pud:
 	if (pud_none(*vmf.pud) &&
-	    thp_vma_allowable_order(vma, vm_flags,
-				TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) {
+	    thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PUD_ORDER)) {
 		ret = create_huge_pud(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
@@ -6289,8 +6288,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		goto retry_pud;
 
 	if (pmd_none(*vmf.pmd) &&
-	    thp_vma_allowable_order(vma, vm_flags,
-				TVA_IN_PF | TVA_ENFORCE_SYSFS, PMD_ORDER)) {
+	    thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
 		ret = create_huge_pmd(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 3/7] mm/huge_memory: respect MADV_COLLAPSE with PR_THP_DISABLE_EXCEPT_ADVISED
  2025-08-13 13:55 [PATCH v4 0/7] prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised Usama Arif
  2025-08-13 13:55 ` [PATCH v4 1/7] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE Usama Arif
  2025-08-13 13:55 ` [PATCH v4 2/7] mm/huge_memory: convert "tva_flags" to "enum tva_type" Usama Arif
@ 2025-08-13 13:55 ` Usama Arif
  2025-08-14 15:14   ` Zi Yan
  2025-08-13 13:55 ` [PATCH v4 4/7] docs: transhuge: document process level THP controls Usama Arif
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Usama Arif @ 2025-08-13 13:55 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
	shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
	npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team,
	Usama Arif

From: David Hildenbrand <david@redhat.com>

Let's allow for making MADV_COLLAPSE succeed on areas that neither have
VM_HUGEPAGE nor VM_NOHUGEPAGE when we have THP disabled
unless explicitly advised (PR_THP_DISABLE_EXCEPT_ADVISED).

MADV_COLLAPSE is a clear advice that we want to collapse.

Note that we still respect the VM_NOHUGEPAGE flag, just like
MADV_COLLAPSE always does. So consequently, MADV_COLLAPSE is now only
refused on VM_NOHUGEPAGE with PR_THP_DISABLE_EXCEPT_ADVISED,
including for shmem.

Co-developed-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/huge_mm.h    | 8 +++++++-
 include/uapi/linux/prctl.h | 2 +-
 mm/huge_memory.c           | 5 +++--
 mm/memory.c                | 6 ++++--
 mm/shmem.c                 | 2 +-
 5 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 92ea0b9771fae..1ac0d06fb3c1d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -329,7 +329,7 @@ struct thpsize {
  * through madvise or prctl.
  */
 static inline bool vma_thp_disabled(struct vm_area_struct *vma,
-		vm_flags_t vm_flags)
+		vm_flags_t vm_flags, bool forced_collapse)
 {
 	/* Are THPs disabled for this VMA? */
 	if (vm_flags & VM_NOHUGEPAGE)
@@ -343,6 +343,12 @@ static inline bool vma_thp_disabled(struct vm_area_struct *vma,
 	 */
 	if (vm_flags & VM_HUGEPAGE)
 		return false;
+	/*
+	 * Forcing a collapse (e.g., madv_collapse), is a clear advice to
+	 * use THPs.
+	 */
+	if (forced_collapse)
+		return false;
 	return mm_flags_test(MMF_DISABLE_THP_EXCEPT_ADVISED, vma->vm_mm);
 }
 
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 150b6deebfb1e..51c4e8c82b1e9 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -185,7 +185,7 @@ struct prctl_mm_map {
 #define PR_SET_THP_DISABLE	41
 /*
  * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
- * VM_HUGEPAGE).
+ * VM_HUGEPAGE, MADV_COLLAPSE).
  */
 # define PR_THP_DISABLE_EXCEPT_ADVISED	(1 << 1)
 #define PR_GET_THP_DISABLE	42
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c716be949cbf..1eca2d543449c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -104,7 +104,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 {
 	const bool smaps = type == TVA_SMAPS;
 	const bool in_pf = type == TVA_PAGEFAULT;
-	const bool enforce_sysfs = type != TVA_FORCED_COLLAPSE;
+	const bool forced_collapse = type == TVA_FORCED_COLLAPSE;
+	const bool enforce_sysfs = !forced_collapse;
 	unsigned long supported_orders;
 
 	/* Check the intersection of requested and supported orders. */
@@ -122,7 +123,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 	if (!vma->vm_mm)		/* vdso */
 		return 0;
 
-	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags))
+	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags, forced_collapse))
 		return 0;
 
 	/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
diff --git a/mm/memory.c b/mm/memory.c
index 7b1e8f137fa3f..e4f533655305a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5332,9 +5332,11 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
 	 * It is too late to allocate a small folio, we already have a large
 	 * folio in the pagecache: especially s390 KVM cannot tolerate any
 	 * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any
-	 * PMD mappings if THPs are disabled.
+	 * PMD mappings if THPs are disabled. As we already have a THP ...
+	 * behave as if we are forcing a collapse.
 	 */
-	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags))
+	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags,
+						     /* forced_collapse=*/ true))
 		return ret;
 
 	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
diff --git a/mm/shmem.c b/mm/shmem.c
index e2c76a30802b6..d945de3a7f0e7 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1817,7 +1817,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
 	vm_flags_t vm_flags = vma ? vma->vm_flags : 0;
 	unsigned int global_orders;
 
-	if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags)))
+	if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags, shmem_huge_force)))
 		return 0;
 
 	global_orders = shmem_huge_global_enabled(inode, index, write_end,
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 4/7] docs: transhuge: document process level THP controls
  2025-08-13 13:55 [PATCH v4 0/7] prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised Usama Arif
                   ` (2 preceding siblings ...)
  2025-08-13 13:55 ` [PATCH v4 3/7] mm/huge_memory: respect MADV_COLLAPSE with PR_THP_DISABLE_EXCEPT_ADVISED Usama Arif
@ 2025-08-13 13:55 ` Usama Arif
  2025-08-13 14:30   ` Lorenzo Stoakes
  2025-08-14 15:47   ` Zi Yan
  2025-08-13 13:55 ` [PATCH v4 5/7] selftest/mm: Extract sz2ord function into vm_util.h Usama Arif
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 34+ messages in thread
From: Usama Arif @ 2025-08-13 13:55 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
	shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
	npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team,
	Usama Arif

This includes the PR_SET_THP_DISABLE/PR_GET_THP_DISABLE pair of
prctl calls as well the newly introduced PR_THP_DISABLE_EXCEPT_ADVISED
flag for the PR_SET_THP_DISABLE prctl call.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 37 ++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 370fba1134606..fa8242766e430 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -225,6 +225,43 @@ to "always" or "madvise"), and it'll be automatically shutdown when
 PMD-sized THP is disabled (when both the per-size anon control and the
 top-level control are "never")
 
+process THP controls
+--------------------
+
+A process can control its own THP behaviour using the ``PR_SET_THP_DISABLE``
+and ``PR_GET_THP_DISABLE`` pair of prctl(2) calls. The THP behaviour set using
+``PR_SET_THP_DISABLE`` is inherited across fork(2) and execve(2). These calls
+support the following arguments::
+
+	prctl(PR_SET_THP_DISABLE, 1, 0, 0, 0):
+		This will disable THPs completely for the process, irrespective
+		of global THP controls or MADV_COLLAPSE.
+
+	prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, 0, 0):
+		This will disable THPs for the process except when the usage of THPs is
+		advised. Consequently, THPs will only be used when:
+		- Global THP controls are set to "always" or "madvise" and
+		  the area either has VM_HUGEPAGE set (e.g., due do MADV_HUGEPAGE) or
+		  MADV_COLLAPSE is used.
+		- Global THP controls are set to "never" and MADV_COLLAPSE is used. This
+		  is the same behavior as if THPs would not be disabled on a process
+		  level.
+		Note that MADV_COLLAPSE is currently always rejected if VM_NOHUGEPAGE is
+		set on an area.
+
+	prctl(PR_SET_THP_DISABLE, 0, 0, 0, 0):
+		This will re-enabled THPs for the process, as if they would never have
+		been disabled. Whether THPs will actually be used depends on global THP
+		controls.
+
+	prctl(PR_GET_THP_DISABLE, 0, 0, 0, 0):
+		This returns a value whose bit indicate how THP-disable is configured:
+		Bits
+		 1 0  Value  Description
+		|0|0|   0    No THP-disable behaviour specified.
+		|0|1|   1    THP is entirely disabled for this process.
+		|1|1|   3    THP-except-advised mode is set for this process.
+
 Khugepaged controls
 -------------------
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 5/7] selftest/mm: Extract sz2ord function into vm_util.h
  2025-08-13 13:55 [PATCH v4 0/7] prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised Usama Arif
                   ` (3 preceding siblings ...)
  2025-08-13 13:55 ` [PATCH v4 4/7] docs: transhuge: document process level THP controls Usama Arif
@ 2025-08-13 13:55 ` Usama Arif
  2025-08-13 14:31   ` Lorenzo Stoakes
  2025-08-14 15:52   ` Zi Yan
  2025-08-13 13:55 ` [PATCH v4 6/7] selftests: prctl: introduce tests for disabling THPs completely Usama Arif
  2025-08-13 13:55 ` [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise Usama Arif
  6 siblings, 2 replies; 34+ messages in thread
From: Usama Arif @ 2025-08-13 13:55 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
	shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
	npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team,
	Usama Arif

The function already has 2 uses and will have a 3rd one
in prctl selftests. The pagesize argument is added into
the function, as it's not a global variable anymore.
No functional change intended with this patch.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 tools/testing/selftests/mm/cow.c            | 12 ++++--------
 tools/testing/selftests/mm/uffd-wp-mremap.c |  9 ++-------
 tools/testing/selftests/mm/vm_util.h        |  5 +++++
 3 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index 90ee5779662f3..a568fe629b094 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -41,10 +41,6 @@ static size_t hugetlbsizes[10];
 static int gup_fd;
 static bool has_huge_zeropage;
 
-static int sz2ord(size_t size)
-{
-	return __builtin_ctzll(size / pagesize);
-}
 
 static int detect_thp_sizes(size_t sizes[], int max)
 {
@@ -57,7 +53,7 @@ static int detect_thp_sizes(size_t sizes[], int max)
 	if (!pmdsize)
 		return 0;
 
-	orders = 1UL << sz2ord(pmdsize);
+	orders = 1UL << sz2ord(pmdsize, pagesize);
 	orders |= thp_supported_orders();
 
 	for (i = 0; orders && count < max; i++) {
@@ -1216,8 +1212,8 @@ static void run_anon_test_case(struct test_case const *test_case)
 		size_t size = thpsizes[i];
 		struct thp_settings settings = *thp_current_settings();
 
-		settings.hugepages[sz2ord(pmdsize)].enabled = THP_NEVER;
-		settings.hugepages[sz2ord(size)].enabled = THP_ALWAYS;
+		settings.hugepages[sz2ord(pmdsize, pagesize)].enabled = THP_NEVER;
+		settings.hugepages[sz2ord(size, pagesize)].enabled = THP_ALWAYS;
 		thp_push_settings(&settings);
 
 		if (size == pmdsize) {
@@ -1868,7 +1864,7 @@ int main(void)
 	if (pmdsize) {
 		/* Only if THP is supported. */
 		thp_read_settings(&default_settings);
-		default_settings.hugepages[sz2ord(pmdsize)].enabled = THP_INHERIT;
+		default_settings.hugepages[sz2ord(pmdsize, pagesize)].enabled = THP_INHERIT;
 		thp_save_settings();
 		thp_push_settings(&default_settings);
 
diff --git a/tools/testing/selftests/mm/uffd-wp-mremap.c b/tools/testing/selftests/mm/uffd-wp-mremap.c
index 13ceb56289701..b2b6116e65808 100644
--- a/tools/testing/selftests/mm/uffd-wp-mremap.c
+++ b/tools/testing/selftests/mm/uffd-wp-mremap.c
@@ -19,11 +19,6 @@ static size_t thpsizes[20];
 static int nr_hugetlbsizes;
 static size_t hugetlbsizes[10];
 
-static int sz2ord(size_t size)
-{
-	return __builtin_ctzll(size / pagesize);
-}
-
 static int detect_thp_sizes(size_t sizes[], int max)
 {
 	int count = 0;
@@ -87,9 +82,9 @@ static void *alloc_one_folio(size_t size, bool private, bool hugetlb)
 		struct thp_settings settings = *thp_current_settings();
 
 		if (private)
-			settings.hugepages[sz2ord(size)].enabled = THP_ALWAYS;
+			settings.hugepages[sz2ord(size, pagesize)].enabled = THP_ALWAYS;
 		else
-			settings.shmem_hugepages[sz2ord(size)].enabled = SHMEM_ALWAYS;
+			settings.shmem_hugepages[sz2ord(size, pagesize)].enabled = SHMEM_ALWAYS;
 
 		thp_push_settings(&settings);
 
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 148b792cff0fc..e5cb72bf3a2ab 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -135,6 +135,11 @@ static inline void log_test_result(int result)
 	ksft_test_result_report(result, "%s\n", test_name);
 }
 
+static inline int sz2ord(size_t size, size_t pagesize)
+{
+	return __builtin_ctzll(size / pagesize);
+}
+
 void *sys_mremap(void *old_address, unsigned long old_size,
 		 unsigned long new_size, int flags, void *new_address);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 6/7] selftests: prctl: introduce tests for disabling THPs completely
  2025-08-13 13:55 [PATCH v4 0/7] prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised Usama Arif
                   ` (4 preceding siblings ...)
  2025-08-13 13:55 ` [PATCH v4 5/7] selftest/mm: Extract sz2ord function into vm_util.h Usama Arif
@ 2025-08-13 13:55 ` Usama Arif
  2025-08-13 14:54   ` Lorenzo Stoakes
  2025-08-13 13:55 ` [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise Usama Arif
  6 siblings, 1 reply; 34+ messages in thread
From: Usama Arif @ 2025-08-13 13:55 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
	shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
	npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team,
	Usama Arif

The test will set the global system THP setting to never, madvise
or always depending on the fixture variant and the 2M setting to
inherit before it starts (and reset to original at teardown).
The fixture setup will also test if PR_SET_THP_DISABLE prctl call can
be made to disable all THPs and skip if it fails.

This tests if the process can:
- successfully get the policy to disable THPs completely.
- never get a hugepage when the THPs are completely disabled
  with the prctl, including with MADV_HUGE and MADV_COLLAPSE.
- successfully reset the policy of the process.
- after reset, only get hugepages with:
  - MADV_COLLAPSE when policy is set to never.
  - MADV_HUGE and MADV_COLLAPSE when policy is set to madvise.
  - always when policy is set to "always".
- repeat the above tests in a forked process to make sure
  the policy is carried across forks.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   1 +
 .../testing/selftests/mm/prctl_thp_disable.c  | 168 ++++++++++++++++++
 tools/testing/selftests/mm/thp_settings.c     |   9 +-
 tools/testing/selftests/mm/thp_settings.h     |   1 +
 5 files changed, 179 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/mm/prctl_thp_disable.c

diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index e7b23a8a05fe2..eb023ea857b31 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -58,3 +58,4 @@ pkey_sighandler_tests_32
 pkey_sighandler_tests_64
 guard-regions
 merge
+prctl_thp_disable
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index d75f1effcb791..bd5d17beafa64 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -87,6 +87,7 @@ TEST_GEN_FILES += on-fault-limit
 TEST_GEN_FILES += pagemap_ioctl
 TEST_GEN_FILES += pfnmap
 TEST_GEN_FILES += process_madv
+TEST_GEN_FILES += prctl_thp_disable
 TEST_GEN_FILES += thuge-gen
 TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += uffd-stress
diff --git a/tools/testing/selftests/mm/prctl_thp_disable.c b/tools/testing/selftests/mm/prctl_thp_disable.c
new file mode 100644
index 0000000000000..8845e9f414560
--- /dev/null
+++ b/tools/testing/selftests/mm/prctl_thp_disable.c
@@ -0,0 +1,168 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Basic tests for PR_GET/SET_THP_DISABLE prctl calls
+ *
+ * Author(s): Usama Arif <usamaarif642@gmail.com>
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+
+#include "../kselftest_harness.h"
+#include "thp_settings.h"
+#include "vm_util.h"
+
+enum thp_collapse_type {
+	THP_COLLAPSE_NONE,
+	THP_COLLAPSE_MADV_HUGEPAGE,	/* MADV_HUGEPAGE before access */
+	THP_COLLAPSE_MADV_COLLAPSE,	/* MADV_COLLAPSE after access */
+};
+
+/*
+ * Function to mmap a buffer, fault it in, madvise it appropriately (before
+ * page fault for MADV_HUGE, and after for MADV_COLLAPSE), and check if the
+ * mmap region is huge.
+ * Returns:
+ * 0 if test doesn't give hugepage
+ * 1 if test gives a hugepage
+ * -errno if mmap fails
+ */
+static int test_mmap_thp(enum thp_collapse_type madvise_buf, size_t pmdsize)
+{
+	char *mem, *mmap_mem;
+	size_t mmap_size;
+	int ret;
+
+	/* For alignment purposes, we need twice the THP size. */
+	mmap_size = 2 * pmdsize;
+	mmap_mem = (char *)mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
+				    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (mmap_mem == MAP_FAILED)
+		return -errno;
+
+	/* We need a THP-aligned memory area. */
+	mem = (char *)(((uintptr_t)mmap_mem + pmdsize) & ~(pmdsize - 1));
+
+	if (madvise_buf == THP_COLLAPSE_MADV_HUGEPAGE)
+		madvise(mem, pmdsize, MADV_HUGEPAGE);
+
+	/* Ensure memory is allocated */
+	memset(mem, 1, pmdsize);
+
+	if (madvise_buf == THP_COLLAPSE_MADV_COLLAPSE)
+		madvise(mem, pmdsize, MADV_COLLAPSE);
+
+	/* HACK: make sure we have a separate VMA that we can check reliably. */
+	mprotect(mem, pmdsize, PROT_READ);
+
+	ret = check_huge_anon(mem, 1, pmdsize);
+	munmap(mmap_mem, mmap_size);
+	return ret;
+}
+
+static void prctl_thp_disable_completely_test(struct __test_metadata *const _metadata,
+					      size_t pmdsize,
+					      enum thp_enabled thp_policy)
+{
+	ASSERT_EQ(prctl(PR_GET_THP_DISABLE, NULL, NULL, NULL, NULL), 1);
+
+	/* tests after prctl overrides global policy */
+	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_NONE, pmdsize), 0);
+
+	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_HUGEPAGE, pmdsize), 0);
+
+	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_COLLAPSE, pmdsize), 0);
+
+	/* Reset to global policy */
+	ASSERT_EQ(prctl(PR_SET_THP_DISABLE, 0, NULL, NULL, NULL), 0);
+
+	/* tests after prctl is cleared, and only global policy is effective */
+	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_NONE, pmdsize),
+		  thp_policy == THP_ALWAYS ? 1 : 0);
+
+	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_HUGEPAGE, pmdsize),
+		  thp_policy == THP_NEVER ? 0 : 1);
+
+	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_COLLAPSE, pmdsize), 1);
+}
+
+FIXTURE(prctl_thp_disable_completely)
+{
+	struct thp_settings settings;
+	size_t pmdsize;
+};
+
+FIXTURE_VARIANT(prctl_thp_disable_completely)
+{
+	enum thp_enabled thp_policy;
+};
+
+FIXTURE_VARIANT_ADD(prctl_thp_disable_completely, never)
+{
+	.thp_policy = THP_NEVER,
+};
+
+FIXTURE_VARIANT_ADD(prctl_thp_disable_completely, madvise)
+{
+	.thp_policy = THP_MADVISE,
+};
+
+FIXTURE_VARIANT_ADD(prctl_thp_disable_completely, always)
+{
+	.thp_policy = THP_ALWAYS,
+};
+
+FIXTURE_SETUP(prctl_thp_disable_completely)
+{
+	if (!thp_available())
+		SKIP(return, "Transparent Hugepages not available\n");
+
+	self->pmdsize = read_pmd_pagesize();
+	if (!self->pmdsize)
+		SKIP(return, "Unable to read PMD size\n");
+
+	if (prctl(PR_SET_THP_DISABLE, 1, NULL, NULL, NULL))
+		SKIP(return, "Unable to disable THPs completely for the process\n");
+
+	thp_save_settings();
+	thp_read_settings(&self->settings);
+	self->settings.thp_enabled = variant->thp_policy;
+	self->settings.hugepages[sz2ord(self->pmdsize, getpagesize())].enabled = THP_INHERIT;
+	thp_write_settings(&self->settings);
+}
+
+FIXTURE_TEARDOWN(prctl_thp_disable_completely)
+{
+	thp_restore_settings();
+}
+
+TEST_F(prctl_thp_disable_completely, nofork)
+{
+	prctl_thp_disable_completely_test(_metadata, self->pmdsize, variant->thp_policy);
+}
+
+TEST_F(prctl_thp_disable_completely, fork)
+{
+	int ret = 0;
+	pid_t pid;
+
+	/* Make sure prctl changes are carried across fork */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (!pid)
+		prctl_thp_disable_completely_test(_metadata, self->pmdsize, variant->thp_policy);
+
+	wait(&ret);
+	if (WIFEXITED(ret))
+		ret = WEXITSTATUS(ret);
+	else
+		ret = -EINVAL;
+	ASSERT_EQ(ret, 0);
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/mm/thp_settings.c b/tools/testing/selftests/mm/thp_settings.c
index bad60ac52874a..574bd0f8ae480 100644
--- a/tools/testing/selftests/mm/thp_settings.c
+++ b/tools/testing/selftests/mm/thp_settings.c
@@ -382,10 +382,17 @@ unsigned long thp_shmem_supported_orders(void)
 	return __thp_supported_orders(true);
 }
 
-bool thp_is_enabled(void)
+bool thp_available(void)
 {
 	if (access(THP_SYSFS, F_OK) != 0)
 		return false;
+	return true;
+}
+
+bool thp_is_enabled(void)
+{
+	if (!thp_available())
+		return false;
 
 	int mode = thp_read_string("enabled", thp_enabled_strings);
 
diff --git a/tools/testing/selftests/mm/thp_settings.h b/tools/testing/selftests/mm/thp_settings.h
index 6c07f70beee97..76eeb712e5f10 100644
--- a/tools/testing/selftests/mm/thp_settings.h
+++ b/tools/testing/selftests/mm/thp_settings.h
@@ -84,6 +84,7 @@ void thp_set_read_ahead_path(char *path);
 unsigned long thp_supported_orders(void);
 unsigned long thp_shmem_supported_orders(void);
 
+bool thp_available(void);
 bool thp_is_enabled(void);
 
 #endif /* __THP_SETTINGS_H__ */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise
  2025-08-13 13:55 [PATCH v4 0/7] prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised Usama Arif
                   ` (5 preceding siblings ...)
  2025-08-13 13:55 ` [PATCH v4 6/7] selftests: prctl: introduce tests for disabling THPs completely Usama Arif
@ 2025-08-13 13:55 ` Usama Arif
  2025-08-13 15:13   ` Lorenzo Stoakes
  6 siblings, 1 reply; 34+ messages in thread
From: Usama Arif @ 2025-08-13 13:55 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
	shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
	npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team,
	Usama Arif

The test will set the global system THP setting to never, madvise
or always depending on the fixture variant and the 2M setting to
inherit before it starts (and reset to original at teardown).
The fixture setup will also test if PR_SET_THP_DISABLE prctl call can
be made with PR_THP_DISABLE_EXCEPT_ADVISED and skip if it fails.

This tests if the process can:
- successfully get the policy to disable THPs expect for madvise.
- get hugepages only on MADV_HUGE and MADV_COLLAPSE if the global policy
  is madvise/always and only with MADV_COLLAPSE if the global policy is
  never.
- successfully reset the policy of the process.
- after reset, only get hugepages with:
  - MADV_COLLAPSE when policy is set to never.
  - MADV_HUGE and MADV_COLLAPSE when policy is set to madvise.
  - always when policy is set to "always".
- repeat the above tests in a forked process to make sure  the policy is
  carried across forks.

Test results:
./prctl_thp_disable
TAP version 13
1..12
ok 1 prctl_thp_disable_completely.never.nofork
ok 2 prctl_thp_disable_completely.never.fork
ok 3 prctl_thp_disable_completely.madvise.nofork
ok 4 prctl_thp_disable_completely.madvise.fork
ok 5 prctl_thp_disable_completely.always.nofork
ok 6 prctl_thp_disable_completely.always.fork
ok 7 prctl_thp_disable_except_madvise.never.nofork
ok 8 prctl_thp_disable_except_madvise.never.fork
ok 9 prctl_thp_disable_except_madvise.madvise.nofork
ok 10 prctl_thp_disable_except_madvise.madvise.fork
ok 11 prctl_thp_disable_except_madvise.always.nofork
ok 12 prctl_thp_disable_except_madvise.always.fork

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 .../testing/selftests/mm/prctl_thp_disable.c  | 107 ++++++++++++++++++
 1 file changed, 107 insertions(+)

diff --git a/tools/testing/selftests/mm/prctl_thp_disable.c b/tools/testing/selftests/mm/prctl_thp_disable.c
index 8845e9f414560..9bfed4598a1a6 100644
--- a/tools/testing/selftests/mm/prctl_thp_disable.c
+++ b/tools/testing/selftests/mm/prctl_thp_disable.c
@@ -16,6 +16,10 @@
 #include "thp_settings.h"
 #include "vm_util.h"
 
+#ifndef PR_THP_DISABLE_EXCEPT_ADVISED
+#define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1)
+#endif
+
 enum thp_collapse_type {
 	THP_COLLAPSE_NONE,
 	THP_COLLAPSE_MADV_HUGEPAGE,	/* MADV_HUGEPAGE before access */
@@ -165,4 +169,107 @@ TEST_F(prctl_thp_disable_completely, fork)
 	ASSERT_EQ(ret, 0);
 }
 
+static void prctl_thp_disable_except_madvise_test(struct __test_metadata *const _metadata,
+						  size_t pmdsize,
+						  enum thp_enabled thp_policy)
+{
+	ASSERT_EQ(prctl(PR_GET_THP_DISABLE, NULL, NULL, NULL, NULL), 3);
+
+	/* tests after prctl overrides global policy */
+	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_NONE, pmdsize), 0);
+
+	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_HUGEPAGE, pmdsize),
+		  thp_policy == THP_NEVER ? 0 : 1);
+
+	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_COLLAPSE, pmdsize), 1);
+
+	/* Reset to global policy */
+	ASSERT_EQ(prctl(PR_SET_THP_DISABLE, 0, NULL, NULL, NULL), 0);
+
+	/* tests after prctl is cleared, and only global policy is effective */
+	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_NONE, pmdsize),
+		  thp_policy == THP_ALWAYS ? 1 : 0);
+
+	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_HUGEPAGE, pmdsize),
+		  thp_policy == THP_NEVER ? 0 : 1);
+
+	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_COLLAPSE, pmdsize), 1);
+}
+
+FIXTURE(prctl_thp_disable_except_madvise)
+{
+	struct thp_settings settings;
+	size_t pmdsize;
+};
+
+FIXTURE_VARIANT(prctl_thp_disable_except_madvise)
+{
+	enum thp_enabled thp_policy;
+};
+
+FIXTURE_VARIANT_ADD(prctl_thp_disable_except_madvise, never)
+{
+	.thp_policy = THP_NEVER,
+};
+
+FIXTURE_VARIANT_ADD(prctl_thp_disable_except_madvise, madvise)
+{
+	.thp_policy = THP_MADVISE,
+};
+
+FIXTURE_VARIANT_ADD(prctl_thp_disable_except_madvise, always)
+{
+	.thp_policy = THP_ALWAYS,
+};
+
+FIXTURE_SETUP(prctl_thp_disable_except_madvise)
+{
+	if (!thp_available())
+		SKIP(return, "Transparent Hugepages not available\n");
+
+	self->pmdsize = read_pmd_pagesize();
+	if (!self->pmdsize)
+		SKIP(return, "Unable to read PMD size\n");
+
+	if (prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, NULL, NULL))
+		SKIP(return, "Unable to set PR_THP_DISABLE_EXCEPT_ADVISED\n");
+
+	thp_save_settings();
+	thp_read_settings(&self->settings);
+	self->settings.thp_enabled = variant->thp_policy;
+	self->settings.hugepages[sz2ord(self->pmdsize, getpagesize())].enabled = THP_INHERIT;
+	thp_write_settings(&self->settings);
+}
+
+FIXTURE_TEARDOWN(prctl_thp_disable_except_madvise)
+{
+	thp_restore_settings();
+}
+
+TEST_F(prctl_thp_disable_except_madvise, nofork)
+{
+	prctl_thp_disable_except_madvise_test(_metadata, self->pmdsize, variant->thp_policy);
+}
+
+TEST_F(prctl_thp_disable_except_madvise, fork)
+{
+	int ret = 0;
+	pid_t pid;
+
+	/* Make sure prctl changes are carried across fork */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (!pid)
+		prctl_thp_disable_except_madvise_test(_metadata, self->pmdsize,
+						      variant->thp_policy);
+
+	wait(&ret);
+	if (WIFEXITED(ret))
+		ret = WEXITSTATUS(ret);
+	else
+		ret = -EINVAL;
+	ASSERT_EQ(ret, 0);
+}
+
 TEST_HARNESS_MAIN
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 4/7] docs: transhuge: document process level THP controls
  2025-08-13 13:55 ` [PATCH v4 4/7] docs: transhuge: document process level THP controls Usama Arif
@ 2025-08-13 14:30   ` Lorenzo Stoakes
  2025-08-14 15:47   ` Zi Yan
  1 sibling, 0 replies; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-13 14:30 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
	surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
	laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
	ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
	linux-doc, kernel-team

On Wed, Aug 13, 2025 at 02:55:39PM +0100, Usama Arif wrote:
> This includes the PR_SET_THP_DISABLE/PR_GET_THP_DISABLE pair of
> prctl calls as well the newly introduced PR_THP_DISABLE_EXCEPT_ADVISED
> flag for the PR_SET_THP_DISABLE prctl call.
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  Documentation/admin-guide/mm/transhuge.rst | 37 ++++++++++++++++++++++
>  1 file changed, 37 insertions(+)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 370fba1134606..fa8242766e430 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -225,6 +225,43 @@ to "always" or "madvise"), and it'll be automatically shutdown when
>  PMD-sized THP is disabled (when both the per-size anon control and the
>  top-level control are "never")
>
> +process THP controls
> +--------------------
> +
> +A process can control its own THP behaviour using the ``PR_SET_THP_DISABLE``
> +and ``PR_GET_THP_DISABLE`` pair of prctl(2) calls. The THP behaviour set using
> +``PR_SET_THP_DISABLE`` is inherited across fork(2) and execve(2). These calls
> +support the following arguments::

Thanks that's an improvement putting the bit about fork/exec here.

> +
> +	prctl(PR_SET_THP_DISABLE, 1, 0, 0, 0):
> +		This will disable THPs completely for the process, irrespective
> +		of global THP controls or MADV_COLLAPSE.

Thanks for including MADV_COLLAPSE aspect!

> +
> +	prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, 0, 0):
> +		This will disable THPs for the process except when the usage of THPs is
> +		advised. Consequently, THPs will only be used when:
> +		- Global THP controls are set to "always" or "madvise" and
> +		  the area either has VM_HUGEPAGE set (e.g., due do MADV_HUGEPAGE) or
> +		  MADV_COLLAPSE is used.
> +		- Global THP controls are set to "never" and MADV_COLLAPSE is used. This
> +		  is the same behavior as if THPs would not be disabled on a process
> +		  level.
> +		Note that MADV_COLLAPSE is currently always rejected if VM_NOHUGEPAGE is
> +		set on an area.
> +
> +	prctl(PR_SET_THP_DISABLE, 0, 0, 0, 0):
> +		This will re-enabled THPs for the process, as if they would never have

Real super nit, but could be 'as if they were never disabled'.

> +		been disabled. Whether THPs will actually be used depends on global THP
> +		controls.
> +
> +	prctl(PR_GET_THP_DISABLE, 0, 0, 0, 0):
> +		This returns a value whose bit indicate how THP-disable is configured:
> +		Bits
> +		 1 0  Value  Description
> +		|0|0|   0    No THP-disable behaviour specified.
> +		|0|1|   1    THP is entirely disabled for this process.
> +		|1|1|   3    THP-except-advised mode is set for this process.

Thanks this looks great!

> +
>  Khugepaged controls
>  -------------------
>
> --
> 2.47.3
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 5/7] selftest/mm: Extract sz2ord function into vm_util.h
  2025-08-13 13:55 ` [PATCH v4 5/7] selftest/mm: Extract sz2ord function into vm_util.h Usama Arif
@ 2025-08-13 14:31   ` Lorenzo Stoakes
  2025-08-14 15:52   ` Zi Yan
  1 sibling, 0 replies; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-13 14:31 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
	surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
	laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
	ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
	linux-doc, kernel-team

On Wed, Aug 13, 2025 at 02:55:40PM +0100, Usama Arif wrote:
> The function already has 2 uses and will have a 3rd one
> in prctl selftests. The pagesize argument is added into
> the function, as it's not a global variable anymore.
> No functional change intended with this patch.
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  tools/testing/selftests/mm/cow.c            | 12 ++++--------
>  tools/testing/selftests/mm/uffd-wp-mremap.c |  9 ++-------
>  tools/testing/selftests/mm/vm_util.h        |  5 +++++
>  3 files changed, 11 insertions(+), 15 deletions(-)
>
> diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
> index 90ee5779662f3..a568fe629b094 100644
> --- a/tools/testing/selftests/mm/cow.c
> +++ b/tools/testing/selftests/mm/cow.c
> @@ -41,10 +41,6 @@ static size_t hugetlbsizes[10];
>  static int gup_fd;
>  static bool has_huge_zeropage;
>
> -static int sz2ord(size_t size)
> -{
> -	return __builtin_ctzll(size / pagesize);
> -}
>
>  static int detect_thp_sizes(size_t sizes[], int max)
>  {
> @@ -57,7 +53,7 @@ static int detect_thp_sizes(size_t sizes[], int max)
>  	if (!pmdsize)
>  		return 0;
>
> -	orders = 1UL << sz2ord(pmdsize);
> +	orders = 1UL << sz2ord(pmdsize, pagesize);
>  	orders |= thp_supported_orders();
>
>  	for (i = 0; orders && count < max; i++) {
> @@ -1216,8 +1212,8 @@ static void run_anon_test_case(struct test_case const *test_case)
>  		size_t size = thpsizes[i];
>  		struct thp_settings settings = *thp_current_settings();
>
> -		settings.hugepages[sz2ord(pmdsize)].enabled = THP_NEVER;
> -		settings.hugepages[sz2ord(size)].enabled = THP_ALWAYS;
> +		settings.hugepages[sz2ord(pmdsize, pagesize)].enabled = THP_NEVER;
> +		settings.hugepages[sz2ord(size, pagesize)].enabled = THP_ALWAYS;
>  		thp_push_settings(&settings);
>
>  		if (size == pmdsize) {
> @@ -1868,7 +1864,7 @@ int main(void)
>  	if (pmdsize) {
>  		/* Only if THP is supported. */
>  		thp_read_settings(&default_settings);
> -		default_settings.hugepages[sz2ord(pmdsize)].enabled = THP_INHERIT;
> +		default_settings.hugepages[sz2ord(pmdsize, pagesize)].enabled = THP_INHERIT;
>  		thp_save_settings();
>  		thp_push_settings(&default_settings);
>
> diff --git a/tools/testing/selftests/mm/uffd-wp-mremap.c b/tools/testing/selftests/mm/uffd-wp-mremap.c
> index 13ceb56289701..b2b6116e65808 100644
> --- a/tools/testing/selftests/mm/uffd-wp-mremap.c
> +++ b/tools/testing/selftests/mm/uffd-wp-mremap.c
> @@ -19,11 +19,6 @@ static size_t thpsizes[20];
>  static int nr_hugetlbsizes;
>  static size_t hugetlbsizes[10];
>
> -static int sz2ord(size_t size)
> -{
> -	return __builtin_ctzll(size / pagesize);
> -}
> -
>  static int detect_thp_sizes(size_t sizes[], int max)
>  {
>  	int count = 0;
> @@ -87,9 +82,9 @@ static void *alloc_one_folio(size_t size, bool private, bool hugetlb)
>  		struct thp_settings settings = *thp_current_settings();
>
>  		if (private)
> -			settings.hugepages[sz2ord(size)].enabled = THP_ALWAYS;
> +			settings.hugepages[sz2ord(size, pagesize)].enabled = THP_ALWAYS;
>  		else
> -			settings.shmem_hugepages[sz2ord(size)].enabled = SHMEM_ALWAYS;
> +			settings.shmem_hugepages[sz2ord(size, pagesize)].enabled = SHMEM_ALWAYS;
>
>  		thp_push_settings(&settings);
>
> diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
> index 148b792cff0fc..e5cb72bf3a2ab 100644
> --- a/tools/testing/selftests/mm/vm_util.h
> +++ b/tools/testing/selftests/mm/vm_util.h
> @@ -135,6 +135,11 @@ static inline void log_test_result(int result)
>  	ksft_test_result_report(result, "%s\n", test_name);
>  }
>
> +static inline int sz2ord(size_t size, size_t pagesize)
> +{
> +	return __builtin_ctzll(size / pagesize);
> +}
> +
>  void *sys_mremap(void *old_address, unsigned long old_size,
>  		 unsigned long new_size, int flags, void *new_address);
>
> --
> 2.47.3
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 6/7] selftests: prctl: introduce tests for disabling THPs completely
  2025-08-13 13:55 ` [PATCH v4 6/7] selftests: prctl: introduce tests for disabling THPs completely Usama Arif
@ 2025-08-13 14:54   ` Lorenzo Stoakes
  0 siblings, 0 replies; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-13 14:54 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
	surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
	laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
	ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
	linux-doc, kernel-team

On Wed, Aug 13, 2025 at 02:55:41PM +0100, Usama Arif wrote:
> The test will set the global system THP setting to never, madvise
> or always depending on the fixture variant and the 2M setting to
> inherit before it starts (and reset to original at teardown).
> The fixture setup will also test if PR_SET_THP_DISABLE prctl call can
> be made to disable all THPs and skip if it fails.
>
> This tests if the process can:
> - successfully get the policy to disable THPs completely.
> - never get a hugepage when the THPs are completely disabled
>   with the prctl, including with MADV_HUGE and MADV_COLLAPSE.
> - successfully reset the policy of the process.
> - after reset, only get hugepages with:
>   - MADV_COLLAPSE when policy is set to never.
>   - MADV_HUGE and MADV_COLLAPSE when policy is set to madvise.
>   - always when policy is set to "always".
> - repeat the above tests in a forked process to make sure
>   the policy is carried across forks.
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> Acked-by: David Hildenbrand <david@redhat.com>

Some nits below but this looks sensible, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  tools/testing/selftests/mm/.gitignore         |   1 +
>  tools/testing/selftests/mm/Makefile           |   1 +
>  .../testing/selftests/mm/prctl_thp_disable.c  | 168 ++++++++++++++++++
>  tools/testing/selftests/mm/thp_settings.c     |   9 +-
>  tools/testing/selftests/mm/thp_settings.h     |   1 +
>  5 files changed, 179 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/mm/prctl_thp_disable.c
>
> diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
> index e7b23a8a05fe2..eb023ea857b31 100644
> --- a/tools/testing/selftests/mm/.gitignore
> +++ b/tools/testing/selftests/mm/.gitignore
> @@ -58,3 +58,4 @@ pkey_sighandler_tests_32
>  pkey_sighandler_tests_64
>  guard-regions
>  merge
> +prctl_thp_disable
> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> index d75f1effcb791..bd5d17beafa64 100644
> --- a/tools/testing/selftests/mm/Makefile
> +++ b/tools/testing/selftests/mm/Makefile
> @@ -87,6 +87,7 @@ TEST_GEN_FILES += on-fault-limit
>  TEST_GEN_FILES += pagemap_ioctl
>  TEST_GEN_FILES += pfnmap
>  TEST_GEN_FILES += process_madv
> +TEST_GEN_FILES += prctl_thp_disable
>  TEST_GEN_FILES += thuge-gen
>  TEST_GEN_FILES += transhuge-stress
>  TEST_GEN_FILES += uffd-stress
> diff --git a/tools/testing/selftests/mm/prctl_thp_disable.c b/tools/testing/selftests/mm/prctl_thp_disable.c
> new file mode 100644
> index 0000000000000..8845e9f414560
> --- /dev/null
> +++ b/tools/testing/selftests/mm/prctl_thp_disable.c
> @@ -0,0 +1,168 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Basic tests for PR_GET/SET_THP_DISABLE prctl calls
> + *
> + * Author(s): Usama Arif <usamaarif642@gmail.com>
> + */
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <sys/mman.h>
> +#include <sys/prctl.h>
> +#include <sys/wait.h>
> +
> +#include "../kselftest_harness.h"
> +#include "thp_settings.h"
> +#include "vm_util.h"
> +
> +enum thp_collapse_type {
> +	THP_COLLAPSE_NONE,
> +	THP_COLLAPSE_MADV_HUGEPAGE,	/* MADV_HUGEPAGE before access */
> +	THP_COLLAPSE_MADV_COLLAPSE,	/* MADV_COLLAPSE after access */
> +};
> +
> +/*
> + * Function to mmap a buffer, fault it in, madvise it appropriately (before
> + * page fault for MADV_HUGE, and after for MADV_COLLAPSE), and check if the
> + * mmap region is huge.
> + * Returns:
> + * 0 if test doesn't give hugepage
> + * 1 if test gives a hugepage
> + * -errno if mmap fails
> + */
> +static int test_mmap_thp(enum thp_collapse_type madvise_buf, size_t pmdsize)
> +{
> +	char *mem, *mmap_mem;
> +	size_t mmap_size;
> +	int ret;
> +
> +	/* For alignment purposes, we need twice the THP size. */
> +	mmap_size = 2 * pmdsize;
> +	mmap_mem = (char *)mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
> +				    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +	if (mmap_mem == MAP_FAILED)
> +		return -errno;
> +
> +	/* We need a THP-aligned memory area. */
> +	mem = (char *)(((uintptr_t)mmap_mem + pmdsize) & ~(pmdsize - 1));
> +
> +	if (madvise_buf == THP_COLLAPSE_MADV_HUGEPAGE)
> +		madvise(mem, pmdsize, MADV_HUGEPAGE);
> +
> +	/* Ensure memory is allocated */
> +	memset(mem, 1, pmdsize);
> +
> +	if (madvise_buf == THP_COLLAPSE_MADV_COLLAPSE)
> +		madvise(mem, pmdsize, MADV_COLLAPSE);
> +
> +	/* HACK: make sure we have a separate VMA that we can check reliably. */
> +	mprotect(mem, pmdsize, PROT_READ);

I mean you won't be _absolutely_ sure of this, as you might merge with an
adjacent read-only VMA.

The best way is always to map a PROT_NONE mapping first, then perform a
MAP_FIXED mapping into it.

Given 2 * PMD should guarantee at least 1 alligned PMD you can use, you could
do:

	char *reserve, *mem, *mmap_mem;

	...

	(set mmap_size)

	/* Reserve space so we don't get any unexpected merges around us. */

	reserve = mmap(NULL, 2 * pagesize + mmap_size, PROT_NONE, MAP_PRIVATE | MAP_ANON, -1, 0);
	if (reserve == MAP_FAILED)
		return -errno;

	mmap_mem = mmap(&reserved[pagesize], mmap_size, PROT_READ | PROT_WRITE,
			MAP_PRIVATE | MAP_ANON | MAP_FIXED, -1, 0);
	...

You could then do your 'hack' (which is not really a hack, just fine I think).

> +
> +	ret = check_huge_anon(mem, 1, pmdsize);
> +	munmap(mmap_mem, mmap_size);
> +	return ret;
> +}
> +
> +static void prctl_thp_disable_completely_test(struct __test_metadata *const _metadata,
> +					      size_t pmdsize,
> +					      enum thp_enabled thp_policy)
> +{
> +	ASSERT_EQ(prctl(PR_GET_THP_DISABLE, NULL, NULL, NULL, NULL), 1);
> +
> +	/* tests after prctl overrides global policy */
> +	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_NONE, pmdsize), 0);
> +
> +	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_HUGEPAGE, pmdsize), 0);
> +
> +	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_COLLAPSE, pmdsize), 0);
> +
> +	/* Reset to global policy */
> +	ASSERT_EQ(prctl(PR_SET_THP_DISABLE, 0, NULL, NULL, NULL), 0);
> +
> +	/* tests after prctl is cleared, and only global policy is effective */
> +	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_NONE, pmdsize),
> +		  thp_policy == THP_ALWAYS ? 1 : 0);
> +
> +	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_HUGEPAGE, pmdsize),
> +		  thp_policy == THP_NEVER ? 0 : 1);
> +
> +	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_COLLAPSE, pmdsize), 1);
> +}
> +
> +FIXTURE(prctl_thp_disable_completely)
> +{
> +	struct thp_settings settings;
> +	size_t pmdsize;
> +};
> +
> +FIXTURE_VARIANT(prctl_thp_disable_completely)
> +{
> +	enum thp_enabled thp_policy;
> +};
> +
> +FIXTURE_VARIANT_ADD(prctl_thp_disable_completely, never)
> +{
> +	.thp_policy = THP_NEVER,
> +};
> +
> +FIXTURE_VARIANT_ADD(prctl_thp_disable_completely, madvise)
> +{
> +	.thp_policy = THP_MADVISE,
> +};
> +
> +FIXTURE_VARIANT_ADD(prctl_thp_disable_completely, always)
> +{
> +	.thp_policy = THP_ALWAYS,
> +};
> +

Nice!

> +FIXTURE_SETUP(prctl_thp_disable_completely)
> +{
> +	if (!thp_available())
> +		SKIP(return, "Transparent Hugepages not available\n");
> +
> +	self->pmdsize = read_pmd_pagesize();
> +	if (!self->pmdsize)
> +		SKIP(return, "Unable to read PMD size\n");
> +
> +	if (prctl(PR_SET_THP_DISABLE, 1, NULL, NULL, NULL))
> +		SKIP(return, "Unable to disable THPs completely for the process\n");

Hm, shouldn't this be a test failure?

> +
> +	thp_save_settings();
> +	thp_read_settings(&self->settings);
> +	self->settings.thp_enabled = variant->thp_policy;

Ugh this variable name is horrid, not your fault. I see you've renamed it at
least in the variant field.

That's not one for this series though, one for a follow up.


> +	self->settings.hugepages[sz2ord(self->pmdsize, getpagesize())].enabled = THP_INHERIT;
> +	thp_write_settings(&self->settings);
> +}
> +
> +FIXTURE_TEARDOWN(prctl_thp_disable_completely)
> +{
> +	thp_restore_settings();
> +}
> +
> +TEST_F(prctl_thp_disable_completely, nofork)
> +{
> +	prctl_thp_disable_completely_test(_metadata, self->pmdsize, variant->thp_policy);
> +}
> +
> +TEST_F(prctl_thp_disable_completely, fork)
> +{
> +	int ret = 0;
> +	pid_t pid;
> +
> +	/* Make sure prctl changes are carried across fork */
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (!pid)
> +		prctl_thp_disable_completely_test(_metadata, self->pmdsize, variant->thp_policy);
> +
> +	wait(&ret);
> +	if (WIFEXITED(ret))
> +		ret = WEXITSTATUS(ret);
> +	else
> +		ret = -EINVAL;
> +	ASSERT_EQ(ret, 0);
> +}
> +
> +TEST_HARNESS_MAIN
> diff --git a/tools/testing/selftests/mm/thp_settings.c b/tools/testing/selftests/mm/thp_settings.c
> index bad60ac52874a..574bd0f8ae480 100644
> --- a/tools/testing/selftests/mm/thp_settings.c
> +++ b/tools/testing/selftests/mm/thp_settings.c
> @@ -382,10 +382,17 @@ unsigned long thp_shmem_supported_orders(void)
>  	return __thp_supported_orders(true);
>  }
>
> -bool thp_is_enabled(void)
> +bool thp_available(void)
>  {
>  	if (access(THP_SYSFS, F_OK) != 0)
>  		return false;
> +	return true;
> +}
> +
> +bool thp_is_enabled(void)
> +{
> +	if (!thp_available())
> +		return false;
>
>  	int mode = thp_read_string("enabled", thp_enabled_strings);
>
> diff --git a/tools/testing/selftests/mm/thp_settings.h b/tools/testing/selftests/mm/thp_settings.h
> index 6c07f70beee97..76eeb712e5f10 100644
> --- a/tools/testing/selftests/mm/thp_settings.h
> +++ b/tools/testing/selftests/mm/thp_settings.h
> @@ -84,6 +84,7 @@ void thp_set_read_ahead_path(char *path);
>  unsigned long thp_supported_orders(void);
>  unsigned long thp_shmem_supported_orders(void);
>
> +bool thp_available(void);
>  bool thp_is_enabled(void);
>
>  #endif /* __THP_SETTINGS_H__ */
> --
> 2.47.3
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise
  2025-08-13 13:55 ` [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise Usama Arif
@ 2025-08-13 15:13   ` Lorenzo Stoakes
  2025-08-13 16:24     ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-13 15:13 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
	surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
	laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
	ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
	linux-doc, kernel-team

On Wed, Aug 13, 2025 at 02:55:42PM +0100, Usama Arif wrote:
> The test will set the global system THP setting to never, madvise
> or always depending on the fixture variant and the 2M setting to
> inherit before it starts (and reset to original at teardown).
> The fixture setup will also test if PR_SET_THP_DISABLE prctl call can
> be made with PR_THP_DISABLE_EXCEPT_ADVISED and skip if it fails.
>
> This tests if the process can:
> - successfully get the policy to disable THPs expect for madvise.
> - get hugepages only on MADV_HUGE and MADV_COLLAPSE if the global policy
>   is madvise/always and only with MADV_COLLAPSE if the global policy is
>   never.
> - successfully reset the policy of the process.
> - after reset, only get hugepages with:
>   - MADV_COLLAPSE when policy is set to never.
>   - MADV_HUGE and MADV_COLLAPSE when policy is set to madvise.
>   - always when policy is set to "always".
> - repeat the above tests in a forked process to make sure  the policy is
>   carried across forks.
>
> Test results:
> ./prctl_thp_disable
> TAP version 13
> 1..12
> ok 1 prctl_thp_disable_completely.never.nofork
> ok 2 prctl_thp_disable_completely.never.fork
> ok 3 prctl_thp_disable_completely.madvise.nofork
> ok 4 prctl_thp_disable_completely.madvise.fork
> ok 5 prctl_thp_disable_completely.always.nofork
> ok 6 prctl_thp_disable_completely.always.fork
> ok 7 prctl_thp_disable_except_madvise.never.nofork
> ok 8 prctl_thp_disable_except_madvise.never.fork
> ok 9 prctl_thp_disable_except_madvise.madvise.nofork
> ok 10 prctl_thp_disable_except_madvise.madvise.fork
> ok 11 prctl_thp_disable_except_madvise.always.nofork
> ok 12 prctl_thp_disable_except_madvise.always.fork
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>

I don't see any tests asserting VM_NOHUGEPAGE behaviour could you expand
this test to add that?

Other than that this is looking good.

Thanks!

> ---
>  .../testing/selftests/mm/prctl_thp_disable.c  | 107 ++++++++++++++++++
>  1 file changed, 107 insertions(+)
>
> diff --git a/tools/testing/selftests/mm/prctl_thp_disable.c b/tools/testing/selftests/mm/prctl_thp_disable.c
> index 8845e9f414560..9bfed4598a1a6 100644
> --- a/tools/testing/selftests/mm/prctl_thp_disable.c
> +++ b/tools/testing/selftests/mm/prctl_thp_disable.c
> @@ -16,6 +16,10 @@
>  #include "thp_settings.h"
>  #include "vm_util.h"
>
> +#ifndef PR_THP_DISABLE_EXCEPT_ADVISED
> +#define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1)
> +#endif
> +
>  enum thp_collapse_type {
>  	THP_COLLAPSE_NONE,
>  	THP_COLLAPSE_MADV_HUGEPAGE,	/* MADV_HUGEPAGE before access */
> @@ -165,4 +169,107 @@ TEST_F(prctl_thp_disable_completely, fork)
>  	ASSERT_EQ(ret, 0);
>  }
>
> +static void prctl_thp_disable_except_madvise_test(struct __test_metadata *const _metadata,
> +						  size_t pmdsize,
> +						  enum thp_enabled thp_policy)
> +{
> +	ASSERT_EQ(prctl(PR_GET_THP_DISABLE, NULL, NULL, NULL, NULL), 3);
> +
> +	/* tests after prctl overrides global policy */
> +	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_NONE, pmdsize), 0);
> +
> +	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_HUGEPAGE, pmdsize),
> +		  thp_policy == THP_NEVER ? 0 : 1);
> +
> +	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_COLLAPSE, pmdsize), 1);
> +
> +	/* Reset to global policy */
> +	ASSERT_EQ(prctl(PR_SET_THP_DISABLE, 0, NULL, NULL, NULL), 0);
> +
> +	/* tests after prctl is cleared, and only global policy is effective */
> +	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_NONE, pmdsize),
> +		  thp_policy == THP_ALWAYS ? 1 : 0);
> +
> +	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_HUGEPAGE, pmdsize),
> +		  thp_policy == THP_NEVER ? 0 : 1);
> +
> +	ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_COLLAPSE, pmdsize), 1);
> +}
> +
> +FIXTURE(prctl_thp_disable_except_madvise)
> +{
> +	struct thp_settings settings;
> +	size_t pmdsize;
> +};
> +
> +FIXTURE_VARIANT(prctl_thp_disable_except_madvise)
> +{
> +	enum thp_enabled thp_policy;
> +};
> +
> +FIXTURE_VARIANT_ADD(prctl_thp_disable_except_madvise, never)
> +{
> +	.thp_policy = THP_NEVER,
> +};
> +
> +FIXTURE_VARIANT_ADD(prctl_thp_disable_except_madvise, madvise)
> +{
> +	.thp_policy = THP_MADVISE,
> +};
> +
> +FIXTURE_VARIANT_ADD(prctl_thp_disable_except_madvise, always)
> +{
> +	.thp_policy = THP_ALWAYS,
> +};

again the kselftest_harness stuff is really useful!

> +
> +FIXTURE_SETUP(prctl_thp_disable_except_madvise)
> +{
> +	if (!thp_available())
> +		SKIP(return, "Transparent Hugepages not available\n");
> +
> +	self->pmdsize = read_pmd_pagesize();
> +	if (!self->pmdsize)
> +		SKIP(return, "Unable to read PMD size\n");
> +
> +	if (prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, NULL, NULL))
> +		SKIP(return, "Unable to set PR_THP_DISABLE_EXCEPT_ADVISED\n");

This should be a test fail I think, as the only ways this could fail are
invalid flags, or failure to obtain an mmap write lock.

> +
> +	thp_save_settings();
> +	thp_read_settings(&self->settings);
> +	self->settings.thp_enabled = variant->thp_policy;
> +	self->settings.hugepages[sz2ord(self->pmdsize, getpagesize())].enabled = THP_INHERIT;
> +	thp_write_settings(&self->settings);
> +}
> +
> +FIXTURE_TEARDOWN(prctl_thp_disable_except_madvise)
> +{
> +	thp_restore_settings();
> +}
> +
> +TEST_F(prctl_thp_disable_except_madvise, nofork)
> +{
> +	prctl_thp_disable_except_madvise_test(_metadata, self->pmdsize, variant->thp_policy);
> +}
> +
> +TEST_F(prctl_thp_disable_except_madvise, fork)
> +{
> +	int ret = 0;
> +	pid_t pid;
> +
> +	/* Make sure prctl changes are carried across fork */
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (!pid)
> +		prctl_thp_disable_except_madvise_test(_metadata, self->pmdsize,
> +						      variant->thp_policy);
> +
> +	wait(&ret);
> +	if (WIFEXITED(ret))
> +		ret = WEXITSTATUS(ret);
> +	else
> +		ret = -EINVAL;
> +	ASSERT_EQ(ret, 0);
> +}
> +
>  TEST_HARNESS_MAIN
> --
> 2.47.3
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise
  2025-08-13 15:13   ` Lorenzo Stoakes
@ 2025-08-13 16:24     ` David Hildenbrand
  2025-08-13 18:52       ` Lorenzo Stoakes
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-08-13 16:24 UTC (permalink / raw)
  To: Lorenzo Stoakes, Usama Arif
  Cc: Andrew Morton, linux-mm, linux-fsdevel, corbet, rppt, surenb,
	mhocko, hannes, baohua, shakeel.butt, riel, ziy, laoar.shao,
	dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team

>> +
>> +FIXTURE_SETUP(prctl_thp_disable_except_madvise)
>> +{
>> +	if (!thp_available())
>> +		SKIP(return, "Transparent Hugepages not available\n");
>> +
>> +	self->pmdsize = read_pmd_pagesize();
>> +	if (!self->pmdsize)
>> +		SKIP(return, "Unable to read PMD size\n");
>> +
>> +	if (prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, NULL, NULL))
>> +		SKIP(return, "Unable to set PR_THP_DISABLE_EXCEPT_ADVISED\n");
> 
> This should be a test fail I think, as the only ways this could fail are
> invalid flags, or failure to obtain an mmap write lock.

Running a kernel that does not support it?

We could check the errno to distinguish I guess.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise
  2025-08-13 16:24     ` David Hildenbrand
@ 2025-08-13 18:52       ` Lorenzo Stoakes
  2025-08-14  9:32         ` David Hildenbrand
  2025-08-14 10:36         ` Usama Arif
  0 siblings, 2 replies; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-13 18:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Usama Arif, Andrew Morton, linux-mm, linux-fsdevel, corbet, rppt,
	surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
	laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
	ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
	linux-doc, kernel-team

On Wed, Aug 13, 2025 at 06:24:11PM +0200, David Hildenbrand wrote:
> > > +
> > > +FIXTURE_SETUP(prctl_thp_disable_except_madvise)
> > > +{
> > > +	if (!thp_available())
> > > +		SKIP(return, "Transparent Hugepages not available\n");
> > > +
> > > +	self->pmdsize = read_pmd_pagesize();
> > > +	if (!self->pmdsize)
> > > +		SKIP(return, "Unable to read PMD size\n");
> > > +
> > > +	if (prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, NULL, NULL))
> > > +		SKIP(return, "Unable to set PR_THP_DISABLE_EXCEPT_ADVISED\n");
> >
> > This should be a test fail I think, as the only ways this could fail are
> > invalid flags, or failure to obtain an mmap write lock.
>
> Running a kernel that does not support it?

I can't see anything in the kernel to #ifdef it out so I suppose you mean
running these tests on an older kernel?

But this is an unsupported way of running self-tests, they are tied to the
kernel version in which they reside, and test that specific version.

Unless I'm missing something here?

>
> We could check the errno to distinguish I guess.

Which one? manpage says -EINVAL, but can also be due to incorrect invocation,
which would mean a typo could mean tests pass but your tests do nothing :)

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/7] mm/huge_memory: convert "tva_flags" to "enum tva_type"
  2025-08-13 13:55 ` [PATCH v4 2/7] mm/huge_memory: convert "tva_flags" to "enum tva_type" Usama Arif
@ 2025-08-14  3:07   ` Yafang Shao
  2025-08-14 10:43     ` Usama Arif
  2025-08-14 14:59   ` Zi Yan
  1 sibling, 1 reply; 34+ messages in thread
From: Yafang Shao @ 2025-08-14  3:07 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
	surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy, dev.jain,
	baolin.wang, npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	vbabka, jannh, Arnd Bergmann, sj, linux-kernel, linux-doc,
	kernel-team

On Wed, Aug 13, 2025 at 9:57 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
> From: David Hildenbrand <david@redhat.com>
>
> When determining which THP orders are eligible for a VMA mapping,
> we have previously specified tva_flags, however it turns out it is
> really not necessary to treat these as flags.
>
> Rather, we distinguish between distinct modes.
>
> The only case where we previously combined flags was with
> TVA_ENFORCE_SYSFS, but we can avoid this by observing that this
> is the default, except for MADV_COLLAPSE or an edge cases in
> collapse_pte_mapped_thp() and hugepage_vma_revalidate(), and
> adding a mode specifically for this case - TVA_FORCED_COLLAPSE.
>
> We have:
> * smaps handling for showing "THPeligible"
> * Pagefault handling
> * khugepaged handling
> * Forced collapse handling: primarily MADV_COLLAPSE, but also for
>   an edge case in collapse_pte_mapped_thp()
>
> Disregarding the edge cases, we only want to ignore sysfs settings only
> when we are forcing a collapse through MADV_COLLAPSE, otherwise we
> want to enforce it, hence this patch does the following flag to enum
> conversions:
>
> * TVA_SMAPS | TVA_ENFORCE_SYSFS -> TVA_SMAPS
> * TVA_IN_PF | TVA_ENFORCE_SYSFS -> TVA_PAGEFAULT
> * TVA_ENFORCE_SYSFS             -> TVA_KHUGEPAGED
> * 0                             -> TVA_FORCED_COLLAPSE
>
> With this change, we immediately know if we are in the forced collapse
> case, which will be valuable next.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> Acked-by: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Acked-by: Yafang Shao <laoar.shao@gmail.com>

Hello Usama,

This change is also required by my BPF-based THP order selection
series [0]. Since this patch appears to be independent of the series,
could we merge it first into mm-new or mm-everything if the series
itself won't be merged shortly?

Link: https://lwn.net/Articles/1031829/ [0]

> ---
>  fs/proc/task_mmu.c      |  4 ++--
>  include/linux/huge_mm.h | 30 ++++++++++++++++++------------
>  mm/huge_memory.c        |  8 ++++----
>  mm/khugepaged.c         | 17 ++++++++---------
>  mm/memory.c             | 14 ++++++--------
>  5 files changed, 38 insertions(+), 35 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index e8e7bef345313..ced01cf3c5ab3 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1369,8 +1369,8 @@ static int show_smap(struct seq_file *m, void *v)
>         __show_smap(m, &mss, false);
>
>         seq_printf(m, "THPeligible:    %8u\n",
> -                  !!thp_vma_allowable_orders(vma, vma->vm_flags,
> -                          TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL));
> +                  !!thp_vma_allowable_orders(vma, vma->vm_flags, TVA_SMAPS,
> +                                             THP_ORDERS_ALL));
>
>         if (arch_pkeys_enabled())
>                 seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 22b8b067b295e..92ea0b9771fae 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -94,12 +94,15 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>  #define THP_ORDERS_ALL \
>         (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_SPECIAL | THP_ORDERS_ALL_FILE_DEFAULT)
>
> -#define TVA_SMAPS              (1 << 0)        /* Will be used for procfs */
> -#define TVA_IN_PF              (1 << 1)        /* Page fault handler */
> -#define TVA_ENFORCE_SYSFS      (1 << 2)        /* Obey sysfs configuration */
> +enum tva_type {
> +       TVA_SMAPS,              /* Exposing "THPeligible:" in smaps. */
> +       TVA_PAGEFAULT,          /* Serving a page fault. */
> +       TVA_KHUGEPAGED,         /* Khugepaged collapse. */
> +       TVA_FORCED_COLLAPSE,    /* Forced collapse (e.g. MADV_COLLAPSE). */
> +};
>
> -#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
> -       (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
> +#define thp_vma_allowable_order(vma, vm_flags, type, order) \
> +       (!!thp_vma_allowable_orders(vma, vm_flags, type, BIT(order)))
>
>  #define split_folio(f) split_folio_to_list(f, NULL)
>
> @@ -264,14 +267,14 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
>
>  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>                                          vm_flags_t vm_flags,
> -                                        unsigned long tva_flags,
> +                                        enum tva_type type,
>                                          unsigned long orders);
>
>  /**
>   * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
>   * @vma:  the vm area to check
>   * @vm_flags: use these vm_flags instead of vma->vm_flags
> - * @tva_flags: Which TVA flags to honour
> + * @type: TVA type
>   * @orders: bitfield of all orders to consider
>   *
>   * Calculates the intersection of the requested hugepage orders and the allowed
> @@ -285,11 +288,14 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>  static inline
>  unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>                                        vm_flags_t vm_flags,
> -                                      unsigned long tva_flags,
> +                                      enum tva_type type,
>                                        unsigned long orders)
>  {
> -       /* Optimization to check if required orders are enabled early. */
> -       if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
> +       /*
> +        * Optimization to check if required orders are enabled early. Only
> +        * forced collapse ignores sysfs configs.
> +        */
> +       if (type != TVA_FORCED_COLLAPSE && vma_is_anonymous(vma)) {
>                 unsigned long mask = READ_ONCE(huge_anon_orders_always);
>
>                 if (vm_flags & VM_HUGEPAGE)
> @@ -303,7 +309,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>                         return 0;
>         }
>
> -       return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> +       return __thp_vma_allowable_orders(vma, vm_flags, type, orders);
>  }
>
>  struct thpsize {
> @@ -547,7 +553,7 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
>
>  static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>                                         vm_flags_t vm_flags,
> -                                       unsigned long tva_flags,
> +                                       enum tva_type type,
>                                         unsigned long orders)
>  {
>         return 0;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6df1ed0cef5cf..9c716be949cbf 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -99,12 +99,12 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>
>  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>                                          vm_flags_t vm_flags,
> -                                        unsigned long tva_flags,
> +                                        enum tva_type type,
>                                          unsigned long orders)
>  {
> -       bool smaps = tva_flags & TVA_SMAPS;
> -       bool in_pf = tva_flags & TVA_IN_PF;
> -       bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS;
> +       const bool smaps = type == TVA_SMAPS;
> +       const bool in_pf = type == TVA_PAGEFAULT;
> +       const bool enforce_sysfs = type != TVA_FORCED_COLLAPSE;
>         unsigned long supported_orders;
>
>         /* Check the intersection of requested and supported orders. */
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1a416b8659972..d3d4f116e14b6 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -474,8 +474,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
>  {
>         if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>             hugepage_pmd_enabled()) {
> -               if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
> -                                           PMD_ORDER))
> +               if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
>                         __khugepaged_enter(vma->vm_mm);
>         }
>  }
> @@ -921,7 +920,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>                                    struct collapse_control *cc)
>  {
>         struct vm_area_struct *vma;
> -       unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> +       enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
> +                                TVA_FORCED_COLLAPSE;
>
>         if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
>                 return SCAN_ANY_PROCESS;
> @@ -932,7 +932,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>
>         if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
>                 return SCAN_ADDRESS_RANGE;
> -       if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
> +       if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
>                 return SCAN_VMA_CHECK;
>         /*
>          * Anon VMA expected, the address may be unmapped then
> @@ -1533,9 +1533,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>          * in the page cache with a single hugepage. If a mm were to fault-in
>          * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
>          * and map it by a PMD, regardless of sysfs THP settings. As such, let's
> -        * analogously elide sysfs THP settings here.
> +        * analogously elide sysfs THP settings here and force collapse.
>          */
> -       if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
> +       if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
>                 return SCAN_VMA_CHECK;
>
>         /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
> @@ -2432,8 +2432,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>                         progress++;
>                         break;
>                 }
> -               if (!thp_vma_allowable_order(vma, vma->vm_flags,
> -                                       TVA_ENFORCE_SYSFS, PMD_ORDER)) {
> +               if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
>  skip:
>                         progress++;
>                         continue;
> @@ -2767,7 +2766,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>         BUG_ON(vma->vm_start > start);
>         BUG_ON(vma->vm_end < end);
>
> -       if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
> +       if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
>                 return -EINVAL;
>
>         cc = kmalloc(sizeof(*cc), GFP_KERNEL);
> diff --git a/mm/memory.c b/mm/memory.c
> index 002c28795d8b7..7b1e8f137fa3f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4515,8 +4515,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>          * Get a list of all the (large) orders below PMD_ORDER that are enabled
>          * and suitable for swapping THP.
>          */
> -       orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> -                       TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> +       orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
> +                                         BIT(PMD_ORDER) - 1);
>         orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>         orders = thp_swap_suitable_orders(swp_offset(entry),
>                                           vmf->address, orders);
> @@ -5063,8 +5063,8 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>          * for this vma. Then filter out the orders that can't be allocated over
>          * the faulting address and still be fully contained in the vma.
>          */
> -       orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> -                       TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> +       orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
> +                                         BIT(PMD_ORDER) - 1);
>         orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>
>         if (!orders)
> @@ -6254,8 +6254,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>                 return VM_FAULT_OOM;
>  retry_pud:
>         if (pud_none(*vmf.pud) &&
> -           thp_vma_allowable_order(vma, vm_flags,
> -                               TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) {
> +           thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PUD_ORDER)) {
>                 ret = create_huge_pud(&vmf);
>                 if (!(ret & VM_FAULT_FALLBACK))
>                         return ret;
> @@ -6289,8 +6288,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>                 goto retry_pud;
>
>         if (pmd_none(*vmf.pmd) &&
> -           thp_vma_allowable_order(vma, vm_flags,
> -                               TVA_IN_PF | TVA_ENFORCE_SYSFS, PMD_ORDER)) {
> +           thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
>                 ret = create_huge_pmd(&vmf);
>                 if (!(ret & VM_FAULT_FALLBACK))
>                         return ret;
> --
> 2.47.3
>


-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise
  2025-08-13 18:52       ` Lorenzo Stoakes
@ 2025-08-14  9:32         ` David Hildenbrand
  2025-08-14 10:49           ` Lorenzo Stoakes
  2025-08-14 10:36         ` Usama Arif
  1 sibling, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-08-14  9:32 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Usama Arif, Andrew Morton, linux-mm, linux-fsdevel, corbet, rppt,
	surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
	laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
	ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
	linux-doc, kernel-team

On 13.08.25 20:52, Lorenzo Stoakes wrote:
> On Wed, Aug 13, 2025 at 06:24:11PM +0200, David Hildenbrand wrote:
>>>> +
>>>> +FIXTURE_SETUP(prctl_thp_disable_except_madvise)
>>>> +{
>>>> +	if (!thp_available())
>>>> +		SKIP(return, "Transparent Hugepages not available\n");
>>>> +
>>>> +	self->pmdsize = read_pmd_pagesize();
>>>> +	if (!self->pmdsize)
>>>> +		SKIP(return, "Unable to read PMD size\n");
>>>> +
>>>> +	if (prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, NULL, NULL))
>>>> +		SKIP(return, "Unable to set PR_THP_DISABLE_EXCEPT_ADVISED\n");
>>>
>>> This should be a test fail I think, as the only ways this could fail are
>>> invalid flags, or failure to obtain an mmap write lock.
>>
>> Running a kernel that does not support it?
> 
> I can't see anything in the kernel to #ifdef it out so I suppose you mean
> running these tests on an older kernel?

Yes.

> 
> But this is an unsupported way of running self-tests, they are tied to the
> kernel version in which they reside, and test that specific version.
> 
> Unless I'm missing something here?

I remember we allow for a bit of flexibility when it is simple to handle.

Is that documented somewhere?

> 
>>
>> We could check the errno to distinguish I guess.
> 
> Which one? manpage says -EINVAL, but can also be due to incorrect invocation,
> which would mean a typo could mean tests pass but your tests do nothing :)

Right, no ENOSYS in that case to distinguish :(

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise
  2025-08-13 18:52       ` Lorenzo Stoakes
  2025-08-14  9:32         ` David Hildenbrand
@ 2025-08-14 10:36         ` Usama Arif
  2025-08-14 10:53           ` Lorenzo Stoakes
  1 sibling, 1 reply; 34+ messages in thread
From: Usama Arif @ 2025-08-14 10:36 UTC (permalink / raw)
  To: Lorenzo Stoakes, David Hildenbrand
  Cc: Andrew Morton, linux-mm, linux-fsdevel, corbet, rppt, surenb,
	mhocko, hannes, baohua, shakeel.butt, riel, ziy, laoar.shao,
	dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
	jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team



On 13/08/2025 19:52, Lorenzo Stoakes wrote:
> On Wed, Aug 13, 2025 at 06:24:11PM +0200, David Hildenbrand wrote:
>>>> +
>>>> +FIXTURE_SETUP(prctl_thp_disable_except_madvise)
>>>> +{
>>>> +	if (!thp_available())
>>>> +		SKIP(return, "Transparent Hugepages not available\n");
>>>> +
>>>> +	self->pmdsize = read_pmd_pagesize();
>>>> +	if (!self->pmdsize)
>>>> +		SKIP(return, "Unable to read PMD size\n");
>>>> +
>>>> +	if (prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, NULL, NULL))
>>>> +		SKIP(return, "Unable to set PR_THP_DISABLE_EXCEPT_ADVISED\n");
>>>
>>> This should be a test fail I think, as the only ways this could fail are
>>> invalid flags, or failure to obtain an mmap write lock.
>>
>> Running a kernel that does not support it?
> 
> I can't see anything in the kernel to #ifdef it out so I suppose you mean
> running these tests on an older kernel?
> 

It was a fail in my previous revision
(https://lore.kernel.org/all/9bcb1dee-314e-4366-9bad-88a47d516c79@redhat.com/)

I do believe people (including me :)) get the latest kernel selftest and run it on
older kernels.
It might not be the right way to run selftests, but I do think its done.

> But this is an unsupported way of running self-tests, they are tied to the
> kernel version in which they reside, and test that specific version.
> 
> Unless I'm missing something here?
> 
>>
>> We could check the errno to distinguish I guess.
> 
> Which one? manpage says -EINVAL, but can also be due to incorrect invocation,
> which would mean a typo could mean tests pass but your tests do nothing :)
> 

Yeah I dont think we can distinguish between the prctl not being available (i.e. older kernel)
and the prctl not working as it should.

We just need to decide whether to fail or skip.

If the right way is to always run selftests from the same kernel version as the host
on which its being run on, we can just fail? I can go back to the older version of
doing things and move the failure from FIXTURE_SETUP to TEST_F?   

>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/7] mm/huge_memory: convert "tva_flags" to "enum tva_type"
  2025-08-14  3:07   ` Yafang Shao
@ 2025-08-14 10:43     ` Usama Arif
  2025-08-15  1:11       ` Andrew Morton
  0 siblings, 1 reply; 34+ messages in thread
From: Usama Arif @ 2025-08-14 10:43 UTC (permalink / raw)
  To: Yafang Shao, Andrew Morton, david
  Cc: linux-mm, linux-fsdevel, corbet, rppt, surenb, mhocko, hannes,
	baohua, shakeel.butt, riel, ziy, dev.jain, baolin.wang, npache,
	lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka, jannh,
	Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team



On 14/08/2025 04:07, Yafang Shao wrote:
> On Wed, Aug 13, 2025 at 9:57 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>> From: David Hildenbrand <david@redhat.com>
>>
>> When determining which THP orders are eligible for a VMA mapping,
>> we have previously specified tva_flags, however it turns out it is
>> really not necessary to treat these as flags.
>>
>> Rather, we distinguish between distinct modes.
>>
>> The only case where we previously combined flags was with
>> TVA_ENFORCE_SYSFS, but we can avoid this by observing that this
>> is the default, except for MADV_COLLAPSE or an edge cases in
>> collapse_pte_mapped_thp() and hugepage_vma_revalidate(), and
>> adding a mode specifically for this case - TVA_FORCED_COLLAPSE.
>>
>> We have:
>> * smaps handling for showing "THPeligible"
>> * Pagefault handling
>> * khugepaged handling
>> * Forced collapse handling: primarily MADV_COLLAPSE, but also for
>>   an edge case in collapse_pte_mapped_thp()
>>
>> Disregarding the edge cases, we only want to ignore sysfs settings only
>> when we are forcing a collapse through MADV_COLLAPSE, otherwise we
>> want to enforce it, hence this patch does the following flag to enum
>> conversions:
>>
>> * TVA_SMAPS | TVA_ENFORCE_SYSFS -> TVA_SMAPS
>> * TVA_IN_PF | TVA_ENFORCE_SYSFS -> TVA_PAGEFAULT
>> * TVA_ENFORCE_SYSFS             -> TVA_KHUGEPAGED
>> * 0                             -> TVA_FORCED_COLLAPSE
>>
>> With this change, we immediately know if we are in the forced collapse
>> case, which will be valuable next.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> Acked-by: Usama Arif <usamaarif642@gmail.com>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> 
> Acked-by: Yafang Shao <laoar.shao@gmail.com>
> 
> Hello Usama,
> 
> This change is also required by my BPF-based THP order selection
> series [0]. Since this patch appears to be independent of the series,
> could we merge it first into mm-new or mm-everything if the series
> itself won't be merged shortly?
> 
> Link: https://lwn.net/Articles/1031829/ [0]
> 

Thanks for reviewing!

All of the patches in the series have several acks/reviews. Only a small change
might be required in selftest, so hopefully the next revision is the last one.

Andrew - would it be ok to start including this entire series in the mm-new now?

Thanks!


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise
  2025-08-14  9:32         ` David Hildenbrand
@ 2025-08-14 10:49           ` Lorenzo Stoakes
  2025-08-14 11:45             ` Mark Brown
  0 siblings, 1 reply; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-14 10:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Usama Arif, Andrew Morton, linux-mm, linux-fsdevel, corbet, rppt,
	surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
	laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
	ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
	linux-doc, kernel-team, Mark Brown

+cc Mark who might have insights here

On Thu, Aug 14, 2025 at 11:32:55AM +0200, David Hildenbrand wrote:
> On 13.08.25 20:52, Lorenzo Stoakes wrote:
> > On Wed, Aug 13, 2025 at 06:24:11PM +0200, David Hildenbrand wrote:
> > > > > +
> > > > > +FIXTURE_SETUP(prctl_thp_disable_except_madvise)
> > > > > +{
> > > > > +	if (!thp_available())
> > > > > +		SKIP(return, "Transparent Hugepages not available\n");
> > > > > +
> > > > > +	self->pmdsize = read_pmd_pagesize();
> > > > > +	if (!self->pmdsize)
> > > > > +		SKIP(return, "Unable to read PMD size\n");
> > > > > +
> > > > > +	if (prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, NULL, NULL))
> > > > > +		SKIP(return, "Unable to set PR_THP_DISABLE_EXCEPT_ADVISED\n");
> > > >
> > > > This should be a test fail I think, as the only ways this could fail are
> > > > invalid flags, or failure to obtain an mmap write lock.
> > >
> > > Running a kernel that does not support it?
> >
> > I can't see anything in the kernel to #ifdef it out so I suppose you mean
> > running these tests on an older kernel?
>
> Yes.
>
> >
> > But this is an unsupported way of running self-tests, they are tied to the
> > kernel version in which they reside, and test that specific version.
> >
> > Unless I'm missing something here?
>
> I remember we allow for a bit of flexibility when it is simple to handle.
>
> Is that documented somewhere?

Not sure if it's documented, but it'd make testing extremely egregious if
you had to consider all of the possible kernels and interactions and etc.

I think it's 'if it happens to work then fine' but otherwise it is expected
that the tests match the kernel.

It's also very neat that with a revision you get a set of (hopefully)
working tests for that revision :)

>
> >
> > >
> > > We could check the errno to distinguish I guess.
> >
> > Which one? manpage says -EINVAL, but can also be due to incorrect invocation,
> > which would mean a typo could mean tests pass but your tests do nothing :)
>
> Right, no ENOSYS in that case to distinguish :(

Yup sadly

>
> --
> Cheers
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise
  2025-08-14 10:36         ` Usama Arif
@ 2025-08-14 10:53           ` Lorenzo Stoakes
  2025-08-14 11:51             ` Usama Arif
  0 siblings, 1 reply; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-14 10:53 UTC (permalink / raw)
  To: Usama Arif
  Cc: David Hildenbrand, Andrew Morton, linux-mm, linux-fsdevel, corbet,
	rppt, surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
	laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
	ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
	linux-doc, kernel-team, Mark Brown

On Thu, Aug 14, 2025 at 11:36:51AM +0100, Usama Arif wrote:
>
>
> On 13/08/2025 19:52, Lorenzo Stoakes wrote:
> > On Wed, Aug 13, 2025 at 06:24:11PM +0200, David Hildenbrand wrote:
> >>>> +
> >>>> +FIXTURE_SETUP(prctl_thp_disable_except_madvise)
> >>>> +{
> >>>> +	if (!thp_available())
> >>>> +		SKIP(return, "Transparent Hugepages not available\n");
> >>>> +
> >>>> +	self->pmdsize = read_pmd_pagesize();
> >>>> +	if (!self->pmdsize)
> >>>> +		SKIP(return, "Unable to read PMD size\n");
> >>>> +
> >>>> +	if (prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, NULL, NULL))
> >>>> +		SKIP(return, "Unable to set PR_THP_DISABLE_EXCEPT_ADVISED\n");
> >>>
> >>> This should be a test fail I think, as the only ways this could fail are
> >>> invalid flags, or failure to obtain an mmap write lock.
> >>
> >> Running a kernel that does not support it?
> >
> > I can't see anything in the kernel to #ifdef it out so I suppose you mean
> > running these tests on an older kernel?
> >
>
> It was a fail in my previous revision
> (https://lore.kernel.org/all/9bcb1dee-314e-4366-9bad-88a47d516c79@redhat.com/)

Well it seems it's a debate between me and David then haha :P sorry.

This is a bit of a trivial thing I'm just keen that bugs don't get accidentally
missed because of skips, that's the most important thing I think.

>
> I do believe people (including me :)) get the latest kernel selftest and run it on
> older kernels.
> It might not be the right way to run selftests, but I do think its done.

People can do unsupported things, but then if it breaks that's on them to live
with :)

>
> > But this is an unsupported way of running self-tests, they are tied to the
> > kernel version in which they reside, and test that specific version.
> >
> > Unless I'm missing something here?
> >
> >>
> >> We could check the errno to distinguish I guess.
> >
> > Which one? manpage says -EINVAL, but can also be due to incorrect invocation,
> > which would mean a typo could mean tests pass but your tests do nothing :)
> >
>
> Yeah I dont think we can distinguish between the prctl not being available (i.e. older kernel)
> and the prctl not working as it should.
>
> We just need to decide whether to fail or skip.

I really think it's far worse to miss a bug in the code (or testing) than to
account for people running with different kernels.

>
> If the right way is to always run selftests from the same kernel version as the host
> on which its being run on, we can just fail? I can go back to the older version of
> doing things and move the failure from FIXTURE_SETUP to TEST_F?

Yeah I think it simply should just be a fail.

Why would you move things around though? Think it's fine as-is, if something on
setup fails then all tests should fail.

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise
  2025-08-14 10:49           ` Lorenzo Stoakes
@ 2025-08-14 11:45             ` Mark Brown
  2025-08-14 12:00               ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Mark Brown @ 2025-08-14 11:45 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Usama Arif, Andrew Morton, linux-mm,
	linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
	shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, jannh, Arnd Bergmann,
	sj, linux-kernel, linux-doc, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1630 bytes --]

On Thu, Aug 14, 2025 at 11:49:15AM +0100, Lorenzo Stoakes wrote:
> On Thu, Aug 14, 2025 at 11:32:55AM +0200, David Hildenbrand wrote:
> > On 13.08.25 20:52, Lorenzo Stoakes wrote:

> > > I can't see anything in the kernel to #ifdef it out so I suppose you mean
> > > running these tests on an older kernel?

...

> > > But this is an unsupported way of running self-tests, they are tied to the
> > > kernel version in which they reside, and test that specific version.

> > > Unless I'm missing something here?

> > I remember we allow for a bit of flexibility when it is simple to handle.

> > Is that documented somewhere?

> Not sure if it's documented, but it'd make testing extremely egregious if
> you had to consider all of the possible kernels and interactions and etc.

> I think it's 'if it happens to work then fine' but otherwise it is expected
> that the tests match the kernel.

> It's also very neat that with a revision you get a set of (hopefully)
> working tests for that revision :)

Some people do try to run the selftests with older kernels, they're
trying to get better coverage for the stables.  For a lot of areas the
skipping falls out natually since there's some optionality (so even with
the same kernel version you might not have the feature in the running
kernel) or it's a new API which has a discovery mechanism in the ABI
anyway.  OTOH some areas have been actively hostile to the idea of
running on older kernels so there are things that do break when you try.
TBH so long as the tests don't crash the system or something people are
probably just going to ignore any tests that have never passed.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise
  2025-08-14 10:53           ` Lorenzo Stoakes
@ 2025-08-14 11:51             ` Usama Arif
  0 siblings, 0 replies; 34+ messages in thread
From: Usama Arif @ 2025-08-14 11:51 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Andrew Morton, linux-mm, linux-fsdevel, corbet,
	rppt, surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
	laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
	ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
	linux-doc, kernel-team, Mark Brown



> 
> Why would you move things around though? Think it's fine as-is, if something on
> setup fails then all tests should fail.

If its a "test" itself and not a check, I think its better if it belongs in TEST_F and
not FIXTURE_SETUP.
But yeah this is ofcourse going to be the first test, so if it fails the entire thing
is marked as a failure and we dont proceed.

> 
> Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise
  2025-08-14 11:45             ` Mark Brown
@ 2025-08-14 12:00               ` David Hildenbrand
  2025-08-14 12:09                 ` Mark Brown
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-08-14 12:00 UTC (permalink / raw)
  To: Mark Brown, Lorenzo Stoakes
  Cc: Usama Arif, Andrew Morton, linux-mm, linux-fsdevel, corbet, rppt,
	surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
	laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
	ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
	linux-doc, kernel-team

On 14.08.25 13:45, Mark Brown wrote:
> On Thu, Aug 14, 2025 at 11:49:15AM +0100, Lorenzo Stoakes wrote:
>> On Thu, Aug 14, 2025 at 11:32:55AM +0200, David Hildenbrand wrote:
>>> On 13.08.25 20:52, Lorenzo Stoakes wrote:
> 
>>>> I can't see anything in the kernel to #ifdef it out so I suppose you mean
>>>> running these tests on an older kernel?
> 
> ...
> 
>>>> But this is an unsupported way of running self-tests, they are tied to the
>>>> kernel version in which they reside, and test that specific version.
> 
>>>> Unless I'm missing something here?
> 
>>> I remember we allow for a bit of flexibility when it is simple to handle.
> 
>>> Is that documented somewhere?
> 
>> Not sure if it's documented, but it'd make testing extremely egregious if
>> you had to consider all of the possible kernels and interactions and etc.
> 
>> I think it's 'if it happens to work then fine' but otherwise it is expected
>> that the tests match the kernel.
> 
>> It's also very neat that with a revision you get a set of (hopefully)
>> working tests for that revision :)
> 
> Some people do try to run the selftests with older kernels, they're
> trying to get better coverage for the stables.  For a lot of areas the
> skipping falls out natually since there's some optionality (so even with
> the same kernel version you might not have the feature in the running
> kernel) or it's a new API which has a discovery mechanism in the ABI
> anyway.  OTOH some areas have been actively hostile to the idea of
> running on older kernels so there are things that do break when you try.
> TBH so long as the tests don't crash the system or something people are
> probably just going to ignore any tests that have never passed.

Some people (hello :) ) run tests against distro kernels ... shame that 
prctl just knows one sort of "EINVAL" so we cannot distinguish :(

But yeah, maybe one has to be more careful of filtering these failures 
out then.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise
  2025-08-14 12:00               ` David Hildenbrand
@ 2025-08-14 12:09                 ` Mark Brown
  2025-08-14 12:59                   ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Mark Brown @ 2025-08-14 12:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Usama Arif, Andrew Morton, linux-mm,
	linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
	shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, jannh, Arnd Bergmann,
	sj, linux-kernel, linux-doc, kernel-team

[-- Attachment #1: Type: text/plain, Size: 706 bytes --]

On Thu, Aug 14, 2025 at 02:00:27PM +0200, David Hildenbrand wrote:

> Some people (hello :) ) run tests against distro kernels ... shame that
> prctl just knows one sort of "EINVAL" so we cannot distinguish :(

> But yeah, maybe one has to be more careful of filtering these failures out
> then.

Perhaps this is something that needs considering in the ABI, so
userspace can reasonably figure out if it failed to configure whatever
is being configured due to a missing feature (in which case it should
fall back to not using that feature somehow) or due to it messing
something else up?  We might be happy with the tests being version
specific but general userspace should be able to be a bit more robust.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise
  2025-08-14 12:09                 ` Mark Brown
@ 2025-08-14 12:59                   ` David Hildenbrand
  2025-08-14 13:08                     ` Mark Brown
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-08-14 12:59 UTC (permalink / raw)
  To: Mark Brown
  Cc: Lorenzo Stoakes, Usama Arif, Andrew Morton, linux-mm,
	linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
	shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, jannh, Arnd Bergmann,
	sj, linux-kernel, linux-doc, kernel-team

On 14.08.25 14:09, Mark Brown wrote:
> On Thu, Aug 14, 2025 at 02:00:27PM +0200, David Hildenbrand wrote:
> 
>> Some people (hello :) ) run tests against distro kernels ... shame that
>> prctl just knows one sort of "EINVAL" so we cannot distinguish :(
> 
>> But yeah, maybe one has to be more careful of filtering these failures out
>> then.
> 
> Perhaps this is something that needs considering in the ABI, so
> userspace can reasonably figure out if it failed to configure whatever
> is being configured due to a missing feature (in which case it should
> fall back to not using that feature somehow) or due to it messing
> something else up?  We might be happy with the tests being version
> specific but general userspace should be able to be a bit more robust.

Yeah, the whole prctl() ship has sailed, unfortunately :(

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise
  2025-08-14 12:59                   ` David Hildenbrand
@ 2025-08-14 13:08                     ` Mark Brown
  2025-08-14 15:02                       ` Lorenzo Stoakes
  0 siblings, 1 reply; 34+ messages in thread
From: Mark Brown @ 2025-08-14 13:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Usama Arif, Andrew Morton, linux-mm,
	linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
	shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, jannh, Arnd Bergmann,
	sj, linux-kernel, linux-doc, kernel-team

[-- Attachment #1: Type: text/plain, Size: 795 bytes --]

On Thu, Aug 14, 2025 at 02:59:13PM +0200, David Hildenbrand wrote:
> On 14.08.25 14:09, Mark Brown wrote:

> > Perhaps this is something that needs considering in the ABI, so
> > userspace can reasonably figure out if it failed to configure whatever
> > is being configured due to a missing feature (in which case it should
> > fall back to not using that feature somehow) or due to it messing
> > something else up?  We might be happy with the tests being version
> > specific but general userspace should be able to be a bit more robust.

> Yeah, the whole prctl() ship has sailed, unfortunately :(

Perhaps a second call or sysfs file or something that returns the
supported mask?  You'd still have a boostrapping issue with existing
versions but at least at any newer stuff would be helped.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/7] mm/huge_memory: convert "tva_flags" to "enum tva_type"
  2025-08-13 13:55 ` [PATCH v4 2/7] mm/huge_memory: convert "tva_flags" to "enum tva_type" Usama Arif
  2025-08-14  3:07   ` Yafang Shao
@ 2025-08-14 14:59   ` Zi Yan
  1 sibling, 0 replies; 34+ messages in thread
From: Zi Yan @ 2025-08-14 14:59 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
	surenb, mhocko, hannes, baohua, shakeel.butt, riel, laoar.shao,
	dev.jain, baolin.wang, npache, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
	linux-doc, kernel-team

On 13 Aug 2025, at 9:55, Usama Arif wrote:

> From: David Hildenbrand <david@redhat.com>
>
> When determining which THP orders are eligible for a VMA mapping,
> we have previously specified tva_flags, however it turns out it is
> really not necessary to treat these as flags.
>
> Rather, we distinguish between distinct modes.
>
> The only case where we previously combined flags was with
> TVA_ENFORCE_SYSFS, but we can avoid this by observing that this
> is the default, except for MADV_COLLAPSE or an edge cases in
> collapse_pte_mapped_thp() and hugepage_vma_revalidate(), and
> adding a mode specifically for this case - TVA_FORCED_COLLAPSE.
>
> We have:
> * smaps handling for showing "THPeligible"
> * Pagefault handling
> * khugepaged handling
> * Forced collapse handling: primarily MADV_COLLAPSE, but also for
>   an edge case in collapse_pte_mapped_thp()
>
> Disregarding the edge cases, we only want to ignore sysfs settings only
> when we are forcing a collapse through MADV_COLLAPSE, otherwise we
> want to enforce it, hence this patch does the following flag to enum
> conversions:
>
> * TVA_SMAPS | TVA_ENFORCE_SYSFS -> TVA_SMAPS
> * TVA_IN_PF | TVA_ENFORCE_SYSFS -> TVA_PAGEFAULT
> * TVA_ENFORCE_SYSFS             -> TVA_KHUGEPAGED
> * 0                             -> TVA_FORCED_COLLAPSE
>
> With this change, we immediately know if we are in the forced collapse
> case, which will be valuable next.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> Acked-by: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  fs/proc/task_mmu.c      |  4 ++--
>  include/linux/huge_mm.h | 30 ++++++++++++++++++------------
>  mm/huge_memory.c        |  8 ++++----
>  mm/khugepaged.c         | 17 ++++++++---------
>  mm/memory.c             | 14 ++++++--------
>  5 files changed, 38 insertions(+), 35 deletions(-)
>

Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise
  2025-08-14 13:08                     ` Mark Brown
@ 2025-08-14 15:02                       ` Lorenzo Stoakes
  2025-08-14 15:41                         ` Usama Arif
  0 siblings, 1 reply; 34+ messages in thread
From: Lorenzo Stoakes @ 2025-08-14 15:02 UTC (permalink / raw)
  To: Mark Brown
  Cc: David Hildenbrand, Usama Arif, Andrew Morton, linux-mm,
	linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
	shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
	npache, Liam.Howlett, ryan.roberts, vbabka, jannh, Arnd Bergmann,
	sj, linux-kernel, linux-doc, kernel-team

On Thu, Aug 14, 2025 at 02:08:57PM +0100, Mark Brown wrote:
> On Thu, Aug 14, 2025 at 02:59:13PM +0200, David Hildenbrand wrote:
> > On 14.08.25 14:09, Mark Brown wrote:
>
> > > Perhaps this is something that needs considering in the ABI, so
> > > userspace can reasonably figure out if it failed to configure whatever
> > > is being configured due to a missing feature (in which case it should
> > > fall back to not using that feature somehow) or due to it messing
> > > something else up?  We might be happy with the tests being version
> > > specific but general userspace should be able to be a bit more robust.
>
> > Yeah, the whole prctl() ship has sailed, unfortunately :(
>
> Perhaps a second call or sysfs file or something that returns the
> supported mask?  You'd still have a boostrapping issue with existing
> versions but at least at any newer stuff would be helped.

Ack yeah I do wish we had better APIs for expressing what was
available/not. Will put this sort of thing on the TODO...

Overall I don't want to hold this up unnecesarily, and I bow to the
consensus if others feel we ought not to _assume_ same kernel at least best
effort.

Usama - It's ok to leave it as is in this case since obviously only tip
kernel will have this feature.

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 3/7] mm/huge_memory: respect MADV_COLLAPSE with PR_THP_DISABLE_EXCEPT_ADVISED
  2025-08-13 13:55 ` [PATCH v4 3/7] mm/huge_memory: respect MADV_COLLAPSE with PR_THP_DISABLE_EXCEPT_ADVISED Usama Arif
@ 2025-08-14 15:14   ` Zi Yan
  0 siblings, 0 replies; 34+ messages in thread
From: Zi Yan @ 2025-08-14 15:14 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
	surenb, mhocko, hannes, baohua, shakeel.butt, riel, laoar.shao,
	dev.jain, baolin.wang, npache, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
	linux-doc, kernel-team

On 13 Aug 2025, at 9:55, Usama Arif wrote:

> From: David Hildenbrand <david@redhat.com>
>
> Let's allow for making MADV_COLLAPSE succeed on areas that neither have
> VM_HUGEPAGE nor VM_NOHUGEPAGE when we have THP disabled
> unless explicitly advised (PR_THP_DISABLE_EXCEPT_ADVISED).
>
> MADV_COLLAPSE is a clear advice that we want to collapse.
>
> Note that we still respect the VM_NOHUGEPAGE flag, just like
> MADV_COLLAPSE always does. So consequently, MADV_COLLAPSE is now only
> refused on VM_NOHUGEPAGE with PR_THP_DISABLE_EXCEPT_ADVISED,
> including for shmem.
>
> Co-developed-by: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  include/linux/huge_mm.h    | 8 +++++++-
>  include/uapi/linux/prctl.h | 2 +-
>  mm/huge_memory.c           | 5 +++--
>  mm/memory.c                | 6 ++++--
>  mm/shmem.c                 | 2 +-
>  5 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 92ea0b9771fae..1ac0d06fb3c1d 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -329,7 +329,7 @@ struct thpsize {
>   * through madvise or prctl.
>   */
>  static inline bool vma_thp_disabled(struct vm_area_struct *vma,
> -		vm_flags_t vm_flags)
> +		vm_flags_t vm_flags, bool forced_collapse)
>  {
>  	/* Are THPs disabled for this VMA? */
>  	if (vm_flags & VM_NOHUGEPAGE)
> @@ -343,6 +343,12 @@ static inline bool vma_thp_disabled(struct vm_area_struct *vma,
>  	 */
>  	if (vm_flags & VM_HUGEPAGE)
>  		return false;
> +	/*
> +	 * Forcing a collapse (e.g., madv_collapse), is a clear advice to
> +	 * use THPs.
> +	 */
> +	if (forced_collapse)
> +		return false;
>  	return mm_flags_test(MMF_DISABLE_THP_EXCEPT_ADVISED, vma->vm_mm);
>  }
>
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 150b6deebfb1e..51c4e8c82b1e9 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -185,7 +185,7 @@ struct prctl_mm_map {
>  #define PR_SET_THP_DISABLE	41
>  /*
>   * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
> - * VM_HUGEPAGE).
> + * VM_HUGEPAGE, MADV_COLLAPSE).
>   */
>  # define PR_THP_DISABLE_EXCEPT_ADVISED	(1 << 1)
>  #define PR_GET_THP_DISABLE	42
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9c716be949cbf..1eca2d543449c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -104,7 +104,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>  {
>  	const bool smaps = type == TVA_SMAPS;
>  	const bool in_pf = type == TVA_PAGEFAULT;
> -	const bool enforce_sysfs = type != TVA_FORCED_COLLAPSE;
> +	const bool forced_collapse = type == TVA_FORCED_COLLAPSE;
> +	const bool enforce_sysfs = !forced_collapse;
>  	unsigned long supported_orders;
>
>  	/* Check the intersection of requested and supported orders. */
> @@ -122,7 +123,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>  	if (!vma->vm_mm)		/* vdso */
>  		return 0;
>
> -	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags))
> +	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags, forced_collapse))
>  		return 0;
>
>  	/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
> diff --git a/mm/memory.c b/mm/memory.c
> index 7b1e8f137fa3f..e4f533655305a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5332,9 +5332,11 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
>  	 * It is too late to allocate a small folio, we already have a large
>  	 * folio in the pagecache: especially s390 KVM cannot tolerate any
>  	 * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any
> -	 * PMD mappings if THPs are disabled.
> +	 * PMD mappings if THPs are disabled. As we already have a THP ...
> +	 * behave as if we are forcing a collapse.

What does the “...” mean here?

Shouldn’t it be:

As we already have a THP,
behave as if we are forcing a collapse.

>  	 */
> -	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags))
> +	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags,
> +						     /* forced_collapse=*/ true))
>  		return ret;
>
>  	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e2c76a30802b6..d945de3a7f0e7 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1817,7 +1817,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
>  	vm_flags_t vm_flags = vma ? vma->vm_flags : 0;
>  	unsigned int global_orders;
>
> -	if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags)))
> +	if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags, shmem_huge_force)))
>  		return 0;
>
>  	global_orders = shmem_huge_global_enabled(inode, index, write_end,
> -- 
> 2.47.3

Otherwise, LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise
  2025-08-14 15:02                       ` Lorenzo Stoakes
@ 2025-08-14 15:41                         ` Usama Arif
  0 siblings, 0 replies; 34+ messages in thread
From: Usama Arif @ 2025-08-14 15:41 UTC (permalink / raw)
  To: Lorenzo Stoakes, Mark Brown
  Cc: David Hildenbrand, Andrew Morton, linux-mm, linux-fsdevel, corbet,
	rppt, surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
	laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
	ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
	linux-doc, kernel-team



On 14/08/2025 16:02, Lorenzo Stoakes wrote:
> On Thu, Aug 14, 2025 at 02:08:57PM +0100, Mark Brown wrote:
>> On Thu, Aug 14, 2025 at 02:59:13PM +0200, David Hildenbrand wrote:
>>> On 14.08.25 14:09, Mark Brown wrote:
>>
>>>> Perhaps this is something that needs considering in the ABI, so
>>>> userspace can reasonably figure out if it failed to configure whatever
>>>> is being configured due to a missing feature (in which case it should
>>>> fall back to not using that feature somehow) or due to it messing
>>>> something else up?  We might be happy with the tests being version
>>>> specific but general userspace should be able to be a bit more robust.
>>
>>> Yeah, the whole prctl() ship has sailed, unfortunately :(
>>
>> Perhaps a second call or sysfs file or something that returns the
>> supported mask?  You'd still have a boostrapping issue with existing
>> versions but at least at any newer stuff would be helped.
> 
> Ack yeah I do wish we had better APIs for expressing what was
> available/not. Will put this sort of thing on the TODO...
> 
> Overall I don't want to hold this up unnecesarily, and I bow to the
> consensus if others feel we ought not to _assume_ same kernel at least best
> effort.
> 
> Usama - It's ok to leave it as is in this case since obviously only tip
> kernel will have this feature.

ah ok, so will keep it at skipping if prctl doesnt work in fixture as is
in the current v4 version.

I only have the below diff and its equivalent for patch 7 as a difference over
this version. Will wait until tomorrow morning incase there are more comments
and hopefully send out a last revision!

Thanks!



diff --git a/tools/testing/selftests/mm/prctl_thp_disable.c b/tools/testing/selftests/mm/prctl_thp_disable.c
index 8845e9f414560..e9e519c85224c 100644
--- a/tools/testing/selftests/mm/prctl_thp_disable.c
+++ b/tools/testing/selftests/mm/prctl_thp_disable.c
@@ -18,6 +18,7 @@
 
 enum thp_collapse_type {
        THP_COLLAPSE_NONE,
+       THP_COLLAPSE_MADV_NOHUGEPAGE,
        THP_COLLAPSE_MADV_HUGEPAGE,     /* MADV_HUGEPAGE before access */
        THP_COLLAPSE_MADV_COLLAPSE,     /* MADV_COLLAPSE after access */
 };
@@ -49,6 +50,8 @@ static int test_mmap_thp(enum thp_collapse_type madvise_buf, size_t pmdsize)
 
        if (madvise_buf == THP_COLLAPSE_MADV_HUGEPAGE)
                madvise(mem, pmdsize, MADV_HUGEPAGE);
+       else if (madvise_buf == THP_COLLAPSE_MADV_NOHUGEPAGE)
+               madvise(mem, pmdsize, MADV_NOHUGEPAGE);
 
        /* Ensure memory is allocated */
        memset(mem, 1, pmdsize);
@@ -73,6 +76,8 @@ static void prctl_thp_disable_completely_test(struct __test_metadata *const _met
        /* tests after prctl overrides global policy */
        ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_NONE, pmdsize), 0);
 
+       ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_NOHUGEPAGE, pmdsize), 0);
+
        ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_HUGEPAGE, pmdsize), 0);
 
        ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_COLLAPSE, pmdsize), 0);
@@ -84,6 +89,8 @@ static void prctl_thp_disable_completely_test(struct __test_metadata *const _met
        ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_NONE, pmdsize),
                  thp_policy == THP_ALWAYS ? 1 : 0);
 
+       ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_NOHUGEPAGE, pmdsize), 0);
+
        ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_HUGEPAGE, pmdsize),
                  thp_policy == THP_NEVER ? 0 : 1);

> 
> Cheers, Lorenzo


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 4/7] docs: transhuge: document process level THP controls
  2025-08-13 13:55 ` [PATCH v4 4/7] docs: transhuge: document process level THP controls Usama Arif
  2025-08-13 14:30   ` Lorenzo Stoakes
@ 2025-08-14 15:47   ` Zi Yan
  1 sibling, 0 replies; 34+ messages in thread
From: Zi Yan @ 2025-08-14 15:47 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
	surenb, mhocko, hannes, baohua, shakeel.butt, riel, laoar.shao,
	dev.jain, baolin.wang, npache, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
	linux-doc, kernel-team

On 13 Aug 2025, at 9:55, Usama Arif wrote:

> This includes the PR_SET_THP_DISABLE/PR_GET_THP_DISABLE pair of
> prctl calls as well the newly introduced PR_THP_DISABLE_EXCEPT_ADVISED
> flag for the PR_SET_THP_DISABLE prctl call.
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst | 37 ++++++++++++++++++++++
>  1 file changed, 37 insertions(+)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 370fba1134606..fa8242766e430 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -225,6 +225,43 @@ to "always" or "madvise"), and it'll be automatically shutdown when
>  PMD-sized THP is disabled (when both the per-size anon control and the
>  top-level control are "never")
>
> +process THP controls
> +--------------------
> +
> +A process can control its own THP behaviour using the ``PR_SET_THP_DISABLE``
> +and ``PR_GET_THP_DISABLE`` pair of prctl(2) calls. The THP behaviour set using
> +``PR_SET_THP_DISABLE`` is inherited across fork(2) and execve(2). These calls
> +support the following arguments::
> +
> +	prctl(PR_SET_THP_DISABLE, 1, 0, 0, 0):
> +		This will disable THPs completely for the process, irrespective
> +		of global THP controls or MADV_COLLAPSE.
> +
> +	prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, 0, 0):
> +		This will disable THPs for the process except when the usage of THPs is
> +		advised. Consequently, THPs will only be used when:
> +		- Global THP controls are set to "always" or "madvise" and

> +		  the area either has VM_HUGEPAGE set (e.g., due do MADV_HUGEPAGE) or
> +		  MADV_COLLAPSE is used.

It is better to change the above sentence to:

madvise(..., MADV_HUGEPAGE) or madvise(..., MADV_COLLAPSE) is used.

Since this document is for sysadmin, who does not need to know the implementation
details like VM_HUGEPAGE. And I do not see any kernel internal is mentioned
in the rest of the document.

> +		- Global THP controls are set to "never" and MADV_COLLAPSE is used. This
> +		  is the same behavior as if THPs would not be disabled on a process
> +		  level.

> +		Note that MADV_COLLAPSE is currently always rejected if VM_NOHUGEPAGE is
> +		set on an area.

The same for the above sentence.

Something like:

Note that MADV_COLLAPSE is always rejected if madvise(..., MADV_NOHUGEPAGE) is
used.



> +
> +	prctl(PR_SET_THP_DISABLE, 0, 0, 0, 0):
> +		This will re-enabled THPs for the process, as if they would never have

s/re-enabled/re-enable/

> +		been disabled. Whether THPs will actually be used depends on global THP
> +		controls.

and madvise() calls.

> +
> +	prctl(PR_GET_THP_DISABLE, 0, 0, 0, 0):
> +		This returns a value whose bit indicate how THP-disable is configured:

s/bit/bits

> +		Bits
> +		 1 0  Value  Description
> +		|0|0|   0    No THP-disable behaviour specified.
> +		|0|1|   1    THP is entirely disabled for this process.
> +		|1|1|   3    THP-except-advised mode is set for this process.
> +
>  Khugepaged controls
>  -------------------
>
> -- 
> 2.47.3

Otherwise, LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 5/7] selftest/mm: Extract sz2ord function into vm_util.h
  2025-08-13 13:55 ` [PATCH v4 5/7] selftest/mm: Extract sz2ord function into vm_util.h Usama Arif
  2025-08-13 14:31   ` Lorenzo Stoakes
@ 2025-08-14 15:52   ` Zi Yan
  1 sibling, 0 replies; 34+ messages in thread
From: Zi Yan @ 2025-08-14 15:52 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
	surenb, mhocko, hannes, baohua, shakeel.butt, riel, laoar.shao,
	dev.jain, baolin.wang, npache, lorenzo.stoakes, Liam.Howlett,
	ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
	linux-doc, kernel-team

On 13 Aug 2025, at 9:55, Usama Arif wrote:

> The function already has 2 uses and will have a 3rd one
> in prctl selftests. The pagesize argument is added into
> the function, as it's not a global variable anymore.
> No functional change intended with this patch.
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  tools/testing/selftests/mm/cow.c            | 12 ++++--------
>  tools/testing/selftests/mm/uffd-wp-mremap.c |  9 ++-------
>  tools/testing/selftests/mm/vm_util.h        |  5 +++++
>  3 files changed, 11 insertions(+), 15 deletions(-)
>

<snip>

> diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
> index 148b792cff0fc..e5cb72bf3a2ab 100644
> --- a/tools/testing/selftests/mm/vm_util.h
> +++ b/tools/testing/selftests/mm/vm_util.h
> @@ -135,6 +135,11 @@ static inline void log_test_result(int result)
>  	ksft_test_result_report(result, "%s\n", test_name);
>  }
>
> +static inline int sz2ord(size_t size, size_t pagesize)
> +{
> +	return __builtin_ctzll(size / pagesize);
> +}
> +

There is a psize() at the top of vm_util.h to get pagesize.
But I have no strong opinion on passing pagesize or not.

Anyway, Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/7] mm/huge_memory: convert "tva_flags" to "enum tva_type"
  2025-08-14 10:43     ` Usama Arif
@ 2025-08-15  1:11       ` Andrew Morton
  2025-08-15  9:29         ` Usama Arif
  0 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2025-08-15  1:11 UTC (permalink / raw)
  To: Usama Arif
  Cc: Yafang Shao, david, linux-mm, linux-fsdevel, corbet, rppt, surenb,
	mhocko, hannes, baohua, shakeel.butt, riel, ziy, dev.jain,
	baolin.wang, npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	vbabka, jannh, Arnd Bergmann, sj, linux-kernel, linux-doc,
	kernel-team

On Thu, 14 Aug 2025 11:43:16 +0100 Usama Arif <usamaarif642@gmail.com> wrote:

> 
> 
> > Hello Usama,
> > 
> > This change is also required by my BPF-based THP order selection
> > series [0]. Since this patch appears to be independent of the series,
> > could we merge it first into mm-new or mm-everything if the series
> > itself won't be merged shortly?
> > 
> > Link: https://lwn.net/Articles/1031829/ [0]
> > 
> 
> Thanks for reviewing!
> 
> All of the patches in the series have several acks/reviews. Only a small change
> might be required in selftest, so hopefully the next revision is the last one.
> 
> Andrew - would it be ok to start including this entire series in the mm-new now?
> 

https://lkml.kernel.org/r/0879b2c9-3088-4f92-8d73-666493ec783a@gmail.com
led me to expect a v5 series?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/7] mm/huge_memory: convert "tva_flags" to "enum tva_type"
  2025-08-15  1:11       ` Andrew Morton
@ 2025-08-15  9:29         ` Usama Arif
  0 siblings, 0 replies; 34+ messages in thread
From: Usama Arif @ 2025-08-15  9:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yafang Shao, david, linux-mm, linux-fsdevel, corbet, rppt, surenb,
	mhocko, hannes, baohua, shakeel.butt, riel, ziy, dev.jain,
	baolin.wang, npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts,
	vbabka, jannh, Arnd Bergmann, sj, linux-kernel, linux-doc,
	kernel-team



On 15/08/2025 02:11, Andrew Morton wrote:
> On Thu, 14 Aug 2025 11:43:16 +0100 Usama Arif <usamaarif642@gmail.com> wrote:
> 
>>
>>
>>> Hello Usama,
>>>
>>> This change is also required by my BPF-based THP order selection
>>> series [0]. Since this patch appears to be independent of the series,
>>> could we merge it first into mm-new or mm-everything if the series
>>> itself won't be merged shortly?
>>>
>>> Link: https://lwn.net/Articles/1031829/ [0]
>>>
>>
>> Thanks for reviewing!
>>
>> All of the patches in the series have several acks/reviews. Only a small change
>> might be required in selftest, so hopefully the next revision is the last one.
>>
>> Andrew - would it be ok to start including this entire series in the mm-new now?
>>
> 
> https://lkml.kernel.org/r/0879b2c9-3088-4f92-8d73-666493ec783a@gmail.com
> led me to expect a v5 series?

yes, small changes changes needed, was thinking of doing fixlet but will send
v5 for it. Thanks!

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-08-15  9:29 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-13 13:55 [PATCH v4 0/7] prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised Usama Arif
2025-08-13 13:55 ` [PATCH v4 1/7] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE Usama Arif
2025-08-13 13:55 ` [PATCH v4 2/7] mm/huge_memory: convert "tva_flags" to "enum tva_type" Usama Arif
2025-08-14  3:07   ` Yafang Shao
2025-08-14 10:43     ` Usama Arif
2025-08-15  1:11       ` Andrew Morton
2025-08-15  9:29         ` Usama Arif
2025-08-14 14:59   ` Zi Yan
2025-08-13 13:55 ` [PATCH v4 3/7] mm/huge_memory: respect MADV_COLLAPSE with PR_THP_DISABLE_EXCEPT_ADVISED Usama Arif
2025-08-14 15:14   ` Zi Yan
2025-08-13 13:55 ` [PATCH v4 4/7] docs: transhuge: document process level THP controls Usama Arif
2025-08-13 14:30   ` Lorenzo Stoakes
2025-08-14 15:47   ` Zi Yan
2025-08-13 13:55 ` [PATCH v4 5/7] selftest/mm: Extract sz2ord function into vm_util.h Usama Arif
2025-08-13 14:31   ` Lorenzo Stoakes
2025-08-14 15:52   ` Zi Yan
2025-08-13 13:55 ` [PATCH v4 6/7] selftests: prctl: introduce tests for disabling THPs completely Usama Arif
2025-08-13 14:54   ` Lorenzo Stoakes
2025-08-13 13:55 ` [PATCH v4 7/7] selftests: prctl: introduce tests for disabling THPs except for madvise Usama Arif
2025-08-13 15:13   ` Lorenzo Stoakes
2025-08-13 16:24     ` David Hildenbrand
2025-08-13 18:52       ` Lorenzo Stoakes
2025-08-14  9:32         ` David Hildenbrand
2025-08-14 10:49           ` Lorenzo Stoakes
2025-08-14 11:45             ` Mark Brown
2025-08-14 12:00               ` David Hildenbrand
2025-08-14 12:09                 ` Mark Brown
2025-08-14 12:59                   ` David Hildenbrand
2025-08-14 13:08                     ` Mark Brown
2025-08-14 15:02                       ` Lorenzo Stoakes
2025-08-14 15:41                         ` Usama Arif
2025-08-14 10:36         ` Usama Arif
2025-08-14 10:53           ` Lorenzo Stoakes
2025-08-14 11:51             ` Usama Arif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).