* [PATCH v2 0/5] prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised
@ 2025-07-31 12:27 Usama Arif
2025-07-31 12:27 ` [PATCH v2 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE Usama Arif
` (4 more replies)
0 siblings, 5 replies; 24+ messages in thread
From: Usama Arif @ 2025-07-31 12:27 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team,
Usama Arif
(Resending this as forgot to include PATCH v2 in subject prefix in
https://lore.kernel.org/all/20250731122150.2039342-1-usamaarif642@gmail.com/)
This will allow individual processes to opt-out of THP = "always"
into THP = "madvise", without affecting other workloads on the system.
This has been extensively discussed on the mailing list and has been
summarized very well by David in the first patch which also includes
the links to alternatives, please refer to the first patch commit message
for the motivation for this series.
Patch 1 adds the PR_THP_DISABLE_EXCEPT_ADVISED flag to implement this, along
with the MMF changes.
Patch 2 is a cleanup patch for tva_flags that will allow the forced collapse
case to be transmitted to vma_thp_disabled (which is done in patch 3).
Patches 4-5 implement the selftests for PR_SET_THP_DISABLE for completely
disabling THPs (old behaviour) and only enabling it at advise
(PR_THP_DISABLE_EXCEPT_ADVISED).
The patches are tested on top of 4ad831303eca6ae518c3b3d86838a2a04b90ec41
from mm-new.
v1 -> v2: https://lore.kernel.org/all/20250725162258.1043176-1-usamaarif642@gmail.com/
- Change thp_push_settings to thp_write_settings (David)
- Add tests for all the system policies for the prctl call (David)
- Small fixes and cleanups
David Hildenbrand (3):
prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
mm/huge_memory: convert "tva_flags" to "enum tva_type" for
thp_vma_allowable_order*()
mm/huge_memory: treat MADV_COLLAPSE as an advise with
PR_THP_DISABLE_EXCEPT_ADVISED
Usama Arif (2):
selftests: prctl: introduce tests for disabling THPs completely
selftests: prctl: introduce tests for disabling THPs except for
madvise
Documentation/filesystems/proc.rst | 5 +-
fs/proc/array.c | 2 +-
fs/proc/task_mmu.c | 4 +-
include/linux/huge_mm.h | 60 ++-
include/linux/mm_types.h | 13 +-
include/uapi/linux/prctl.h | 10 +
kernel/sys.c | 59 ++-
mm/huge_memory.c | 11 +-
mm/khugepaged.c | 20 +-
mm/memory.c | 20 +-
mm/shmem.c | 2 +-
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 1 +
.../testing/selftests/mm/prctl_thp_disable.c | 358 ++++++++++++++++++
tools/testing/selftests/mm/thp_settings.c | 9 +-
tools/testing/selftests/mm/thp_settings.h | 1 +
16 files changed, 505 insertions(+), 71 deletions(-)
create mode 100644 tools/testing/selftests/mm/prctl_thp_disable.c
--
2.47.3
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v2 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
2025-07-31 12:27 [PATCH v2 0/5] prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised Usama Arif
@ 2025-07-31 12:27 ` Usama Arif
2025-07-31 12:40 ` Lorenzo Stoakes
2025-07-31 15:13 ` Zi Yan
2025-07-31 12:27 ` [PATCH v2 2/5] mm/huge_memory: convert "tva_flags" to "enum tva_type" for thp_vma_allowable_order*() Usama Arif
` (3 subsequent siblings)
4 siblings, 2 replies; 24+ messages in thread
From: Usama Arif @ 2025-07-31 12:27 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team,
Usama Arif, Matthew Wilcox
From: David Hildenbrand <david@redhat.com>
People want to make use of more THPs, for example, moving from
the "never" system policy to "madvise", or from "madvise" to "always".
While this is great news for every THP desperately waiting to get
allocated out there, apparently there are some workloads that require a
bit of care during that transition: individual processes may need to
opt-out from this behavior for various reasons, and this should be
permitted without needing to make all other workloads on the system
similarly opt-out.
The following scenarios are imaginable:
(1) Switch from "none" system policy to "madvise"/"always", but keep THPs
disabled for selected workloads.
(2) Stay at "none" system policy, but enable THPs for selected
workloads, making only these workloads use the "madvise" or "always"
policy.
(3) Switch from "madvise" system policy to "always", but keep the
"madvise" policy for selected workloads: allocate THPs only when
advised.
(4) Stay at "madvise" system policy, but enable THPs even when not advised
for selected workloads -- "always" policy.
Once can emulate (2) through (1), by setting the system policy to
"madvise"/"always" while disabling THPs for all processes that don't want
THPs. It requires configuring all workloads, but that is a user-space
problem to sort out.
(4) can be emulated through (3) in a similar way.
Back when (1) was relevant in the past, as people started enabling THPs,
we added PR_SET_THP_DISABLE, so relevant workloads that were not ready
yet (i.e., used by Redis) were able to just disable THPs completely. Redis
still implements the option to use this interface to disable THPs
completely.
With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
workload -- a process, including fork+exec'ed process hierarchy.
That essentially made us support (1): simply disable THPs for all workloads
that are not ready for THPs yet, while still enabling THPs system-wide.
The quest for handling (3) and (4) started, but current approaches
(completely new prctl, options to set other policies per process,
alternatives to prctl -- mctrl, cgroup handling) don't look particularly
promising. Likely, the future will use bpf or something similar to
implement better policies, in particular to also make better decisions
about THP sizes to use, but this will certainly take a while as that work
just started.
Long story short: a simple enable/disable is not really suitable for the
future, so we're not willing to add completely new toggles.
While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
completely for these processes, this is a step backwards, because these
processes can no longer allocate THPs in regions where THPs were
explicitly advised: regions flagged as VM_HUGEPAGE. Apparently, that
imposes a problem for relevant workloads, because "not THPs" is certainly
worse than "THPs only when advised".
Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this
would change the documented semantics quite a bit, and the versatility
to use it for debugging purposes, so I am not 100% sure that is what we
want -- although it would certainly be much easier.
So instead, as an easy way forward for (3) and (4), add an option to
make PR_SET_THP_DISABLE disable *less* THPs for a process.
In essence, this patch:
(A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).
prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED).
(B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.
Previously, it would return 1 if THPs were disabled completely. Now
it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED
was set.
(C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
the semantics clearly.
Fortunately, there are only two instances outside of prctl() code.
(D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
with VM_HUGEPAGE" -- essentially "thp=madvise" behavior
Fortunately, we only have to extend vma_thp_disabled().
(E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are
disabled completely
Only indicating that THPs are disabled when they are really disabled
completely, not only partially.
For now, we don't add another interface to obtained whether THPs
are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If
ever required, we could add a new entry.
The documented semantics in the man page for PR_SET_THP_DISABLE
"is inherited by a child created via fork(2) and is preserved across
execve(2)" is maintained. This behavior, for example, allows for
disabling THPs for a workload through the launching process (e.g.,
systemd where we fork() a helper process to then exec()).
For now, MADV_COLLAPSE will *fail* in regions without VM_HUGEPAGE and
VM_NOHUGEPAGE. As MADV_COLLAPSE is a clear advise that user space
thinks a THP is a good idea, we'll enable that separately next
(requiring a bit of cleanup first).
There is currently not way to prevent that a process will not issue
PR_SET_THP_DISABLE itself to re-enable THP. There are not really known
users for re-enabling it, and it's against the purpose of the original
interface. So if ever required, we could investigate just forbidding to
re-enable them, or make this somehow configurable.
Acked-by: Usama Arif <usamaarif642@gmail.com>
Tested-by: Usama Arif <usamaarif642@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
think there might be real use cases where we want to disable any THPs --
in particular also around debugging THP-related problems, and
"never" not meaning ... "never" anymore ever since we add MADV_COLLAPSE.
PR_SET_THP_DISABLE will also block MADV_COLLAPSE, which can be very
helpful for debugging purposes. Of course, I thought of having a
system-wide config option to modify PR_SET_THP_DISABLE behavior, but
I just don't like the semantics.
"prctl: allow overriding system THP policy to always"[1] proposed
"overriding policies to always", which is just the wrong way around: we
should not add mechanisms to "enable more" when we already have an
interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets
weird otherwise.
"[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed
setting the default of the VM_HUGEPAGE, which is similarly the wrong way
around I think now.
The ideas explored by Lorenzo to extend process_madvise()[3] and mctrl()[4]
similarly were around the "default for VM_HUGEPAGE" idea, but after the
discussion, I think we should better leave VM_HUGEPAGE untouched.
Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
we essentially want to say "leave advised regions alone" -- "keep THP
enabled for advised regions",
The only thing I really dislike about this is using another MMF_* flag,
but well, no way around it -- and seems like we could easily support
more than 32 if we want to (most users already treat it like a proper
bitmap).
I think this here (modifying an existing toggle) is the only prctl()
extension that we might be willing to accept. In general, I agree like
most others, that prctl() is a very bad interface for that -- but
PR_SET_THP_DISABLE is already there and is getting used.
Long-term, I think the answer will be something based on bpf[5]. Maybe
in that context, I there could still be value in easily disabling THPs for
selected workloads (esp. debugging purposes).
Jann raised valid concerns[6] about new flags that are persistent across
exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I
consider it having a similar security risk as our existing
PR_SET_THP_DISABLE, but devil is in the detail.
[1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com
[2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com
[3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com
[4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
[5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com
[6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com
Signed-off-by: David Hildenbrand <david@redhat.com>
---
Documentation/filesystems/proc.rst | 5 ++-
fs/proc/array.c | 2 +-
include/linux/huge_mm.h | 20 +++++++---
include/linux/mm_types.h | 13 +++----
include/uapi/linux/prctl.h | 10 +++++
kernel/sys.c | 59 ++++++++++++++++++++++++------
mm/khugepaged.c | 2 +-
7 files changed, 82 insertions(+), 29 deletions(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2971551b7235..915a3e44bc12 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -291,8 +291,9 @@ It's slow but very precise.
HugetlbPages size of hugetlb memory portions
CoreDumping process's memory is currently being dumped
(killing the process may lead to a corrupted core)
- THP_enabled process is allowed to use THP (returns 0 when
- PR_SET_THP_DISABLE is set on the process
+ THP_enabled process is allowed to use THP (returns 0 when
+ PR_SET_THP_DISABLE is set on the process to disable
+ THP completely, not just partially)
Threads number of threads
SigQ number of signals queued/max. number for queue
SigPnd bitmap of pending signals for the thread
diff --git a/fs/proc/array.c b/fs/proc/array.c
index d6a0369caa93..c4f91a784104 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
if (thp_enabled)
- thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags);
+ thp_enabled = !test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
}
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7748489fde1b..71db243a002e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -318,16 +318,26 @@ struct thpsize {
(transparent_hugepage_flags & \
(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
+/*
+ * Check whether THPs are explicitly disabled for this VMA, for example,
+ * through madvise or prctl.
+ */
static inline bool vma_thp_disabled(struct vm_area_struct *vma,
vm_flags_t vm_flags)
{
+ /* Are THPs disabled for this VMA? */
+ if (vm_flags & VM_NOHUGEPAGE)
+ return true;
+ /* Are THPs disabled for all VMAs in the whole process? */
+ if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags))
+ return true;
/*
- * Explicitly disabled through madvise or prctl, or some
- * architectures may disable THP for some mappings, for
- * example, s390 kvm.
+ * Are THPs disabled only for VMAs where we didn't get an explicit
+ * advise to use them?
*/
- return (vm_flags & VM_NOHUGEPAGE) ||
- test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
+ if (vm_flags & VM_HUGEPAGE)
+ return false;
+ return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
}
static inline bool thp_disabled_by_hw(void)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1ec273b06691..123fefaa4b98 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1743,19 +1743,16 @@ enum {
#define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */
#define MMF_VM_HUGEPAGE 17 /* set when mm is available for khugepaged */
-/*
- * This one-shot flag is dropped due to necessity of changing exe once again
- * on NFS restore
- */
-//#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */
+#define MMF_HUGE_ZERO_PAGE 18 /* mm has ever used the global huge zero page */
#define MMF_HAS_UPROBES 19 /* has uprobes */
#define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */
#define MMF_OOM_SKIP 21 /* mm is of no interest for the OOM killer */
#define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */
-#define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */
-#define MMF_DISABLE_THP 24 /* disable THP for all VMAs */
-#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP)
+#define MMF_DISABLE_THP_EXCEPT_ADVISED 23 /* no THP except when advised (e.g., VM_HUGEPAGE) */
+#define MMF_DISABLE_THP_COMPLETELY 24 /* no THP for all VMAs */
+#define MMF_DISABLE_THP_MASK ((1 << MMF_DISABLE_THP_COMPLETELY) |\
+ (1 << MMF_DISABLE_THP_EXCEPT_ADVISED))
#define MMF_OOM_REAP_QUEUED 25 /* mm was queued for oom_reaper */
#define MMF_MULTIPROCESS 26 /* mm is shared between processes */
/*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 43dec6eed559..9c1d6e49b8a9 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -177,7 +177,17 @@ struct prctl_mm_map {
#define PR_GET_TID_ADDRESS 40
+/*
+ * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0
+ * is reserved, so PR_GET_THP_DISABLE can return "1 | flags", to effectively
+ * return "1" when no flags were specified for PR_SET_THP_DISABLE.
+ */
#define PR_SET_THP_DISABLE 41
+/*
+ * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
+ * VM_HUGEPAGE).
+ */
+# define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1)
#define PR_GET_THP_DISABLE 42
/*
diff --git a/kernel/sys.c b/kernel/sys.c
index b153fb345ada..932a8e637e78 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2423,6 +2423,51 @@ static int prctl_get_auxv(void __user *addr, unsigned long len)
return sizeof(mm->saved_auxv);
}
+static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
+{
+ unsigned long *mm_flags = ¤t->mm->flags;
+
+ if (arg2 || arg3 || arg4 || arg5)
+ return -EINVAL;
+
+ /* If disabled, we return "1 | flags", otherwise 0. */
+ if (test_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags))
+ return 1;
+ else if (test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags))
+ return 1 | PR_THP_DISABLE_EXCEPT_ADVISED;
+ return 0;
+}
+
+static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
+ unsigned long arg4, unsigned long arg5)
+{
+ unsigned long *mm_flags = ¤t->mm->flags;
+
+ if (arg4 || arg5)
+ return -EINVAL;
+
+ /* Flags are only allowed when disabling. */
+ if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
+ return -EINVAL;
+ if (mmap_write_lock_killable(current->mm))
+ return -EINTR;
+ if (thp_disable) {
+ if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) {
+ clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
+ set_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
+ } else {
+ set_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
+ clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
+ }
+ } else {
+ clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
+ clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
+ }
+ mmap_write_unlock(current->mm);
+ return 0;
+}
+
SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
{
@@ -2596,20 +2641,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
return task_no_new_privs(current) ? 1 : 0;
case PR_GET_THP_DISABLE:
- if (arg2 || arg3 || arg4 || arg5)
- return -EINVAL;
- error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
+ error = prctl_get_thp_disable(arg2, arg3, arg4, arg5);
break;
case PR_SET_THP_DISABLE:
- if (arg3 || arg4 || arg5)
- return -EINVAL;
- if (mmap_write_lock_killable(me->mm))
- return -EINTR;
- if (arg2)
- set_bit(MMF_DISABLE_THP, &me->mm->flags);
- else
- clear_bit(MMF_DISABLE_THP, &me->mm->flags);
- mmap_write_unlock(me->mm);
+ error = prctl_set_thp_disable(arg2, arg3, arg4, arg5);
break;
case PR_MPX_ENABLE_MANAGEMENT:
case PR_MPX_DISABLE_MANAGEMENT:
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1ff0c7dd2be4..2c9008246785 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -410,7 +410,7 @@ static inline int hpage_collapse_test_exit(struct mm_struct *mm)
static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
{
return hpage_collapse_test_exit(mm) ||
- test_bit(MMF_DISABLE_THP, &mm->flags);
+ test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
}
static bool hugepage_pmd_enabled(void)
--
2.47.3
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v2 2/5] mm/huge_memory: convert "tva_flags" to "enum tva_type" for thp_vma_allowable_order*()
2025-07-31 12:27 [PATCH v2 0/5] prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised Usama Arif
2025-07-31 12:27 ` [PATCH v2 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE Usama Arif
@ 2025-07-31 12:27 ` Usama Arif
2025-07-31 14:00 ` Lorenzo Stoakes
2025-07-31 12:27 ` [PATCH v2 3/5] mm/huge_memory: treat MADV_COLLAPSE as an advise with PR_THP_DISABLE_EXCEPT_ADVISED Usama Arif
` (2 subsequent siblings)
4 siblings, 1 reply; 24+ messages in thread
From: Usama Arif @ 2025-07-31 12:27 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team,
Usama Arif
From: David Hildenbrand <david@redhat.com>
Describing the context through a type is much clearer, and good enough
for our case.
We have:
* smaps handling for showing "THPeligible"
* Pagefault handling
* khugepaged handling
* Forced collapse handling: primarily MADV_COLLAPSE, but one other odd case
Really, we want to ignore sysfs only when we are forcing a collapse
through MADV_COLLAPSE, otherwise we want to enforce.
With this change, we immediately know if we are in the forced collapse
case, which will be valuable next.
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
fs/proc/task_mmu.c | 4 ++--
include/linux/huge_mm.h | 30 ++++++++++++++++++------------
mm/huge_memory.c | 8 ++++----
mm/khugepaged.c | 18 +++++++++---------
mm/memory.c | 14 ++++++--------
5 files changed, 39 insertions(+), 35 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3d6d8a9f13fc..d440df7b3d59 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1293,8 +1293,8 @@ static int show_smap(struct seq_file *m, void *v)
__show_smap(m, &mss, false);
seq_printf(m, "THPeligible: %8u\n",
- !!thp_vma_allowable_orders(vma, vma->vm_flags,
- TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL));
+ !!thp_vma_allowable_orders(vma, vma->vm_flags, TVA_SMAPS,
+ THP_ORDERS_ALL));
if (arch_pkeys_enabled())
seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma));
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 71db243a002e..b0ff54eee81c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -94,12 +94,15 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
#define THP_ORDERS_ALL \
(THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_SPECIAL | THP_ORDERS_ALL_FILE_DEFAULT)
-#define TVA_SMAPS (1 << 0) /* Will be used for procfs */
-#define TVA_IN_PF (1 << 1) /* Page fault handler */
-#define TVA_ENFORCE_SYSFS (1 << 2) /* Obey sysfs configuration */
+enum tva_type {
+ TVA_SMAPS, /* Exposing "THPeligible:" in smaps. */
+ TVA_PAGEFAULT, /* Serving a page fault. */
+ TVA_KHUGEPAGED, /* Khugepaged collapse. */
+ TVA_FORCED_COLLAPSE, /* Forced collapse (i.e., MADV_COLLAPSE). */
+};
-#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
- (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
+#define thp_vma_allowable_order(vma, vm_flags, type, order) \
+ (!!thp_vma_allowable_orders(vma, vm_flags, type, BIT(order)))
#define split_folio(f) split_folio_to_list(f, NULL)
@@ -264,14 +267,14 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
vm_flags_t vm_flags,
- unsigned long tva_flags,
+ enum tva_type type,
unsigned long orders);
/**
* thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
* @vma: the vm area to check
* @vm_flags: use these vm_flags instead of vma->vm_flags
- * @tva_flags: Which TVA flags to honour
+ * @type: TVA type
* @orders: bitfield of all orders to consider
*
* Calculates the intersection of the requested hugepage orders and the allowed
@@ -285,11 +288,14 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
static inline
unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
vm_flags_t vm_flags,
- unsigned long tva_flags,
+ enum tva_type type,
unsigned long orders)
{
- /* Optimization to check if required orders are enabled early. */
- if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
+ /*
+ * Optimization to check if required orders are enabled early. Only
+ * forced collapse ignores sysfs configs.
+ */
+ if (type != TVA_FORCED_COLLAPSE && vma_is_anonymous(vma)) {
unsigned long mask = READ_ONCE(huge_anon_orders_always);
if (vm_flags & VM_HUGEPAGE)
@@ -303,7 +309,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
return 0;
}
- return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
+ return __thp_vma_allowable_orders(vma, vm_flags, type, orders);
}
struct thpsize {
@@ -536,7 +542,7 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
vm_flags_t vm_flags,
- unsigned long tva_flags,
+ enum tva_type type,
unsigned long orders)
{
return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2b4ea5a2ce7d..85252b468f80 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -99,12 +99,12 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
vm_flags_t vm_flags,
- unsigned long tva_flags,
+ enum tva_type type,
unsigned long orders)
{
- bool smaps = tva_flags & TVA_SMAPS;
- bool in_pf = tva_flags & TVA_IN_PF;
- bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS;
+ const bool smaps = type == TVA_SMAPS;
+ const bool in_pf = type == TVA_PAGEFAULT;
+ const bool enforce_sysfs = type != TVA_FORCED_COLLAPSE;
unsigned long supported_orders;
/* Check the intersection of requested and supported orders. */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2c9008246785..7a54b6f2a346 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -474,8 +474,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
{
if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
hugepage_pmd_enabled()) {
- if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
- PMD_ORDER))
+ if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
__khugepaged_enter(vma->vm_mm);
}
}
@@ -921,7 +920,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
struct collapse_control *cc)
{
struct vm_area_struct *vma;
- unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
+ enum tva_type tva_type = cc->is_khugepaged ? TVA_KHUGEPAGED :
+ TVA_FORCED_COLLAPSE;
if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
return SCAN_ANY_PROCESS;
@@ -932,7 +932,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
return SCAN_ADDRESS_RANGE;
- if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
+ if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_type, PMD_ORDER))
return SCAN_VMA_CHECK;
/*
* Anon VMA expected, the address may be unmapped then
@@ -1532,9 +1532,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
* in the page cache with a single hugepage. If a mm were to fault-in
* this memory (mapped by a suitably aligned VMA), we'd get the hugepage
* and map it by a PMD, regardless of sysfs THP settings. As such, let's
- * analogously elide sysfs THP settings here.
+ * analogously elide sysfs THP settings here and pretend we are
+ * collapsing.
*/
- if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
+ if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
return SCAN_VMA_CHECK;
/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
@@ -2431,8 +2432,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
progress++;
break;
}
- if (!thp_vma_allowable_order(vma, vma->vm_flags,
- TVA_ENFORCE_SYSFS, PMD_ORDER)) {
+ if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
skip:
progress++;
continue;
@@ -2766,7 +2766,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
BUG_ON(vma->vm_start > start);
BUG_ON(vma->vm_end < end);
- if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
+ if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
return -EINVAL;
cc = kmalloc(sizeof(*cc), GFP_KERNEL);
diff --git a/mm/memory.c b/mm/memory.c
index 92fd18a5d8d1..be761753f240 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4369,8 +4369,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
* Get a list of all the (large) orders below PMD_ORDER that are enabled
* and suitable for swapping THP.
*/
- orders = thp_vma_allowable_orders(vma, vma->vm_flags,
- TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+ orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
+ BIT(PMD_ORDER) - 1);
orders = thp_vma_suitable_orders(vma, vmf->address, orders);
orders = thp_swap_suitable_orders(swp_offset(entry),
vmf->address, orders);
@@ -4917,8 +4917,8 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
* for this vma. Then filter out the orders that can't be allocated over
* the faulting address and still be fully contained in the vma.
*/
- orders = thp_vma_allowable_orders(vma, vma->vm_flags,
- TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+ orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
+ BIT(PMD_ORDER) - 1);
orders = thp_vma_suitable_orders(vma, vmf->address, orders);
if (!orders)
@@ -6108,8 +6108,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
return VM_FAULT_OOM;
retry_pud:
if (pud_none(*vmf.pud) &&
- thp_vma_allowable_order(vma, vm_flags,
- TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) {
+ thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PUD_ORDER)) {
ret = create_huge_pud(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
@@ -6143,8 +6142,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
goto retry_pud;
if (pmd_none(*vmf.pmd) &&
- thp_vma_allowable_order(vma, vm_flags,
- TVA_IN_PF | TVA_ENFORCE_SYSFS, PMD_ORDER)) {
+ thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
ret = create_huge_pmd(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
--
2.47.3
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v2 3/5] mm/huge_memory: treat MADV_COLLAPSE as an advise with PR_THP_DISABLE_EXCEPT_ADVISED
2025-07-31 12:27 [PATCH v2 0/5] prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised Usama Arif
2025-07-31 12:27 ` [PATCH v2 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE Usama Arif
2025-07-31 12:27 ` [PATCH v2 2/5] mm/huge_memory: convert "tva_flags" to "enum tva_type" for thp_vma_allowable_order*() Usama Arif
@ 2025-07-31 12:27 ` Usama Arif
2025-07-31 14:38 ` Lorenzo Stoakes
2025-07-31 12:27 ` [PATCH v2 4/5] selftests: prctl: introduce tests for disabling THPs completely Usama Arif
2025-07-31 12:27 ` [PATCH v2 5/5] selftests: prctl: introduce tests for disabling THPs except for madvise Usama Arif
4 siblings, 1 reply; 24+ messages in thread
From: Usama Arif @ 2025-07-31 12:27 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team,
Usama Arif
From: David Hildenbrand <david@redhat.com>
Let's allow for making MADV_COLLAPSE succeed on areas that neither have
VM_HUGEPAGE nor VM_NOHUGEPAGE when we have THP disabled
unless explicitly advised (PR_THP_DISABLE_EXCEPT_ADVISED).
MADV_COLLAPSE is a clear advise that we want to collapse.
Note that we still respect the VM_NOHUGEPAGE flag, just like
MADV_COLLAPSE always does. So consequently, MADV_COLLAPSE is now only
refused on VM_NOHUGEPAGE with PR_THP_DISABLE_EXCEPT_ADVISED.
Co-developed-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
include/linux/huge_mm.h | 8 +++++++-
include/uapi/linux/prctl.h | 2 +-
mm/huge_memory.c | 5 +++--
mm/memory.c | 6 ++++--
mm/shmem.c | 2 +-
5 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b0ff54eee81c..aeaf93f8ac2e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -329,7 +329,7 @@ struct thpsize {
* through madvise or prctl.
*/
static inline bool vma_thp_disabled(struct vm_area_struct *vma,
- vm_flags_t vm_flags)
+ vm_flags_t vm_flags, bool forced_collapse)
{
/* Are THPs disabled for this VMA? */
if (vm_flags & VM_NOHUGEPAGE)
@@ -343,6 +343,12 @@ static inline bool vma_thp_disabled(struct vm_area_struct *vma,
*/
if (vm_flags & VM_HUGEPAGE)
return false;
+ /*
+ * Forcing a collapse (e.g., madv_collapse), is a clear advise to
+ * use THPs.
+ */
+ if (forced_collapse)
+ return false;
return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
}
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 9c1d6e49b8a9..ee4165738779 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -185,7 +185,7 @@ struct prctl_mm_map {
#define PR_SET_THP_DISABLE 41
/*
* Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
- * VM_HUGEPAGE).
+ * VM_HUGEPAGE / MADV_COLLAPSE).
*/
# define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1)
#define PR_GET_THP_DISABLE 42
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 85252b468f80..ef5ccb0ec5d5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -104,7 +104,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
{
const bool smaps = type == TVA_SMAPS;
const bool in_pf = type == TVA_PAGEFAULT;
- const bool enforce_sysfs = type != TVA_FORCED_COLLAPSE;
+ const bool forced_collapse = type == TVA_FORCED_COLLAPSE;
+ const bool enforce_sysfs = !forced_collapse;
unsigned long supported_orders;
/* Check the intersection of requested and supported orders. */
@@ -122,7 +123,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
if (!vma->vm_mm) /* vdso */
return 0;
- if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags))
+ if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags, forced_collapse))
return 0;
/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
diff --git a/mm/memory.c b/mm/memory.c
index be761753f240..bd04212d6f79 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5186,9 +5186,11 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
* It is too late to allocate a small folio, we already have a large
* folio in the pagecache: especially s390 KVM cannot tolerate any
* PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any
- * PMD mappings if THPs are disabled.
+ * PMD mappings if THPs are disabled. As we already have a THP ...
+ * behave as if we are forcing a collapse.
*/
- if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags))
+ if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags,
+ /* forced_collapse=*/ true))
return ret;
if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
diff --git a/mm/shmem.c b/mm/shmem.c
index e6cdfda08aed..30609197a266 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1816,7 +1816,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
vm_flags_t vm_flags = vma ? vma->vm_flags : 0;
unsigned int global_orders;
- if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags)))
+ if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags, shmem_huge_force)))
return 0;
global_orders = shmem_huge_global_enabled(inode, index, write_end,
--
2.47.3
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v2 4/5] selftests: prctl: introduce tests for disabling THPs completely
2025-07-31 12:27 [PATCH v2 0/5] prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised Usama Arif
` (2 preceding siblings ...)
2025-07-31 12:27 ` [PATCH v2 3/5] mm/huge_memory: treat MADV_COLLAPSE as an advise with PR_THP_DISABLE_EXCEPT_ADVISED Usama Arif
@ 2025-07-31 12:27 ` Usama Arif
2025-07-31 19:42 ` David Hildenbrand
2025-07-31 12:27 ` [PATCH v2 5/5] selftests: prctl: introduce tests for disabling THPs except for madvise Usama Arif
4 siblings, 1 reply; 24+ messages in thread
From: Usama Arif @ 2025-07-31 12:27 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team,
Usama Arif
The test will set the global system THP setting to never, madvise
or always depending on the fixture variant and the 2M setting to
inherit before it starts (and reset to original at teardown).
This tests if the process can:
- successfully set and get the policy to disable THPs completely.
- never get a hugepage when the THPs are completely disabled
with the prctl, including with MADV_HUGE and MADV_COLLAPSE.
- successfully reset the policy of the process.
- after reset, only get hugepages with:
- MADV_COLLAPSE when policy is set to never.
- MADV_HUGE and MADV_COLLAPSE when policy is set to madvise.
- always when policy is set to "always".
- repeat the above tests in a forked process to make sure
the policy is carried across forks.
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 1 +
.../testing/selftests/mm/prctl_thp_disable.c | 241 ++++++++++++++++++
tools/testing/selftests/mm/thp_settings.c | 9 +-
tools/testing/selftests/mm/thp_settings.h | 1 +
5 files changed, 252 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/mm/prctl_thp_disable.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index e7b23a8a05fe..eb023ea857b3 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -58,3 +58,4 @@ pkey_sighandler_tests_32
pkey_sighandler_tests_64
guard-regions
merge
+prctl_thp_disable
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index d13b3cef2a2b..2bb8d3ebc17c 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -86,6 +86,7 @@ TEST_GEN_FILES += on-fault-limit
TEST_GEN_FILES += pagemap_ioctl
TEST_GEN_FILES += pfnmap
TEST_GEN_FILES += process_madv
+TEST_GEN_FILES += prctl_thp_disable
TEST_GEN_FILES += thuge-gen
TEST_GEN_FILES += transhuge-stress
TEST_GEN_FILES += uffd-stress
diff --git a/tools/testing/selftests/mm/prctl_thp_disable.c b/tools/testing/selftests/mm/prctl_thp_disable.c
new file mode 100644
index 000000000000..2f54e5e52274
--- /dev/null
+++ b/tools/testing/selftests/mm/prctl_thp_disable.c
@@ -0,0 +1,241 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Basic tests for PR_GET/SET_THP_DISABLE prctl calls
+ *
+ * Author(s): Usama Arif <usamaarif642@gmail.com>
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+
+#include "../kselftest_harness.h"
+#include "thp_settings.h"
+#include "vm_util.h"
+
+static int sz2ord(size_t size, size_t pagesize)
+{
+ return __builtin_ctzll(size / pagesize);
+}
+
+enum thp_collapse_type {
+ THP_COLLAPSE_NONE,
+ THP_COLLAPSE_MADV_HUGEPAGE, /* MADV_HUGEPAGE before access */
+ THP_COLLAPSE_MADV_COLLAPSE, /* MADV_COLLAPSE after access */
+};
+
+enum thp_policy {
+ THP_POLICY_NEVER,
+ THP_POLICY_MADVISE,
+ THP_POLICY_ALWAYS,
+};
+
+struct test_results {
+ int prctl_get_thp_disable;
+ int prctl_applied_collapse_none;
+ int prctl_applied_collapse_madv_huge;
+ int prctl_applied_collapse_madv_collapse;
+ int prctl_removed_collapse_none;
+ int prctl_removed_collapse_madv_huge;
+ int prctl_removed_collapse_madv_collapse;
+};
+
+/*
+ * Function to mmap a buffer, fault it in, madvise it appropriately (before
+ * page fault for MADV_HUGE, and after for MADV_COLLAPSE), and check if the
+ * mmap region is huge.
+ * Returns:
+ * 0 if test doesn't give hugepage
+ * 1 if test gives a hugepage
+ * -errno if mmap fails
+ */
+static int test_mmap_thp(enum thp_collapse_type madvise_buf, size_t pmdsize)
+{
+ char *mem, *mmap_mem;
+ size_t mmap_size;
+ int ret;
+
+ /* For alignment purposes, we need twice the THP size. */
+ mmap_size = 2 * pmdsize;
+ mmap_mem = (char *)mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (mmap_mem == MAP_FAILED)
+ return -errno;
+
+ /* We need a THP-aligned memory area. */
+ mem = (char *)(((uintptr_t)mmap_mem + pmdsize) & ~(pmdsize - 1));
+
+ if (madvise_buf == THP_COLLAPSE_MADV_HUGEPAGE)
+ madvise(mem, pmdsize, MADV_HUGEPAGE);
+
+ /* Ensure memory is allocated */
+ memset(mem, 1, pmdsize);
+
+ if (madvise_buf == THP_COLLAPSE_MADV_COLLAPSE)
+ madvise(mem, pmdsize, MADV_COLLAPSE);
+
+ /*
+ * MADV_HUGEPAGE will create a new VMA at "mem", which is the address
+ * pattern we want to check for to detect the presence of hugepage in
+ * smaps.
+ * MADV_COLLAPSE will not create a new VMA, therefore we need to check
+ * for hugepage at "mmap_mem" in smaps.
+ * Check for hugepage at both locations to ensure that
+ * THP_COLLAPSE_NONE, THP_COLLAPSE_MADV_HUGEPAGE and
+ * THP_COLLAPSE_MADV_COLLAPSE only gives a THP when expected
+ * in the range [mmap_mem, mmap_mem + 2 * pmdsize].
+ */
+ ret = check_huge_anon(mem, 1, pmdsize) ||
+ check_huge_anon(mmap_mem, 1, pmdsize);
+ munmap(mmap_mem, mmap_size);
+ return ret;
+}
+
+static void prctl_thp_disable_test(struct __test_metadata *const _metadata,
+ size_t pmdsize, struct test_results *results)
+{
+
+ ASSERT_EQ(prctl(PR_GET_THP_DISABLE, NULL, NULL, NULL, NULL),
+ results->prctl_get_thp_disable);
+
+ /* tests after prctl overrides global policy */
+ ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_NONE, pmdsize),
+ results->prctl_applied_collapse_none);
+
+ ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_HUGEPAGE, pmdsize),
+ results->prctl_applied_collapse_madv_huge);
+
+ ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_COLLAPSE, pmdsize),
+ results->prctl_applied_collapse_madv_collapse);
+
+ /* Reset to global policy */
+ ASSERT_EQ(prctl(PR_SET_THP_DISABLE, 0, NULL, NULL, NULL), 0);
+
+ /* tests after prctl is cleared, and only global policy is effective */
+ ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_NONE, pmdsize),
+ results->prctl_removed_collapse_none);
+
+ ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_HUGEPAGE, pmdsize),
+ results->prctl_removed_collapse_madv_huge);
+
+ ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_COLLAPSE, pmdsize),
+ results->prctl_removed_collapse_madv_collapse);
+}
+
+FIXTURE(prctl_thp_disable_completely)
+{
+ struct thp_settings settings;
+ struct test_results results;
+ size_t pmdsize;
+};
+
+FIXTURE_VARIANT(prctl_thp_disable_completely)
+{
+ enum thp_policy thp_global_policy;
+};
+
+FIXTURE_VARIANT_ADD(prctl_thp_disable_completely, never)
+{
+ .thp_global_policy = THP_POLICY_NEVER,
+};
+
+FIXTURE_VARIANT_ADD(prctl_thp_disable_completely, madvise)
+{
+ .thp_global_policy = THP_POLICY_MADVISE,
+};
+
+FIXTURE_VARIANT_ADD(prctl_thp_disable_completely, always)
+{
+ .thp_global_policy = THP_POLICY_ALWAYS,
+};
+
+FIXTURE_SETUP(prctl_thp_disable_completely)
+{
+ if (!thp_available())
+ SKIP(return, "Transparent Hugepages not available\n");
+
+ self->pmdsize = read_pmd_pagesize();
+ if (!self->pmdsize)
+ SKIP(return, "Unable to read PMD size\n");
+
+ thp_save_settings();
+ thp_read_settings(&self->settings);
+ switch (variant->thp_global_policy) {
+ case THP_POLICY_NEVER:
+ self->settings.thp_enabled = THP_NEVER;
+ self->results = (struct test_results) {
+ .prctl_get_thp_disable = 1,
+ .prctl_applied_collapse_none = 0,
+ .prctl_applied_collapse_madv_huge = 0,
+ .prctl_applied_collapse_madv_collapse = 0,
+ .prctl_removed_collapse_none = 0,
+ .prctl_removed_collapse_madv_huge = 0,
+ .prctl_removed_collapse_madv_collapse = 1,
+ };
+ break;
+ case THP_POLICY_MADVISE:
+ self->settings.thp_enabled = THP_MADVISE;
+ self->results = (struct test_results) {
+ .prctl_get_thp_disable = 1,
+ .prctl_applied_collapse_none = 0,
+ .prctl_applied_collapse_madv_huge = 0,
+ .prctl_applied_collapse_madv_collapse = 0,
+ .prctl_removed_collapse_none = 0,
+ .prctl_removed_collapse_madv_huge = 1,
+ .prctl_removed_collapse_madv_collapse = 1,
+ };
+ break;
+ case THP_POLICY_ALWAYS:
+ self->settings.thp_enabled = THP_ALWAYS;
+ self->results = (struct test_results) {
+ .prctl_get_thp_disable = 1,
+ .prctl_applied_collapse_none = 0,
+ .prctl_applied_collapse_madv_huge = 0,
+ .prctl_applied_collapse_madv_collapse = 0,
+ .prctl_removed_collapse_none = 1,
+ .prctl_removed_collapse_madv_huge = 1,
+ .prctl_removed_collapse_madv_collapse = 1,
+ };
+ break;
+ }
+ self->settings.hugepages[sz2ord(self->pmdsize, getpagesize())].enabled = THP_INHERIT;
+ thp_write_settings(&self->settings);
+}
+
+FIXTURE_TEARDOWN(prctl_thp_disable_completely)
+{
+ thp_restore_settings();
+}
+
+TEST_F(prctl_thp_disable_completely, nofork)
+{
+ ASSERT_EQ(prctl(PR_SET_THP_DISABLE, 1, NULL, NULL, NULL), 0);
+ prctl_thp_disable_test(_metadata, self->pmdsize, &self->results);
+}
+
+TEST_F(prctl_thp_disable_completely, fork)
+{
+ int ret = 0;
+ pid_t pid;
+
+ ASSERT_EQ(prctl(PR_SET_THP_DISABLE, 1, NULL, NULL, NULL), 0);
+
+ /* Make sure prctl changes are carried across fork */
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (!pid)
+ prctl_thp_disable_test(_metadata, self->pmdsize, &self->results);
+
+ wait(&ret);
+ if (WIFEXITED(ret))
+ ret = WEXITSTATUS(ret);
+ else
+ ret = -EINVAL;
+ ASSERT_EQ(ret, 0);
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/mm/thp_settings.c b/tools/testing/selftests/mm/thp_settings.c
index bad60ac52874..574bd0f8ae48 100644
--- a/tools/testing/selftests/mm/thp_settings.c
+++ b/tools/testing/selftests/mm/thp_settings.c
@@ -382,10 +382,17 @@ unsigned long thp_shmem_supported_orders(void)
return __thp_supported_orders(true);
}
-bool thp_is_enabled(void)
+bool thp_available(void)
{
if (access(THP_SYSFS, F_OK) != 0)
return false;
+ return true;
+}
+
+bool thp_is_enabled(void)
+{
+ if (!thp_available())
+ return false;
int mode = thp_read_string("enabled", thp_enabled_strings);
diff --git a/tools/testing/selftests/mm/thp_settings.h b/tools/testing/selftests/mm/thp_settings.h
index 6c07f70beee9..76eeb712e5f1 100644
--- a/tools/testing/selftests/mm/thp_settings.h
+++ b/tools/testing/selftests/mm/thp_settings.h
@@ -84,6 +84,7 @@ void thp_set_read_ahead_path(char *path);
unsigned long thp_supported_orders(void);
unsigned long thp_shmem_supported_orders(void);
+bool thp_available(void);
bool thp_is_enabled(void);
#endif /* __THP_SETTINGS_H__ */
--
2.47.3
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v2 5/5] selftests: prctl: introduce tests for disabling THPs except for madvise
2025-07-31 12:27 [PATCH v2 0/5] prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised Usama Arif
` (3 preceding siblings ...)
2025-07-31 12:27 ` [PATCH v2 4/5] selftests: prctl: introduce tests for disabling THPs completely Usama Arif
@ 2025-07-31 12:27 ` Usama Arif
4 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2025-07-31 12:27 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team,
Usama Arif
The test will set the global system THP setting to never, madvise
or always depending on the fixture variant and the 2M setting to
inherit before it starts (and reset to original at teardown)
This tests if the process can:
- successfully set and get the policy to disable THPs expect for madvise.
- get hugepages only on MADV_HUGE and MADV_COLLAPSE if the global policy
is madvise/always and only with MADV_COLLAPSE if the global policy is
never.
- successfully reset the policy of the process.
- after reset, only get hugepages with:
- MADV_COLLAPSE when policy is set to never.
- MADV_HUGE and MADV_COLLAPSE when policy is set to madvise.
- always when policy is set to "always".
- repeat the above tests in a forked process to make sure the policy is
carried across forks.
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
.../testing/selftests/mm/prctl_thp_disable.c | 117 ++++++++++++++++++
1 file changed, 117 insertions(+)
diff --git a/tools/testing/selftests/mm/prctl_thp_disable.c b/tools/testing/selftests/mm/prctl_thp_disable.c
index 2f54e5e52274..3c34ac7e11f1 100644
--- a/tools/testing/selftests/mm/prctl_thp_disable.c
+++ b/tools/testing/selftests/mm/prctl_thp_disable.c
@@ -16,6 +16,10 @@
#include "thp_settings.h"
#include "vm_util.h"
+#ifndef PR_THP_DISABLE_EXCEPT_ADVISED
+#define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1)
+#endif
+
static int sz2ord(size_t size, size_t pagesize)
{
return __builtin_ctzll(size / pagesize);
@@ -238,4 +242,117 @@ TEST_F(prctl_thp_disable_completely, fork)
ASSERT_EQ(ret, 0);
}
+FIXTURE(prctl_thp_disable_except_madvise)
+{
+ struct thp_settings settings;
+ struct test_results results;
+ size_t pmdsize;
+};
+
+FIXTURE_VARIANT(prctl_thp_disable_except_madvise)
+{
+ enum thp_policy thp_global_policy;
+};
+
+FIXTURE_VARIANT_ADD(prctl_thp_disable_except_madvise, never)
+{
+ .thp_global_policy = THP_POLICY_NEVER,
+};
+
+FIXTURE_VARIANT_ADD(prctl_thp_disable_except_madvise, madvise)
+{
+ .thp_global_policy = THP_POLICY_MADVISE,
+};
+
+FIXTURE_VARIANT_ADD(prctl_thp_disable_except_madvise, always)
+{
+ .thp_global_policy = THP_POLICY_ALWAYS,
+};
+
+FIXTURE_SETUP(prctl_thp_disable_except_madvise)
+{
+ if (!thp_available())
+ SKIP(return, "Transparent Hugepages not available\n");
+
+ self->pmdsize = read_pmd_pagesize();
+ if (!self->pmdsize)
+ SKIP(return, "Unable to read PMD size\n");
+
+ thp_save_settings();
+ thp_read_settings(&self->settings);
+ switch (variant->thp_global_policy) {
+ case THP_POLICY_NEVER:
+ self->settings.thp_enabled = THP_NEVER;
+ self->results = (struct test_results) {
+ .prctl_get_thp_disable = 3,
+ .prctl_applied_collapse_none = 0,
+ .prctl_applied_collapse_madv_huge = 0,
+ .prctl_applied_collapse_madv_collapse = 1,
+ .prctl_removed_collapse_none = 0,
+ .prctl_removed_collapse_madv_huge = 0,
+ .prctl_removed_collapse_madv_collapse = 1,
+ };
+ break;
+ case THP_POLICY_MADVISE:
+ self->settings.thp_enabled = THP_MADVISE;
+ self->results = (struct test_results) {
+ .prctl_get_thp_disable = 3,
+ .prctl_applied_collapse_none = 0,
+ .prctl_applied_collapse_madv_huge = 1,
+ .prctl_applied_collapse_madv_collapse = 1,
+ .prctl_removed_collapse_none = 0,
+ .prctl_removed_collapse_madv_huge = 1,
+ .prctl_removed_collapse_madv_collapse = 1,
+ };
+ break;
+ case THP_POLICY_ALWAYS:
+ self->settings.thp_enabled = THP_ALWAYS;
+ self->results = (struct test_results) {
+ .prctl_get_thp_disable = 3,
+ .prctl_applied_collapse_none = 0,
+ .prctl_applied_collapse_madv_huge = 1,
+ .prctl_applied_collapse_madv_collapse = 1,
+ .prctl_removed_collapse_none = 1,
+ .prctl_removed_collapse_madv_huge = 1,
+ .prctl_removed_collapse_madv_collapse = 1,
+ };
+ break;
+ }
+ self->settings.hugepages[sz2ord(self->pmdsize, getpagesize())].enabled = THP_INHERIT;
+ thp_write_settings(&self->settings);
+}
+
+FIXTURE_TEARDOWN(prctl_thp_disable_except_madvise)
+{
+ thp_restore_settings();
+}
+
+TEST_F(prctl_thp_disable_except_madvise, nofork)
+{
+ ASSERT_EQ(prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, NULL, NULL), 0);
+ prctl_thp_disable_test(_metadata, self->pmdsize, &self->results);
+}
+
+TEST_F(prctl_thp_disable_except_madvise, fork)
+{
+ int ret = 0;
+ pid_t pid;
+
+ ASSERT_EQ(prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, NULL, NULL), 0);
+
+ /* Make sure prctl changes are carried across fork */
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (!pid)
+ prctl_thp_disable_test(_metadata, self->pmdsize, &self->results);
+
+ wait(&ret);
+ if (WIFEXITED(ret))
+ ret = WEXITSTATUS(ret);
+ else
+ ret = -EINVAL;
+ ASSERT_EQ(ret, 0);
+}
+
TEST_HARNESS_MAIN
--
2.47.3
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
2025-07-31 12:27 ` [PATCH v2 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE Usama Arif
@ 2025-07-31 12:40 ` Lorenzo Stoakes
2025-07-31 13:12 ` Usama Arif
2025-07-31 15:13 ` Zi Yan
1 sibling, 1 reply; 24+ messages in thread
From: Lorenzo Stoakes @ 2025-07-31 12:40 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
linux-doc, kernel-team, Matthew Wilcox
On Thu, Jul 31, 2025 at 01:27:18PM +0100, Usama Arif wrote:
[snip]
> Acked-by: Usama Arif <usamaarif642@gmail.com>
> Tested-by: Usama Arif <usamaarif642@gmail.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Cc: SeongJae Park <sj@kernel.org>
> Cc: Jann Horn <jannh@google.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Cc: Matthew Wilcox <willy@infradead.org>
You don't need to include these Cc's, Andrew will add them for you.
> Signed-off-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Shouldn't this also be signed off by you? 2/5 and 3/5 has S-o-b for both
David and yourself?
This is inconsistent at the very least.
>
> ---
>
Nothing below the --- will be included in the patch, so we can drop the
below, it's just noise that people can find easily if needed.
> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
> think there might be real use cases where we want to disable any THPs --
> in particular also around debugging THP-related problems, and
> "never" not meaning ... "never" anymore ever since we add MADV_COLLAPSE.
> PR_SET_THP_DISABLE will also block MADV_COLLAPSE, which can be very
> helpful for debugging purposes. Of course, I thought of having a
> system-wide config option to modify PR_SET_THP_DISABLE behavior, but
> I just don't like the semantics.
[snip]
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
This S-o-b is weird, it's in a comment essentially. Let's drop that too
please.
> ---
> Documentation/filesystems/proc.rst | 5 ++-
> fs/proc/array.c | 2 +-
> include/linux/huge_mm.h | 20 +++++++---
> include/linux/mm_types.h | 13 +++----
> include/uapi/linux/prctl.h | 10 +++++
> kernel/sys.c | 59 ++++++++++++++++++++++++------
> mm/khugepaged.c | 2 +-
> 7 files changed, 82 insertions(+), 29 deletions(-)
>
[snip]
> +static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3,
> + unsigned long arg4, unsigned long arg5)
> +{
> + unsigned long *mm_flags = ¤t->mm->flags;
> +
> + if (arg2 || arg3 || arg4 || arg5)
> + return -EINVAL;
> +
> + /* If disabled, we return "1 | flags", otherwise 0. */
Thanks! LGTM.
> + if (test_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags))
> + return 1;
> + else if (test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags))
> + return 1 | PR_THP_DISABLE_EXCEPT_ADVISED;
> + return 0;
> +}
> +
[snip]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
2025-07-31 12:40 ` Lorenzo Stoakes
@ 2025-07-31 13:12 ` Usama Arif
2025-07-31 13:18 ` Lorenzo Stoakes
2025-07-31 13:20 ` David Hildenbrand
0 siblings, 2 replies; 24+ messages in thread
From: Usama Arif @ 2025-07-31 13:12 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
linux-doc, kernel-team, Matthew Wilcox
On 31/07/2025 13:40, Lorenzo Stoakes wrote:
> On Thu, Jul 31, 2025 at 01:27:18PM +0100, Usama Arif wrote:
> [snip]
>> Acked-by: Usama Arif <usamaarif642@gmail.com>
>> Tested-by: Usama Arif <usamaarif642@gmail.com>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Mike Rapoport <rppt@kernel.org>
>> Cc: Suren Baghdasaryan <surenb@google.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Usama Arif <usamaarif642@gmail.com>
>> Cc: SeongJae Park <sj@kernel.org>
>> Cc: Jann Horn <jannh@google.com>
>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>> Cc: Yafang Shao <laoar.shao@gmail.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>
> You don't need to include these Cc's, Andrew will add them for you.
>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Shouldn't this also be signed off by you? 2/5 and 3/5 has S-o-b for both
> David and yourself?
>
> This is inconsistent at the very least.
>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
The Ccs were added by David, and I didn't want to remove them.
>>
>> ---
>>
>
> Nothing below the --- will be included in the patch, so we can drop the
> below, it's just noise that people can find easily if needed.
>
>> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
>> think there might be real use cases where we want to disable any THPs --
>> in particular also around debugging THP-related problems, and
>> "never" not meaning ... "never" anymore ever since we add MADV_COLLAPSE.
>> PR_SET_THP_DISABLE will also block MADV_COLLAPSE, which can be very
>> helpful for debugging purposes. Of course, I thought of having a
>> system-wide config option to modify PR_SET_THP_DISABLE behavior, but
>> I just don't like the semantics.
>
> [snip]
>
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> This S-o-b is weird, it's in a comment essentially. Let's drop that too
> please.
Everything below --- was added by David I believe to provide further explanation that
doesn't need to be included in the commit message, and I didn't want to remove it
or his 2nd sign-off, as its discarded anyways. Its useful info that can just be
ignored.
>
>> ---
>> Documentation/filesystems/proc.rst | 5 ++-
>> fs/proc/array.c | 2 +-
>> include/linux/huge_mm.h | 20 +++++++---
>> include/linux/mm_types.h | 13 +++----
>> include/uapi/linux/prctl.h | 10 +++++
>> kernel/sys.c | 59 ++++++++++++++++++++++++------
>> mm/khugepaged.c | 2 +-
>> 7 files changed, 82 insertions(+), 29 deletions(-)
>>
>
> [snip]
>
>> +static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3,
>> + unsigned long arg4, unsigned long arg5)
>> +{
>> + unsigned long *mm_flags = ¤t->mm->flags;
>> +
>> + if (arg2 || arg3 || arg4 || arg5)
>> + return -EINVAL;
>> +
>> + /* If disabled, we return "1 | flags", otherwise 0. */
>
> Thanks! LGTM.
>
>> + if (test_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags))
>> + return 1;
>> + else if (test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags))
>> + return 1 | PR_THP_DISABLE_EXCEPT_ADVISED;
>> + return 0;
>> +}
>> +
>
> [snip]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
2025-07-31 13:12 ` Usama Arif
@ 2025-07-31 13:18 ` Lorenzo Stoakes
2025-07-31 13:20 ` David Hildenbrand
1 sibling, 0 replies; 24+ messages in thread
From: Lorenzo Stoakes @ 2025-07-31 13:18 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
linux-doc, kernel-team, Matthew Wilcox
On Thu, Jul 31, 2025 at 02:12:44PM +0100, Usama Arif wrote:
>
>
> On 31/07/2025 13:40, Lorenzo Stoakes wrote:
> > On Thu, Jul 31, 2025 at 01:27:18PM +0100, Usama Arif wrote:
> > [snip]
> >> Acked-by: Usama Arif <usamaarif642@gmail.com>
> >> Tested-by: Usama Arif <usamaarif642@gmail.com>
> >> Cc: Jonathan Corbet <corbet@lwn.net>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >> Cc: Zi Yan <ziy@nvidia.com>
> >> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> >> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> >> Cc: Nico Pache <npache@redhat.com>
> >> Cc: Ryan Roberts <ryan.roberts@arm.com>
> >> Cc: Dev Jain <dev.jain@arm.com>
> >> Cc: Barry Song <baohua@kernel.org>
> >> Cc: Vlastimil Babka <vbabka@suse.cz>
> >> Cc: Mike Rapoport <rppt@kernel.org>
> >> Cc: Suren Baghdasaryan <surenb@google.com>
> >> Cc: Michal Hocko <mhocko@suse.com>
> >> Cc: Usama Arif <usamaarif642@gmail.com>
> >> Cc: SeongJae Park <sj@kernel.org>
> >> Cc: Jann Horn <jannh@google.com>
> >> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> >> Cc: Yafang Shao <laoar.shao@gmail.com>
> >> Cc: Matthew Wilcox <willy@infradead.org>
> >
> > You don't need to include these Cc's, Andrew will add them for you.
> >
> >> Signed-off-by: David Hildenbrand <david@redhat.com>
> >> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > Shouldn't this also be signed off by you? 2/5 and 3/5 has S-o-b for both
> > David and yourself?
> >
> > This is inconsistent at the very least.
> >
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>
> The Ccs were added by David, and I didn't want to remove them.
Let's not add this noise on respins please, it makes review harder. No
other patch in the series, including ones by David, do it.
>
> >>
> >> ---
> >>
> >
> > Nothing below the --- will be included in the patch, so we can drop the
> > below, it's just noise that people can find easily if needed.
> >
> >> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
> >> think there might be real use cases where we want to disable any THPs --
> >> in particular also around debugging THP-related problems, and
> >> "never" not meaning ... "never" anymore ever since we add MADV_COLLAPSE.
> >> PR_SET_THP_DISABLE will also block MADV_COLLAPSE, which can be very
> >> helpful for debugging purposes. Of course, I thought of having a
> >> system-wide config option to modify PR_SET_THP_DISABLE behavior, but
> >> I just don't like the semantics.
> >
> > [snip]
> >
> >>
> >> Signed-off-by: David Hildenbrand <david@redhat.com>
> >
> > This S-o-b is weird, it's in a comment essentially. Let's drop that too
> > please.
>
>
> Everything below --- was added by David I believe to provide further explanation that
> doesn't need to be included in the commit message, and I didn't want to remove it
> or his 2nd sign-off, as its discarded anyways. Its useful info that can just be
> ignored.
Yup I get that, you've posted this several times now, anybody who needs it can
find it. Please remove on future respins to save me having to page through
noise.
The S-o-b was actively confusing, and this whole block of noise hid the
fact you got the S-o-b wrong on this patch.
Thanks
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
2025-07-31 13:12 ` Usama Arif
2025-07-31 13:18 ` Lorenzo Stoakes
@ 2025-07-31 13:20 ` David Hildenbrand
1 sibling, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2025-07-31 13:20 UTC (permalink / raw)
To: Usama Arif, Lorenzo Stoakes
Cc: Andrew Morton, linux-mm, linux-fsdevel, corbet, rppt, surenb,
mhocko, hannes, baohua, shakeel.butt, riel, ziy, laoar.shao,
dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team,
Matthew Wilcox
On 31.07.25 15:12, Usama Arif wrote:
>
>
> On 31/07/2025 13:40, Lorenzo Stoakes wrote:
>> On Thu, Jul 31, 2025 at 01:27:18PM +0100, Usama Arif wrote:
>> [snip]
>>> Acked-by: Usama Arif <usamaarif642@gmail.com>
>>> Tested-by: Usama Arif <usamaarif642@gmail.com>
>>> Cc: Jonathan Corbet <corbet@lwn.net>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>> Cc: Nico Pache <npache@redhat.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Dev Jain <dev.jain@arm.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>> Cc: Mike Rapoport <rppt@kernel.org>
>>> Cc: Suren Baghdasaryan <surenb@google.com>
>>> Cc: Michal Hocko <mhocko@suse.com>
>>> Cc: Usama Arif <usamaarif642@gmail.com>
>>> Cc: SeongJae Park <sj@kernel.org>
>>> Cc: Jann Horn <jannh@google.com>
>>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>>> Cc: Yafang Shao <laoar.shao@gmail.com>
>>> Cc: Matthew Wilcox <willy@infradead.org>
>>
>> You don't need to include these Cc's, Andrew will add them for you.
>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>
>> Shouldn't this also be signed off by you? 2/5 and 3/5 has S-o-b for both
>> David and yourself?
>>
>> This is inconsistent at the very least.
>>
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>
> The Ccs were added by David, and I didn't want to remove them.
They were still part of the first submission without cover letter, so
you should drop them from here now that you are sending it as part of a
series.
>
>>>
>>> ---
>>>
>>
>> Nothing below the --- will be included in the patch, so we can drop the
>> below, it's just noise that people can find easily if needed.
>>
>>> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
>>> think there might be real use cases where we want to disable any THPs --
>>> in particular also around debugging THP-related problems, and
>>> "never" not meaning ... "never" anymore ever since we add MADV_COLLAPSE.
>>> PR_SET_THP_DISABLE will also block MADV_COLLAPSE, which can be very
>>> helpful for debugging purposes. Of course, I thought of having a
>>> system-wide config option to modify PR_SET_THP_DISABLE behavior, but
>>> I just don't like the semantics.
>>
>> [snip]
>>
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> This S-o-b is weird, it's in a comment essentially. Let's drop that too
>> please.
That just got added automatically while modifying the patch.
>
>
> Everything below --- was added by David I believe to provide further explanation that
> doesn't need to be included in the commit message, and I didn't want to remove it
> or his 2nd sign-off, as its discarded anyways. Its useful info that can just be
> ignored.
Best to drop under the "---" I think it was most important for the PoC
to give more context.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 2/5] mm/huge_memory: convert "tva_flags" to "enum tva_type" for thp_vma_allowable_order*()
2025-07-31 12:27 ` [PATCH v2 2/5] mm/huge_memory: convert "tva_flags" to "enum tva_type" for thp_vma_allowable_order*() Usama Arif
@ 2025-07-31 14:00 ` Lorenzo Stoakes
2025-07-31 15:19 ` Zi Yan
` (2 more replies)
0 siblings, 3 replies; 24+ messages in thread
From: Lorenzo Stoakes @ 2025-07-31 14:00 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
linux-doc, kernel-team
On Thu, Jul 31, 2025 at 01:27:19PM +0100, Usama Arif wrote:
> From: David Hildenbrand <david@redhat.com>
>
> Describing the context through a type is much clearer, and good enough
> for our case.
This is pretty bare bones. What context, what type? Under what
circumstances?
This also is missing detail on the key difference here - that actually it
turns out we _don't_ need these to be flags, rather we can have _distinct_
modes which are clearer.
I'd say something like:
when determining which THP orders are eligiible for a VMA mapping,
we have previously specified tva_flags, however it turns out it is
really not necessary to treat these as flags.
Rather, we distinguish between distinct modes.
The only case where we previously combined flags was with
TVA_ENFORCE_SYSFS, but we can avoid this by observing that this is
the default, except for MADV_COLLAPSE or an edge cases in
collapse_pte_mapped_thp() and hugepage_vma_revalidate(), and adding
a mode specifically for this case - TVA_FORCED_COLLAPSE.
... stuff about the different modes...
>
> We have:
> * smaps handling for showing "THPeligible"
> * Pagefault handling
> * khugepaged handling
> * Forced collapse handling: primarily MADV_COLLAPSE, but one other odd case
Can we actually state what this case is? I mean I guess a handwave in the
form of 'an edge case in collapse_pte_mapped_thp()' will do also.
Hmm actually we do weird stuff with this so maybe just handwave.
Like uprobes calls collapse_pte_mapped_thp()... :/ I'm not sure this 'If we
are here, we've succeeded in replacing all the native pages in the page
cache with a single hugepage.' comment is even correct.
Anyway yeah, hand wave I guess...
>
> Really, we want to ignore sysfs only when we are forcing a collapse
> through MADV_COLLAPSE, otherwise we want to enforce.
I'd say 'ignoring this edge case, ...'
I think the clearest thing might be to literally list the before/after
like:
* TVA_SMAPS | TVA_ENFORCE_SYSFS -> TVA_SMAPS
* TVA_IN_PF | TVA_ENFORCE_SYSFS -> TVA_PAGEFAULT
* TVA_ENFORCE_SYSFS -> TVA_KHUGEPAGED
* 0 -> TVA_FORCED_COLLAPSE
>
> With this change, we immediately know if we are in the forced collapse
> case, which will be valuable next.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> Acked-by: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Overall this is a great cleanup, some various nits however.
> ---
> fs/proc/task_mmu.c | 4 ++--
> include/linux/huge_mm.h | 30 ++++++++++++++++++------------
> mm/huge_memory.c | 8 ++++----
> mm/khugepaged.c | 18 +++++++++---------
> mm/memory.c | 14 ++++++--------
> 5 files changed, 39 insertions(+), 35 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 3d6d8a9f13fc..d440df7b3d59 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1293,8 +1293,8 @@ static int show_smap(struct seq_file *m, void *v)
> __show_smap(m, &mss, false);
>
> seq_printf(m, "THPeligible: %8u\n",
> - !!thp_vma_allowable_orders(vma, vma->vm_flags,
> - TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL));
> + !!thp_vma_allowable_orders(vma, vma->vm_flags, TVA_SMAPS,
> + THP_ORDERS_ALL));
This !! is so gross, wonder if we could have a bool wrapper. But not a big
deal.
I also sort of _hate_ the smaps flag anyway, invoking this 'allowable
orders' thing just for smaps reporting with maybe some minor delta is just
odd.
Something like `bool vma_has_thp_allowed_orders(struct vm_area_struct
*vma);` would be nicer.
Anyway thoughts for another time... :)
>
> if (arch_pkeys_enabled())
> seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma));
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 71db243a002e..b0ff54eee81c 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -94,12 +94,15 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
> #define THP_ORDERS_ALL \
> (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_SPECIAL | THP_ORDERS_ALL_FILE_DEFAULT)
>
> -#define TVA_SMAPS (1 << 0) /* Will be used for procfs */
Dumb question, but what does 'TVA' stand for? :P
> -#define TVA_IN_PF (1 << 1) /* Page fault handler */
> -#define TVA_ENFORCE_SYSFS (1 << 2) /* Obey sysfs configuration */
> +enum tva_type {
> + TVA_SMAPS, /* Exposing "THPeligible:" in smaps. */
How I hate this flag (just an observation...)
> + TVA_PAGEFAULT, /* Serving a page fault. */
> + TVA_KHUGEPAGED, /* Khugepaged collapse. */
This is equivalent to the TVA_ENFORCE_SYSFS case before, sort of a default
I guess, but actually quite nice to add the context that it's sourced from
khugepaged - I assume this will always be the case when specified?
> + TVA_FORCED_COLLAPSE, /* Forced collapse (i.e., MADV_COLLAPSE). */
Would put 'e.g.' here, then that allows 'space' for the edge case...
> +};
>
> -#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
> - (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
> +#define thp_vma_allowable_order(vma, vm_flags, type, order) \
> + (!!thp_vma_allowable_orders(vma, vm_flags, type, BIT(order)))
Nit, but maybe worth keeping tva_ prefix - tva_type - here just so it's
clear what type it refers to.
But not end of the world.
Same comment goes for param names below etc.
>
> #define split_folio(f) split_folio_to_list(f, NULL)
>
> @@ -264,14 +267,14 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
>
> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> vm_flags_t vm_flags,
> - unsigned long tva_flags,
> + enum tva_type type,
> unsigned long orders);
>
> /**
> * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
> * @vma: the vm area to check
> * @vm_flags: use these vm_flags instead of vma->vm_flags
> - * @tva_flags: Which TVA flags to honour
> + * @type: TVA type
> * @orders: bitfield of all orders to consider
> *
> * Calculates the intersection of the requested hugepage orders and the allowed
> @@ -285,11 +288,14 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> static inline
> unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> vm_flags_t vm_flags,
> - unsigned long tva_flags,
> + enum tva_type type,
> unsigned long orders)
> {
> - /* Optimization to check if required orders are enabled early. */
> - if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
> + /*
> + * Optimization to check if required orders are enabled early. Only
> + * forced collapse ignores sysfs configs.
> + */
> + if (type != TVA_FORCED_COLLAPSE && vma_is_anonymous(vma)) {
> unsigned long mask = READ_ONCE(huge_anon_orders_always);
>
> if (vm_flags & VM_HUGEPAGE)
> @@ -303,7 +309,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> return 0;
> }
>
> - return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> + return __thp_vma_allowable_orders(vma, vm_flags, type, orders);
> }
>
> struct thpsize {
> @@ -536,7 +542,7 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
>
> static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> vm_flags_t vm_flags,
> - unsigned long tva_flags,
> + enum tva_type type,
> unsigned long orders)
> {
> return 0;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2b4ea5a2ce7d..85252b468f80 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -99,12 +99,12 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>
> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> vm_flags_t vm_flags,
> - unsigned long tva_flags,
> + enum tva_type type,
> unsigned long orders)
> {
> - bool smaps = tva_flags & TVA_SMAPS;
> - bool in_pf = tva_flags & TVA_IN_PF;
> - bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS;
> + const bool smaps = type == TVA_SMAPS;
> + const bool in_pf = type == TVA_PAGEFAULT;
> + const bool enforce_sysfs = type != TVA_FORCED_COLLAPSE;
Some cheeky const-ifying, I like it :)
> unsigned long supported_orders;
>
> /* Check the intersection of requested and supported orders. */
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2c9008246785..7a54b6f2a346 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -474,8 +474,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
> {
> if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
> hugepage_pmd_enabled()) {
> - if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
> - PMD_ORDER))
> + if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> __khugepaged_enter(vma->vm_mm);
> }
> }
> @@ -921,7 +920,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> struct collapse_control *cc)
> {
> struct vm_area_struct *vma;
> - unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> + enum tva_type tva_type = cc->is_khugepaged ? TVA_KHUGEPAGED :
> + TVA_FORCED_COLLAPSE;
This is great, this is so much clearer.
A nit though, I mean I come back to my 'type' vs 'tva_type' nit above, this
is inconsistent, so we should choose one approach and stick with it.
>
> if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> return SCAN_ANY_PROCESS;
> @@ -932,7 +932,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>
> if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
> return SCAN_ADDRESS_RANGE;
> - if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
> + if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_type, PMD_ORDER))
> return SCAN_VMA_CHECK;
> /*
> * Anon VMA expected, the address may be unmapped then
> @@ -1532,9 +1532,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> * in the page cache with a single hugepage. If a mm were to fault-in
> * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
> * and map it by a PMD, regardless of sysfs THP settings. As such, let's
> - * analogously elide sysfs THP settings here.
> + * analogously elide sysfs THP settings here and pretend we are
> + * collapsing.
I think saying pretending here is potentially confusing, maybe worth saying
'force collapse'?
> */
> - if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
> + if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> return SCAN_VMA_CHECK;
Again, fantastically clearer.
>
> /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
> @@ -2431,8 +2432,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> progress++;
> break;
> }
> - if (!thp_vma_allowable_order(vma, vma->vm_flags,
> - TVA_ENFORCE_SYSFS, PMD_ORDER)) {
> + if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> skip:
> progress++;
> continue;
> @@ -2766,7 +2766,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> BUG_ON(vma->vm_start > start);
> BUG_ON(vma->vm_end < end);
>
> - if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
> + if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> return -EINVAL;
>
> cc = kmalloc(sizeof(*cc), GFP_KERNEL);
> diff --git a/mm/memory.c b/mm/memory.c
> index 92fd18a5d8d1..be761753f240 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4369,8 +4369,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> * Get a list of all the (large) orders below PMD_ORDER that are enabled
> * and suitable for swapping THP.
> */
> - orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> - TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
> + BIT(PMD_ORDER) - 1);
> orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> orders = thp_swap_suitable_orders(swp_offset(entry),
> vmf->address, orders);
> @@ -4917,8 +4917,8 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> * for this vma. Then filter out the orders that can't be allocated over
> * the faulting address and still be fully contained in the vma.
> */
> - orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> - TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
> + BIT(PMD_ORDER) - 1);
> orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>
> if (!orders)
> @@ -6108,8 +6108,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> return VM_FAULT_OOM;
> retry_pud:
> if (pud_none(*vmf.pud) &&
> - thp_vma_allowable_order(vma, vm_flags,
> - TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) {
> + thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PUD_ORDER)) {
> ret = create_huge_pud(&vmf);
> if (!(ret & VM_FAULT_FALLBACK))
> return ret;
> @@ -6143,8 +6142,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> goto retry_pud;
>
> if (pmd_none(*vmf.pmd) &&
> - thp_vma_allowable_order(vma, vm_flags,
> - TVA_IN_PF | TVA_ENFORCE_SYSFS, PMD_ORDER)) {
> + thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
> ret = create_huge_pmd(&vmf);
> if (!(ret & VM_FAULT_FALLBACK))
> return ret;
> --
> 2.47.3
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 3/5] mm/huge_memory: treat MADV_COLLAPSE as an advise with PR_THP_DISABLE_EXCEPT_ADVISED
2025-07-31 12:27 ` [PATCH v2 3/5] mm/huge_memory: treat MADV_COLLAPSE as an advise with PR_THP_DISABLE_EXCEPT_ADVISED Usama Arif
@ 2025-07-31 14:38 ` Lorenzo Stoakes
2025-07-31 14:54 ` David Hildenbrand
0 siblings, 1 reply; 24+ messages in thread
From: Lorenzo Stoakes @ 2025-07-31 14:38 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
linux-doc, kernel-team
Nits on subject:
- It's >75 chars
- advise is the verb, advice is the noun.
On Thu, Jul 31, 2025 at 01:27:20PM +0100, Usama Arif wrote:
> From: David Hildenbrand <david@redhat.com>
>
> Let's allow for making MADV_COLLAPSE succeed on areas that neither have
> VM_HUGEPAGE nor VM_NOHUGEPAGE when we have THP disabled
> unless explicitly advised (PR_THP_DISABLE_EXCEPT_ADVISED).
Hmm, I'm not sure about this.
So far this prctl() has been the only way to override MADV_COLLAPSE
behaviour, but now we're allowing for this one case to not.
I suppose the precedent is that MADV_COLLAPSE overrides 'madvise' sysfs
behaviour.
I suppose what saves us here is 'advised' can be read to mean either
MADV_HUGEPAGE or MADV_COLLAPSE.
And yes, MADV_COLLAPSE is clearly the user requesting this behaviour.
I think the vagueness here is one that already existed, because one could
perfectly one have expected MADV_COLLAPSE to obey sysfs and require
MADV_HUGEPAGE to have been applied, but of course this is not the case.
OK so fine.
BUT.
I think the MADV_COLLAPSE man page will need to be updated to mention this.
And I REALLY think we should update the THP doc too to mention all these
prctl() modes.
I'm not sure we cover that right now _at all_ and obviously we should
describe the new flags.
Usama - can you add a patch to this series to do that?
>
> MADV_COLLAPSE is a clear advise that we want to collapse.
advise -> advice.
>
> Note that we still respect the VM_NOHUGEPAGE flag, just like
> MADV_COLLAPSE always does. So consequently, MADV_COLLAPSE is now only
> refused on VM_NOHUGEPAGE with PR_THP_DISABLE_EXCEPT_ADVISED.
You also need to mention the shmem change you've made I think.
>
> Co-developed-by: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> include/linux/huge_mm.h | 8 +++++++-
> include/uapi/linux/prctl.h | 2 +-
> mm/huge_memory.c | 5 +++--
> mm/memory.c | 6 ++++--
> mm/shmem.c | 2 +-
> 5 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index b0ff54eee81c..aeaf93f8ac2e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -329,7 +329,7 @@ struct thpsize {
> * through madvise or prctl.
> */
> static inline bool vma_thp_disabled(struct vm_area_struct *vma,
> - vm_flags_t vm_flags)
> + vm_flags_t vm_flags, bool forced_collapse)
> {
> /* Are THPs disabled for this VMA? */
> if (vm_flags & VM_NOHUGEPAGE)
> @@ -343,6 +343,12 @@ static inline bool vma_thp_disabled(struct vm_area_struct *vma,
> */
> if (vm_flags & VM_HUGEPAGE)
> return false;
> + /*
> + * Forcing a collapse (e.g., madv_collapse), is a clear advise to
advise -> advice.
> + * use THPs.
> + */
> + if (forced_collapse)
> + return false;
> return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
> }
>
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 9c1d6e49b8a9..ee4165738779 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -185,7 +185,7 @@ struct prctl_mm_map {
> #define PR_SET_THP_DISABLE 41
> /*
> * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
> - * VM_HUGEPAGE).
> + * VM_HUGEPAGE / MADV_COLLAPSE).
This is confusing you're mixing VMA flags with MADV ones... maybe just
stick to madvise ones, or add extra context around VM_HUGEPAGE bit?
Would need to be fixed up in a prior commit obviously.
> */
> # define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1)
> #define PR_GET_THP_DISABLE 42
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 85252b468f80..ef5ccb0ec5d5 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -104,7 +104,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> {
> const bool smaps = type == TVA_SMAPS;
> const bool in_pf = type == TVA_PAGEFAULT;
> - const bool enforce_sysfs = type != TVA_FORCED_COLLAPSE;
> + const bool forced_collapse = type == TVA_FORCED_COLLAPSE;
> + const bool enforce_sysfs = !forced_collapse;
Can we just get rid of this enforce_sysfs altogether in patch 2/5 and use
forced_collapse?
The first place we use it we negate it:
return orders & shmem_allowable_huge_orders(file_inode(vma->vm_file),
vma, vma->vm_pgoff, 0,
!enforce_sysfs);
And the one other place we'd have to negate, but it actually helps document
behaviour I think.
> unsigned long supported_orders;
>
> /* Check the intersection of requested and supported orders. */
> @@ -122,7 +123,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> if (!vma->vm_mm) /* vdso */
> return 0;
>
> - if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags))
> + if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags, forced_collapse))
> return 0;
>
> /* khugepaged doesn't collapse DAX vma, but page fault is fine. */
> diff --git a/mm/memory.c b/mm/memory.c
> index be761753f240..bd04212d6f79 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5186,9 +5186,11 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
> * It is too late to allocate a small folio, we already have a large
> * folio in the pagecache: especially s390 KVM cannot tolerate any
> * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any
> - * PMD mappings if THPs are disabled.
> + * PMD mappings if THPs are disabled. As we already have a THP ...
> + * behave as if we are forcing a collapse.
> */
> - if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags))
> + if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags,
> + /* forced_collapse=*/ true))
> return ret;
>
> if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e6cdfda08aed..30609197a266 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1816,7 +1816,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
> vm_flags_t vm_flags = vma ? vma->vm_flags : 0;
> unsigned int global_orders;
>
> - if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags)))
> + if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags, shmem_huge_force)))
Hm this is an extra bit of logic, as noted above, we definitely need to
mention we're changing this in the commit message also.
Since shmem_allowable_huge_orders() can only be invoked by
__thp_vma_allowable_orders() with shmem_huge_force set true, and it'd only
be the case should TVA_FORCED_COLLAPSE is set, this makes sense.
> return 0;
>
> global_orders = shmem_huge_global_enabled(inode, index, write_end,
> --
> 2.47.3
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 3/5] mm/huge_memory: treat MADV_COLLAPSE as an advise with PR_THP_DISABLE_EXCEPT_ADVISED
2025-07-31 14:38 ` Lorenzo Stoakes
@ 2025-07-31 14:54 ` David Hildenbrand
2025-08-01 10:32 ` Lorenzo Stoakes
2025-08-01 11:26 ` Usama Arif
0 siblings, 2 replies; 24+ messages in thread
From: David Hildenbrand @ 2025-07-31 14:54 UTC (permalink / raw)
To: Lorenzo Stoakes, Usama Arif
Cc: Andrew Morton, linux-mm, linux-fsdevel, corbet, rppt, surenb,
mhocko, hannes, baohua, shakeel.butt, riel, ziy, laoar.shao,
dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team
On 31.07.25 16:38, Lorenzo Stoakes wrote:
> Nits on subject:
>
> - It's >75 chars
No big deal. If we cna come up with something shorter, good.
> - advise is the verb, advice is the noun.
Yeah.
>
> On Thu, Jul 31, 2025 at 01:27:20PM +0100, Usama Arif wrote:
>> From: David Hildenbrand <david@redhat.com>
>>
>> Let's allow for making MADV_COLLAPSE succeed on areas that neither have
>> VM_HUGEPAGE nor VM_NOHUGEPAGE when we have THP disabled
>> unless explicitly advised (PR_THP_DISABLE_EXCEPT_ADVISED).
>
> Hmm, I'm not sure about this.
>
> So far this prctl() has been the only way to override MADV_COLLAPSE
> behaviour, but now we're allowing for this one case to not.
This is not an override really. prctl() disallowed MADV_COLLAPSE, but in
the new mode we don't want that anymore.
> > I suppose the precedent is that MADV_COLLAPSE overrides 'madvise' sysfs
> behaviour.
> > I suppose what saves us here is 'advised' can be read to mean either
> MADV_HUGEPAGE or MADV_COLLAPSE.
> > And yes, MADV_COLLAPSE is clearly the user requesting this behaviour.
Exactly.
>
> I think the vagueness here is one that already existed, because one could
> perfectly one have expected MADV_COLLAPSE to obey sysfs and require
> MADV_HUGEPAGE to have been applied, but of course this is not the case.
Yes.
>
> OK so fine.
>
> BUT.
>
> I think the MADV_COLLAPSE man page will need to be updated to mention this.
>
Yes.
> And I REALLY think we should update the THP doc too to mention all these
> prctl() modes.
>
> I'm not sure we cover that right now _at all_ and obviously we should
> describe the new flags.
>
> Usama - can you add a patch to this series to do that?
Good point, let's document the interaction with prctl().
>
>>
>> MADV_COLLAPSE is a clear advise that we want to collapse.
>
> advise -> advice.
>
>>
>> Note that we still respect the VM_NOHUGEPAGE flag, just like
>> MADV_COLLAPSE always does. So consequently, MADV_COLLAPSE is now only
>> refused on VM_NOHUGEPAGE with PR_THP_DISABLE_EXCEPT_ADVISED.
>
> You also need to mention the shmem change you've made I think.
Yes.
> >>
>> Co-developed-by: Usama Arif <usamaarif642@gmail.com>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>> include/linux/huge_mm.h | 8 +++++++-
>> include/uapi/linux/prctl.h | 2 +-
>> mm/huge_memory.c | 5 +++--
>> mm/memory.c | 6 ++++--
>> mm/shmem.c | 2 +-
>> 5 files changed, 16 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index b0ff54eee81c..aeaf93f8ac2e 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -329,7 +329,7 @@ struct thpsize {
>> * through madvise or prctl.
>> */
>> static inline bool vma_thp_disabled(struct vm_area_struct *vma,
>> - vm_flags_t vm_flags)
>> + vm_flags_t vm_flags, bool forced_collapse)
>> {
>> /* Are THPs disabled for this VMA? */
>> if (vm_flags & VM_NOHUGEPAGE)
>> @@ -343,6 +343,12 @@ static inline bool vma_thp_disabled(struct vm_area_struct *vma,
>> */
>> if (vm_flags & VM_HUGEPAGE)
>> return false;
>> + /*
>> + * Forcing a collapse (e.g., madv_collapse), is a clear advise to
>
> advise -> advice.
>
>> + * use THPs.
>> + */
>> + if (forced_collapse)
>> + return false;
>> return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
>> }
>>
>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> index 9c1d6e49b8a9..ee4165738779 100644
>> --- a/include/uapi/linux/prctl.h
>> +++ b/include/uapi/linux/prctl.h
>> @@ -185,7 +185,7 @@ struct prctl_mm_map {
>> #define PR_SET_THP_DISABLE 41
>> /*
>> * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
>> - * VM_HUGEPAGE).
>> + * VM_HUGEPAGE / MADV_COLLAPSE).
>
> This is confusing you're mixing VMA flags with MADV ones... maybe just
> stick to madvise ones, or add extra context around VM_HUGEPAGE bit?
I don't see anything confusing here, really.
But if it helps you, we can do
(e.g., MADV_HUGEPAGE / VM_HUGEPAGE, MADV_COLLAPSE).
(reason VM_HUGEPAGE is spelled out is that there might be code where we
set VM_HUGEPAGE implicitly in the kernel)
>
> Would need to be fixed up in a prior commit obviously.
>
>> */
>> # define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1)
>> #define PR_GET_THP_DISABLE 42
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 85252b468f80..ef5ccb0ec5d5 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -104,7 +104,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>> {
>> const bool smaps = type == TVA_SMAPS;
>> const bool in_pf = type == TVA_PAGEFAULT;
>> - const bool enforce_sysfs = type != TVA_FORCED_COLLAPSE;
>> + const bool forced_collapse = type == TVA_FORCED_COLLAPSE;
>> + const bool enforce_sysfs = !forced_collapse;
>
> Can we just get rid of this enforce_sysfs altogether in patch 2/5 and use
> forced_collapse?
Let's do that as a separate cleanup on top. I want least churn in that
patch.
(had the same idea while writing that patch, but I have other things to
focus on than cleaning up all this mess)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
2025-07-31 12:27 ` [PATCH v2 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE Usama Arif
2025-07-31 12:40 ` Lorenzo Stoakes
@ 2025-07-31 15:13 ` Zi Yan
1 sibling, 0 replies; 24+ messages in thread
From: Zi Yan @ 2025-07-31 15:13 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
surenb, mhocko, hannes, baohua, shakeel.butt, riel, laoar.shao,
dev.jain, baolin.wang, npache, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
linux-doc, kernel-team, Matthew Wilcox
On 31 Jul 2025, at 8:27, Usama Arif wrote:
> From: David Hildenbrand <david@redhat.com>
>
> People want to make use of more THPs, for example, moving from
> the "never" system policy to "madvise", or from "madvise" to "always".
>
> While this is great news for every THP desperately waiting to get
> allocated out there, apparently there are some workloads that require a
> bit of care during that transition: individual processes may need to
> opt-out from this behavior for various reasons, and this should be
> permitted without needing to make all other workloads on the system
> similarly opt-out.
>
> The following scenarios are imaginable:
>
> (1) Switch from "none" system policy to "madvise"/"always", but keep THPs
> disabled for selected workloads.
>
> (2) Stay at "none" system policy, but enable THPs for selected
> workloads, making only these workloads use the "madvise" or "always"
> policy.
>
> (3) Switch from "madvise" system policy to "always", but keep the
> "madvise" policy for selected workloads: allocate THPs only when
> advised.
>
> (4) Stay at "madvise" system policy, but enable THPs even when not advised
> for selected workloads -- "always" policy.
>
> Once can emulate (2) through (1), by setting the system policy to
> "madvise"/"always" while disabling THPs for all processes that don't want
> THPs. It requires configuring all workloads, but that is a user-space
> problem to sort out.
>
> (4) can be emulated through (3) in a similar way.
>
> Back when (1) was relevant in the past, as people started enabling THPs,
> we added PR_SET_THP_DISABLE, so relevant workloads that were not ready
> yet (i.e., used by Redis) were able to just disable THPs completely. Redis
> still implements the option to use this interface to disable THPs
> completely.
>
> With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
> workload -- a process, including fork+exec'ed process hierarchy.
> That essentially made us support (1): simply disable THPs for all workloads
> that are not ready for THPs yet, while still enabling THPs system-wide.
>
> The quest for handling (3) and (4) started, but current approaches
> (completely new prctl, options to set other policies per process,
> alternatives to prctl -- mctrl, cgroup handling) don't look particularly
> promising. Likely, the future will use bpf or something similar to
> implement better policies, in particular to also make better decisions
> about THP sizes to use, but this will certainly take a while as that work
> just started.
>
> Long story short: a simple enable/disable is not really suitable for the
> future, so we're not willing to add completely new toggles.
>
> While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
> completely for these processes, this is a step backwards, because these
> processes can no longer allocate THPs in regions where THPs were
> explicitly advised: regions flagged as VM_HUGEPAGE. Apparently, that
> imposes a problem for relevant workloads, because "not THPs" is certainly
> worse than "THPs only when advised".
>
> Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
> explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this
> would change the documented semantics quite a bit, and the versatility
> to use it for debugging purposes, so I am not 100% sure that is what we
> want -- although it would certainly be much easier.
>
> So instead, as an easy way forward for (3) and (4), add an option to
> make PR_SET_THP_DISABLE disable *less* THPs for a process.
>
> In essence, this patch:
>
> (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
> of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).
>
> prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED).
>
> (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
> PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.
>
> Previously, it would return 1 if THPs were disabled completely. Now
> it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED
> was set.
>
> (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
> the semantics clearly.
>
> Fortunately, there are only two instances outside of prctl() code.
>
> (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
> with VM_HUGEPAGE" -- essentially "thp=madvise" behavior
>
> Fortunately, we only have to extend vma_thp_disabled().
>
> (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are
> disabled completely
>
> Only indicating that THPs are disabled when they are really disabled
> completely, not only partially.
>
> For now, we don't add another interface to obtained whether THPs
> are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If
> ever required, we could add a new entry.
>
> The documented semantics in the man page for PR_SET_THP_DISABLE
> "is inherited by a child created via fork(2) and is preserved across
> execve(2)" is maintained. This behavior, for example, allows for
> disabling THPs for a workload through the launching process (e.g.,
> systemd where we fork() a helper process to then exec()).
>
> For now, MADV_COLLAPSE will *fail* in regions without VM_HUGEPAGE and
> VM_NOHUGEPAGE. As MADV_COLLAPSE is a clear advise that user space
> thinks a THP is a good idea, we'll enable that separately next
> (requiring a bit of cleanup first).
>
> There is currently not way to prevent that a process will not issue
> PR_SET_THP_DISABLE itself to re-enable THP. There are not really known
> users for re-enabling it, and it's against the purpose of the original
> interface. So if ever required, we could investigate just forbidding to
> re-enable them, or make this somehow configurable.
>
> Acked-by: Usama Arif <usamaarif642@gmail.com>
> Tested-by: Usama Arif <usamaarif642@gmail.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Cc: SeongJae Park <sj@kernel.org>
> Cc: Jann Horn <jannh@google.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> ---
>
> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
> think there might be real use cases where we want to disable any THPs --
> in particular also around debugging THP-related problems, and
> "never" not meaning ... "never" anymore ever since we add MADV_COLLAPSE.
> PR_SET_THP_DISABLE will also block MADV_COLLAPSE, which can be very
> helpful for debugging purposes. Of course, I thought of having a
> system-wide config option to modify PR_SET_THP_DISABLE behavior, but
> I just don't like the semantics.
>
> "prctl: allow overriding system THP policy to always"[1] proposed
> "overriding policies to always", which is just the wrong way around: we
> should not add mechanisms to "enable more" when we already have an
> interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets
> weird otherwise.
>
> "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed
> setting the default of the VM_HUGEPAGE, which is similarly the wrong way
> around I think now.
>
> The ideas explored by Lorenzo to extend process_madvise()[3] and mctrl()[4]
> similarly were around the "default for VM_HUGEPAGE" idea, but after the
> discussion, I think we should better leave VM_HUGEPAGE untouched.
>
> Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
> we essentially want to say "leave advised regions alone" -- "keep THP
> enabled for advised regions",
>
> The only thing I really dislike about this is using another MMF_* flag,
> but well, no way around it -- and seems like we could easily support
> more than 32 if we want to (most users already treat it like a proper
> bitmap).
>
> I think this here (modifying an existing toggle) is the only prctl()
> extension that we might be willing to accept. In general, I agree like
> most others, that prctl() is a very bad interface for that -- but
> PR_SET_THP_DISABLE is already there and is getting used.
>
> Long-term, I think the answer will be something based on bpf[5]. Maybe
> in that context, I there could still be value in easily disabling THPs for
> selected workloads (esp. debugging purposes).
>
> Jann raised valid concerns[6] about new flags that are persistent across
> exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I
> consider it having a similar security risk as our existing
> PR_SET_THP_DISABLE, but devil is in the detail.
>
> [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com
> [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com
> [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com
> [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
> [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com
> [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> Documentation/filesystems/proc.rst | 5 ++-
> fs/proc/array.c | 2 +-
> include/linux/huge_mm.h | 20 +++++++---
> include/linux/mm_types.h | 13 +++----
> include/uapi/linux/prctl.h | 10 +++++
> kernel/sys.c | 59 ++++++++++++++++++++++++------
> mm/khugepaged.c | 2 +-
> 7 files changed, 82 insertions(+), 29 deletions(-)
>
The changes look good to me. Acked-by: Zi Yan <ziy@nvidia.com>
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 2/5] mm/huge_memory: convert "tva_flags" to "enum tva_type" for thp_vma_allowable_order*()
2025-07-31 14:00 ` Lorenzo Stoakes
@ 2025-07-31 15:19 ` Zi Yan
2025-07-31 16:15 ` David Hildenbrand
2025-07-31 19:20 ` Usama Arif
2 siblings, 0 replies; 24+ messages in thread
From: Zi Yan @ 2025-07-31 15:19 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Usama Arif, Andrew Morton, david, linux-mm, linux-fsdevel, corbet,
rppt, surenb, mhocko, hannes, baohua, shakeel.butt, riel,
laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
linux-doc, kernel-team
On 31 Jul 2025, at 10:00, Lorenzo Stoakes wrote:
> On Thu, Jul 31, 2025 at 01:27:19PM +0100, Usama Arif wrote:
>> From: David Hildenbrand <david@redhat.com>
>>
>> Describing the context through a type is much clearer, and good enough
>> for our case.
>
> This is pretty bare bones. What context, what type? Under what
> circumstances?
>
> This also is missing detail on the key difference here - that actually it
> turns out we _don't_ need these to be flags, rather we can have _distinct_
> modes which are clearer.
>
> I'd say something like:
>
> when determining which THP orders are eligiible for a VMA mapping,
> we have previously specified tva_flags, however it turns out it is
> really not necessary to treat these as flags.
>
> Rather, we distinguish between distinct modes.
>
> The only case where we previously combined flags was with
> TVA_ENFORCE_SYSFS, but we can avoid this by observing that this is
> the default, except for MADV_COLLAPSE or an edge cases in
> collapse_pte_mapped_thp() and hugepage_vma_revalidate(), and adding
> a mode specifically for this case - TVA_FORCED_COLLAPSE.
>
> ... stuff about the different modes...
>
>>
>> We have:
>> * smaps handling for showing "THPeligible"
>> * Pagefault handling
>> * khugepaged handling
>> * Forced collapse handling: primarily MADV_COLLAPSE, but one other odd case
>
> Can we actually state what this case is? I mean I guess a handwave in the
> form of 'an edge case in collapse_pte_mapped_thp()' will do also.
>
> Hmm actually we do weird stuff with this so maybe just handwave.
>
> Like uprobes calls collapse_pte_mapped_thp()... :/ I'm not sure this 'If we
> are here, we've succeeded in replacing all the native pages in the page
> cache with a single hugepage.' comment is even correct.
>
> Anyway yeah, hand wave I guess...
>
>>
>> Really, we want to ignore sysfs only when we are forcing a collapse
>> through MADV_COLLAPSE, otherwise we want to enforce.
>
> I'd say 'ignoring this edge case, ...'
>
> I think the clearest thing might be to literally list the before/after
> like:
>
> * TVA_SMAPS | TVA_ENFORCE_SYSFS -> TVA_SMAPS
> * TVA_IN_PF | TVA_ENFORCE_SYSFS -> TVA_PAGEFAULT
> * TVA_ENFORCE_SYSFS -> TVA_KHUGEPAGED
> * 0 -> TVA_FORCED_COLLAPSE
>
>>
>> With this change, we immediately know if we are in the forced collapse
>> case, which will be valuable next.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> Acked-by: Usama Arif <usamaarif642@gmail.com>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>
> Overall this is a great cleanup, some various nits however.
>
>> ---
>> fs/proc/task_mmu.c | 4 ++--
>> include/linux/huge_mm.h | 30 ++++++++++++++++++------------
>> mm/huge_memory.c | 8 ++++----
>> mm/khugepaged.c | 18 +++++++++---------
>> mm/memory.c | 14 ++++++--------
>> 5 files changed, 39 insertions(+), 35 deletions(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 3d6d8a9f13fc..d440df7b3d59 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -1293,8 +1293,8 @@ static int show_smap(struct seq_file *m, void *v)
>> __show_smap(m, &mss, false);
>>
>> seq_printf(m, "THPeligible: %8u\n",
>> - !!thp_vma_allowable_orders(vma, vma->vm_flags,
>> - TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL));
>> + !!thp_vma_allowable_orders(vma, vma->vm_flags, TVA_SMAPS,
>> + THP_ORDERS_ALL));
>
> This !! is so gross, wonder if we could have a bool wrapper. But not a big
> deal.
>
> I also sort of _hate_ the smaps flag anyway, invoking this 'allowable
> orders' thing just for smaps reporting with maybe some minor delta is just
> odd.
>
> Something like `bool vma_has_thp_allowed_orders(struct vm_area_struct
> *vma);` would be nicer.
>
> Anyway thoughts for another time... :)
Or just
bool thp_eligible = thp_vma_allowable_orders(...);
seq_printf(m, "THPeligible: %8u\n", thp_eligible);
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 2/5] mm/huge_memory: convert "tva_flags" to "enum tva_type" for thp_vma_allowable_order*()
2025-07-31 14:00 ` Lorenzo Stoakes
2025-07-31 15:19 ` Zi Yan
@ 2025-07-31 16:15 ` David Hildenbrand
2025-08-01 10:08 ` Lorenzo Stoakes
2025-07-31 19:20 ` Usama Arif
2 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2025-07-31 16:15 UTC (permalink / raw)
To: Lorenzo Stoakes, Usama Arif
Cc: Andrew Morton, linux-mm, linux-fsdevel, corbet, rppt, surenb,
mhocko, hannes, baohua, shakeel.butt, riel, ziy, laoar.shao,
dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team
On 31.07.25 16:00, Lorenzo Stoakes wrote:
> On Thu, Jul 31, 2025 at 01:27:19PM +0100, Usama Arif wrote:
>> From: David Hildenbrand <david@redhat.com>
>>
>> Describing the context through a type is much clearer, and good enough
>> for our case.
Just for the other patch, I'll let Usama take it from here, just a bunch
of comments.
>
> This is pretty bare bones. What context, what type? Under what
> circumstances?
>
> This also is missing detail on the key difference here - that actually it
> turns out we _don't_ need these to be flags, rather we can have _distinct_
> modes which are clearer.
>
> I'd say something like:
>
> when determining which THP orders are eligiible for a VMA mapping,
> we have previously specified tva_flags, however it turns out it is
> really not necessary to treat these as flags.
>
> Rather, we distinguish between distinct modes.
>
> The only case where we previously combined flags was with
> TVA_ENFORCE_SYSFS, but we can avoid this by observing that this is
> the default, except for MADV_COLLAPSE or an edge cases in
> collapse_pte_mapped_thp() and hugepage_vma_revalidate(), and adding
> a mode specifically for this case - TVA_FORCED_COLLAPSE.
>
> ... stuff about the different modes...
>
>>
>> We have:
>> * smaps handling for showing "THPeligible"
>> * Pagefault handling
>> * khugepaged handling
>> * Forced collapse handling: primarily MADV_COLLAPSE, but one other odd case
>
> Can we actually state what this case is? I mean I guess a handwave in the
> form of 'an edge case in collapse_pte_mapped_thp()' will do also.
Yeah, something like that. I think we also call it when we previously
checked that there is a THP and that we might be allowed to collapse.
E.g., collapse_pte_mapped_thp() is also called from khugepaged code
where we already checked the allowed order.
>
> Hmm actually we do weird stuff with this so maybe just handwave.
>
> Like uprobes calls collapse_pte_mapped_thp()... :/ I'm not sure this 'If we
> are here, we've succeeded in replacing all the native pages in the page
> cache with a single hugepage.' comment is even correct.
I think in all these cases we already have a THP and want to force that
collapse in the page table.
[...]
>>
>> Really, we want to ignore sysfs only when we are forcing a collapse
>> through MADV_COLLAPSE, otherwise we want to enforce.
>
> I'd say 'ignoring this edge case, ...'
>
> I think the clearest thing might be to literally list the before/after
> like:
>
> * TVA_SMAPS | TVA_ENFORCE_SYSFS -> TVA_SMAPS
> * TVA_IN_PF | TVA_ENFORCE_SYSFS -> TVA_PAGEFAULT
> * TVA_ENFORCE_SYSFS -> TVA_KHUGEPAGED
> * 0 -> TVA_FORCED_COLLAPSE
>
That makes sense.
>>
>> With this change, we immediately know if we are in the forced collapse
>> case, which will be valuable next.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> Acked-by: Usama Arif <usamaarif642@gmail.com>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>
> Overall this is a great cleanup, some various nits however.
>
>> ---
>> fs/proc/task_mmu.c | 4 ++--
>> include/linux/huge_mm.h | 30 ++++++++++++++++++------------
>> mm/huge_memory.c | 8 ++++----
>> mm/khugepaged.c | 18 +++++++++---------
>> mm/memory.c | 14 ++++++--------
>> 5 files changed, 39 insertions(+), 35 deletions(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 3d6d8a9f13fc..d440df7b3d59 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -1293,8 +1293,8 @@ static int show_smap(struct seq_file *m, void *v)
>> __show_smap(m, &mss, false);
>>
>> seq_printf(m, "THPeligible: %8u\n",
>> - !!thp_vma_allowable_orders(vma, vma->vm_flags,
>> - TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL));
>> + !!thp_vma_allowable_orders(vma, vma->vm_flags, TVA_SMAPS,
>> + THP_ORDERS_ALL));
>
> This !! is so gross, wonder if we could have a bool wrapper. But not a big
> deal.
>
> I also sort of _hate_ the smaps flag anyway, invoking this 'allowable
> orders' thing just for smaps reporting with maybe some minor delta is just
> odd.
>
> Something like `bool vma_has_thp_allowed_orders(struct vm_area_struct
> *vma);` would be nicer.
>
> Anyway thoughts for another time... :)
Yeah, that's not the only nasty bit here ... :)
>
>>
>> if (arch_pkeys_enabled())
>> seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma));
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 71db243a002e..b0ff54eee81c 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -94,12 +94,15 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>> #define THP_ORDERS_ALL \
>> (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_SPECIAL | THP_ORDERS_ALL_FILE_DEFAULT)
>>
>> -#define TVA_SMAPS (1 << 0) /* Will be used for procfs */
>
> Dumb question, but what does 'TVA' stand for? :P
Whoever came up with that probably used the function name where this is
passed in
thp_vma_allowable_orders()
>
>> -#define TVA_IN_PF (1 << 1) /* Page fault handler */
>> -#define TVA_ENFORCE_SYSFS (1 << 2) /* Obey sysfs configuration */
>> +enum tva_type {
>> + TVA_SMAPS, /* Exposing "THPeligible:" in smaps. */
>
> How I hate this flag (just an observation...)
>
>> + TVA_PAGEFAULT, /* Serving a page fault. */
>> + TVA_KHUGEPAGED, /* Khugepaged collapse. */
>
> This is equivalent to the TVA_ENFORCE_SYSFS case before, sort of a default
> I guess, but actually quite nice to add the context that it's sourced from
> khugepaged - I assume this will always be the case when specified?
>
>> + TVA_FORCED_COLLAPSE, /* Forced collapse (i.e., MADV_COLLAPSE). */
>
> Would put 'e.g.' here, then that allows 'space' for the edge case...
Makes sense.
>
>> +};
>>
>> -#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
>> - (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
>> +#define thp_vma_allowable_order(vma, vm_flags, type, order) \
>> + (!!thp_vma_allowable_orders(vma, vm_flags, type, BIT(order)))
>
> Nit, but maybe worth keeping tva_ prefix - tva_type - here just so it's
> clear what type it refers to.
>
> But not end of the world.
>
> Same comment goes for param names below etc.
No strong opinion, but I prefer to drop the prefix when it can be
deduced from the type and we are inside of the very function that
essentially defines these types (tva prefix is implicit, no other type
applies).
These should probably just be inline functions at some point with proper
types and doc (separate patch uin the future, of course).
[...]
>> +++ b/mm/khugepaged.c
>> @@ -474,8 +474,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
>> {
>> if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
>> hugepage_pmd_enabled()) {
>> - if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
>> - PMD_ORDER))
>> + if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
>> __khugepaged_enter(vma->vm_mm);
>> }
>> }
>> @@ -921,7 +920,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>> struct collapse_control *cc)
>> {
>> struct vm_area_struct *vma;
>> - unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>> + enum tva_type tva_type = cc->is_khugepaged ? TVA_KHUGEPAGED :
>> + TVA_FORCED_COLLAPSE;
>
> This is great, this is so much clearer.
>
> A nit though, I mean I come back to my 'type' vs 'tva_type' nit above, this
> is inconsistent, so we should choose one approach and stick with it.
This is outside of the function, so I would prefer to keep it here, but
no stong opinion.
>
>>
>> if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
>> return SCAN_ANY_PROCESS;
>> @@ -932,7 +932,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>>
>> if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
>> return SCAN_ADDRESS_RANGE;
>> - if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
>> + if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_type, PMD_ORDER))
>> return SCAN_VMA_CHECK;
>> /*
>> * Anon VMA expected, the address may be unmapped then
>> @@ -1532,9 +1532,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>> * in the page cache with a single hugepage. If a mm were to fault-in
>> * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
>> * and map it by a PMD, regardless of sysfs THP settings. As such, let's
>> - * analogously elide sysfs THP settings here.
>> + * analogously elide sysfs THP settings here and pretend we are
>> + * collapsing.
>
> I think saying pretending here is potentially confusing, maybe worth saying
> 'force collapse'?
Makes sense.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 2/5] mm/huge_memory: convert "tva_flags" to "enum tva_type" for thp_vma_allowable_order*()
2025-07-31 14:00 ` Lorenzo Stoakes
2025-07-31 15:19 ` Zi Yan
2025-07-31 16:15 ` David Hildenbrand
@ 2025-07-31 19:20 ` Usama Arif
2025-08-01 10:12 ` Lorenzo Stoakes
2 siblings, 1 reply; 24+ messages in thread
From: Usama Arif @ 2025-07-31 19:20 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
linux-doc, kernel-team
On 31/07/2025 15:00, Lorenzo Stoakes wrote:
> On Thu, Jul 31, 2025 at 01:27:19PM +0100, Usama Arif wrote:
>> From: David Hildenbrand <david@redhat.com>
>>
>> Describing the context through a type is much clearer, and good enough
>> for our case.
>
> This is pretty bare bones. What context, what type? Under what
> circumstances?
>
> This also is missing detail on the key difference here - that actually it
> turns out we _don't_ need these to be flags, rather we can have _distinct_
> modes which are clearer.
>
> I'd say something like:
>
> when determining which THP orders are eligiible for a VMA mapping,
> we have previously specified tva_flags, however it turns out it is
> really not necessary to treat these as flags.
>
> Rather, we distinguish between distinct modes.
>
> The only case where we previously combined flags was with
> TVA_ENFORCE_SYSFS, but we can avoid this by observing that this is
> the default, except for MADV_COLLAPSE or an edge cases in
> collapse_pte_mapped_thp() and hugepage_vma_revalidate(), and adding
> a mode specifically for this case - TVA_FORCED_COLLAPSE.
>
> ... stuff about the different modes...
>
>>
>> We have:
>> * smaps handling for showing "THPeligible"
>> * Pagefault handling
>> * khugepaged handling
>> * Forced collapse handling: primarily MADV_COLLAPSE, but one other odd case
>
> Can we actually state what this case is? I mean I guess a handwave in the
> form of 'an edge case in collapse_pte_mapped_thp()' will do also.
>
> Hmm actually we do weird stuff with this so maybe just handwave.
>
> Like uprobes calls collapse_pte_mapped_thp()... :/ I'm not sure this 'If we
> are here, we've succeeded in replacing all the native pages in the page
> cache with a single hugepage.' comment is even correct.
>
> Anyway yeah, hand wave I guess...
>
>>
>> Really, we want to ignore sysfs only when we are forcing a collapse
>> through MADV_COLLAPSE, otherwise we want to enforce.
>
> I'd say 'ignoring this edge case, ...'
>
> I think the clearest thing might be to literally list the before/after
> like:
>
> * TVA_SMAPS | TVA_ENFORCE_SYSFS -> TVA_SMAPS
> * TVA_IN_PF | TVA_ENFORCE_SYSFS -> TVA_PAGEFAULT
> * TVA_ENFORCE_SYSFS -> TVA_KHUGEPAGED
> * 0 -> TVA_FORCED_COLLAPSE
>
>>
>> With this change, we immediately know if we are in the forced collapse
>> case, which will be valuable next.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> Acked-by: Usama Arif <usamaarif642@gmail.com>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>
> Overall this is a great cleanup, some various nits however.
>
Thanks for the feedback Lorenzo!
I have modified the commit message to be:
mm/huge_memory: convert "tva_flags" to "enum tva_type"
When determining which THP orders are eligible for a VMA mapping,
we have previously specified tva_flags, however it turns out it is
really not necessary to treat these as flags.
Rather, we distinguish between distinct modes.
The only case where we previously combined flags was with
TVA_ENFORCE_SYSFS, but we can avoid this by observing that this
is the default, except for MADV_COLLAPSE or an edge cases in
collapse_pte_mapped_thp() and hugepage_vma_revalidate(), and
adding a mode specifically for this case - TVA_FORCED_COLLAPSE.
We have:
* smaps handling for showing "THPeligible"
* Pagefault handling
* khugepaged handling
* Forced collapse handling: primarily MADV_COLLAPSE, but also for
an edge case in collapse_pte_mapped_thp()
Ignoring the collapse_pte_mapped_thp edgecase, we only want to
ignore sysfs only when we are forcing a collapse through
MADV_COLLAPSE, otherwise we want to enforce it, hence this patch
does the following flag to enum conversions:
* TVA_SMAPS | TVA_ENFORCE_SYSFS -> TVA_SMAPS
* TVA_IN_PF | TVA_ENFORCE_SYSFS -> TVA_PAGEFAULT
* TVA_ENFORCE_SYSFS -> TVA_KHUGEPAGED
* 0 -> TVA_FORCED_COLLAPSE
With this change, we immediately know if we are in the forced collapse
case, which will be valuable next.
>> ---
>> fs/proc/task_mmu.c | 4 ++--
>> include/linux/huge_mm.h | 30 ++++++++++++++++++------------
>> mm/huge_memory.c | 8 ++++----
>> mm/khugepaged.c | 18 +++++++++---------
>> mm/memory.c | 14 ++++++--------
>> 5 files changed, 39 insertions(+), 35 deletions(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 3d6d8a9f13fc..d440df7b3d59 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -1293,8 +1293,8 @@ static int show_smap(struct seq_file *m, void *v)
>> __show_smap(m, &mss, false);
>>
>> seq_printf(m, "THPeligible: %8u\n",
>> - !!thp_vma_allowable_orders(vma, vma->vm_flags,
>> - TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL));
>> + !!thp_vma_allowable_orders(vma, vma->vm_flags, TVA_SMAPS,
>> + THP_ORDERS_ALL));
>
> This !! is so gross, wonder if we could have a bool wrapper. But not a big
> deal.
>
> I also sort of _hate_ the smaps flag anyway, invoking this 'allowable
> orders' thing just for smaps reporting with maybe some minor delta is just
> odd.
>
> Something like `bool vma_has_thp_allowed_orders(struct vm_area_struct
> *vma);` would be nicer.
>
> Anyway thoughts for another time... :)
>
>>
>> if (arch_pkeys_enabled())
>> seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma));
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 71db243a002e..b0ff54eee81c 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -94,12 +94,15 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>> #define THP_ORDERS_ALL \
>> (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_SPECIAL | THP_ORDERS_ALL_FILE_DEFAULT)
>>
>> -#define TVA_SMAPS (1 << 0) /* Will be used for procfs */
>
> Dumb question, but what does 'TVA' stand for? :P
>
>> -#define TVA_IN_PF (1 << 1) /* Page fault handler */
>> -#define TVA_ENFORCE_SYSFS (1 << 2) /* Obey sysfs configuration */
>> +enum tva_type {
>> + TVA_SMAPS, /* Exposing "THPeligible:" in smaps. */
>
> How I hate this flag (just an observation...)
>
>> + TVA_PAGEFAULT, /* Serving a page fault. */
>> + TVA_KHUGEPAGED, /* Khugepaged collapse. */
>
> This is equivalent to the TVA_ENFORCE_SYSFS case before, sort of a default
> I guess, but actually quite nice to add the context that it's sourced from
> khugepaged - I assume this will always be the case when specified?
>
>> + TVA_FORCED_COLLAPSE, /* Forced collapse (i.e., MADV_COLLAPSE). */
>
> Would put 'e.g.' here, then that allows 'space' for the edge case...
>
>> +};
>>
>> -#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
>> - (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
>> +#define thp_vma_allowable_order(vma, vm_flags, type, order) \
>> + (!!thp_vma_allowable_orders(vma, vm_flags, type, BIT(order)))
>
> Nit, but maybe worth keeping tva_ prefix - tva_type - here just so it's
> clear what type it refers to.
>
> But not end of the world.
>
> Same comment goes for param names below etc.
>
>>
>> #define split_folio(f) split_folio_to_list(f, NULL)
>>
>> @@ -264,14 +267,14 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
>>
>> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>> vm_flags_t vm_flags,
>> - unsigned long tva_flags,
>> + enum tva_type type,
>> unsigned long orders);
>>
>> /**
>> * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
>> * @vma: the vm area to check
>> * @vm_flags: use these vm_flags instead of vma->vm_flags
>> - * @tva_flags: Which TVA flags to honour
>> + * @type: TVA type
>> * @orders: bitfield of all orders to consider
>> *
>> * Calculates the intersection of the requested hugepage orders and the allowed
>> @@ -285,11 +288,14 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>> static inline
>> unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>> vm_flags_t vm_flags,
>> - unsigned long tva_flags,
>> + enum tva_type type,
>> unsigned long orders)
>> {
>> - /* Optimization to check if required orders are enabled early. */
>> - if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
>> + /*
>> + * Optimization to check if required orders are enabled early. Only
>> + * forced collapse ignores sysfs configs.
>> + */
>> + if (type != TVA_FORCED_COLLAPSE && vma_is_anonymous(vma)) {
>> unsigned long mask = READ_ONCE(huge_anon_orders_always);
>>
>> if (vm_flags & VM_HUGEPAGE)
>> @@ -303,7 +309,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>> return 0;
>> }
>>
>> - return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
>> + return __thp_vma_allowable_orders(vma, vm_flags, type, orders);
>> }
>>
>> struct thpsize {
>> @@ -536,7 +542,7 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
>>
>> static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>> vm_flags_t vm_flags,
>> - unsigned long tva_flags,
>> + enum tva_type type,
>> unsigned long orders)
>> {
>> return 0;
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 2b4ea5a2ce7d..85252b468f80 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -99,12 +99,12 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>>
>> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>> vm_flags_t vm_flags,
>> - unsigned long tva_flags,
>> + enum tva_type type,
>> unsigned long orders)
>> {
>> - bool smaps = tva_flags & TVA_SMAPS;
>> - bool in_pf = tva_flags & TVA_IN_PF;
>> - bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS;
>> + const bool smaps = type == TVA_SMAPS;
>> + const bool in_pf = type == TVA_PAGEFAULT;
>> + const bool enforce_sysfs = type != TVA_FORCED_COLLAPSE;
>
> Some cheeky const-ifying, I like it :)
>
>> unsigned long supported_orders;
>>
>> /* Check the intersection of requested and supported orders. */
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 2c9008246785..7a54b6f2a346 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -474,8 +474,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
>> {
>> if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
>> hugepage_pmd_enabled()) {
>> - if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
>> - PMD_ORDER))
>> + if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
>> __khugepaged_enter(vma->vm_mm);
>> }
>> }
>> @@ -921,7 +920,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>> struct collapse_control *cc)
>> {
>> struct vm_area_struct *vma;
>> - unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>> + enum tva_type tva_type = cc->is_khugepaged ? TVA_KHUGEPAGED :
>> + TVA_FORCED_COLLAPSE;
>
> This is great, this is so much clearer.
>
> A nit though, I mean I come back to my 'type' vs 'tva_type' nit above, this
> is inconsistent, so we should choose one approach and stick with it.
>
I dont exactly like the name "tva" (It has nothing to do with the fact it took
me more time than I would like to figure out that it meant THP VMA allowable :)),
so what I will do is use "type" everywhere if that is ok?
But no strong opinion and can change the variable/macro args to tva_type if that
is preferred.
The diff over v2 after taking the review comments into account looks quite trivial:
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b0ff54eee81c..bd4f9e6327e0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -98,7 +98,7 @@ enum tva_type {
TVA_SMAPS, /* Exposing "THPeligible:" in smaps. */
TVA_PAGEFAULT, /* Serving a page fault. */
TVA_KHUGEPAGED, /* Khugepaged collapse. */
- TVA_FORCED_COLLAPSE, /* Forced collapse (i.e., MADV_COLLAPSE). */
+ TVA_FORCED_COLLAPSE, /* Forced collapse (e.g. MADV_COLLAPSE). */
};
#define thp_vma_allowable_order(vma, vm_flags, type, order) \
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 7a54b6f2a346..88cb6339e910 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -920,7 +920,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
struct collapse_control *cc)
{
struct vm_area_struct *vma;
- enum tva_type tva_type = cc->is_khugepaged ? TVA_KHUGEPAGED :
+ enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
TVA_FORCED_COLLAPSE;
if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
@@ -932,7 +932,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
return SCAN_ADDRESS_RANGE;
- if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_type, PMD_ORDER))
+ if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
return SCAN_VMA_CHECK;
/*
* Anon VMA expected, the address may be unmapped then
@@ -1532,8 +1532,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
* in the page cache with a single hugepage. If a mm were to fault-in
* this memory (mapped by a suitably aligned VMA), we'd get the hugepage
* and map it by a PMD, regardless of sysfs THP settings. As such, let's
- * analogously elide sysfs THP settings here and pretend we are
- * collapsing.
+ * analogously elide sysfs THP settings here and force collapse.
*/
if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
return SCAN_VMA_CHECK;
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH v2 4/5] selftests: prctl: introduce tests for disabling THPs completely
2025-07-31 12:27 ` [PATCH v2 4/5] selftests: prctl: introduce tests for disabling THPs completely Usama Arif
@ 2025-07-31 19:42 ` David Hildenbrand
2025-08-01 11:42 ` Usama Arif
0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2025-07-31 19:42 UTC (permalink / raw)
To: Usama Arif, Andrew Morton, linux-mm
Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team
On 31.07.25 14:27, Usama Arif wrote:
> The test will set the global system THP setting to never, madvise
> or always depending on the fixture variant and the 2M setting to
> inherit before it starts (and reset to original at teardown).
>
> This tests if the process can:
> - successfully set and get the policy to disable THPs completely.
> - never get a hugepage when the THPs are completely disabled
> with the prctl, including with MADV_HUGE and MADV_COLLAPSE.
> - successfully reset the policy of the process.
> - after reset, only get hugepages with:
> - MADV_COLLAPSE when policy is set to never.
> - MADV_HUGE and MADV_COLLAPSE when policy is set to madvise.
> - always when policy is set to "always".
> - repeat the above tests in a forked process to make sure
> the policy is carried across forks.
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
[...]
Looks much better already. Some quirks.
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <sys/mman.h>
> +#include <sys/prctl.h>
> +#include <sys/wait.h>
> +
> +#include "../kselftest_harness.h"
> +#include "thp_settings.h"
> +#include "vm_util.h"
> +
> +static int sz2ord(size_t size, size_t pagesize)
> +{
> + return __builtin_ctzll(size / pagesize);
> +}
> +
> +enum thp_collapse_type {
> + THP_COLLAPSE_NONE,
> + THP_COLLAPSE_MADV_HUGEPAGE, /* MADV_HUGEPAGE before access */
> + THP_COLLAPSE_MADV_COLLAPSE, /* MADV_COLLAPSE after access */
> +};
> +
> +enum thp_policy {
> + THP_POLICY_NEVER,
> + THP_POLICY_MADVISE,
> + THP_POLICY_ALWAYS,
> +};
Couldn't you have reused "enum thp_enabled" end simply never specified
the "THP_INHERIT"? Then, you need to do less translation.
> +
> +struct test_results {
> + int prctl_get_thp_disable;
The result is always one, does that here make sense?
> + int prctl_applied_collapse_none;
"prctl_applied" is a bit confusing. And most of these always have the
same value.
Can't we special case the remaining two cases on the current policy and
avoid this struct compeltely?
> + int prctl_applied_collapse_madv_huge;
> + int prctl_applied_collapse_madv_collapse;
> + int prctl_removed_collapse_none;
> + int prctl_removed_collapse_madv_huge;
> + int prctl_removed_collapse_madv_collapse;
> +};
> +
> +/*
> + * Function to mmap a buffer, fault it in, madvise it appropriately (before
> + * page fault for MADV_HUGE, and after for MADV_COLLAPSE), and check if the
> + * mmap region is huge.
> + * Returns:
> + * 0 if test doesn't give hugepage
> + * 1 if test gives a hugepage
> + * -errno if mmap fails
> + */
> +static int test_mmap_thp(enum thp_collapse_type madvise_buf, size_t pmdsize)
> +{
> + char *mem, *mmap_mem;
> + size_t mmap_size;
> + int ret;
> +
> + /* For alignment purposes, we need twice the THP size. */
> + mmap_size = 2 * pmdsize;
> + mmap_mem = (char *)mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
> + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> + if (mmap_mem == MAP_FAILED)
> + return -errno;
> +
> + /* We need a THP-aligned memory area. */
> + mem = (char *)(((uintptr_t)mmap_mem + pmdsize) & ~(pmdsize - 1));
> +
> + if (madvise_buf == THP_COLLAPSE_MADV_HUGEPAGE)
> + madvise(mem, pmdsize, MADV_HUGEPAGE);
> +
> + /* Ensure memory is allocated */
> + memset(mem, 1, pmdsize);
> +
> + if (madvise_buf == THP_COLLAPSE_MADV_COLLAPSE)
> + madvise(mem, pmdsize, MADV_COLLAPSE);
> +
To avoid even mmap_mem to get merged with some other VMA, maybe just do
before reading the smap here:
/* HACK: make sure we have a separate VMA that we can check reliably. */
mprotect(mem, pmdsize, PROT_READ);
or
madvise(mem, pmdsize, MADV_DONTFORK);
before reading smaps.
That is probably the easiest approach. The you can drop the lengthy
comment and perform a single thp check.
[...]
> +
> +static void prctl_thp_disable_test(struct __test_metadata *const _metadata,
> + size_t pmdsize, struct test_results *results)
> +{
> +
> + ASSERT_EQ(prctl(PR_GET_THP_DISABLE, NULL, NULL, NULL, NULL),
> + results->prctl_get_thp_disable);
> +
> + /* tests after prctl overrides global policy */
> + ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_NONE, pmdsize),
> + results->prctl_applied_collapse_none);
> +
> + ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_HUGEPAGE, pmdsize),
> + results->prctl_applied_collapse_madv_huge);
> +
> + ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_COLLAPSE, pmdsize),
> + results->prctl_applied_collapse_madv_collapse);
> +
> + /* Reset to global policy */
> + ASSERT_EQ(prctl(PR_SET_THP_DISABLE, 0, NULL, NULL, NULL), 0);
> +
> + /* tests after prctl is cleared, and only global policy is effective */
> + ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_NONE, pmdsize),
> + results->prctl_removed_collapse_none);
> +
> + ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_HUGEPAGE, pmdsize),
> + results->prctl_removed_collapse_madv_huge);
> +
> + ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_COLLAPSE, pmdsize),
> + results->prctl_removed_collapse_madv_collapse);
> +}
> +
> +FIXTURE(prctl_thp_disable_completely)
> +{
> + struct thp_settings settings;
> + struct test_results results;
Is this "expected_results" ?
But again, hopefully we can remove that and instead just base it on the
polocy that we configured.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 2/5] mm/huge_memory: convert "tva_flags" to "enum tva_type" for thp_vma_allowable_order*()
2025-07-31 16:15 ` David Hildenbrand
@ 2025-08-01 10:08 ` Lorenzo Stoakes
0 siblings, 0 replies; 24+ messages in thread
From: Lorenzo Stoakes @ 2025-08-01 10:08 UTC (permalink / raw)
To: David Hildenbrand
Cc: Usama Arif, Andrew Morton, linux-mm, linux-fsdevel, corbet, rppt,
surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
linux-doc, kernel-team
On Thu, Jul 31, 2025 at 06:15:35PM +0200, David Hildenbrand wrote:
> On 31.07.25 16:00, Lorenzo Stoakes wrote:
> > On Thu, Jul 31, 2025 at 01:27:19PM +0100, Usama Arif wrote:
> > > From: David Hildenbrand <david@redhat.com>
> > >
> > > Describing the context through a type is much clearer, and good enough
> > > for our case.
>
> Just for the other patch, I'll let Usama take it from here, just a bunch of
> comments.
Ack!
>
> >
> > This is pretty bare bones. What context, what type? Under what
> > circumstances?
> >
> > This also is missing detail on the key difference here - that actually it
> > turns out we _don't_ need these to be flags, rather we can have _distinct_
> > modes which are clearer.
> >
> > I'd say something like:
> >
> > when determining which THP orders are eligiible for a VMA mapping,
> > we have previously specified tva_flags, however it turns out it is
> > really not necessary to treat these as flags.
> >
> > Rather, we distinguish between distinct modes.
> >
> > The only case where we previously combined flags was with
> > TVA_ENFORCE_SYSFS, but we can avoid this by observing that this is
> > the default, except for MADV_COLLAPSE or an edge cases in
> > collapse_pte_mapped_thp() and hugepage_vma_revalidate(), and adding
> > a mode specifically for this case - TVA_FORCED_COLLAPSE.
> >
> > ... stuff about the different modes...
> >
> > >
> > > We have:
> > > * smaps handling for showing "THPeligible"
> > > * Pagefault handling
> > > * khugepaged handling
> > > * Forced collapse handling: primarily MADV_COLLAPSE, but one other odd case
> >
> > Can we actually state what this case is? I mean I guess a handwave in the
> > form of 'an edge case in collapse_pte_mapped_thp()' will do also.
>
> Yeah, something like that. I think we also call it when we previously
> checked that there is a THP and that we might be allowed to collapse. E.g.,
> collapse_pte_mapped_thp() is also called from khugepaged code where we
> already checked the allowed order.
See below...
>
> >
> > Hmm actually we do weird stuff with this so maybe just handwave.
> >
> > Like uprobes calls collapse_pte_mapped_thp()... :/ I'm not sure this 'If we
> > are here, we've succeeded in replacing all the native pages in the page
> > cache with a single hugepage.' comment is even correct.
>
> I think in all these cases we already have a THP and want to force that
> collapse in the page table.
...Yeah, I remember looking at this before and thinking 'it makes sense to force it
here'.
But since we have some sort of random edge cases here probably best to hand wave
'and a few other places' or sth (I see Usama has proposed a commit msg, will
look shortly of course ::)
>
> [...]
>
> > >
> > > Really, we want to ignore sysfs only when we are forcing a collapse
> > > through MADV_COLLAPSE, otherwise we want to enforce.
> >
> > I'd say 'ignoring this edge case, ...'
> >
> > I think the clearest thing might be to literally list the before/after
> > like:
> >
> > * TVA_SMAPS | TVA_ENFORCE_SYSFS -> TVA_SMAPS
> > * TVA_IN_PF | TVA_ENFORCE_SYSFS -> TVA_PAGEFAULT
> > * TVA_ENFORCE_SYSFS -> TVA_KHUGEPAGED
> > * 0 -> TVA_FORCED_COLLAPSE
> >
>
> That makes sense.
Thanks!
>
> > >
> > > With this change, we immediately know if we are in the forced collapse
> > > case, which will be valuable next.
> > >
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > Acked-by: Usama Arif <usamaarif642@gmail.com>
> > > Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >
> > Overall this is a great cleanup, some various nits however.
> >
> > > ---
> > > fs/proc/task_mmu.c | 4 ++--
> > > include/linux/huge_mm.h | 30 ++++++++++++++++++------------
> > > mm/huge_memory.c | 8 ++++----
> > > mm/khugepaged.c | 18 +++++++++---------
> > > mm/memory.c | 14 ++++++--------
> > > 5 files changed, 39 insertions(+), 35 deletions(-)
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 3d6d8a9f13fc..d440df7b3d59 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -1293,8 +1293,8 @@ static int show_smap(struct seq_file *m, void *v)
> > > __show_smap(m, &mss, false);
> > >
> > > seq_printf(m, "THPeligible: %8u\n",
> > > - !!thp_vma_allowable_orders(vma, vma->vm_flags,
> > > - TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL));
> > > + !!thp_vma_allowable_orders(vma, vma->vm_flags, TVA_SMAPS,
> > > + THP_ORDERS_ALL));
> >
> > This !! is so gross, wonder if we could have a bool wrapper. But not a big
> > deal.
> >
> > I also sort of _hate_ the smaps flag anyway, invoking this 'allowable
> > orders' thing just for smaps reporting with maybe some minor delta is just
> > odd.
> >
> > Something like `bool vma_has_thp_allowed_orders(struct vm_area_struct
> > *vma);` would be nicer.
> >
> > Anyway thoughts for another time... :)
>
> Yeah, that's not the only nasty bit here ... :)
Indeed, sometimes I get distracted by this... but we'll fix it over time!
>
> >
> > >
> > > if (arch_pkeys_enabled())
> > > seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma));
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index 71db243a002e..b0ff54eee81c 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> > > @@ -94,12 +94,15 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
> > > #define THP_ORDERS_ALL \
> > > (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_SPECIAL | THP_ORDERS_ALL_FILE_DEFAULT)
> > >
> > > -#define TVA_SMAPS (1 << 0) /* Will be used for procfs */
> >
> > Dumb question, but what does 'TVA' stand for? :P
>
> Whoever came up with that probably used the function name where this is
> passed in
>
> thp_vma_allowable_orders()
Ahhh ok that makes sense, thanks!
'THP VMA Allowable' is kind of funny, like an abbreviation that abbreviates 2
abbreviations :P
>
> >
> > > -#define TVA_IN_PF (1 << 1) /* Page fault handler */
> > > -#define TVA_ENFORCE_SYSFS (1 << 2) /* Obey sysfs configuration */
> > > +enum tva_type {
> > > + TVA_SMAPS, /* Exposing "THPeligible:" in smaps. */
> >
> > How I hate this flag (just an observation...)
> >
> > > + TVA_PAGEFAULT, /* Serving a page fault. */
> > > + TVA_KHUGEPAGED, /* Khugepaged collapse. */
> >
> > This is equivalent to the TVA_ENFORCE_SYSFS case before, sort of a default
> > I guess, but actually quite nice to add the context that it's sourced from
> > khugepaged - I assume this will always be the case when specified?
> >
> > > + TVA_FORCED_COLLAPSE, /* Forced collapse (i.e., MADV_COLLAPSE). */
> >
> > Would put 'e.g.' here, then that allows 'space' for the edge case...
>
> Makes sense.
Thanks!
>
> >
> > > +};
> > >
> > > -#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
> > > - (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
> > > +#define thp_vma_allowable_order(vma, vm_flags, type, order) \
> > > + (!!thp_vma_allowable_orders(vma, vm_flags, type, BIT(order)))
> >
> > Nit, but maybe worth keeping tva_ prefix - tva_type - here just so it's
> > clear what type it refers to.
> >
> > But not end of the world.
> >
> > Same comment goes for param names below etc.
>
> No strong opinion, but I prefer to drop the prefix when it can be deduced
> from the type and we are inside of the very function that essentially
> defines these types (tva prefix is implicit, no other type applies).
>
> These should probably just be inline functions at some point with proper
> types and doc (separate patch uin the future, of course).
Yeah I mean this is the most petty of petty comments. Just as long as it's
consistent it's ok.
Slightly an aside, but I'm motivated by previous case of passing around VMA
flags as a 'flags' parameter, when in some functions you also have mmap flags
and it suddenly becomes painful to know which is which.
Probably here, given limited scope it's not really likely to be an issue.
>
> [...]
>
> > > +++ b/mm/khugepaged.c
> > > @@ -474,8 +474,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
> > > {
> > > if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
> > > hugepage_pmd_enabled()) {
> > > - if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
> > > - PMD_ORDER))
> > > + if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> > > __khugepaged_enter(vma->vm_mm);
> > > }
> > > }
> > > @@ -921,7 +920,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > > struct collapse_control *cc)
> > > {
> > > struct vm_area_struct *vma;
> > > - unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> > > + enum tva_type tva_type = cc->is_khugepaged ? TVA_KHUGEPAGED :
> > > + TVA_FORCED_COLLAPSE;
> >
> > This is great, this is so much clearer.
> >
> > A nit though, I mean I come back to my 'type' vs 'tva_type' nit above, this
> > is inconsistent, so we should choose one approach and stick with it.
>
> This is outside of the function, so I would prefer to keep it here, but no
> stong opinion.
I'd rather we be consistent, but this isn't a huge big deal, obviously.
>
> >
> > >
> > > if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> > > return SCAN_ANY_PROCESS;
> > > @@ -932,7 +932,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > >
> > > if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
> > > return SCAN_ADDRESS_RANGE;
> > > - if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
> > > + if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_type, PMD_ORDER))
> > > return SCAN_VMA_CHECK;
> > > /*
> > > * Anon VMA expected, the address may be unmapped then
> > > @@ -1532,9 +1532,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > > * in the page cache with a single hugepage. If a mm were to fault-in
> > > * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
> > > * and map it by a PMD, regardless of sysfs THP settings. As such, let's
> > > - * analogously elide sysfs THP settings here.
> > > + * analogously elide sysfs THP settings here and pretend we are
> > > + * collapsing.
> >
> > I think saying pretending here is potentially confusing, maybe worth saying
> > 'force collapse'?
>
> Makes sense.
Thanks!
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 2/5] mm/huge_memory: convert "tva_flags" to "enum tva_type" for thp_vma_allowable_order*()
2025-07-31 19:20 ` Usama Arif
@ 2025-08-01 10:12 ` Lorenzo Stoakes
0 siblings, 0 replies; 24+ messages in thread
From: Lorenzo Stoakes @ 2025-08-01 10:12 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, linux-fsdevel, corbet, rppt,
surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
linux-doc, kernel-team
On Thu, Jul 31, 2025 at 08:20:18PM +0100, Usama Arif wrote:
[snip]
> >> Signed-off-by: David Hildenbrand <david@redhat.com>
> >> Acked-by: Usama Arif <usamaarif642@gmail.com>
> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >
> > Overall this is a great cleanup, some various nits however.
> >
>
> Thanks for the feedback Lorenzo!
No problem :) I'm glad by the way we've found a solution that serves the
needs you and other's specified while not encountering the issues I raised
concerns with, the approach of extending this interface I think is a good
compromise.
>
> I have modified the commit message to be:
>
> mm/huge_memory: convert "tva_flags" to "enum tva_type"
>
> When determining which THP orders are eligible for a VMA mapping,
> we have previously specified tva_flags, however it turns out it is
> really not necessary to treat these as flags.
>
> Rather, we distinguish between distinct modes.
>
> The only case where we previously combined flags was with
> TVA_ENFORCE_SYSFS, but we can avoid this by observing that this
> is the default, except for MADV_COLLAPSE or an edge cases in
> collapse_pte_mapped_thp() and hugepage_vma_revalidate(), and
> adding a mode specifically for this case - TVA_FORCED_COLLAPSE.
>
> We have:
> * smaps handling for showing "THPeligible"
> * Pagefault handling
> * khugepaged handling
> * Forced collapse handling: primarily MADV_COLLAPSE, but also for
> an edge case in collapse_pte_mapped_thp()
>
> Ignoring the collapse_pte_mapped_thp edgecase, we only want to
I'd say 'disregarding the edge cases,' here as there's also
hugepage_vma_revalidate() above and being really really nitty we say
'ignore' a 2nd time below which reads less well.
> ignore sysfs only when we are forcing a collapse through
I'd say 'ignore sysfs settings' just to be clear.
> MADV_COLLAPSE, otherwise we want to enforce it, hence this patch
> does the following flag to enum conversions:
>
> * TVA_SMAPS | TVA_ENFORCE_SYSFS -> TVA_SMAPS
> * TVA_IN_PF | TVA_ENFORCE_SYSFS -> TVA_PAGEFAULT
> * TVA_ENFORCE_SYSFS -> TVA_KHUGEPAGED
> * 0 -> TVA_FORCED_COLLAPSE
>
> With this change, we immediately know if we are in the forced collapse
> case, which will be valuable next.
Other than nits above this looks really good, thanks!
[snip]
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 2b4ea5a2ce7d..85252b468f80 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
[snip]
> >> @@ -921,7 +920,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >> struct collapse_control *cc)
> >> {
> >> struct vm_area_struct *vma;
> >> - unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> >> + enum tva_type tva_type = cc->is_khugepaged ? TVA_KHUGEPAGED :
> >> + TVA_FORCED_COLLAPSE;
> >
> > This is great, this is so much clearer.
> >
> > A nit though, I mean I come back to my 'type' vs 'tva_type' nit above, this
> > is inconsistent, so we should choose one approach and stick with it.
> >
>
> I dont exactly like the name "tva" (It has nothing to do with the fact it took
> me more time than I would like to figure out that it meant THP VMA allowable :)),
Hey, dude, at least you worked it out, I had to ask :P
> so what I will do is use "type" everywhere if that is ok?
Sure that's fine, it's not a big deal and I'd rather we just make it
consistent.
> But no strong opinion and can change the variable/macro args to tva_type if that
> is preferred.
>
> The diff over v2 after taking the review comments into account looks quite trivial:
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index b0ff54eee81c..bd4f9e6327e0 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -98,7 +98,7 @@ enum tva_type {
> TVA_SMAPS, /* Exposing "THPeligible:" in smaps. */
> TVA_PAGEFAULT, /* Serving a page fault. */
> TVA_KHUGEPAGED, /* Khugepaged collapse. */
> - TVA_FORCED_COLLAPSE, /* Forced collapse (i.e., MADV_COLLAPSE). */
> + TVA_FORCED_COLLAPSE, /* Forced collapse (e.g. MADV_COLLAPSE). */
> };
>
> #define thp_vma_allowable_order(vma, vm_flags, type, order) \
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 7a54b6f2a346..88cb6339e910 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -920,7 +920,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> struct collapse_control *cc)
> {
> struct vm_area_struct *vma;
> - enum tva_type tva_type = cc->is_khugepaged ? TVA_KHUGEPAGED :
> + enum tva_type type = cc->is_khugepaged ? TVA_KHUGEPAGED :
> TVA_FORCED_COLLAPSE;
>
> if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> @@ -932,7 +932,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>
> if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
> return SCAN_ADDRESS_RANGE;
> - if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_type, PMD_ORDER))
> + if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
> return SCAN_VMA_CHECK;
> /*
> * Anon VMA expected, the address may be unmapped then
> @@ -1532,8 +1532,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> * in the page cache with a single hugepage. If a mm were to fault-in
> * this memory (mapped by a suitably aligned VMA), we'd get the hugepage
> * and map it by a PMD, regardless of sysfs THP settings. As such, let's
> - * analogously elide sysfs THP settings here and pretend we are
> - * collapsing.
> + * analogously elide sysfs THP settings here and force collapse.
> */
> if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> return SCAN_VMA_CHECK;
Nice that's fine.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 3/5] mm/huge_memory: treat MADV_COLLAPSE as an advise with PR_THP_DISABLE_EXCEPT_ADVISED
2025-07-31 14:54 ` David Hildenbrand
@ 2025-08-01 10:32 ` Lorenzo Stoakes
2025-08-01 11:26 ` Usama Arif
1 sibling, 0 replies; 24+ messages in thread
From: Lorenzo Stoakes @ 2025-08-01 10:32 UTC (permalink / raw)
To: David Hildenbrand
Cc: Usama Arif, Andrew Morton, linux-mm, linux-fsdevel, corbet, rppt,
surenb, mhocko, hannes, baohua, shakeel.butt, riel, ziy,
laoar.shao, dev.jain, baolin.wang, npache, Liam.Howlett,
ryan.roberts, vbabka, jannh, Arnd Bergmann, sj, linux-kernel,
linux-doc, kernel-team
On Thu, Jul 31, 2025 at 04:54:38PM +0200, David Hildenbrand wrote:
> On 31.07.25 16:38, Lorenzo Stoakes wrote:
> > Nits on subject:
> >
> > - It's >75 chars
>
> No big deal. If we cna come up with something shorter, good.
>
> > - advise is the verb, advice is the noun.
>
> Yeah.
>
> >
> > On Thu, Jul 31, 2025 at 01:27:20PM +0100, Usama Arif wrote:
> > > From: David Hildenbrand <david@redhat.com>
> > >
> > > Let's allow for making MADV_COLLAPSE succeed on areas that neither have
> > > VM_HUGEPAGE nor VM_NOHUGEPAGE when we have THP disabled
> > > unless explicitly advised (PR_THP_DISABLE_EXCEPT_ADVISED).
> >
> > Hmm, I'm not sure about this.
> >
> > So far this prctl() has been the only way to override MADV_COLLAPSE
> > behaviour, but now we're allowing for this one case to not.
>
> This is not an override really. prctl() disallowed MADV_COLLAPSE, but in the
> new mode we don't want that anymore.
Yeah see below, I sort of convinced myself this is OK.
>
> > > I suppose the precedent is that MADV_COLLAPSE overrides 'madvise' sysfs
> > behaviour.
> > > I suppose what saves us here is 'advised' can be read to mean either
> > MADV_HUGEPAGE or MADV_COLLAPSE.
> > > And yes, MADV_COLLAPSE is clearly the user requesting this behaviour.
>
> Exactly.
And really I guess it's logical right? If you MADV_COLLAPSE you are saying
'I want THPs', if you MADV_HUGEPAGE you are saying 'I want THPs', the
difference being on timing.
>
> >
> > I think the vagueness here is one that already existed, because one could
> > perfectly one have expected MADV_COLLAPSE to obey sysfs and require
> > MADV_HUGEPAGE to have been applied, but of course this is not the case.
>
> Yes.
>
> >
> > OK so fine.
> >
> > BUT.
> >
> > I think the MADV_COLLAPSE man page will need to be updated to mention this.
> >
>
> Yes.
>
> > And I REALLY think we should update the THP doc too to mention all these
> > prctl() modes.
> >
> > I'm not sure we cover that right now _at all_ and obviously we should
> > describe the new flags.
> >
> > Usama - can you add a patch to this series to do that?
>
> Good point, let's document the interaction with prctl().
Thanks, I think we haven't even spoken that much about MADV_COLLAPSE in the
docs, somebody had a patch but then it fizzled out.
But that's a separate task, will add to my TODOs in case nobody else picks
up.
>
> >
> > >
> > > MADV_COLLAPSE is a clear advise that we want to collapse.
> >
> > advise -> advice.
> >
> > >
> > > Note that we still respect the VM_NOHUGEPAGE flag, just like
> > > MADV_COLLAPSE always does. So consequently, MADV_COLLAPSE is now only
> > > refused on VM_NOHUGEPAGE with PR_THP_DISABLE_EXCEPT_ADVISED.
> >
> > You also need to mention the shmem change you've made I think.
>
> Yes.
>
> > >>
> > > Co-developed-by: Usama Arif <usamaarif642@gmail.com>
> > > Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > ---
> > > include/linux/huge_mm.h | 8 +++++++-
> > > include/uapi/linux/prctl.h | 2 +-
> > > mm/huge_memory.c | 5 +++--
> > > mm/memory.c | 6 ++++--
> > > mm/shmem.c | 2 +-
> > > 5 files changed, 16 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index b0ff54eee81c..aeaf93f8ac2e 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> > > @@ -329,7 +329,7 @@ struct thpsize {
> > > * through madvise or prctl.
> > > */
> > > static inline bool vma_thp_disabled(struct vm_area_struct *vma,
> > > - vm_flags_t vm_flags)
> > > + vm_flags_t vm_flags, bool forced_collapse)
> > > {
> > > /* Are THPs disabled for this VMA? */
> > > if (vm_flags & VM_NOHUGEPAGE)
> > > @@ -343,6 +343,12 @@ static inline bool vma_thp_disabled(struct vm_area_struct *vma,
> > > */
> > > if (vm_flags & VM_HUGEPAGE)
> > > return false;
> > > + /*
> > > + * Forcing a collapse (e.g., madv_collapse), is a clear advise to
> >
> > advise -> advice.
> >
> > > + * use THPs.
> > > + */
> > > + if (forced_collapse)
> > > + return false;
> > > return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
> > > }
> > >
> > > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > > index 9c1d6e49b8a9..ee4165738779 100644
> > > --- a/include/uapi/linux/prctl.h
> > > +++ b/include/uapi/linux/prctl.h
> > > @@ -185,7 +185,7 @@ struct prctl_mm_map {
> > > #define PR_SET_THP_DISABLE 41
> > > /*
> > > * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
> > > - * VM_HUGEPAGE).
> > > + * VM_HUGEPAGE / MADV_COLLAPSE).
> >
> > This is confusing you're mixing VMA flags with MADV ones... maybe just
> > stick to madvise ones, or add extra context around VM_HUGEPAGE bit?
>
> I don't see anything confusing here, really.
>
> But if it helps you, we can do
> (e.g., MADV_HUGEPAGE / VM_HUGEPAGE, MADV_COLLAPSE).
>
> (reason VM_HUGEPAGE is spelled out is that there might be code where we set
> VM_HUGEPAGE implicitly in the kernel)
Yeah to be clear this is a pretty minor point, and I see why we'd want to
mention VM_HUGEPAGE explicitly, as it is in fact this that overrides THP
being disabled for the process.
>
> >
> > Would need to be fixed up in a prior commit obviously.
> >
> > > */
> > > # define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1)
> > > #define PR_GET_THP_DISABLE 42
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 85252b468f80..ef5ccb0ec5d5 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -104,7 +104,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> > > {
> > > const bool smaps = type == TVA_SMAPS;
> > > const bool in_pf = type == TVA_PAGEFAULT;
> > > - const bool enforce_sysfs = type != TVA_FORCED_COLLAPSE;
> > > + const bool forced_collapse = type == TVA_FORCED_COLLAPSE;
> > > + const bool enforce_sysfs = !forced_collapse;
> >
> > Can we just get rid of this enforce_sysfs altogether in patch 2/5 and use
> > forced_collapse?
>
> Let's do that as a separate cleanup on top. I want least churn in that
> patch.
>
> (had the same idea while writing that patch, but I have other things to
> focus on than cleaning up all this mess)
Ack, I'm fine with that.
>
> --
> Cheers,
>
> David / dhildenb
>
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 3/5] mm/huge_memory: treat MADV_COLLAPSE as an advise with PR_THP_DISABLE_EXCEPT_ADVISED
2025-07-31 14:54 ` David Hildenbrand
2025-08-01 10:32 ` Lorenzo Stoakes
@ 2025-08-01 11:26 ` Usama Arif
1 sibling, 0 replies; 24+ messages in thread
From: Usama Arif @ 2025-08-01 11:26 UTC (permalink / raw)
To: David Hildenbrand, Lorenzo Stoakes
Cc: Andrew Morton, linux-mm, linux-fsdevel, corbet, rppt, surenb,
mhocko, hannes, baohua, shakeel.butt, riel, ziy, laoar.shao,
dev.jain, baolin.wang, npache, Liam.Howlett, ryan.roberts, vbabka,
jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team
On 31/07/2025 15:54, David Hildenbrand wrote:
> On 31.07.25 16:38, Lorenzo Stoakes wrote:
>> Nits on subject:
>>
>> - It's >75 chars
>
> No big deal. If we cna come up with something shorter, good.
Changed it to "mm/huge_memory: respect MADV_COLLAPSE with PR_THP_DISABLE_EXCEPT_ADVISED"
for the next revision. That would be 73 chars :)
>
>> - advise is the verb, advice is the noun.
>
> Yeah.
>
>>
>> On Thu, Jul 31, 2025 at 01:27:20PM +0100, Usama Arif wrote:
>>> From: David Hildenbrand <david@redhat.com>
>>>
>>> Let's allow for making MADV_COLLAPSE succeed on areas that neither have
>>> VM_HUGEPAGE nor VM_NOHUGEPAGE when we have THP disabled
>>> unless explicitly advised (PR_THP_DISABLE_EXCEPT_ADVISED).
>>
>> Hmm, I'm not sure about this.
>>
>> So far this prctl() has been the only way to override MADV_COLLAPSE
>> behaviour, but now we're allowing for this one case to not.
>
> This is not an override really. prctl() disallowed MADV_COLLAPSE, but in the new mode we don't want that anymore.
>
>> > I suppose the precedent is that MADV_COLLAPSE overrides 'madvise' sysfs
>> behaviour.
>> > I suppose what saves us here is 'advised' can be read to mean either
>> MADV_HUGEPAGE or MADV_COLLAPSE.
>> > And yes, MADV_COLLAPSE is clearly the user requesting this behaviour.
>
> Exactly.
>
>>
>> I think the vagueness here is one that already existed, because one could
>> perfectly one have expected MADV_COLLAPSE to obey sysfs and require
>> MADV_HUGEPAGE to have been applied, but of course this is not the case.
>
> Yes.
>
>>
>> OK so fine.
>>
>> BUT.
>>
>> I think the MADV_COLLAPSE man page will need to be updated to mention this.
>>
>
> Yes.
>
Thanks, yes will do and send this along with changes to prctl man page after this
makes into mm-stable.
>> And I REALLY think we should update the THP doc too to mention all these
>> prctl() modes.
>>
>> I'm not sure we cover that right now _at all_ and obviously we should
>> describe the new flags.
>>
>> Usama - can you add a patch to this series to do that?
>
> Good point, let's document the interaction with prctl().
>
I have added the following patch for the next revision. I know that a lot
of this will be in the man page as well, but I have gone the way of being very
very explicit of what are all the possible calls that can be made (hopefully thats
the right approach :))
commit 5f290d29741a514d0861d0f99c8b860ba6af9c37
Author: Usama Arif <usamaarif642@gmail.com>
Date: Fri Aug 1 12:05:49 2025 +0100
docs: transhuge: document process level THP controls
This includes the PR_SET_THP_DISABLE/PR_GET_THP_DISABLE pair of
prctl calls as well the newly introduced PR_THP_DISABLE_EXCEPT_ADVISED
flag for the PR_SET_THP_DISABLE prctl call.
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 370fba113460..cce0a99beac8 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -225,6 +225,45 @@ to "always" or "madvise"), and it'll be automatically shutdown when
PMD-sized THP is disabled (when both the per-size anon control and the
top-level control are "never")
+process THP controls
+--------------------
+
+A process can control its own THP behaviour using the ``PR_SET_THP_DISABLE``
+and ``PR_GET_THP_DISABLE`` pair of prctl(2) calls. These calls support the
+following arguments::
+
+ prctl(PR_SET_THP_DISABLE, 1, 0, 0, 0):
+ This will set the MMF_DISABLE_THP_COMPLETELY mm flag which will
+ result in no THPs being faulted in or collapsed, irrespective
+ of global THP controls. This flag and hence the behaviour is
+ inherited across fork(2) and execve(2).
+
+ prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, 0, 0):
+ This will set the MMF_DISABLE_THP_EXCEPT_ADVISED mm flag which
+ will result in THPs being faulted in or collapsed only for
+ the following cases:
+ - Global THP controls are set to "always" or "madvise" and
+ the process has madvised the region with either MADV_HUGEPAGE
+ or MADV_COLLAPSE.
+ - Global THP controls is set to "never" and the process has
+ madvised the region with MADV_COLLAPSE.
+ This flag and hence the behaviour is inherited across fork(2)
+ and execve(2).
+
+ prctl(PR_SET_THP_DISABLE, 0, 0, 0, 0):
+ This will clear the MMF_DISABLE_THP_COMPLETELY and
+ MMF_DISABLE_THP_EXCEPT_ADVISED mm flags. The process will
+ behave according to the global THP controls. This behaviour
+ will be inherited across fork(2) and execve(2).
+
+ prctl(PR_GET_THP_DISABLE, 0, 0, 0, 0):
+ This will return the THP disable mm flag status of the process
+ that was set by prctl(PR_SET_THP_DISABLE, ...).
+ i.e.
+ - 1 if MMF_DISABLE_THP_COMPLETELY flag is set
+ - 3 if MMF_DISABLE_THP_EXCEPT_ADVISED flag is set
+ - 0 otherwise.
+
Khugepaged controls
-------------------
>>
>>>
>>> MADV_COLLAPSE is a clear advise that we want to collapse.
>>
>> advise -> advice.
>>
>>>
>>> Note that we still respect the VM_NOHUGEPAGE flag, just like
>>> MADV_COLLAPSE always does. So consequently, MADV_COLLAPSE is now only
>>> refused on VM_NOHUGEPAGE with PR_THP_DISABLE_EXCEPT_ADVISED.
>>
>> You also need to mention the shmem change you've made I think.
>
> Yes.
>
>> >>
>>> Co-developed-by: Usama Arif <usamaarif642@gmail.com>
>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>> include/linux/huge_mm.h | 8 +++++++-
>>> include/uapi/linux/prctl.h | 2 +-
>>> mm/huge_memory.c | 5 +++--
>>> mm/memory.c | 6 ++++--
>>> mm/shmem.c | 2 +-
>>> 5 files changed, 16 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index b0ff54eee81c..aeaf93f8ac2e 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -329,7 +329,7 @@ struct thpsize {
>>> * through madvise or prctl.
>>> */
>>> static inline bool vma_thp_disabled(struct vm_area_struct *vma,
>>> - vm_flags_t vm_flags)
>>> + vm_flags_t vm_flags, bool forced_collapse)
>>> {
>>> /* Are THPs disabled for this VMA? */
>>> if (vm_flags & VM_NOHUGEPAGE)
>>> @@ -343,6 +343,12 @@ static inline bool vma_thp_disabled(struct vm_area_struct *vma,
>>> */
>>> if (vm_flags & VM_HUGEPAGE)
>>> return false;
>>> + /*
>>> + * Forcing a collapse (e.g., madv_collapse), is a clear advise to
>>
>> advise -> advice.
>>
>>> + * use THPs.
>>> + */
>>> + if (forced_collapse)
>>> + return false;
>>> return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
>>> }
>>>
>>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>>> index 9c1d6e49b8a9..ee4165738779 100644
>>> --- a/include/uapi/linux/prctl.h
>>> +++ b/include/uapi/linux/prctl.h
>>> @@ -185,7 +185,7 @@ struct prctl_mm_map {
>>> #define PR_SET_THP_DISABLE 41
>>> /*
>>> * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
>>> - * VM_HUGEPAGE).
>>> + * VM_HUGEPAGE / MADV_COLLAPSE).
>>
>> This is confusing you're mixing VMA flags with MADV ones... maybe just
>> stick to madvise ones, or add extra context around VM_HUGEPAGE bit?
>
> I don't see anything confusing here, really.
>
> But if it helps you, we can do
> (e.g., MADV_HUGEPAGE / VM_HUGEPAGE, MADV_COLLAPSE).
>
> (reason VM_HUGEPAGE is spelled out is that there might be code where we set VM_HUGEPAGE implicitly in the kernel)
>
>>
>> Would need to be fixed up in a prior commit obviously.
>>
>>> */
>>> # define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1)
>>> #define PR_GET_THP_DISABLE 42
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 85252b468f80..ef5ccb0ec5d5 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -104,7 +104,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>>> {
>>> const bool smaps = type == TVA_SMAPS;
>>> const bool in_pf = type == TVA_PAGEFAULT;
>>> - const bool enforce_sysfs = type != TVA_FORCED_COLLAPSE;
>>> + const bool forced_collapse = type == TVA_FORCED_COLLAPSE;
>>> + const bool enforce_sysfs = !forced_collapse;
>>
>> Can we just get rid of this enforce_sysfs altogether in patch 2/5 and use
>> forced_collapse?
>
> Let's do that as a separate cleanup on top. I want least churn in that patch.
>
> (had the same idea while writing that patch, but I have other things to focus on than cleaning up all this mess)
>
I am happy to send this cleanup once this series makes it to mm-new and a new revision is not expected.
Thanks!
Usama
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH v2 4/5] selftests: prctl: introduce tests for disabling THPs completely
2025-07-31 19:42 ` David Hildenbrand
@ 2025-08-01 11:42 ` Usama Arif
2025-08-01 12:53 ` David Hildenbrand
0 siblings, 1 reply; 24+ messages in thread
From: Usama Arif @ 2025-08-01 11:42 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, linux-mm
Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team
On 31/07/2025 20:42, David Hildenbrand wrote:
> On 31.07.25 14:27, Usama Arif wrote:
>> The test will set the global system THP setting to never, madvise
>> or always depending on the fixture variant and the 2M setting to
>> inherit before it starts (and reset to original at teardown).
>>
>> This tests if the process can:
>> - successfully set and get the policy to disable THPs completely.
>> - never get a hugepage when the THPs are completely disabled
>> with the prctl, including with MADV_HUGE and MADV_COLLAPSE.
>> - successfully reset the policy of the process.
>> - after reset, only get hugepages with:
>> - MADV_COLLAPSE when policy is set to never.
>> - MADV_HUGE and MADV_COLLAPSE when policy is set to madvise.
>> - always when policy is set to "always".
>> - repeat the above tests in a forked process to make sure
>> the policy is carried across forks.
>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>
> [...]
>
> Looks much better already. Some quirks.
>
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <string.h>
>> +#include <unistd.h>
>> +#include <sys/mman.h>
>> +#include <sys/prctl.h>
>> +#include <sys/wait.h>
>> +
>> +#include "../kselftest_harness.h"
>> +#include "thp_settings.h"
>> +#include "vm_util.h"
>> +
>> +static int sz2ord(size_t size, size_t pagesize)
>> +{
>> + return __builtin_ctzll(size / pagesize);
>> +}
>> +
>> +enum thp_collapse_type {
>> + THP_COLLAPSE_NONE,
>> + THP_COLLAPSE_MADV_HUGEPAGE, /* MADV_HUGEPAGE before access */
>> + THP_COLLAPSE_MADV_COLLAPSE, /* MADV_COLLAPSE after access */
>> +};
>> +
>> +enum thp_policy {
>> + THP_POLICY_NEVER,
>> + THP_POLICY_MADVISE,
>> + THP_POLICY_ALWAYS,
>> +};
>
> Couldn't you have reused "enum thp_enabled" end simply never specified the "THP_INHERIT"? Then, you need to do less translation.
yes, introducing this enum was silly. Have removed it for next revision.>
>> +
>> +struct test_results {
>> + int prctl_get_thp_disable;
>
> The result is always one, does that here make sense?
Its 3 in the next patch for PR_THP_DISABLE_EXCEPT_ADVISED :)
I will remove this struct, but I think maybe it might have been a good idea to squash this
with the next patch to show why the struct was useful.
>
>> + int prctl_applied_collapse_none;
>
> "prctl_applied" is a bit confusing. And most of these always have the same value.
>
> Can't we special case the remaining two cases on the current policy and avoid this struct compeltely?
>
The values are different in the next patch when PR_THP_DISABLE_EXCEPT_ADVISED is used.
Just to explain how I came about using this struct test_results (though it doesnt matter as
I will remove it for the next revision):
I wanted to maximise code reuse and only wanted to have one instance of prctl_thp_disable_test.
I actually started with special casing, but went the brute force way of adding too many if else
statements and it was looking quite messy after I added the tests for PR_THP_DISABLE_EXCEPT_ADVISED.
I saw this struct test_results in another kselftest and thought this should make it much better and
extendable.
I have removed struct test_results and changed prctl_thp_disable_test to the following for next revision:
static void prctl_thp_disable_test(struct __test_metadata *const _metadata,
size_t pmdsize, enum thp_enabled thp_policy,
int prctl_flags)
{
ASSERT_EQ(prctl(PR_GET_THP_DISABLE, NULL, NULL, NULL, NULL),
prctl_flags & PR_THP_DISABLE_EXCEPT_ADVISED ? 3 : 1);
/* tests after prctl overrides global policy */
ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_NONE, pmdsize), 0);
ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_HUGEPAGE, pmdsize),
thp_policy == THP_NEVER || !prctl_flags ? 0 : 1);
ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_COLLAPSE, pmdsize),
!prctl_flags ? 0 : 1);
/* Reset to global policy */
ASSERT_EQ(prctl(PR_SET_THP_DISABLE, 0, NULL, NULL, NULL), 0);
/* tests after prctl is cleared, and only global policy is effective */
ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_NONE, pmdsize),
thp_policy == THP_ALWAYS ? 1 : 0);
ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_HUGEPAGE, pmdsize),
thp_policy == THP_NEVER ? 0 : 1);
ASSERT_EQ(test_mmap_thp(THP_COLLAPSE_MADV_COLLAPSE, pmdsize), 1);
}
>
>> + int prctl_applied_collapse_madv_huge;
>> + int prctl_applied_collapse_madv_collapse;
>> + int prctl_removed_collapse_none;
>> + int prctl_removed_collapse_madv_huge;
>> + int prctl_removed_collapse_madv_collapse;
>> +};
>> +
>> +/*
>> + * Function to mmap a buffer, fault it in, madvise it appropriately (before
>> + * page fault for MADV_HUGE, and after for MADV_COLLAPSE), and check if the
>> + * mmap region is huge.
>> + * Returns:
>> + * 0 if test doesn't give hugepage
>> + * 1 if test gives a hugepage
>> + * -errno if mmap fails
>> + */
>> +static int test_mmap_thp(enum thp_collapse_type madvise_buf, size_t pmdsize)
>> +{
>> + char *mem, *mmap_mem;
>> + size_t mmap_size;
>> + int ret;
>> +
>> + /* For alignment purposes, we need twice the THP size. */
>> + mmap_size = 2 * pmdsize;
>> + mmap_mem = (char *)mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
>> + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> + if (mmap_mem == MAP_FAILED)
>> + return -errno;
>> +
>> + /* We need a THP-aligned memory area. */
>> + mem = (char *)(((uintptr_t)mmap_mem + pmdsize) & ~(pmdsize - 1));
>> +
>> + if (madvise_buf == THP_COLLAPSE_MADV_HUGEPAGE)
>> + madvise(mem, pmdsize, MADV_HUGEPAGE);
>> +
>> + /* Ensure memory is allocated */
>> + memset(mem, 1, pmdsize);
>> +
>> + if (madvise_buf == THP_COLLAPSE_MADV_COLLAPSE)
>> + madvise(mem, pmdsize, MADV_COLLAPSE);
>> +
>
> To avoid even mmap_mem to get merged with some other VMA, maybe just do
> before reading the smap here:
>
> /* HACK: make sure we have a separate VMA that we can check reliably. */
> mprotect(mem, pmdsize, PROT_READ);
>
Thanks! Yeah this is a nice hack, have used it in the next revision.
> or
>
> madvise(mem, pmdsize, MADV_DONTFORK);
>
> before reading smaps.
>
> That is probably the easiest approach. The you can drop the lengthy comment and perform a single thp check.
>
>
[...]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 4/5] selftests: prctl: introduce tests for disabling THPs completely
2025-08-01 11:42 ` Usama Arif
@ 2025-08-01 12:53 ` David Hildenbrand
0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2025-08-01 12:53 UTC (permalink / raw)
To: Usama Arif, Andrew Morton, linux-mm
Cc: linux-fsdevel, corbet, rppt, surenb, mhocko, hannes, baohua,
shakeel.butt, riel, ziy, laoar.shao, dev.jain, baolin.wang,
npache, lorenzo.stoakes, Liam.Howlett, ryan.roberts, vbabka,
jannh, Arnd Bergmann, sj, linux-kernel, linux-doc, kernel-team
>>> +
>>> +struct test_results {
>>> + int prctl_get_thp_disable;
>>
>> The result is always one, does that here make sense?
>
> Its 3 in the next patch for PR_THP_DISABLE_EXCEPT_ADVISED :)
>
> I will remove this struct, but I think maybe it might have been a good idea to squash this
> with the next patch to show why the struct was useful.
I think it's reasonable to keep them separate.
>
>>
>>> + int prctl_applied_collapse_none;
>>
>> "prctl_applied" is a bit confusing. And most of these always have the same value.
>>
>> Can't we special case the remaining two cases on the current policy and avoid this struct compeltely?
>>
>
> The values are different in the next patch when PR_THP_DISABLE_EXCEPT_ADVISED is used.
>
> Just to explain how I came about using this struct test_results (though it doesnt matter as
> I will remove it for the next revision):
> I wanted to maximise code reuse and only wanted to have one instance of prctl_thp_disable_test.
> I actually started with special casing, but went the brute force way of adding too many if else
> statements and it was looking quite messy after I added the tests for PR_THP_DISABLE_EXCEPT_ADVISED.
> I saw this struct test_results in another kselftest and thought this should make it much better and
> extendable.
>
> I have removed struct test_results and changed prctl_thp_disable_test to the following for next revision:
Yeah, or just duplicate that function and call it
prctl_thp_disable_unless_advised_test() in the next patch.
Makes the code easier to read and the duplication is limited.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2025-08-01 12:53 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-31 12:27 [PATCH v2 0/5] prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised Usama Arif
2025-07-31 12:27 ` [PATCH v2 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE Usama Arif
2025-07-31 12:40 ` Lorenzo Stoakes
2025-07-31 13:12 ` Usama Arif
2025-07-31 13:18 ` Lorenzo Stoakes
2025-07-31 13:20 ` David Hildenbrand
2025-07-31 15:13 ` Zi Yan
2025-07-31 12:27 ` [PATCH v2 2/5] mm/huge_memory: convert "tva_flags" to "enum tva_type" for thp_vma_allowable_order*() Usama Arif
2025-07-31 14:00 ` Lorenzo Stoakes
2025-07-31 15:19 ` Zi Yan
2025-07-31 16:15 ` David Hildenbrand
2025-08-01 10:08 ` Lorenzo Stoakes
2025-07-31 19:20 ` Usama Arif
2025-08-01 10:12 ` Lorenzo Stoakes
2025-07-31 12:27 ` [PATCH v2 3/5] mm/huge_memory: treat MADV_COLLAPSE as an advise with PR_THP_DISABLE_EXCEPT_ADVISED Usama Arif
2025-07-31 14:38 ` Lorenzo Stoakes
2025-07-31 14:54 ` David Hildenbrand
2025-08-01 10:32 ` Lorenzo Stoakes
2025-08-01 11:26 ` Usama Arif
2025-07-31 12:27 ` [PATCH v2 4/5] selftests: prctl: introduce tests for disabling THPs completely Usama Arif
2025-07-31 19:42 ` David Hildenbrand
2025-08-01 11:42 ` Usama Arif
2025-08-01 12:53 ` David Hildenbrand
2025-07-31 12:27 ` [PATCH v2 5/5] selftests: prctl: introduce tests for disabling THPs except for madvise Usama Arif
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).