linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
@ 2025-07-21  9:09 David Hildenbrand
  2025-07-21 10:10 ` David Hildenbrand
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-07-21  9:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-fsdevel, linux-doc, David Hildenbrand,
	Jonathan Corbet, Andrew Morton, Lorenzo Stoakes, Zi Yan,
	Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Usama Arif, SeongJae Park, Jann Horn, Yafang Shao,
	Matthew Wilcox

People want to make use of more THPs, for example, moving from
THP=never to THP=madvise, or from THP=madvise to THP=never.

While this is great news for every THP desperately waiting to get
allocated out there, apparently there are some workloads that require a
bit of care during that transition: once problems are detected, these
workloads should be started with the old behavior, without making all
other workloads on the system go back to the old behavior as well.

In essence, the following scenarios are imaginable:

(1) Switch from THP=none to THP=madvise or THP=always, but keep the old
    behavior (no THP) for selected workloads.

(2) Stay at THP=none, but have "madvise" or "always" behavior for
    selected workloads.

(3) Switch from THP=madvise to THP=always, but keep the old behavior
    (THP only when advised) for selected workloads.

(4) Stay at THP=madvise, but have "always" behavior for selected
    workloads.

In essence, (2) can be emulated through (1), by setting THP!=none while
disabling THPs for all processes that don't want THPs. It requires
configuring all workloads, but that is a user-space problem to sort out.

(4) can be emulated through (3) in a similar way.

Back when (1) was relevant in the past, as people started enabling THPs,
we added PR_SET_THP_DISABLE, so relevant workloads that were not ready
yet (i.e., used by Redis) were able to just disable THPs completely. Redis
still implements the option to use this interface to disable THPs
completely.

With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
workload -- a process, including fork+exec'ed process hierarchy.
That essentially made us support (1): simply disable THPs for all workloads
that are not ready for THPs yet, while still enabling THPs system-wide.

The quest for handling (3) and (4) started, but current approaches
(completely new prctl, options to set other policies per processm,
 alternatives to prctl -- mctrl, cgroup handling) don't look particularly
promising. Likely, the future will use bpf or something similar to
implement better policies, in particular to also make better decisions
about THP sizes to use, but this will certainly take a while as that work
just started.

Long story short: a simple enable/disable is not really suitable for the
future, so we're not willing to add completely new toggles.

While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
completely for these processes, this scares many THPs in our system
because they could no longer get allocated where they used to be allocated
for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a
problem for relevant workloads, because "not THPs" is certainly worse
than "THPs only when advised".

Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this
would change the documented semantics quite a bit, and the versatility
to use it for debugging purposes, so I am not 100% sure that is what we
want -- although it would certainly be much easier.

So instead, as an easy way forward for (3) and (4), an option to
make PR_SET_THP_DISABLE disable *less* THPs for a process.

In essence, this patch:

(A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
    of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).

    For now, arg3 was not allowed to be set (-EINVAL). Now it holds
    flags.

(B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
    PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.

    For now, it would return 1 if THPs were disabled completely. Now
    it essentially returns the set flags as well.

(C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
    the semantics clearly.

    Fortunately, there are only two instances outside of prctl() code.

(D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
    with VM_HUGEPAGE" -- essentially "thp=madvise" behavior

    Fortunately, we only have to extend vma_thp_disabled().

(E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are not
    disabled completely

    Only indicating that THPs are disabled when they are really disabled
    completely, not only partially.

The documented semantics in the man page for PR_SET_THP_DISABLE
"is inherited by a child created via fork(2) and is preserved across
execve(2)" is maintained. This behavior, for example, allows for
disabling THPs for a workload through the launching process (e.g.,
systemd where we fork() a helper process to then exec()).

There is currently not way to prevent that a process will not issue
PR_SET_THP_DISABLE itself to re-enable THP. We could add a "seal" option
to PR_SET_THP_DISABLE through another flag if ever required. The known
users (such as redis) really use PR_SET_THP_DISABLE to disable THPs, so
that is not added for now.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Hildenbrand <david@redhat.com>

---

At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
think there might be real use cases where we want to disable any THPs --
in particular also around debugging THP-related problems, and
"THP=never" not meaning ... "never" anymore. PR_SET_THP_DISABLE will
also block MADV_COLLAPSE, which can be very helpful. Of course, I thought
of having a system-wide config to change PR_SET_THP_DISABLE behavior, but
I just don't like the semantics.

"prctl: allow overriding system THP policy to always"[1] proposed
"overriding policies to always", which is just the wrong way around: we
should not add mechanisms to "enable more" when we already have an
interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets
weird otherwise.

"[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed
setting the default of the VM_HUGEPAGE, which is similarly the wrong way
around I think now.

The proposals by Lorenzo to extend process_madvise()[3] and mctrl()[4]
similarly were around the "default for VM_HUGEPAGE" idea, but after the
discussion, I think we should better leave VM_HUGEPAGE untouched.

Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
we essentially want to say "leave advised regions alone" -- "keep THP
enabled for advised regions",

The only thing I really dislike about this is using another MMF_* flag,
but well, no way around it -- and seems like we could easily support
more than 32 if we want to, or storing this thp information elsewhere.

I think this here (modifying an existing toggle) is the only prctl()
extension that we might be willing to accept. In general, I agree like
most others, that prctl() is a very bad interface for that -- but
PR_SET_THP_DISABLE is already there and is getting used.

Long-term, I think the answer will be something based on bpf[5]. Maybe
in that context, I there could still be value in easily disabling THPs for
selected workloads (esp. debugging purposes).

Jann raised valid concerns[6] about new flags that are persistent across
exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I
consider it having a similar security risk as our existing
PR_SET_THP_DISABLE, but devil is in the detail.

This is *completely* untested and might be utterly broken. It merely
serves as a PoC of what I think could be done. If this ever goes upstream,
we need some kselftests for it, and extensive tests.

[1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com
[2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com
[3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com
[4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
[5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com
[6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com

---
 Documentation/filesystems/proc.rst |  5 +--
 fs/proc/array.c                    |  2 +-
 include/linux/huge_mm.h            | 20 ++++++++---
 include/linux/mm_types.h           | 13 +++----
 include/uapi/linux/prctl.h         |  7 ++++
 kernel/sys.c                       | 58 +++++++++++++++++++++++-------
 mm/khugepaged.c                    |  2 +-
 7 files changed, 78 insertions(+), 29 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2971551b72353..915a3e44bc120 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -291,8 +291,9 @@ It's slow but very precise.
  HugetlbPages                size of hugetlb memory portions
  CoreDumping                 process's memory is currently being dumped
                              (killing the process may lead to a corrupted core)
- THP_enabled		     process is allowed to use THP (returns 0 when
-			     PR_SET_THP_DISABLE is set on the process
+ THP_enabled                 process is allowed to use THP (returns 0 when
+                             PR_SET_THP_DISABLE is set on the process to disable
+                             THP completely, not just partially)
  Threads                     number of threads
  SigQ                        number of signals queued/max. number for queue
  SigPnd                      bitmap of pending signals for the thread
diff --git a/fs/proc/array.c b/fs/proc/array.c
index d6a0369caa931..c4f91a784104f 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
 	bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
 
 	if (thp_enabled)
-		thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags);
+		thp_enabled = !test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
 	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
 }
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e0a27f80f390d..c4127104d9bc3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -323,16 +323,26 @@ struct thpsize {
 	(transparent_hugepage_flags &					\
 	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
 
+/*
+ * Check whether THPs are explicitly disabled through madvise or prctl, or some
+ * architectures may disable THP for some mappings, for example, s390 kvm.
+ */
 static inline bool vma_thp_disabled(struct vm_area_struct *vma,
 		vm_flags_t vm_flags)
 {
+	/* Are THPs disabled for this VMA? */
+	if (vm_flags & VM_NOHUGEPAGE)
+		return true;
+	/* Are THPs disabled for all VMAs in the whole process? */
+	if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags))
+		return true;
 	/*
-	 * Explicitly disabled through madvise or prctl, or some
-	 * architectures may disable THP for some mappings, for
-	 * example, s390 kvm.
+	 * Are THPs disabled only for VMAs where we didn't get an explicit
+	 * advise to use them?
 	 */
-	return (vm_flags & VM_NOHUGEPAGE) ||
-	       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
+	if (vm_flags & VM_HUGEPAGE)
+		return false;
+	return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
 }
 
 static inline bool thp_disabled_by_hw(void)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1ec273b066915..a999f2d352648 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1743,19 +1743,16 @@ enum {
 #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
 #define MMF_VM_HUGEPAGE		17	/* set when mm is available for khugepaged */
 
-/*
- * This one-shot flag is dropped due to necessity of changing exe once again
- * on NFS restore
- */
-//#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
+#define MMF_HUGE_ZERO_PAGE	18      /* mm has ever used the global huge zero page */
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
 #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
-#define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
-#define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
-#define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
+#define MMF_DISABLE_THP_EXCEPT_ADVISED	23	/* no THP except for VMAs with VM_HUGEPAGE */
+#define MMF_DISABLE_THP_COMPLETELY	24	/* no THP for all VMAs */
+#define MMF_DISABLE_THP_MASK	((1 << MMF_DISABLE_THP_COMPLETELY) |\
+				 (1 << MMF_DISABLE_THP_EXCEPT_ADVISED))
 #define MMF_OOM_REAP_QUEUED	25	/* mm was queued for oom_reaper */
 #define MMF_MULTIPROCESS	26	/* mm is shared between processes */
 /*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 43dec6eed559a..1949bb9270d48 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -177,7 +177,14 @@ struct prctl_mm_map {
 
 #define PR_GET_TID_ADDRESS	40
 
+/*
+ * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0
+ * is reserved, so PR_GET_THP_DISABLE can return 1 when no other flags were
+ * specified for PR_SET_THP_DISABLE.
+ */
 #define PR_SET_THP_DISABLE	41
+/* Don't disable THPs when explicitly advised (MADV_HUGEPAGE / VM_HUGEPAGE). */
+# define PR_THP_DISABLE_EXCEPT_ADVISED	(1 << 1)
 #define PR_GET_THP_DISABLE	42
 
 /*
diff --git a/kernel/sys.c b/kernel/sys.c
index b153fb345ada2..2a34b2f708900 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2423,6 +2423,50 @@ static int prctl_get_auxv(void __user *addr, unsigned long len)
 	return sizeof(mm->saved_auxv);
 }
 
+static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5)
+{
+	unsigned long *mm_flags = &current->mm->flags;
+
+	if (arg2 || arg3 || arg4 || arg5)
+		return -EINVAL;
+
+	if (test_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags))
+		return 1;
+	else if (test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags))
+		return 1 | PR_THP_DISABLE_EXCEPT_ADVISED;
+	return 0;
+}
+
+static int prctl_set_thp_disable(unsigned long thp_disable, unsigned long flags,
+				 unsigned long arg4, unsigned long arg5)
+{
+	unsigned long *mm_flags = &current->mm->flags;
+
+	if (arg4 || arg5)
+		return -EINVAL;
+
+	/* Flags are only allowed when disabling. */
+	if (!thp_disable || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
+		return -EINVAL;
+	if (mmap_write_lock_killable(current->mm))
+		return -EINTR;
+	if (thp_disable) {
+		if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) {
+			clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
+			set_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
+		} else {
+			set_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
+			clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
+		}
+	} else {
+		clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
+		clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
+	}
+	mmap_write_unlock(current->mm);
+	return 0;
+}
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -2596,20 +2640,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			return -EINVAL;
 		return task_no_new_privs(current) ? 1 : 0;
 	case PR_GET_THP_DISABLE:
-		if (arg2 || arg3 || arg4 || arg5)
-			return -EINVAL;
-		error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
+		error = prctl_get_thp_disable(arg2, arg3, arg4, arg5);
 		break;
 	case PR_SET_THP_DISABLE:
-		if (arg3 || arg4 || arg5)
-			return -EINVAL;
-		if (mmap_write_lock_killable(me->mm))
-			return -EINTR;
-		if (arg2)
-			set_bit(MMF_DISABLE_THP, &me->mm->flags);
-		else
-			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
-		mmap_write_unlock(me->mm);
+		error = prctl_set_thp_disable(arg2, arg3, arg4, arg5);
 		break;
 	case PR_MPX_ENABLE_MANAGEMENT:
 	case PR_MPX_DISABLE_MANAGEMENT:
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8a5873d0a23a7..a685077644b4e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -427,7 +427,7 @@ static inline int collapse_test_exit(struct mm_struct *mm)
 static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
 {
 	return collapse_test_exit(mm) ||
-	       test_bit(MMF_DISABLE_THP, &mm->flags);
+	       test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
 }
 
 static bool hugepage_enabled(void)

base-commit: 760b462b3921c5dc8bfa151d2d27a944e4e96081
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-21  9:09 [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE David Hildenbrand
@ 2025-07-21 10:10 ` David Hildenbrand
  2025-07-21 11:28 ` Lorenzo Stoakes
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-07-21 10:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-fsdevel, linux-doc, Jonathan Corbet,
	Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Usama Arif, SeongJae Park, Jann Horn, Yafang Shao, Matthew Wilcox

Two additions:

> 
> (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are not
>      disabled completely


As raised off-list, this should be "only if THPs are disabled completely"

> 
>      Only indicating that THPs are disabled when they are really disabled
>      completely, not only partially.
 > > The documented semantics in the man page for PR_SET_THP_DISABLE
> "is inherited by a child created via fork(2) and is preserved across
> execve(2)" is maintained. This behavior, for example, allows for
> disabling THPs for a workload through the launching process (e.g.,
> systemd where we fork() a helper process to then exec()).
> 
> There is currently not way to prevent that a process will not issue
> PR_SET_THP_DISABLE itself to re-enable THP. We could add a "seal" option
> to PR_SET_THP_DISABLE through another flag if ever required. The known
> users (such as redis) really use PR_SET_THP_DISABLE to disable THPs, so
> that is not added for now.

I don't think there is any user that would try re-enabling THPs through 
that interface. It's kind-of against the original purpose (man page): 
"Setting  this  flag  provides a method for disabling transparent huge 
pages for jobs where the code cannot be modified ..."

So if ever really required, one could investigate forbidding re-enabling 
once disabled. But that obviously needs more investigation.

(also, if a workload ever enables THPs through that mechanism, probably 
it is to blame)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-21  9:09 [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE David Hildenbrand
  2025-07-21 10:10 ` David Hildenbrand
@ 2025-07-21 11:28 ` Lorenzo Stoakes
  2025-07-21 11:45   ` David Hildenbrand
  2025-07-21 14:32 ` Usama Arif
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Lorenzo Stoakes @ 2025-07-21 11:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-fsdevel, linux-doc, Jonathan Corbet,
	Andrew Morton, Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Usama Arif,
	SeongJae Park, Jann Horn, Yafang Shao, Matthew Wilcox

Overall, while I HATE this interface (as y'know, everyone knows :P), since
it _already_ exists and fulfils a real need (and we _have_ to keep
supporting that need) I'm open to us solving the issue this way.

So this might be a way for us to achieve what Usama + others need without
having to splice in horridness.

This as a proof-of-concept is obviously not for 6.17 (and late in the day
anyway :P), so we have at least 6.18 cycle to discuss.

On Mon, Jul 21, 2025 at 11:09:42AM +0200, David Hildenbrand wrote:
> People want to make use of more THPs, for example, moving from
> THP=never to THP=madvise, or from THP=madvise to THP=never.

Nitty, but sort of vague as to what THP= means here, I'd just say 'from
never to madvise, or from madvise to never' - it's pretty clear what you
mean and keeps enough 'flexibility' of interpretation to cover off the
various ways you can do this in the sysfs interfaces.

Same comment for simlar below.

>
> While this is great news for every THP desperately waiting to get
> allocated out there, apparently there are some workloads that require a
> bit of care during that transition: once problems are detected, these
> workloads should be started with the old behavior, without making all
> other workloads on the system go back to the old behavior as well.

I'm confused about what 'old behavior' is here. Also it's not always
necessarily due to problems, there can be a desire to treat THPs as a
resource to be distributed as desired.

So I'd say something like '... transition: individual processes may need to
opt-out from this behaviour for various reasons, and this should be
permitted without needing to make all other workloads on the system
similarly opt-out'.

>
> In essence, the following scenarios are imaginable:
>
> (1) Switch from THP=none to THP=madvise or THP=always, but keep the old
>     behavior (no THP) for selected workloads.

I'd remove 'old behavior' here as it's confusing, and simply refer to THP
being disabled for selected  workloads.

>
> (2) Stay at THP=none, but have "madvise" or "always" behavior for
>     selected workloads.
>
> (3) Switch from THP=madvise to THP=always, but keep the old behavior
>     (THP only when advised) for selected workloads.
>
> (4) Stay at THP=madvise, but have "always" behavior for selected
>     workloads.
>
> In essence, (2) can be emulated through (1), by setting THP!=none while
> disabling THPs for all processes that don't want THPs. It requires
> configuring all workloads, but that is a user-space problem to sort out.

NIT: Delete 'In essence' here.

>
> (4) can be emulated through (3) in a similar way.
>
> Back when (1) was relevant in the past, as people started enabling THPs,
> we added PR_SET_THP_DISABLE, so relevant workloads that were not ready
> yet (i.e., used by Redis) were able to just disable THPs completely. Redis
> still implements the option to use this interface to disable THPs
> completely.
>
> With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
> workload -- a process, including fork+exec'ed process hierarchy.
> That essentially made us support (1): simply disable THPs for all workloads
> that are not ready for THPs yet, while still enabling THPs system-wide.
>
> The quest for handling (3) and (4) started, but current approaches
> (completely new prctl, options to set other policies per processm,
>  alternatives to prctl -- mctrl, cgroup handling) don't look particularly
> promising. Likely, the future will use bpf or something similar to
> implement better policies, in particular to also make better decisions
> about THP sizes to use, but this will certainly take a while as that work
> just started.

Ack.

>
> Long story short: a simple enable/disable is not really suitable for the
> future, so we're not willing to add completely new toggles.

Yes this is the crux of the problem.

>
> While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
> completely for these processes, this scares many THPs in our system
> because they could no longer get allocated where they used to be allocated
> for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a
> problem for relevant workloads, because "not THPs" is certainly worse
> than "THPs only when advised".

I don't know what you mean by 'scares' many THPs? :P

>
> Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
> explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this

MAD_HUGEPAGE -> MADV_HUGEPAGE

I'm confused by 'unless not explicitly advised' do you mean 'disable THPs
unless explicitly advised by the app through MADV_HUGEPAGE'?

> would change the documented semantics quite a bit, and the versatility
> to use it for debugging purposes, so I am not 100% sure that is what we
> want -- although it would certainly be much easier.
>
> So instead, as an easy way forward for (3) and (4), an option to
> make PR_SET_THP_DISABLE disable *less* THPs for a process.
>
> In essence, this patch:
>
> (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
>     of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).

prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED)?

>
>     For now, arg3 was not allowed to be set (-EINVAL). Now it holds
>     flags.

This sentence is redundant.

>
> (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
>     PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.
>
>     For now, it would return 1 if THPs were disabled completely. Now
>     it essentially returns the set flags as well.

For now as in 'previously'. I guess right now it's just used as a boolean,
so this seems fine.

>
> (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
>     the semantics clearly.
>
>     Fortunately, there are only two instances outside of prctl() code.
>
> (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
>     with VM_HUGEPAGE" -- essentially "thp=madvise" behavior
>
>     Fortunately, we only have to extend vma_thp_disabled().
>
> (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are not
>     disabled completely

You mean 'are disabled completely' but this has been covered already :P

>
>     Only indicating that THPs are disabled when they are really disabled
>     completely, not only partially.


So the really interesting part in the above is the small delta this change
represents... which makes it a lot more compelling as a solution.

>
> The documented semantics in the man page for PR_SET_THP_DISABLE
> "is inherited by a child created via fork(2) and is preserved across
> execve(2)" is maintained. This behavior, for example, allows for
> disabling THPs for a workload through the launching process (e.g.,
> systemd where we fork() a helper process to then exec()).

Yeah, this is something I REALLY don't want us to perpetuate, as it's
adding a now policy method by the 'back door'.

I had actually come to the conclusion that we absolutely should NOT
implement anything like this, as discussed in the THP cabal meeting.

HOWEVER, since this mechanism ALREADY EXISTS for this case, I am much more
ok with this.

We already perpetuate state for this across fork/exec.

>
> There is currently not way to prevent that a process will not issue
> PR_SET_THP_DISABLE itself to re-enable THP. We could add a "seal" option
> to PR_SET_THP_DISABLE through another flag if ever required. The known
> users (such as redis) really use PR_SET_THP_DISABLE to disable THPs, so
> that is not added for now.

Yeah not a fan of the seal idea, that will add a bunch of complexity and
state that I would rather not have.

I'd far prefer just disallowing re-enabling THP. We could allow
re-disabling with different flags though.

Be good to get some examples of the old + new prctl() invocations in the
commit message too.

>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Cc: SeongJae Park <sj@kernel.org>
> Cc: Jann Horn <jannh@google.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> ---
>
> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
> think there might be real use cases where we want to disable any THPs --
> in particular also around debugging THP-related problems, and
> "THP=never" not meaning ... "never" anymore. PR_SET_THP_DISABLE will

Well, not quite anymore :) it's been this way for a while right? Even since
MADV_COLLAPSE was introduced.

> also block MADV_COLLAPSE, which can be very helpful. Of course, I thought

Yes.

I mean I hate, hate, HATE this interface. But it exists and we have to
support it anyway.

> of having a system-wide config to change PR_SET_THP_DISABLE behavior, but
> I just don't like the semantics.

What do you mean?

>
> "prctl: allow overriding system THP policy to always"[1] proposed
> "overriding policies to always", which is just the wrong way around: we
> should not add mechanisms to "enable more" when we already have an
> interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets
> weird otherwise.

Yes. A 'disable but' is more logical.

>
> "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed
> setting the default of the VM_HUGEPAGE, which is similarly the wrong way
> around I think now.

Yes.

>
> The proposals by Lorenzo to extend process_madvise()[3] and mctrl()[4]
> similarly were around the "default for VM_HUGEPAGE" idea, but after the
> discussion, I think we should better leave VM_HUGEPAGE untouched.

Well, to be clear, these were more 'if we HAVE to do this, what is the
least awful way of doing this?' rather than proposals per se :P and mctrl()
was really taking existing discussed ideas and -simply seeing how it looked
in code- though in the end we decided better to spell out in words how it
would look.

At least now I'm not in favour of us implementing policy this way (but
again, am open to us extending an _existing_ abomination :)

>
> Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
> we essentially want to say "leave advised regions alone" -- "keep THP
> enabled for advised regions",

Seems OK to me. I guess the one point of confusion could be people being
confused between the THP toggle 'madvise' vs. actually having MADV_HUGEPAGE
set, but that's moot, because 'madvise' mode only enables THP if the region
has had MADV_HUGEPAGE set.

>
> The only thing I really dislike about this is using another MMF_* flag,
> but well, no way around it -- and seems like we could easily support
> more than 32 if we want to, or storing this thp information elsewhere.

Yes my exact thoughts. But I will be adding a series to change this for VMA
flags soon enough, and can potentially do mm flags at the same time...

So this shouldn't in the end be as much of a problem.

Maybe it's worth holding off on this until I've done that? But at any rate
I intend to do those changes next cycle, and this will be a next cycle
thing at the earliest anyway.

>
> I think this here (modifying an existing toggle) is the only prctl()
> extension that we might be willing to accept. In general, I agree like
> most others, that prctl() is a very bad interface for that -- but
> PR_SET_THP_DISABLE is already there and is getting used.

Yes.

>
> Long-term, I think the answer will be something based on bpf[5]. Maybe
> in that context, I there could still be value in easily disabling THPs for
> selected workloads (esp. debugging purposes).
>
> Jann raised valid concerns[6] about new flags that are persistent across
> exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I
> consider it having a similar security risk as our existing
> PR_SET_THP_DISABLE, but devil is in the detail.

Yes...

>
> This is *completely* untested and might be utterly broken. It merely
> serves as a PoC of what I think could be done. If this ever goes upstream,
> we need some kselftests for it, and extensive tests.

Well :) I mean we should definitely try this out in anger and it _MUST_
have self tests and put under some pressure.

Usama, can you attack this and see?

>
> [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com
> [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com
> [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com
> [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
> [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com
> [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com
>
> ---
>  Documentation/filesystems/proc.rst |  5 +--
>  fs/proc/array.c                    |  2 +-
>  include/linux/huge_mm.h            | 20 ++++++++---
>  include/linux/mm_types.h           | 13 +++----
>  include/uapi/linux/prctl.h         |  7 ++++
>  kernel/sys.c                       | 58 +++++++++++++++++++++++-------
>  mm/khugepaged.c                    |  2 +-
>  7 files changed, 78 insertions(+), 29 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 2971551b72353..915a3e44bc120 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -291,8 +291,9 @@ It's slow but very precise.
>   HugetlbPages                size of hugetlb memory portions
>   CoreDumping                 process's memory is currently being dumped
>                               (killing the process may lead to a corrupted core)
> - THP_enabled		     process is allowed to use THP (returns 0 when
> -			     PR_SET_THP_DISABLE is set on the process
> + THP_enabled                 process is allowed to use THP (returns 0 when
> +                             PR_SET_THP_DISABLE is set on the process to disable
> +                             THP completely, not just partially)

Hmm but this means we have no way of knowing if it's set for partial

>   Threads                     number of threads
>   SigQ                        number of signals queued/max. number for queue
>   SigPnd                      bitmap of pending signals for the thread
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index d6a0369caa931..c4f91a784104f 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
>  	bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
>
>  	if (thp_enabled)
> -		thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags);
> +		thp_enabled = !test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
>  	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>  }
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index e0a27f80f390d..c4127104d9bc3 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -323,16 +323,26 @@ struct thpsize {
>  	(transparent_hugepage_flags &					\
>  	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
>
> +/*
> + * Check whether THPs are explicitly disabled through madvise or prctl, or some
> + * architectures may disable THP for some mappings, for example, s390 kvm.
> + */
>  static inline bool vma_thp_disabled(struct vm_area_struct *vma,
>  		vm_flags_t vm_flags)

This _should_ work as we set/clear VM_HUGEPAGE, VM_NOHUGEPAGE in madvise()
unconditionally without bothering to check available THP orders first so no
chicken-and-egg here.

>  {
> +	/* Are THPs disabled for this VMA? */
> +	if (vm_flags & VM_NOHUGEPAGE)
> +		return true;
> +	/* Are THPs disabled for all VMAs in the whole process? */
> +	if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags))
> +		return true;
>  	/*
> -	 * Explicitly disabled through madvise or prctl, or some
> -	 * architectures may disable THP for some mappings, for
> -	 * example, s390 kvm.
> +	 * Are THPs disabled only for VMAs where we didn't get an explicit
> +	 * advise to use them?

Probably fine to drop the rather pernickety reference to s390 kvm here, I
mean we don't need to spell out massively specific details in a general
handler.

>  	 */
> -	return (vm_flags & VM_NOHUGEPAGE) ||
> -	       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
> +	if (vm_flags & VM_HUGEPAGE)
> +		return false;
> +	return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
>  }
>
>  static inline bool thp_disabled_by_hw(void)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 1ec273b066915..a999f2d352648 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1743,19 +1743,16 @@ enum {
>  #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
>  #define MMF_VM_HUGEPAGE		17	/* set when mm is available for khugepaged */
>
> -/*
> - * This one-shot flag is dropped due to necessity of changing exe once again
> - * on NFS restore
> - */
> -//#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
> +#define MMF_HUGE_ZERO_PAGE	18      /* mm has ever used the global huge zero page */
>
>  #define MMF_HAS_UPROBES		19	/* has uprobes */
>  #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
>  #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
>  #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
> -#define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
> -#define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
> -#define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
> +#define MMF_DISABLE_THP_EXCEPT_ADVISED	23	/* no THP except for VMAs with VM_HUGEPAGE */
> +#define MMF_DISABLE_THP_COMPLETELY	24	/* no THP for all VMAs */
> +#define MMF_DISABLE_THP_MASK	((1 << MMF_DISABLE_THP_COMPLETELY) |\
> +				 (1 << MMF_DISABLE_THP_EXCEPT_ADVISED))

It feels a bit sigh to have to use up low-supply mm flags for this. But
again, I should be attacking this shortage soon enough.

Are we not going ahead with Barry's series that was going to use one of
these in the end?

>  #define MMF_OOM_REAP_QUEUED	25	/* mm was queued for oom_reaper */
>  #define MMF_MULTIPROCESS	26	/* mm is shared between processes */
>  /*
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 43dec6eed559a..1949bb9270d48 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -177,7 +177,14 @@ struct prctl_mm_map {
>
>  #define PR_GET_TID_ADDRESS	40
>
> +/*
> + * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0
> + * is reserved, so PR_GET_THP_DISABLE can return 1 when no other flags were
> + * specified for PR_SET_THP_DISABLE.
> + */

Probably worth specifying that you're just returning the flags here.

>  #define PR_SET_THP_DISABLE	41
> +/* Don't disable THPs when explicitly advised (MADV_HUGEPAGE / VM_HUGEPAGE). */
> +# define PR_THP_DISABLE_EXCEPT_ADVISED	(1 << 1)
>  #define PR_GET_THP_DISABLE	42
>
>  /*
> diff --git a/kernel/sys.c b/kernel/sys.c
> index b153fb345ada2..2a34b2f708900 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2423,6 +2423,50 @@ static int prctl_get_auxv(void __user *addr, unsigned long len)
>  	return sizeof(mm->saved_auxv);
>  }
>
> +static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3,
> +				 unsigned long arg4, unsigned long arg5)
> +{
> +	unsigned long *mm_flags = &current->mm->flags;
> +
> +	if (arg2 || arg3 || arg4 || arg5)
> +		return -EINVAL;
> +
> +	if (test_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags))
> +		return 1;
> +	else if (test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags))
> +		return 1 | PR_THP_DISABLE_EXCEPT_ADVISED;
> +	return 0;
> +}
> +
> +static int prctl_set_thp_disable(unsigned long thp_disable, unsigned long flags,
> +				 unsigned long arg4, unsigned long arg5)
> +{
> +	unsigned long *mm_flags = &current->mm->flags;
> +
> +	if (arg4 || arg5)
> +		return -EINVAL;
> +
> +	/* Flags are only allowed when disabling. */
> +	if (!thp_disable || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
> +		return -EINVAL;
> +	if (mmap_write_lock_killable(current->mm))
> +		return -EINTR;
> +	if (thp_disable) {
> +		if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) {
> +			clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
> +			set_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
> +		} else {
> +			set_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
> +			clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
> +		}
> +	} else {
> +		clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
> +		clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
> +	}
> +	mmap_write_unlock(current->mm);
> +	return 0;
> +}
> +
>  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		unsigned long, arg4, unsigned long, arg5)
>  {
> @@ -2596,20 +2640,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  			return -EINVAL;
>  		return task_no_new_privs(current) ? 1 : 0;
>  	case PR_GET_THP_DISABLE:
> -		if (arg2 || arg3 || arg4 || arg5)
> -			return -EINVAL;
> -		error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
> +		error = prctl_get_thp_disable(arg2, arg3, arg4, arg5);
>  		break;
>  	case PR_SET_THP_DISABLE:
> -		if (arg3 || arg4 || arg5)
> -			return -EINVAL;
> -		if (mmap_write_lock_killable(me->mm))
> -			return -EINTR;
> -		if (arg2)
> -			set_bit(MMF_DISABLE_THP, &me->mm->flags);
> -		else
> -			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
> -		mmap_write_unlock(me->mm);

I HATE that these were implemented here before.

> +		error = prctl_set_thp_disable(arg2, arg3, arg4, arg5);
>  		break;
>  	case PR_MPX_ENABLE_MANAGEMENT:
>  	case PR_MPX_DISABLE_MANAGEMENT:
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 8a5873d0a23a7..a685077644b4e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -427,7 +427,7 @@ static inline int collapse_test_exit(struct mm_struct *mm)
>  static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
>  {
>  	return collapse_test_exit(mm) ||
> -	       test_bit(MMF_DISABLE_THP, &mm->flags);
> +	       test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
>  }
>
>  static bool hugepage_enabled(void)
>
> base-commit: 760b462b3921c5dc8bfa151d2d27a944e4e96081
> --
> 2.50.1
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-21 11:28 ` Lorenzo Stoakes
@ 2025-07-21 11:45   ` David Hildenbrand
  2025-07-21 13:15     ` Lorenzo Stoakes
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-07-21 11:45 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, linux-fsdevel, linux-doc, Jonathan Corbet,
	Andrew Morton, Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Usama Arif,
	SeongJae Park, Jann Horn, Yafang Shao, Matthew Wilcox

On 21.07.25 13:28, Lorenzo Stoakes wrote:
> Overall, while I HATE this interface (as y'know, everyone knows :P), since
> it _already_ exists and fulfils a real need (and we _have_ to keep
> supporting that need) I'm open to us solving the issue this way.
> 
> So this might be a way for us to achieve what Usama + others need without
> having to splice in horridness.
> 
> This as a proof-of-concept is obviously not for 6.17 (and late in the day
> anyway :P), so we have at least 6.18 cycle to discuss.
> 
> On Mon, Jul 21, 2025 at 11:09:42AM +0200, David Hildenbrand wrote:
>> People want to make use of more THPs, for example, moving from
>> THP=never to THP=madvise, or from THP=madvise to THP=never.
> 
> Nitty, but sort of vague as to what THP= means here, I'd just say 'from
> never to madvise, or from madvise to never' - it's pretty clear what you
> mean and keeps enough 'flexibility' of interpretation to cover off the
> various ways you can do this in the sysfs interfaces.
 > > Same comment for simlar below.
> 
>>
>> While this is great news for every THP desperately waiting to get
>> allocated out there, apparently there are some workloads that require a
>> bit of care during that transition: once problems are detected, these
>> workloads should be started with the old behavior, without making all
>> other workloads on the system go back to the old behavior as well.
> 
> I'm confused about what 'old behavior' is here. Also it's not always
> necessarily due to problems, there can be a desire to treat THPs as a
> resource to be distributed as desired.
> 
> So I'd say something like '... transition: individual processes may need to
> opt-out from this behaviour for various reasons, and this should be
> permitted without needing to make all other workloads on the system
> similarly opt-out'.

No strong opinion.

> 
>>
>> In essence, the following scenarios are imaginable:
>>
>> (1) Switch from THP=none to THP=madvise or THP=always, but keep the old
>>      behavior (no THP) for selected workloads.
> 
> I'd remove 'old behavior' here as it's confusing, and simply refer to THP
> being disabled for selected  workloads.

Yes.

> 
>>
>> (2) Stay at THP=none, but have "madvise" or "always" behavior for
>>      selected workloads.
>>
>> (3) Switch from THP=madvise to THP=always, but keep the old behavior
>>      (THP only when advised) for selected workloads.
>>
>> (4) Stay at THP=madvise, but have "always" behavior for selected
>>      workloads.
>>
>> In essence, (2) can be emulated through (1), by setting THP!=none while
>> disabling THPs for all processes that don't want THPs. It requires
>> configuring all workloads, but that is a user-space problem to sort out.
> 
> NIT: Delete 'In essence' here.

I wanted "something" there to not make it look like the list keeps going 
on in a weird way ;)

> 
>>
>> (4) can be emulated through (3) in a similar way.
>>
>> Back when (1) was relevant in the past, as people started enabling THPs,
>> we added PR_SET_THP_DISABLE, so relevant workloads that were not ready
>> yet (i.e., used by Redis) were able to just disable THPs completely. Redis
>> still implements the option to use this interface to disable THPs
>> completely.
>>
>> With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
>> workload -- a process, including fork+exec'ed process hierarchy.
>> That essentially made us support (1): simply disable THPs for all workloads
>> that are not ready for THPs yet, while still enabling THPs system-wide.
>>
>> The quest for handling (3) and (4) started, but current approaches
>> (completely new prctl, options to set other policies per processm,
>>   alternatives to prctl -- mctrl, cgroup handling) don't look particularly
>> promising. Likely, the future will use bpf or something similar to
>> implement better policies, in particular to also make better decisions
>> about THP sizes to use, but this will certainly take a while as that work
>> just started.
> 
> Ack.
> 
>>
>> Long story short: a simple enable/disable is not really suitable for the
>> future, so we're not willing to add completely new toggles.
> 
> Yes this is the crux of the problem.
> 
>>
>> While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
>> completely for these processes, this scares many THPs in our system
>> because they could no longer get allocated where they used to be allocated
>> for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a
>> problem for relevant workloads, because "not THPs" is certainly worse
>> than "THPs only when advised".
> 
> I don't know what you mean by 'scares' many THPs? :P

They are very afraid of not getting allocated :)

> 
>>
>> Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
>> explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this
> 
> MAD_HUGEPAGE -> MADV_HUGEPAGE
> 
> I'm confused by 'unless not explicitly advised' do you mean 'disable THPs
> unless explicitly advised by the app through MADV_HUGEPAGE'?

Yes.

> 
>> would change the documented semantics quite a bit, and the versatility
>> to use it for debugging purposes, so I am not 100% sure that is what we
>> want -- although it would certainly be much easier.
>>
>> So instead, as an easy way forward for (3) and (4), an option to
>> make PR_SET_THP_DISABLE disable *less* THPs for a process.
>>
>> In essence, this patch:
>>
>> (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
>>      of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).
> 
> prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED)?
> 
>>
>>      For now, arg3 was not allowed to be set (-EINVAL). Now it holds
>>      flags.
> 
> This sentence is redundant.
> 
>>
>> (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
>>      PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.
>>
>>      For now, it would return 1 if THPs were disabled completely. Now
>>      it essentially returns the set flags as well.
> 
> For now as in 'previously'. I guess right now it's just used as a boolean,
> so this seems fine.
> 
>>
>> (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
>>      the semantics clearly.
>>
>>      Fortunately, there are only two instances outside of prctl() code.
>>
>> (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
>>      with VM_HUGEPAGE" -- essentially "thp=madvise" behavior
>>
>>      Fortunately, we only have to extend vma_thp_disabled().
>>
>> (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are not
>>      disabled completely
> 
> You mean 'are disabled completely' but this has been covered already :P

Yeah, see my self-reply.

> 
>>
>>      Only indicating that THPs are disabled when they are really disabled
>>      completely, not only partially.
> 
> 
> So the really interesting part in the above is the small delta this change
> represents... which makes it a lot more compelling as a solution.
> 
>>
>> The documented semantics in the man page for PR_SET_THP_DISABLE
>> "is inherited by a child created via fork(2) and is preserved across
>> execve(2)" is maintained. This behavior, for example, allows for
>> disabling THPs for a workload through the launching process (e.g.,
>> systemd where we fork() a helper process to then exec()).
> 
> Yeah, this is something I REALLY don't want us to perpetuate, as it's
> adding a now policy method by the 'back door'.
> 
> I had actually come to the conclusion that we absolutely should NOT
> implement anything like this, as discussed in the THP cabal meeting.
> 
> HOWEVER, since this mechanism ALREADY EXISTS for this case, I am much more
> ok with this.
> 
> We already perpetuate state for this across fork/exec.
> 
>>
>> There is currently not way to prevent that a process will not issue
>> PR_SET_THP_DISABLE itself to re-enable THP. We could add a "seal" option
>> to PR_SET_THP_DISABLE through another flag if ever required. The known
>> users (such as redis) really use PR_SET_THP_DISABLE to disable THPs, so
>> that is not added for now.
> 
> Yeah not a fan of the seal idea, that will add a bunch of complexity and
> state that I would rather not have.
> 
> I'd far prefer just disallowing re-enabling THP. We could allow
> re-disabling with different flags though.
> 
> Be good to get some examples of the old + new prctl() invocations in the
> commit message too.
> 
>>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Mike Rapoport <rppt@kernel.org>
>> Cc: Suren Baghdasaryan <surenb@google.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Usama Arif <usamaarif642@gmail.com>
>> Cc: SeongJae Park <sj@kernel.org>
>> Cc: Jann Horn <jannh@google.com>
>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>> Cc: Yafang Shao <laoar.shao@gmail.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> ---
>>
>> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
>> think there might be real use cases where we want to disable any THPs --
>> in particular also around debugging THP-related problems, and
>> "THP=never" not meaning ... "never" anymore. PR_SET_THP_DISABLE will
> 
> Well, not quite anymore :) it's been this way for a while right? Even since
> MADV_COLLAPSE was introduced.

It goes back to 3.15, yes ...

> 
>> also block MADV_COLLAPSE, which can be very helpful. Of course, I thought
> 
> Yes.
> 
> I mean I hate, hate, HATE this interface. But it exists and we have to
> support it anyway.
> 
>> of having a system-wide config to change PR_SET_THP_DISABLE behavior, but
>> I just don't like the semantics.
> 
> What do you mean?

Kconfig option to change the behavior etc. In summary, I don't want to 
go down that path, it all gets weird.

> 
>>
>> "prctl: allow overriding system THP policy to always"[1] proposed
>> "overriding policies to always", which is just the wrong way around: we
>> should not add mechanisms to "enable more" when we already have an
>> interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets
>> weird otherwise.
> 
> Yes. A 'disable but' is more logical.
> 
>>
>> "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed
>> setting the default of the VM_HUGEPAGE, which is similarly the wrong way
>> around I think now.
> 
> Yes.
> 
>>
>> The proposals by Lorenzo to extend process_madvise()[3] and mctrl()[4]
>> similarly were around the "default for VM_HUGEPAGE" idea, but after the
>> discussion, I think we should better leave VM_HUGEPAGE untouched.
> 
> Well, to be clear, these were more 'if we HAVE to do this, what is the
> least awful way of doing this?' rather than proposals per se :P and mctrl()
> was really taking existing discussed ideas and -simply seeing how it looked
> in code- though in the end we decided better to spell out in words how it
> would look.
> 
> At least now I'm not in favour of us implementing policy this way (but
> again, am open to us extending an _existing_ abomination :)
> 
>>
>> Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
>> we essentially want to say "leave advised regions alone" -- "keep THP
>> enabled for advised regions",
> 
> Seems OK to me. I guess the one point of confusion could be people being
> confused between the THP toggle 'madvise' vs. actually having MADV_HUGEPAGE
> set, but that's moot, because 'madvise' mode only enables THP if the region
> has had MADV_HUGEPAGE set.

Right, whatever ends up setting VM_HUGEPAGE.

> 
>>
>> The only thing I really dislike about this is using another MMF_* flag,
>> but well, no way around it -- and seems like we could easily support
>> more than 32 if we want to, or storing this thp information elsewhere.
> 
> Yes my exact thoughts. But I will be adding a series to change this for VMA
> flags soon enough, and can potentially do mm flags at the same time...
> 
> So this shouldn't in the end be as much of a problem.
> 
> Maybe it's worth holding off on this until I've done that? But at any rate
> I intend to do those changes next cycle, and this will be a next cycle
> thing at the earliest anyway.

I don't think this series must be blocked by that. Using a bitmap 
instead of a single "unsigned long" should be fairly easy later -- I did 
not identify any big blockers.

> 
>>
>> I think this here (modifying an existing toggle) is the only prctl()
>> extension that we might be willing to accept. In general, I agree like
>> most others, that prctl() is a very bad interface for that -- but
>> PR_SET_THP_DISABLE is already there and is getting used.
> 
> Yes.
> 
>>
>> Long-term, I think the answer will be something based on bpf[5]. Maybe
>> in that context, I there could still be value in easily disabling THPs for
>> selected workloads (esp. debugging purposes).
>>
>> Jann raised valid concerns[6] about new flags that are persistent across
>> exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I
>> consider it having a similar security risk as our existing
>> PR_SET_THP_DISABLE, but devil is in the detail.
> 
> Yes...
> 
>>
>> This is *completely* untested and might be utterly broken. It merely
>> serves as a PoC of what I think could be done. If this ever goes upstream,
>> we need some kselftests for it, and extensive tests.
> 
> Well :) I mean we should definitely try this out in anger and it _MUST_
> have self tests and put under some pressure.
> 
> Usama, can you attack this and see?

Yes, that's what I am hoping for.

> 
>>
>> [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com
>> [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com
>> [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com
>> [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
>> [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com
>> [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com
>>
>> ---
>>   Documentation/filesystems/proc.rst |  5 +--
>>   fs/proc/array.c                    |  2 +-
>>   include/linux/huge_mm.h            | 20 ++++++++---
>>   include/linux/mm_types.h           | 13 +++----
>>   include/uapi/linux/prctl.h         |  7 ++++
>>   kernel/sys.c                       | 58 +++++++++++++++++++++++-------
>>   mm/khugepaged.c                    |  2 +-
>>   7 files changed, 78 insertions(+), 29 deletions(-)
>>
>> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
>> index 2971551b72353..915a3e44bc120 100644
>> --- a/Documentation/filesystems/proc.rst
>> +++ b/Documentation/filesystems/proc.rst
>> @@ -291,8 +291,9 @@ It's slow but very precise.
>>    HugetlbPages                size of hugetlb memory portions
>>    CoreDumping                 process's memory is currently being dumped
>>                                (killing the process may lead to a corrupted core)
>> - THP_enabled		     process is allowed to use THP (returns 0 when
>> -			     PR_SET_THP_DISABLE is set on the process
>> + THP_enabled                 process is allowed to use THP (returns 0 when
>> +                             PR_SET_THP_DISABLE is set on the process to disable
>> +                             THP completely, not just partially)
> 
> Hmm but this means we have no way of knowing if it's set for partial

Yes. I briefly thought about indicating another member, but then I 
thought (a) it's ugly and (b) "who cares".

I also thought about just printing "partial" instead of "1", but not 
sure if that would break any parser.

> 
>>    Threads                     number of threads
>>    SigQ                        number of signals queued/max. number for queue
>>    SigPnd                      bitmap of pending signals for the thread
>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>> index d6a0369caa931..c4f91a784104f 100644
>> --- a/fs/proc/array.c
>> +++ b/fs/proc/array.c
>> @@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
>>   	bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
>>
>>   	if (thp_enabled)
>> -		thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags);
>> +		thp_enabled = !test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
>>   	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>>   }
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index e0a27f80f390d..c4127104d9bc3 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -323,16 +323,26 @@ struct thpsize {
>>   	(transparent_hugepage_flags &					\
>>   	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
>>
>> +/*
>> + * Check whether THPs are explicitly disabled through madvise or prctl, or some
>> + * architectures may disable THP for some mappings, for example, s390 kvm.
>> + */
>>   static inline bool vma_thp_disabled(struct vm_area_struct *vma,
>>   		vm_flags_t vm_flags)
> 
> This _should_ work as we set/clear VM_HUGEPAGE, VM_NOHUGEPAGE in madvise()
> unconditionally without bothering to check available THP orders first so no
> chicken-and-egg here.
> 
>>   {
>> +	/* Are THPs disabled for this VMA? */
>> +	if (vm_flags & VM_NOHUGEPAGE)
>> +		return true;
>> +	/* Are THPs disabled for all VMAs in the whole process? */
>> +	if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags))
>> +		return true;
>>   	/*
>> -	 * Explicitly disabled through madvise or prctl, or some
>> -	 * architectures may disable THP for some mappings, for
>> -	 * example, s390 kvm.
>> +	 * Are THPs disabled only for VMAs where we didn't get an explicit
>> +	 * advise to use them?
> 
> Probably fine to drop the rather pernickety reference to s390 kvm here, I
> mean we don't need to spell out massively specific details in a general
> handler.

No strong opinion.

> 
>>   	 */
>> -	return (vm_flags & VM_NOHUGEPAGE) ||
>> -	       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
>> +	if (vm_flags & VM_HUGEPAGE)
>> +		return false;
>> +	return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
>>   }
>>
>>   static inline bool thp_disabled_by_hw(void)
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 1ec273b066915..a999f2d352648 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -1743,19 +1743,16 @@ enum {
>>   #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
>>   #define MMF_VM_HUGEPAGE		17	/* set when mm is available for khugepaged */
>>
>> -/*
>> - * This one-shot flag is dropped due to necessity of changing exe once again
>> - * on NFS restore
>> - */
>> -//#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
>> +#define MMF_HUGE_ZERO_PAGE	18      /* mm has ever used the global huge zero page */
>>
>>   #define MMF_HAS_UPROBES		19	/* has uprobes */
>>   #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
>>   #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
>>   #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
>> -#define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
>> -#define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
>> -#define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
>> +#define MMF_DISABLE_THP_EXCEPT_ADVISED	23	/* no THP except for VMAs with VM_HUGEPAGE */
>> +#define MMF_DISABLE_THP_COMPLETELY	24	/* no THP for all VMAs */
>> +#define MMF_DISABLE_THP_MASK	((1 << MMF_DISABLE_THP_COMPLETELY) |\
>> +				 (1 << MMF_DISABLE_THP_EXCEPT_ADVISED))
> 
> It feels a bit sigh to have to use up low-supply mm flags for this. But
> again, I should be attacking this shortage soon enough.
> 
> Are we not going ahead with Barry's series that was going to use one of
> these in the end?

Whoever gets acked first ;)

> 
>>   #define MMF_OOM_REAP_QUEUED	25	/* mm was queued for oom_reaper */
>>   #define MMF_MULTIPROCESS	26	/* mm is shared between processes */
>>   /*
>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> index 43dec6eed559a..1949bb9270d48 100644
>> --- a/include/uapi/linux/prctl.h
>> +++ b/include/uapi/linux/prctl.h
>> @@ -177,7 +177,14 @@ struct prctl_mm_map {
>>
>>   #define PR_GET_TID_ADDRESS	40
>>
>> +/*
>> + * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0
>> + * is reserved, so PR_GET_THP_DISABLE can return 1 when no other flags were
>> + * specified for PR_SET_THP_DISABLE.
>> + */
> 
> Probably worth specifying that you're just returning the flags here.

Yes.

Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-21 11:45   ` David Hildenbrand
@ 2025-07-21 13:15     ` Lorenzo Stoakes
  0 siblings, 0 replies; 18+ messages in thread
From: Lorenzo Stoakes @ 2025-07-21 13:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-fsdevel, linux-doc, Jonathan Corbet,
	Andrew Morton, Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Usama Arif,
	SeongJae Park, Jann Horn, Yafang Shao, Matthew Wilcox

On Mon, Jul 21, 2025 at 01:45:18PM +0200, David Hildenbrand wrote:
> > >
> > > (2) Stay at THP=none, but have "madvise" or "always" behavior for
> > >      selected workloads.
> > >
> > > (3) Switch from THP=madvise to THP=always, but keep the old behavior
> > >      (THP only when advised) for selected workloads.
> > >
> > > (4) Stay at THP=madvise, but have "always" behavior for selected
> > >      workloads.
> > >
> > > In essence, (2) can be emulated through (1), by setting THP!=none while
> > > disabling THPs for all processes that don't want THPs. It requires
> > > configuring all workloads, but that is a user-space problem to sort out.
> >
> > NIT: Delete 'In essence' here.
>
> I wanted "something" there to not make it look like the list keeps going on
> in a weird way ;)

I mean it's not a big deal :P just have memories of English teachers telling me
off for repetition of such phrases...

> > > While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
> > > completely for these processes, this scares many THPs in our system
> > > because they could no longer get allocated where they used to be allocated
> > > for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a
> > > problem for relevant workloads, because "not THPs" is certainly worse
> > > than "THPs only when advised".
> >
> > I don't know what you mean by 'scares' many THPs? :P
>
> They are very afraid of not getting allocated :)

Haha they sound cute!

> >
> > > of having a system-wide config to change PR_SET_THP_DISABLE behavior, but
> > > I just don't like the semantics.
> >
> > What do you mean?
>
> Kconfig option to change the behavior etc. In summary, I don't want to go
> down that path, it all gets weird.

Yeah please don't! :>)


> > > Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
> > > we essentially want to say "leave advised regions alone" -- "keep THP
> > > enabled for advised regions",
> >
> > Seems OK to me. I guess the one point of confusion could be people being
> > confused between the THP toggle 'madvise' vs. actually having MADV_HUGEPAGE
> > set, but that's moot, because 'madvise' mode only enables THP if the region
> > has had MADV_HUGEPAGE set.
>
> Right, whatever ends up setting VM_HUGEPAGE.

Yeah this naming is fine iMO.

>
> >
> > >
> > > The only thing I really dislike about this is using another MMF_* flag,
> > > but well, no way around it -- and seems like we could easily support
> > > more than 32 if we want to, or storing this thp information elsewhere.
> >
> > Yes my exact thoughts. But I will be adding a series to change this for VMA
> > flags soon enough, and can potentially do mm flags at the same time...
> >
> > So this shouldn't in the end be as much of a problem.
> >
> > Maybe it's worth holding off on this until I've done that? But at any rate
> > I intend to do those changes next cycle, and this will be a next cycle
> > thing at the earliest anyway.
>
> I don't think this series must be blocked by that. Using a bitmap instead of
> a single "unsigned long" should be fairly easy later -- I did not identify
> any big blockers.

Yeah that's fine. And I don't know when I will get the bitmap changes done, so
let's not block this with that!

> > > This is *completely* untested and might be utterly broken. It merely
> > > serves as a PoC of what I think could be done. If this ever goes upstream,
> > > we need some kselftests for it, and extensive tests.
> >
> > Well :) I mean we should definitely try this out in anger and it _MUST_
> > have self tests and put under some pressure.
> >
> > Usama, can you attack this and see?
>
> Yes, that's what I am hoping for.

Cool. And of course Usama is best placed to experiment with this approach,
as he can experiment with workloads relevant to this requirement.

>
> >
> > >
> > > [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com
> > > [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com
> > > [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com
> > > [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
> > > [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com
> > > [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com
> > >
> > > ---
> > >   Documentation/filesystems/proc.rst |  5 +--
> > >   fs/proc/array.c                    |  2 +-
> > >   include/linux/huge_mm.h            | 20 ++++++++---
> > >   include/linux/mm_types.h           | 13 +++----
> > >   include/uapi/linux/prctl.h         |  7 ++++
> > >   kernel/sys.c                       | 58 +++++++++++++++++++++++-------
> > >   mm/khugepaged.c                    |  2 +-
> > >   7 files changed, 78 insertions(+), 29 deletions(-)
> > >
> > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > > index 2971551b72353..915a3e44bc120 100644
> > > --- a/Documentation/filesystems/proc.rst
> > > +++ b/Documentation/filesystems/proc.rst
> > > @@ -291,8 +291,9 @@ It's slow but very precise.
> > >    HugetlbPages                size of hugetlb memory portions
> > >    CoreDumping                 process's memory is currently being dumped
> > >                                (killing the process may lead to a corrupted core)
> > > - THP_enabled		     process is allowed to use THP (returns 0 when
> > > -			     PR_SET_THP_DISABLE is set on the process
> > > + THP_enabled                 process is allowed to use THP (returns 0 when
> > > +                             PR_SET_THP_DISABLE is set on the process to disable
> > > +                             THP completely, not just partially)
> >
> > Hmm but this means we have no way of knowing if it's set for partial
>
> Yes. I briefly thought about indicating another member, but then I thought
> (a) it's ugly and (b) "who cares".
>
> I also thought about just printing "partial" instead of "1", but not sure if
> that would break any parser.

Hm and >1 could break a user who expects this to be 0, 1. We can always add
a new entry if needed.

> > >   {
> > > +	/* Are THPs disabled for this VMA? */
> > > +	if (vm_flags & VM_NOHUGEPAGE)
> > > +		return true;
> > > +	/* Are THPs disabled for all VMAs in the whole process? */
> > > +	if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags))
> > > +		return true;
> > >   	/*
> > > -	 * Explicitly disabled through madvise or prctl, or some
> > > -	 * architectures may disable THP for some mappings, for
> > > -	 * example, s390 kvm.
> > > +	 * Are THPs disabled only for VMAs where we didn't get an explicit
> > > +	 * advise to use them?
> >
> > Probably fine to drop the rather pernickety reference to s390 kvm here, I
> > mean we don't need to spell out massively specific details in a general
> > handler.
>
> No strong opinion.

I mean what I'm saying is this is fine :P Got no problem wtih removing this
bit of the comment.

>
> >
> > >   	 */
> > > -	return (vm_flags & VM_NOHUGEPAGE) ||
> > > -	       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
> > > +	if (vm_flags & VM_HUGEPAGE)
> > > +		return false;
> > > +	return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
> > >   }
> > >
> > >   static inline bool thp_disabled_by_hw(void)
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 1ec273b066915..a999f2d352648 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -1743,19 +1743,16 @@ enum {
> > >   #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
> > >   #define MMF_VM_HUGEPAGE		17	/* set when mm is available for khugepaged */
> > >
> > > -/*
> > > - * This one-shot flag is dropped due to necessity of changing exe once again
> > > - * on NFS restore
> > > - */
> > > -//#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
> > > +#define MMF_HUGE_ZERO_PAGE	18      /* mm has ever used the global huge zero page */
> > >
> > >   #define MMF_HAS_UPROBES		19	/* has uprobes */
> > >   #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
> > >   #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
> > >   #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
> > > -#define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
> > > -#define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
> > > -#define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
> > > +#define MMF_DISABLE_THP_EXCEPT_ADVISED	23	/* no THP except for VMAs with VM_HUGEPAGE */
> > > +#define MMF_DISABLE_THP_COMPLETELY	24	/* no THP for all VMAs */
> > > +#define MMF_DISABLE_THP_MASK	((1 << MMF_DISABLE_THP_COMPLETELY) |\
> > > +				 (1 << MMF_DISABLE_THP_EXCEPT_ADVISED))
> >
> > It feels a bit sigh to have to use up low-supply mm flags for this. But
> > again, I should be attacking this shortage soon enough.
> >
> > Are we not going ahead with Barry's series that was going to use one of
> > these in the end?
>
> Whoever gets acked first ;)

;)

>
> >
> > >   #define MMF_OOM_REAP_QUEUED	25	/* mm was queued for oom_reaper */
> > >   #define MMF_MULTIPROCESS	26	/* mm is shared between processes */
> > >   /*
> > > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > > index 43dec6eed559a..1949bb9270d48 100644
> > > --- a/include/uapi/linux/prctl.h
> > > +++ b/include/uapi/linux/prctl.h
> > > @@ -177,7 +177,14 @@ struct prctl_mm_map {
> > >
> > >   #define PR_GET_TID_ADDRESS	40
> > >
> > > +/*
> > > + * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0
> > > + * is reserved, so PR_GET_THP_DISABLE can return 1 when no other flags were
> > > + * specified for PR_SET_THP_DISABLE.
> > > + */
> >
> > Probably worth specifying that you're just returning the flags here.
>
> Yes.
>
> Thanks!

Cheers!

>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-21  9:09 [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE David Hildenbrand
  2025-07-21 10:10 ` David Hildenbrand
  2025-07-21 11:28 ` Lorenzo Stoakes
@ 2025-07-21 14:32 ` Usama Arif
  2025-07-21 14:39   ` David Hildenbrand
  2025-07-21 17:27 ` Usama Arif
  2025-07-24 18:57 ` Usama Arif
  4 siblings, 1 reply; 18+ messages in thread
From: Usama Arif @ 2025-07-21 14:32 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, linux-fsdevel, linux-doc, Jonathan Corbet,
	Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	SeongJae Park, Jann Horn, Yafang Shao, Matthew Wilcox



On 21/07/2025 10:09, David Hildenbrand wrote:
> People want to make use of more THPs, for example, moving from
> THP=never to THP=madvise, or from THP=madvise to THP=never.
> 
> While this is great news for every THP desperately waiting to get
> allocated out there, apparently there are some workloads that require a
> bit of care during that transition: once problems are detected, these
> workloads should be started with the old behavior, without making all
> other workloads on the system go back to the old behavior as well.
> 
> In essence, the following scenarios are imaginable:
> 
> (1) Switch from THP=none to THP=madvise or THP=always, but keep the old
>     behavior (no THP) for selected workloads.
> 
> (2) Stay at THP=none, but have "madvise" or "always" behavior for
>     selected workloads.
> 
> (3) Switch from THP=madvise to THP=always, but keep the old behavior
>     (THP only when advised) for selected workloads.
> 
> (4) Stay at THP=madvise, but have "always" behavior for selected
>     workloads.
> 
> In essence, (2) can be emulated through (1), by setting THP!=none while
> disabling THPs for all processes that don't want THPs. It requires
> configuring all workloads, but that is a user-space problem to sort out.
> 
> (4) can be emulated through (3) in a similar way.
> 
> Back when (1) was relevant in the past, as people started enabling THPs,
> we added PR_SET_THP_DISABLE, so relevant workloads that were not ready
> yet (i.e., used by Redis) were able to just disable THPs completely. Redis
> still implements the option to use this interface to disable THPs
> completely.
> 
> With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
> workload -- a process, including fork+exec'ed process hierarchy.
> That essentially made us support (1): simply disable THPs for all workloads
> that are not ready for THPs yet, while still enabling THPs system-wide.
> 
> The quest for handling (3) and (4) started, but current approaches
> (completely new prctl, options to set other policies per processm,
>  alternatives to prctl -- mctrl, cgroup handling) don't look particularly
> promising. Likely, the future will use bpf or something similar to
> implement better policies, in particular to also make better decisions
> about THP sizes to use, but this will certainly take a while as that work
> just started.
> 
> Long story short: a simple enable/disable is not really suitable for the
> future, so we're not willing to add completely new toggles.
> 
> While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
> completely for these processes, this scares many THPs in our system
> because they could no longer get allocated where they used to be allocated
> for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a
> problem for relevant workloads, because "not THPs" is certainly worse
> than "THPs only when advised".
> 
> Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
> explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this
> would change the documented semantics quite a bit, and the versatility
> to use it for debugging purposes, so I am not 100% sure that is what we
> want -- although it would certainly be much easier.
> 
> So instead, as an easy way forward for (3) and (4), an option to
> make PR_SET_THP_DISABLE disable *less* THPs for a process.
> 
> In essence, this patch:
> 
> (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
>     of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).
> 
>     For now, arg3 was not allowed to be set (-EINVAL). Now it holds
>     flags.
> 
> (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
>     PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.
> 
>     For now, it would return 1 if THPs were disabled completely. Now
>     it essentially returns the set flags as well.
> 

No strong opinion, but maybe we have it return 2 (i.e. bit 1 set)?

I know that you are returning bit 1 set to indicate the flag, and I know that
everyone dislikes prctl so its likely no more flags will be added :),
but in the off chance there are extra flags, than it can make the return
value weird?
If instead we return a value with only a single bit set, might be better?

Again, no strong opinion here.

> (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
>     the semantics clearly.
> 
>     Fortunately, there are only two instances outside of prctl() code.
> 
> (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
>     with VM_HUGEPAGE" -- essentially "thp=madvise" behavior
> 
>     Fortunately, we only have to extend vma_thp_disabled().
> 
> (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are not
>     disabled completely
> 
>     Only indicating that THPs are disabled when they are really disabled
>     completely, not only partially.
> 
> The documented semantics in the man page for PR_SET_THP_DISABLE
> "is inherited by a child created via fork(2) and is preserved across
> execve(2)" is maintained. This behavior, for example, allows for
> disabling THPs for a workload through the launching process (e.g.,
> systemd where we fork() a helper process to then exec()).
> 
> There is currently not way to prevent that a process will not issue
> PR_SET_THP_DISABLE itself to re-enable THP. We could add a "seal" option
> to PR_SET_THP_DISABLE through another flag if ever required. The known
> users (such as redis) really use PR_SET_THP_DISABLE to disable THPs, so
> that is not added for now.
> 
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Cc: SeongJae Park <sj@kernel.org>
> Cc: Jann Horn <jannh@google.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> ---
> 
> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
> think there might be real use cases where we want to disable any THPs --
> in particular also around debugging THP-related problems, and
> "THP=never" not meaning ... "never" anymore. PR_SET_THP_DISABLE will
> also block MADV_COLLAPSE, which can be very helpful. Of course, I thought
> of having a system-wide config to change PR_SET_THP_DISABLE behavior, but
> I just don't like the semantics.
> 
> "prctl: allow overriding system THP policy to always"[1] proposed
> "overriding policies to always", which is just the wrong way around: we
> should not add mechanisms to "enable more" when we already have an
> interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets
> weird otherwise.
> 
> "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed
> setting the default of the VM_HUGEPAGE, which is similarly the wrong way
> around I think now.
> 
> The proposals by Lorenzo to extend process_madvise()[3] and mctrl()[4]
> similarly were around the "default for VM_HUGEPAGE" idea, but after the
> discussion, I think we should better leave VM_HUGEPAGE untouched.
> 
> Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
> we essentially want to say "leave advised regions alone" -- "keep THP
> enabled for advised regions",
> 
> The only thing I really dislike about this is using another MMF_* flag,
> but well, no way around it -- and seems like we could easily support
> more than 32 if we want to, or storing this thp information elsewhere.
> 
> I think this here (modifying an existing toggle) is the only prctl()
> extension that we might be willing to accept. In general, I agree like
> most others, that prctl() is a very bad interface for that -- but
> PR_SET_THP_DISABLE is already there and is getting used.
> 
> Long-term, I think the answer will be something based on bpf[5]. Maybe
> in that context, I there could still be value in easily disabling THPs for
> selected workloads (esp. debugging purposes).
> 
> Jann raised valid concerns[6] about new flags that are persistent across
> exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I
> consider it having a similar security risk as our existing
> PR_SET_THP_DISABLE, but devil is in the detail.
> 
> This is *completely* untested and might be utterly broken. It merely
> serves as a PoC of what I think could be done. If this ever goes upstream,
> we need some kselftests for it, and extensive tests.
> 
> [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com
> [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com
> [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com
> [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
> [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com
> [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com
> 
> ---
>  Documentation/filesystems/proc.rst |  5 +--
>  fs/proc/array.c                    |  2 +-
>  include/linux/huge_mm.h            | 20 ++++++++---
>  include/linux/mm_types.h           | 13 +++----
>  include/uapi/linux/prctl.h         |  7 ++++
>  kernel/sys.c                       | 58 +++++++++++++++++++++++-------
>  mm/khugepaged.c                    |  2 +-
>  7 files changed, 78 insertions(+), 29 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 2971551b72353..915a3e44bc120 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -291,8 +291,9 @@ It's slow but very precise.
>   HugetlbPages                size of hugetlb memory portions
>   CoreDumping                 process's memory is currently being dumped
>                               (killing the process may lead to a corrupted core)
> - THP_enabled		     process is allowed to use THP (returns 0 when
> -			     PR_SET_THP_DISABLE is set on the process
> + THP_enabled                 process is allowed to use THP (returns 0 when
> +                             PR_SET_THP_DISABLE is set on the process to disable
> +                             THP completely, not just partially)
>   Threads                     number of threads
>   SigQ                        number of signals queued/max. number for queue
>   SigPnd                      bitmap of pending signals for the thread
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index d6a0369caa931..c4f91a784104f 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
>  	bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
>  
>  	if (thp_enabled)
> -		thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags);
> +		thp_enabled = !test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
>  	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>  }
>  
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index e0a27f80f390d..c4127104d9bc3 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -323,16 +323,26 @@ struct thpsize {
>  	(transparent_hugepage_flags &					\
>  	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
>  
> +/*
> + * Check whether THPs are explicitly disabled through madvise or prctl, or some
> + * architectures may disable THP for some mappings, for example, s390 kvm.
> + */
>  static inline bool vma_thp_disabled(struct vm_area_struct *vma,
>  		vm_flags_t vm_flags)
>  {
> +	/* Are THPs disabled for this VMA? */
> +	if (vm_flags & VM_NOHUGEPAGE)
> +		return true;
> +	/* Are THPs disabled for all VMAs in the whole process? */
> +	if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags))
> +		return true;
>  	/*
> -	 * Explicitly disabled through madvise or prctl, or some
> -	 * architectures may disable THP for some mappings, for
> -	 * example, s390 kvm.
> +	 * Are THPs disabled only for VMAs where we didn't get an explicit
> +	 * advise to use them?
>  	 */
> -	return (vm_flags & VM_NOHUGEPAGE) ||
> -	       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
> +	if (vm_flags & VM_HUGEPAGE)
> +		return false;
> +	return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
>  }
>  
>  static inline bool thp_disabled_by_hw(void)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 1ec273b066915..a999f2d352648 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1743,19 +1743,16 @@ enum {
>  #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
>  #define MMF_VM_HUGEPAGE		17	/* set when mm is available for khugepaged */
>  
> -/*
> - * This one-shot flag is dropped due to necessity of changing exe once again
> - * on NFS restore
> - */
> -//#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
> +#define MMF_HUGE_ZERO_PAGE	18      /* mm has ever used the global huge zero page */
>  
>  #define MMF_HAS_UPROBES		19	/* has uprobes */
>  #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
>  #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
>  #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
> -#define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
> -#define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
> -#define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
> +#define MMF_DISABLE_THP_EXCEPT_ADVISED	23	/* no THP except for VMAs with VM_HUGEPAGE */
> +#define MMF_DISABLE_THP_COMPLETELY	24	/* no THP for all VMAs */
> +#define MMF_DISABLE_THP_MASK	((1 << MMF_DISABLE_THP_COMPLETELY) |\
> +				 (1 << MMF_DISABLE_THP_EXCEPT_ADVISED))
>  #define MMF_OOM_REAP_QUEUED	25	/* mm was queued for oom_reaper */
>  #define MMF_MULTIPROCESS	26	/* mm is shared between processes */
>  /*
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 43dec6eed559a..1949bb9270d48 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -177,7 +177,14 @@ struct prctl_mm_map {
>  
>  #define PR_GET_TID_ADDRESS	40
>  
> +/*
> + * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0
> + * is reserved, so PR_GET_THP_DISABLE can return 1 when no other flags were
> + * specified for PR_SET_THP_DISABLE.
> + */
>  #define PR_SET_THP_DISABLE	41
> +/* Don't disable THPs when explicitly advised (MADV_HUGEPAGE / VM_HUGEPAGE). */
> +# define PR_THP_DISABLE_EXCEPT_ADVISED	(1 << 1)
>  #define PR_GET_THP_DISABLE	42
>  
>  /*
> diff --git a/kernel/sys.c b/kernel/sys.c
> index b153fb345ada2..2a34b2f708900 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2423,6 +2423,50 @@ static int prctl_get_auxv(void __user *addr, unsigned long len)
>  	return sizeof(mm->saved_auxv);
>  }
>  
> +static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3,
> +				 unsigned long arg4, unsigned long arg5)
> +{
> +	unsigned long *mm_flags = &current->mm->flags;
> +
> +	if (arg2 || arg3 || arg4 || arg5)
> +		return -EINVAL;
> +
> +	if (test_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags))
> +		return 1;
> +	else if (test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags))
> +		return 1 | PR_THP_DISABLE_EXCEPT_ADVISED;
> +	return 0;
> +}
> +
> +static int prctl_set_thp_disable(unsigned long thp_disable, unsigned long flags,
> +				 unsigned long arg4, unsigned long arg5)
> +{
> +	unsigned long *mm_flags = &current->mm->flags;
> +
> +	if (arg4 || arg5)
> +		return -EINVAL;
> +
> +	/* Flags are only allowed when disabling. */
> +	if (!thp_disable || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))


I think you meant over here?

	if (!thp_disable && (flags & PR_THP_DISABLE_EXCEPT_ADVISED))

> +		return -EINVAL;
> +	if (mmap_write_lock_killable(current->mm))
> +		return -EINTR;
> +	if (thp_disable) {
> +		if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) {
> +			clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
> +			set_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
> +		} else {
> +			set_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
> +			clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
> +		}
> +	} else {
> +		clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
> +		clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
> +	}
> +	mmap_write_unlock(current->mm);
> +	return 0;
> +}
> +
>  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		unsigned long, arg4, unsigned long, arg5)
>  {
> @@ -2596,20 +2640,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  			return -EINVAL;
>  		return task_no_new_privs(current) ? 1 : 0;
>  	case PR_GET_THP_DISABLE:
> -		if (arg2 || arg3 || arg4 || arg5)
> -			return -EINVAL;
> -		error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
> +		error = prctl_get_thp_disable(arg2, arg3, arg4, arg5);
>  		break;
>  	case PR_SET_THP_DISABLE:
> -		if (arg3 || arg4 || arg5)
> -			return -EINVAL;
> -		if (mmap_write_lock_killable(me->mm))
> -			return -EINTR;
> -		if (arg2)
> -			set_bit(MMF_DISABLE_THP, &me->mm->flags);
> -		else
> -			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
> -		mmap_write_unlock(me->mm);
> +		error = prctl_set_thp_disable(arg2, arg3, arg4, arg5);
>  		break;
>  	case PR_MPX_ENABLE_MANAGEMENT:
>  	case PR_MPX_DISABLE_MANAGEMENT:
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 8a5873d0a23a7..a685077644b4e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -427,7 +427,7 @@ static inline int collapse_test_exit(struct mm_struct *mm)
>  static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
>  {
>  	return collapse_test_exit(mm) ||
> -	       test_bit(MMF_DISABLE_THP, &mm->flags);
> +	       test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
>  }
>  
>  static bool hugepage_enabled(void)
> 
> base-commit: 760b462b3921c5dc8bfa151d2d27a944e4e96081


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-21 14:32 ` Usama Arif
@ 2025-07-21 14:39   ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-07-21 14:39 UTC (permalink / raw)
  To: Usama Arif, linux-kernel
  Cc: linux-mm, linux-fsdevel, linux-doc, Jonathan Corbet,
	Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	SeongJae Park, Jann Horn, Yafang Shao, Matthew Wilcox


>> (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
>>      PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.
>>
>>      For now, it would return 1 if THPs were disabled completely. Now
>>      it essentially returns the set flags as well.
>>
> 
> No strong opinion, but maybe we have it return 2 (i.e. bit 1 set)?
> 
> I know that you are returning bit 1 set to indicate the flag, and I know that
> everyone dislikes prctl so its likely no more flags will be added :),
> but in the off chance there are extra flags, than it can make the return
> value weird?

Well, never say never, so I decided to just return the set flags :)

> If instead we return a value with only a single bit set, might be better?
> 
> Again, no strong opinion here.

I prefer the current approach, until there is good reason not do do it.

IOW, set bit-0 if something is disabled, and specify using flag what 
exectly (provided flags).

> 
>> (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
>>      the semantics clearly.
>>
>>      Fortunately, there are only two instances outside of prctl() code.
>>
>> (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
>>      with VM_HUGEPAGE" -- essentially "thp=madvise" behavior
>>
>>      Fortunately, we only have to extend vma_thp_disabled().
>>
>> (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are not
>>      disabled completely
>>
>>      Only indicating that THPs are disabled when they are really disabled
>>      completely, not only partially.
>>
>> The documented semantics in the man page for PR_SET_THP_DISABLE
>> "is inherited by a child created via fork(2) and is preserved across
>> execve(2)" is maintained. This behavior, for example, allows for
>> disabling THPs for a workload through the launching process (e.g.,
>> systemd where we fork() a helper process to then exec()).
>>
>> There is currently not way to prevent that a process will not issue
>> PR_SET_THP_DISABLE itself to re-enable THP. We could add a "seal" option
>> to PR_SET_THP_DISABLE through another flag if ever required. The known
>> users (such as redis) really use PR_SET_THP_DISABLE to disable THPs, so
>> that is not added for now.
>>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Mike Rapoport <rppt@kernel.org>
>> Cc: Suren Baghdasaryan <surenb@google.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Usama Arif <usamaarif642@gmail.com>
>> Cc: SeongJae Park <sj@kernel.org>
>> Cc: Jann Horn <jannh@google.com>
>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>> Cc: Yafang Shao <laoar.shao@gmail.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> ---
>>
>> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
>> think there might be real use cases where we want to disable any THPs --
>> in particular also around debugging THP-related problems, and
>> "THP=never" not meaning ... "never" anymore. PR_SET_THP_DISABLE will
>> also block MADV_COLLAPSE, which can be very helpful. Of course, I thought
>> of having a system-wide config to change PR_SET_THP_DISABLE behavior, but
>> I just don't like the semantics.
>>
>> "prctl: allow overriding system THP policy to always"[1] proposed
>> "overriding policies to always", which is just the wrong way around: we
>> should not add mechanisms to "enable more" when we already have an
>> interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets
>> weird otherwise.
>>
>> "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed
>> setting the default of the VM_HUGEPAGE, which is similarly the wrong way
>> around I think now.
>>
>> The proposals by Lorenzo to extend process_madvise()[3] and mctrl()[4]
>> similarly were around the "default for VM_HUGEPAGE" idea, but after the
>> discussion, I think we should better leave VM_HUGEPAGE untouched.
>>
>> Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
>> we essentially want to say "leave advised regions alone" -- "keep THP
>> enabled for advised regions",
>>
>> The only thing I really dislike about this is using another MMF_* flag,
>> but well, no way around it -- and seems like we could easily support
>> more than 32 if we want to, or storing this thp information elsewhere.
>>
>> I think this here (modifying an existing toggle) is the only prctl()
>> extension that we might be willing to accept. In general, I agree like
>> most others, that prctl() is a very bad interface for that -- but
>> PR_SET_THP_DISABLE is already there and is getting used.
>>
>> Long-term, I think the answer will be something based on bpf[5]. Maybe
>> in that context, I there could still be value in easily disabling THPs for
>> selected workloads (esp. debugging purposes).
>>
>> Jann raised valid concerns[6] about new flags that are persistent across
>> exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I
>> consider it having a similar security risk as our existing
>> PR_SET_THP_DISABLE, but devil is in the detail.
>>
>> This is *completely* untested and might be utterly broken. It merely
>> serves as a PoC of what I think could be done. If this ever goes upstream,
>> we need some kselftests for it, and extensive tests.
>>
>> [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com
>> [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com
>> [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com
>> [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
>> [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com
>> [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com
>>
>> ---
>>   Documentation/filesystems/proc.rst |  5 +--
>>   fs/proc/array.c                    |  2 +-
>>   include/linux/huge_mm.h            | 20 ++++++++---
>>   include/linux/mm_types.h           | 13 +++----
>>   include/uapi/linux/prctl.h         |  7 ++++
>>   kernel/sys.c                       | 58 +++++++++++++++++++++++-------
>>   mm/khugepaged.c                    |  2 +-
>>   7 files changed, 78 insertions(+), 29 deletions(-)
>>
>> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
>> index 2971551b72353..915a3e44bc120 100644
>> --- a/Documentation/filesystems/proc.rst
>> +++ b/Documentation/filesystems/proc.rst
>> @@ -291,8 +291,9 @@ It's slow but very precise.
>>    HugetlbPages                size of hugetlb memory portions
>>    CoreDumping                 process's memory is currently being dumped
>>                                (killing the process may lead to a corrupted core)
>> - THP_enabled		     process is allowed to use THP (returns 0 when
>> -			     PR_SET_THP_DISABLE is set on the process
>> + THP_enabled                 process is allowed to use THP (returns 0 when
>> +                             PR_SET_THP_DISABLE is set on the process to disable
>> +                             THP completely, not just partially)
>>    Threads                     number of threads
>>    SigQ                        number of signals queued/max. number for queue
>>    SigPnd                      bitmap of pending signals for the thread
>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>> index d6a0369caa931..c4f91a784104f 100644
>> --- a/fs/proc/array.c
>> +++ b/fs/proc/array.c
>> @@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
>>   	bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
>>   
>>   	if (thp_enabled)
>> -		thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags);
>> +		thp_enabled = !test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
>>   	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>>   }
>>   
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index e0a27f80f390d..c4127104d9bc3 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -323,16 +323,26 @@ struct thpsize {
>>   	(transparent_hugepage_flags &					\
>>   	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
>>   
>> +/*
>> + * Check whether THPs are explicitly disabled through madvise or prctl, or some
>> + * architectures may disable THP for some mappings, for example, s390 kvm.
>> + */
>>   static inline bool vma_thp_disabled(struct vm_area_struct *vma,
>>   		vm_flags_t vm_flags)
>>   {
>> +	/* Are THPs disabled for this VMA? */
>> +	if (vm_flags & VM_NOHUGEPAGE)
>> +		return true;
>> +	/* Are THPs disabled for all VMAs in the whole process? */
>> +	if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags))
>> +		return true;
>>   	/*
>> -	 * Explicitly disabled through madvise or prctl, or some
>> -	 * architectures may disable THP for some mappings, for
>> -	 * example, s390 kvm.
>> +	 * Are THPs disabled only for VMAs where we didn't get an explicit
>> +	 * advise to use them?
>>   	 */
>> -	return (vm_flags & VM_NOHUGEPAGE) ||
>> -	       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
>> +	if (vm_flags & VM_HUGEPAGE)
>> +		return false;
>> +	return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
>>   }
>>   
>>   static inline bool thp_disabled_by_hw(void)
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 1ec273b066915..a999f2d352648 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -1743,19 +1743,16 @@ enum {
>>   #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
>>   #define MMF_VM_HUGEPAGE		17	/* set when mm is available for khugepaged */
>>   
>> -/*
>> - * This one-shot flag is dropped due to necessity of changing exe once again
>> - * on NFS restore
>> - */
>> -//#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
>> +#define MMF_HUGE_ZERO_PAGE	18      /* mm has ever used the global huge zero page */
>>   
>>   #define MMF_HAS_UPROBES		19	/* has uprobes */
>>   #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
>>   #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
>>   #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
>> -#define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
>> -#define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
>> -#define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
>> +#define MMF_DISABLE_THP_EXCEPT_ADVISED	23	/* no THP except for VMAs with VM_HUGEPAGE */
>> +#define MMF_DISABLE_THP_COMPLETELY	24	/* no THP for all VMAs */
>> +#define MMF_DISABLE_THP_MASK	((1 << MMF_DISABLE_THP_COMPLETELY) |\
>> +				 (1 << MMF_DISABLE_THP_EXCEPT_ADVISED))
>>   #define MMF_OOM_REAP_QUEUED	25	/* mm was queued for oom_reaper */
>>   #define MMF_MULTIPROCESS	26	/* mm is shared between processes */
>>   /*
>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> index 43dec6eed559a..1949bb9270d48 100644
>> --- a/include/uapi/linux/prctl.h
>> +++ b/include/uapi/linux/prctl.h
>> @@ -177,7 +177,14 @@ struct prctl_mm_map {
>>   
>>   #define PR_GET_TID_ADDRESS	40
>>   
>> +/*
>> + * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0
>> + * is reserved, so PR_GET_THP_DISABLE can return 1 when no other flags were
>> + * specified for PR_SET_THP_DISABLE.
>> + */
>>   #define PR_SET_THP_DISABLE	41
>> +/* Don't disable THPs when explicitly advised (MADV_HUGEPAGE / VM_HUGEPAGE). */
>> +# define PR_THP_DISABLE_EXCEPT_ADVISED	(1 << 1)
>>   #define PR_GET_THP_DISABLE	42
>>   
>>   /*
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index b153fb345ada2..2a34b2f708900 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -2423,6 +2423,50 @@ static int prctl_get_auxv(void __user *addr, unsigned long len)
>>   	return sizeof(mm->saved_auxv);
>>   }
>>   
>> +static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3,
>> +				 unsigned long arg4, unsigned long arg5)
>> +{
>> +	unsigned long *mm_flags = &current->mm->flags;
>> +
>> +	if (arg2 || arg3 || arg4 || arg5)
>> +		return -EINVAL;
>> +
>> +	if (test_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags))
>> +		return 1;
>> +	else if (test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags))
>> +		return 1 | PR_THP_DISABLE_EXCEPT_ADVISED;
>> +	return 0;
>> +}
>> +
>> +static int prctl_set_thp_disable(unsigned long thp_disable, unsigned long flags,
>> +				 unsigned long arg4, unsigned long arg5)
>> +{
>> +	unsigned long *mm_flags = &current->mm->flags;
>> +
>> +	if (arg4 || arg5)
>> +		return -EINVAL;
>> +
>> +	/* Flags are only allowed when disabling. */
>> +	if (!thp_disable || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
> 
> 
> I think you meant over here?
> 
> 	if (!thp_disable && (flags & PR_THP_DISABLE_EXCEPT_ADVISED))
> 


When re-enabling, we don't allow flags, otherwise we only allow the 
supported (PR_THP_DISABLE_EXCEPT_ADVISED) flag.

So I think it should probably be something like

if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-21  9:09 [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE David Hildenbrand
                   ` (2 preceding siblings ...)
  2025-07-21 14:32 ` Usama Arif
@ 2025-07-21 17:27 ` Usama Arif
  2025-07-21 19:35   ` SeongJae Park
  2025-07-22 10:23   ` David Hildenbrand
  2025-07-24 18:57 ` Usama Arif
  4 siblings, 2 replies; 18+ messages in thread
From: Usama Arif @ 2025-07-21 17:27 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, linux-fsdevel, linux-doc, Jonathan Corbet,
	Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	SeongJae Park, Jann Horn, Yafang Shao, Matthew Wilcox



On 21/07/2025 10:09, David Hildenbrand wrote:
> People want to make use of more THPs, for example, moving from
> THP=never to THP=madvise, or from THP=madvise to THP=never.
> 
> While this is great news for every THP desperately waiting to get
> allocated out there, apparently there are some workloads that require a
> bit of care during that transition: once problems are detected, these
> workloads should be started with the old behavior, without making all
> other workloads on the system go back to the old behavior as well.
> 
> In essence, the following scenarios are imaginable:
> 
> (1) Switch from THP=none to THP=madvise or THP=always, but keep the old
>     behavior (no THP) for selected workloads.
> 
> (2) Stay at THP=none, but have "madvise" or "always" behavior for
>     selected workloads.
> 
> (3) Switch from THP=madvise to THP=always, but keep the old behavior
>     (THP only when advised) for selected workloads.
> 
> (4) Stay at THP=madvise, but have "always" behavior for selected
>     workloads.
> 
> In essence, (2) can be emulated through (1), by setting THP!=none while
> disabling THPs for all processes that don't want THPs. It requires
> configuring all workloads, but that is a user-space problem to sort out.
> 
> (4) can be emulated through (3) in a similar way.
> 
> Back when (1) was relevant in the past, as people started enabling THPs,
> we added PR_SET_THP_DISABLE, so relevant workloads that were not ready
> yet (i.e., used by Redis) were able to just disable THPs completely. Redis
> still implements the option to use this interface to disable THPs
> completely.
> 
> With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
> workload -- a process, including fork+exec'ed process hierarchy.
> That essentially made us support (1): simply disable THPs for all workloads
> that are not ready for THPs yet, while still enabling THPs system-wide.
> 
> The quest for handling (3) and (4) started, but current approaches
> (completely new prctl, options to set other policies per processm,
>  alternatives to prctl -- mctrl, cgroup handling) don't look particularly
> promising. Likely, the future will use bpf or something similar to
> implement better policies, in particular to also make better decisions
> about THP sizes to use, but this will certainly take a while as that work
> just started.
> 
> Long story short: a simple enable/disable is not really suitable for the
> future, so we're not willing to add completely new toggles.
> 
> While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
> completely for these processes, this scares many THPs in our system
> because they could no longer get allocated where they used to be allocated
> for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a
> problem for relevant workloads, because "not THPs" is certainly worse
> than "THPs only when advised".
> 
> Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
> explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this
> would change the documented semantics quite a bit, and the versatility
> to use it for debugging purposes, so I am not 100% sure that is what we
> want -- although it would certainly be much easier.
> 
> So instead, as an easy way forward for (3) and (4), an option to
> make PR_SET_THP_DISABLE disable *less* THPs for a process.
> 
> In essence, this patch:
> 
> (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
>     of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).
> 
>     For now, arg3 was not allowed to be set (-EINVAL). Now it holds
>     flags.
> 
> (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
>     PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.
> 
>     For now, it would return 1 if THPs were disabled completely. Now
>     it essentially returns the set flags as well.
> 
> (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
>     the semantics clearly.
> 
>     Fortunately, there are only two instances outside of prctl() code.
> 
> (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
>     with VM_HUGEPAGE" -- essentially "thp=madvise" behavior
> 
>     Fortunately, we only have to extend vma_thp_disabled().
> 
> (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are not
>     disabled completely
> 
>     Only indicating that THPs are disabled when they are really disabled
>     completely, not only partially.
> 
> The documented semantics in the man page for PR_SET_THP_DISABLE
> "is inherited by a child created via fork(2) and is preserved across
> execve(2)" is maintained. This behavior, for example, allows for
> disabling THPs for a workload through the launching process (e.g.,
> systemd where we fork() a helper process to then exec()).
> 
> There is currently not way to prevent that a process will not issue
> PR_SET_THP_DISABLE itself to re-enable THP. We could add a "seal" option
> to PR_SET_THP_DISABLE through another flag if ever required. The known
> users (such as redis) really use PR_SET_THP_DISABLE to disable THPs, so
> that is not added for now.
> 
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Cc: SeongJae Park <sj@kernel.org>
> Cc: Jann Horn <jannh@google.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> ---
> 
> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
> think there might be real use cases where we want to disable any THPs --
> in particular also around debugging THP-related problems, and
> "THP=never" not meaning ... "never" anymore. PR_SET_THP_DISABLE will
> also block MADV_COLLAPSE, which can be very helpful. Of course, I thought
> of having a system-wide config to change PR_SET_THP_DISABLE behavior, but
> I just don't like the semantics.
> 
> "prctl: allow overriding system THP policy to always"[1] proposed
> "overriding policies to always", which is just the wrong way around: we
> should not add mechanisms to "enable more" when we already have an
> interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets
> weird otherwise.
> 
> "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed
> setting the default of the VM_HUGEPAGE, which is similarly the wrong way
> around I think now.
> 
> The proposals by Lorenzo to extend process_madvise()[3] and mctrl()[4]
> similarly were around the "default for VM_HUGEPAGE" idea, but after the
> discussion, I think we should better leave VM_HUGEPAGE untouched.
> 
> Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
> we essentially want to say "leave advised regions alone" -- "keep THP
> enabled for advised regions",
> 
> The only thing I really dislike about this is using another MMF_* flag,
> but well, no way around it -- and seems like we could easily support
> more than 32 if we want to, or storing this thp information elsewhere.
> 
> I think this here (modifying an existing toggle) is the only prctl()
> extension that we might be willing to accept. In general, I agree like
> most others, that prctl() is a very bad interface for that -- but
> PR_SET_THP_DISABLE is already there and is getting used.
> 
> Long-term, I think the answer will be something based on bpf[5]. Maybe
> in that context, I there could still be value in easily disabling THPs for
> selected workloads (esp. debugging purposes).
> 
> Jann raised valid concerns[6] about new flags that are persistent across
> exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I
> consider it having a similar security risk as our existing
> PR_SET_THP_DISABLE, but devil is in the detail.
> 
> This is *completely* untested and might be utterly broken. It merely
> serves as a PoC of what I think could be done. If this ever goes upstream,
> we need some kselftests for it, and extensive tests.
> 
> [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com
> [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com
> [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com
> [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
> [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com
> [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com
> 
> ---
>  Documentation/filesystems/proc.rst |  5 +--
>  fs/proc/array.c                    |  2 +-
>  include/linux/huge_mm.h            | 20 ++++++++---
>  include/linux/mm_types.h           | 13 +++----
>  include/uapi/linux/prctl.h         |  7 ++++
>  kernel/sys.c                       | 58 +++++++++++++++++++++++-------
>  mm/khugepaged.c                    |  2 +-
>  7 files changed, 78 insertions(+), 29 deletions(-)


Thanks for the patch David!

As discussed in the other thread, with the below diff

diff --git a/kernel/sys.c b/kernel/sys.c
index 2a34b2f70890..3912f5b6a02d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2447,7 +2447,7 @@ static int prctl_set_thp_disable(unsigned long thp_disable, unsigned long flags,
                return -EINVAL;
 
        /* Flags are only allowed when disabling. */
-       if (!thp_disable || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
+       if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
                return -EINVAL;
        if (mmap_write_lock_killable(current->mm))
                return -EINTR;


I tested with the below selftest, and it works. It hopefully covers
majority of the cases including fork and re-enabling THPs. 
Let me know if it looks ok and please feel free to add this in the
next revision you send.


Once the above diff is included, please feel free to add

Acked-by: Usama Arif <usamaarif642@gmail.com>
Tested-by: Usama Arif <usamaarif642@gmail.com>


Thanks!

From ee9004e7d34511a79726ee1314aec0503e6351d4 Mon Sep 17 00:00:00 2001
From: Usama Arif <usamaarif642@gmail.com>
Date: Thu, 15 May 2025 14:33:33 +0100
Subject: [PATCH] selftests: prctl: introduce tests for
 PR_THP_DISABLE_EXCEPT_ADVISED

The test is limited to 2M PMD THPs. It does not modify the system
settings in order to not disturb other process running in the system.
It checks if the PMD size is 2M, if the 2M policy is set to inherit
and if the system global THP policy is set to "always", so that
the change in behaviour due to PR_THP_DISABLE_EXCEPT_ADVISED can
be seen.

This tests if:
- the process can successfully set the policy
- carry it over to the new process with fork
- if no hugepage is gotten when the process doesn't MADV_HUGEPAGE
- if hugepage is gotten when the process does MADV_HUGEPAGE
- the process can successfully reset the policy to PR_THP_POLICY_SYSTEM
- if hugepage is gotten after the policy reset

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 tools/testing/selftests/prctl/Makefile      |   2 +-
 tools/testing/selftests/prctl/thp_disable.c | 207 ++++++++++++++++++++
 2 files changed, 208 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/prctl/thp_disable.c

diff --git a/tools/testing/selftests/prctl/Makefile b/tools/testing/selftests/prctl/Makefile
index 01dc90fbb509..a3cf76585c48 100644
--- a/tools/testing/selftests/prctl/Makefile
+++ b/tools/testing/selftests/prctl/Makefile
@@ -5,7 +5,7 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)
 
 ifeq ($(ARCH),x86)
 TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \
-		disable-tsc-test set-anon-vma-name-test set-process-name
+		disable-tsc-test set-anon-vma-name-test set-process-name thp_disable
 all: $(TEST_PROGS)
 
 include ../lib.mk
diff --git a/tools/testing/selftests/prctl/thp_disable.c b/tools/testing/selftests/prctl/thp_disable.c
new file mode 100644
index 000000000000..e524723b3313
--- /dev/null
+++ b/tools/testing/selftests/prctl/thp_disable.c
@@ -0,0 +1,207 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This test covers the PR_GET/SET_THP_DISABLE functionality of prctl calls
+ * for PR_THP_DISABLE_EXCEPT_ADVISED
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+
+#ifndef PR_THP_DISABLE_EXCEPT_ADVISED
+#define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1)
+#endif
+
+#define CONTENT_SIZE 256
+#define BUF_SIZE (12 * 2 * 1024 * 1024) // 12 x 2MB pages
+
+enum system_policy {
+	SYSTEM_POLICY_ALWAYS,
+	SYSTEM_POLICY_MADVISE,
+	SYSTEM_POLICY_NEVER,
+};
+
+int system_thp_policy;
+
+/* check if the sysfs file contains the expected substring */
+static int check_file_content(const char *file_path, const char *expected_substring)
+{
+	FILE *file = fopen(file_path, "r");
+	char buffer[CONTENT_SIZE];
+
+	if (!file) {
+		perror("Failed to open file");
+		return -1;
+	}
+	if (fgets(buffer, CONTENT_SIZE, file) == NULL) {
+		perror("Failed to read file");
+		fclose(file);
+		return -1;
+	}
+	fclose(file);
+	// Remove newline character from the buffer
+	buffer[strcspn(buffer, "\n")] = '\0';
+	if (strstr(buffer, expected_substring))
+		return 0;
+	else
+		return 1;
+}
+
+/*
+ * The test is designed for 2M hugepages only.
+ * Check if hugepage size is 2M, if 2M size inherits from global
+ * setting, and if the global setting is always.
+ */
+static int sysfs_check(void)
+{
+	int res = 0;
+
+	res = check_file_content("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", "2097152");
+	if (res) {
+		printf("hpage_pmd_size is not set to 2MB. Skipping test.\n");
+		return -1;
+	}
+	res |= check_file_content("/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled",
+				  "[inherit]");
+	if (res) {
+		printf("hugepages-2048kB does not inherit global setting. Skipping test.\n");
+		return -1;
+	}
+
+	res = check_file_content("/sys/kernel/mm/transparent_hugepage/enabled", "[always]");
+	if (!res) {
+		system_thp_policy = SYSTEM_POLICY_ALWAYS;
+		return 0;
+	}
+	printf("Global THP policy not set to always. Skipping test.\n");
+	return -1;
+}
+
+static int check_smaps_for_huge(void)
+{
+	FILE *file = fopen("/proc/self/smaps", "r");
+	int is_anonhuge = 0;
+	char line[256];
+
+	if (!file) {
+		perror("fopen");
+		return -1;
+	}
+
+	while (fgets(line, sizeof(line), file)) {
+		if (strstr(line, "AnonHugePages:") && strstr(line, "24576 kB")) {
+			is_anonhuge = 1;
+			break;
+		}
+	}
+	fclose(file);
+	return is_anonhuge;
+}
+
+static int test_mmap_thp(int madvise_buffer)
+{
+	int is_anonhuge;
+
+	char *buffer = (char *)mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
+				    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (buffer == MAP_FAILED) {
+		perror("mmap");
+		return -1;
+	}
+	if (madvise_buffer)
+		madvise(buffer, BUF_SIZE, MADV_HUGEPAGE);
+
+	// set memory to ensure it's allocated
+	memset(buffer, 0, BUF_SIZE);
+	is_anonhuge = check_smaps_for_huge();
+	munmap(buffer, BUF_SIZE);
+	return is_anonhuge;
+}
+
+/* Global policy is always, process is changed to "madvise only" */
+static int test_global_always_process_madvise(void)
+{
+	int is_anonhuge = 0, res = 0, status = 0;
+	pid_t pid;
+
+	if (prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, NULL, NULL) != 0) {
+		perror("prctl failed to set policy to madvise");
+		return -1;
+	}
+
+	/* Make sure prctl changes are carried across fork */
+	pid = fork();
+	if (pid < 0) {
+		perror("fork");
+		exit(EXIT_FAILURE);
+	}
+
+	res = prctl(PR_GET_THP_DISABLE, NULL, NULL, NULL, NULL);
+	if (res != 3) {
+		printf("prctl PR_GET_THP_POLICY returned %d pid %d\n", res, pid);
+		goto err_out;
+	}
+
+	/* global = always, process = madvise, we shouldn't get HPs without madvise */
+	is_anonhuge = test_mmap_thp(0);
+	if (is_anonhuge) {
+		printf(
+		"PR_THP_POLICY_DEFAULT_NOHUGE set but still got hugepages without MADV_HUGEPAGE\n");
+		goto err_out;
+	}
+
+	is_anonhuge = test_mmap_thp(1);
+	if (!is_anonhuge) {
+		printf(
+		"PR_THP_POLICY_DEFAULT_NOHUGE set but did't get hugepages with MADV_HUGEPAGE\n");
+		goto err_out;
+	}
+
+	/* Reset to system policy */
+	if (prctl(PR_SET_THP_DISABLE, 0, NULL, NULL, NULL) != 0) {
+		perror("prctl failed to set policy to system");
+		goto err_out;
+	}
+
+	is_anonhuge = test_mmap_thp(0);
+	if (!is_anonhuge) {
+		printf("global policy is always but we still didn't get hugepages\n");
+		goto err_out;
+	}
+
+	is_anonhuge = test_mmap_thp(1);
+	if (!is_anonhuge) {
+		printf("global policy is always but we still didn't get hugepages\n");
+		goto err_out;
+	}
+	printf("PASS\n");
+
+	if (pid == 0) {
+		exit(EXIT_SUCCESS);
+	} else {
+		wait(&status);
+		if (WIFEXITED(status))
+			return 0;
+		else
+			return -1;
+	}
+
+err_out:
+	if (pid == 0)
+		exit(EXIT_FAILURE);
+	else
+		return -1;
+}
+
+int main(void)
+{
+	if (sysfs_check())
+		return 0;
+
+	if (system_thp_policy == SYSTEM_POLICY_ALWAYS)
+		return test_global_always_process_madvise();
+
+}
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-21 17:27 ` Usama Arif
@ 2025-07-21 19:35   ` SeongJae Park
  2025-07-22 10:23   ` David Hildenbrand
  1 sibling, 0 replies; 18+ messages in thread
From: SeongJae Park @ 2025-07-21 19:35 UTC (permalink / raw)
  To: Usama Arif
  Cc: SeongJae Park, David Hildenbrand, linux-kernel, linux-mm,
	linux-fsdevel, linux-doc, Jonathan Corbet, Andrew Morton,
	Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Yafang Shao, Matthew Wilcox

On Mon, 21 Jul 2025 18:27:38 +0100 Usama Arif <usamaarif642@gmail.com> wrote:

[...]
> >From ee9004e7d34511a79726ee1314aec0503e6351d4 Mon Sep 17 00:00:00 2001
> From: Usama Arif <usamaarif642@gmail.com>
> Date: Thu, 15 May 2025 14:33:33 +0100
> Subject: [PATCH] selftests: prctl: introduce tests for
>  PR_THP_DISABLE_EXCEPT_ADVISED
> 
> The test is limited to 2M PMD THPs. It does not modify the system
> settings in order to not disturb other process running in the system.
> It checks if the PMD size is 2M, if the 2M policy is set to inherit
> and if the system global THP policy is set to "always", so that
> the change in behaviour due to PR_THP_DISABLE_EXCEPT_ADVISED can
> be seen.
> 
> This tests if:
> - the process can successfully set the policy
> - carry it over to the new process with fork
> - if no hugepage is gotten when the process doesn't MADV_HUGEPAGE
> - if hugepage is gotten when the process does MADV_HUGEPAGE
> - the process can successfully reset the policy to PR_THP_POLICY_SYSTEM
> - if hugepage is gotten after the policy reset

Nice!  I added a few trivial comments below, though.

> 
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>

Acked-by: SeongJae Park <sj@kernel.org>

> ---
>  tools/testing/selftests/prctl/Makefile      |   2 +-
>  tools/testing/selftests/prctl/thp_disable.c | 207 ++++++++++++++++++++

I once thought this might better fit on selftests/mm/, but I found we already
have selftests/prctl/set-anon-vma-name-tests.c, no no strong opinion from my
side.

>  2 files changed, 208 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/prctl/thp_disable.c
> 
> diff --git a/tools/testing/selftests/prctl/Makefile b/tools/testing/selftests/prctl/Makefile
> index 01dc90fbb509..a3cf76585c48 100644
> --- a/tools/testing/selftests/prctl/Makefile
> +++ b/tools/testing/selftests/prctl/Makefile
> @@ -5,7 +5,7 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)
>  
>  ifeq ($(ARCH),x86)
>  TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \
> -		disable-tsc-test set-anon-vma-name-test set-process-name
> +		disable-tsc-test set-anon-vma-name-test set-process-name thp_disable
>  all: $(TEST_PROGS)
>  
>  include ../lib.mk
> diff --git a/tools/testing/selftests/prctl/thp_disable.c b/tools/testing/selftests/prctl/thp_disable.c
> new file mode 100644
> index 000000000000..e524723b3313
> --- /dev/null
> +++ b/tools/testing/selftests/prctl/thp_disable.c
> @@ -0,0 +1,207 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * This test covers the PR_GET/SET_THP_DISABLE functionality of prctl calls
> + * for PR_THP_DISABLE_EXCEPT_ADVISED
> + */
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <sys/mman.h>
> +#include <sys/prctl.h>
> +#include <sys/wait.h>
> +
> +#ifndef PR_THP_DISABLE_EXCEPT_ADVISED
> +#define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1)
> +#endif
> +
> +#define CONTENT_SIZE 256
> +#define BUF_SIZE (12 * 2 * 1024 * 1024) // 12 x 2MB pages
> +
> +enum system_policy {
> +	SYSTEM_POLICY_ALWAYS,
> +	SYSTEM_POLICY_MADVISE,
> +	SYSTEM_POLICY_NEVER,
> +};
> +
> +int system_thp_policy;
> +
> +/* check if the sysfs file contains the expected substring */
> +static int check_file_content(const char *file_path, const char *expected_substring)
> +{
> +	FILE *file = fopen(file_path, "r");
> +	char buffer[CONTENT_SIZE];
> +
> +	if (!file) {
> +		perror("Failed to open file");
> +		return -1;
> +	}
> +	if (fgets(buffer, CONTENT_SIZE, file) == NULL) {
> +		perror("Failed to read file");
> +		fclose(file);
> +		return -1;
> +	}
> +	fclose(file);
> +	// Remove newline character from the buffer

Nit.  I'd suggest using "/* */" consisetntly.

> +	buffer[strcspn(buffer, "\n")] = '\0';
> +	if (strstr(buffer, expected_substring))
> +		return 0;
> +	else
> +		return 1;
> +}
> +
> +/*
> + * The test is designed for 2M hugepages only.
> + * Check if hugepage size is 2M, if 2M size inherits from global
> + * setting, and if the global setting is always.
> + */
> +static int sysfs_check(void)
> +{
> +	int res = 0;
> +
> +	res = check_file_content("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", "2097152");
> +	if (res) {
> +		printf("hpage_pmd_size is not set to 2MB. Skipping test.\n");
> +		return -1;

Nit.  Skipping is done by the caller, right?  I think it is more natural to say
"Skipping test" from the caller.

> +	}
> +	res |= check_file_content("/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled",
> +				  "[inherit]");

Nit.  I think we can drop '|' and just do 'res = '.

> +	if (res) {
> +		printf("hugepages-2048kB does not inherit global setting. Skipping test.\n");
> +		return -1;
> +	}
> +
> +	res = check_file_content("/sys/kernel/mm/transparent_hugepage/enabled", "[always]");
> +	if (!res) {

Seems 'res' is being used for only checking whether it is zero.  Maybe doing
'if (check_file_content(...))' and removing 'res' can make code simpler?

> +		system_thp_policy = SYSTEM_POLICY_ALWAYS;
> +		return 0;
> +	}

Also, system_thp_policy is set only here, so we know 'system_thp_policy ==
SYSTEM_POLICY_ALWAYS' if sysfs_check() returned zero.  Maybe system_thp_policy
is not really required?

> +	printf("Global THP policy not set to always. Skipping test.\n");
> +	return -1;
> +}
> +
> +static int check_smaps_for_huge(void)
> +{
> +	FILE *file = fopen("/proc/self/smaps", "r");
> +	int is_anonhuge = 0;
> +	char line[256];
> +
> +	if (!file) {
> +		perror("fopen");
> +		return -1;
> +	}
> +
> +	while (fgets(line, sizeof(line), file)) {
> +		if (strstr(line, "AnonHugePages:") && strstr(line, "24576 kB")) {
> +			is_anonhuge = 1;
> +			break;
> +		}
> +	}
> +	fclose(file);
> +	return is_anonhuge;
> +}
> +
> +static int test_mmap_thp(int madvise_buffer)
> +{
> +	int is_anonhuge;
> +
> +	char *buffer = (char *)mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
> +				    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +	if (buffer == MAP_FAILED) {
> +		perror("mmap");
> +		return -1;
> +	}
> +	if (madvise_buffer)
> +		madvise(buffer, BUF_SIZE, MADV_HUGEPAGE);
> +
> +	// set memory to ensure it's allocated

'/* */' for consistency?

> +	memset(buffer, 0, BUF_SIZE);
> +	is_anonhuge = check_smaps_for_huge();
> +	munmap(buffer, BUF_SIZE);
> +	return is_anonhuge;
> +}
> +
> +/* Global policy is always, process is changed to "madvise only" */
> +static int test_global_always_process_madvise(void)
> +{
> +	int is_anonhuge = 0, res = 0, status = 0;
> +	pid_t pid;
> +
> +	if (prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, NULL, NULL) != 0) {

Nit.  '!= 0' can be dropped.

> +		perror("prctl failed to set policy to madvise");
> +		return -1;
> +	}
> +
> +	/* Make sure prctl changes are carried across fork */
> +	pid = fork();
> +	if (pid < 0) {
> +		perror("fork");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	res = prctl(PR_GET_THP_DISABLE, NULL, NULL, NULL, NULL);
> +	if (res != 3) {
> +		printf("prctl PR_GET_THP_POLICY returned %d pid %d\n", res, pid);
> +		goto err_out;
> +	}
> +
> +	/* global = always, process = madvise, we shouldn't get HPs without madvise */
> +	is_anonhuge = test_mmap_thp(0);
> +	if (is_anonhuge) {
> +		printf(
> +		"PR_THP_POLICY_DEFAULT_NOHUGE set but still got hugepages without MADV_HUGEPAGE\n");
> +		goto err_out;
> +	}
> +
> +	is_anonhuge = test_mmap_thp(1);
> +	if (!is_anonhuge) {
> +		printf(
> +		"PR_THP_POLICY_DEFAULT_NOHUGE set but did't get hugepages with MADV_HUGEPAGE\n");
> +		goto err_out;
> +	}
> +
> +	/* Reset to system policy */
> +	if (prctl(PR_SET_THP_DISABLE, 0, NULL, NULL, NULL) != 0) {
> +		perror("prctl failed to set policy to system");
> +		goto err_out;
> +	}
> +
> +	is_anonhuge = test_mmap_thp(0);
> +	if (!is_anonhuge) {
> +		printf("global policy is always but we still didn't get hugepages\n");
> +		goto err_out;
> +	}
> +
> +	is_anonhuge = test_mmap_thp(1);
> +	if (!is_anonhuge) {
> +		printf("global policy is always but we still didn't get hugepages\n");
> +		goto err_out;
> +	}

Seems is_anonhugepage is used for only whether it is zero or not, just after
being assigned from test_mmap_thp().  How about removing the variable?

> +	printf("PASS\n");
> +
> +	if (pid == 0) {
> +		exit(EXIT_SUCCESS);
> +	} else {
> +		wait(&status);
> +		if (WIFEXITED(status))
> +			return 0;
> +		else
> +			return -1;
> +	}
> +
> +err_out:
> +	if (pid == 0)
> +		exit(EXIT_FAILURE);
> +	else
> +		return -1;
> +}
> +
> +int main(void)
> +{
> +	if (sysfs_check())
> +		return 0;

May better to return KSFT_SKIP insted of 0?

> +
> +	if (system_thp_policy == SYSTEM_POLICY_ALWAYS)

This should be always true, since sysfs_check() returned zero, right?  I think
we can remove this check.

> +		return test_global_always_process_madvise();
> +

Nit.  Unnecessary blank line.

> +}
> -- 
> 2.47.1
> 
> 
> 


Thanks,
SJ

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-21 17:27 ` Usama Arif
  2025-07-21 19:35   ` SeongJae Park
@ 2025-07-22 10:23   ` David Hildenbrand
  2025-07-22 10:27     ` Lorenzo Stoakes
  2025-07-23 17:07     ` Usama Arif
  1 sibling, 2 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-07-22 10:23 UTC (permalink / raw)
  To: Usama Arif, linux-kernel
  Cc: linux-mm, linux-fsdevel, linux-doc, Jonathan Corbet,
	Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	SeongJae Park, Jann Horn, Yafang Shao, Matthew Wilcox

On 21.07.25 19:27, Usama Arif wrote:
> 
> 
> On 21/07/2025 10:09, David Hildenbrand wrote:
>> People want to make use of more THPs, for example, moving from
>> THP=never to THP=madvise, or from THP=madvise to THP=never.
>>
>> While this is great news for every THP desperately waiting to get
>> allocated out there, apparently there are some workloads that require a
>> bit of care during that transition: once problems are detected, these
>> workloads should be started with the old behavior, without making all
>> other workloads on the system go back to the old behavior as well.
>>
>> In essence, the following scenarios are imaginable:
>>
>> (1) Switch from THP=none to THP=madvise or THP=always, but keep the old
>>      behavior (no THP) for selected workloads.
>>
>> (2) Stay at THP=none, but have "madvise" or "always" behavior for
>>      selected workloads.
>>
>> (3) Switch from THP=madvise to THP=always, but keep the old behavior
>>      (THP only when advised) for selected workloads.
>>
>> (4) Stay at THP=madvise, but have "always" behavior for selected
>>      workloads.
>>
>> In essence, (2) can be emulated through (1), by setting THP!=none while
>> disabling THPs for all processes that don't want THPs. It requires
>> configuring all workloads, but that is a user-space problem to sort out.
>>
>> (4) can be emulated through (3) in a similar way.
>>
>> Back when (1) was relevant in the past, as people started enabling THPs,
>> we added PR_SET_THP_DISABLE, so relevant workloads that were not ready
>> yet (i.e., used by Redis) were able to just disable THPs completely. Redis
>> still implements the option to use this interface to disable THPs
>> completely.
>>
>> With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
>> workload -- a process, including fork+exec'ed process hierarchy.
>> That essentially made us support (1): simply disable THPs for all workloads
>> that are not ready for THPs yet, while still enabling THPs system-wide.
>>
>> The quest for handling (3) and (4) started, but current approaches
>> (completely new prctl, options to set other policies per processm,
>>   alternatives to prctl -- mctrl, cgroup handling) don't look particularly
>> promising. Likely, the future will use bpf or something similar to
>> implement better policies, in particular to also make better decisions
>> about THP sizes to use, but this will certainly take a while as that work
>> just started.
>>
>> Long story short: a simple enable/disable is not really suitable for the
>> future, so we're not willing to add completely new toggles.
>>
>> While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
>> completely for these processes, this scares many THPs in our system
>> because they could no longer get allocated where they used to be allocated
>> for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a
>> problem for relevant workloads, because "not THPs" is certainly worse
>> than "THPs only when advised".
>>
>> Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
>> explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this
>> would change the documented semantics quite a bit, and the versatility
>> to use it for debugging purposes, so I am not 100% sure that is what we
>> want -- although it would certainly be much easier.
>>
>> So instead, as an easy way forward for (3) and (4), an option to
>> make PR_SET_THP_DISABLE disable *less* THPs for a process.
>>
>> In essence, this patch:
>>
>> (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
>>      of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).
>>
>>      For now, arg3 was not allowed to be set (-EINVAL). Now it holds
>>      flags.
>>
>> (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
>>      PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.
>>
>>      For now, it would return 1 if THPs were disabled completely. Now
>>      it essentially returns the set flags as well.
>>
>> (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
>>      the semantics clearly.
>>
>>      Fortunately, there are only two instances outside of prctl() code.
>>
>> (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
>>      with VM_HUGEPAGE" -- essentially "thp=madvise" behavior
>>
>>      Fortunately, we only have to extend vma_thp_disabled().
>>
>> (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are not
>>      disabled completely
>>
>>      Only indicating that THPs are disabled when they are really disabled
>>      completely, not only partially.
>>
>> The documented semantics in the man page for PR_SET_THP_DISABLE
>> "is inherited by a child created via fork(2) and is preserved across
>> execve(2)" is maintained. This behavior, for example, allows for
>> disabling THPs for a workload through the launching process (e.g.,
>> systemd where we fork() a helper process to then exec()).
>>
>> There is currently not way to prevent that a process will not issue
>> PR_SET_THP_DISABLE itself to re-enable THP. We could add a "seal" option
>> to PR_SET_THP_DISABLE through another flag if ever required. The known
>> users (such as redis) really use PR_SET_THP_DISABLE to disable THPs, so
>> that is not added for now.
>>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Mike Rapoport <rppt@kernel.org>
>> Cc: Suren Baghdasaryan <surenb@google.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Usama Arif <usamaarif642@gmail.com>
>> Cc: SeongJae Park <sj@kernel.org>
>> Cc: Jann Horn <jannh@google.com>
>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>> Cc: Yafang Shao <laoar.shao@gmail.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> ---
>>
>> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
>> think there might be real use cases where we want to disable any THPs --
>> in particular also around debugging THP-related problems, and
>> "THP=never" not meaning ... "never" anymore. PR_SET_THP_DISABLE will
>> also block MADV_COLLAPSE, which can be very helpful. Of course, I thought
>> of having a system-wide config to change PR_SET_THP_DISABLE behavior, but
>> I just don't like the semantics.
>>
>> "prctl: allow overriding system THP policy to always"[1] proposed
>> "overriding policies to always", which is just the wrong way around: we
>> should not add mechanisms to "enable more" when we already have an
>> interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets
>> weird otherwise.
>>
>> "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed
>> setting the default of the VM_HUGEPAGE, which is similarly the wrong way
>> around I think now.
>>
>> The proposals by Lorenzo to extend process_madvise()[3] and mctrl()[4]
>> similarly were around the "default for VM_HUGEPAGE" idea, but after the
>> discussion, I think we should better leave VM_HUGEPAGE untouched.
>>
>> Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
>> we essentially want to say "leave advised regions alone" -- "keep THP
>> enabled for advised regions",
>>
>> The only thing I really dislike about this is using another MMF_* flag,
>> but well, no way around it -- and seems like we could easily support
>> more than 32 if we want to, or storing this thp information elsewhere.
>>
>> I think this here (modifying an existing toggle) is the only prctl()
>> extension that we might be willing to accept. In general, I agree like
>> most others, that prctl() is a very bad interface for that -- but
>> PR_SET_THP_DISABLE is already there and is getting used.
>>
>> Long-term, I think the answer will be something based on bpf[5]. Maybe
>> in that context, I there could still be value in easily disabling THPs for
>> selected workloads (esp. debugging purposes).
>>
>> Jann raised valid concerns[6] about new flags that are persistent across
>> exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I
>> consider it having a similar security risk as our existing
>> PR_SET_THP_DISABLE, but devil is in the detail.
>>
>> This is *completely* untested and might be utterly broken. It merely
>> serves as a PoC of what I think could be done. If this ever goes upstream,
>> we need some kselftests for it, and extensive tests.
>>
>> [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com
>> [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com
>> [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com
>> [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
>> [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com
>> [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com
>>
>> ---
>>   Documentation/filesystems/proc.rst |  5 +--
>>   fs/proc/array.c                    |  2 +-
>>   include/linux/huge_mm.h            | 20 ++++++++---
>>   include/linux/mm_types.h           | 13 +++----
>>   include/uapi/linux/prctl.h         |  7 ++++
>>   kernel/sys.c                       | 58 +++++++++++++++++++++++-------
>>   mm/khugepaged.c                    |  2 +-
>>   7 files changed, 78 insertions(+), 29 deletions(-)
> 
> 
> Thanks for the patch David!
> 
> As discussed in the other thread, with the below diff
> 
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 2a34b2f70890..3912f5b6a02d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2447,7 +2447,7 @@ static int prctl_set_thp_disable(unsigned long thp_disable, unsigned long flags,
>                  return -EINVAL;
>   
>          /* Flags are only allowed when disabling. */
> -       if (!thp_disable || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
> +       if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
>                  return -EINVAL;
>          if (mmap_write_lock_killable(current->mm))
>                  return -EINTR;
> 
> 
> I tested with the below selftest, and it works. It hopefully covers
> majority of the cases including fork and re-enabling THPs.
> Let me know if it looks ok and please feel free to add this in the
> next revision you send.
> 
> 
> Once the above diff is included, please feel free to add
> 
> Acked-by: Usama Arif <usamaarif642@gmail.com>
> Tested-by: Usama Arif <usamaarif642@gmail.com>

Thanks!

The latest version lives at

   https://github.com/davidhildenbrand/linux/tree/PR_SET_THP_DISABLE

With all current review feedback addressed (primarily around 
description+comments) + that one fix.


> 
> 
> Thanks!
> 
>  From ee9004e7d34511a79726ee1314aec0503e6351d4 Mon Sep 17 00:00:00 2001
> From: Usama Arif <usamaarif642@gmail.com>
> Date: Thu, 15 May 2025 14:33:33 +0100
> Subject: [PATCH] selftests: prctl: introduce tests for
>   PR_THP_DISABLE_EXCEPT_ADVISED
> 
> The test is limited to 2M PMD THPs. It does not modify the system
> settings in order to not disturb other process running in the system.
> It checks if the PMD size is 2M, if the 2M policy is set to inherit
> and if the system global THP policy is set to "always", so that
> the change in behaviour due to PR_THP_DISABLE_EXCEPT_ADVISED can
> be seen.
> 
> This tests if:
> - the process can successfully set the policy
> - carry it over to the new process with fork
> - if no hugepage is gotten when the process doesn't MADV_HUGEPAGE
> - if hugepage is gotten when the process does MADV_HUGEPAGE
> - the process can successfully reset the policy to PR_THP_POLICY_SYSTEM
> - if hugepage is gotten after the policy reset
> 
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>   tools/testing/selftests/prctl/Makefile      |   2 +-
>   tools/testing/selftests/prctl/thp_disable.c | 207 ++++++++++++++++++++

Like SJ says, this should better live under mm, then we can also make 
use of check_huge_anon() and vm_utils.c and probably also THP helpers 
from thp_settings.h. Most of the helpers you use should be available in 
some form there already.

With THP helpers in thp_settings.h, you can explicitly set the system 
policy, to then reset to eh previous version IIRC.

Further, can you make sure to use kselftest infrastructure for the test, 
preferrably kselftest_harness.h? (see pfnmap.c on one of my latest 
selftests)

I also wonder if we want to test old behavior, without the flag set. We 
could also test that MADV_COLLAPSE doesn't succeed in either case.

Ideally, you'd send my patch (see above) along with the selftest, as I 
suspect there will be more review+changes to the selftest (and only 
smaller changes to my patch).

Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-22 10:23   ` David Hildenbrand
@ 2025-07-22 10:27     ` Lorenzo Stoakes
  2025-07-23 17:07     ` Usama Arif
  1 sibling, 0 replies; 18+ messages in thread
From: Lorenzo Stoakes @ 2025-07-22 10:27 UTC (permalink / raw)
  To: Usama Arif
  Cc: David Hildenbrand, linux-kernel, linux-mm, linux-fsdevel,
	linux-doc, Jonathan Corbet, Andrew Morton, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	SeongJae Park, Jann Horn, Yafang Shao, Matthew Wilcox

On Tue, Jul 22, 2025 at 12:23:04PM +0200, David Hildenbrand wrote:
> On 21.07.25 19:27, Usama Arif wrote:
> >   tools/testing/selftests/prctl/Makefile      |   2 +-
> >   tools/testing/selftests/prctl/thp_disable.c | 207 ++++++++++++++++++++
>
> Like SJ says, this should better live under mm, then we can also make use of
> check_huge_anon() and vm_utils.c and probably also THP helpers from
> thp_settings.h. Most of the helpers you use should be available in some form
> there already.

*A wild Lorenzo appears*

I mean everyone's saying the same thing, but you'd almost be disappointed
if I didn't say here 'pleeeease keep as much of this in mm as possible'
also :P

Thanks for taking this on and writing the test Usama! :)

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-22 10:23   ` David Hildenbrand
  2025-07-22 10:27     ` Lorenzo Stoakes
@ 2025-07-23 17:07     ` Usama Arif
  2025-07-23 18:02       ` David Hildenbrand
  1 sibling, 1 reply; 18+ messages in thread
From: Usama Arif @ 2025-07-23 17:07 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, linux-fsdevel, linux-doc, Jonathan Corbet,
	Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	SeongJae Park, Jann Horn, Yafang Shao, Matthew Wilcox

>> Thanks for the patch David!
>>
>> As discussed in the other thread, with the below diff
>>
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index 2a34b2f70890..3912f5b6a02d 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -2447,7 +2447,7 @@ static int prctl_set_thp_disable(unsigned long thp_disable, unsigned long flags,
>>                  return -EINVAL;
>>            /* Flags are only allowed when disabling. */
>> -       if (!thp_disable || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
>> +       if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
>>                  return -EINVAL;
>>          if (mmap_write_lock_killable(current->mm))
>>                  return -EINTR;
>>
>>
>> I tested with the below selftest, and it works. It hopefully covers
>> majority of the cases including fork and re-enabling THPs.
>> Let me know if it looks ok and please feel free to add this in the
>> next revision you send.
>>
>>
>> Once the above diff is included, please feel free to add
>>
>> Acked-by: Usama Arif <usamaarif642@gmail.com>
>> Tested-by: Usama Arif <usamaarif642@gmail.com>
> 
> Thanks!
> 
> The latest version lives at
> 
>   https://github.com/davidhildenbrand/linux/tree/PR_SET_THP_DISABLE
> 
> With all current review feedback addressed (primarily around description+comments) + that one fix.
> 
> 

Hi David,

Just wanted to check if the above branch is up to date?

I didn't check the description/comments, but it still has [1]:

if (!thp_disable || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))

and not 

if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))

in prctl_set_thp_disable, which is causing the reset to system policy case in my selftest to fail.

[1] https://github.com/davidhildenbrand/linux/commit/5711cdf5dfe65ca28dac2a57d62e18f1475dac57#diff-dc9985831020a20a54baf023fec641593d0d4e75a78988c3b35a176aff1c0321R2450


Thanks,
Usama


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-23 17:07     ` Usama Arif
@ 2025-07-23 18:02       ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-07-23 18:02 UTC (permalink / raw)
  To: Usama Arif, linux-kernel
  Cc: linux-mm, linux-fsdevel, linux-doc, Jonathan Corbet,
	Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	SeongJae Park, Jann Horn, Yafang Shao, Matthew Wilcox

On 23.07.25 19:07, Usama Arif wrote:
>>> Thanks for the patch David!
>>>
>>> As discussed in the other thread, with the below diff
>>>
>>> diff --git a/kernel/sys.c b/kernel/sys.c
>>> index 2a34b2f70890..3912f5b6a02d 100644
>>> --- a/kernel/sys.c
>>> +++ b/kernel/sys.c
>>> @@ -2447,7 +2447,7 @@ static int prctl_set_thp_disable(unsigned long thp_disable, unsigned long flags,
>>>                   return -EINVAL;
>>>             /* Flags are only allowed when disabling. */
>>> -       if (!thp_disable || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
>>> +       if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
>>>                   return -EINVAL;
>>>           if (mmap_write_lock_killable(current->mm))
>>>                   return -EINTR;
>>>
>>>
>>> I tested with the below selftest, and it works. It hopefully covers
>>> majority of the cases including fork and re-enabling THPs.
>>> Let me know if it looks ok and please feel free to add this in the
>>> next revision you send.
>>>
>>>
>>> Once the above diff is included, please feel free to add
>>>
>>> Acked-by: Usama Arif <usamaarif642@gmail.com>
>>> Tested-by: Usama Arif <usamaarif642@gmail.com>
>>
>> Thanks!
>>
>> The latest version lives at
>>
>>    https://github.com/davidhildenbrand/linux/tree/PR_SET_THP_DISABLE
>>
>> With all current review feedback addressed (primarily around description+comments) + that one fix.
>>
>>
> 
> Hi David,
> 
> Just wanted to check if the above branch is up to date?
> 

No, I forgot to push. Now pushed.

This is the diff (excluding description changes):

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c4127104d9bc3..527aa4c9645fd 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -324,8 +324,8 @@ struct thpsize {
          (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
  
  /*
- * Check whether THPs are explicitly disabled through madvise or prctl, or some
- * architectures may disable THP for some mappings, for example, s390 kvm.
+ * Check whether THPs are explicitly disabled for this VMA, for example,
+ * through madvise or prctl.
   */
  static inline bool vma_thp_disabled(struct vm_area_struct *vma,
                 vm_flags_t vm_flags)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 1949bb9270d48..60e496ecabe04 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -179,8 +179,8 @@ struct prctl_mm_map {
  
  /*
   * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0
- * is reserved, so PR_GET_THP_DISABLE can return 1 when no other flags were
- * specified for PR_SET_THP_DISABLE.
+ * is reserved, so PR_GET_THP_DISABLE can return "1 | flags", to effectively
+ * return "1" when no flags were specified for PR_SET_THP_DISABLE.
   */
  #define PR_SET_THP_DISABLE     41
  /* Don't disable THPs when explicitly advised (MADV_HUGEPAGE / VM_HUGEPAGE). */
diff --git a/kernel/sys.c b/kernel/sys.c
index 2a34b2f708900..b87d0acaab0be 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2438,7 +2438,7 @@ static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3,
         return 0;
  }
  
-static int prctl_set_thp_disable(unsigned long thp_disable, unsigned long flags,
+static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
                                  unsigned long arg4, unsigned long arg5)
  {
         unsigned long *mm_flags = &current->mm->flags;
@@ -2447,7 +2447,7 @@ static int prctl_set_thp_disable(unsigned long thp_disable, unsigned long flags,
                 return -EINVAL;
  
         /* Flags are only allowed when disabling. */
-       if (!thp_disable || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
+       if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
                 return -EINVAL;
         if (mmap_write_lock_killable(current->mm))
                 return -EINTR;


-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-21  9:09 [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE David Hildenbrand
                   ` (3 preceding siblings ...)
  2025-07-21 17:27 ` Usama Arif
@ 2025-07-24 18:57 ` Usama Arif
  2025-07-24 19:07   ` David Hildenbrand
  4 siblings, 1 reply; 18+ messages in thread
From: Usama Arif @ 2025-07-24 18:57 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, linux-fsdevel, linux-doc, Jonathan Corbet,
	Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	SeongJae Park, Jann Horn, Yafang Shao, Matthew Wilcox,
	Johannes Weiner



On 21/07/2025 10:09, David Hildenbrand wrote:
> People want to make use of more THPs, for example, moving from
> THP=never to THP=madvise, or from THP=madvise to THP=never.
> 
> While this is great news for every THP desperately waiting to get
> allocated out there, apparently there are some workloads that require a
> bit of care during that transition: once problems are detected, these
> workloads should be started with the old behavior, without making all
> other workloads on the system go back to the old behavior as well.
> 
> In essence, the following scenarios are imaginable:
> 
> (1) Switch from THP=none to THP=madvise or THP=always, but keep the old
>     behavior (no THP) for selected workloads.
> 
> (2) Stay at THP=none, but have "madvise" or "always" behavior for
>     selected workloads.
> 
> (3) Switch from THP=madvise to THP=always, but keep the old behavior
>     (THP only when advised) for selected workloads.
> 
> (4) Stay at THP=madvise, but have "always" behavior for selected
>     workloads.
> 
> In essence, (2) can be emulated through (1), by setting THP!=none while
> disabling THPs for all processes that don't want THPs. It requires
> configuring all workloads, but that is a user-space problem to sort out.
> 
> (4) can be emulated through (3) in a similar way.
> 
> Back when (1) was relevant in the past, as people started enabling THPs,
> we added PR_SET_THP_DISABLE, so relevant workloads that were not ready
> yet (i.e., used by Redis) were able to just disable THPs completely. Redis
> still implements the option to use this interface to disable THPs
> completely.
> 
> With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
> workload -- a process, including fork+exec'ed process hierarchy.
> That essentially made us support (1): simply disable THPs for all workloads
> that are not ready for THPs yet, while still enabling THPs system-wide.
> 
> The quest for handling (3) and (4) started, but current approaches
> (completely new prctl, options to set other policies per processm,
>  alternatives to prctl -- mctrl, cgroup handling) don't look particularly
> promising. Likely, the future will use bpf or something similar to
> implement better policies, in particular to also make better decisions
> about THP sizes to use, but this will certainly take a while as that work
> just started.
> 
> Long story short: a simple enable/disable is not really suitable for the
> future, so we're not willing to add completely new toggles.
> 
> While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
> completely for these processes, this scares many THPs in our system
> because they could no longer get allocated where they used to be allocated
> for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a
> problem for relevant workloads, because "not THPs" is certainly worse
> than "THPs only when advised".
> 
> Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
> explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this
> would change the documented semantics quite a bit, and the versatility
> to use it for debugging purposes, so I am not 100% sure that is what we
> want -- although it would certainly be much easier.
> 
> So instead, as an easy way forward for (3) and (4), an option to
> make PR_SET_THP_DISABLE disable *less* THPs for a process.
> 
> In essence, this patch:
> 
> (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
>     of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).
> 
>     For now, arg3 was not allowed to be set (-EINVAL). Now it holds
>     flags.
> 
> (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
>     PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.
> 
>     For now, it would return 1 if THPs were disabled completely. Now
>     it essentially returns the set flags as well.
> 
> (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
>     the semantics clearly.
> 
>     Fortunately, there are only two instances outside of prctl() code.
> 
> (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
>     with VM_HUGEPAGE" -- essentially "thp=madvise" behavior
> 
>     Fortunately, we only have to extend vma_thp_disabled().
> 
> (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are not
>     disabled completely
> 
>     Only indicating that THPs are disabled when they are really disabled
>     completely, not only partially.
> 
> The documented semantics in the man page for PR_SET_THP_DISABLE
> "is inherited by a child created via fork(2) and is preserved across
> execve(2)" is maintained. This behavior, for example, allows for
> disabling THPs for a workload through the launching process (e.g.,
> systemd where we fork() a helper process to then exec()).
> 
> There is currently not way to prevent that a process will not issue
> PR_SET_THP_DISABLE itself to re-enable THP. We could add a "seal" option
> to PR_SET_THP_DISABLE through another flag if ever required. The known
> users (such as redis) really use PR_SET_THP_DISABLE to disable THPs, so
> that is not added for now.
> 
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Cc: SeongJae Park <sj@kernel.org>
> Cc: Jann Horn <jannh@google.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> ---
> 
> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
> think there might be real use cases where we want to disable any THPs --
> in particular also around debugging THP-related problems, and
> "THP=never" not meaning ... "never" anymore. PR_SET_THP_DISABLE will
> also block MADV_COLLAPSE, which can be very helpful. Of course, I thought
> of having a system-wide config to change PR_SET_THP_DISABLE behavior, but
> I just don't like the semantics.
> 
> "prctl: allow overriding system THP policy to always"[1] proposed
> "overriding policies to always", which is just the wrong way around: we
> should not add mechanisms to "enable more" when we already have an
> interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets
> weird otherwise.
> 
> "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed
> setting the default of the VM_HUGEPAGE, which is similarly the wrong way
> around I think now.
> 
> The proposals by Lorenzo to extend process_madvise()[3] and mctrl()[4]
> similarly were around the "default for VM_HUGEPAGE" idea, but after the
> discussion, I think we should better leave VM_HUGEPAGE untouched.
> 
> Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
> we essentially want to say "leave advised regions alone" -- "keep THP
> enabled for advised regions",
> 
> The only thing I really dislike about this is using another MMF_* flag,
> but well, no way around it -- and seems like we could easily support
> more than 32 if we want to, or storing this thp information elsewhere.
> 
> I think this here (modifying an existing toggle) is the only prctl()
> extension that we might be willing to accept. In general, I agree like
> most others, that prctl() is a very bad interface for that -- but
> PR_SET_THP_DISABLE is already there and is getting used.
> 
> Long-term, I think the answer will be something based on bpf[5]. Maybe
> in that context, I there could still be value in easily disabling THPs for
> selected workloads (esp. debugging purposes).
> 
> Jann raised valid concerns[6] about new flags that are persistent across
> exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I
> consider it having a similar security risk as our existing
> PR_SET_THP_DISABLE, but devil is in the detail.
> 
> This is *completely* untested and might be utterly broken. It merely
> serves as a PoC of what I think could be done. If this ever goes upstream,
> we need some kselftests for it, and extensive tests.
> 
> [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com
> [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com
> [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com
> [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
> [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com
> [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com
> 
> ---
>  Documentation/filesystems/proc.rst |  5 +--
>  fs/proc/array.c                    |  2 +-
>  include/linux/huge_mm.h            | 20 ++++++++---
>  include/linux/mm_types.h           | 13 +++----
>  include/uapi/linux/prctl.h         |  7 ++++
>  kernel/sys.c                       | 58 +++++++++++++++++++++++-------
>  mm/khugepaged.c                    |  2 +-
>  7 files changed, 78 insertions(+), 29 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 2971551b72353..915a3e44bc120 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -291,8 +291,9 @@ It's slow but very precise.
>   HugetlbPages                size of hugetlb memory portions
>   CoreDumping                 process's memory is currently being dumped
>                               (killing the process may lead to a corrupted core)
> - THP_enabled		     process is allowed to use THP (returns 0 when
> -			     PR_SET_THP_DISABLE is set on the process
> + THP_enabled                 process is allowed to use THP (returns 0 when
> +                             PR_SET_THP_DISABLE is set on the process to disable
> +                             THP completely, not just partially)
>   Threads                     number of threads
>   SigQ                        number of signals queued/max. number for queue
>   SigPnd                      bitmap of pending signals for the thread
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index d6a0369caa931..c4f91a784104f 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
>  	bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
>  
>  	if (thp_enabled)
> -		thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags);
> +		thp_enabled = !test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
>  	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>  }
>  
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index e0a27f80f390d..c4127104d9bc3 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -323,16 +323,26 @@ struct thpsize {
>  	(transparent_hugepage_flags &					\
>  	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
>  
> +/*
> + * Check whether THPs are explicitly disabled through madvise or prctl, or some
> + * architectures may disable THP for some mappings, for example, s390 kvm.
> + */
>  static inline bool vma_thp_disabled(struct vm_area_struct *vma,
>  		vm_flags_t vm_flags)
>  {
> +	/* Are THPs disabled for this VMA? */
> +	if (vm_flags & VM_NOHUGEPAGE)
> +		return true;
> +	/* Are THPs disabled for all VMAs in the whole process? */
> +	if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags))
> +		return true;
>  	/*
> -	 * Explicitly disabled through madvise or prctl, or some
> -	 * architectures may disable THP for some mappings, for
> -	 * example, s390 kvm.
> +	 * Are THPs disabled only for VMAs where we didn't get an explicit
> +	 * advise to use them?
>  	 */
> -	return (vm_flags & VM_NOHUGEPAGE) ||
> -	       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
> +	if (vm_flags & VM_HUGEPAGE)
> +		return false;
> +	return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
>  }


Hi David,

Over here, with MMF_DISABLE_THP_EXCEPT_ADVISED, MADV_HUGEPAGE will succeed as vm_flags has
VM_HUGEPAGE set, but MADV_COLLAPSE will fail to give a hugepage (as VM_HUGEPAGE is not set
and MMF_DISABLE_THP_EXCEPT_ADVISED is set) which I feel might not be the right behaviour
as MADV_COLLAPSE is "advise" and the prctl flag is PR_THP_DISABLE_EXCEPT_ADVISED?

This will be checked in multiple places in madvise_collapse: thp_vma_allowable_order,
hugepage_vma_revalidate which calls thp_vma_allowable_order and hpage_collapse_scan_pmd
which also ends up calling hugepage_vma_revalidate. 

A hacky way would be to save and overwrite vma->vm_flags with VM_HUGEPAGE at the start of madvise_collapse
if VM_NOHUGEPAGE is not set, and reset vma->vm_flags to its original value at the end of madvise_collapse
(Not something I am recommending, just throwing it out there).

Another possibility is to pass the fact that you are in madvise_collapse to these functions 
as an argument, this might look ugly, although maybe not as ugly as hugepage_vma_revalidate
already has collapse control arg, so just need to take care of thp_vma_allowable_orders.

Any preference or better suggestions?

Thanks!
Usama


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-24 18:57 ` Usama Arif
@ 2025-07-24 19:07   ` David Hildenbrand
  2025-07-24 22:27     ` Usama Arif
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-07-24 19:07 UTC (permalink / raw)
  To: Usama Arif, linux-kernel
  Cc: linux-mm, linux-fsdevel, linux-doc, Jonathan Corbet,
	Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	SeongJae Park, Jann Horn, Yafang Shao, Matthew Wilcox,
	Johannes Weiner

On 24.07.25 20:57, Usama Arif wrote:
> 
> 
> On 21/07/2025 10:09, David Hildenbrand wrote:
>> People want to make use of more THPs, for example, moving from
>> THP=never to THP=madvise, or from THP=madvise to THP=never.
>>
>> While this is great news for every THP desperately waiting to get
>> allocated out there, apparently there are some workloads that require a
>> bit of care during that transition: once problems are detected, these
>> workloads should be started with the old behavior, without making all
>> other workloads on the system go back to the old behavior as well.
>>
>> In essence, the following scenarios are imaginable:
>>
>> (1) Switch from THP=none to THP=madvise or THP=always, but keep the old
>>      behavior (no THP) for selected workloads.
>>
>> (2) Stay at THP=none, but have "madvise" or "always" behavior for
>>      selected workloads.
>>
>> (3) Switch from THP=madvise to THP=always, but keep the old behavior
>>      (THP only when advised) for selected workloads.
>>
>> (4) Stay at THP=madvise, but have "always" behavior for selected
>>      workloads.
>>
>> In essence, (2) can be emulated through (1), by setting THP!=none while
>> disabling THPs for all processes that don't want THPs. It requires
>> configuring all workloads, but that is a user-space problem to sort out.
>>
>> (4) can be emulated through (3) in a similar way.
>>
>> Back when (1) was relevant in the past, as people started enabling THPs,
>> we added PR_SET_THP_DISABLE, so relevant workloads that were not ready
>> yet (i.e., used by Redis) were able to just disable THPs completely. Redis
>> still implements the option to use this interface to disable THPs
>> completely.
>>
>> With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
>> workload -- a process, including fork+exec'ed process hierarchy.
>> That essentially made us support (1): simply disable THPs for all workloads
>> that are not ready for THPs yet, while still enabling THPs system-wide.
>>
>> The quest for handling (3) and (4) started, but current approaches
>> (completely new prctl, options to set other policies per processm,
>>   alternatives to prctl -- mctrl, cgroup handling) don't look particularly
>> promising. Likely, the future will use bpf or something similar to
>> implement better policies, in particular to also make better decisions
>> about THP sizes to use, but this will certainly take a while as that work
>> just started.
>>
>> Long story short: a simple enable/disable is not really suitable for the
>> future, so we're not willing to add completely new toggles.
>>
>> While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
>> completely for these processes, this scares many THPs in our system
>> because they could no longer get allocated where they used to be allocated
>> for: regions flagged as VM_HUGEPAGE. Apparently, that imposes a
>> problem for relevant workloads, because "not THPs" is certainly worse
>> than "THPs only when advised".
>>
>> Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
>> explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this
>> would change the documented semantics quite a bit, and the versatility
>> to use it for debugging purposes, so I am not 100% sure that is what we
>> want -- although it would certainly be much easier.
>>
>> So instead, as an easy way forward for (3) and (4), an option to
>> make PR_SET_THP_DISABLE disable *less* THPs for a process.
>>
>> In essence, this patch:
>>
>> (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
>>      of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).
>>
>>      For now, arg3 was not allowed to be set (-EINVAL). Now it holds
>>      flags.
>>
>> (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
>>      PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.
>>
>>      For now, it would return 1 if THPs were disabled completely. Now
>>      it essentially returns the set flags as well.
>>
>> (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
>>      the semantics clearly.
>>
>>      Fortunately, there are only two instances outside of prctl() code.
>>
>> (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
>>      with VM_HUGEPAGE" -- essentially "thp=madvise" behavior
>>
>>      Fortunately, we only have to extend vma_thp_disabled().
>>
>> (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are not
>>      disabled completely
>>
>>      Only indicating that THPs are disabled when they are really disabled
>>      completely, not only partially.
>>
>> The documented semantics in the man page for PR_SET_THP_DISABLE
>> "is inherited by a child created via fork(2) and is preserved across
>> execve(2)" is maintained. This behavior, for example, allows for
>> disabling THPs for a workload through the launching process (e.g.,
>> systemd where we fork() a helper process to then exec()).
>>
>> There is currently not way to prevent that a process will not issue
>> PR_SET_THP_DISABLE itself to re-enable THP. We could add a "seal" option
>> to PR_SET_THP_DISABLE through another flag if ever required. The known
>> users (such as redis) really use PR_SET_THP_DISABLE to disable THPs, so
>> that is not added for now.
>>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Mike Rapoport <rppt@kernel.org>
>> Cc: Suren Baghdasaryan <surenb@google.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Usama Arif <usamaarif642@gmail.com>
>> Cc: SeongJae Park <sj@kernel.org>
>> Cc: Jann Horn <jannh@google.com>
>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>> Cc: Yafang Shao <laoar.shao@gmail.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> ---
>>
>> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
>> think there might be real use cases where we want to disable any THPs --
>> in particular also around debugging THP-related problems, and
>> "THP=never" not meaning ... "never" anymore. PR_SET_THP_DISABLE will
>> also block MADV_COLLAPSE, which can be very helpful. Of course, I thought
>> of having a system-wide config to change PR_SET_THP_DISABLE behavior, but
>> I just don't like the semantics.
>>
>> "prctl: allow overriding system THP policy to always"[1] proposed
>> "overriding policies to always", which is just the wrong way around: we
>> should not add mechanisms to "enable more" when we already have an
>> interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets
>> weird otherwise.
>>
>> "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed
>> setting the default of the VM_HUGEPAGE, which is similarly the wrong way
>> around I think now.
>>
>> The proposals by Lorenzo to extend process_madvise()[3] and mctrl()[4]
>> similarly were around the "default for VM_HUGEPAGE" idea, but after the
>> discussion, I think we should better leave VM_HUGEPAGE untouched.
>>
>> Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
>> we essentially want to say "leave advised regions alone" -- "keep THP
>> enabled for advised regions",
>>
>> The only thing I really dislike about this is using another MMF_* flag,
>> but well, no way around it -- and seems like we could easily support
>> more than 32 if we want to, or storing this thp information elsewhere.
>>
>> I think this here (modifying an existing toggle) is the only prctl()
>> extension that we might be willing to accept. In general, I agree like
>> most others, that prctl() is a very bad interface for that -- but
>> PR_SET_THP_DISABLE is already there and is getting used.
>>
>> Long-term, I think the answer will be something based on bpf[5]. Maybe
>> in that context, I there could still be value in easily disabling THPs for
>> selected workloads (esp. debugging purposes).
>>
>> Jann raised valid concerns[6] about new flags that are persistent across
>> exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I
>> consider it having a similar security risk as our existing
>> PR_SET_THP_DISABLE, but devil is in the detail.
>>
>> This is *completely* untested and might be utterly broken. It merely
>> serves as a PoC of what I think could be done. If this ever goes upstream,
>> we need some kselftests for it, and extensive tests.
>>
>> [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com
>> [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com
>> [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com
>> [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
>> [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com
>> [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com
>>
>> ---
>>   Documentation/filesystems/proc.rst |  5 +--
>>   fs/proc/array.c                    |  2 +-
>>   include/linux/huge_mm.h            | 20 ++++++++---
>>   include/linux/mm_types.h           | 13 +++----
>>   include/uapi/linux/prctl.h         |  7 ++++
>>   kernel/sys.c                       | 58 +++++++++++++++++++++++-------
>>   mm/khugepaged.c                    |  2 +-
>>   7 files changed, 78 insertions(+), 29 deletions(-)
>>
>> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
>> index 2971551b72353..915a3e44bc120 100644
>> --- a/Documentation/filesystems/proc.rst
>> +++ b/Documentation/filesystems/proc.rst
>> @@ -291,8 +291,9 @@ It's slow but very precise.
>>    HugetlbPages                size of hugetlb memory portions
>>    CoreDumping                 process's memory is currently being dumped
>>                                (killing the process may lead to a corrupted core)
>> - THP_enabled		     process is allowed to use THP (returns 0 when
>> -			     PR_SET_THP_DISABLE is set on the process
>> + THP_enabled                 process is allowed to use THP (returns 0 when
>> +                             PR_SET_THP_DISABLE is set on the process to disable
>> +                             THP completely, not just partially)
>>    Threads                     number of threads
>>    SigQ                        number of signals queued/max. number for queue
>>    SigPnd                      bitmap of pending signals for the thread
>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>> index d6a0369caa931..c4f91a784104f 100644
>> --- a/fs/proc/array.c
>> +++ b/fs/proc/array.c
>> @@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
>>   	bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
>>   
>>   	if (thp_enabled)
>> -		thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags);
>> +		thp_enabled = !test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
>>   	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>>   }
>>   
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index e0a27f80f390d..c4127104d9bc3 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -323,16 +323,26 @@ struct thpsize {
>>   	(transparent_hugepage_flags &					\
>>   	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
>>   
>> +/*
>> + * Check whether THPs are explicitly disabled through madvise or prctl, or some
>> + * architectures may disable THP for some mappings, for example, s390 kvm.
>> + */
>>   static inline bool vma_thp_disabled(struct vm_area_struct *vma,
>>   		vm_flags_t vm_flags)
>>   {
>> +	/* Are THPs disabled for this VMA? */
>> +	if (vm_flags & VM_NOHUGEPAGE)
>> +		return true;
>> +	/* Are THPs disabled for all VMAs in the whole process? */
>> +	if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags))
>> +		return true;
>>   	/*
>> -	 * Explicitly disabled through madvise or prctl, or some
>> -	 * architectures may disable THP for some mappings, for
>> -	 * example, s390 kvm.
>> +	 * Are THPs disabled only for VMAs where we didn't get an explicit
>> +	 * advise to use them?
>>   	 */
>> -	return (vm_flags & VM_NOHUGEPAGE) ||
>> -	       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
>> +	if (vm_flags & VM_HUGEPAGE)
>> +		return false;
>> +	return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
>>   }
> 
> 
> Hi David,

Hi!

> 
> Over here, with MMF_DISABLE_THP_EXCEPT_ADVISED, MADV_HUGEPAGE will succeed as vm_flags has
> VM_HUGEPAGE set, but MADV_COLLAPSE will fail to give a hugepage (as VM_HUGEPAGE is not set
> and MMF_DISABLE_THP_EXCEPT_ADVISED is set) which I feel might not be the right behaviour
> as MADV_COLLAPSE is "advise" and the prctl flag is PR_THP_DISABLE_EXCEPT_ADVISED?

THPs are disabled for these regions, so it's at least consistent with 
the "disable all", but ...

> 
> This will be checked in multiple places in madvise_collapse: thp_vma_allowable_order,
> hugepage_vma_revalidate which calls thp_vma_allowable_order and hpage_collapse_scan_pmd
> which also ends up calling hugepage_vma_revalidate.
 > > A hacky way would be to save and overwrite vma->vm_flags with 
VM_HUGEPAGE at the start of madvise_collapse
> if VM_NOHUGEPAGE is not set, and reset vma->vm_flags to its original value at the end of madvise_collapse
> (Not something I am recommending, just throwing it out there).

Gah.

> 
> Another possibility is to pass the fact that you are in madvise_collapse to these functions
> as an argument, this might look ugly, although maybe not as ugly as hugepage_vma_revalidate
> already has collapse control arg, so just need to take care of thp_vma_allowable_orders.

Likely this.

> 
> Any preference or better suggestions?

What you are asking for is not MMF_DISABLE_THP_EXCEPT_ADVISED as I 
planned it, but MMF_DISABLE_THP_EXCEPT_ADVISED_OR_MADV_COLLAPSE.

Now, one could consider MADV_COLLAPSE an "advise". (I am not opposed to 
that change)

Indeed, the right way might be telling vma_thp_disabled() whether we are 
in collapse.

Can you try implementing that on top of my patch to see how it looks?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-24 19:07   ` David Hildenbrand
@ 2025-07-24 22:27     ` Usama Arif
  2025-07-25 13:08       ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Usama Arif @ 2025-07-24 22:27 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, linux-fsdevel, linux-doc, Jonathan Corbet,
	Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	SeongJae Park, Jann Horn, Yafang Shao, Matthew Wilcox,
	Johannes Weiner


> Hi!
> 
>>
>> Over here, with MMF_DISABLE_THP_EXCEPT_ADVISED, MADV_HUGEPAGE will succeed as vm_flags has
>> VM_HUGEPAGE set, but MADV_COLLAPSE will fail to give a hugepage (as VM_HUGEPAGE is not set
>> and MMF_DISABLE_THP_EXCEPT_ADVISED is set) which I feel might not be the right behaviour
>> as MADV_COLLAPSE is "advise" and the prctl flag is PR_THP_DISABLE_EXCEPT_ADVISED?
> 
> THPs are disabled for these regions, so it's at least consistent with the "disable all", but ...
> 
>>
>> This will be checked in multiple places in madvise_collapse: thp_vma_allowable_order,
>> hugepage_vma_revalidate which calls thp_vma_allowable_order and hpage_collapse_scan_pmd
>> which also ends up calling hugepage_vma_revalidate.
>> > A hacky way would be to save and overwrite vma->vm_flags with VM_HUGEPAGE at the start of madvise_collapse
>> if VM_NOHUGEPAGE is not set, and reset vma->vm_flags to its original value at the end of madvise_collapse
>> (Not something I am recommending, just throwing it out there).
> 
> Gah.
> 
>>
>> Another possibility is to pass the fact that you are in madvise_collapse to these functions
>> as an argument, this might look ugly, although maybe not as ugly as hugepage_vma_revalidate
>> already has collapse control arg, so just need to take care of thp_vma_allowable_orders.
> 
> Likely this.
> 
>>
>> Any preference or better suggestions?
> 
> What you are asking for is not MMF_DISABLE_THP_EXCEPT_ADVISED as I planned it, but MMF_DISABLE_THP_EXCEPT_ADVISED_OR_MADV_COLLAPSE.
> 
> Now, one could consider MADV_COLLAPSE an "advise". (I am not opposed to that change)
> 

lol yeah I always think of MADV_COLLAPSE as an extreme version of MADV_HUGE (more of a demand
than an advice :)), eventhough its not persistant.
Which is why I think might be unexpected if MADV_HUGE gives hugepages but MADV_COLLAPSE doesn't
(But could just be my opinion).

> Indeed, the right way might be telling vma_thp_disabled() whether we are in collapse.
> 
> Can you try implementing that on top of my patch to see how it looks?
> 

My reasoning is that a process that is running with system policy always but with
PR_THP_DISABLE_EXCEPT_ADVISED gets THPs in exactly the same behaviour as a process that is running
with system policy madvise. This will help us achieve (3) that you mentioned in the
commit message:
(3) Switch from THP=madvise to THP=always, but keep the old behavior
     (THP only when advised) for selected workloads.


I have written quite a few selftests now for prctl SET_THP_DISABLE, both with and without
PR_THP_DISABLE_EXCEPT_ADVISED set incorporating your feedback on it. I have all of them passing
with the below diff. The diff is slightly ugly, but very simple and hopefully acceptable. If it
looks good, I can send a series with everything. Probably make the below diff as a separate patch
on top of this patch as its mostly adding an extra arg to functions and would keep the review easier?
I can squash it with this patch as well if thats better.

Thanks!


diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3d6d8a9f13fc..bb5f1dedbd2c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1294,7 +1294,7 @@ static int show_smap(struct seq_file *m, void *v)
 
 	seq_printf(m, "THPeligible:    %8u\n",
 		   !!thp_vma_allowable_orders(vma, vma->vm_flags,
-			   TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL));
+			   TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL, 0));
 
 	if (arch_pkeys_enabled())
 		seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 71db243a002e..82066721b161 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -98,8 +98,8 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 #define TVA_IN_PF		(1 << 1)	/* Page fault handler */
 #define TVA_ENFORCE_SYSFS	(1 << 2)	/* Obey sysfs configuration */
 
-#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
-	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
+#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order, in_collapse) \
+	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order), in_collapse))
 
 #define split_folio(f) split_folio_to_list(f, NULL)
 
@@ -265,7 +265,8 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 vm_flags_t vm_flags,
 					 unsigned long tva_flags,
-					 unsigned long orders);
+					 unsigned long orders,
+					 bool in_collapse);
 
 /**
  * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
@@ -273,6 +274,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
  * @vm_flags: use these vm_flags instead of vma->vm_flags
  * @tva_flags: Which TVA flags to honour
  * @orders: bitfield of all orders to consider
+ * @in_collapse: whether we are being called from MADV_COLLAPSE
  *
  * Calculates the intersection of the requested hugepage orders and the allowed
  * hugepage orders for the provided vma. Permitted orders are encoded as a set
@@ -286,7 +288,8 @@ static inline
 unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 				       vm_flags_t vm_flags,
 				       unsigned long tva_flags,
-				       unsigned long orders)
+				       unsigned long orders,
+				       bool in_collapse)
 {
 	/* Optimization to check if required orders are enabled early. */
 	if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
@@ -303,7 +306,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 			return 0;
 	}
 
-	return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
+	return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders, in_collapse);
 }
 
 struct thpsize {
@@ -323,7 +326,7 @@ struct thpsize {
  * through madvise or prctl.
  */
 static inline bool vma_thp_disabled(struct vm_area_struct *vma,
-		vm_flags_t vm_flags)
+		vm_flags_t vm_flags, bool in_collapse)
 {
 	/* Are THPs disabled for this VMA? */
 	if (vm_flags & VM_NOHUGEPAGE)
@@ -331,6 +334,9 @@ static inline bool vma_thp_disabled(struct vm_area_struct *vma,
 	/* Are THPs disabled for all VMAs in the whole process? */
 	if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags))
 		return true;
+	/* Are we being called from madvise_collapse? */
+	if (in_collapse)
+		return false;
 	/*
 	 * Are THPs disabled only for VMAs where we didn't get an explicit
 	 * advise to use them?
@@ -537,7 +543,8 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
 static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 					vm_flags_t vm_flags,
 					unsigned long tva_flags,
-					unsigned long orders)
+					unsigned long orders,
+					bool in_collapse)
 {
 	return 0;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2b4ea5a2ce7d..ecf48a922530 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -100,7 +100,8 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 vm_flags_t vm_flags,
 					 unsigned long tva_flags,
-					 unsigned long orders)
+					 unsigned long orders,
+					 bool in_collapse)
 {
 	bool smaps = tva_flags & TVA_SMAPS;
 	bool in_pf = tva_flags & TVA_IN_PF;
@@ -122,7 +123,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 	if (!vma->vm_mm)		/* vdso */
 		return 0;
 
-	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags))
+	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags, in_collapse))
 		return 0;
 
 	/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2c9008246785..ba707ce5a00a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -475,7 +475,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
 	    hugepage_pmd_enabled()) {
 		if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
-					    PMD_ORDER))
+					    PMD_ORDER, 0))
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
@@ -932,7 +932,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 
 	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
 		return SCAN_ADDRESS_RANGE;
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
+	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER, 1))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
@@ -1534,7 +1534,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	 * and map it by a PMD, regardless of sysfs THP settings. As such, let's
 	 * analogously elide sysfs THP settings here.
 	 */
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
+	if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER, 1))
 		return SCAN_VMA_CHECK;
 
 	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
@@ -2432,7 +2432,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			break;
 		}
 		if (!thp_vma_allowable_order(vma, vma->vm_flags,
-					TVA_ENFORCE_SYSFS, PMD_ORDER)) {
+					TVA_ENFORCE_SYSFS, PMD_ORDER, 0)) {
 skip:
 			progress++;
 			continue;
@@ -2766,7 +2766,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	BUG_ON(vma->vm_start > start);
 	BUG_ON(vma->vm_end < end);
 
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
+	if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER, 1))
 		return -EINVAL;
 
 	cc = kmalloc(sizeof(*cc), GFP_KERNEL);
diff --git a/mm/memory.c b/mm/memory.c
index 92fd18a5d8d1..da5ab2dc1797 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4370,7 +4370,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	 * and suitable for swapping THP.
 	 */
 	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
-			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1, 0);
 	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 	orders = thp_swap_suitable_orders(swp_offset(entry),
 					  vmf->address, orders);
@@ -4918,7 +4918,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 	 * the faulting address and still be fully contained in the vma.
 	 */
 	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
-			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1, 0);
 	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 
 	if (!orders)
@@ -5188,7 +5188,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
 	 * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any
 	 * PMD mappings if THPs are disabled.
 	 */
-	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags))
+	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags, 0))
 		return ret;
 
 	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
@@ -6109,7 +6109,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 retry_pud:
 	if (pud_none(*vmf.pud) &&
 	    thp_vma_allowable_order(vma, vm_flags,
-				TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) {
+				TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER, 0)) {
 		ret = create_huge_pud(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
@@ -6144,7 +6144,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 
 	if (pmd_none(*vmf.pmd) &&
 	    thp_vma_allowable_order(vma, vm_flags,
-				TVA_IN_PF | TVA_ENFORCE_SYSFS, PMD_ORDER)) {
+				TVA_IN_PF | TVA_ENFORCE_SYSFS, PMD_ORDER, 0)) {
 		ret = create_huge_pmd(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
diff --git a/mm/shmem.c b/mm/shmem.c
index e6cdfda08aed..1960cf87b077 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1816,7 +1816,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
 	vm_flags_t vm_flags = vma ? vma->vm_flags : 0;
 	unsigned int global_orders;
 
-	if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags)))
+	if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags, 0)))
 		return 0;
 
 	global_orders = shmem_huge_global_enabled(inode, index, write_end,






^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-24 22:27     ` Usama Arif
@ 2025-07-25 13:08       ` David Hildenbrand
  2025-07-25 16:26         ` Usama Arif
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-07-25 13:08 UTC (permalink / raw)
  To: Usama Arif, linux-kernel
  Cc: linux-mm, linux-fsdevel, linux-doc, Jonathan Corbet,
	Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	SeongJae Park, Jann Horn, Yafang Shao, Matthew Wilcox,
	Johannes Weiner

On 25.07.25 00:27, Usama Arif wrote:
> 
>> Hi!
>>
>>>
>>> Over here, with MMF_DISABLE_THP_EXCEPT_ADVISED, MADV_HUGEPAGE will succeed as vm_flags has
>>> VM_HUGEPAGE set, but MADV_COLLAPSE will fail to give a hugepage (as VM_HUGEPAGE is not set
>>> and MMF_DISABLE_THP_EXCEPT_ADVISED is set) which I feel might not be the right behaviour
>>> as MADV_COLLAPSE is "advise" and the prctl flag is PR_THP_DISABLE_EXCEPT_ADVISED?
>>
>> THPs are disabled for these regions, so it's at least consistent with the "disable all", but ...
>>
>>>
>>> This will be checked in multiple places in madvise_collapse: thp_vma_allowable_order,
>>> hugepage_vma_revalidate which calls thp_vma_allowable_order and hpage_collapse_scan_pmd
>>> which also ends up calling hugepage_vma_revalidate.
>>>> A hacky way would be to save and overwrite vma->vm_flags with VM_HUGEPAGE at the start of madvise_collapse
>>> if VM_NOHUGEPAGE is not set, and reset vma->vm_flags to its original value at the end of madvise_collapse
>>> (Not something I am recommending, just throwing it out there).
>>
>> Gah.
>>
>>>
>>> Another possibility is to pass the fact that you are in madvise_collapse to these functions
>>> as an argument, this might look ugly, although maybe not as ugly as hugepage_vma_revalidate
>>> already has collapse control arg, so just need to take care of thp_vma_allowable_orders.
>>
>> Likely this.
>>
>>>
>>> Any preference or better suggestions?
>>
>> What you are asking for is not MMF_DISABLE_THP_EXCEPT_ADVISED as I planned it, but MMF_DISABLE_THP_EXCEPT_ADVISED_OR_MADV_COLLAPSE.
>>
>> Now, one could consider MADV_COLLAPSE an "advise". (I am not opposed to that change)
>>
> 
> lol yeah I always think of MADV_COLLAPSE as an extreme version of MADV_HUGE (more of a demand
> than an advice :)), eventhough its not persistant.
> Which is why I think might be unexpected if MADV_HUGE gives hugepages but MADV_COLLAPSE doesn't
> (But could just be my opinion).
> 
>> Indeed, the right way might be telling vma_thp_disabled() whether we are in collapse.
>>
>> Can you try implementing that on top of my patch to see how it looks?
>>
> 
> My reasoning is that a process that is running with system policy always but with
> PR_THP_DISABLE_EXCEPT_ADVISED gets THPs in exactly the same behaviour as a process that is running
> with system policy madvise. This will help us achieve (3) that you mentioned in the
> commit message:
> (3) Switch from THP=madvise to THP=always, but keep the old behavior
>       (THP only when advised) for selected workloads.
> 
> 
> I have written quite a few selftests now for prctl SET_THP_DISABLE, both with and without
> PR_THP_DISABLE_EXCEPT_ADVISED set incorporating your feedback on it. I have all of them passing
> with the below diff. The diff is slightly ugly, but very simple and hopefully acceptable. If it
> looks good, I can send a series with everything. Probably make the below diff as a separate patch
> on top of this patch as its mostly adding an extra arg to functions and would keep the review easier?

Yes, we should do it as a separate patch, makes our life easier, because 
that requires more work.

We require a cleanup first, the boolean parameter for 
__thp_vma_allowable_orders() is no good.

I just pushed something untested to my branch (slightly adjusted patch#1 
+ 2 more patches), can you have a look at that? (untested ... :) )

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
  2025-07-25 13:08       ` David Hildenbrand
@ 2025-07-25 16:26         ` Usama Arif
  0 siblings, 0 replies; 18+ messages in thread
From: Usama Arif @ 2025-07-25 16:26 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, linux-fsdevel, linux-doc, Jonathan Corbet,
	Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	SeongJae Park, Jann Horn, Yafang Shao, Matthew Wilcox,
	Johannes Weiner



On 25/07/2025 14:08, David Hildenbrand wrote:
> On 25.07.25 00:27, Usama Arif wrote:
>>
>>> Hi!
>>>
>>>>
>>>> Over here, with MMF_DISABLE_THP_EXCEPT_ADVISED, MADV_HUGEPAGE will succeed as vm_flags has
>>>> VM_HUGEPAGE set, but MADV_COLLAPSE will fail to give a hugepage (as VM_HUGEPAGE is not set
>>>> and MMF_DISABLE_THP_EXCEPT_ADVISED is set) which I feel might not be the right behaviour
>>>> as MADV_COLLAPSE is "advise" and the prctl flag is PR_THP_DISABLE_EXCEPT_ADVISED?
>>>
>>> THPs are disabled for these regions, so it's at least consistent with the "disable all", but ...
>>>
>>>>
>>>> This will be checked in multiple places in madvise_collapse: thp_vma_allowable_order,
>>>> hugepage_vma_revalidate which calls thp_vma_allowable_order and hpage_collapse_scan_pmd
>>>> which also ends up calling hugepage_vma_revalidate.
>>>>> A hacky way would be to save and overwrite vma->vm_flags with VM_HUGEPAGE at the start of madvise_collapse
>>>> if VM_NOHUGEPAGE is not set, and reset vma->vm_flags to its original value at the end of madvise_collapse
>>>> (Not something I am recommending, just throwing it out there).
>>>
>>> Gah.
>>>
>>>>
>>>> Another possibility is to pass the fact that you are in madvise_collapse to these functions
>>>> as an argument, this might look ugly, although maybe not as ugly as hugepage_vma_revalidate
>>>> already has collapse control arg, so just need to take care of thp_vma_allowable_orders.
>>>
>>> Likely this.
>>>
>>>>
>>>> Any preference or better suggestions?
>>>
>>> What you are asking for is not MMF_DISABLE_THP_EXCEPT_ADVISED as I planned it, but MMF_DISABLE_THP_EXCEPT_ADVISED_OR_MADV_COLLAPSE.
>>>
>>> Now, one could consider MADV_COLLAPSE an "advise". (I am not opposed to that change)
>>>
>>
>> lol yeah I always think of MADV_COLLAPSE as an extreme version of MADV_HUGE (more of a demand
>> than an advice :)), eventhough its not persistant.
>> Which is why I think might be unexpected if MADV_HUGE gives hugepages but MADV_COLLAPSE doesn't
>> (But could just be my opinion).
>>
>>> Indeed, the right way might be telling vma_thp_disabled() whether we are in collapse.
>>>
>>> Can you try implementing that on top of my patch to see how it looks?
>>>
>>
>> My reasoning is that a process that is running with system policy always but with
>> PR_THP_DISABLE_EXCEPT_ADVISED gets THPs in exactly the same behaviour as a process that is running
>> with system policy madvise. This will help us achieve (3) that you mentioned in the
>> commit message:
>> (3) Switch from THP=madvise to THP=always, but keep the old behavior
>>       (THP only when advised) for selected workloads.
>>
>>
>> I have written quite a few selftests now for prctl SET_THP_DISABLE, both with and without
>> PR_THP_DISABLE_EXCEPT_ADVISED set incorporating your feedback on it. I have all of them passing
>> with the below diff. The diff is slightly ugly, but very simple and hopefully acceptable. If it
>> looks good, I can send a series with everything. Probably make the below diff as a separate patch
>> on top of this patch as its mostly adding an extra arg to functions and would keep the review easier?
> 
> Yes, we should do it as a separate patch, makes our life easier, because that requires more work.
> 
> We require a cleanup first, the boolean parameter for __thp_vma_allowable_orders() is no good.
> 
> I just pushed something untested to my branch (slightly adjusted patch#1 + 2 more patches), can you have a look at that? (untested ... :) )
> 

Thanks for this!
I tested it and its good, have sent it for review.



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-07-25 16:26 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-21  9:09 [PATCH POC] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE David Hildenbrand
2025-07-21 10:10 ` David Hildenbrand
2025-07-21 11:28 ` Lorenzo Stoakes
2025-07-21 11:45   ` David Hildenbrand
2025-07-21 13:15     ` Lorenzo Stoakes
2025-07-21 14:32 ` Usama Arif
2025-07-21 14:39   ` David Hildenbrand
2025-07-21 17:27 ` Usama Arif
2025-07-21 19:35   ` SeongJae Park
2025-07-22 10:23   ` David Hildenbrand
2025-07-22 10:27     ` Lorenzo Stoakes
2025-07-23 17:07     ` Usama Arif
2025-07-23 18:02       ` David Hildenbrand
2025-07-24 18:57 ` Usama Arif
2025-07-24 19:07   ` David Hildenbrand
2025-07-24 22:27     ` Usama Arif
2025-07-25 13:08       ` David Hildenbrand
2025-07-25 16:26         ` Usama Arif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).