[PATCH 0/1] prctl: allow overriding system THP policy to always

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/1] prctl: allow overriding system THP policy to always
@ 2025-05-07 14:00 Usama Arif
  2025-05-07 14:00 ` [PATCH 1/1] prctl: allow overriding system THP policy to always per process Usama Arif
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Usama Arif @ 2025-05-07 14:00 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team,
	Usama Arif

Allowing override of global THP policy per process allows workloads
that have shown to benefit from hugepages to do so, without regressing
workloads that wouldn't benefit. This will allow such types of
workloads to be run/stacked on the same machine.

It also helps in rolling out hugepages in hyperscaler configurations
for workloads that benefit from them, where a single THP policy is
likely to be used across the entire fleet, and prctl will help override it.

An advantage of doing it via prctl vs creating a cgroup specific
option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
that this will work even when there are no cgroups present, and my
understanding is there is a strong preference of cgroups controls being
hierarchical which usually means them having a numerical value.

The output and code of test program is below:

[root@vm4 vmuser]# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
[root@vm4 vmuser]# echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
[root@vm4 vmuser]# ./a.out
Default THP setting:
THP is not set to 'always'.
PR_SET_THP_ALWAYS = 1
THP is set to 'always'.
PR_SET_THP_ALWAYS = 0
THP is not set to 'always'.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/prctl.h>

#define PR_SET_THP_ALWAYS 78
#define SIZE 12 * (2 * 1024 * 1024) // 24 MB

void check_smaps(void) {
    FILE *file = fopen("/proc/self/smaps", "r");
    if (!file) {
        perror("fopen");
        return;
    }

    char line[256];
    int is_hugepage = 0;
    while (fgets(line, sizeof(line), file)) {
        // if (strstr(line, "AnonHugePages:"))
        //     printf("%s\n", line);
        if (strstr(line, "AnonHugePages:") && strstr(line, "24576 kB"))
{
            // printf("%s\n", line);
            is_hugepage = 1;
            break;
        }
    }
    fclose(file);
    if (is_hugepage) {
        printf("THP is set to 'always'.\n");
    } else {
        printf("THP is not set to 'always'.\n");
    }
}

void test_mmap_thp(void) {
    char *buffer = (char *)mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (buffer == MAP_FAILED) {
        perror("mmap");
        return;
    }
    // Touch the memory to ensure it's allocated
    memset(buffer, 0, SIZE);
    check_smaps();
    munmap(buffer, SIZE);
}

int main() {
    printf("Default THP setting: \n");
    test_mmap_thp();
    printf("PR_SET_THP_ALWAYS = 1 \n");
    prctl(PR_SET_THP_ALWAYS, 1, NULL, NULL, NULL);
    test_mmap_thp();
    printf("PR_SET_THP_ALWAYS = 0 \n");
    prctl(PR_SET_THP_ALWAYS, 0, NULL, NULL, NULL);
    test_mmap_thp();

    return 0;
}

Usama Arif (1):
  prctl: allow overriding system THP policy to always per process

 include/linux/huge_mm.h                          |  3 ++-
 include/linux/mm_types.h                         |  7 ++-----
 include/uapi/linux/prctl.h                       |  3 +++
 kernel/sys.c                                     | 16 ++++++++++++++++
 tools/include/uapi/linux/prctl.h                 |  3 +++
 .../perf/trace/beauty/include/uapi/linux/prctl.h |  3 +++
 6 files changed, 29 insertions(+), 6 deletions(-)

-- 
2.47.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/1] prctl: allow overriding system THP policy to always per process
  2025-05-07 14:00 [PATCH 0/1] prctl: allow overriding system THP policy to always Usama Arif
@ 2025-05-07 14:00 ` Usama Arif
  2025-05-07 15:02   ` Usama Arif
  2025-05-07 20:14   ` Zi Yan
  2025-05-07 14:57 ` [PATCH 0/1] prctl: allow overriding system THP policy to always Zi Yan
  2025-05-08 11:06 ` David Hildenbrand
  2 siblings, 2 replies; 29+ messages in thread
From: Usama Arif @ 2025-05-07 14:00 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team,
	Usama Arif

Allowing override of global THP policy per process allows workloads
that have shown to benefit from hugepages to do so, without regressing
workloads that wouldn't benefit. This will allow such types of workloads
to be run/stacked on the same machine.

It also helps in rolling out hugepages in hyperscaler configurations
for workloads that benefit from them, where a single THP policy is likely
to be used across the entire fleet, and prctl will help override it.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/huge_mm.h                          |  3 ++-
 include/linux/mm_types.h                         |  7 ++-----
 include/uapi/linux/prctl.h                       |  3 +++
 kernel/sys.c                                     | 16 ++++++++++++++++
 tools/include/uapi/linux/prctl.h                 |  3 +++
 .../perf/trace/beauty/include/uapi/linux/prctl.h |  3 +++
 6 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..0587dc4b8e2d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -293,7 +293,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 		if (vm_flags & VM_HUGEPAGE)
 			mask |= READ_ONCE(huge_anon_orders_madvise);
 		if (hugepage_global_always() ||
-		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
+		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()) ||
+		    test_bit(MMF_THP_ALWAYS, &vma->vm_mm->flags))
 			mask |= READ_ONCE(huge_anon_orders_inherit);
 
 		orders &= mask;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e76bade9ebb1..9bcd72b2b191 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1704,11 +1704,8 @@ enum {
 #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
 #define MMF_VM_HUGEPAGE		17	/* set when mm is available for khugepaged */
 
-/*
- * This one-shot flag is dropped due to necessity of changing exe once again
- * on NFS restore
- */
-//#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
+/* override inherited page sizes to always for the entire process */
+ #define MMF_THP_ALWAYS	18
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 15c18ef4eb11..22c526681562 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -364,4 +364,7 @@ struct prctl_mm_map {
 # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
 # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
 
+#define PR_SET_THP_ALWAYS	78
+#define PR_GET_THP_ALWAYS	79
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index c434968e9f5d..ee56b059ff1f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2658,6 +2658,22 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
 		mmap_write_unlock(me->mm);
 		break;
+	case PR_GET_THP_ALWAYS:
+		if (arg2 || arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = !!test_bit(MMF_THP_ALWAYS, &me->mm->flags);
+		break;
+	case PR_SET_THP_ALWAYS:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		if (mmap_write_lock_killable(me->mm))
+			return -EINTR;
+		if (arg2)
+			set_bit(MMF_THP_ALWAYS, &me->mm->flags);
+		else
+			clear_bit(MMF_THP_ALWAYS, &me->mm->flags);
+		mmap_write_unlock(me->mm);
+		break;
 	case PR_MPX_ENABLE_MANAGEMENT:
 	case PR_MPX_DISABLE_MANAGEMENT:
 		/* No longer implemented: */
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 35791791a879..f5f6cff42b3f 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -328,4 +328,7 @@ struct prctl_mm_map {
 # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC	0x10 /* Clear the aspect on exec */
 # define PR_PPC_DEXCR_CTRL_MASK		0x1f
 
+#define PR_GET_THP_ALWAYS	78
+#define PR_SET_THP_ALWAYS	79
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
index 15c18ef4eb11..680996d56faf 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
@@ -364,4 +364,7 @@ struct prctl_mm_map {
 # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
 # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
 
+#define PR_GET_THP_ALWAYS	78
+#define PR_SET_THP_ALWAYS	79
+
 #endif /* _LINUX_PRCTL_H */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-07 14:00 [PATCH 0/1] prctl: allow overriding system THP policy to always Usama Arif
  2025-05-07 14:00 ` [PATCH 1/1] prctl: allow overriding system THP policy to always per process Usama Arif
@ 2025-05-07 14:57 ` Zi Yan
  2025-05-07 15:12   ` Usama Arif
  2025-05-08 11:06 ` David Hildenbrand
  2 siblings, 1 reply; 29+ messages in thread
From: Zi Yan @ 2025-05-07 14:57 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	linux-kernel, kernel-team, Yafang Shao

+Yafang, who is also looking at changing THP config at cgroup/container level.

On 7 May 2025, at 10:00, Usama Arif wrote:

> Allowing override of global THP policy per process allows workloads
> that have shown to benefit from hugepages to do so, without regressing
> workloads that wouldn't benefit. This will allow such types of
> workloads to be run/stacked on the same machine.
>
> It also helps in rolling out hugepages in hyperscaler configurations
> for workloads that benefit from them, where a single THP policy is
> likely to be used across the entire fleet, and prctl will help override it.
>
> An advantage of doing it via prctl vs creating a cgroup specific
> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
> that this will work even when there are no cgroups present, and my
> understanding is there is a strong preference of cgroups controls being
> hierarchical which usually means them having a numerical value.

Hi Usama,

Do you mind giving an example on how to change THP policy for a set of
processes running in a container (under a cgroup)?

Yafang mentioned that the prctl approach would require restarting all running
services[1] and other inflexiblities, so he proposed to use BPF to change THP
policy[2]. I wonder if Yafang's issues also apply to your case and if you
have a solution to them.

Thanks.

[1] https://lore.kernel.org/linux-mm/CALOAHbCXMi2GaZdHJaNLXxGsJf-hkDTrztsQiceaBcJ8d8p3cA@mail.gmail.com/
[2] https://lore.kernel.org/linux-mm/20250429024139.34365-1-laoar.shao@gmail.com/
>
>
> The output and code of test program is below:
>
> [root@vm4 vmuser]# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
> [root@vm4 vmuser]# echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> [root@vm4 vmuser]# ./a.out
> Default THP setting:
> THP is not set to 'always'.
> PR_SET_THP_ALWAYS = 1
> THP is set to 'always'.
> PR_SET_THP_ALWAYS = 0
> THP is not set to 'always'.
>
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/mman.h>
> #include <sys/prctl.h>
>
> #define PR_SET_THP_ALWAYS 78
> #define SIZE 12 * (2 * 1024 * 1024) // 24 MB
>
> void check_smaps(void) {
>     FILE *file = fopen("/proc/self/smaps", "r");
>     if (!file) {
>         perror("fopen");
>         return;
>     }
>
>     char line[256];
>     int is_hugepage = 0;
>     while (fgets(line, sizeof(line), file)) {
>         // if (strstr(line, "AnonHugePages:"))
>         //     printf("%s\n", line);
>         if (strstr(line, "AnonHugePages:") && strstr(line, "24576 kB"))
> {
>             // printf("%s\n", line);
>             is_hugepage = 1;
>             break;
>         }
>     }
>     fclose(file);
>     if (is_hugepage) {
>         printf("THP is set to 'always'.\n");
>     } else {
>         printf("THP is not set to 'always'.\n");
>     }
> }
>
> void test_mmap_thp(void) {
>     char *buffer = (char *)mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>                                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>     if (buffer == MAP_FAILED) {
>         perror("mmap");
>         return;
>     }
>     // Touch the memory to ensure it's allocated
>     memset(buffer, 0, SIZE);
>     check_smaps();
>     munmap(buffer, SIZE);
> }
>
> int main() {
>     printf("Default THP setting: \n");
>     test_mmap_thp();
>     printf("PR_SET_THP_ALWAYS = 1 \n");
>     prctl(PR_SET_THP_ALWAYS, 1, NULL, NULL, NULL);
>     test_mmap_thp();
>     printf("PR_SET_THP_ALWAYS = 0 \n");
>     prctl(PR_SET_THP_ALWAYS, 0, NULL, NULL, NULL);
>     test_mmap_thp();
>
>     return 0;
> }
>
>
> Usama Arif (1):
>   prctl: allow overriding system THP policy to always per process
>
>  include/linux/huge_mm.h                          |  3 ++-
>  include/linux/mm_types.h                         |  7 ++-----
>  include/uapi/linux/prctl.h                       |  3 +++
>  kernel/sys.c                                     | 16 ++++++++++++++++
>  tools/include/uapi/linux/prctl.h                 |  3 +++
>  .../perf/trace/beauty/include/uapi/linux/prctl.h |  3 +++
>  6 files changed, 29 insertions(+), 6 deletions(-)
>
> -- 
> 2.47.1


--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] prctl: allow overriding system THP policy to always per process
  2025-05-07 14:00 ` [PATCH 1/1] prctl: allow overriding system THP policy to always per process Usama Arif
@ 2025-05-07 15:02   ` Usama Arif
  2025-05-07 20:14   ` Zi Yan
  1 sibling, 0 replies; 29+ messages in thread
From: Usama Arif @ 2025-05-07 15:02 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team



On 07/05/2025 15:00, Usama Arif wrote:
> Allowing override of global THP policy per process allows workloads
> that have shown to benefit from hugepages to do so, without regressing
> workloads that wouldn't benefit. This will allow such types of workloads
> to be run/stacked on the same machine.
> 
> It also helps in rolling out hugepages in hyperscaler configurations
> for workloads that benefit from them, where a single THP policy is likely
> to be used across the entire fleet, and prctl will help override it.
> 
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  include/linux/huge_mm.h                          |  3 ++-
>  include/linux/mm_types.h                         |  7 ++-----
>  include/uapi/linux/prctl.h                       |  3 +++
>  kernel/sys.c                                     | 16 ++++++++++++++++
>  tools/include/uapi/linux/prctl.h                 |  3 +++
>  .../perf/trace/beauty/include/uapi/linux/prctl.h |  3 +++
>  6 files changed, 29 insertions(+), 6 deletions(-)
> 


I forgot to include the change for non-anon VMA, which Johannes pointed out (Thanks!)

The patch will require the below fixlet on top of it:

From b719cd28ae78de699c0f801a4283449f6ac767ad Mon Sep 17 00:00:00 2001
From: Usama Arif <usamaarif642@gmail.com>
Date: Wed, 7 May 2025 15:58:39 +0100
Subject: [PATCH] prctl: Allowing override of global THP policy per process

This is a fixlet for doing it for non-anon VMAs as well.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 mm/huge_memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2780a12b25f0..c4bf8eae420c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -177,7 +177,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 		 */
 		if (enforce_sysfs &&
 		    (!hugepage_global_enabled() || (!(vm_flags & VM_HUGEPAGE) &&
-						    !hugepage_global_always())))
+						    !hugepage_global_always()) ||
+		     test_bit(MMF_THP_ALWAYS, &vma->vm_mm->flags)))
 			return 0;
 
 		/*
-- 
2.47.1






^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-07 14:57 ` [PATCH 0/1] prctl: allow overriding system THP policy to always Zi Yan
@ 2025-05-07 15:12   ` Usama Arif
  2025-05-07 15:57     ` Zi Yan
  0 siblings, 1 reply; 29+ messages in thread
From: Usama Arif @ 2025-05-07 15:12 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	linux-kernel, kernel-team, Yafang Shao



On 07/05/2025 15:57, Zi Yan wrote:
> +Yafang, who is also looking at changing THP config at cgroup/container level.
> 
> On 7 May 2025, at 10:00, Usama Arif wrote:
> 
>> Allowing override of global THP policy per process allows workloads
>> that have shown to benefit from hugepages to do so, without regressing
>> workloads that wouldn't benefit. This will allow such types of
>> workloads to be run/stacked on the same machine.
>>
>> It also helps in rolling out hugepages in hyperscaler configurations
>> for workloads that benefit from them, where a single THP policy is
>> likely to be used across the entire fleet, and prctl will help override it.
>>
>> An advantage of doing it via prctl vs creating a cgroup specific
>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
>> that this will work even when there are no cgroups present, and my
>> understanding is there is a strong preference of cgroups controls being
>> hierarchical which usually means them having a numerical value.
> 
> Hi Usama,
> 
> Do you mind giving an example on how to change THP policy for a set of
> processes running in a container (under a cgroup)?

Hi Zi,

In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always
for processes in a cgroup is in the same way we enable KSM for the cgroup.
The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS
in exec-invoke.
This is at the start of the process, but you would already know at the start of the process
whether you want THP=always for it or not.

[1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045

Thanks,
Usama

> 
> Yafang mentioned that the prctl approach would require restarting all running
> services[1] and other inflexiblities, so he proposed to use BPF to change THP
> policy[2]. I wonder if Yafang's issues also apply to your case and if you
> have a solution to them.
> 
> Thanks.
> 
> [1] https://lore.kernel.org/linux-mm/CALOAHbCXMi2GaZdHJaNLXxGsJf-hkDTrztsQiceaBcJ8d8p3cA@mail.gmail.com/
> [2] https://lore.kernel.org/linux-mm/20250429024139.34365-1-laoar.shao@gmail.com/
>>
>>
>> The output and code of test program is below:
>>
>> [root@vm4 vmuser]# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
>> [root@vm4 vmuser]# echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>> [root@vm4 vmuser]# ./a.out
>> Default THP setting:
>> THP is not set to 'always'.
>> PR_SET_THP_ALWAYS = 1
>> THP is set to 'always'.
>> PR_SET_THP_ALWAYS = 0
>> THP is not set to 'always'.
>>
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <string.h>
>> #include <unistd.h>
>> #include <sys/mman.h>
>> #include <sys/prctl.h>
>>
>> #define PR_SET_THP_ALWAYS 78
>> #define SIZE 12 * (2 * 1024 * 1024) // 24 MB
>>
>> void check_smaps(void) {
>>     FILE *file = fopen("/proc/self/smaps", "r");
>>     if (!file) {
>>         perror("fopen");
>>         return;
>>     }
>>
>>     char line[256];
>>     int is_hugepage = 0;
>>     while (fgets(line, sizeof(line), file)) {
>>         // if (strstr(line, "AnonHugePages:"))
>>         //     printf("%s\n", line);
>>         if (strstr(line, "AnonHugePages:") && strstr(line, "24576 kB"))
>> {
>>             // printf("%s\n", line);
>>             is_hugepage = 1;
>>             break;
>>         }
>>     }
>>     fclose(file);
>>     if (is_hugepage) {
>>         printf("THP is set to 'always'.\n");
>>     } else {
>>         printf("THP is not set to 'always'.\n");
>>     }
>> }
>>
>> void test_mmap_thp(void) {
>>     char *buffer = (char *)mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>>                                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>     if (buffer == MAP_FAILED) {
>>         perror("mmap");
>>         return;
>>     }
>>     // Touch the memory to ensure it's allocated
>>     memset(buffer, 0, SIZE);
>>     check_smaps();
>>     munmap(buffer, SIZE);
>> }
>>
>> int main() {
>>     printf("Default THP setting: \n");
>>     test_mmap_thp();
>>     printf("PR_SET_THP_ALWAYS = 1 \n");
>>     prctl(PR_SET_THP_ALWAYS, 1, NULL, NULL, NULL);
>>     test_mmap_thp();
>>     printf("PR_SET_THP_ALWAYS = 0 \n");
>>     prctl(PR_SET_THP_ALWAYS, 0, NULL, NULL, NULL);
>>     test_mmap_thp();
>>
>>     return 0;
>> }
>>
>>
>> Usama Arif (1):
>>   prctl: allow overriding system THP policy to always per process
>>
>>  include/linux/huge_mm.h                          |  3 ++-
>>  include/linux/mm_types.h                         |  7 ++-----
>>  include/uapi/linux/prctl.h                       |  3 +++
>>  kernel/sys.c                                     | 16 ++++++++++++++++
>>  tools/include/uapi/linux/prctl.h                 |  3 +++
>>  .../perf/trace/beauty/include/uapi/linux/prctl.h |  3 +++
>>  6 files changed, 29 insertions(+), 6 deletions(-)
>>
>> -- 
>> 2.47.1
> 
> 
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-07 15:12   ` Usama Arif
@ 2025-05-07 15:57     ` Zi Yan
  2025-05-07 16:09       ` Usama Arif
  0 siblings, 1 reply; 29+ messages in thread
From: Zi Yan @ 2025-05-07 15:57 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	linux-kernel, kernel-team, Yafang Shao

On 7 May 2025, at 11:12, Usama Arif wrote:

> On 07/05/2025 15:57, Zi Yan wrote:
>> +Yafang, who is also looking at changing THP config at cgroup/container level.
>>
>> On 7 May 2025, at 10:00, Usama Arif wrote:
>>
>>> Allowing override of global THP policy per process allows workloads
>>> that have shown to benefit from hugepages to do so, without regressing
>>> workloads that wouldn't benefit. This will allow such types of
>>> workloads to be run/stacked on the same machine.
>>>
>>> It also helps in rolling out hugepages in hyperscaler configurations
>>> for workloads that benefit from them, where a single THP policy is
>>> likely to be used across the entire fleet, and prctl will help override it.
>>>
>>> An advantage of doing it via prctl vs creating a cgroup specific
>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
>>> that this will work even when there are no cgroups present, and my
>>> understanding is there is a strong preference of cgroups controls being
>>> hierarchical which usually means them having a numerical value.
>>
>> Hi Usama,
>>
>> Do you mind giving an example on how to change THP policy for a set of
>> processes running in a container (under a cgroup)?
>
> Hi Zi,
>
> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always
> for processes in a cgroup is in the same way we enable KSM for the cgroup.
> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS
> in exec-invoke.
> This is at the start of the process, but you would already know at the start of the process
> whether you want THP=always for it or not.
>
> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045

You also need to add a new systemd.directives, e.g., MemoryTHP, to
pass the THP enablement or disablement info from a systemd config file.
And if you find those processes do not benefit from using THPs,
you can just change the new "MemoryTHP" config and restart the processes.

Am I getting it? Thanks.

>>
>> Yafang mentioned that the prctl approach would require restarting all running
>> services[1] and other inflexiblities, so he proposed to use BPF to change THP
>> policy[2]. I wonder if Yafang's issues also apply to your case and if you
>> have a solution to them.
>>
>> Thanks.
>>
>> [1] https://lore.kernel.org/linux-mm/CALOAHbCXMi2GaZdHJaNLXxGsJf-hkDTrztsQiceaBcJ8d8p3cA@mail.gmail.com/
>> [2] https://lore.kernel.org/linux-mm/20250429024139.34365-1-laoar.shao@gmail.com/
>>>
>>>
>>> The output and code of test program is below:
>>>
>>> [root@vm4 vmuser]# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
>>> [root@vm4 vmuser]# echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>>> [root@vm4 vmuser]# ./a.out
>>> Default THP setting:
>>> THP is not set to 'always'.
>>> PR_SET_THP_ALWAYS = 1
>>> THP is set to 'always'.
>>> PR_SET_THP_ALWAYS = 0
>>> THP is not set to 'always'.
>>>
>>>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <string.h>
>>> #include <unistd.h>
>>> #include <sys/mman.h>
>>> #include <sys/prctl.h>
>>>
>>> #define PR_SET_THP_ALWAYS 78
>>> #define SIZE 12 * (2 * 1024 * 1024) // 24 MB
>>>
>>> void check_smaps(void) {
>>>     FILE *file = fopen("/proc/self/smaps", "r");
>>>     if (!file) {
>>>         perror("fopen");
>>>         return;
>>>     }
>>>
>>>     char line[256];
>>>     int is_hugepage = 0;
>>>     while (fgets(line, sizeof(line), file)) {
>>>         // if (strstr(line, "AnonHugePages:"))
>>>         //     printf("%s\n", line);
>>>         if (strstr(line, "AnonHugePages:") && strstr(line, "24576 kB"))
>>> {
>>>             // printf("%s\n", line);
>>>             is_hugepage = 1;
>>>             break;
>>>         }
>>>     }
>>>     fclose(file);
>>>     if (is_hugepage) {
>>>         printf("THP is set to 'always'.\n");
>>>     } else {
>>>         printf("THP is not set to 'always'.\n");
>>>     }
>>> }
>>>
>>> void test_mmap_thp(void) {
>>>     char *buffer = (char *)mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>>>                                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>>     if (buffer == MAP_FAILED) {
>>>         perror("mmap");
>>>         return;
>>>     }
>>>     // Touch the memory to ensure it's allocated
>>>     memset(buffer, 0, SIZE);
>>>     check_smaps();
>>>     munmap(buffer, SIZE);
>>> }
>>>
>>> int main() {
>>>     printf("Default THP setting: \n");
>>>     test_mmap_thp();
>>>     printf("PR_SET_THP_ALWAYS = 1 \n");
>>>     prctl(PR_SET_THP_ALWAYS, 1, NULL, NULL, NULL);
>>>     test_mmap_thp();
>>>     printf("PR_SET_THP_ALWAYS = 0 \n");
>>>     prctl(PR_SET_THP_ALWAYS, 0, NULL, NULL, NULL);
>>>     test_mmap_thp();
>>>
>>>     return 0;
>>> }
>>>
>>>
>>> Usama Arif (1):
>>>   prctl: allow overriding system THP policy to always per process
>>>
>>>  include/linux/huge_mm.h                          |  3 ++-
>>>  include/linux/mm_types.h                         |  7 ++-----
>>>  include/uapi/linux/prctl.h                       |  3 +++
>>>  kernel/sys.c                                     | 16 ++++++++++++++++
>>>  tools/include/uapi/linux/prctl.h                 |  3 +++
>>>  .../perf/trace/beauty/include/uapi/linux/prctl.h |  3 +++
>>>  6 files changed, 29 insertions(+), 6 deletions(-)
>>>
>>> -- 
>>> 2.47.1
>>
>>
>> --
>> Best Regards,
>> Yan, Zi


--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-07 15:57     ` Zi Yan
@ 2025-05-07 16:09       ` Usama Arif
  2025-05-08  5:41         ` Yafang Shao
  0 siblings, 1 reply; 29+ messages in thread
From: Usama Arif @ 2025-05-07 16:09 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	linux-kernel, kernel-team, Yafang Shao



On 07/05/2025 16:57, Zi Yan wrote:
> On 7 May 2025, at 11:12, Usama Arif wrote:
> 
>> On 07/05/2025 15:57, Zi Yan wrote:
>>> +Yafang, who is also looking at changing THP config at cgroup/container level.
>>>
>>> On 7 May 2025, at 10:00, Usama Arif wrote:
>>>
>>>> Allowing override of global THP policy per process allows workloads
>>>> that have shown to benefit from hugepages to do so, without regressing
>>>> workloads that wouldn't benefit. This will allow such types of
>>>> workloads to be run/stacked on the same machine.
>>>>
>>>> It also helps in rolling out hugepages in hyperscaler configurations
>>>> for workloads that benefit from them, where a single THP policy is
>>>> likely to be used across the entire fleet, and prctl will help override it.
>>>>
>>>> An advantage of doing it via prctl vs creating a cgroup specific
>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
>>>> that this will work even when there are no cgroups present, and my
>>>> understanding is there is a strong preference of cgroups controls being
>>>> hierarchical which usually means them having a numerical value.
>>>
>>> Hi Usama,
>>>
>>> Do you mind giving an example on how to change THP policy for a set of
>>> processes running in a container (under a cgroup)?
>>
>> Hi Zi,
>>
>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always
>> for processes in a cgroup is in the same way we enable KSM for the cgroup.
>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS
>> in exec-invoke.
>> This is at the start of the process, but you would already know at the start of the process
>> whether you want THP=always for it or not.
>>
>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> 
> You also need to add a new systemd.directives, e.g., MemoryTHP, to
> pass the THP enablement or disablement info from a systemd config file.
> And if you find those processes do not benefit from using THPs,
> you can just change the new "MemoryTHP" config and restart the processes.
> 
> Am I getting it? Thanks.
> 

Yes, thats right. They would exactly the same as what we (Meta) do
for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set,
the ExecContext->memory_thp would be set similar to memory_ksm [2], and when
that is set, the prctl will be called at exec_invoke of the process [3].

The systemd changes should be quite simple to do.

[1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978 
[2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151
[3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
>>>
>>> Yafang mentioned that the prctl approach would require restarting all running
>>> services[1] and other inflexiblities, so he proposed to use BPF to change THP
>>> policy[2]. I wonder if Yafang's issues also apply to your case and if you
>>> have a solution to them.
>>>
>>> Thanks.
>>>
>>> [1] https://lore.kernel.org/linux-mm/CALOAHbCXMi2GaZdHJaNLXxGsJf-hkDTrztsQiceaBcJ8d8p3cA@mail.gmail.com/
>>> [2] https://lore.kernel.org/linux-mm/20250429024139.34365-1-laoar.shao@gmail.com/
>>>>
>>>>
>>>> The output and code of test program is below:
>>>>
>>>> [root@vm4 vmuser]# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
>>>> [root@vm4 vmuser]# echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>>>> [root@vm4 vmuser]# ./a.out
>>>> Default THP setting:
>>>> THP is not set to 'always'.
>>>> PR_SET_THP_ALWAYS = 1
>>>> THP is set to 'always'.
>>>> PR_SET_THP_ALWAYS = 0
>>>> THP is not set to 'always'.
>>>>
>>>>
>>>> #include <stdio.h>
>>>> #include <stdlib.h>
>>>> #include <string.h>
>>>> #include <unistd.h>
>>>> #include <sys/mman.h>
>>>> #include <sys/prctl.h>
>>>>
>>>> #define PR_SET_THP_ALWAYS 78
>>>> #define SIZE 12 * (2 * 1024 * 1024) // 24 MB
>>>>
>>>> void check_smaps(void) {
>>>>     FILE *file = fopen("/proc/self/smaps", "r");
>>>>     if (!file) {
>>>>         perror("fopen");
>>>>         return;
>>>>     }
>>>>
>>>>     char line[256];
>>>>     int is_hugepage = 0;
>>>>     while (fgets(line, sizeof(line), file)) {
>>>>         // if (strstr(line, "AnonHugePages:"))
>>>>         //     printf("%s\n", line);
>>>>         if (strstr(line, "AnonHugePages:") && strstr(line, "24576 kB"))
>>>> {
>>>>             // printf("%s\n", line);
>>>>             is_hugepage = 1;
>>>>             break;
>>>>         }
>>>>     }
>>>>     fclose(file);
>>>>     if (is_hugepage) {
>>>>         printf("THP is set to 'always'.\n");
>>>>     } else {
>>>>         printf("THP is not set to 'always'.\n");
>>>>     }
>>>> }
>>>>
>>>> void test_mmap_thp(void) {
>>>>     char *buffer = (char *)mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>>>>                                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>>>     if (buffer == MAP_FAILED) {
>>>>         perror("mmap");
>>>>         return;
>>>>     }
>>>>     // Touch the memory to ensure it's allocated
>>>>     memset(buffer, 0, SIZE);
>>>>     check_smaps();
>>>>     munmap(buffer, SIZE);
>>>> }
>>>>
>>>> int main() {
>>>>     printf("Default THP setting: \n");
>>>>     test_mmap_thp();
>>>>     printf("PR_SET_THP_ALWAYS = 1 \n");
>>>>     prctl(PR_SET_THP_ALWAYS, 1, NULL, NULL, NULL);
>>>>     test_mmap_thp();
>>>>     printf("PR_SET_THP_ALWAYS = 0 \n");
>>>>     prctl(PR_SET_THP_ALWAYS, 0, NULL, NULL, NULL);
>>>>     test_mmap_thp();
>>>>
>>>>     return 0;
>>>> }
>>>>
>>>>
>>>> Usama Arif (1):
>>>>   prctl: allow overriding system THP policy to always per process
>>>>
>>>>  include/linux/huge_mm.h                          |  3 ++-
>>>>  include/linux/mm_types.h                         |  7 ++-----
>>>>  include/uapi/linux/prctl.h                       |  3 +++
>>>>  kernel/sys.c                                     | 16 ++++++++++++++++
>>>>  tools/include/uapi/linux/prctl.h                 |  3 +++
>>>>  .../perf/trace/beauty/include/uapi/linux/prctl.h |  3 +++
>>>>  6 files changed, 29 insertions(+), 6 deletions(-)
>>>>
>>>> -- 
>>>> 2.47.1
>>>
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi
> 
> 
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] prctl: allow overriding system THP policy to always per process
  2025-05-07 14:00 ` [PATCH 1/1] prctl: allow overriding system THP policy to always per process Usama Arif
  2025-05-07 15:02   ` Usama Arif
@ 2025-05-07 20:14   ` Zi Yan
  2025-05-08 10:53     ` Usama Arif
  1 sibling, 1 reply; 29+ messages in thread
From: Zi Yan @ 2025-05-07 20:14 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	linux-kernel, kernel-team

On 7 May 2025, at 10:00, Usama Arif wrote:

> Allowing override of global THP policy per process allows workloads
> that have shown to benefit from hugepages to do so, without regressing
> workloads that wouldn't benefit. This will allow such types of workloads
> to be run/stacked on the same machine.
>
> It also helps in rolling out hugepages in hyperscaler configurations
> for workloads that benefit from them, where a single THP policy is likely
> to be used across the entire fleet, and prctl will help override it.
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  include/linux/huge_mm.h                          |  3 ++-
>  include/linux/mm_types.h                         |  7 ++-----
>  include/uapi/linux/prctl.h                       |  3 +++
>  kernel/sys.c                                     | 16 ++++++++++++++++
>  tools/include/uapi/linux/prctl.h                 |  3 +++
>  .../perf/trace/beauty/include/uapi/linux/prctl.h |  3 +++
>  6 files changed, 29 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..0587dc4b8e2d 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -293,7 +293,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>  		if (vm_flags & VM_HUGEPAGE)
>  			mask |= READ_ONCE(huge_anon_orders_madvise);
>  		if (hugepage_global_always() ||
> -		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
> +		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()) ||
> +		    test_bit(MMF_THP_ALWAYS, &vma->vm_mm->flags))
>  			mask |= READ_ONCE(huge_anon_orders_inherit);
>
>  		orders &= mask;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e76bade9ebb1..9bcd72b2b191 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1704,11 +1704,8 @@ enum {
>  #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
>  #define MMF_VM_HUGEPAGE		17	/* set when mm is available for khugepaged */
>
> -/*
> - * This one-shot flag is dropped due to necessity of changing exe once again
> - * on NFS restore
> - */
> -//#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
> +/* override inherited page sizes to always for the entire process */
> + #define MMF_THP_ALWAYS	18

Could we have something like MMF_THP_POLICY_SET and another field
for "always"? Otherwise, how are we going to set MMF_THP_MADVISE if
we want it in the future?

>
>  #define MMF_HAS_UPROBES		19	/* has uprobes */
>  #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 15c18ef4eb11..22c526681562 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -364,4 +364,7 @@ struct prctl_mm_map {
>  # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
>  # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
>
> +#define PR_SET_THP_ALWAYS	78
> +#define PR_GET_THP_ALWAYS	79
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index c434968e9f5d..ee56b059ff1f 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2658,6 +2658,22 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
>  		mmap_write_unlock(me->mm);
>  		break;
> +	case PR_GET_THP_ALWAYS:
> +		if (arg2 || arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		error = !!test_bit(MMF_THP_ALWAYS, &me->mm->flags);
> +		break;
> +	case PR_SET_THP_ALWAYS:
> +		if (arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		if (mmap_write_lock_killable(me->mm))
> +			return -EINTR;
> +		if (arg2)
> +			set_bit(MMF_THP_ALWAYS, &me->mm->flags);
> +		else
> +			clear_bit(MMF_THP_ALWAYS, &me->mm->flags);
> +		mmap_write_unlock(me->mm);
> +		break;

prctl can take more than one arguments. Would it be better to add
PR_SET_THP_POLICY and PR_GET_THP_POLICY and specify PR_THP_POLICY_ALWAYS
in the second argument? So that in the future, if we want to add
more THP policies, we do not need to keep piling up PR_{GET,SET}_THP_*?

>  	case PR_MPX_ENABLE_MANAGEMENT:
>  	case PR_MPX_DISABLE_MANAGEMENT:
>  		/* No longer implemented: */
> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
> index 35791791a879..f5f6cff42b3f 100644
> --- a/tools/include/uapi/linux/prctl.h
> +++ b/tools/include/uapi/linux/prctl.h
> @@ -328,4 +328,7 @@ struct prctl_mm_map {
>  # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC	0x10 /* Clear the aspect on exec */
>  # define PR_PPC_DEXCR_CTRL_MASK		0x1f
>
> +#define PR_GET_THP_ALWAYS	78
> +#define PR_SET_THP_ALWAYS	79
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> index 15c18ef4eb11..680996d56faf 100644
> --- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> +++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
> @@ -364,4 +364,7 @@ struct prctl_mm_map {
>  # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
>  # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
>
> +#define PR_GET_THP_ALWAYS	78
> +#define PR_SET_THP_ALWAYS	79
> +
>  #endif /* _LINUX_PRCTL_H */
> -- 
> 2.47.1


--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-07 16:09       ` Usama Arif
@ 2025-05-08  5:41         ` Yafang Shao
  2025-05-08 16:04           ` Usama Arif
  0 siblings, 1 reply; 29+ messages in thread
From: Yafang Shao @ 2025-05-08  5:41 UTC (permalink / raw)
  To: Usama Arif
  Cc: Zi Yan, Andrew Morton, david, linux-mm, hannes, shakeel.butt,
	riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, linux-kernel, kernel-team

On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 07/05/2025 16:57, Zi Yan wrote:
> > On 7 May 2025, at 11:12, Usama Arif wrote:
> >
> >> On 07/05/2025 15:57, Zi Yan wrote:
> >>> +Yafang, who is also looking at changing THP config at cgroup/container level.

Thanks

> >>>
> >>> On 7 May 2025, at 10:00, Usama Arif wrote:
> >>>
> >>>> Allowing override of global THP policy per process allows workloads
> >>>> that have shown to benefit from hugepages to do so, without regressing
> >>>> workloads that wouldn't benefit. This will allow such types of
> >>>> workloads to be run/stacked on the same machine.
> >>>>
> >>>> It also helps in rolling out hugepages in hyperscaler configurations
> >>>> for workloads that benefit from them, where a single THP policy is
> >>>> likely to be used across the entire fleet, and prctl will help override it.
> >>>>
> >>>> An advantage of doing it via prctl vs creating a cgroup specific
> >>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
> >>>> that this will work even when there are no cgroups present, and my
> >>>> understanding is there is a strong preference of cgroups controls being
> >>>> hierarchical which usually means them having a numerical value.
> >>>
> >>> Hi Usama,
> >>>
> >>> Do you mind giving an example on how to change THP policy for a set of
> >>> processes running in a container (under a cgroup)?
> >>
> >> Hi Zi,
> >>
> >> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always
> >> for processes in a cgroup is in the same way we enable KSM for the cgroup.
> >> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS
> >> in exec-invoke.
> >> This is at the start of the process, but you would already know at the start of the process
> >> whether you want THP=always for it or not.
> >>
> >> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> >
> > You also need to add a new systemd.directives, e.g., MemoryTHP, to
> > pass the THP enablement or disablement info from a systemd config file.
> > And if you find those processes do not benefit from using THPs,
> > you can just change the new "MemoryTHP" config and restart the processes.
> >
> > Am I getting it? Thanks.
> >
>
> Yes, thats right. They would exactly the same as what we (Meta) do
> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set,
> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when
> that is set, the prctl will be called at exec_invoke of the process [3].
>
> The systemd changes should be quite simple to do.
>
> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978
> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151
> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045

This solution carries a risk: since prctl() does not require any
capabilities, the task itself could call it and override your memory
policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that
capability is typically enabled by default in containers, leaving them
still vulnerable.

This approach might work for Kubernetes/container environments, but it
would require substantial code changes to implement securely.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] prctl: allow overriding system THP policy to always per process
  2025-05-07 20:14   ` Zi Yan
@ 2025-05-08 10:53     ` Usama Arif
  2025-05-08 20:29       ` Zi Yan
  0 siblings, 1 reply; 29+ messages in thread
From: Usama Arif @ 2025-05-08 10:53 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	linux-kernel, kernel-team



On 07/05/2025 21:14, Zi Yan wrote:
> On 7 May 2025, at 10:00, Usama Arif wrote:
> 
>> Allowing override of global THP policy per process allows workloads
>> that have shown to benefit from hugepages to do so, without regressing
>> workloads that wouldn't benefit. This will allow such types of workloads
>> to be run/stacked on the same machine.
>>
>> It also helps in rolling out hugepages in hyperscaler configurations
>> for workloads that benefit from them, where a single THP policy is likely
>> to be used across the entire fleet, and prctl will help override it.
>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>>  include/linux/huge_mm.h                          |  3 ++-
>>  include/linux/mm_types.h                         |  7 ++-----
>>  include/uapi/linux/prctl.h                       |  3 +++
>>  kernel/sys.c                                     | 16 ++++++++++++++++
>>  tools/include/uapi/linux/prctl.h                 |  3 +++
>>  .../perf/trace/beauty/include/uapi/linux/prctl.h |  3 +++
>>  6 files changed, 29 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2f190c90192d..0587dc4b8e2d 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -293,7 +293,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>>  		if (vm_flags & VM_HUGEPAGE)
>>  			mask |= READ_ONCE(huge_anon_orders_madvise);
>>  		if (hugepage_global_always() ||
>> -		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
>> +		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()) ||
>> +		    test_bit(MMF_THP_ALWAYS, &vma->vm_mm->flags))
>>  			mask |= READ_ONCE(huge_anon_orders_inherit);
>>
>>  		orders &= mask;
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index e76bade9ebb1..9bcd72b2b191 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -1704,11 +1704,8 @@ enum {
>>  #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
>>  #define MMF_VM_HUGEPAGE		17	/* set when mm is available for khugepaged */
>>
>> -/*
>> - * This one-shot flag is dropped due to necessity of changing exe once again
>> - * on NFS restore
>> - */
>> -//#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
>> +/* override inherited page sizes to always for the entire process */
>> + #define MMF_THP_ALWAYS	18
> 
> Could we have something like MMF_THP_POLICY_SET and another field
> for "always"? Otherwise, how are we going to set MMF_THP_MADVISE if
> we want it in the future?
> 

So we actually need MMF_THP_MADVISE as well. If we have the entire fleet system
policy set to always, but if certain workloads regress with it, then we need to
prctl MADVISE them.

As you said, either we have:
- MMF_THP_MADVISE as another flag. The issue with this is we have run out
of bits in mm->flags for 32 bit machines. So either we introduce another member
in mm_struct (maybe mm_struct->flags2?), or we start using bits 32 and above of
flags field, limit this to 64 bit architectures and wrap the code with #ifdefs.
This is probably ugly, but if its ok for upstream, happy to do that.
- MMF_THP_POLICY_SET + another field to specify what the policy is, the issue
with this might be another field per mm_struct/process. Having an entire field
just for this might be wasteful, so I think I would prefer having just
mm_struct->flags2 and use up only 1 bit of it, and the rest can be used for
anything else in the future.

I think the flags2 approach is likely the best, but let me know what you think.

>>
>>  #define MMF_HAS_UPROBES		19	/* has uprobes */
>>  #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> index 15c18ef4eb11..22c526681562 100644
>> --- a/include/uapi/linux/prctl.h
>> +++ b/include/uapi/linux/prctl.h
>> @@ -364,4 +364,7 @@ struct prctl_mm_map {
>>  # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
>>  # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
>>
>> +#define PR_SET_THP_ALWAYS	78
>> +#define PR_GET_THP_ALWAYS	79
>> +
>>  #endif /* _LINUX_PRCTL_H */
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index c434968e9f5d..ee56b059ff1f 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -2658,6 +2658,22 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>>  			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
>>  		mmap_write_unlock(me->mm);
>>  		break;
>> +	case PR_GET_THP_ALWAYS:
>> +		if (arg2 || arg3 || arg4 || arg5)
>> +			return -EINVAL;
>> +		error = !!test_bit(MMF_THP_ALWAYS, &me->mm->flags);
>> +		break;
>> +	case PR_SET_THP_ALWAYS:
>> +		if (arg3 || arg4 || arg5)
>> +			return -EINVAL;
>> +		if (mmap_write_lock_killable(me->mm))
>> +			return -EINTR;
>> +		if (arg2)
>> +			set_bit(MMF_THP_ALWAYS, &me->mm->flags);
>> +		else
>> +			clear_bit(MMF_THP_ALWAYS, &me->mm->flags);
>> +		mmap_write_unlock(me->mm);
>> +		break;
> 
> prctl can take more than one arguments. Would it be better to add
> PR_SET_THP_POLICY and PR_GET_THP_POLICY and specify PR_THP_POLICY_ALWAYS
> in the second argument? So that in the future, if we want to add
> more THP policies, we do not need to keep piling up PR_{GET,SET}_THP_*?

I am ok with either, with prctl values there isnt any limit like MMF.
I went with how the other prctls are used, i.e. PR_SET_THP_DISABLE uses
arg2 to check if it needs to be enabled (non 0)/disabled (0 value).

I think having PR_GET/SET_THP_ALWAYS/MADVISE could be better as it would
work similar to how existing PR_* (PR_SET_THP_DISABLE) works with arg2
used for enabling/disabling and we dont have any limits for the number
of PR_* that can exist. But also happy to change it to PR_GET/SET_THP_POLICY.

> 
>>  	case PR_MPX_ENABLE_MANAGEMENT:
>>  	case PR_MPX_DISABLE_MANAGEMENT:
>>  		/* No longer implemented: */
>> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
>> index 35791791a879..f5f6cff42b3f 100644
>> --- a/tools/include/uapi/linux/prctl.h
>> +++ b/tools/include/uapi/linux/prctl.h
>> @@ -328,4 +328,7 @@ struct prctl_mm_map {
>>  # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC	0x10 /* Clear the aspect on exec */
>>  # define PR_PPC_DEXCR_CTRL_MASK		0x1f
>>
>> +#define PR_GET_THP_ALWAYS	78
>> +#define PR_SET_THP_ALWAYS	79
>> +
>>  #endif /* _LINUX_PRCTL_H */
>> diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
>> index 15c18ef4eb11..680996d56faf 100644
>> --- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
>> +++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
>> @@ -364,4 +364,7 @@ struct prctl_mm_map {
>>  # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
>>  # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
>>
>> +#define PR_GET_THP_ALWAYS	78
>> +#define PR_SET_THP_ALWAYS	79
>> +
>>  #endif /* _LINUX_PRCTL_H */
>> -- 
>> 2.47.1
> 
> 
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-07 14:00 [PATCH 0/1] prctl: allow overriding system THP policy to always Usama Arif
  2025-05-07 14:00 ` [PATCH 1/1] prctl: allow overriding system THP policy to always per process Usama Arif
  2025-05-07 14:57 ` [PATCH 0/1] prctl: allow overriding system THP policy to always Zi Yan
@ 2025-05-08 11:06 ` David Hildenbrand
  2025-05-08 16:35   ` Usama Arif
  2 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-05-08 11:06 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team

On 07.05.25 16:00, Usama Arif wrote:
> Allowing override of global THP policy per process allows workloads
> that have shown to benefit from hugepages to do so, without regressing
> workloads that wouldn't benefit. This will allow such types of
> workloads to be run/stacked on the same machine.
> 
> It also helps in rolling out hugepages in hyperscaler configurations
> for workloads that benefit from them, where a single THP policy is
> likely to be used across the entire fleet, and prctl will help override it.
> 
> An advantage of doing it via prctl vs creating a cgroup specific
> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
> that this will work even when there are no cgroups present, and my
> understanding is there is a strong preference of cgroups controls being
> hierarchical which usually means them having a numerical value.
> 
> 
> The output and code of test program is below:
> 
> [root@vm4 vmuser]# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
> [root@vm4 vmuser]# echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> [root@vm4 vmuser]# ./a.out
> Default THP setting:
> THP is not set to 'always'.
> PR_SET_THP_ALWAYS = 1
> THP is set to 'always'.
> PR_SET_THP_ALWAYS = 0
> THP is not set to 'always'.

Some quick feedback:

(1) The "always" in PR_SET_THP_ALWAYS does not look future proof. Why 
wouldn't someone want to force-disable them for a process (-> "never") 
or set it to some other new mode ("-> defer" that is currently on the list).

(2) In your example, is the toggle specific to 2M THP or the global 
toggle ...? Unclear. And that makes this interface also suboptimal.

(3) I'm a bit concerned about interaction with per-VMA settings (the one 
we already have, and order-specific ones that people were discussing). 
It's going to be a mess if we have global, per-process, per-vma and then 
some other policies (ebpf? whatever else?) on top.


The low-hanging fruit would be a per-process toggle that only controls 
the existing per-VMA toggle: for example, with the semantics that

(1) All new (applicable) VMAs start with VM_HUGEPAGE
(2) All existing (applicable) VMAs that are *not* VM_NOHUGEPAGE become 
VM_HUGEPAGE.


We did something similar with PR_SET_MEMORY_MERGE.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-08  5:41         ` Yafang Shao
@ 2025-05-08 16:04           ` Usama Arif
  2025-05-09  2:15             ` Yafang Shao
  0 siblings, 1 reply; 29+ messages in thread
From: Usama Arif @ 2025-05-08 16:04 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Zi Yan, Andrew Morton, david, linux-mm, hannes, shakeel.butt,
	riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, linux-kernel, kernel-team



On 08/05/2025 06:41, Yafang Shao wrote:
> On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 07/05/2025 16:57, Zi Yan wrote:
>>> On 7 May 2025, at 11:12, Usama Arif wrote:
>>>
>>>> On 07/05/2025 15:57, Zi Yan wrote:
>>>>> +Yafang, who is also looking at changing THP config at cgroup/container level.
> 
> Thanks
> 
>>>>>
>>>>> On 7 May 2025, at 10:00, Usama Arif wrote:
>>>>>
>>>>>> Allowing override of global THP policy per process allows workloads
>>>>>> that have shown to benefit from hugepages to do so, without regressing
>>>>>> workloads that wouldn't benefit. This will allow such types of
>>>>>> workloads to be run/stacked on the same machine.
>>>>>>
>>>>>> It also helps in rolling out hugepages in hyperscaler configurations
>>>>>> for workloads that benefit from them, where a single THP policy is
>>>>>> likely to be used across the entire fleet, and prctl will help override it.
>>>>>>
>>>>>> An advantage of doing it via prctl vs creating a cgroup specific
>>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
>>>>>> that this will work even when there are no cgroups present, and my
>>>>>> understanding is there is a strong preference of cgroups controls being
>>>>>> hierarchical which usually means them having a numerical value.
>>>>>
>>>>> Hi Usama,
>>>>>
>>>>> Do you mind giving an example on how to change THP policy for a set of
>>>>> processes running in a container (under a cgroup)?
>>>>
>>>> Hi Zi,
>>>>
>>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always
>>>> for processes in a cgroup is in the same way we enable KSM for the cgroup.
>>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS
>>>> in exec-invoke.
>>>> This is at the start of the process, but you would already know at the start of the process
>>>> whether you want THP=always for it or not.
>>>>
>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
>>>
>>> You also need to add a new systemd.directives, e.g., MemoryTHP, to
>>> pass the THP enablement or disablement info from a systemd config file.
>>> And if you find those processes do not benefit from using THPs,
>>> you can just change the new "MemoryTHP" config and restart the processes.
>>>
>>> Am I getting it? Thanks.
>>>
>>
>> Yes, thats right. They would exactly the same as what we (Meta) do
>> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set,
>> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when
>> that is set, the prctl will be called at exec_invoke of the process [3].
>>
>> The systemd changes should be quite simple to do.
>>
>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978
>> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151
>> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> 
> This solution carries a risk: since prctl() does not require any
> capabilities, the task itself could call it and override your memory
> policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that
> capability is typically enabled by default in containers, leaving them
> still vulnerable.
> 
> This approach might work for Kubernetes/container environments, but it
> would require substantial code changes to implement securely.
> 

You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE
already exists and the someone could use this to slow the process down. So the
approach this patch takes shouldn't be anymore of a security fix then what is already
exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE
should be used to restrict this. 

In terms of security vulnerability of prctl, I feel like there are a lot of others
that can be a much much bigger issue? I just had a look and you can change the
seccomp, reset PAC keys(!) even speculation control(!!), so I dont think the security
argument would be valid.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-08 11:06 ` David Hildenbrand
@ 2025-05-08 16:35   ` Usama Arif
  2025-05-08 17:39     ` David Hildenbrand
  0 siblings, 1 reply; 29+ messages in thread
From: Usama Arif @ 2025-05-08 16:35 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team



On 08/05/2025 12:06, David Hildenbrand wrote:
> On 07.05.25 16:00, Usama Arif wrote:
>> Allowing override of global THP policy per process allows workloads
>> that have shown to benefit from hugepages to do so, without regressing
>> workloads that wouldn't benefit. This will allow such types of
>> workloads to be run/stacked on the same machine.
>>
>> It also helps in rolling out hugepages in hyperscaler configurations
>> for workloads that benefit from them, where a single THP policy is
>> likely to be used across the entire fleet, and prctl will help override it.
>>
>> An advantage of doing it via prctl vs creating a cgroup specific
>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
>> that this will work even when there are no cgroups present, and my
>> understanding is there is a strong preference of cgroups controls being
>> hierarchical which usually means them having a numerical value.
>>
>>
>> The output and code of test program is below:
>>
>> [root@vm4 vmuser]# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
>> [root@vm4 vmuser]# echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>> [root@vm4 vmuser]# ./a.out
>> Default THP setting:
>> THP is not set to 'always'.
>> PR_SET_THP_ALWAYS = 1
>> THP is set to 'always'.
>> PR_SET_THP_ALWAYS = 0
>> THP is not set to 'always'.
> 
> Some quick feedback:
> 
> (1) The "always" in PR_SET_THP_ALWAYS does not look future proof. Why wouldn't someone want to force-disable them for a process (-> "never") or set it to some other new mode ("-> defer" that is currently on the list).

Yes agree with this, I think there are 3 different possible ways forward for this which I outlined in [1] in reply to Zi Yan.
I like flags2 approach, but let me know what you think.

[1] https://lore.kernel.org/all/9ed673ad-764f-4f46-84a7-ef98b19d22ec@gmail.com/
> 
> (2) In your example, is the toggle specific to 2M THP or the global toggle ...? Unclear. And that makes this interface also suboptimal.

In this approach you would overwrite inherit folio sizes, and I think thats the right approach. So if you have for e.g. 2M and 16K set to inherit,
and the global one is set to madvise, doing PR_SET_THP_ALWAYS would change those folio to always.

> 
> (3) I'm a bit concerned about interaction with per-VMA settings (the one we already have, and order-specific ones that people were discussing). It's going to be a mess if we have global, per-process, per-vma and then some other policies (ebpf? whatever else?) on top.
> 
> 
> The low-hanging fruit would be a per-process toggle that only controls the existing per-VMA toggle: for example, with the semantics that
> 
> (1) All new (applicable) VMAs start with VM_HUGEPAGE
> (2) All existing (applicable) VMAs that are *not* VM_NOHUGEPAGE become VM_HUGEPAGE.
> 
> 
> We did something similar with PR_SET_MEMORY_MERGE.
> 

For this you mean the prctl command would do for_each_vma and set VM_HUGEPAGE to implement point 2.
For having point 1, I think we will still need extra mm->flags, i.e. MMF_VM_THP_MADVISE/DEFER/ALWAYS/NEVER.

I think it would have the same affect as what this patch is trying to do? But would be just more
expensive in terms of both code changes and the cost of the actual call as you now have to walk
all vmas. On the other hand you wont need the below diff in from v1. I do feel the current approach
in the patch is simpler? But if your point 3 is better in terms of code maintainability, happy to make
it the change to it in v2. 


diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..0587dc4b8e2d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -293,7 +293,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
                if (vm_flags & VM_HUGEPAGE)
                        mask |= READ_ONCE(huge_anon_orders_madvise);
                if (hugepage_global_always() ||
-                   ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
+                   ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()) ||
+                   test_bit(MMF_THP_ALWAYS, &vma->vm_mm->flags))
                        mask |= READ_ONCE(huge_anon_orders_inherit);
 
                orders &= mask; 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-08 16:35   ` Usama Arif
@ 2025-05-08 17:39     ` David Hildenbrand
  2025-05-08 18:05       ` Usama Arif
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-05-08 17:39 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team

>> We did something similar with PR_SET_MEMORY_MERGE.
>>
> 
> For this you mean the prctl command would do for_each_vma and set VM_HUGEPAGE to implement point 2.

Yes. The expectation would be that the prctl is either a) triggered 
early when there are not that many VMAs yet or b) inherited during 
fork+exec such that the target process immediately gets it enabled for 
all VMAs.

> For having point 1, I think we will still need extra mm->flags, i.e. MMF_VM_THP_MADVISE/DEFER/ALWAYS/NEVER.

Yes, and it's unclear what we really want here.

> 
> I think it would have the same affect as what this patch is trying to do? But would be just more
> expensive in terms of both code changes and the cost of the actual call as you now have to walk
> all vmas. On the other hand you wont need the below diff in from v1. I do feel the current approach
> in the patch is simpler? But if your point 3 is better in terms of code maintainability, happy to make
> it the change to it in v2.

Yes. Having a toggle that

a) Changes the default of new VMAs to be VM_HUGEPAGE

b) Modifies all existing VMAs that are not VM_NOHUGEPAGE to be VM_HUGEPAGE

c) Is inherited during fork+exec

Would not add any additional checks in our already-complicated THP 
allocation logic, it simply changes the default/value of VM_HUGEPAGE.


At least to me, this sounds like better semantics. Would that make your 
use case happy?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-08 17:39     ` David Hildenbrand
@ 2025-05-08 18:05       ` Usama Arif
  0 siblings, 0 replies; 29+ messages in thread
From: Usama Arif @ 2025-05-08 18:05 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team



On 08/05/2025 18:39, David Hildenbrand wrote:
>>> We did something similar with PR_SET_MEMORY_MERGE.
>>>
>>
>> For this you mean the prctl command would do for_each_vma and set VM_HUGEPAGE to implement point 2.
> 
> Yes. The expectation would be that the prctl is either a) triggered early when there are not that many VMAs yet or b) inherited during fork+exec such that the target process immediately gets it enabled for all VMAs.

Yes that is the case for us as well, For us its mostly going to be a, i.e. exec_invoke called by systemd. But b will be good as well.
> 
>> For having point 1, I think we will still need extra mm->flags, i.e. MMF_VM_THP_MADVISE/DEFER/ALWAYS/NEVER.
> 
> Yes, and it's unclear what we really want here.

Let me make a v2 and we might be able to have a better discussion once its done. I will move it to RFC.
> 
>>
>> I think it would have the same affect as what this patch is trying to do? But would be just more
>> expensive in terms of both code changes and the cost of the actual call as you now have to walk
>> all vmas. On the other hand you wont need the below diff in from v1. I do feel the current approach
>> in the patch is simpler? But if your point 3 is better in terms of code maintainability, happy to make
>> it the change to it in v2.
> 
> Yes. Having a toggle that
> 
> a) Changes the default of new VMAs to be VM_HUGEPAGE
> 
> b) Modifies all existing VMAs that are not VM_NOHUGEPAGE to be VM_HUGEPAGE
> 
> c) Is inherited during fork+exec
> 
> Would not add any additional checks in our already-complicated THP allocation logic, it simply changes the default/value of VM_HUGEPAGE.
> 
> 
> At least to me, this sounds like better semantics. Would that make your use case happy?
> 

Yes, this would work for us. Let me try and make this work in the way you described and also make it work for
MADVISE and NEVER, will send that in v2.

Thanks for the feedback!!




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] prctl: allow overriding system THP policy to always per process
  2025-05-08 10:53     ` Usama Arif
@ 2025-05-08 20:29       ` Zi Yan
  0 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2025-05-08 20:29 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	linux-kernel, kernel-team

On 8 May 2025, at 6:53, Usama Arif wrote:

> On 07/05/2025 21:14, Zi Yan wrote:
>> On 7 May 2025, at 10:00, Usama Arif wrote:
>>
>>> Allowing override of global THP policy per process allows workloads
>>> that have shown to benefit from hugepages to do so, without regressing
>>> workloads that wouldn't benefit. This will allow such types of workloads
>>> to be run/stacked on the same machine.
>>>
>>> It also helps in rolling out hugepages in hyperscaler configurations
>>> for workloads that benefit from them, where a single THP policy is likely
>>> to be used across the entire fleet, and prctl will help override it.
>>>
>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>> ---
>>>  include/linux/huge_mm.h                          |  3 ++-
>>>  include/linux/mm_types.h                         |  7 ++-----
>>>  include/uapi/linux/prctl.h                       |  3 +++
>>>  kernel/sys.c                                     | 16 ++++++++++++++++
>>>  tools/include/uapi/linux/prctl.h                 |  3 +++
>>>  .../perf/trace/beauty/include/uapi/linux/prctl.h |  3 +++
>>>  6 files changed, 29 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 2f190c90192d..0587dc4b8e2d 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -293,7 +293,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>>>  		if (vm_flags & VM_HUGEPAGE)
>>>  			mask |= READ_ONCE(huge_anon_orders_madvise);
>>>  		if (hugepage_global_always() ||
>>> -		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
>>> +		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()) ||
>>> +		    test_bit(MMF_THP_ALWAYS, &vma->vm_mm->flags))
>>>  			mask |= READ_ONCE(huge_anon_orders_inherit);
>>>
>>>  		orders &= mask;
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index e76bade9ebb1..9bcd72b2b191 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -1704,11 +1704,8 @@ enum {
>>>  #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
>>>  #define MMF_VM_HUGEPAGE		17	/* set when mm is available for khugepaged */
>>>
>>> -/*
>>> - * This one-shot flag is dropped due to necessity of changing exe once again
>>> - * on NFS restore
>>> - */
>>> -//#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
>>> +/* override inherited page sizes to always for the entire process */
>>> + #define MMF_THP_ALWAYS	18
>>
>> Could we have something like MMF_THP_POLICY_SET and another field
>> for "always"? Otherwise, how are we going to set MMF_THP_MADVISE if
>> we want it in the future?
>>
>
> So we actually need MMF_THP_MADVISE as well. If we have the entire fleet system
> policy set to always, but if certain workloads regress with it, then we need to
> prctl MADVISE them.
>
> As you said, either we have:
> - MMF_THP_MADVISE as another flag. The issue with this is we have run out
> of bits in mm->flags for 32 bit machines. So either we introduce another member
> in mm_struct (maybe mm_struct->flags2?), or we start using bits 32 and above of
> flags field, limit this to 64 bit architectures and wrap the code with #ifdefs.
> This is probably ugly, but if its ok for upstream, happy to do that.
> - MMF_THP_POLICY_SET + another field to specify what the policy is, the issue
> with this might be another field per mm_struct/process. Having an entire field
> just for this might be wasteful, so I think I would prefer having just
> mm_struct->flags2 and use up only 1 bit of it, and the rest can be used for
> anything else in the future.
>
> I think the flags2 approach is likely the best, but let me know what you think.

Sounds good to me. If you are adding flags2, maybe put both MMF_THP_ALWAYS and
MMF_THP_MADVISE in flags2 to make code look more consistent? Just a thought.

>>>
>>>  #define MMF_HAS_UPROBES		19	/* has uprobes */
>>>  #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
>>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>>> index 15c18ef4eb11..22c526681562 100644
>>> --- a/include/uapi/linux/prctl.h
>>> +++ b/include/uapi/linux/prctl.h
>>> @@ -364,4 +364,7 @@ struct prctl_mm_map {
>>>  # define PR_TIMER_CREATE_RESTORE_IDS_ON		1
>>>  # define PR_TIMER_CREATE_RESTORE_IDS_GET	2
>>>
>>> +#define PR_SET_THP_ALWAYS	78
>>> +#define PR_GET_THP_ALWAYS	79
>>> +
>>>  #endif /* _LINUX_PRCTL_H */
>>> diff --git a/kernel/sys.c b/kernel/sys.c
>>> index c434968e9f5d..ee56b059ff1f 100644
>>> --- a/kernel/sys.c
>>> +++ b/kernel/sys.c
>>> @@ -2658,6 +2658,22 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>>>  			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
>>>  		mmap_write_unlock(me->mm);
>>>  		break;
>>> +	case PR_GET_THP_ALWAYS:
>>> +		if (arg2 || arg3 || arg4 || arg5)
>>> +			return -EINVAL;
>>> +		error = !!test_bit(MMF_THP_ALWAYS, &me->mm->flags);
>>> +		break;
>>> +	case PR_SET_THP_ALWAYS:
>>> +		if (arg3 || arg4 || arg5)
>>> +			return -EINVAL;
>>> +		if (mmap_write_lock_killable(me->mm))
>>> +			return -EINTR;
>>> +		if (arg2)
>>> +			set_bit(MMF_THP_ALWAYS, &me->mm->flags);
>>> +		else
>>> +			clear_bit(MMF_THP_ALWAYS, &me->mm->flags);
>>> +		mmap_write_unlock(me->mm);
>>> +		break;
>>
>> prctl can take more than one arguments. Would it be better to add
>> PR_SET_THP_POLICY and PR_GET_THP_POLICY and specify PR_THP_POLICY_ALWAYS
>> in the second argument? So that in the future, if we want to add
>> more THP policies, we do not need to keep piling up PR_{GET,SET}_THP_*?
>
> I am ok with either, with prctl values there isnt any limit like MMF.
> I went with how the other prctls are used, i.e. PR_SET_THP_DISABLE uses
> arg2 to check if it needs to be enabled (non 0)/disabled (0 value).
>
> I think having PR_GET/SET_THP_ALWAYS/MADVISE could be better as it would
> work similar to how existing PR_* (PR_SET_THP_DISABLE) works with arg2
> used for enabling/disabling and we dont have any limits for the number
> of PR_* that can exist. But also happy to change it to PR_GET/SET_THP_POLICY.

OK, if the number of PR_* is not a concern, I do not have a strong opinion on
this one.

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-08 16:04           ` Usama Arif
@ 2025-05-09  2:15             ` Yafang Shao
  2025-05-09  5:13               ` Johannes Weiner
  0 siblings, 1 reply; 29+ messages in thread
From: Yafang Shao @ 2025-05-09  2:15 UTC (permalink / raw)
  To: Usama Arif
  Cc: Zi Yan, Andrew Morton, david, linux-mm, hannes, shakeel.butt,
	riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, linux-kernel, kernel-team

On Fri, May 9, 2025 at 12:04 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 08/05/2025 06:41, Yafang Shao wrote:
> > On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >>
> >>
> >> On 07/05/2025 16:57, Zi Yan wrote:
> >>> On 7 May 2025, at 11:12, Usama Arif wrote:
> >>>
> >>>> On 07/05/2025 15:57, Zi Yan wrote:
> >>>>> +Yafang, who is also looking at changing THP config at cgroup/container level.
> >
> > Thanks
> >
> >>>>>
> >>>>> On 7 May 2025, at 10:00, Usama Arif wrote:
> >>>>>
> >>>>>> Allowing override of global THP policy per process allows workloads
> >>>>>> that have shown to benefit from hugepages to do so, without regressing
> >>>>>> workloads that wouldn't benefit. This will allow such types of
> >>>>>> workloads to be run/stacked on the same machine.
> >>>>>>
> >>>>>> It also helps in rolling out hugepages in hyperscaler configurations
> >>>>>> for workloads that benefit from them, where a single THP policy is
> >>>>>> likely to be used across the entire fleet, and prctl will help override it.
> >>>>>>
> >>>>>> An advantage of doing it via prctl vs creating a cgroup specific
> >>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
> >>>>>> that this will work even when there are no cgroups present, and my
> >>>>>> understanding is there is a strong preference of cgroups controls being
> >>>>>> hierarchical which usually means them having a numerical value.
> >>>>>
> >>>>> Hi Usama,
> >>>>>
> >>>>> Do you mind giving an example on how to change THP policy for a set of
> >>>>> processes running in a container (under a cgroup)?
> >>>>
> >>>> Hi Zi,
> >>>>
> >>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always
> >>>> for processes in a cgroup is in the same way we enable KSM for the cgroup.
> >>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS
> >>>> in exec-invoke.
> >>>> This is at the start of the process, but you would already know at the start of the process
> >>>> whether you want THP=always for it or not.
> >>>>
> >>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> >>>
> >>> You also need to add a new systemd.directives, e.g., MemoryTHP, to
> >>> pass the THP enablement or disablement info from a systemd config file.
> >>> And if you find those processes do not benefit from using THPs,
> >>> you can just change the new "MemoryTHP" config and restart the processes.
> >>>
> >>> Am I getting it? Thanks.
> >>>
> >>
> >> Yes, thats right. They would exactly the same as what we (Meta) do
> >> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set,
> >> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when
> >> that is set, the prctl will be called at exec_invoke of the process [3].
> >>
> >> The systemd changes should be quite simple to do.
> >>
> >> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978
> >> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151
> >> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> >
> > This solution carries a risk: since prctl() does not require any
> > capabilities, the task itself could call it and override your memory
> > policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that
> > capability is typically enabled by default in containers, leaving them
> > still vulnerable.
> >
> > This approach might work for Kubernetes/container environments, but it
> > would require substantial code changes to implement securely.
> >
>
> You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE
> already exists and the someone could use this to slow the process down. So the
> approach this patch takes shouldn't be anymore of a security fix then what is already
> exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE
> should be used to restrict this.

I believe we should at least require CAP_SYS_RESOURCE to enable THP,
since it overrides global system settings. Alternatively,
CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely
certain.

>
> In terms of security vulnerability of prctl, I feel like there are a lot of others
> that can be a much much bigger issue? I just had a look and you can change the
> seccomp, reset PAC keys(!) even speculation control(!!), so I dont think the security
> argument would be valid.

I was surprised to discover that none of these operations require any
capabilities to execute.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-09  2:15             ` Yafang Shao
@ 2025-05-09  5:13               ` Johannes Weiner
  2025-05-09  9:24                 ` Yafang Shao
  0 siblings, 1 reply; 29+ messages in thread
From: Johannes Weiner @ 2025-05-09  5:13 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Usama Arif, Zi Yan, Andrew Morton, david, linux-mm, shakeel.butt,
	riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, linux-kernel, kernel-team

On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote:
> On Fri, May 9, 2025 at 12:04 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >
> >
> >
> > On 08/05/2025 06:41, Yafang Shao wrote:
> > > On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > >>
> > >>
> > >>
> > >> On 07/05/2025 16:57, Zi Yan wrote:
> > >>> On 7 May 2025, at 11:12, Usama Arif wrote:
> > >>>
> > >>>> On 07/05/2025 15:57, Zi Yan wrote:
> > >>>>> +Yafang, who is also looking at changing THP config at cgroup/container level.
> > >
> > > Thanks
> > >
> > >>>>>
> > >>>>> On 7 May 2025, at 10:00, Usama Arif wrote:
> > >>>>>
> > >>>>>> Allowing override of global THP policy per process allows workloads
> > >>>>>> that have shown to benefit from hugepages to do so, without regressing
> > >>>>>> workloads that wouldn't benefit. This will allow such types of
> > >>>>>> workloads to be run/stacked on the same machine.
> > >>>>>>
> > >>>>>> It also helps in rolling out hugepages in hyperscaler configurations
> > >>>>>> for workloads that benefit from them, where a single THP policy is
> > >>>>>> likely to be used across the entire fleet, and prctl will help override it.
> > >>>>>>
> > >>>>>> An advantage of doing it via prctl vs creating a cgroup specific
> > >>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
> > >>>>>> that this will work even when there are no cgroups present, and my
> > >>>>>> understanding is there is a strong preference of cgroups controls being
> > >>>>>> hierarchical which usually means them having a numerical value.
> > >>>>>
> > >>>>> Hi Usama,
> > >>>>>
> > >>>>> Do you mind giving an example on how to change THP policy for a set of
> > >>>>> processes running in a container (under a cgroup)?
> > >>>>
> > >>>> Hi Zi,
> > >>>>
> > >>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always
> > >>>> for processes in a cgroup is in the same way we enable KSM for the cgroup.
> > >>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS
> > >>>> in exec-invoke.
> > >>>> This is at the start of the process, but you would already know at the start of the process
> > >>>> whether you want THP=always for it or not.
> > >>>>
> > >>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> > >>>
> > >>> You also need to add a new systemd.directives, e.g., MemoryTHP, to
> > >>> pass the THP enablement or disablement info from a systemd config file.
> > >>> And if you find those processes do not benefit from using THPs,
> > >>> you can just change the new "MemoryTHP" config and restart the processes.
> > >>>
> > >>> Am I getting it? Thanks.
> > >>>
> > >>
> > >> Yes, thats right. They would exactly the same as what we (Meta) do
> > >> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set,
> > >> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when
> > >> that is set, the prctl will be called at exec_invoke of the process [3].
> > >>
> > >> The systemd changes should be quite simple to do.
> > >>
> > >> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978
> > >> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151
> > >> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> > >
> > > This solution carries a risk: since prctl() does not require any
> > > capabilities, the task itself could call it and override your memory
> > > policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that
> > > capability is typically enabled by default in containers, leaving them
> > > still vulnerable.
> > >
> > > This approach might work for Kubernetes/container environments, but it
> > > would require substantial code changes to implement securely.
> > >
> >
> > You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE
> > already exists and the someone could use this to slow the process down. So the
> > approach this patch takes shouldn't be anymore of a security fix then what is already
> > exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE
> > should be used to restrict this.
> 
> I believe we should at least require CAP_SYS_RESOURCE to enable THP,
> since it overrides global system settings. Alternatively,
> CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely
> certain.

Hm, could you verbalize a concrete security concern?

I've never really looked at the global settings as a hard policy, more
as picking a default for the workloads in the system. It's usually
`madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long
existed to give applications the ability to refine the global choice.

The prctl should probably respect `never' for consistency, but beyond
that I don't really see the concern, or how this would allow something
that isn't already possible.

> > In terms of security vulnerability of prctl, I feel like there are a lot of others
> > that can be a much much bigger issue? I just had a look and you can change the
> > seccomp, reset PAC keys(!) even speculation control(!!), so I dont think the security
> > argument would be valid.
> 
> I was surprised to discover that none of these operations require any
> capabilities to execute.

seccomp enabling is a one-way street, PR_SPEC_FORCE_DISABLE is as
well. You can reset PAC keys, but presumably, unless you also switch
to a new execution context with entirely new PAC/AUT pairs, this would
just crash the application on the next AUT?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-09  5:13               ` Johannes Weiner
@ 2025-05-09  9:24                 ` Yafang Shao
  2025-05-09  9:30                   ` David Hildenbrand
  0 siblings, 1 reply; 29+ messages in thread
From: Yafang Shao @ 2025-05-09  9:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Usama Arif, Zi Yan, Andrew Morton, david, linux-mm, shakeel.butt,
	riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, linux-kernel, kernel-team

On Fri, May 9, 2025 at 1:13 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote:
> > On Fri, May 9, 2025 at 12:04 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > >
> > >
> > >
> > > On 08/05/2025 06:41, Yafang Shao wrote:
> > > > On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > > >>
> > > >>
> > > >>
> > > >> On 07/05/2025 16:57, Zi Yan wrote:
> > > >>> On 7 May 2025, at 11:12, Usama Arif wrote:
> > > >>>
> > > >>>> On 07/05/2025 15:57, Zi Yan wrote:
> > > >>>>> +Yafang, who is also looking at changing THP config at cgroup/container level.
> > > >
> > > > Thanks
> > > >
> > > >>>>>
> > > >>>>> On 7 May 2025, at 10:00, Usama Arif wrote:
> > > >>>>>
> > > >>>>>> Allowing override of global THP policy per process allows workloads
> > > >>>>>> that have shown to benefit from hugepages to do so, without regressing
> > > >>>>>> workloads that wouldn't benefit. This will allow such types of
> > > >>>>>> workloads to be run/stacked on the same machine.
> > > >>>>>>
> > > >>>>>> It also helps in rolling out hugepages in hyperscaler configurations
> > > >>>>>> for workloads that benefit from them, where a single THP policy is
> > > >>>>>> likely to be used across the entire fleet, and prctl will help override it.
> > > >>>>>>
> > > >>>>>> An advantage of doing it via prctl vs creating a cgroup specific
> > > >>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
> > > >>>>>> that this will work even when there are no cgroups present, and my
> > > >>>>>> understanding is there is a strong preference of cgroups controls being
> > > >>>>>> hierarchical which usually means them having a numerical value.
> > > >>>>>
> > > >>>>> Hi Usama,
> > > >>>>>
> > > >>>>> Do you mind giving an example on how to change THP policy for a set of
> > > >>>>> processes running in a container (under a cgroup)?
> > > >>>>
> > > >>>> Hi Zi,
> > > >>>>
> > > >>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always
> > > >>>> for processes in a cgroup is in the same way we enable KSM for the cgroup.
> > > >>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS
> > > >>>> in exec-invoke.
> > > >>>> This is at the start of the process, but you would already know at the start of the process
> > > >>>> whether you want THP=always for it or not.
> > > >>>>
> > > >>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> > > >>>
> > > >>> You also need to add a new systemd.directives, e.g., MemoryTHP, to
> > > >>> pass the THP enablement or disablement info from a systemd config file.
> > > >>> And if you find those processes do not benefit from using THPs,
> > > >>> you can just change the new "MemoryTHP" config and restart the processes.
> > > >>>
> > > >>> Am I getting it? Thanks.
> > > >>>
> > > >>
> > > >> Yes, thats right. They would exactly the same as what we (Meta) do
> > > >> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set,
> > > >> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when
> > > >> that is set, the prctl will be called at exec_invoke of the process [3].
> > > >>
> > > >> The systemd changes should be quite simple to do.
> > > >>
> > > >> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978
> > > >> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151
> > > >> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> > > >
> > > > This solution carries a risk: since prctl() does not require any
> > > > capabilities, the task itself could call it and override your memory
> > > > policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that
> > > > capability is typically enabled by default in containers, leaving them
> > > > still vulnerable.
> > > >
> > > > This approach might work for Kubernetes/container environments, but it
> > > > would require substantial code changes to implement securely.
> > > >
> > >
> > > You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE
> > > already exists and the someone could use this to slow the process down. So the
> > > approach this patch takes shouldn't be anymore of a security fix then what is already
> > > exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE
> > > should be used to restrict this.
> >
> > I believe we should at least require CAP_SYS_RESOURCE to enable THP,
> > since it overrides global system settings. Alternatively,
> > CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely
> > certain.
>
> Hm, could you verbalize a concrete security concern?
>
> I've never really looked at the global settings as a hard policy, more
> as picking a default for the workloads in the system. It's usually
> `madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long
> existed to give applications the ability to refine the global choice.
>
> The prctl should probably respect `never' for consistency, but beyond
> that I don't really see the concern, or how this would allow something
> that isn't already possible.

I would interpret the always, madvise, and never options as follows:
- always
  The sysadmin strongly recommends using THP. If a user does not
want to use it, they must explicitly disable it.
- madvise
 The sysadmin gently encourages the use of THP, but it is only
enabled when explicitly requested by the application.
- never
  The sysadmin discourages the use of THP, and "its use is only permitted
with explicit approval" .

>
> > > In terms of security vulnerability of prctl, I feel like there are a lot of others
> > > that can be a much much bigger issue? I just had a look and you can change the
> > > seccomp, reset PAC keys(!) even speculation control(!!), so I dont think the security
> > > argument would be valid.
> >
> > I was surprised to discover that none of these operations require any
> > capabilities to execute.
>
> seccomp enabling is a one-way street, PR_SPEC_FORCE_DISABLE is as
> well. You can reset PAC keys, but presumably, unless you also switch
> to a new execution context with entirely new PAC/AUT pairs, this would
> just crash the application on the next AUT?

It appears so—thank you for the clarification.


--
Regards

Yafang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-09  9:24                 ` Yafang Shao
@ 2025-05-09  9:30                   ` David Hildenbrand
  2025-05-09  9:43                     ` Yafang Shao
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-05-09  9:30 UTC (permalink / raw)
  To: Yafang Shao, Johannes Weiner
  Cc: Usama Arif, Zi Yan, Andrew Morton, linux-mm, shakeel.butt, riel,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	linux-kernel, kernel-team

On 09.05.25 11:24, Yafang Shao wrote:
> On Fri, May 9, 2025 at 1:13 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>> On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote:
>>> On Fri, May 9, 2025 at 12:04 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 08/05/2025 06:41, Yafang Shao wrote:
>>>>> On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 07/05/2025 16:57, Zi Yan wrote:
>>>>>>> On 7 May 2025, at 11:12, Usama Arif wrote:
>>>>>>>
>>>>>>>> On 07/05/2025 15:57, Zi Yan wrote:
>>>>>>>>> +Yafang, who is also looking at changing THP config at cgroup/container level.
>>>>>
>>>>> Thanks
>>>>>
>>>>>>>>>
>>>>>>>>> On 7 May 2025, at 10:00, Usama Arif wrote:
>>>>>>>>>
>>>>>>>>>> Allowing override of global THP policy per process allows workloads
>>>>>>>>>> that have shown to benefit from hugepages to do so, without regressing
>>>>>>>>>> workloads that wouldn't benefit. This will allow such types of
>>>>>>>>>> workloads to be run/stacked on the same machine.
>>>>>>>>>>
>>>>>>>>>> It also helps in rolling out hugepages in hyperscaler configurations
>>>>>>>>>> for workloads that benefit from them, where a single THP policy is
>>>>>>>>>> likely to be used across the entire fleet, and prctl will help override it.
>>>>>>>>>>
>>>>>>>>>> An advantage of doing it via prctl vs creating a cgroup specific
>>>>>>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
>>>>>>>>>> that this will work even when there are no cgroups present, and my
>>>>>>>>>> understanding is there is a strong preference of cgroups controls being
>>>>>>>>>> hierarchical which usually means them having a numerical value.
>>>>>>>>>
>>>>>>>>> Hi Usama,
>>>>>>>>>
>>>>>>>>> Do you mind giving an example on how to change THP policy for a set of
>>>>>>>>> processes running in a container (under a cgroup)?
>>>>>>>>
>>>>>>>> Hi Zi,
>>>>>>>>
>>>>>>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always
>>>>>>>> for processes in a cgroup is in the same way we enable KSM for the cgroup.
>>>>>>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS
>>>>>>>> in exec-invoke.
>>>>>>>> This is at the start of the process, but you would already know at the start of the process
>>>>>>>> whether you want THP=always for it or not.
>>>>>>>>
>>>>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
>>>>>>>
>>>>>>> You also need to add a new systemd.directives, e.g., MemoryTHP, to
>>>>>>> pass the THP enablement or disablement info from a systemd config file.
>>>>>>> And if you find those processes do not benefit from using THPs,
>>>>>>> you can just change the new "MemoryTHP" config and restart the processes.
>>>>>>>
>>>>>>> Am I getting it? Thanks.
>>>>>>>
>>>>>>
>>>>>> Yes, thats right. They would exactly the same as what we (Meta) do
>>>>>> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set,
>>>>>> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when
>>>>>> that is set, the prctl will be called at exec_invoke of the process [3].
>>>>>>
>>>>>> The systemd changes should be quite simple to do.
>>>>>>
>>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978
>>>>>> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151
>>>>>> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
>>>>>
>>>>> This solution carries a risk: since prctl() does not require any
>>>>> capabilities, the task itself could call it and override your memory
>>>>> policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that
>>>>> capability is typically enabled by default in containers, leaving them
>>>>> still vulnerable.
>>>>>
>>>>> This approach might work for Kubernetes/container environments, but it
>>>>> would require substantial code changes to implement securely.
>>>>>
>>>>
>>>> You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE
>>>> already exists and the someone could use this to slow the process down. So the
>>>> approach this patch takes shouldn't be anymore of a security fix then what is already
>>>> exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE
>>>> should be used to restrict this.
>>>
>>> I believe we should at least require CAP_SYS_RESOURCE to enable THP,
>>> since it overrides global system settings. Alternatively,
>>> CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely
>>> certain.
>>
>> Hm, could you verbalize a concrete security concern?
>>
>> I've never really looked at the global settings as a hard policy, more
>> as picking a default for the workloads in the system. It's usually
>> `madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long
>> existed to give applications the ability to refine the global choice.
>>
>> The prctl should probably respect `never' for consistency, but beyond
>> that I don't really see the concern, or how this would allow something
>> that isn't already possible.
> 
> I would interpret the always, madvise, and never options as follows:
> - always
>    The sysadmin strongly recommends using THP. If a user does not
> want to use it, they must explicitly disable it.
> - madvise
>   The sysadmin gently encourages the use of THP, but it is only
> enabled when explicitly requested by the application.
> - never
>    The sysadmin discourages the use of THP, and "its use is only permitted
> with explicit approval" .

"never" so far means "no thps, no exceptions". We've had serious THP 
issues in the past, where our workaround until we sorted out the issue 
for affected customers was to force-disable THPs on that system during boot.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-09  9:30                   ` David Hildenbrand
@ 2025-05-09  9:43                     ` Yafang Shao
  2025-05-09 16:46                       ` Johannes Weiner
  0 siblings, 1 reply; 29+ messages in thread
From: Yafang Shao @ 2025-05-09  9:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Johannes Weiner, Usama Arif, Zi Yan, Andrew Morton, linux-mm,
	shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, linux-kernel, kernel-team

On Fri, May 9, 2025 at 5:31 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 09.05.25 11:24, Yafang Shao wrote:
> > On Fri, May 9, 2025 at 1:13 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >>
> >> On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote:
> >>> On Fri, May 9, 2025 at 12:04 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 08/05/2025 06:41, Yafang Shao wrote:
> >>>>> On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 07/05/2025 16:57, Zi Yan wrote:
> >>>>>>> On 7 May 2025, at 11:12, Usama Arif wrote:
> >>>>>>>
> >>>>>>>> On 07/05/2025 15:57, Zi Yan wrote:
> >>>>>>>>> +Yafang, who is also looking at changing THP config at cgroup/container level.
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>>>>>>
> >>>>>>>>> On 7 May 2025, at 10:00, Usama Arif wrote:
> >>>>>>>>>
> >>>>>>>>>> Allowing override of global THP policy per process allows workloads
> >>>>>>>>>> that have shown to benefit from hugepages to do so, without regressing
> >>>>>>>>>> workloads that wouldn't benefit. This will allow such types of
> >>>>>>>>>> workloads to be run/stacked on the same machine.
> >>>>>>>>>>
> >>>>>>>>>> It also helps in rolling out hugepages in hyperscaler configurations
> >>>>>>>>>> for workloads that benefit from them, where a single THP policy is
> >>>>>>>>>> likely to be used across the entire fleet, and prctl will help override it.
> >>>>>>>>>>
> >>>>>>>>>> An advantage of doing it via prctl vs creating a cgroup specific
> >>>>>>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
> >>>>>>>>>> that this will work even when there are no cgroups present, and my
> >>>>>>>>>> understanding is there is a strong preference of cgroups controls being
> >>>>>>>>>> hierarchical which usually means them having a numerical value.
> >>>>>>>>>
> >>>>>>>>> Hi Usama,
> >>>>>>>>>
> >>>>>>>>> Do you mind giving an example on how to change THP policy for a set of
> >>>>>>>>> processes running in a container (under a cgroup)?
> >>>>>>>>
> >>>>>>>> Hi Zi,
> >>>>>>>>
> >>>>>>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always
> >>>>>>>> for processes in a cgroup is in the same way we enable KSM for the cgroup.
> >>>>>>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS
> >>>>>>>> in exec-invoke.
> >>>>>>>> This is at the start of the process, but you would already know at the start of the process
> >>>>>>>> whether you want THP=always for it or not.
> >>>>>>>>
> >>>>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> >>>>>>>
> >>>>>>> You also need to add a new systemd.directives, e.g., MemoryTHP, to
> >>>>>>> pass the THP enablement or disablement info from a systemd config file.
> >>>>>>> And if you find those processes do not benefit from using THPs,
> >>>>>>> you can just change the new "MemoryTHP" config and restart the processes.
> >>>>>>>
> >>>>>>> Am I getting it? Thanks.
> >>>>>>>
> >>>>>>
> >>>>>> Yes, thats right. They would exactly the same as what we (Meta) do
> >>>>>> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set,
> >>>>>> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when
> >>>>>> that is set, the prctl will be called at exec_invoke of the process [3].
> >>>>>>
> >>>>>> The systemd changes should be quite simple to do.
> >>>>>>
> >>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978
> >>>>>> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151
> >>>>>> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> >>>>>
> >>>>> This solution carries a risk: since prctl() does not require any
> >>>>> capabilities, the task itself could call it and override your memory
> >>>>> policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that
> >>>>> capability is typically enabled by default in containers, leaving them
> >>>>> still vulnerable.
> >>>>>
> >>>>> This approach might work for Kubernetes/container environments, but it
> >>>>> would require substantial code changes to implement securely.
> >>>>>
> >>>>
> >>>> You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE
> >>>> already exists and the someone could use this to slow the process down. So the
> >>>> approach this patch takes shouldn't be anymore of a security fix then what is already
> >>>> exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE
> >>>> should be used to restrict this.
> >>>
> >>> I believe we should at least require CAP_SYS_RESOURCE to enable THP,
> >>> since it overrides global system settings. Alternatively,
> >>> CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely
> >>> certain.
> >>
> >> Hm, could you verbalize a concrete security concern?
> >>
> >> I've never really looked at the global settings as a hard policy, more
> >> as picking a default for the workloads in the system. It's usually
> >> `madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long
> >> existed to give applications the ability to refine the global choice.
> >>
> >> The prctl should probably respect `never' for consistency, but beyond
> >> that I don't really see the concern, or how this would allow something
> >> that isn't already possible.
> >
> > I would interpret the always, madvise, and never options as follows:
> > - always
> >    The sysadmin strongly recommends using THP. If a user does not
> > want to use it, they must explicitly disable it.
> > - madvise
> >   The sysadmin gently encourages the use of THP, but it is only
> > enabled when explicitly requested by the application.
> > - never
> >    The sysadmin discourages the use of THP, and "its use is only permitted
> > with explicit approval" .
>
> "never" so far means "no thps, no exceptions". We've had serious THP
> issues in the past, where our workaround until we sorted out the issue
> for affected customers was to force-disable THPs on that system during boot.

Right, that reflects the current behavior. What we aim to enhance is
by adding the requirement that "its use is only permitted with
explicit approval."

This would be a small but meaningful step toward making THP truly
usable in a transparent way in production environments.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-09  9:43                     ` Yafang Shao
@ 2025-05-09 16:46                       ` Johannes Weiner
  2025-05-09 22:42                         ` David Hildenbrand
  2025-05-11  2:08                         ` Yafang Shao
  0 siblings, 2 replies; 29+ messages in thread
From: Johannes Weiner @ 2025-05-09 16:46 UTC (permalink / raw)
  To: Yafang Shao
  Cc: David Hildenbrand, Usama Arif, Zi Yan, Andrew Morton, linux-mm,
	shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, linux-kernel, kernel-team

On Fri, May 09, 2025 at 05:43:10PM +0800, Yafang Shao wrote:
> On Fri, May 9, 2025 at 5:31 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 09.05.25 11:24, Yafang Shao wrote:
> > > On Fri, May 9, 2025 at 1:13 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >>
> > >> On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote:
> > >>> On Fri, May 9, 2025 at 12:04 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 08/05/2025 06:41, Yafang Shao wrote:
> > >>>>> On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On 07/05/2025 16:57, Zi Yan wrote:
> > >>>>>>> On 7 May 2025, at 11:12, Usama Arif wrote:
> > >>>>>>>
> > >>>>>>>> On 07/05/2025 15:57, Zi Yan wrote:
> > >>>>>>>>> +Yafang, who is also looking at changing THP config at cgroup/container level.
> > >>>>>
> > >>>>> Thanks
> > >>>>>
> > >>>>>>>>>
> > >>>>>>>>> On 7 May 2025, at 10:00, Usama Arif wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Allowing override of global THP policy per process allows workloads
> > >>>>>>>>>> that have shown to benefit from hugepages to do so, without regressing
> > >>>>>>>>>> workloads that wouldn't benefit. This will allow such types of
> > >>>>>>>>>> workloads to be run/stacked on the same machine.
> > >>>>>>>>>>
> > >>>>>>>>>> It also helps in rolling out hugepages in hyperscaler configurations
> > >>>>>>>>>> for workloads that benefit from them, where a single THP policy is
> > >>>>>>>>>> likely to be used across the entire fleet, and prctl will help override it.
> > >>>>>>>>>>
> > >>>>>>>>>> An advantage of doing it via prctl vs creating a cgroup specific
> > >>>>>>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
> > >>>>>>>>>> that this will work even when there are no cgroups present, and my
> > >>>>>>>>>> understanding is there is a strong preference of cgroups controls being
> > >>>>>>>>>> hierarchical which usually means them having a numerical value.
> > >>>>>>>>>
> > >>>>>>>>> Hi Usama,
> > >>>>>>>>>
> > >>>>>>>>> Do you mind giving an example on how to change THP policy for a set of
> > >>>>>>>>> processes running in a container (under a cgroup)?
> > >>>>>>>>
> > >>>>>>>> Hi Zi,
> > >>>>>>>>
> > >>>>>>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always
> > >>>>>>>> for processes in a cgroup is in the same way we enable KSM for the cgroup.
> > >>>>>>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS
> > >>>>>>>> in exec-invoke.
> > >>>>>>>> This is at the start of the process, but you would already know at the start of the process
> > >>>>>>>> whether you want THP=always for it or not.
> > >>>>>>>>
> > >>>>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> > >>>>>>>
> > >>>>>>> You also need to add a new systemd.directives, e.g., MemoryTHP, to
> > >>>>>>> pass the THP enablement or disablement info from a systemd config file.
> > >>>>>>> And if you find those processes do not benefit from using THPs,
> > >>>>>>> you can just change the new "MemoryTHP" config and restart the processes.
> > >>>>>>>
> > >>>>>>> Am I getting it? Thanks.
> > >>>>>>>
> > >>>>>>
> > >>>>>> Yes, thats right. They would exactly the same as what we (Meta) do
> > >>>>>> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set,
> > >>>>>> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when
> > >>>>>> that is set, the prctl will be called at exec_invoke of the process [3].
> > >>>>>>
> > >>>>>> The systemd changes should be quite simple to do.
> > >>>>>>
> > >>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978
> > >>>>>> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151
> > >>>>>> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> > >>>>>
> > >>>>> This solution carries a risk: since prctl() does not require any
> > >>>>> capabilities, the task itself could call it and override your memory
> > >>>>> policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that
> > >>>>> capability is typically enabled by default in containers, leaving them
> > >>>>> still vulnerable.
> > >>>>>
> > >>>>> This approach might work for Kubernetes/container environments, but it
> > >>>>> would require substantial code changes to implement securely.
> > >>>>>
> > >>>>
> > >>>> You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE
> > >>>> already exists and the someone could use this to slow the process down. So the
> > >>>> approach this patch takes shouldn't be anymore of a security fix then what is already
> > >>>> exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE
> > >>>> should be used to restrict this.
> > >>>
> > >>> I believe we should at least require CAP_SYS_RESOURCE to enable THP,
> > >>> since it overrides global system settings. Alternatively,
> > >>> CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely
> > >>> certain.
> > >>
> > >> Hm, could you verbalize a concrete security concern?
> > >>
> > >> I've never really looked at the global settings as a hard policy, more
> > >> as picking a default for the workloads in the system. It's usually
> > >> `madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long
> > >> existed to give applications the ability to refine the global choice.
> > >>
> > >> The prctl should probably respect `never' for consistency, but beyond
> > >> that I don't really see the concern, or how this would allow something
> > >> that isn't already possible.
> > >
> > > I would interpret the always, madvise, and never options as follows:
> > > - always
> > >    The sysadmin strongly recommends using THP. If a user does not
> > > want to use it, they must explicitly disable it.

I would call this "kernel mode" or "auto mode", where userspace should
*generally* not have to worry about huge pages, but with an option for
declaring the odd exceptional case.

Both madvise() and unprivileged prctl() currently work, and IMO should
continue to work, for declaring exceptions.

> > > - madvise
> > >   The sysadmin gently encourages the use of THP, but it is only
> > > enabled when explicitly requested by the application.

And this "user mode" or "manual mode", where applications self-manage
which parts of userspace they want to enroll.

Both madvise() and unprivileged prctl() should work here as well,
IMO. There is no policy or security difference between them, it's just
about granularity and usability.

> > > - never
> > >    The sysadmin discourages the use of THP, and "its use is only permitted
> > > with explicit approval" .

This one I don't quite agree with, and IMO conflicts with what David
is saying as well.

> > "never" so far means "no thps, no exceptions". We've had serious THP
> > issues in the past, where our workaround until we sorted out the issue
> > for affected customers was to force-disable THPs on that system during boot.
> 
> Right, that reflects the current behavior. What we aim to enhance is
> by adding the requirement that "its use is only permitted with
> explicit approval."

I think you're conflating a safety issue with a security issue.

David is saying there can be cases where the kernel is broken, and
"never" is a production escape hatch to disable the feature until a
kernel upgrade for the fix is possible. In such a case, it doesn't
make sense to override this decision based on any sort of workload
policy, privileged or not.

The way I understand you is that you want enrollment (and/or
self-management) only for blessed applications. Because you don't
generally trust workloads in the wild enough to switch the global
default away from "never", given the semantics of always/madvise.

To me this sounds like you'd need a different mode, call it "blessed";
with a privileged interface to control which applications are allowed
to madvise/prctl-enable.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-09 16:46                       ` Johannes Weiner
@ 2025-05-09 22:42                         ` David Hildenbrand
  2025-05-09 23:34                           ` Zi Yan
  2025-05-11  2:08                         ` Yafang Shao
  1 sibling, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-05-09 22:42 UTC (permalink / raw)
  To: Johannes Weiner, Yafang Shao
  Cc: Usama Arif, Zi Yan, Andrew Morton, linux-mm, shakeel.butt, riel,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	linux-kernel, kernel-team

>>>> - madvise
>>>>    The sysadmin gently encourages the use of THP, but it is only
>>>> enabled when explicitly requested by the application.
> 
> And this "user mode" or "manual mode", where applications self-manage
> which parts of userspace they want to enroll.
> 
> Both madvise() and unprivileged prctl() should work here as well,
> IMO. There is no policy or security difference between them, it's just
> about granularity and usability.
> 
>>>> - never
>>>>     The sysadmin discourages the use of THP, and "its use is only permitted
>>>> with explicit approval" .
> 
> This one I don't quite agree with, and IMO conflicts with what David
> is saying as well.

Yeah ... "never" does not mean "sometimes" in my reality :)

> 
>>> "never" so far means "no thps, no exceptions". We've had serious THP
>>> issues in the past, where our workaround until we sorted out the issue
>>> for affected customers was to force-disable THPs on that system during boot.
>>
>> Right, that reflects the current behavior. What we aim to enhance is
>> by adding the requirement that "its use is only permitted with
>> explicit approval."
> 
> I think you're conflating a safety issue with a security issue.
> 
> David is saying there can be cases where the kernel is broken, and
> "never" is a production escape hatch to disable the feature until a
> kernel upgrade for the fix is possible. In such a case, it doesn't
> make sense to override this decision based on any sort of workload
> policy, privileged or not.
> 
> The way I understand you is that you want enrollment (and/or
> self-management) only for blessed applications. Because you don't
> generally trust workloads in the wild enough to switch the global
> default away from "never", given the semantics of always/madvise.

Assuming "never" means "never" and "always" means "always" ( crazy, 
right? :) ), could be make use of "madvise" mode, which essentially 
means "VM_HUGEPAGE" takes control?

We'd need

a) A way to enable THP for a process. Changing the default/vma settings 
to VM_HUGEPAGE as discussed using a prctl could work.

b) A way to ignore VM_HUGEPAGE for processes. Maybe the existing prctl 
to force-disable THPs could work?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-09 22:42                         ` David Hildenbrand
@ 2025-05-09 23:34                           ` Zi Yan
  2025-05-11  8:15                             ` David Hildenbrand
  0 siblings, 1 reply; 29+ messages in thread
From: Zi Yan @ 2025-05-09 23:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Johannes Weiner, Yafang Shao, Usama Arif, Andrew Morton, linux-mm,
	shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, linux-kernel, kernel-team

On 9 May 2025, at 18:42, David Hildenbrand wrote:

>>>>> - madvise
>>>>>    The sysadmin gently encourages the use of THP, but it is only
>>>>> enabled when explicitly requested by the application.
>>
>> And this "user mode" or "manual mode", where applications self-manage
>> which parts of userspace they want to enroll.
>>
>> Both madvise() and unprivileged prctl() should work here as well,
>> IMO. There is no policy or security difference between them, it's just
>> about granularity and usability.
>>
>>>>> - never
>>>>>     The sysadmin discourages the use of THP, and "its use is only permitted
>>>>> with explicit approval" .
>>
>> This one I don't quite agree with, and IMO conflicts with what David
>> is saying as well.
>
> Yeah ... "never" does not mean "sometimes" in my reality :)
>
>>
>>>> "never" so far means "no thps, no exceptions". We've had serious THP
>>>> issues in the past, where our workaround until we sorted out the issue
>>>> for affected customers was to force-disable THPs on that system during boot.
>>>
>>> Right, that reflects the current behavior. What we aim to enhance is
>>> by adding the requirement that "its use is only permitted with
>>> explicit approval."
>>
>> I think you're conflating a safety issue with a security issue.
>>
>> David is saying there can be cases where the kernel is broken, and
>> "never" is a production escape hatch to disable the feature until a
>> kernel upgrade for the fix is possible. In such a case, it doesn't
>> make sense to override this decision based on any sort of workload
>> policy, privileged or not.
>>
>> The way I understand you is that you want enrollment (and/or
>> self-management) only for blessed applications. Because you don't
>> generally trust workloads in the wild enough to switch the global
>> default away from "never", given the semantics of always/madvise.
>
> Assuming "never" means "never" and "always" means "always" ( crazy, right? :) ), could be make use of "madvise" mode, which essentially means "VM_HUGEPAGE" takes control?
>
> We'd need
>
> a) A way to enable THP for a process. Changing the default/vma settings to VM_HUGEPAGE as discussed using a prctl could work.
>
> b) A way to ignore VM_HUGEPAGE for processes. Maybe the existing prctl to force-disable THPs could work?

This means process level control overrides VMA level control, which
overrides global control, right?

Intuitively, it should be that VMA level control overrides process level
control, which overrides global control, namely finer granularly control
precedes coarse one. But some apps might not use VMA level control
(e.g., madvise) carefully, we want to override that. Maybe ignoring VMA
level control is what we want?


--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-09 16:46                       ` Johannes Weiner
  2025-05-09 22:42                         ` David Hildenbrand
@ 2025-05-11  2:08                         ` Yafang Shao
  1 sibling, 0 replies; 29+ messages in thread
From: Yafang Shao @ 2025-05-11  2:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Hildenbrand, Usama Arif, Zi Yan, Andrew Morton, linux-mm,
	shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, linux-kernel, kernel-team

On Sat, May 10, 2025 at 12:47 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, May 09, 2025 at 05:43:10PM +0800, Yafang Shao wrote:
> > On Fri, May 9, 2025 at 5:31 PM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > On 09.05.25 11:24, Yafang Shao wrote:
> > > > On Fri, May 9, 2025 at 1:13 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >>
> > > >> On Fri, May 09, 2025 at 10:15:08AM +0800, Yafang Shao wrote:
> > > >>> On Fri, May 9, 2025 at 12:04 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On 08/05/2025 06:41, Yafang Shao wrote:
> > > >>>>> On Thu, May 8, 2025 at 12:09 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On 07/05/2025 16:57, Zi Yan wrote:
> > > >>>>>>> On 7 May 2025, at 11:12, Usama Arif wrote:
> > > >>>>>>>
> > > >>>>>>>> On 07/05/2025 15:57, Zi Yan wrote:
> > > >>>>>>>>> +Yafang, who is also looking at changing THP config at cgroup/container level.
> > > >>>>>
> > > >>>>> Thanks
> > > >>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On 7 May 2025, at 10:00, Usama Arif wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Allowing override of global THP policy per process allows workloads
> > > >>>>>>>>>> that have shown to benefit from hugepages to do so, without regressing
> > > >>>>>>>>>> workloads that wouldn't benefit. This will allow such types of
> > > >>>>>>>>>> workloads to be run/stacked on the same machine.
> > > >>>>>>>>>>
> > > >>>>>>>>>> It also helps in rolling out hugepages in hyperscaler configurations
> > > >>>>>>>>>> for workloads that benefit from them, where a single THP policy is
> > > >>>>>>>>>> likely to be used across the entire fleet, and prctl will help override it.
> > > >>>>>>>>>>
> > > >>>>>>>>>> An advantage of doing it via prctl vs creating a cgroup specific
> > > >>>>>>>>>> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
> > > >>>>>>>>>> that this will work even when there are no cgroups present, and my
> > > >>>>>>>>>> understanding is there is a strong preference of cgroups controls being
> > > >>>>>>>>>> hierarchical which usually means them having a numerical value.
> > > >>>>>>>>>
> > > >>>>>>>>> Hi Usama,
> > > >>>>>>>>>
> > > >>>>>>>>> Do you mind giving an example on how to change THP policy for a set of
> > > >>>>>>>>> processes running in a container (under a cgroup)?
> > > >>>>>>>>
> > > >>>>>>>> Hi Zi,
> > > >>>>>>>>
> > > >>>>>>>> In our case, we create the processes in the cgroup via systemd. The way we will enable THP=always
> > > >>>>>>>> for processes in a cgroup is in the same way we enable KSM for the cgroup.
> > > >>>>>>>> The change in systemd would be very similar to the line in [1], where we would set prctl PR_SET_THP_ALWAYS
> > > >>>>>>>> in exec-invoke.
> > > >>>>>>>> This is at the start of the process, but you would already know at the start of the process
> > > >>>>>>>> whether you want THP=always for it or not.
> > > >>>>>>>>
> > > >>>>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> > > >>>>>>>
> > > >>>>>>> You also need to add a new systemd.directives, e.g., MemoryTHP, to
> > > >>>>>>> pass the THP enablement or disablement info from a systemd config file.
> > > >>>>>>> And if you find those processes do not benefit from using THPs,
> > > >>>>>>> you can just change the new "MemoryTHP" config and restart the processes.
> > > >>>>>>>
> > > >>>>>>> Am I getting it? Thanks.
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>> Yes, thats right. They would exactly the same as what we (Meta) do
> > > >>>>>> for KSM. So have MemoryTHP similar to MemroryKSM [1] and if MemoryTHP is set,
> > > >>>>>> the ExecContext->memory_thp would be set similar to memory_ksm [2], and when
> > > >>>>>> that is set, the prctl will be called at exec_invoke of the process [3].
> > > >>>>>>
> > > >>>>>> The systemd changes should be quite simple to do.
> > > >>>>>>
> > > >>>>>> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/man/systemd.exec.xml#L1978
> > > >>>>>> [2] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/dbus-execute.c#L2151
> > > >>>>>> [3] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5045
> > > >>>>>
> > > >>>>> This solution carries a risk: since prctl() does not require any
> > > >>>>> capabilities, the task itself could call it and override your memory
> > > >>>>> policy. While we could enforce CAP_SYS_RESOURCE to restrict this, that
> > > >>>>> capability is typically enabled by default in containers, leaving them
> > > >>>>> still vulnerable.
> > > >>>>>
> > > >>>>> This approach might work for Kubernetes/container environments, but it
> > > >>>>> would require substantial code changes to implement securely.
> > > >>>>>
> > > >>>>
> > > >>>> You can already change the memory policy with prctl, for e.g. PR_SET_THP_DISABLE
> > > >>>> already exists and the someone could use this to slow the process down. So the
> > > >>>> approach this patch takes shouldn't be anymore of a security fix then what is already
> > > >>>> exposed by the kernel. I think as you mentioned, if prctl is an issue CAP_SYS_RESOURCE
> > > >>>> should be used to restrict this.
> > > >>>
> > > >>> I believe we should at least require CAP_SYS_RESOURCE to enable THP,
> > > >>> since it overrides global system settings. Alternatively,
> > > >>> CAP_SYS_ADMIN might be even more appropriate, though I'm not entirely
> > > >>> certain.
> > > >>
> > > >> Hm, could you verbalize a concrete security concern?
> > > >>
> > > >> I've never really looked at the global settings as a hard policy, more
> > > >> as picking a default for the workloads in the system. It's usually
> > > >> `madvise' or `always', and MADV_HUGEPAGE and MADV_NOHUGEPAGE have long
> > > >> existed to give applications the ability to refine the global choice.
> > > >>
> > > >> The prctl should probably respect `never' for consistency, but beyond
> > > >> that I don't really see the concern, or how this would allow something
> > > >> that isn't already possible.
> > > >
> > > > I would interpret the always, madvise, and never options as follows:
> > > > - always
> > > >    The sysadmin strongly recommends using THP. If a user does not
> > > > want to use it, they must explicitly disable it.
>
> I would call this "kernel mode" or "auto mode", where userspace should
> *generally* not have to worry about huge pages, but with an option for
> declaring the odd exceptional case.
>
> Both madvise() and unprivileged prctl() currently work, and IMO should
> continue to work, for declaring exceptions.
>
> > > > - madvise
> > > >   The sysadmin gently encourages the use of THP, but it is only
> > > > enabled when explicitly requested by the application.
>
> And this "user mode" or "manual mode", where applications self-manage
> which parts of userspace they want to enroll.
>
> Both madvise() and unprivileged prctl() should work here as well,
> IMO. There is no policy or security difference between them, it's just
> about granularity and usability.
>
> > > > - never
> > > >    The sysadmin discourages the use of THP, and "its use is only permitted
> > > > with explicit approval" .
>
> This one I don't quite agree with, and IMO conflicts with what David
> is saying as well.
>
> > > "never" so far means "no thps, no exceptions". We've had serious THP
> > > issues in the past, where our workaround until we sorted out the issue
> > > for affected customers was to force-disable THPs on that system during boot.
> >
> > Right, that reflects the current behavior. What we aim to enhance is
> > by adding the requirement that "its use is only permitted with
> > explicit approval."
>
> I think you're conflating a safety issue with a security issue.

I appreciate the corrections. English isn't my first language, so I
occasionally don't use words as precisely as I'd like.

>
> David is saying there can be cases where the kernel is broken, and
> "never" is a production escape hatch to disable the feature until a
> kernel upgrade for the fix is possible. In such a case, it doesn't
> make sense to override this decision based on any sort of workload
> policy, privileged or not.
>
> The way I understand you is that you want enrollment (and/or
> self-management) only for blessed applications.

Right.

> Because you don't
> generally trust workloads in the wild enough to switch the global
> default away from "never", given the semantics of always/madvise.

Historically, we have always set it to 'never.' Due to concerns
stemming from past incidents, the sysadmins have been hesitant to
switch it to 'madvise.' However, we’ve now discovered that AI services
can gain significant performance benefits from it. As a solution, we
propose enabling THP exclusively for AI services while maintaining the
global setting as 'never.'

>
> To me this sounds like you'd need a different mode, call it "blessed";
> with a privileged interface to control which applications are allowed
> to madvise/prctl-enable.

This appears to be a viable solution.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-09 23:34                           ` Zi Yan
@ 2025-05-11  8:15                             ` David Hildenbrand
  2025-05-11 14:08                               ` Usama Arif
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-05-11  8:15 UTC (permalink / raw)
  To: Zi Yan
  Cc: Johannes Weiner, Yafang Shao, Usama Arif, Andrew Morton, linux-mm,
	shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, linux-kernel, kernel-team

On 10.05.25 01:34, Zi Yan wrote:
> On 9 May 2025, at 18:42, David Hildenbrand wrote:
> 
>>>>>> - madvise
>>>>>>     The sysadmin gently encourages the use of THP, but it is only
>>>>>> enabled when explicitly requested by the application.
>>>
>>> And this "user mode" or "manual mode", where applications self-manage
>>> which parts of userspace they want to enroll.
>>>
>>> Both madvise() and unprivileged prctl() should work here as well,
>>> IMO. There is no policy or security difference between them, it's just
>>> about granularity and usability.
>>>
>>>>>> - never
>>>>>>      The sysadmin discourages the use of THP, and "its use is only permitted
>>>>>> with explicit approval" .
>>>
>>> This one I don't quite agree with, and IMO conflicts with what David
>>> is saying as well.
>>
>> Yeah ... "never" does not mean "sometimes" in my reality :)
>>
>>>
>>>>> "never" so far means "no thps, no exceptions". We've had serious THP
>>>>> issues in the past, where our workaround until we sorted out the issue
>>>>> for affected customers was to force-disable THPs on that system during boot.
>>>>
>>>> Right, that reflects the current behavior. What we aim to enhance is
>>>> by adding the requirement that "its use is only permitted with
>>>> explicit approval."
>>>
>>> I think you're conflating a safety issue with a security issue.
>>>
>>> David is saying there can be cases where the kernel is broken, and
>>> "never" is a production escape hatch to disable the feature until a
>>> kernel upgrade for the fix is possible. In such a case, it doesn't
>>> make sense to override this decision based on any sort of workload
>>> policy, privileged or not.
>>>
>>> The way I understand you is that you want enrollment (and/or
>>> self-management) only for blessed applications. Because you don't
>>> generally trust workloads in the wild enough to switch the global
>>> default away from "never", given the semantics of always/madvise.
>>
>> Assuming "never" means "never" and "always" means "always" ( crazy, right? :) ), could be make use of "madvise" mode, which essentially means "VM_HUGEPAGE" takes control?
>>
>> We'd need
>>
>> a) A way to enable THP for a process. Changing the default/vma settings to VM_HUGEPAGE as discussed using a prctl could work.
>>
>> b) A way to ignore VM_HUGEPAGE for processes. Maybe the existing prctl to force-disable THPs could work?
> 
> This means process level control overrides VMA level control, which
> overrides global control, right?
> 
> Intuitively, it should be that VMA level control overrides process level
> control, which overrides global control, namely finer granularly control
> precedes coarse one. But some apps might not use VMA level control
> (e.g., madvise) carefully, we want to override that. Maybe ignoring VMA
> level control is what we want?

Let's take a step back:

Current behavior is

1) If anybody (global / process / VM) says "never" 
(never/PR_SET_THP_DISABLE/VM_NOHUGEPAGE), the behavior is "never".

2) In "madvise" mode, only "VM_HUGEPAGE" gets THP unless 
PR_SET_THP_DISABLE is set. So per-process overrides per-VMA.

3) In "always" mode, everything gets THP unless per-VMA (VM_NOHUGEPAGE) 
or per-process (PR_SET_THP_DISABLE) disables it.


Interestingly, PR_SET_THP_DISABLE used to mimic exactly what I proposed 
for the other direction (change default of VM_HUGEPAGE), except that it 
wouldn't modify already existing mappings. Worth looking at 
1860033237d4. Not sure if that commit was the right call, but it's the 
semantics we have today.

That commit notes:

"It should be noted, that the new implementation makes 
PR_SET_THP_DISABLE master override to any per-VMA setting, which was not 
the case previously."


Whatever we do, we have to be careful to not create more mess or 
inconsistency.

Especially, if anybody sets VM_NOHUGEPAGE or PR_SET_THP_DISABLE, we must 
not use THPs, ever.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-11  8:15                             ` David Hildenbrand
@ 2025-05-11 14:08                               ` Usama Arif
  2025-05-13 11:43                                 ` Yafang Shao
  2025-05-13 12:04                                 ` David Hildenbrand
  0 siblings, 2 replies; 29+ messages in thread
From: Usama Arif @ 2025-05-11 14:08 UTC (permalink / raw)
  To: David Hildenbrand, Zi Yan
  Cc: Johannes Weiner, Yafang Shao, Andrew Morton, linux-mm,
	shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, linux-kernel, kernel-team

On 11/05/2025 09:15, David Hildenbrand wrote:
> On 10.05.25 01:34, Zi Yan wrote:
>> On 9 May 2025, at 18:42, David Hildenbrand wrote:
>>
>>>>>>> - madvise
>>>>>>>     The sysadmin gently encourages the use of THP, but it is only
>>>>>>> enabled when explicitly requested by the application.
>>>>
>>>> And this "user mode" or "manual mode", where applications self-manage
>>>> which parts of userspace they want to enroll.
>>>>
>>>> Both madvise() and unprivileged prctl() should work here as well,
>>>> IMO. There is no policy or security difference between them, it's just
>>>> about granularity and usability.
>>>>
>>>>>>> - never
>>>>>>>      The sysadmin discourages the use of THP, and "its use is only permitted
>>>>>>> with explicit approval" .
>>>>
>>>> This one I don't quite agree with, and IMO conflicts with what David
>>>> is saying as well.
>>>
>>> Yeah ... "never" does not mean "sometimes" in my reality :)
>>>
>>>>
>>>>>> "never" so far means "no thps, no exceptions". We've had serious THP
>>>>>> issues in the past, where our workaround until we sorted out the issue
>>>>>> for affected customers was to force-disable THPs on that system during boot.
>>>>>
>>>>> Right, that reflects the current behavior. What we aim to enhance is
>>>>> by adding the requirement that "its use is only permitted with
>>>>> explicit approval."
>>>>
>>>> I think you're conflating a safety issue with a security issue.
>>>>
>>>> David is saying there can be cases where the kernel is broken, and
>>>> "never" is a production escape hatch to disable the feature until a
>>>> kernel upgrade for the fix is possible. In such a case, it doesn't
>>>> make sense to override this decision based on any sort of workload
>>>> policy, privileged or not.
>>>>
>>>> The way I understand you is that you want enrollment (and/or
>>>> self-management) only for blessed applications. Because you don't
>>>> generally trust workloads in the wild enough to switch the global
>>>> default away from "never", given the semantics of always/madvise.
>>>
>>> Assuming "never" means "never" and "always" means "always" ( crazy, right? :) ), could be make use of "madvise" mode, which essentially means "VM_HUGEPAGE" takes control?
>>>
>>> We'd need
>>>
>>> a) A way to enable THP for a process. Changing the default/vma settings to VM_HUGEPAGE as discussed using a prctl could work.
>>>
>>> b) A way to ignore VM_HUGEPAGE for processes. Maybe the existing prctl to force-disable THPs could work?
>>
>> This means process level control overrides VMA level control, which
>> overrides global control, right?
>>
>> Intuitively, it should be that VMA level control overrides process level
>> control, which overrides global control, namely finer granularly control
>> precedes coarse one. But some apps might not use VMA level control
>> (e.g., madvise) carefully, we want to override that. Maybe ignoring VMA
>> level control is what we want?
> 
> Let's take a step back:
> 
> Current behavior is
> 
> 1) If anybody (global / process / VM) says "never" (never/PR_SET_THP_DISABLE/VM_NOHUGEPAGE), the behavior is "never".

Just to add here to the current behavior for completeness, if we have the global system setting set to never,
but the global hugepage level setting set to madvise, we do still get a THP, i.e. if I have:

[root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
[root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
always inherit [madvise] never

And then MADV_HUGEPAGE some region, it still gives me a THP.

> 
> 2) In "madvise" mode, only "VM_HUGEPAGE" gets THP unless PR_SET_THP_DISABLE is set. So per-process overrides per-VMA.
> 
> 3) In "always" mode, everything gets THP unless per-VMA (VM_NOHUGEPAGE) or per-process (PR_SET_THP_DISABLE) disables it.
> 
> 
> Interestingly, PR_SET_THP_DISABLE used to mimic exactly what I proposed for the other direction (change default of VM_HUGEPAGE), except that it wouldn't modify already existing mappings. Worth looking at 1860033237d4. Not sure if that commit was the right call, but it's the semantics we have today.
> 
> That commit notes:
> 
> "It should be noted, that the new implementation makes PR_SET_THP_DISABLE master override to any per-VMA setting, which was not the case previously."
> 
> 
> Whatever we do, we have to be careful to not create more mess or inconsistency.
> 
> Especially, if anybody sets VM_NOHUGEPAGE or PR_SET_THP_DISABLE, we must not use THPs, ever.
> 

I thought I will also summarize what the real world usecases are that we want to solve:

1) global system policy=madvise, process wants "always" policy for itself: We can have different types of workloads stacked on the same host, some of them benefit from always having THPs,
others will incur a regression (either its a performance regression or they are completely memory bound and even a very slight increase in memory will cause them to OOM).
So we want to selectively have "always" set for just those workloads (processes). (This is how workloads are deployed in our (Metas) fleet at this moment.)

2) global system policy=always, process wants "madvise" policy for itself: Same reasoning as 1, just that the host has a different default policy and we don't want the workloads (processes) that regress with always getting THPs to do so.
(We hope this is us (meta) in the future, if a majority of workloads show that they benefit from always, we flip the default host setting to "always" and workloads that regress can opt-out and be "madvise".
New services developed will then be tested with always by default. Always is also the default defconfig option upstream, so I would imagine this is faced by others as well.)

3) global system policy=never, process wants "madvise" policy for itself: This is what Yafang mentioned in [1]. sysadmins dont want to switch the global policy to madvise, but are willing to accept certain processes to madvise.
But David mentioned in [2] that never means no thps, no exceptions and the only way to solve some issues in the past has been to disable THPs completely.

Please feel free to add to the above list. I thought it would be good to list them out so that the solution can be derived with them in mind.

In terms of doing this with prctl, I was able to make prototypes for the 2 approaches that have been discussed:

a) have prctl change how existing and new VMAs have VM_HUGEPAGE set for the process including after fork+exec, as proposed by David. This prototype is available at [3]. This will solve problem 1 discussed above, but I don't think this
approach can be used to solve problems 2 and 3? There isnt a way where we can have a process change VMA setting so that after prctl, all future allocations are on madvise basis and not global policy (even if always). IOW we will need
some change in thp_vma_allowable_orders to have it done on process level basis.

b) have prctl override global policy *only* for hugepages that "inherit" global and only if global is "madvise" or "always". This prototype is available at [4]. The way I did this will achieve usecase 1 and 2, but not 3 (It can very easily be modified to get 3, but didn't do it as there maybe still is a discussion on what should be allowed when global=never?). I do prefer this method as I think it might be simpler overall and achieves both usecases.

[1] https://lore.kernel.org/all/CALOAHbBAVELx-fwyoQUH_ypFvT_Zd5ZLjSkAPXxShgCua8ifpA@mail.gmail.com/
[2] https://lore.kernel.org/all/41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com/
[3] https://github.com/uarif1/linux/commit/209373cdeda93a55a699e2eee29f88f4e64ac8a5
[4] https://github.com/uarif1/linux/commit/e85c8edbcb4165c00026f0058b71e85f77da23f4

Thanks,
Usama

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-11 14:08                               ` Usama Arif
@ 2025-05-13 11:43                                 ` Yafang Shao
  2025-05-13 12:04                                 ` David Hildenbrand
  1 sibling, 0 replies; 29+ messages in thread
From: Yafang Shao @ 2025-05-13 11:43 UTC (permalink / raw)
  To: Usama Arif
  Cc: David Hildenbrand, Zi Yan, Johannes Weiner, Andrew Morton,
	linux-mm, shakeel.butt, riel, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, linux-kernel, kernel-team

On Sun, May 11, 2025 at 10:08 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 11/05/2025 09:15, David Hildenbrand wrote:
> > On 10.05.25 01:34, Zi Yan wrote:
> >> On 9 May 2025, at 18:42, David Hildenbrand wrote:
> >>
> >>>>>>> - madvise
> >>>>>>>     The sysadmin gently encourages the use of THP, but it is only
> >>>>>>> enabled when explicitly requested by the application.
> >>>>
> >>>> And this "user mode" or "manual mode", where applications self-manage
> >>>> which parts of userspace they want to enroll.
> >>>>
> >>>> Both madvise() and unprivileged prctl() should work here as well,
> >>>> IMO. There is no policy or security difference between them, it's just
> >>>> about granularity and usability.
> >>>>
> >>>>>>> - never
> >>>>>>>      The sysadmin discourages the use of THP, and "its use is only permitted
> >>>>>>> with explicit approval" .
> >>>>
> >>>> This one I don't quite agree with, and IMO conflicts with what David
> >>>> is saying as well.
> >>>
> >>> Yeah ... "never" does not mean "sometimes" in my reality :)
> >>>
> >>>>
> >>>>>> "never" so far means "no thps, no exceptions". We've had serious THP
> >>>>>> issues in the past, where our workaround until we sorted out the issue
> >>>>>> for affected customers was to force-disable THPs on that system during boot.
> >>>>>
> >>>>> Right, that reflects the current behavior. What we aim to enhance is
> >>>>> by adding the requirement that "its use is only permitted with
> >>>>> explicit approval."
> >>>>
> >>>> I think you're conflating a safety issue with a security issue.
> >>>>
> >>>> David is saying there can be cases where the kernel is broken, and
> >>>> "never" is a production escape hatch to disable the feature until a
> >>>> kernel upgrade for the fix is possible. In such a case, it doesn't
> >>>> make sense to override this decision based on any sort of workload
> >>>> policy, privileged or not.
> >>>>
> >>>> The way I understand you is that you want enrollment (and/or
> >>>> self-management) only for blessed applications. Because you don't
> >>>> generally trust workloads in the wild enough to switch the global
> >>>> default away from "never", given the semantics of always/madvise.
> >>>
> >>> Assuming "never" means "never" and "always" means "always" ( crazy, right? :) ), could be make use of "madvise" mode, which essentially means "VM_HUGEPAGE" takes control?
> >>>
> >>> We'd need
> >>>
> >>> a) A way to enable THP for a process. Changing the default/vma settings to VM_HUGEPAGE as discussed using a prctl could work.
> >>>
> >>> b) A way to ignore VM_HUGEPAGE for processes. Maybe the existing prctl to force-disable THPs could work?
> >>
> >> This means process level control overrides VMA level control, which
> >> overrides global control, right?
> >>
> >> Intuitively, it should be that VMA level control overrides process level
> >> control, which overrides global control, namely finer granularly control
> >> precedes coarse one. But some apps might not use VMA level control
> >> (e.g., madvise) carefully, we want to override that. Maybe ignoring VMA
> >> level control is what we want?
> >
> > Let's take a step back:
> >
> > Current behavior is
> >
> > 1) If anybody (global / process / VM) says "never" (never/PR_SET_THP_DISABLE/VM_NOHUGEPAGE), the behavior is "never".
>
> Just to add here to the current behavior for completeness, if we have the global system setting set to never,
> but the global hugepage level setting set to madvise, we do still get a THP, i.e. if I have:
>
> [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
> [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> always inherit [madvise] never
>
> And then MADV_HUGEPAGE some region, it still gives me a THP.
>
> >
> > 2) In "madvise" mode, only "VM_HUGEPAGE" gets THP unless PR_SET_THP_DISABLE is set. So per-process overrides per-VMA.
> >
> > 3) In "always" mode, everything gets THP unless per-VMA (VM_NOHUGEPAGE) or per-process (PR_SET_THP_DISABLE) disables it.
> >
> >
> > Interestingly, PR_SET_THP_DISABLE used to mimic exactly what I proposed for the other direction (change default of VM_HUGEPAGE), except that it wouldn't modify already existing mappings. Worth looking at 1860033237d4. Not sure if that commit was the right call, but it's the semantics we have today.
> >
> > That commit notes:
> >
> > "It should be noted, that the new implementation makes PR_SET_THP_DISABLE master override to any per-VMA setting, which was not the case previously."
> >
> >
> > Whatever we do, we have to be careful to not create more mess or inconsistency.
> >
> > Especially, if anybody sets VM_NOHUGEPAGE or PR_SET_THP_DISABLE, we must not use THPs, ever.
> >
>
>
> I thought I will also summarize what the real world usecases are that we want to solve:
>
> 1) global system policy=madvise, process wants "always" policy for itself: We can have different types of workloads stacked on the same host, some of them benefit from always having THPs,
> others will incur a regression (either its a performance regression or they are completely memory bound and even a very slight increase in memory will cause them to OOM).
> So we want to selectively have "always" set for just those workloads (processes). (This is how workloads are deployed in our (Metas) fleet at this moment.)
>
> 2) global system policy=always, process wants "madvise" policy for itself: Same reasoning as 1, just that the host has a different default policy and we don't want the workloads (processes) that regress with always getting THPs to do so.
> (We hope this is us (meta) in the future, if a majority of workloads show that they benefit from always, we flip the default host setting to "always" and workloads that regress can opt-out and be "madvise".
> New services developed will then be tested with always by default. Always is also the default defconfig option upstream, so I would imagine this is faced by others as well.)
>
> 3) global system policy=never, process wants "madvise" policy for itself: This is what Yafang mentioned in [1]. sysadmins dont want to switch the global policy to madvise, but are willing to accept certain processes to madvise.
> But David mentioned in [2] that never means no thps, no exceptions and the only way to solve some issues in the past has been to disable THPs completely.

The interpretation of never as "disable THPs completely" makes sense
to me, as sysadmins need a guaranteed way to disable THP. If
applications could still allocate THPs in never mode, it would
frustrate sysadmins.

For case 3), I agree with Johannes that introducing a new mode (e.g.,
"blessed" or "bpf") would be ideal. This mode would allow per-task THP
policy adjustments via BPF programs, defaulting to never when no BPF
program is attached.

[0]. https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.org/

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
  2025-05-11 14:08                               ` Usama Arif
  2025-05-13 11:43                                 ` Yafang Shao
@ 2025-05-13 12:04                                 ` David Hildenbrand
  1 sibling, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-05-13 12:04 UTC (permalink / raw)
  To: Usama Arif, Zi Yan
  Cc: Johannes Weiner, Yafang Shao, Andrew Morton, linux-mm,
	shakeel.butt, riel, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, linux-kernel, kernel-team

>>> This means process level control overrides VMA level control, which
>>> overrides global control, right?
>>>
>>> Intuitively, it should be that VMA level control overrides process level
>>> control, which overrides global control, namely finer granularly control
>>> precedes coarse one. But some apps might not use VMA level control
>>> (e.g., madvise) carefully, we want to override that. Maybe ignoring VMA
>>> level control is what we want?
>>
>> Let's take a step back:
>>
>> Current behavior is
>>
>> 1) If anybody (global / process / VM) says "never" (never/PR_SET_THP_DISABLE/VM_NOHUGEPAGE), the behavior is "never".
> 
> Just to add here to the current behavior for completeness, if we have the global system setting set to never,
> but the global hugepage level setting set to madvise, we do still get a THP, i.e. if I have:
> 
> [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
> [root@vm4 vmuser]# cat /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> always inherit [madvise] never
> 
> And then MADV_HUGEPAGE some region, it still gives me a THP.

Yes. These "global" / "system" toggles are to be considered "one 
setting". If you set a per-size toggle to something that is not 
"inherit", then you're telling the system that you want more fine-grain 
control.

In retrospective, we could maybe have made 
"/sys/kernel/mm/transparent_hugepage/enabled" a symlink to 
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled, but we 
didn't think of that option back then and decided for the "inherit" 
solution.

> 
>>
>> 2) In "madvise" mode, only "VM_HUGEPAGE" gets THP unless PR_SET_THP_DISABLE is set. So per-process overrides per-VMA.
>>
>> 3) In "always" mode, everything gets THP unless per-VMA (VM_NOHUGEPAGE) or per-process (PR_SET_THP_DISABLE) disables it.
>>
>>
>> Interestingly, PR_SET_THP_DISABLE used to mimic exactly what I proposed for the other direction (change default of VM_HUGEPAGE), except that it wouldn't modify already existing mappings. Worth looking at 1860033237d4. Not sure if that commit was the right call, but it's the semantics we have today.
>>
>> That commit notes:
>>
>> "It should be noted, that the new implementation makes PR_SET_THP_DISABLE master override to any per-VMA setting, which was not the case previously."
>>
>>
>> Whatever we do, we have to be careful to not create more mess or inconsistency.
>>
>> Especially, if anybody sets VM_NOHUGEPAGE or PR_SET_THP_DISABLE, we must not use THPs, ever.
>>
> 
> 
> I thought I will also summarize what the real world usecases are that we want to solve:
> 
> 1) global system policy=madvise, process wants "always" policy for itself: We can have different types of workloads stacked on the same host, some of them benefit from always having THPs,
> others will incur a regression (either its a performance regression or they are completely memory bound and even a very slight increase in memory will cause them to OOM).
> So we want to selectively have "always" set for just those workloads (processes). (This is how workloads are deployed in our (Metas) fleet at this moment.)

Agreed.

Similar to the process setting VM_HUGEPAGE on all VMAs where we do want 
VMAs.

> 
> 2) global system policy=always, process wants "madvise" policy for itself: Same reasoning as 1, just that the host has a different default policy and we don't want the workloads (processes) that regress with always getting THPs to do so.
> (We hope this is us (meta) in the future, if a majority of workloads show that they benefit from always, we flip the default host setting to "always" and workloads that regress can opt-out and be "madvise".
> New services developed will then be tested with always by default. Always is also the default defconfig option upstream, so I would imagine this is faced by others as well.)
> 

Understood.

Similar to the process setting VM_NOHUGEPAGE on all VMAs where we don't 
want THPs.

> 3) global system policy=never, process wants "madvise" policy for itself: This is what Yafang mentioned in [1]. sysadmins dont want to switch the global policy to madvise, but are willing to accept certain processes to madvise.
> But David mentioned in [2] that never means no thps, no exceptions and the only way to solve some issues in the past has been to disable THPs completely.

Yes.

Similar to setting on all processes where we don't want any THPs 
PR_SET_THP_DISABLE.

> 
> Please feel free to add to the above list. I thought it would be good to list them out so that the solution can be derived with them in mind.
> 
> In terms of doing this with prctl, I was able to make prototypes for the 2 approaches that have been discussed:
> 
> a) have prctl change how existing and new VMAs have VM_HUGEPAGE set for the process including after fork+exec, as proposed by David. This prototype is available at [3]. This will solve problem 1 discussed above, but I don't think this
> approach can be used to solve problems 2 and 3? There isnt a way where we can have a process change VMA setting so that after prctl, all future allocations are on madvise basis and not global policy (even if always). IOW we will need
> some change in thp_vma_allowable_orders to have it done on process level basis.

Let's assume we change PR_SET_THP_DISABLE to per-VMA handling as well, 
we could have for the use cases

1) system policy=madvise. Set (new) PR_SET_THP_ENABLE on the process.

Afterwards, MADV_NOHUGEPAGE could be used to fine-tune the VMAs. (e.g., 
explicitly temporarily disable on some areas, for example, required with 
some uffd scenarios)

2) system policy=always. Set PR_SET_THP_DISABLE, then only enable it for 
the VMAs using MADV_HUGEPAGE.

3) system policy=madvise/always. Set PR_SET_THP_DISABLE on all processes 
where we don't want THPs.


In case of 3) nothing would stop the process from re-enabling THPs 
either using the prctl or madvise(). If that's a problem, 
PR_SET_THP_DISABLE_LOCKED or sth. like that could be used.


But again, just a thought on how to work with what we already have, 
trying not to create an absolute mess.

> 
> b) have prctl override global policy *only* for hugepages that "inherit" global and only if global is "madvise" or "always". This prototype is available at [4]. The way I did this will achieve usecase 1 and 2, but not 3 (It can very easily be modified to get 3, but didn't do it as there maybe still is a discussion on what should be allowed when global=never?). I do prefer this method as I think it might be simpler overall and achieves both usecases.


I'm afraid that will create a mess. :/


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-05-13 12:05 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-07 14:00 [PATCH 0/1] prctl: allow overriding system THP policy to always Usama Arif
2025-05-07 14:00 ` [PATCH 1/1] prctl: allow overriding system THP policy to always per process Usama Arif
2025-05-07 15:02   ` Usama Arif
2025-05-07 20:14   ` Zi Yan
2025-05-08 10:53     ` Usama Arif
2025-05-08 20:29       ` Zi Yan
2025-05-07 14:57 ` [PATCH 0/1] prctl: allow overriding system THP policy to always Zi Yan
2025-05-07 15:12   ` Usama Arif
2025-05-07 15:57     ` Zi Yan
2025-05-07 16:09       ` Usama Arif
2025-05-08  5:41         ` Yafang Shao
2025-05-08 16:04           ` Usama Arif
2025-05-09  2:15             ` Yafang Shao
2025-05-09  5:13               ` Johannes Weiner
2025-05-09  9:24                 ` Yafang Shao
2025-05-09  9:30                   ` David Hildenbrand
2025-05-09  9:43                     ` Yafang Shao
2025-05-09 16:46                       ` Johannes Weiner
2025-05-09 22:42                         ` David Hildenbrand
2025-05-09 23:34                           ` Zi Yan
2025-05-11  8:15                             ` David Hildenbrand
2025-05-11 14:08                               ` Usama Arif
2025-05-13 11:43                                 ` Yafang Shao
2025-05-13 12:04                                 ` David Hildenbrand
2025-05-11  2:08                         ` Yafang Shao
2025-05-08 11:06 ` David Hildenbrand
2025-05-08 16:35   ` Usama Arif
2025-05-08 17:39     ` David Hildenbrand
2025-05-08 18:05       ` Usama Arif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).