* [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise
2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
@ 2025-05-19 22:29 ` Usama Arif
2025-05-20 9:51 ` kernel test robot
2025-05-20 14:43 ` Lorenzo Stoakes
2025-05-19 22:29 ` [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process Usama Arif
` (8 subsequent siblings)
9 siblings, 2 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-19 22:29 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, vbabka,
jannh, Arnd Bergmann, linux-kernel, linux-doc, kernel-team,
Usama Arif
This is so that flag setting can be resused later in other functions,
to reduce code duplication (including the s390 exception).
No functional change intended with this patch.
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
include/linux/huge_mm.h | 1 +
mm/khugepaged.c | 26 +++++++++++++++++---------
2 files changed, 18 insertions(+), 9 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..23580a43787c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -431,6 +431,7 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
__split_huge_pud(__vma, __pud, __address); \
} while (0)
+int hugepage_set_vmflags(unsigned long *vm_flags, int advice);
int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
int advice);
int madvise_collapse(struct vm_area_struct *vma,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b04b6a770afe..ab3427c87422 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -346,8 +346,7 @@ struct attribute_group khugepaged_attr_group = {
};
#endif /* CONFIG_SYSFS */
-int hugepage_madvise(struct vm_area_struct *vma,
- unsigned long *vm_flags, int advice)
+int hugepage_set_vmflags(unsigned long *vm_flags, int advice)
{
switch (advice) {
case MADV_HUGEPAGE:
@@ -358,16 +357,10 @@ int hugepage_madvise(struct vm_area_struct *vma,
* ignore the madvise to prevent qemu from causing a SIGSEGV.
*/
if (mm_has_pgste(vma->vm_mm))
- return 0;
+ return -EPERM;
#endif
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
- /*
- * If the vma become good for khugepaged to scan,
- * register it here without waiting a page fault that
- * may not happen any time soon.
- */
- khugepaged_enter_vma(vma, *vm_flags);
break;
case MADV_NOHUGEPAGE:
*vm_flags &= ~VM_HUGEPAGE;
@@ -383,6 +376,21 @@ int hugepage_madvise(struct vm_area_struct *vma,
return 0;
}
+int hugepage_madvise(struct vm_area_struct *vma,
+ unsigned long *vm_flags, int advice)
+{
+ if (advice == MADV_HUGEPAGE && !hugepage_set_vmflags(vm_flags, advice)) {
+ /*
+ * If the vma become good for khugepaged to scan,
+ * register it here without waiting a page fault that
+ * may not happen any time soon.
+ */
+ khugepaged_enter_vma(vma, *vm_flags);
+ }
+
+ return 0;
+}
+
int __init khugepaged_init(void)
{
mm_slot_cache = KMEM_CACHE(khugepaged_mm_slot, 0);
--
2.47.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise
2025-05-19 22:29 ` [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise Usama Arif
@ 2025-05-20 9:51 ` kernel test robot
2025-05-20 14:43 ` Lorenzo Stoakes
1 sibling, 0 replies; 25+ messages in thread
From: kernel test robot @ 2025-05-20 9:51 UTC (permalink / raw)
To: Usama Arif, Andrew Morton, david
Cc: llvm, oe-kbuild-all, Linux Memory Management List, hannes,
shakeel.butt, riel, ziy, laoar.shao, baolin.wang, lorenzo.stoakes,
Liam.Howlett, npache, ryan.roberts, vbabka, jannh, Arnd Bergmann,
linux-kernel, linux-doc, kernel-team, Usama Arif
Hi Usama,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on perf-tools-next/perf-tools-next tip/perf/core perf-tools/perf-tools linus/master acme/perf/core v6.15-rc7 next-20250516]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-khugepaged-extract-vm-flag-setting-outside-of-hugepage_madvise/20250520-063452
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20250519223307.3601786-2-usamaarif642%40gmail.com
patch subject: [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise
config: s390-randconfig-002-20250520 (https://download.01.org/0day-ci/archive/20250520/202505201734.8Fyk3qKi-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project f819f46284f2a79790038e1f6649172789734ae8)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250520/202505201734.8Fyk3qKi-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505201734.8Fyk3qKi-lkp@intel.com/
All errors (new ones prefixed by >>):
>> mm/khugepaged.c:359:20: error: use of undeclared identifier 'vma'
359 | if (mm_has_pgste(vma->vm_mm))
| ^
1 error generated.
vim +/vma +359 mm/khugepaged.c
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 348
d2a8f83f11a4ba Usama Arif 2025-05-19 349 int hugepage_set_vmflags(unsigned long *vm_flags, int advice)
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 350 {
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 351 switch (advice) {
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 352 case MADV_HUGEPAGE:
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 353 #ifdef CONFIG_S390
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 354 /*
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 355 * qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 356 * can't handle this properly after s390_enable_sie, so we simply
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 357 * ignore the madvise to prevent qemu from causing a SIGSEGV.
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 358 */
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 @359 if (mm_has_pgste(vma->vm_mm))
d2a8f83f11a4ba Usama Arif 2025-05-19 360 return -EPERM;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 361 #endif
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 362 *vm_flags &= ~VM_NOHUGEPAGE;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 363 *vm_flags |= VM_HUGEPAGE;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 364 break;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 365 case MADV_NOHUGEPAGE:
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 366 *vm_flags &= ~VM_HUGEPAGE;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 367 *vm_flags |= VM_NOHUGEPAGE;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 368 /*
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 369 * Setting VM_NOHUGEPAGE will prevent khugepaged from scanning
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 370 * this vma even if we leave the mm registered in khugepaged if
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 371 * it got registered before VM_NOHUGEPAGE was set.
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 372 */
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 373 break;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 374 }
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 375
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 376 return 0;
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 377 }
b46e756f5e4703 Kirill A. Shutemov 2016-07-26 378
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise
2025-05-19 22:29 ` [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise Usama Arif
2025-05-20 9:51 ` kernel test robot
@ 2025-05-20 14:43 ` Lorenzo Stoakes
2025-05-20 14:57 ` Usama Arif
1 sibling, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 14:43 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
kernel-team
This commit message is really poor. You're also not mentioning that you're
changing s390 behaviour?
On Mon, May 19, 2025 at 11:29:53PM +0100, Usama Arif wrote:
> This is so that flag setting can be resused later in other functions,
Typo.
> to reduce code duplication (including the s390 exception).
>
> No functional change intended with this patch.
I'm pretty sure somebody reviewed that this should just be merged with whatever
uses this? I'm not sure this is all that valuable as you're not really changing
this structurally very much.
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Yeah I'm not a fan of this patch, it's buggy and really unclear what the
purpose is here.
> ---
> include/linux/huge_mm.h | 1 +
> mm/khugepaged.c | 26 +++++++++++++++++---------
> 2 files changed, 18 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..23580a43787c 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -431,6 +431,7 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
> __split_huge_pud(__vma, __pud, __address); \
> } while (0)
>
> +int hugepage_set_vmflags(unsigned long *vm_flags, int advice);
> int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> int advice);
> int madvise_collapse(struct vm_area_struct *vma,
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b04b6a770afe..ab3427c87422 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -346,8 +346,7 @@ struct attribute_group khugepaged_attr_group = {
> };
> #endif /* CONFIG_SYSFS */
>
> -int hugepage_madvise(struct vm_area_struct *vma,
> - unsigned long *vm_flags, int advice)
> +int hugepage_set_vmflags(unsigned long *vm_flags, int advice)
> {
> switch (advice) {
> case MADV_HUGEPAGE:
> @@ -358,16 +357,10 @@ int hugepage_madvise(struct vm_area_struct *vma,
> * ignore the madvise to prevent qemu from causing a SIGSEGV.
> */
> if (mm_has_pgste(vma->vm_mm))
This is broken, you refer to vma which doesn't exist.
As the kernel bots are telling you...
> - return 0;
> + return -EPERM;
Why are you now returning an error?
This seems like a super broken way of making the caller return 0. Just make this
whole thing a bool return if you're going to treat it like a boolean function.
> #endif
> *vm_flags &= ~VM_NOHUGEPAGE;
> *vm_flags |= VM_HUGEPAGE;
> - /*
> - * If the vma become good for khugepaged to scan,
> - * register it here without waiting a page fault that
> - * may not happen any time soon.
> - */
> - khugepaged_enter_vma(vma, *vm_flags);
> break;
> case MADV_NOHUGEPAGE:
> *vm_flags &= ~VM_HUGEPAGE;
> @@ -383,6 +376,21 @@ int hugepage_madvise(struct vm_area_struct *vma,
> return 0;
> }
>
> +int hugepage_madvise(struct vm_area_struct *vma,
> + unsigned long *vm_flags, int advice)
> +{
> + if (advice == MADV_HUGEPAGE && !hugepage_set_vmflags(vm_flags, advice)) {
So now you've completely broken MADV_NOHUGEPAGE haven't you?
> + /*
> + * If the vma become good for khugepaged to scan,
> + * register it here without waiting a page fault that
> + * may not happen any time soon.
> + */
> + khugepaged_enter_vma(vma, *vm_flags);
> + }
> +
> + return 0;
> +}
> +
> int __init khugepaged_init(void)
> {
> mm_slot_cache = KMEM_CACHE(khugepaged_mm_slot, 0);
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise
2025-05-20 14:43 ` Lorenzo Stoakes
@ 2025-05-20 14:57 ` Usama Arif
2025-05-20 15:13 ` Usama Arif
2025-05-20 15:31 ` Lorenzo Stoakes
0 siblings, 2 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-20 14:57 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
kernel-team
On 20/05/2025 15:43, Lorenzo Stoakes wrote:
> This commit message is really poor. You're also not mentioning that you're
> changing s390 behaviour?
>
> On Mon, May 19, 2025 at 11:29:53PM +0100, Usama Arif wrote:
>> This is so that flag setting can be resused later in other functions,
>
> Typo.
>
>> to reduce code duplication (including the s390 exception).
>>
>> No functional change intended with this patch.
>
> I'm pretty sure somebody reviewed that this should just be merged with whatever
> uses this? I'm not sure this is all that valuable as you're not really changing
> this structurally very much.
>
Please see patch 2 where hugepage_set_vmflags is reused.
I was just trying to follow your feedback from previous revision that the flag
setting and s390 code part is duplicate code and should be common in the prctl
and madvise function.
I realize I messed up the arg not having vma and the order of the if statement.
>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>
> Yeah I'm not a fan of this patch, it's buggy and really unclear what the
> purpose is here.
No functional change was intended (I realized the order below broke it but can be fixed).
In the previous revision it was:
+ case PR_SET_THP_POLICY:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ if (mmap_write_lock_killable(me->mm))
+ return -EINTR;
+ switch (arg2) {
+ case PR_DEFAULT_MADV_HUGEPAGE:
+ if (!hugepage_global_enabled())
+ error = -EPERM;
+#ifdef CONFIG_S390
+ /*
+ * qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
+ * can't handle this properly after s390_enable_sie, so we simply
+ * ignore the madvise to prevent qemu from causing a SIGSEGV.
+ */
+ else if (mm_has_pgste(vma->vm_mm))
+ error = -EPERM;
+#endif
+ else {
+ me->mm->def_flags &= ~VM_NOHUGEPAGE;
+ me->mm->def_flags |= VM_HUGEPAGE;
+ process_default_madv_hugepage(me->mm, MADV_HUGEPAGE);
+ }
+ break;
...
Now with this hugepage_set_vmflags, it would be
+ case PR_SET_THP_POLICY:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ if (mmap_write_lock_killable(mm))
+ return -EINTR;
+ switch (arg2) {
+ case PR_DEFAULT_MADV_HUGEPAGE:
+ if (!hugepage_global_enabled())
+ error = -EPERM;
+ error = hugepage_set_vmflags(&mm->def_flags, MADV_HUGEPAGE);
+ if (!error)
+ process_default_madv_hugepage(mm, MADV_HUGEPAGE);
+ break;
I am happy to go with either of the methods above, but was just trying to
incorporate your feedback :)
Would you like the method from previous version?
>
>> ---
>> include/linux/huge_mm.h | 1 +
>> mm/khugepaged.c | 26 +++++++++++++++++---------
>> 2 files changed, 18 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2f190c90192d..23580a43787c 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -431,6 +431,7 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>> __split_huge_pud(__vma, __pud, __address); \
>> } while (0)
>>
>> +int hugepage_set_vmflags(unsigned long *vm_flags, int advice);
>> int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
>> int advice);
>> int madvise_collapse(struct vm_area_struct *vma,
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index b04b6a770afe..ab3427c87422 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -346,8 +346,7 @@ struct attribute_group khugepaged_attr_group = {
>> };
>> #endif /* CONFIG_SYSFS */
>>
>> -int hugepage_madvise(struct vm_area_struct *vma,
>> - unsigned long *vm_flags, int advice)
>> +int hugepage_set_vmflags(unsigned long *vm_flags, int advice)
>
>
>> {
>> switch (advice) {
>> case MADV_HUGEPAGE:
>> @@ -358,16 +357,10 @@ int hugepage_madvise(struct vm_area_struct *vma,
>> * ignore the madvise to prevent qemu from causing a SIGSEGV.
>> */
>> if (mm_has_pgste(vma->vm_mm))
>
> This is broken, you refer to vma which doesn't exist.
>
> As the kernel bots are telling you...
>
>> - return 0;
>> + return -EPERM;
>
> Why are you now returning an error?
>
> This seems like a super broken way of making the caller return 0. Just make this
> whole thing a bool return if you're going to treat it like a boolean function.
>
>> #endif
>> *vm_flags &= ~VM_NOHUGEPAGE;
>> *vm_flags |= VM_HUGEPAGE;
>> - /*
>> - * If the vma become good for khugepaged to scan,
>> - * register it here without waiting a page fault that
>> - * may not happen any time soon.
>> - */
>> - khugepaged_enter_vma(vma, *vm_flags);
>> break;
>> case MADV_NOHUGEPAGE:
>> *vm_flags &= ~VM_HUGEPAGE;
>> @@ -383,6 +376,21 @@ int hugepage_madvise(struct vm_area_struct *vma,
>> return 0;
>> }
>>
>> +int hugepage_madvise(struct vm_area_struct *vma,
>> + unsigned long *vm_flags, int advice)
>> +{
>> + if (advice == MADV_HUGEPAGE && !hugepage_set_vmflags(vm_flags, advice)) {
>
> So now you've completely broken MADV_NOHUGEPAGE haven't you?
>
Yeah order needs to be reversed.
>> + /*
>> + * If the vma become good for khugepaged to scan,
>> + * register it here without waiting a page fault that
>> + * may not happen any time soon.
>> + */
>> + khugepaged_enter_vma(vma, *vm_flags);
>> + }
>> +
>> + return 0;
>> +}
>> +
>> int __init khugepaged_init(void)
>> {
>> mm_slot_cache = KMEM_CACHE(khugepaged_mm_slot, 0);
>> --
>> 2.47.1
>>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise
2025-05-20 14:57 ` Usama Arif
@ 2025-05-20 15:13 ` Usama Arif
2025-05-20 15:31 ` Lorenzo Stoakes
1 sibling, 0 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-20 15:13 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
kernel-team
On 20/05/2025 15:57, Usama Arif wrote:
>
>
> On 20/05/2025 15:43, Lorenzo Stoakes wrote:
>> This commit message is really poor. You're also not mentioning that you're
>> changing s390 behaviour?
>>
>> On Mon, May 19, 2025 at 11:29:53PM +0100, Usama Arif wrote:
>>> This is so that flag setting can be resused later in other functions,
>>
>> Typo.
>>
>>> to reduce code duplication (including the s390 exception).
>>>
>>> No functional change intended with this patch.
>>
>> I'm pretty sure somebody reviewed that this should just be merged with whatever
>> uses this? I'm not sure this is all that valuable as you're not really changing
>> this structurally very much.
>>
>
So I unfortunately never tested s390 build which the kernel bot is complaining.
So If I want to reuse hugepage_set_vmflags in patch 2 and 3 for the prctls,
the fix over here would be at the end.
If you don't like the approach of trying to abstract the flag setting away
and reusing it in prctl in this patch I can change it to the way in previous
revision and just do something like below. Happy with either approach and
can drop patch 1 if you prefer.
+ case PR_SET_THP_POLICY:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ if (mmap_write_lock_killable(me->mm))
+ return -EINTR;
+ switch (arg2) {
+ case PR_DEFAULT_MADV_HUGEPAGE:
+ if (!hugepage_global_enabled())
+ error = -EPERM;
+#ifdef CONFIG_S390
+ /*
+ * qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
+ * can't handle this properly after s390_enable_sie, so we simply
+ * ignore the madvise to prevent qemu from causing a SIGSEGV.
+ */
+ else if (mm_has_pgste(vma->vm_mm))
+ error = -EPERM;
+#endif
+ else {
+ me->mm->def_flags &= ~VM_NOHUGEPAGE;
+ me->mm->def_flags |= VM_HUGEPAGE;
+ process_default_madv_hugepage(me->mm, MADV_HUGEPAGE);
+ }
+ break;
+ default:
+ error = -EINVAL;
+ }
+ mmap_write_unlock(me->mm);
+ break;
Thanks!
diff for fixing this patch:
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b24a2e0ae642..e5176afaaffe 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -432,7 +432,7 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
} while (0)
void process_default_madv_hugepage(struct mm_struct *mm, int advice);
-int hugepage_set_vmflags(unsigned long *vm_flags, int advice);
+int hugepage_set_vmflags(struct mm_struct* mm, unsigned long *vm_flags, int advice);
int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
int advice);
int madvise_collapse(struct vm_area_struct *vma,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ab3427c87422..b6c9ed6bb442 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -346,7 +346,7 @@ struct attribute_group khugepaged_attr_group = {
};
#endif /* CONFIG_SYSFS */
-int hugepage_set_vmflags(unsigned long *vm_flags, int advice)
+int hugepage_set_vmflags(struct mm_struct * mm, unsigned long *vm_flags, int advice)
{
switch (advice) {
case MADV_HUGEPAGE:
@@ -356,8 +356,8 @@ int hugepage_set_vmflags(unsigned long *vm_flags, int advice)
* can't handle this properly after s390_enable_sie, so we simply
* ignore the madvise to prevent qemu from causing a SIGSEGV.
*/
- if (mm_has_pgste(vma->vm_mm))
- return -EPERM;
+ if (mm_has_pgste(mm))
+ return 0;
#endif
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
@@ -373,13 +373,14 @@ int hugepage_set_vmflags(unsigned long *vm_flags, int advice)
break;
}
- return 0;
+ return 1;
}
int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
{
- if (advice == MADV_HUGEPAGE && !hugepage_set_vmflags(vm_flags, advice)) {
+ if (hugepage_set_vmflags(vma->vm_mm, vm_flags, advice)
+ && advice == MADV_HUGEPAGE) {
/*
* If the vma become good for khugepaged to scan,
* register it here without waiting a page fault that
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise
2025-05-20 14:57 ` Usama Arif
2025-05-20 15:13 ` Usama Arif
@ 2025-05-20 15:31 ` Lorenzo Stoakes
1 sibling, 0 replies; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 15:31 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
kernel-team
On Tue, May 20, 2025 at 03:57:35PM +0100, Usama Arif wrote:
>
>
> On 20/05/2025 15:43, Lorenzo Stoakes wrote:
> > This commit message is really poor. You're also not mentioning that you're
> > changing s390 behaviour?
> >
> > On Mon, May 19, 2025 at 11:29:53PM +0100, Usama Arif wrote:
> >> This is so that flag setting can be resused later in other functions,
> >
> > Typo.
> >
> >> to reduce code duplication (including the s390 exception).
> >>
> >> No functional change intended with this patch.
> >
> > I'm pretty sure somebody reviewed that this should just be merged with whatever
> > uses this? I'm not sure this is all that valuable as you're not really changing
> > this structurally very much.
> >
>
> Please see patch 2 where hugepage_set_vmflags is reused.
> I was just trying to follow your feedback from previous revision that the flag
> setting and s390 code part is duplicate code and should be common in the prctl
> and madvise function.
Sure, but I think it'd be better as part of that patch probably. Perhaps I
was thinking of another comment in reference to a 'no function change'
remark.
>
> I realize I messed up the arg not having vma and the order of the if statement.
I am getting the strong impression here that you're rushing :)
I strongly suggest slowing thing down here. We're in RC7, this is (or
should be) an RFC for us to explore concepts. There's no need for it.
I appreciate your input and enthusiasm, but clearly rushing is causing you
to make mistakes. I get it, we've all been there.
But right now we have what 5 maybe? THP series in-flight at the same time,
all touching similar stuff, and it'll make everybody's lives easier and
less chaotic if we take a little more time to assess.
We are ultimately going to choose what's best for the kernel, there's no
'race' as to which series is 'ready' first.
>
> >>
> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >
> > Yeah I'm not a fan of this patch, it's buggy and really unclear what the
> > purpose is here.
>
> No functional change was intended (I realized the order below broke it but can be fixed).
>
> In the previous revision it was:
> + case PR_SET_THP_POLICY:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + if (mmap_write_lock_killable(me->mm))
> + return -EINTR;
> + switch (arg2) {
> + case PR_DEFAULT_MADV_HUGEPAGE:
> + if (!hugepage_global_enabled())
> + error = -EPERM;
> +#ifdef CONFIG_S390
> + /*
> + * qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
> + * can't handle this properly after s390_enable_sie, so we simply
> + * ignore the madvise to prevent qemu from causing a SIGSEGV.
> + */
> + else if (mm_has_pgste(vma->vm_mm))
> + error = -EPERM;
> +#endif
> + else {
> + me->mm->def_flags &= ~VM_NOHUGEPAGE;
> + me->mm->def_flags |= VM_HUGEPAGE;
> + process_default_madv_hugepage(me->mm, MADV_HUGEPAGE);
> + }
> + break;
> ...
>
> Now with this hugepage_set_vmflags, it would be
>
> + case PR_SET_THP_POLICY:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + if (mmap_write_lock_killable(mm))
> + return -EINTR;
> + switch (arg2) {
> + case PR_DEFAULT_MADV_HUGEPAGE:
> + if (!hugepage_global_enabled())
> + error = -EPERM;
> + error = hugepage_set_vmflags(&mm->def_flags, MADV_HUGEPAGE);
> + if (!error)
> + process_default_madv_hugepage(mm, MADV_HUGEPAGE);
> + break;
>
>
> I am happy to go with either of the methods above, but was just trying to
> incorporate your feedback :)
>
> Would you like the method from previous version?
I'm going to go ahead and overlook what would be in the UK 100% a
deployment of the finest British sarcasm here, and assume not intended :)
Very obviously we do not want to duplicate architecture-specific code. I'm
a little concerned you're ok with both (imagine if one changed but not the
other for instance), but clearly this series is unmergeable without
de-duplicating this.
My objections here are that you submitted a totally broken patch with a
poor commit message that seems that it could well be merged with the
subsequent patch.
I also have concerns about your levels of testing here - you completely
broken MADV_NOHUGEPAGE but didn't notice? Are you running self-tests? Do we
have one that'd pick that up? If not, can we have one like that?
Thanks!
>
> >
> >> ---
> >> include/linux/huge_mm.h | 1 +
> >> mm/khugepaged.c | 26 +++++++++++++++++---------
> >> 2 files changed, 18 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index 2f190c90192d..23580a43787c 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -431,6 +431,7 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >> __split_huge_pud(__vma, __pud, __address); \
> >> } while (0)
> >>
> >> +int hugepage_set_vmflags(unsigned long *vm_flags, int advice);
> >> int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> >> int advice);
> >> int madvise_collapse(struct vm_area_struct *vma,
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index b04b6a770afe..ab3427c87422 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -346,8 +346,7 @@ struct attribute_group khugepaged_attr_group = {
> >> };
> >> #endif /* CONFIG_SYSFS */
> >>
> >> -int hugepage_madvise(struct vm_area_struct *vma,
> >> - unsigned long *vm_flags, int advice)
> >> +int hugepage_set_vmflags(unsigned long *vm_flags, int advice)
> >
> >
> >> {
> >> switch (advice) {
> >> case MADV_HUGEPAGE:
> >> @@ -358,16 +357,10 @@ int hugepage_madvise(struct vm_area_struct *vma,
> >> * ignore the madvise to prevent qemu from causing a SIGSEGV.
> >> */
> >> if (mm_has_pgste(vma->vm_mm))
> >
> > This is broken, you refer to vma which doesn't exist.
> >
> > As the kernel bots are telling you...
> >
> >> - return 0;
> >> + return -EPERM;
> >
> > Why are you now returning an error?
> >
> > This seems like a super broken way of making the caller return 0. Just make this
> > whole thing a bool return if you're going to treat it like a boolean function.
> >
> >> #endif
> >> *vm_flags &= ~VM_NOHUGEPAGE;
> >> *vm_flags |= VM_HUGEPAGE;
> >> - /*
> >> - * If the vma become good for khugepaged to scan,
> >> - * register it here without waiting a page fault that
> >> - * may not happen any time soon.
> >> - */
> >> - khugepaged_enter_vma(vma, *vm_flags);
> >> break;
> >> case MADV_NOHUGEPAGE:
> >> *vm_flags &= ~VM_HUGEPAGE;
> >> @@ -383,6 +376,21 @@ int hugepage_madvise(struct vm_area_struct *vma,
> >> return 0;
> >> }
> >>
> >> +int hugepage_madvise(struct vm_area_struct *vma,
> >> + unsigned long *vm_flags, int advice)
> >> +{
> >> + if (advice == MADV_HUGEPAGE && !hugepage_set_vmflags(vm_flags, advice)) {
> >
> > So now you've completely broken MADV_NOHUGEPAGE haven't you?
> >
>
> Yeah order needs to be reversed.
>
> >> + /*
> >> + * If the vma become good for khugepaged to scan,
> >> + * register it here without waiting a page fault that
> >> + * may not happen any time soon.
> >> + */
> >> + khugepaged_enter_vma(vma, *vm_flags);
> >> + }
> >> +
> >> + return 0;
> >> +}
> >> +
> >> int __init khugepaged_init(void)
> >> {
> >> mm_slot_cache = KMEM_CACHE(khugepaged_mm_slot, 0);
> >> --
> >> 2.47.1
> >>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
2025-05-19 22:29 ` [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise Usama Arif
@ 2025-05-19 22:29 ` Usama Arif
2025-05-19 23:01 ` Jann Horn
2025-05-20 8:48 ` kernel test robot
2025-05-19 22:29 ` [PATCH v3 3/7] prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE " Usama Arif
` (7 subsequent siblings)
9 siblings, 2 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-19 22:29 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, vbabka,
jannh, Arnd Bergmann, linux-kernel, linux-doc, kernel-team,
Usama Arif
This is set via the new PR_SET_THP_POLICY prctl. It has 2 affects:
- It sets VM_HUGEPAGE and clears VM_NOHUGEPAGE on the default VMA flags
(def_flags). This means that every new VMA will be considered for
hugepage.
- Iterate through every VMA in the process and call hugepage_madvise
on it, with MADV_HUGEPAGE policy.
The policy is inherited during fork+exec.
This effectively allows setting MADV_HUGEPAGE on the entire process.
In an environment where different types of workloads are run on the
same machine, this will allow workloads that benefit from always having
hugepages to do so, without regressing those that don't.
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
include/linux/huge_mm.h | 1 +
include/linux/mm.h | 2 +-
include/linux/mm_types.h | 4 ++-
include/uapi/linux/prctl.h | 4 +++
kernel/sys.c | 29 +++++++++++++++++++
mm/huge_memory.c | 13 +++++++++
tools/include/uapi/linux/prctl.h | 4 +++
.../trace/beauty/include/uapi/linux/prctl.h | 4 +++
8 files changed, 59 insertions(+), 2 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 23580a43787c..b24a2e0ae642 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -431,6 +431,7 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
__split_huge_pud(__vma, __pud, __address); \
} while (0)
+void process_default_madv_hugepage(struct mm_struct *mm, int advice);
int hugepage_set_vmflags(unsigned long *vm_flags, int advice);
int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
int advice);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 43748c8f3454..436f4588bce8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -466,7 +466,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)
/* This mask defines which mm->def_flags a process can inherit its parent */
-#define VM_INIT_DEF_MASK VM_NOHUGEPAGE
+#define VM_INIT_DEF_MASK (VM_HUGEPAGE | VM_NOHUGEPAGE)
/* This mask represents all the VMA flag bits used by mlock */
#define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e76bade9ebb1..f1836b7c5704 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1703,6 +1703,7 @@ enum {
/* leave room for more dump flags */
#define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */
#define MMF_VM_HUGEPAGE 17 /* set when mm is available for khugepaged */
+#define MMF_VM_HUGEPAGE_MASK (1 << MMF_VM_HUGEPAGE)
/*
* This one-shot flag is dropped due to necessity of changing exe once again
@@ -1742,7 +1743,8 @@ enum {
#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
- MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
+ MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK |\
+ MMF_VM_HUGEPAGE_MASK)
static inline unsigned long mmf_init_flags(unsigned long flags)
{
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 15c18ef4eb11..15aaa4db5ff8 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -364,4 +364,8 @@ struct prctl_mm_map {
# define PR_TIMER_CREATE_RESTORE_IDS_ON 1
# define PR_TIMER_CREATE_RESTORE_IDS_GET 2
+#define PR_SET_THP_POLICY 78
+#define PR_GET_THP_POLICY 79
+#define PR_DEFAULT_MADV_HUGEPAGE 0
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index c434968e9f5d..74397ace62f3 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2474,6 +2474,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
{
struct task_struct *me = current;
+ struct mm_struct *mm = me->mm;
unsigned char comm[sizeof(me->comm)];
long error;
@@ -2658,6 +2659,34 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
clear_bit(MMF_DISABLE_THP, &me->mm->flags);
mmap_write_unlock(me->mm);
break;
+ case PR_GET_THP_POLICY:
+ if (arg2 || arg3 || arg4 || arg5)
+ return -EINVAL;
+ if (mmap_write_lock_killable(mm))
+ return -EINTR;
+ if (mm->def_flags & VM_HUGEPAGE)
+ error = PR_DEFAULT_MADV_HUGEPAGE;
+ mmap_write_unlock(mm);
+ break;
+ case PR_SET_THP_POLICY:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ if (mmap_write_lock_killable(mm))
+ return -EINTR;
+ switch (arg2) {
+ case PR_DEFAULT_MADV_HUGEPAGE:
+ if (!hugepage_global_enabled())
+ error = -EPERM;
+ error = hugepage_set_vmflags(&mm->def_flags, MADV_HUGEPAGE);
+ if (!error)
+ process_default_madv_hugepage(mm, MADV_HUGEPAGE);
+ break;
+ default:
+ error = -EINVAL;
+ break;
+ }
+ mmap_write_unlock(mm);
+ break;
case PR_MPX_ENABLE_MANAGEMENT:
case PR_MPX_DISABLE_MANAGEMENT:
/* No longer implemented: */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2780a12b25f0..72806fe772b5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -98,6 +98,19 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
}
+void process_default_madv_hugepage(struct mm_struct *mm, int advice)
+{
+ struct vm_area_struct *vma;
+ unsigned long vm_flags;
+
+ mmap_assert_write_locked(mm);
+ VMA_ITERATOR(vmi, mm, 0);
+ for_each_vma(vmi, vma) {
+ vm_flags = vma->vm_flags;
+ hugepage_madvise(vma, &vm_flags, advice);
+ }
+}
+
unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
unsigned long vm_flags,
unsigned long tva_flags,
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 35791791a879..f5945ebfe3f2 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -328,4 +328,8 @@ struct prctl_mm_map {
# define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */
# define PR_PPC_DEXCR_CTRL_MASK 0x1f
+#define PR_SET_THP_POLICY 78
+#define PR_GET_THP_POLICY 79
+#define PR_THP_POLICY_DEFAULT_HUGE 0
+
#endif /* _LINUX_PRCTL_H */
diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
index 15c18ef4eb11..325c72f40a93 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
@@ -364,4 +364,8 @@ struct prctl_mm_map {
# define PR_TIMER_CREATE_RESTORE_IDS_ON 1
# define PR_TIMER_CREATE_RESTORE_IDS_GET 2
+#define PR_SET_THP_POLICY 78
+#define PR_GET_THP_POLICY 79
+#define PR_THP_POLICY_DEFAULT_HUGE 0
+
#endif /* _LINUX_PRCTL_H */
--
2.47.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
2025-05-19 22:29 ` [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process Usama Arif
@ 2025-05-19 23:01 ` Jann Horn
2025-05-20 5:23 ` Lorenzo Stoakes
2025-05-20 8:48 ` kernel test robot
1 sibling, 1 reply; 25+ messages in thread
From: Jann Horn @ 2025-05-19 23:01 UTC (permalink / raw)
To: Usama Arif, lorenzo.stoakes
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
vbabka, Arnd Bergmann, linux-kernel, linux-doc, kernel-team
On Tue, May 20, 2025 at 12:33 AM Usama Arif <usamaarif642@gmail.com> wrote:
> This is set via the new PR_SET_THP_POLICY prctl. It has 2 affects:
> - It sets VM_HUGEPAGE and clears VM_NOHUGEPAGE on the default VMA flags
> (def_flags). This means that every new VMA will be considered for
> hugepage.
> - Iterate through every VMA in the process and call hugepage_madvise
> on it, with MADV_HUGEPAGE policy.
> The policy is inherited during fork+exec.
As I replied to Lorenzo's series
(https://lore.kernel.org/all/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com/),
it would be nice if you could avoid introducing new flags that have
the combination of all the following properties:
1. persists across exec
2. not cleared on secureexec execution
3. settable without ns_capable(CAP_SYS_ADMIN)
4. settable without NO_NEW_PRIVS
Flags that have all of these properties need to be reviewed extra
carefully to see if there is any way they could impact the security of
setuid binaries, for example by changing mmap() behavior in a way that
makes addresses significantly more predictable.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
2025-05-19 23:01 ` Jann Horn
@ 2025-05-20 5:23 ` Lorenzo Stoakes
2025-05-20 9:09 ` David Hildenbrand
0 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 5:23 UTC (permalink / raw)
To: Jann Horn
Cc: Usama Arif, Andrew Morton, david, linux-mm, hannes, shakeel.butt,
riel, ziy, laoar.shao, baolin.wang, Liam.Howlett, npache,
ryan.roberts, vbabka, Arnd Bergmann, linux-kernel, linux-doc,
kernel-team
On Tue, May 20, 2025 at 01:01:38AM +0200, Jann Horn wrote:
> On Tue, May 20, 2025 at 12:33 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > This is set via the new PR_SET_THP_POLICY prctl. It has 2 affects:
> > - It sets VM_HUGEPAGE and clears VM_NOHUGEPAGE on the default VMA flags
> > (def_flags). This means that every new VMA will be considered for
> > hugepage.
> > - Iterate through every VMA in the process and call hugepage_madvise
> > on it, with MADV_HUGEPAGE policy.
> > The policy is inherited during fork+exec.
>
> As I replied to Lorenzo's series
> (https://lore.kernel.org/all/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com/),
> it would be nice if you could avoid introducing new flags that have
> the combination of all the following properties:
>
> 1. persists across exec
> 2. not cleared on secureexec execution
> 3. settable without ns_capable(CAP_SYS_ADMIN)
> 4. settable without NO_NEW_PRIVS
>
> Flags that have all of these properties need to be reviewed extra
> carefully to see if there is any way they could impact the security of
> setuid binaries, for example by changing mmap() behavior in a way that
> makes addresses significantly more predictable.
Indeed, this series was meant to be as RFC as mine while we still figured this
out :) grr. Well, with the NACK it is - in effect - now an RFC.
Yes having something persistent like this is not great, the idea of
introducing this in my series was to provide an alternative generic version
of this approach that can be better controlled and isn't just a 'tacked on'
change specific to one company's needs but rather a more general idea of
'madvise() by default'.
I do wonder in this case, whether we need be so cautious however given the
_relatively_ safe nature of these flags?
I do absolutely agree we need to very carefully review whether:
1. It really even makes sense to do this
2. Any such restrictions need be made
I am weaker on the security side so very glad for your input here (thanks!)
I suspect probably we want ns_capable(CAP_SYS_ADMIN) _as a rule_ for this
kind of mm->def_flags change.
I also wanted to dig a little deeper into whether this was sensible as a
general approach.
I, however, do _very much_ prefer it to an mm->flags change (that'd
necessity a pre-requisite 'make mm->flags 64-bit on 32-bit kernels'
series anyway).
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
2025-05-20 5:23 ` Lorenzo Stoakes
@ 2025-05-20 9:09 ` David Hildenbrand
2025-05-20 9:16 ` Lorenzo Stoakes
0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2025-05-20 9:09 UTC (permalink / raw)
To: Lorenzo Stoakes, Jann Horn
Cc: Usama Arif, Andrew Morton, linux-mm, hannes, shakeel.butt, riel,
ziy, laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
vbabka, Arnd Bergmann, linux-kernel, linux-doc, kernel-team
On 20.05.25 07:23, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 01:01:38AM +0200, Jann Horn wrote:
>> On Tue, May 20, 2025 at 12:33 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>> This is set via the new PR_SET_THP_POLICY prctl. It has 2 affects:
>>> - It sets VM_HUGEPAGE and clears VM_NOHUGEPAGE on the default VMA flags
>>> (def_flags). This means that every new VMA will be considered for
>>> hugepage.
>>> - Iterate through every VMA in the process and call hugepage_madvise
>>> on it, with MADV_HUGEPAGE policy.
>>> The policy is inherited during fork+exec.
>>
>> As I replied to Lorenzo's series
>> (https://lore.kernel.org/all/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com/),
>> it would be nice if you could avoid introducing new flags that have
>> the combination of all the following properties:
>>
>> 1. persists across exec
>> 2. not cleared on secureexec execution
>> 3. settable without ns_capable(CAP_SYS_ADMIN)
>> 4. settable without NO_NEW_PRIVS
>>
>> Flags that have all of these properties need to be reviewed extra
>> carefully to see if there is any way they could impact the security of
>> setuid binaries, for example by changing mmap() behavior in a way that
>> makes addresses significantly more predictable.
>
> Indeed, this series was meant to be as RFC as mine while we still figured this
> out :) grr. Well, with the NACK it is - in effect - now an RFC.
>
> Yes having something persistent like this is not great, the idea of
> introducing this in my series was to provide an alternative generic version
> of this approach that can be better controlled and isn't just a 'tacked on'
> change specific to one company's needs but rather a more general idea of
> 'madvise() by default'.
>
> I do wonder in this case, whether we need be so cautious however given the
> _relatively_ safe nature of these flags?
Yes. Changing VM_HUGEPAGE / VM_NOHUGEPAGE defaults should have little
impact, but we better be careful.
setuid execution is certainly an interesting point. Maybe the general
rule should be, that it is not inherited over secureexec unless
CAP_SYS_ADMIN?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
2025-05-20 9:09 ` David Hildenbrand
@ 2025-05-20 9:16 ` Lorenzo Stoakes
0 siblings, 0 replies; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 9:16 UTC (permalink / raw)
To: David Hildenbrand
Cc: Jann Horn, Usama Arif, Andrew Morton, linux-mm, hannes,
shakeel.butt, riel, ziy, laoar.shao, baolin.wang, Liam.Howlett,
npache, ryan.roberts, vbabka, Arnd Bergmann, linux-kernel,
linux-doc, kernel-team
On Tue, May 20, 2025 at 11:09:05AM +0200, David Hildenbrand wrote:
> On 20.05.25 07:23, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 01:01:38AM +0200, Jann Horn wrote:
> > > On Tue, May 20, 2025 at 12:33 AM Usama Arif <usamaarif642@gmail.com> wrote:
> > > > This is set via the new PR_SET_THP_POLICY prctl. It has 2 affects:
> > > > - It sets VM_HUGEPAGE and clears VM_NOHUGEPAGE on the default VMA flags
> > > > (def_flags). This means that every new VMA will be considered for
> > > > hugepage.
> > > > - Iterate through every VMA in the process and call hugepage_madvise
> > > > on it, with MADV_HUGEPAGE policy.
> > > > The policy is inherited during fork+exec.
> > >
> > > As I replied to Lorenzo's series
> > > (https://lore.kernel.org/all/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com/),
> > > it would be nice if you could avoid introducing new flags that have
> > > the combination of all the following properties:
> > >
> > > 1. persists across exec
> > > 2. not cleared on secureexec execution
> > > 3. settable without ns_capable(CAP_SYS_ADMIN)
> > > 4. settable without NO_NEW_PRIVS
> > >
> > > Flags that have all of these properties need to be reviewed extra
> > > carefully to see if there is any way they could impact the security of
> > > setuid binaries, for example by changing mmap() behavior in a way that
> > > makes addresses significantly more predictable.
> >
> > Indeed, this series was meant to be as RFC as mine while we still figured this
> > out :) grr. Well, with the NACK it is - in effect - now an RFC.
> >
> > Yes having something persistent like this is not great, the idea of
> > introducing this in my series was to provide an alternative generic version
> > of this approach that can be better controlled and isn't just a 'tacked on'
> > change specific to one company's needs but rather a more general idea of
> > 'madvise() by default'.
> >
> > I do wonder in this case, whether we need be so cautious however given the
> > _relatively_ safe nature of these flags?
>
> Yes. Changing VM_HUGEPAGE / VM_NOHUGEPAGE defaults should have little
> impact, but we better be careful.
>
> setuid execution is certainly an interesting point. Maybe the general rule
> should be, that it is not inherited over secureexec unless CAP_SYS_ADMIN?
I think probably we should just restrict this operation to system admins
anyway. This will be the most cautious option, and simplifies things as we
then don't have to especially check for things at certain points?
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
2025-05-19 22:29 ` [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process Usama Arif
2025-05-19 23:01 ` Jann Horn
@ 2025-05-20 8:48 ` kernel test robot
1 sibling, 0 replies; 25+ messages in thread
From: kernel test robot @ 2025-05-20 8:48 UTC (permalink / raw)
To: Usama Arif, Andrew Morton, david
Cc: llvm, oe-kbuild-all, Linux Memory Management List, hannes,
shakeel.butt, riel, ziy, laoar.shao, baolin.wang, lorenzo.stoakes,
Liam.Howlett, npache, ryan.roberts, vbabka, jannh, Arnd Bergmann,
linux-kernel, linux-doc, kernel-team, Usama Arif
Hi Usama,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on perf-tools-next/perf-tools-next tip/perf/core perf-tools/perf-tools linus/master v6.15-rc7]
[cannot apply to acme/perf/core next-20250516]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/mm-khugepaged-extract-vm-flag-setting-outside-of-hugepage_madvise/20250520-063452
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20250519223307.3601786-3-usamaarif642%40gmail.com
patch subject: [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
config: s390-randconfig-001-20250520 (https://download.01.org/0day-ci/archive/20250520/202505201614.N4SXnAln-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project f819f46284f2a79790038e1f6649172789734ae8)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250520/202505201614.N4SXnAln-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505201614.N4SXnAln-lkp@intel.com/
All errors (new ones prefixed by >>):
>> kernel/sys.c:2678:9: error: call to undeclared function 'hugepage_global_enabled'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
2678 | if (!hugepage_global_enabled())
| ^
>> kernel/sys.c:2680:12: error: call to undeclared function 'hugepage_set_vmflags'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
2680 | error = hugepage_set_vmflags(&mm->def_flags, MADV_HUGEPAGE);
| ^
>> kernel/sys.c:2682:5: error: call to undeclared function 'process_default_madv_hugepage'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
2682 | process_default_madv_hugepage(mm, MADV_HUGEPAGE);
| ^
3 errors generated.
vim +/hugepage_global_enabled +2678 kernel/sys.c
2472
2473 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
2474 unsigned long, arg4, unsigned long, arg5)
2475 {
2476 struct task_struct *me = current;
2477 struct mm_struct *mm = me->mm;
2478 unsigned char comm[sizeof(me->comm)];
2479 long error;
2480
2481 error = security_task_prctl(option, arg2, arg3, arg4, arg5);
2482 if (error != -ENOSYS)
2483 return error;
2484
2485 error = 0;
2486 switch (option) {
2487 case PR_SET_PDEATHSIG:
2488 if (!valid_signal(arg2)) {
2489 error = -EINVAL;
2490 break;
2491 }
2492 me->pdeath_signal = arg2;
2493 break;
2494 case PR_GET_PDEATHSIG:
2495 error = put_user(me->pdeath_signal, (int __user *)arg2);
2496 break;
2497 case PR_GET_DUMPABLE:
2498 error = get_dumpable(me->mm);
2499 break;
2500 case PR_SET_DUMPABLE:
2501 if (arg2 != SUID_DUMP_DISABLE && arg2 != SUID_DUMP_USER) {
2502 error = -EINVAL;
2503 break;
2504 }
2505 set_dumpable(me->mm, arg2);
2506 break;
2507
2508 case PR_SET_UNALIGN:
2509 error = SET_UNALIGN_CTL(me, arg2);
2510 break;
2511 case PR_GET_UNALIGN:
2512 error = GET_UNALIGN_CTL(me, arg2);
2513 break;
2514 case PR_SET_FPEMU:
2515 error = SET_FPEMU_CTL(me, arg2);
2516 break;
2517 case PR_GET_FPEMU:
2518 error = GET_FPEMU_CTL(me, arg2);
2519 break;
2520 case PR_SET_FPEXC:
2521 error = SET_FPEXC_CTL(me, arg2);
2522 break;
2523 case PR_GET_FPEXC:
2524 error = GET_FPEXC_CTL(me, arg2);
2525 break;
2526 case PR_GET_TIMING:
2527 error = PR_TIMING_STATISTICAL;
2528 break;
2529 case PR_SET_TIMING:
2530 if (arg2 != PR_TIMING_STATISTICAL)
2531 error = -EINVAL;
2532 break;
2533 case PR_SET_NAME:
2534 comm[sizeof(me->comm) - 1] = 0;
2535 if (strncpy_from_user(comm, (char __user *)arg2,
2536 sizeof(me->comm) - 1) < 0)
2537 return -EFAULT;
2538 set_task_comm(me, comm);
2539 proc_comm_connector(me);
2540 break;
2541 case PR_GET_NAME:
2542 get_task_comm(comm, me);
2543 if (copy_to_user((char __user *)arg2, comm, sizeof(comm)))
2544 return -EFAULT;
2545 break;
2546 case PR_GET_ENDIAN:
2547 error = GET_ENDIAN(me, arg2);
2548 break;
2549 case PR_SET_ENDIAN:
2550 error = SET_ENDIAN(me, arg2);
2551 break;
2552 case PR_GET_SECCOMP:
2553 error = prctl_get_seccomp();
2554 break;
2555 case PR_SET_SECCOMP:
2556 error = prctl_set_seccomp(arg2, (char __user *)arg3);
2557 break;
2558 case PR_GET_TSC:
2559 error = GET_TSC_CTL(arg2);
2560 break;
2561 case PR_SET_TSC:
2562 error = SET_TSC_CTL(arg2);
2563 break;
2564 case PR_TASK_PERF_EVENTS_DISABLE:
2565 error = perf_event_task_disable();
2566 break;
2567 case PR_TASK_PERF_EVENTS_ENABLE:
2568 error = perf_event_task_enable();
2569 break;
2570 case PR_GET_TIMERSLACK:
2571 if (current->timer_slack_ns > ULONG_MAX)
2572 error = ULONG_MAX;
2573 else
2574 error = current->timer_slack_ns;
2575 break;
2576 case PR_SET_TIMERSLACK:
2577 if (rt_or_dl_task_policy(current))
2578 break;
2579 if (arg2 <= 0)
2580 current->timer_slack_ns =
2581 current->default_timer_slack_ns;
2582 else
2583 current->timer_slack_ns = arg2;
2584 break;
2585 case PR_MCE_KILL:
2586 if (arg4 | arg5)
2587 return -EINVAL;
2588 switch (arg2) {
2589 case PR_MCE_KILL_CLEAR:
2590 if (arg3 != 0)
2591 return -EINVAL;
2592 current->flags &= ~PF_MCE_PROCESS;
2593 break;
2594 case PR_MCE_KILL_SET:
2595 current->flags |= PF_MCE_PROCESS;
2596 if (arg3 == PR_MCE_KILL_EARLY)
2597 current->flags |= PF_MCE_EARLY;
2598 else if (arg3 == PR_MCE_KILL_LATE)
2599 current->flags &= ~PF_MCE_EARLY;
2600 else if (arg3 == PR_MCE_KILL_DEFAULT)
2601 current->flags &=
2602 ~(PF_MCE_EARLY|PF_MCE_PROCESS);
2603 else
2604 return -EINVAL;
2605 break;
2606 default:
2607 return -EINVAL;
2608 }
2609 break;
2610 case PR_MCE_KILL_GET:
2611 if (arg2 | arg3 | arg4 | arg5)
2612 return -EINVAL;
2613 if (current->flags & PF_MCE_PROCESS)
2614 error = (current->flags & PF_MCE_EARLY) ?
2615 PR_MCE_KILL_EARLY : PR_MCE_KILL_LATE;
2616 else
2617 error = PR_MCE_KILL_DEFAULT;
2618 break;
2619 case PR_SET_MM:
2620 error = prctl_set_mm(arg2, arg3, arg4, arg5);
2621 break;
2622 case PR_GET_TID_ADDRESS:
2623 error = prctl_get_tid_address(me, (int __user * __user *)arg2);
2624 break;
2625 case PR_SET_CHILD_SUBREAPER:
2626 me->signal->is_child_subreaper = !!arg2;
2627 if (!arg2)
2628 break;
2629
2630 walk_process_tree(me, propagate_has_child_subreaper, NULL);
2631 break;
2632 case PR_GET_CHILD_SUBREAPER:
2633 error = put_user(me->signal->is_child_subreaper,
2634 (int __user *)arg2);
2635 break;
2636 case PR_SET_NO_NEW_PRIVS:
2637 if (arg2 != 1 || arg3 || arg4 || arg5)
2638 return -EINVAL;
2639
2640 task_set_no_new_privs(current);
2641 break;
2642 case PR_GET_NO_NEW_PRIVS:
2643 if (arg2 || arg3 || arg4 || arg5)
2644 return -EINVAL;
2645 return task_no_new_privs(current) ? 1 : 0;
2646 case PR_GET_THP_DISABLE:
2647 if (arg2 || arg3 || arg4 || arg5)
2648 return -EINVAL;
2649 error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
2650 break;
2651 case PR_SET_THP_DISABLE:
2652 if (arg3 || arg4 || arg5)
2653 return -EINVAL;
2654 if (mmap_write_lock_killable(me->mm))
2655 return -EINTR;
2656 if (arg2)
2657 set_bit(MMF_DISABLE_THP, &me->mm->flags);
2658 else
2659 clear_bit(MMF_DISABLE_THP, &me->mm->flags);
2660 mmap_write_unlock(me->mm);
2661 break;
2662 case PR_GET_THP_POLICY:
2663 if (arg2 || arg3 || arg4 || arg5)
2664 return -EINVAL;
2665 if (mmap_write_lock_killable(mm))
2666 return -EINTR;
2667 if (mm->def_flags & VM_HUGEPAGE)
2668 error = PR_DEFAULT_MADV_HUGEPAGE;
2669 mmap_write_unlock(mm);
2670 break;
2671 case PR_SET_THP_POLICY:
2672 if (arg3 || arg4 || arg5)
2673 return -EINVAL;
2674 if (mmap_write_lock_killable(mm))
2675 return -EINTR;
2676 switch (arg2) {
2677 case PR_DEFAULT_MADV_HUGEPAGE:
> 2678 if (!hugepage_global_enabled())
2679 error = -EPERM;
> 2680 error = hugepage_set_vmflags(&mm->def_flags, MADV_HUGEPAGE);
2681 if (!error)
> 2682 process_default_madv_hugepage(mm, MADV_HUGEPAGE);
2683 break;
2684 default:
2685 error = -EINVAL;
2686 break;
2687 }
2688 mmap_write_unlock(mm);
2689 break;
2690 case PR_MPX_ENABLE_MANAGEMENT:
2691 case PR_MPX_DISABLE_MANAGEMENT:
2692 /* No longer implemented: */
2693 return -EINVAL;
2694 case PR_SET_FP_MODE:
2695 error = SET_FP_MODE(me, arg2);
2696 break;
2697 case PR_GET_FP_MODE:
2698 error = GET_FP_MODE(me);
2699 break;
2700 case PR_SVE_SET_VL:
2701 error = SVE_SET_VL(arg2);
2702 break;
2703 case PR_SVE_GET_VL:
2704 error = SVE_GET_VL();
2705 break;
2706 case PR_SME_SET_VL:
2707 error = SME_SET_VL(arg2);
2708 break;
2709 case PR_SME_GET_VL:
2710 error = SME_GET_VL();
2711 break;
2712 case PR_GET_SPECULATION_CTRL:
2713 if (arg3 || arg4 || arg5)
2714 return -EINVAL;
2715 error = arch_prctl_spec_ctrl_get(me, arg2);
2716 break;
2717 case PR_SET_SPECULATION_CTRL:
2718 if (arg4 || arg5)
2719 return -EINVAL;
2720 error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
2721 break;
2722 case PR_PAC_RESET_KEYS:
2723 if (arg3 || arg4 || arg5)
2724 return -EINVAL;
2725 error = PAC_RESET_KEYS(me, arg2);
2726 break;
2727 case PR_PAC_SET_ENABLED_KEYS:
2728 if (arg4 || arg5)
2729 return -EINVAL;
2730 error = PAC_SET_ENABLED_KEYS(me, arg2, arg3);
2731 break;
2732 case PR_PAC_GET_ENABLED_KEYS:
2733 if (arg2 || arg3 || arg4 || arg5)
2734 return -EINVAL;
2735 error = PAC_GET_ENABLED_KEYS(me);
2736 break;
2737 case PR_SET_TAGGED_ADDR_CTRL:
2738 if (arg3 || arg4 || arg5)
2739 return -EINVAL;
2740 error = SET_TAGGED_ADDR_CTRL(arg2);
2741 break;
2742 case PR_GET_TAGGED_ADDR_CTRL:
2743 if (arg2 || arg3 || arg4 || arg5)
2744 return -EINVAL;
2745 error = GET_TAGGED_ADDR_CTRL();
2746 break;
2747 case PR_SET_IO_FLUSHER:
2748 if (!capable(CAP_SYS_RESOURCE))
2749 return -EPERM;
2750
2751 if (arg3 || arg4 || arg5)
2752 return -EINVAL;
2753
2754 if (arg2 == 1)
2755 current->flags |= PR_IO_FLUSHER;
2756 else if (!arg2)
2757 current->flags &= ~PR_IO_FLUSHER;
2758 else
2759 return -EINVAL;
2760 break;
2761 case PR_GET_IO_FLUSHER:
2762 if (!capable(CAP_SYS_RESOURCE))
2763 return -EPERM;
2764
2765 if (arg2 || arg3 || arg4 || arg5)
2766 return -EINVAL;
2767
2768 error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
2769 break;
2770 case PR_SET_SYSCALL_USER_DISPATCH:
2771 error = set_syscall_user_dispatch(arg2, arg3, arg4,
2772 (char __user *) arg5);
2773 break;
2774 #ifdef CONFIG_SCHED_CORE
2775 case PR_SCHED_CORE:
2776 error = sched_core_share_pid(arg2, arg3, arg4, arg5);
2777 break;
2778 #endif
2779 case PR_SET_MDWE:
2780 error = prctl_set_mdwe(arg2, arg3, arg4, arg5);
2781 break;
2782 case PR_GET_MDWE:
2783 error = prctl_get_mdwe(arg2, arg3, arg4, arg5);
2784 break;
2785 case PR_PPC_GET_DEXCR:
2786 if (arg3 || arg4 || arg5)
2787 return -EINVAL;
2788 error = PPC_GET_DEXCR_ASPECT(me, arg2);
2789 break;
2790 case PR_PPC_SET_DEXCR:
2791 if (arg4 || arg5)
2792 return -EINVAL;
2793 error = PPC_SET_DEXCR_ASPECT(me, arg2, arg3);
2794 break;
2795 case PR_SET_VMA:
2796 error = prctl_set_vma(arg2, arg3, arg4, arg5);
2797 break;
2798 case PR_GET_AUXV:
2799 if (arg4 || arg5)
2800 return -EINVAL;
2801 error = prctl_get_auxv((void __user *)arg2, arg3);
2802 break;
2803 #ifdef CONFIG_KSM
2804 case PR_SET_MEMORY_MERGE:
2805 if (arg3 || arg4 || arg5)
2806 return -EINVAL;
2807 if (mmap_write_lock_killable(me->mm))
2808 return -EINTR;
2809
2810 if (arg2)
2811 error = ksm_enable_merge_any(me->mm);
2812 else
2813 error = ksm_disable_merge_any(me->mm);
2814 mmap_write_unlock(me->mm);
2815 break;
2816 case PR_GET_MEMORY_MERGE:
2817 if (arg2 || arg3 || arg4 || arg5)
2818 return -EINVAL;
2819
2820 error = !!test_bit(MMF_VM_MERGE_ANY, &me->mm->flags);
2821 break;
2822 #endif
2823 case PR_RISCV_V_SET_CONTROL:
2824 error = RISCV_V_SET_CONTROL(arg2);
2825 break;
2826 case PR_RISCV_V_GET_CONTROL:
2827 error = RISCV_V_GET_CONTROL();
2828 break;
2829 case PR_RISCV_SET_ICACHE_FLUSH_CTX:
2830 error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
2831 break;
2832 case PR_GET_SHADOW_STACK_STATUS:
2833 if (arg3 || arg4 || arg5)
2834 return -EINVAL;
2835 error = arch_get_shadow_stack_status(me, (unsigned long __user *) arg2);
2836 break;
2837 case PR_SET_SHADOW_STACK_STATUS:
2838 if (arg3 || arg4 || arg5)
2839 return -EINVAL;
2840 error = arch_set_shadow_stack_status(me, arg2);
2841 break;
2842 case PR_LOCK_SHADOW_STACK_STATUS:
2843 if (arg3 || arg4 || arg5)
2844 return -EINVAL;
2845 error = arch_lock_shadow_stack_status(me, arg2);
2846 break;
2847 case PR_TIMER_CREATE_RESTORE_IDS:
2848 if (arg3 || arg4 || arg5)
2849 return -EINVAL;
2850 error = posixtimer_create_prctl(arg2);
2851 break;
2852 default:
2853 trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
2854 error = -EINVAL;
2855 break;
2856 }
2857 return error;
2858 }
2859
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v3 3/7] prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE for the process
2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
2025-05-19 22:29 ` [PATCH v3 1/7] mm: khugepaged: extract vm flag setting outside of hugepage_madvise Usama Arif
2025-05-19 22:29 ` [PATCH v3 2/7] prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process Usama Arif
@ 2025-05-19 22:29 ` Usama Arif
2025-05-19 22:29 ` [PATCH v3 4/7] prctl: introduce PR_THP_POLICY_SYSTEM " Usama Arif
` (6 subsequent siblings)
9 siblings, 0 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-19 22:29 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, vbabka,
jannh, Arnd Bergmann, linux-kernel, linux-doc, kernel-team,
Usama Arif
This is set via the new PR_SET_THP_POLICY prctl. It has 2 affects:
- It sets VM_NOHUGEPAGE and clears VM_HUGEPAGE on the default VMA
flags (def_flags). This means that every new VMA will not be
considered for hugepage by default.
- Iterate through every VMA in the process and call hugepage_madvise
on it, with MADV_NOHUGEPAGE policy.
The policy is inherited during fork+exec.
This effectively allows setting MADV_NOHUGEPAGE on the entire process.
In anenvironment where different types of workloads are stacked on the
same machine,this will allow workloads that benefit from having
hugepages on an madvise basis only to do so, without regressing those
that benefit from having hugepages always.
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
include/uapi/linux/prctl.h | 1 +
kernel/sys.c | 7 +++++++
tools/include/uapi/linux/prctl.h | 1 +
tools/perf/trace/beauty/include/uapi/linux/prctl.h | 1 +
4 files changed, 10 insertions(+)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 15aaa4db5ff8..33a6ef6a5a72 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -367,5 +367,6 @@ struct prctl_mm_map {
#define PR_SET_THP_POLICY 78
#define PR_GET_THP_POLICY 79
#define PR_DEFAULT_MADV_HUGEPAGE 0
+#define PR_DEFAULT_MADV_NOHUGEPAGE 1
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 74397ace62f3..6bb28b3666f7 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2666,6 +2666,8 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
return -EINTR;
if (mm->def_flags & VM_HUGEPAGE)
error = PR_DEFAULT_MADV_HUGEPAGE;
+ else if (mm->def_flags & VM_NOHUGEPAGE)
+ error = PR_DEFAULT_MADV_NOHUGEPAGE;
mmap_write_unlock(mm);
break;
case PR_SET_THP_POLICY:
@@ -2681,6 +2683,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
if (!error)
process_default_madv_hugepage(mm, MADV_HUGEPAGE);
break;
+ case PR_DEFAULT_MADV_NOHUGEPAGE:
+ error = hugepage_set_vmflags(&mm->def_flags, MADV_NOHUGEPAGE);
+ if (!error)
+ process_default_madv_hugepage(mm, MADV_NOHUGEPAGE);
+ break;
default:
error = -EINVAL;
break;
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index f5945ebfe3f2..e03d0ed890c5 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -331,5 +331,6 @@ struct prctl_mm_map {
#define PR_SET_THP_POLICY 78
#define PR_GET_THP_POLICY 79
#define PR_THP_POLICY_DEFAULT_HUGE 0
+#define PR_THP_POLICY_DEFAULT_NOHUGE 1
#endif /* _LINUX_PRCTL_H */
diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
index 325c72f40a93..d25458f4db9e 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
@@ -367,5 +367,6 @@ struct prctl_mm_map {
#define PR_SET_THP_POLICY 78
#define PR_GET_THP_POLICY 79
#define PR_THP_POLICY_DEFAULT_HUGE 0
+#define PR_THP_POLICY_DEFAULT_NOHUGE 1
#endif /* _LINUX_PRCTL_H */
--
2.47.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v3 4/7] prctl: introduce PR_THP_POLICY_SYSTEM for the process
2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
` (2 preceding siblings ...)
2025-05-19 22:29 ` [PATCH v3 3/7] prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE " Usama Arif
@ 2025-05-19 22:29 ` Usama Arif
2025-05-19 22:29 ` [PATCH v3 5/7] selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE Usama Arif
` (5 subsequent siblings)
9 siblings, 0 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-19 22:29 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, vbabka,
jannh, Arnd Bergmann, linux-kernel, linux-doc, kernel-team,
Usama Arif
This is set via the new PR_SET_THP_POLICY prctl.
This will clear VM_HUGEPAGE and VM_NOHUGEPAGE in mm->def_flags
to reset VMA hugepage policy to system specific.
(except in the case of s390 where pgstes are switched
on for userspace process, in which case it will only
clear VM_HUGEPAGE).
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
include/uapi/linux/prctl.h | 1 +
kernel/sys.c | 17 +++++++++++++++++
tools/include/uapi/linux/prctl.h | 1 +
.../trace/beauty/include/uapi/linux/prctl.h | 1 +
4 files changed, 20 insertions(+)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 33a6ef6a5a72..508d78bc3364 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -368,5 +368,6 @@ struct prctl_mm_map {
#define PR_GET_THP_POLICY 79
#define PR_DEFAULT_MADV_HUGEPAGE 0
#define PR_DEFAULT_MADV_NOHUGEPAGE 1
+#define PR_THP_POLICY_SYSTEM 2
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 6bb28b3666f7..cffb60632d97 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2668,6 +2668,8 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = PR_DEFAULT_MADV_HUGEPAGE;
else if (mm->def_flags & VM_NOHUGEPAGE)
error = PR_DEFAULT_MADV_NOHUGEPAGE;
+ else
+ error = PR_THP_POLICY_SYSTEM;
mmap_write_unlock(mm);
break;
case PR_SET_THP_POLICY:
@@ -2688,6 +2690,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
if (!error)
process_default_madv_hugepage(mm, MADV_NOHUGEPAGE);
break;
+ case PR_THP_POLICY_SYSTEM:
+#ifdef CONFIG_S390
+ /*
+ * When s390 switches on pgstes for its userspace
+ * process (for kvm), it sets VM_NOHUGEPAGE.
+ * Do not clear it with system policy.
+ */
+ if (mm_has_pgste(mm))
+ mm->def_flags &= ~VM_HUGEPAGE;
+ else
+ mm->def_flags &= ~(VM_HUGEPAGE | VM_NOHUGEPAGE);
+#else
+ mm->def_flags &= ~(VM_HUGEPAGE | VM_NOHUGEPAGE);
+#endif
+ break;
default:
error = -EINVAL;
break;
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index e03d0ed890c5..cc209c9a8afb 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -332,5 +332,6 @@ struct prctl_mm_map {
#define PR_GET_THP_POLICY 79
#define PR_THP_POLICY_DEFAULT_HUGE 0
#define PR_THP_POLICY_DEFAULT_NOHUGE 1
+#define PR_THP_POLICY_SYSTEM 2
#endif /* _LINUX_PRCTL_H */
diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
index d25458f4db9e..340d5ff769a9 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h
@@ -368,5 +368,6 @@ struct prctl_mm_map {
#define PR_GET_THP_POLICY 79
#define PR_THP_POLICY_DEFAULT_HUGE 0
#define PR_THP_POLICY_DEFAULT_NOHUGE 1
+#define PR_THP_POLICY_SYSTEM 2
#endif /* _LINUX_PRCTL_H */
--
2.47.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v3 5/7] selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE
2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
` (3 preceding siblings ...)
2025-05-19 22:29 ` [PATCH v3 4/7] prctl: introduce PR_THP_POLICY_SYSTEM " Usama Arif
@ 2025-05-19 22:29 ` Usama Arif
2025-05-19 22:29 ` [PATCH v3 6/7] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE Usama Arif
` (4 subsequent siblings)
9 siblings, 0 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-19 22:29 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, vbabka,
jannh, Arnd Bergmann, linux-kernel, linux-doc, kernel-team,
Usama Arif
The test is limited to 2M PMD THPs. It does not modify the system
settings in order to not disturb other process running in the system.
It checks if the PMD size is 2M, if the 2M policy is set to inherit
and if the system global THP policy is set to "always", so that
the change in behaviour due to PR_DEFAULT_MADV_NOHUGEPAGE can
be seen.
This tests if:
- the process can successfully set the policy
- carry it over to the new process with fork
- if no hugepage is gotten when the process doesn't MADV_HUGEPAGE
- if hugepage is gotten when the process does MADV_HUGEPAGE
- the process can successfully reset the policy to PR_DEFAULT_SYSTEM
- if hugepage is gotten after the policy reset
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
tools/testing/selftests/prctl/Makefile | 2 +-
tools/testing/selftests/prctl/thp_policy.c | 214 +++++++++++++++++++++
2 files changed, 215 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/prctl/thp_policy.c
diff --git a/tools/testing/selftests/prctl/Makefile b/tools/testing/selftests/prctl/Makefile
index 01dc90fbb509..ee8c98e45b53 100644
--- a/tools/testing/selftests/prctl/Makefile
+++ b/tools/testing/selftests/prctl/Makefile
@@ -5,7 +5,7 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)
ifeq ($(ARCH),x86)
TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \
- disable-tsc-test set-anon-vma-name-test set-process-name
+ disable-tsc-test set-anon-vma-name-test set-process-name thp_policy
all: $(TEST_PROGS)
include ../lib.mk
diff --git a/tools/testing/selftests/prctl/thp_policy.c b/tools/testing/selftests/prctl/thp_policy.c
new file mode 100644
index 000000000000..7791d282f7c8
--- /dev/null
+++ b/tools/testing/selftests/prctl/thp_policy.c
@@ -0,0 +1,214 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This test covers the PR_GET/SET_THP_POLICY functionality of prctl calls
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+
+#ifndef PR_SET_THP_POLICY
+#define PR_SET_THP_POLICY 78
+#define PR_GET_THP_POLICY 79
+#define PR_DEFAULT_MADV_HUGEPAGE 0
+#define PR_DEFAULT_MADV_NOHUGEPAGE 1
+#define PR_DEFAULT_SYSTEM 2
+#endif
+
+#define CONTENT_SIZE 256
+#define BUF_SIZE (12 * 2 * 1024 * 1024) // 12 x 2MB pages
+
+enum system_policy {
+ SYSTEM_POLICY_ALWAYS,
+ SYSTEM_POLICY_MADVISE,
+ SYSTEM_POLICY_NEVER,
+};
+
+int system_thp_policy;
+
+/* check if the sysfs file contains the expected substring */
+static int check_file_content(const char *file_path, const char *expected_substring)
+{
+ FILE *file = fopen(file_path, "r");
+ char buffer[CONTENT_SIZE];
+
+ if (!file) {
+ perror("Failed to open file");
+ return -1;
+ }
+ if (fgets(buffer, CONTENT_SIZE, file) == NULL) {
+ perror("Failed to read file");
+ fclose(file);
+ return -1;
+ }
+ fclose(file);
+ // Remove newline character from the buffer
+ buffer[strcspn(buffer, "\n")] = '\0';
+ if (strstr(buffer, expected_substring))
+ return 0;
+ else
+ return 1;
+}
+
+/*
+ * The test is designed for 2M hugepages only.
+ * Check if hugepage size is 2M, if 2M size inherits from global
+ * setting, and if the global setting is madvise or always.
+ */
+static int sysfs_check(void)
+{
+ int res = 0;
+
+ res = check_file_content("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", "2097152");
+ if (res) {
+ printf("hpage_pmd_size is not set to 2MB. Skipping test.\n");
+ return -1;
+ }
+ res |= check_file_content("/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled",
+ "[inherit]");
+ if (res) {
+ printf("hugepages-2048kB does not inherit global setting. Skipping test.\n");
+ return -1;
+ }
+
+ res = check_file_content("/sys/kernel/mm/transparent_hugepage/enabled", "[madvise]");
+ if (!res) {
+ system_thp_policy = SYSTEM_POLICY_MADVISE;
+ return 0;
+ }
+ res = check_file_content("/sys/kernel/mm/transparent_hugepage/enabled", "[always]");
+ if (!res) {
+ system_thp_policy = SYSTEM_POLICY_ALWAYS;
+ return 0;
+ }
+ printf("Global THP policy not set to madvise or always. Skipping test.\n");
+ return -1;
+}
+
+static int check_smaps_for_huge(void)
+{
+ FILE *file = fopen("/proc/self/smaps", "r");
+ int is_anonhuge = 0;
+ char line[256];
+
+ if (!file) {
+ perror("fopen");
+ return -1;
+ }
+
+ while (fgets(line, sizeof(line), file)) {
+ if (strstr(line, "AnonHugePages:") && strstr(line, "24576 kB")) {
+ is_anonhuge = 1;
+ break;
+ }
+ }
+ fclose(file);
+ return is_anonhuge;
+}
+
+static int test_mmap_thp(int madvise_buffer)
+{
+ int is_anonhuge;
+
+ char *buffer = (char *)mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (buffer == MAP_FAILED) {
+ perror("mmap");
+ return -1;
+ }
+ if (madvise_buffer)
+ madvise(buffer, BUF_SIZE, MADV_HUGEPAGE);
+
+ // set memory to ensure it's allocated
+ memset(buffer, 0, BUF_SIZE);
+ is_anonhuge = check_smaps_for_huge();
+ munmap(buffer, BUF_SIZE);
+ return is_anonhuge;
+}
+
+/* Global policy is always, process is changed to NOHUGE (process becomes madvise) */
+static int test_global_always_process_nohuge(void)
+{
+ int is_anonhuge = 0, res = 0, status = 0;
+ pid_t pid;
+
+ if (prctl(PR_SET_THP_POLICY, PR_DEFAULT_MADV_NOHUGEPAGE, NULL, NULL, NULL) != 0) {
+ perror("prctl failed to set policy to madvise");
+ return -1;
+ }
+
+ /* Make sure prctl changes are carried across fork */
+ pid = fork();
+ if (pid < 0) {
+ perror("fork");
+ exit(EXIT_FAILURE);
+ }
+
+ res = prctl(PR_GET_THP_POLICY, NULL, NULL, NULL, NULL);
+ if (res != PR_DEFAULT_MADV_NOHUGEPAGE) {
+ printf("prctl PR_GET_THP_POLICY returned %d pid %d\n", res, pid);
+ goto err_out;
+ }
+
+ /* global = always, process = madvise, we shouldn't get HPs without madvise */
+ is_anonhuge = test_mmap_thp(0);
+ if (is_anonhuge) {
+ printf(
+ "PR_DEFAULT_MADV_NOHUGEPAGE set but still got hugepages without MADV_HUGEPAGE\n");
+ goto err_out;
+ }
+
+ is_anonhuge = test_mmap_thp(1);
+ if (!is_anonhuge) {
+ printf(
+ "PR_DEFAULT_MADV_NOHUGEPAGE set but did't get hugepages with MADV_HUGEPAGE\n");
+ goto err_out;
+ }
+
+ /* Reset to system policy */
+ if (prctl(PR_SET_THP_POLICY, PR_DEFAULT_SYSTEM, NULL, NULL, NULL) != 0) {
+ perror("prctl failed to set policy to system");
+ goto err_out;
+ }
+
+ is_anonhuge = test_mmap_thp(0);
+ if (!is_anonhuge) {
+ printf("global policy is always but we still didn't get hugepages\n");
+ goto err_out;
+ }
+
+ is_anonhuge = test_mmap_thp(1);
+ if (!is_anonhuge) {
+ printf("global policy is always but we still didn't get hugepages\n");
+ goto err_out;
+ }
+
+ if (pid == 0) {
+ exit(EXIT_SUCCESS);
+ } else {
+ wait(&status);
+ if (WIFEXITED(status))
+ return 0;
+ else
+ return -1;
+ }
+
+err_out:
+ if (pid == 0)
+ exit(EXIT_FAILURE);
+ else
+ return -1;
+}
+
+int main(void)
+{
+ if (sysfs_check())
+ return 0;
+
+ if (system_thp_policy == SYSTEM_POLICY_ALWAYS)
+ return test_global_always_process_nohuge();
+
+}
--
2.47.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v3 6/7] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
` (4 preceding siblings ...)
2025-05-19 22:29 ` [PATCH v3 5/7] selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE Usama Arif
@ 2025-05-19 22:29 ` Usama Arif
2025-05-19 22:29 ` [PATCH v3 7/7] docs: transhuge: document process level THP controls Usama Arif
` (3 subsequent siblings)
9 siblings, 0 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-19 22:29 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, vbabka,
jannh, Arnd Bergmann, linux-kernel, linux-doc, kernel-team,
Usama Arif
The test is limited to 2M PMD THPs. It does not modify the system
settings in order to not disturb other process running in the
system.
It runs if the PMD size is 2M, if the 2M policy is set to inherit
and if the system global THP policy is set to "madvise", so that
the change in behaviour due to PR_THP_POLICY_DEFAULT_HUGE can
be seen.
This tests if:
- the process can successfully set the policy
- carry it over to the new process with fork
- if hugepage is gotten both with and without madvise
- the process can successfully reset the policy to
PR_DEFAULT_SYSTEM
- if hugepage is gotten after the policy reset only with MADV_HUGEPAGE
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
tools/testing/selftests/prctl/thp_policy.c | 74 +++++++++++++++++++++-
1 file changed, 73 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/prctl/thp_policy.c b/tools/testing/selftests/prctl/thp_policy.c
index 7791d282f7c8..62cf1fa6fd28 100644
--- a/tools/testing/selftests/prctl/thp_policy.c
+++ b/tools/testing/selftests/prctl/thp_policy.c
@@ -203,6 +203,77 @@ static int test_global_always_process_nohuge(void)
return -1;
}
+/* Global policy is madvise, process is changed to HUGE (process becomes always) */
+static int test_global_madvise_process_huge(void)
+{
+ int is_anonhuge = 0, res = 0, status = 0;
+ pid_t pid;
+
+ if (prctl(PR_SET_THP_POLICY, PR_DEFAULT_MADV_HUGEPAGE, NULL, NULL, NULL) != 0) {
+ perror("prctl failed to set process policy to always");
+ return -1;
+ }
+
+ /* Make sure prctl changes are carried across fork */
+ pid = fork();
+ if (pid < 0) {
+ perror("fork");
+ exit(EXIT_FAILURE);
+ }
+
+ res = prctl(PR_GET_THP_POLICY, NULL, NULL, NULL, NULL);
+ if (res != PR_DEFAULT_MADV_HUGEPAGE) {
+ printf("prctl PR_GET_THP_POLICY returned %d pid %d\n", res, pid);
+ goto err_out;
+ }
+
+ /* global = madvise, process = always, we should get HPs irrespective of MADV_HUGEPAGE */
+ is_anonhuge = test_mmap_thp(0);
+ if (!is_anonhuge) {
+ printf("PR_DEFAULT_MADV_HUGEPAGE set but didn't get hugepages\n");
+ goto err_out;
+ }
+
+ is_anonhuge = test_mmap_thp(1);
+ if (!is_anonhuge) {
+ printf("PR_DEFAULT_MADV_HUGEPAGE set but did't get hugepages\n");
+ goto err_out;
+ }
+
+ /* Reset to system policy */
+ if (prctl(PR_SET_THP_POLICY, PR_DEFAULT_SYSTEM, NULL, NULL, NULL) != 0) {
+ perror("prctl failed to set policy to system");
+ goto err_out;
+ }
+
+ is_anonhuge = test_mmap_thp(0);
+ if (is_anonhuge) {
+ printf("global policy is madvise\n");
+ goto err_out;
+ }
+
+ is_anonhuge = test_mmap_thp(1);
+ if (!is_anonhuge) {
+ printf("global policy is madvise\n");
+ goto err_out;
+ }
+
+ if (pid == 0) {
+ exit(EXIT_SUCCESS);
+ } else {
+ wait(&status);
+ if (WIFEXITED(status))
+ return 0;
+ else
+ return -1;
+ }
+err_out:
+ if (pid == 0)
+ exit(EXIT_FAILURE);
+ else
+ return -1;
+}
+
int main(void)
{
if (sysfs_check())
@@ -210,5 +281,6 @@ int main(void)
if (system_thp_policy == SYSTEM_POLICY_ALWAYS)
return test_global_always_process_nohuge();
-
+ else
+ return test_global_madvise_process_huge();
}
--
2.47.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v3 7/7] docs: transhuge: document process level THP controls
2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
` (5 preceding siblings ...)
2025-05-19 22:29 ` [PATCH v3 6/7] selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE Usama Arif
@ 2025-05-19 22:29 ` Usama Arif
2025-05-20 5:14 ` [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Lorenzo Stoakes
` (2 subsequent siblings)
9 siblings, 0 replies; 25+ messages in thread
From: Usama Arif @ 2025-05-19 22:29 UTC (permalink / raw)
To: Andrew Morton, david, linux-mm
Cc: hannes, shakeel.butt, riel, ziy, laoar.shao, baolin.wang,
lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, vbabka,
jannh, Arnd Bergmann, linux-kernel, linux-doc, kernel-team,
Usama Arif
This includes the already existing PR_GET/SET_THP_DISABLE policy,
as well as the newly introduced PR_GET/SET_THP_POLICY.
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
Documentation/admin-guide/mm/transhuge.rst | 42 ++++++++++++++++++++++
1 file changed, 42 insertions(+)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index dff8d5985f0f..79983c20ae48 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -218,6 +218,48 @@ to "always" or "madvise"), and it'll be automatically shutdown when
PMD-sized THP is disabled (when both the per-size anon control and the
top-level control are "never")
+process THP controls
+--------------------
+
+Transparent Hugepage behaviour of a process can be modified/obtained by
+using the prctl system call. The following operations are supported:
+
+PR_SET_THP_DISABLE
+ This will set the MMF_DISABLE_THP process flag which will result
+ in no hugepages being faulted in or collapsed by khugepaged,
+ irrespective of global THP controls.
+
+PR_GET_THP_DISABLE
+ This will return the MMF_DISABLE_THP process flag, which will be
+ set if the process has previously been set with PR_SET_THP_DISABLE.
+
+PR_SET_THP_POLICY
+ This is used to change the behaviour of existing and future VMAs.
+ It has support for the following policies:
+
+ PR_DEFAULT_MADV_HUGEPAGE
+ This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE for the default
+ VMA flags. It will also iterate through every VMA in the process
+ and call hugepage_madvise on it, with MADV_HUGEPAGE policy.
+ This effectively allows setting MADV_HUGEPAGE on the entire process.
+ The policy is inherited during fork+exec.
+
+ PR_DEFAULT_MADV_NOHUGEPAGE
+ This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE for the default
+ VMA flags. It will also iterate through every VMA in the process
+ and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy.
+ This effectively allows setting MADV_NOHUGEPAGE on the entire process.
+ The policy is inherited during fork+exec.
+
+ PR_THP_POLICY_SYSTEM
+ This will reset (clear) both VM_HUGEPAGE and VM_NOHUGEPAGE process
+ for the default flags.
+
+PR_SET_THP_POLICY
+ This will return the current THP policy of the process, i.e.
+ PR_DEFAULT_MADV_HUGEPAGE, PR_DEFAULT_MADV_NOHUGEPAGE or
+ PR_THP_POLICY_SYSTEM.
+
Khugepaged controls
-------------------
--
2.47.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY
2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
` (6 preceding siblings ...)
2025-05-19 22:29 ` [PATCH v3 7/7] docs: transhuge: document process level THP controls Usama Arif
@ 2025-05-20 5:14 ` Lorenzo Stoakes
2025-05-20 7:46 ` Usama Arif
2025-05-21 2:33 ` Liam R. Howlett
2025-05-22 12:10 ` Mike Rapoport
9 siblings, 1 reply; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 5:14 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
kernel-team
NACK the whole series.
Usama - I explicitly said make this an RFC, so we can see what this
approach _looks like_ to further examine it, to which you agreed. And now
you've sent it non-RFC. That's not acceptable.
If you agree to something in review, it's not then optional as to whether
you do it.
Thanks.
On Mon, May 19, 2025 at 11:29:52PM +0100, Usama Arif wrote:
> This series allows to change the THP policy of a process, according to the
> value set in arg2, all of which will be inherited during fork+exec:
> - PR_DEFAULT_MADV_HUGEPAGE: This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE
> for the default VMA flags. It will also iterate through every VMA in the
> process and call hugepage_madvise on it, with MADV_HUGEPAGE policy.
> This effectively allows setting MADV_HUGEPAGE on the entire process.
> In an environment where different types of workloads are run on the
> same machine, this will allow workloads that benefit from always having
> hugepages to do so, without regressing those that don't.
> - PR_DEFAULT_MADV_NOHUGEPAGE: This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE
> for the default VMA flags. It will also iterate through every VMA in the
> process and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy.
> This effectively allows setting MADV_NOHUGEPAGE on the entire process.
> In an environment where different types of workloads are run on the
> same machine,this will allow workloads that benefit from having
> hugepages on an madvise basis only to do so, without regressing those
> that benefit from having hugepages always.
> - PR_THP_POLICY_SYSTEM: This will reset (clear) both VM_HUGEPAGE and
> VM_NOHUGEPAGE process for the default flags.
>
> In hyperscalers, we have a single THP policy for the entire fleet.
> We have different types of workloads (e.g. AI/compute/databases/etc)
> running on a single server.
> Some of these workloads will benefit from always getting THP at fault
> (or collapsed by khugepaged), some of them will benefit by only getting
> them at madvise.
>
> This series is useful for 2 usecases:
> 1) global system policy = madvise, while we want some workloads to get THPs
> at fault and by khugepaged :- some processes (e.g. AI workloads) benefits
> from getting THPs at fault (and collapsed by khugepaged). Other workloads
> like databases will incur regression (either a performance regression or
> they are completely memory bound and even a very slight increase in memory
> will cause them to OOM). So what these patches will do is allow setting
> prctl(PR_DEFAULT_MADV_HUGEPAGE) on the AI workloads, (This is how
> workloads are deployed in our (Meta's/Facebook) fleet at this moment).
>
> 2) global system policy = always, while we want some workloads to get THPs
> only on madvise basis :- Same reason as 1). What these patches
> will do is allow setting prctl(PR_DEFAULT_MADV_NOHUGEPAGE) on the database
> workloads. (We hope this is us (Meta) in the near future, if a majority of
> workloads show that they benefit from always, we flip the default host
> setting to "always" across the fleet and workloads that regress can opt-out
> and be "madvise". New services developed will then be tested with always by
> default. "always" is also the default defconfig option upstream, so I would
> imagine this is faced by others as well.)
>
> v2->v3: (Thanks Lorenzo for all the below feedback!)
> v2: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@gmail.com/
> - no more flags2.
> - no more MMF2_...
> - renamed policy to PR_DEFAULT_MADV_(NO)HUGEPAGE
> - mmap_write_lock_killable acquired in PR_GET_THP_POLICY
> - mmap_write lock fixed in PR_SET_THP_POLICY
> - mmap assert check in process_default_madv_hugepage
> - check if hugepage_global_enabled is enabled in the call and account for s390
> - set mm->def_flags VM_HUGEPAGE and VM_NOHUGEPAGE according to the policy in
> the way done by madvise(). I believe VM merge will not be broken in
> this way.
> - process_default_madv_hugepage function that does for_each_vma and calls
> hugepage_madvise.
>
> v1->v2:
> - change from modifying the THP decision making for the process, to modifying
> VMA flags only. This prevents further complicating the logic used to
> determine THP order (Thanks David!)
> - change from using a prctl per policy change to just using PR_SET_THP_POLICY
> and arg2 to set the policy. (Zi Yan)
> - Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM
> - Add selftests and documentation.
>
> Usama Arif (7):
> mm: khugepaged: extract vm flag setting outside of hugepage_madvise
> prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
> prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE for the process
> prctl: introduce PR_THP_POLICY_SYSTEM for the process
> selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE
> selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
> docs: transhuge: document process level THP controls
>
> Documentation/admin-guide/mm/transhuge.rst | 42 +++
> include/linux/huge_mm.h | 2 +
> include/linux/mm.h | 2 +-
> include/linux/mm_types.h | 4 +-
> include/uapi/linux/prctl.h | 6 +
> kernel/sys.c | 53 ++++
> mm/huge_memory.c | 13 +
> mm/khugepaged.c | 26 +-
> tools/include/uapi/linux/prctl.h | 6 +
> .../trace/beauty/include/uapi/linux/prctl.h | 6 +
> tools/testing/selftests/prctl/Makefile | 2 +-
> tools/testing/selftests/prctl/thp_policy.c | 286 ++++++++++++++++++
> 12 files changed, 436 insertions(+), 12 deletions(-)
> create mode 100644 tools/testing/selftests/prctl/thp_policy.c
>
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY
2025-05-20 5:14 ` [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Lorenzo Stoakes
@ 2025-05-20 7:46 ` Usama Arif
2025-05-20 8:51 ` Lorenzo Stoakes
0 siblings, 1 reply; 25+ messages in thread
From: Usama Arif @ 2025-05-20 7:46 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
kernel-team
On 20/05/2025 06:14, Lorenzo Stoakes wrote:
> NACK the whole series.
>
> Usama - I explicitly said make this an RFC, so we can see what this
> approach _looks like_ to further examine it, to which you agreed. And now
> you've sent it non-RFC. That's not acceptable.
>
> If you agree to something in review, it's not then optional as to whether
> you do it.
It was a bit late yesterday and I completely forgot to change --subject-prefix="PATCH v3"
to --subject-prefix="RFC v3". Mistakes happen and I apologize.
I agreed to make it RFC and had full intention of doing that.
Would you like me to resend it with the RFC tag?
Thanks,
Usama
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY
2025-05-20 7:46 ` Usama Arif
@ 2025-05-20 8:51 ` Lorenzo Stoakes
0 siblings, 0 replies; 25+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 8:51 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, Liam.Howlett, npache, ryan.roberts,
vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
kernel-team
On Tue, May 20, 2025 at 08:46:43AM +0100, Usama Arif wrote:
>
>
> On 20/05/2025 06:14, Lorenzo Stoakes wrote:
> > NACK the whole series.
> >
> > Usama - I explicitly said make this an RFC, so we can see what this
> > approach _looks like_ to further examine it, to which you agreed. And now
> > you've sent it non-RFC. That's not acceptable.
> >
> > If you agree to something in review, it's not then optional as to whether
> > you do it.
>
> It was a bit late yesterday and I completely forgot to change --subject-prefix="PATCH v3"
> to --subject-prefix="RFC v3". Mistakes happen and I apologize.
Ack, but in future please try to be careful about this! This obviously
changes the nature of the series and important to highlight we're still in
the planning stages here.
>
> I agreed to make it RFC and had full intention of doing that.
> Would you like me to resend it with the RFC tag?
There's no need, we've got discussion here already so it's sensible to keep
things as-is, the series is in-effect an RFC now as it's NACK'd.
>
> Thanks,
> Usama
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY
2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
` (7 preceding siblings ...)
2025-05-20 5:14 ` [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Lorenzo Stoakes
@ 2025-05-21 2:33 ` Liam R. Howlett
2025-05-21 9:31 ` Usama Arif
2025-05-22 12:10 ` Mike Rapoport
9 siblings, 1 reply; 25+ messages in thread
From: Liam R. Howlett @ 2025-05-21 2:33 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, lorenzo.stoakes, npache, ryan.roberts,
vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
kernel-team
* Usama Arif <usamaarif642@gmail.com> [250519 18:34]:
> This series allows to change the THP policy of a process, according to the
> value set in arg2, all of which will be inherited during fork+exec:
> - PR_DEFAULT_MADV_HUGEPAGE: This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE
> for the default VMA flags. It will also iterate through every VMA in the
> process and call hugepage_madvise on it, with MADV_HUGEPAGE policy.
> This effectively allows setting MADV_HUGEPAGE on the entire process.
> In an environment where different types of workloads are run on the
> same machine, this will allow workloads that benefit from always having
> hugepages to do so, without regressing those that don't.
> - PR_DEFAULT_MADV_NOHUGEPAGE: This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE
> for the default VMA flags. It will also iterate through every VMA in the
> process and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy.
> This effectively allows setting MADV_NOHUGEPAGE on the entire process.
> In an environment where different types of workloads are run on the
> same machine,this will allow workloads that benefit from having
> hugepages on an madvise basis only to do so, without regressing those
> that benefit from having hugepages always.
> - PR_THP_POLICY_SYSTEM: This will reset (clear) both VM_HUGEPAGE and
> VM_NOHUGEPAGE process for the default flags.
>
Subject seems outdated now? PR_DEFAULT_ vs PR_SET/GET_THP ?
On that note, doesn't it make sense to change the default mm flag under
PR_SET_MM? PR_SET_MM_FLAG maybe?
Thanks,
Liam
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY
2025-05-21 2:33 ` Liam R. Howlett
@ 2025-05-21 9:31 ` Usama Arif
2025-05-21 16:37 ` Liam R. Howlett
0 siblings, 1 reply; 25+ messages in thread
From: Usama Arif @ 2025-05-21 9:31 UTC (permalink / raw)
To: Liam R. Howlett, Andrew Morton, david, linux-mm, hannes,
shakeel.butt, riel, ziy, laoar.shao, baolin.wang, lorenzo.stoakes,
npache, ryan.roberts, vbabka, jannh, Arnd Bergmann, linux-kernel,
linux-doc, kernel-team
On 21/05/2025 03:33, Liam R. Howlett wrote:
> * Usama Arif <usamaarif642@gmail.com> [250519 18:34]:
>> This series allows to change the THP policy of a process, according to the
>> value set in arg2, all of which will be inherited during fork+exec:
>> - PR_DEFAULT_MADV_HUGEPAGE: This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE
>> for the default VMA flags. It will also iterate through every VMA in the
>> process and call hugepage_madvise on it, with MADV_HUGEPAGE policy.
>> This effectively allows setting MADV_HUGEPAGE on the entire process.
>> In an environment where different types of workloads are run on the
>> same machine, this will allow workloads that benefit from always having
>> hugepages to do so, without regressing those that don't.
>> - PR_DEFAULT_MADV_NOHUGEPAGE: This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE
>> for the default VMA flags. It will also iterate through every VMA in the
>> process and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy.
>> This effectively allows setting MADV_NOHUGEPAGE on the entire process.
>> In an environment where different types of workloads are run on the
>> same machine,this will allow workloads that benefit from having
>> hugepages on an madvise basis only to do so, without regressing those
>> that benefit from having hugepages always.
>> - PR_THP_POLICY_SYSTEM: This will reset (clear) both VM_HUGEPAGE and
>> VM_NOHUGEPAGE process for the default flags.
>>
>
> Subject seems outdated now? PR_DEFAULT_ vs PR_SET/GET_THP ?
No its not.
prctl takes 5 args, the first 2 are relevant here.
The first arg is to decide the op. This series introduces 2 ops. PR_SET_THP_POLICY
and PR_GET_THP_POLICY to set and get the policy. This is the subject.
The 2nd arg describes the policies: PR_DEFAULT_MADV_HUGEPAGE, PR_DEFAULT_MADV_NOHUGEPAGE
and PR_THP_POLICY_SYSTEM.
The subject is correct.
>
> On that note, doesn't it make sense to change the default mm flag under
> PR_SET_MM? PR_SET_MM_FLAG maybe?
I don't think thats the right approach. PR_SET_MM is used to modify kernel
memory map descriptor fields. Thats not what we are doing here.
I am not sure how the usecase in this series fits at all in the below
switch statement for PR_SET_MM:
switch (opt) {
case PR_SET_MM_START_CODE:
prctl_map.start_code = addr;
break;
case PR_SET_MM_END_CODE:
prctl_map.end_code = addr;
break;
case PR_SET_MM_START_DATA:
prctl_map.start_data = addr;
break;
case PR_SET_MM_END_DATA:
prctl_map.end_data = addr;
break;
case PR_SET_MM_START_STACK:
prctl_map.start_stack = addr;
break;
case PR_SET_MM_START_BRK:
prctl_map.start_brk = addr;
break;
case PR_SET_MM_BRK:
prctl_map.brk = addr;
break;
case PR_SET_MM_ARG_START:
prctl_map.arg_start = addr;
break;
case PR_SET_MM_ARG_END:
prctl_map.arg_end = addr;
break;
case PR_SET_MM_ENV_START:
prctl_map.env_start = addr;
break;
case PR_SET_MM_ENV_END:
prctl_map.env_end = addr;
break;
default:
goto out;
}
>
> Thanks,
> Liam
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY
2025-05-21 9:31 ` Usama Arif
@ 2025-05-21 16:37 ` Liam R. Howlett
0 siblings, 0 replies; 25+ messages in thread
From: Liam R. Howlett @ 2025-05-21 16:37 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, lorenzo.stoakes, npache, ryan.roberts,
vbabka, jannh, Arnd Bergmann, linux-kernel, linux-doc,
kernel-team
* Usama Arif <usamaarif642@gmail.com> [250521 05:31]:
>
>
> On 21/05/2025 03:33, Liam R. Howlett wrote:
> > * Usama Arif <usamaarif642@gmail.com> [250519 18:34]:
> >> This series allows to change the THP policy of a process, according to the
> >> value set in arg2, all of which will be inherited during fork+exec:
> >> - PR_DEFAULT_MADV_HUGEPAGE: This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE
> >> for the default VMA flags. It will also iterate through every VMA in the
> >> process and call hugepage_madvise on it, with MADV_HUGEPAGE policy.
> >> This effectively allows setting MADV_HUGEPAGE on the entire process.
> >> In an environment where different types of workloads are run on the
> >> same machine, this will allow workloads that benefit from always having
> >> hugepages to do so, without regressing those that don't.
> >> - PR_DEFAULT_MADV_NOHUGEPAGE: This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE
> >> for the default VMA flags. It will also iterate through every VMA in the
> >> process and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy.
> >> This effectively allows setting MADV_NOHUGEPAGE on the entire process.
> >> In an environment where different types of workloads are run on the
> >> same machine,this will allow workloads that benefit from having
> >> hugepages on an madvise basis only to do so, without regressing those
> >> that benefit from having hugepages always.
> >> - PR_THP_POLICY_SYSTEM: This will reset (clear) both VM_HUGEPAGE and
> >> VM_NOHUGEPAGE process for the default flags.
> >>
> >
> > Subject seems outdated now? PR_DEFAULT_ vs PR_SET/GET_THP ?
>
> No its not.
>
> prctl takes 5 args, the first 2 are relevant here.
>
> The first arg is to decide the op. This series introduces 2 ops. PR_SET_THP_POLICY
> and PR_GET_THP_POLICY to set and get the policy. This is the subject.
>
> The 2nd arg describes the policies: PR_DEFAULT_MADV_HUGEPAGE, PR_DEFAULT_MADV_NOHUGEPAGE
> and PR_THP_POLICY_SYSTEM.
>
> The subject is correct.
Thanks, that makes sense. You are adding an entire new configuration
item to the prctl fun.
>
> >
> > On that note, doesn't it make sense to change the default mm flag under
> > PR_SET_MM? PR_SET_MM_FLAG maybe?
>
> I don't think thats the right approach. PR_SET_MM is used to modify kernel
> memory map descriptor fields. Thats not what we are doing here.
Fair enough, you are changing the memory map default flags for vmas.
So we are going to add another top level THP specific prctl that changes
flags, but now def_flags and that's communicated by the word POLICY.
I'm not sure this is the right approach either.
Thanks,
Liam
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY
2025-05-19 22:29 [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY Usama Arif
` (8 preceding siblings ...)
2025-05-21 2:33 ` Liam R. Howlett
@ 2025-05-22 12:10 ` Mike Rapoport
9 siblings, 0 replies; 25+ messages in thread
From: Mike Rapoport @ 2025-05-22 12:10 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
laoar.shao, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, vbabka, jannh, Arnd Bergmann, linux-kernel,
linux-doc, kernel-team, linux-api
(cc'ing linux-api)
On Mon, May 19, 2025 at 11:29:52PM +0100, Usama Arif wrote:
> This series allows to change the THP policy of a process, according to the
> value set in arg2, all of which will be inherited during fork+exec:
> - PR_DEFAULT_MADV_HUGEPAGE: This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE
> for the default VMA flags. It will also iterate through every VMA in the
> process and call hugepage_madvise on it, with MADV_HUGEPAGE policy.
> This effectively allows setting MADV_HUGEPAGE on the entire process.
> In an environment where different types of workloads are run on the
> same machine, this will allow workloads that benefit from always having
> hugepages to do so, without regressing those that don't.
> - PR_DEFAULT_MADV_NOHUGEPAGE: This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE
> for the default VMA flags. It will also iterate through every VMA in the
> process and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy.
> This effectively allows setting MADV_NOHUGEPAGE on the entire process.
> In an environment where different types of workloads are run on the
> same machine,this will allow workloads that benefit from having
> hugepages on an madvise basis only to do so, without regressing those
> that benefit from having hugepages always.
> - PR_THP_POLICY_SYSTEM: This will reset (clear) both VM_HUGEPAGE and
> VM_NOHUGEPAGE process for the default flags.
>
> In hyperscalers, we have a single THP policy for the entire fleet.
> We have different types of workloads (e.g. AI/compute/databases/etc)
> running on a single server.
> Some of these workloads will benefit from always getting THP at fault
> (or collapsed by khugepaged), some of them will benefit by only getting
> them at madvise.
>
> This series is useful for 2 usecases:
> 1) global system policy = madvise, while we want some workloads to get THPs
> at fault and by khugepaged :- some processes (e.g. AI workloads) benefits
> from getting THPs at fault (and collapsed by khugepaged). Other workloads
> like databases will incur regression (either a performance regression or
> they are completely memory bound and even a very slight increase in memory
> will cause them to OOM). So what these patches will do is allow setting
> prctl(PR_DEFAULT_MADV_HUGEPAGE) on the AI workloads, (This is how
> workloads are deployed in our (Meta's/Facebook) fleet at this moment).
>
> 2) global system policy = always, while we want some workloads to get THPs
> only on madvise basis :- Same reason as 1). What these patches
> will do is allow setting prctl(PR_DEFAULT_MADV_NOHUGEPAGE) on the database
> workloads. (We hope this is us (Meta) in the near future, if a majority of
> workloads show that they benefit from always, we flip the default host
> setting to "always" across the fleet and workloads that regress can opt-out
> and be "madvise". New services developed will then be tested with always by
> default. "always" is also the default defconfig option upstream, so I would
> imagine this is faced by others as well.)
>
> v2->v3: (Thanks Lorenzo for all the below feedback!)
> v2: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@gmail.com/
> - no more flags2.
> - no more MMF2_...
> - renamed policy to PR_DEFAULT_MADV_(NO)HUGEPAGE
> - mmap_write_lock_killable acquired in PR_GET_THP_POLICY
> - mmap_write lock fixed in PR_SET_THP_POLICY
> - mmap assert check in process_default_madv_hugepage
> - check if hugepage_global_enabled is enabled in the call and account for s390
> - set mm->def_flags VM_HUGEPAGE and VM_NOHUGEPAGE according to the policy in
> the way done by madvise(). I believe VM merge will not be broken in
> this way.
> - process_default_madv_hugepage function that does for_each_vma and calls
> hugepage_madvise.
>
> v1->v2:
> - change from modifying the THP decision making for the process, to modifying
> VMA flags only. This prevents further complicating the logic used to
> determine THP order (Thanks David!)
> - change from using a prctl per policy change to just using PR_SET_THP_POLICY
> and arg2 to set the policy. (Zi Yan)
> - Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM
> - Add selftests and documentation.
>
> Usama Arif (7):
> mm: khugepaged: extract vm flag setting outside of hugepage_madvise
> prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
> prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE for the process
> prctl: introduce PR_THP_POLICY_SYSTEM for the process
> selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE
> selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
> docs: transhuge: document process level THP controls
>
> Documentation/admin-guide/mm/transhuge.rst | 42 +++
> include/linux/huge_mm.h | 2 +
> include/linux/mm.h | 2 +-
> include/linux/mm_types.h | 4 +-
> include/uapi/linux/prctl.h | 6 +
> kernel/sys.c | 53 ++++
> mm/huge_memory.c | 13 +
> mm/khugepaged.c | 26 +-
> tools/include/uapi/linux/prctl.h | 6 +
> .../trace/beauty/include/uapi/linux/prctl.h | 6 +
> tools/testing/selftests/prctl/Makefile | 2 +-
> tools/testing/selftests/prctl/thp_policy.c | 286 ++++++++++++++++++
> 12 files changed, 436 insertions(+), 12 deletions(-)
> create mode 100644 tools/testing/selftests/prctl/thp_policy.c
>
> --
> 2.47.1
>
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 25+ messages in thread