* [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
@ 2025-05-20 6:04 Yafang Shao
2025-05-20 6:04 ` [RFC PATCH v2 1/5] mm: thp: Add a new mode "bpf" Yafang Shao
` (7 more replies)
0 siblings, 8 replies; 52+ messages in thread
From: Yafang Shao @ 2025-05-20 6:04 UTC (permalink / raw)
To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
npache, ryan.roberts, dev.jain, hannes, usamaarif642,
gutierrez.asier, willy, ast, daniel, andrii
Cc: bpf, linux-mm, Yafang Shao
Background
----------
At my current employer, PDD, we have consistently configured THP to "never"
on our production servers due to past incidents caused by its behavior:
- Increased memory consumption
THP significantly raises overall memory usage.
- Latency spikes
Random latency spikes occur due to more frequent memory compaction
activity triggered by THP.
These issues have made sysadmins hesitant to switch to "madvise" or
"always" modes.
New Motivation
--------------
We have now identified that certain AI workloads achieve substantial
performance gains with THP enabled. However, we’ve also verified that some
workloads see little to no benefit—or are even negatively impacted—by THP.
In our Kubernetes environment, we deploy mixed workloads on a single server
to maximize resource utilization. Our goal is to selectively enable THP for
services that benefit from it while keeping it disabled for others. This
approach allows us to incrementally enable THP for additional services and
assess how to make it more viable in production.
Proposed Solution
-----------------
For this use case, Johannes suggested introducing a dedicated mode [0]. In
this new mode, we could implement BPF-based THP adjustment for fine-grained
control over tasks or cgroups. If no BPF program is attached, THP remains
in "never" mode. This solution elegantly meets our needs while avoiding the
complexity of managing BPF alongside other THP modes.
A selftest example demonstrates how to enable THP for the current task
while keeping it disabled for others.
Alternative Proposals
---------------------
- Gutierrez’s cgroup-based approach [1]
- Proposed adding a new cgroup file to control THP policy.
- However, as Johannes noted, cgroups are designed for hierarchical
resource allocation, not arbitrary policy settings [2].
- Usama’s per-task THP proposal based on prctl() [3]:
- Enabling THP per task via prctl().
- As David pointed out, neither madvise() nor prctl() works in "never"
mode [4], making this solution insufficient for our needs.
Conclusion
----------
Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
most effective solution for our requirements. This approach represents a
small but meaningful step toward making THP truly usable—and manageable—in
production environments.
This is currently a PoC implementation. Feedback of any kind is welcome.
Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.org/ [0]
Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/ [1]
Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/ [2]
Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif642@gmail.com/ [3]
Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com/ [4]
RFC v1->v2:
The main changes are as follows,
- Use struct_ops instead of fmod_ret (Alexei)
- Introduce a new THP mode (Johannes)
- Introduce new helpers for BPF hook (Zi)
- Refine the commit log
RFC v1: https://lwn.net/Articles/1019290/
Yafang Shao (5):
mm: thp: Add a new mode "bpf"
mm: thp: Add hook for BPF based THP adjustment
mm: thp: add struct ops for BPF based THP adjustment
bpf: Add get_current_comm to bpf_base_func_proto
selftests/bpf: Add selftest for THP adjustment
include/linux/huge_mm.h | 15 +-
kernel/bpf/cgroup.c | 2 -
kernel/bpf/helpers.c | 2 +
mm/Makefile | 3 +
mm/bpf_thp.c | 120 ++++++++++++
mm/huge_memory.c | 65 ++++++-
mm/khugepaged.c | 3 +
tools/testing/selftests/bpf/config | 1 +
.../selftests/bpf/prog_tests/thp_adjust.c | 175 ++++++++++++++++++
.../selftests/bpf/progs/test_thp_adjust.c | 39 ++++
10 files changed, 414 insertions(+), 11 deletions(-)
create mode 100644 mm/bpf_thp.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
--
2.43.5
^ permalink raw reply [flat|nested] 52+ messages in thread
* [RFC PATCH v2 1/5] mm: thp: Add a new mode "bpf"
2025-05-20 6:04 [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
@ 2025-05-20 6:04 ` Yafang Shao
2025-05-20 6:05 ` [RFC PATCH v2 2/5] mm: thp: Add hook for BPF based THP adjustment Yafang Shao
` (6 subsequent siblings)
7 siblings, 0 replies; 52+ messages in thread
From: Yafang Shao @ 2025-05-20 6:04 UTC (permalink / raw)
To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
npache, ryan.roberts, dev.jain, hannes, usamaarif642,
gutierrez.asier, willy, ast, daniel, andrii
Cc: bpf, linux-mm, Yafang Shao
Background
----------
Historically, our production environment has always configured THP to never
due to past incidents. This has made system administrators hesitant to
switch to madvise.
New Motivation
--------------
We’ve now identified that AI workloads can achieve significant performance
gains with THP enabled. To balance safety and performance, we aim to allow
THP only for AI services while keeping the global system setting at never.
Proposed Solution
-----------------
Johannes suggested introducing a dedicated mode for this use case [0]. This
approach elegantly solves our problem while avoiding the complexity of
managing BPF alongside other THP modes.
Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.org/ [0]
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
include/linux/huge_mm.h | 2 ++
mm/huge_memory.c | 65 ++++++++++++++++++++++++++++++++++++-----
2 files changed, 59 insertions(+), 8 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e893d546a49f..3b5429f73e6e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -54,6 +54,7 @@ enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
+ TRANSPARENT_HUGEPAGE_REQ_BPF_FLAG, /* "bpf" mode */
};
struct kobject;
@@ -174,6 +175,7 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item)
extern unsigned long transparent_hugepage_flags;
extern unsigned long huge_anon_orders_always;
+extern unsigned long huge_anon_orders_bpf;
extern unsigned long huge_anon_orders_madvise;
extern unsigned long huge_anon_orders_inherit;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 47d76d03ce30..8af56ee8d979 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -79,6 +79,7 @@ static atomic_t huge_zero_refcount;
struct folio *huge_zero_folio __read_mostly;
unsigned long huge_zero_pfn __read_mostly = ~0UL;
unsigned long huge_anon_orders_always __read_mostly;
+unsigned long huge_anon_orders_bpf __read_mostly;
unsigned long huge_anon_orders_madvise __read_mostly;
unsigned long huge_anon_orders_inherit __read_mostly;
static bool anon_orders_configured __initdata;
@@ -297,12 +298,15 @@ static ssize_t enabled_show(struct kobject *kobj,
const char *output;
if (test_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags))
- output = "[always] madvise never";
+ output = "[always] bpf madvise never";
+ else if (test_bit(TRANSPARENT_HUGEPAGE_REQ_BPF_FLAG,
+ &transparent_hugepage_flags))
+ output = "always [bpf] madvise never";
else if (test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
&transparent_hugepage_flags))
- output = "always [madvise] never";
+ output = "always bpf [madvise] never";
else
- output = "always madvise [never]";
+ output = "always bpf madvise [never]";
return sysfs_emit(buf, "%s\n", output);
}
@@ -315,13 +319,20 @@ static ssize_t enabled_store(struct kobject *kobj,
if (sysfs_streq(buf, "always")) {
clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_REQ_BPF_FLAG, &transparent_hugepage_flags);
set_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+ } else if (sysfs_streq(buf, "bpf")) {
+ clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+ set_bit(TRANSPARENT_HUGEPAGE_REQ_BPF_FLAG, &transparent_hugepage_flags);
} else if (sysfs_streq(buf, "madvise")) {
clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_REQ_BPF_FLAG, &transparent_hugepage_flags);
set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
} else if (sysfs_streq(buf, "never")) {
clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_REQ_BPF_FLAG, &transparent_hugepage_flags);
} else
ret = -EINVAL;
@@ -495,13 +506,15 @@ static ssize_t anon_enabled_show(struct kobject *kobj,
const char *output;
if (test_bit(order, &huge_anon_orders_always))
- output = "[always] inherit madvise never";
+ output = "[always] bpf inherit madvise never";
+ else if (test_bit(order, &huge_anon_orders_bpf))
+ output = "always [bpf] inherit madvise never";
else if (test_bit(order, &huge_anon_orders_inherit))
- output = "always [inherit] madvise never";
+ output = "always bpf [inherit] madvise never";
else if (test_bit(order, &huge_anon_orders_madvise))
- output = "always inherit [madvise] never";
+ output = "always bpf inherit [madvise] never";
else
- output = "always inherit madvise [never]";
+ output = "always bpf inherit madvise [never]";
return sysfs_emit(buf, "%s\n", output);
}
@@ -515,25 +528,36 @@ static ssize_t anon_enabled_store(struct kobject *kobj,
if (sysfs_streq(buf, "always")) {
spin_lock(&huge_anon_orders_lock);
+ clear_bit(order, &huge_anon_orders_bpf);
clear_bit(order, &huge_anon_orders_inherit);
clear_bit(order, &huge_anon_orders_madvise);
set_bit(order, &huge_anon_orders_always);
spin_unlock(&huge_anon_orders_lock);
+ } else if (sysfs_streq(buf, "bpf")) {
+ spin_lock(&huge_anon_orders_lock);
+ clear_bit(order, &huge_anon_orders_always);
+ clear_bit(order, &huge_anon_orders_inherit);
+ clear_bit(order, &huge_anon_orders_madvise);
+ set_bit(order, &huge_anon_orders_bpf);
+ spin_unlock(&huge_anon_orders_lock);
} else if (sysfs_streq(buf, "inherit")) {
spin_lock(&huge_anon_orders_lock);
clear_bit(order, &huge_anon_orders_always);
+ clear_bit(order, &huge_anon_orders_bpf);
clear_bit(order, &huge_anon_orders_madvise);
set_bit(order, &huge_anon_orders_inherit);
spin_unlock(&huge_anon_orders_lock);
} else if (sysfs_streq(buf, "madvise")) {
spin_lock(&huge_anon_orders_lock);
clear_bit(order, &huge_anon_orders_always);
+ clear_bit(order, &huge_anon_orders_bpf);
clear_bit(order, &huge_anon_orders_inherit);
set_bit(order, &huge_anon_orders_madvise);
spin_unlock(&huge_anon_orders_lock);
} else if (sysfs_streq(buf, "never")) {
spin_lock(&huge_anon_orders_lock);
clear_bit(order, &huge_anon_orders_always);
+ clear_bit(order, &huge_anon_orders_bpf);
clear_bit(order, &huge_anon_orders_inherit);
clear_bit(order, &huge_anon_orders_madvise);
spin_unlock(&huge_anon_orders_lock);
@@ -943,10 +967,22 @@ static int __init setup_transparent_hugepage(char *str)
&transparent_hugepage_flags);
clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
&transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_REQ_BPF_FLAG,
+ &transparent_hugepage_flags);
+ ret = 1;
+ } else if (!strcmp(str, "bpf")) {
+ clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
+ &transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+ &transparent_hugepage_flags);
+ set_bit(TRANSPARENT_HUGEPAGE_REQ_BPF_FLAG,
+ &transparent_hugepage_flags);
ret = 1;
} else if (!strcmp(str, "madvise")) {
clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
&transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_REQ_BPF_FLAG,
+ &transparent_hugepage_flags);
set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
&transparent_hugepage_flags);
ret = 1;
@@ -955,6 +991,8 @@ static int __init setup_transparent_hugepage(char *str)
&transparent_hugepage_flags);
clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
&transparent_hugepage_flags);
+ clear_bit(TRANSPARENT_HUGEPAGE_REQ_BPF_FLAG,
+ &transparent_hugepage_flags);
ret = 1;
}
out:
@@ -967,8 +1005,8 @@ __setup("transparent_hugepage=", setup_transparent_hugepage);
static char str_dup[PAGE_SIZE] __initdata;
static int __init setup_thp_anon(char *str)
{
+ unsigned long always, bpf, inherit, madvise;
char *token, *range, *policy, *subtoken;
- unsigned long always, inherit, madvise;
char *start_size, *end_size;
int start, end, nr;
char *p;
@@ -978,6 +1016,7 @@ static int __init setup_thp_anon(char *str)
strscpy(str_dup, str);
always = huge_anon_orders_always;
+ bpf = huge_anon_orders_bpf;
madvise = huge_anon_orders_madvise;
inherit = huge_anon_orders_inherit;
p = str_dup;
@@ -1019,18 +1058,27 @@ static int __init setup_thp_anon(char *str)
bitmap_set(&always, start, nr);
bitmap_clear(&inherit, start, nr);
bitmap_clear(&madvise, start, nr);
+ bitmap_clear(&bpf, start, nr);
+ } else if (!strcmp(policy, "bpf")) {
+ bitmap_set(&bpf, start, nr);
+ bitmap_clear(&inherit, start, nr);
+ bitmap_clear(&always, start, nr);
+ bitmap_clear(&madvise, start, nr);
} else if (!strcmp(policy, "madvise")) {
bitmap_set(&madvise, start, nr);
bitmap_clear(&inherit, start, nr);
bitmap_clear(&always, start, nr);
+ bitmap_clear(&bpf, start, nr);
} else if (!strcmp(policy, "inherit")) {
bitmap_set(&inherit, start, nr);
bitmap_clear(&madvise, start, nr);
bitmap_clear(&always, start, nr);
+ bitmap_clear(&bpf, start, nr);
} else if (!strcmp(policy, "never")) {
bitmap_clear(&inherit, start, nr);
bitmap_clear(&madvise, start, nr);
bitmap_clear(&always, start, nr);
+ bitmap_clear(&bpf, start, nr);
} else {
pr_err("invalid policy %s in thp_anon boot parameter\n", policy);
goto err;
@@ -1041,6 +1089,7 @@ static int __init setup_thp_anon(char *str)
huge_anon_orders_always = always;
huge_anon_orders_madvise = madvise;
huge_anon_orders_inherit = inherit;
+ huge_anon_orders_bpf = bpf;
anon_orders_configured = true;
return 1;
--
2.43.5
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [RFC PATCH v2 2/5] mm: thp: Add hook for BPF based THP adjustment
2025-05-20 6:04 [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
2025-05-20 6:04 ` [RFC PATCH v2 1/5] mm: thp: Add a new mode "bpf" Yafang Shao
@ 2025-05-20 6:05 ` Yafang Shao
2025-05-20 6:05 ` [RFC PATCH v2 3/5] mm: thp: add struct ops " Yafang Shao
` (5 subsequent siblings)
7 siblings, 0 replies; 52+ messages in thread
From: Yafang Shao @ 2025-05-20 6:05 UTC (permalink / raw)
To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
npache, ryan.roberts, dev.jain, hannes, usamaarif642,
gutierrez.asier, willy, ast, daniel, andrii
Cc: bpf, linux-mm, Yafang Shao
This patch introduces new hooks for BPF program attachment and adds a flag
to indicate when a BPF program is attached. The program only functions when
"bpf" mode is enabled.
Per task THP policy based on BPF will be added in the followup patch.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
include/linux/huge_mm.h | 24 +++++++++++++++++++++++-
mm/khugepaged.c | 3 +++
2 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3b5429f73e6e..fedb5b014d9a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -55,6 +55,7 @@ enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
TRANSPARENT_HUGEPAGE_REQ_BPF_FLAG, /* "bpf" mode */
+ TRANSPARENT_HUGEPAGE_BPF_ATTACHED, /* BPF program is attached */
};
struct kobject;
@@ -192,6 +193,26 @@ static inline bool hugepage_global_always(void)
(1<<TRANSPARENT_HUGEPAGE_FLAG);
}
+static inline bool hugepage_bpf_allowable(void)
+{
+ /* Works only for BPF mode */
+ if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_REQ_BPF_FLAG)))
+ return 0;
+
+ /* No BPF program is attached */
+ if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_BPF_ATTACHED)))
+ return 0;
+ /* We will add struct ops in the future */
+ return 1;
+}
+
+static inline bool hugepaged_bpf_allowable(void)
+{
+ if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_BPF_ATTACHED)))
+ return 0;
+ return 1;
+}
+
static inline int highest_order(unsigned long orders)
{
return fls_long(orders) - 1;
@@ -295,7 +316,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
if (vm_flags & VM_HUGEPAGE)
mask |= READ_ONCE(huge_anon_orders_madvise);
if (hugepage_global_always() ||
- ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
+ ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()) ||
+ hugepage_bpf_allowable())
mask |= READ_ONCE(huge_anon_orders_inherit);
orders &= mask;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cc945c6ab3bd..762e03b50bca 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -432,6 +432,9 @@ static bool hugepage_pmd_enabled(void)
if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
hugepage_global_enabled())
return true;
+ if (test_bit(PMD_ORDER, &huge_anon_orders_bpf) &&
+ hugepaged_bpf_allowable())
+ return true;
if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
return true;
return false;
--
2.43.5
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [RFC PATCH v2 3/5] mm: thp: add struct ops for BPF based THP adjustment
2025-05-20 6:04 [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
2025-05-20 6:04 ` [RFC PATCH v2 1/5] mm: thp: Add a new mode "bpf" Yafang Shao
2025-05-20 6:05 ` [RFC PATCH v2 2/5] mm: thp: Add hook for BPF based THP adjustment Yafang Shao
@ 2025-05-20 6:05 ` Yafang Shao
2025-05-20 6:05 ` [RFC PATCH v2 4/5] bpf: Add get_current_comm to bpf_base_func_proto Yafang Shao
` (4 subsequent siblings)
7 siblings, 0 replies; 52+ messages in thread
From: Yafang Shao @ 2025-05-20 6:05 UTC (permalink / raw)
To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
npache, ryan.roberts, dev.jain, hannes, usamaarif642,
gutierrez.asier, willy, ast, daniel, andrii
Cc: bpf, linux-mm, Yafang Shao
This patch introduces a minimal struct_ops for BPF-based THP adjustment.
Currently, only a single BPF program can be attached. Support for multiple
attachments will be added in the future.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
include/linux/huge_mm.h | 13 +----
mm/Makefile | 3 +
mm/bpf_thp.c | 120 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 124 insertions(+), 12 deletions(-)
create mode 100644 mm/bpf_thp.c
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fedb5b014d9a..a75f5f902af0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -193,18 +193,7 @@ static inline bool hugepage_global_always(void)
(1<<TRANSPARENT_HUGEPAGE_FLAG);
}
-static inline bool hugepage_bpf_allowable(void)
-{
- /* Works only for BPF mode */
- if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_REQ_BPF_FLAG)))
- return 0;
-
- /* No BPF program is attached */
- if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_BPF_ATTACHED)))
- return 0;
- /* We will add struct ops in the future */
- return 1;
-}
+bool hugepage_bpf_allowable(void);
static inline bool hugepaged_bpf_allowable(void)
{
diff --git a/mm/Makefile b/mm/Makefile
index e7f6bbf8ae5f..c355f9426c93 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,6 +99,9 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_NUMA) += memory-tiers.o
obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+ifdef CONFIG_BPF_SYSCALL
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += bpf_thp.o
+endif
obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
new file mode 100644
index 000000000000..690980cdb4da
--- /dev/null
+++ b/mm/bpf_thp.c
@@ -0,0 +1,120 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/huge_mm.h>
+
+struct bpf_thp_ops {
+ /**
+ * @thp_bpf_allowable: Determines whether a task is permitted to
+ * allocate a THP when it is allocating anon memory.
+ *
+ * Return: %true if THP allocation is allowed, %false otherwise.
+ */
+ bool (*thp_bpf_allowable)(void);
+};
+
+static struct bpf_thp_ops bpf_thp;
+
+bool hugepage_bpf_allowable(void)
+{
+ /* Works only for "bpf" mode */
+ if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_REQ_BPF_FLAG)))
+ return 0;
+
+ /* No BPF program is attached */
+ if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_BPF_ATTACHED)))
+ return 0;
+
+ /* BPF adjustment hook */
+ if (bpf_thp.thp_bpf_allowable)
+ return bpf_thp.thp_bpf_allowable();
+ return 0;
+}
+
+static bool bpf_thp_ops_is_valid_access(int off, int size,
+ enum bpf_access_type type,
+ const struct bpf_prog *prog,
+ struct bpf_insn_access_aux *info)
+{
+ return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_func_proto *
+bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+ return bpf_base_func_proto(func_id, prog);
+}
+
+static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
+ .get_func_proto = bpf_thp_get_func_proto,
+ .is_valid_access = bpf_thp_ops_is_valid_access,
+};
+
+static int bpf_thp_reg(void *kdata, struct bpf_link *link)
+{
+ struct bpf_thp_ops *ops = kdata;
+
+ /* TODO: add support for multiple attaches */
+ if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+ &transparent_hugepage_flags))
+ return -EOPNOTSUPP;
+ bpf_thp.thp_bpf_allowable = ops->thp_bpf_allowable;
+ return 0;
+}
+
+static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
+{
+ clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
+ bpf_thp.thp_bpf_allowable = NULL;
+}
+
+static int bpf_thp_check_member(const struct btf_type *t,
+ const struct btf_member *member,
+ const struct bpf_prog *prog)
+{
+ return 0;
+}
+
+static int bpf_thp_init_member(const struct btf_type *t,
+ const struct btf_member *member,
+ void *kdata, const void *udata)
+{
+ return 0;
+}
+
+static int bpf_thp_init(struct btf *btf)
+{
+ return 0;
+}
+
+static bool thp_bpf_allowable(void)
+{
+ return 0;
+}
+
+static struct bpf_thp_ops __bpf_thp_ops = {
+ .thp_bpf_allowable = thp_bpf_allowable,
+};
+
+static struct bpf_struct_ops bpf_bpf_thp_ops = {
+ .verifier_ops = &thp_bpf_verifier_ops,
+ .init = bpf_thp_init,
+ .check_member = bpf_thp_check_member,
+ .init_member = bpf_thp_init_member,
+ .reg = bpf_thp_reg,
+ .unreg = bpf_thp_unreg,
+ .name = "bpf_thp_ops",
+ .cfi_stubs = &__bpf_thp_ops,
+ .owner = THIS_MODULE,
+};
+
+static int __init bpf_thp_ops_init(void)
+{
+ int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
+
+ if (err)
+ pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
+ return err;
+}
+late_initcall(bpf_thp_ops_init);
--
2.43.5
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [RFC PATCH v2 4/5] bpf: Add get_current_comm to bpf_base_func_proto
2025-05-20 6:04 [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
` (2 preceding siblings ...)
2025-05-20 6:05 ` [RFC PATCH v2 3/5] mm: thp: add struct ops " Yafang Shao
@ 2025-05-20 6:05 ` Yafang Shao
2025-05-20 23:32 ` Andrii Nakryiko
2025-05-20 6:05 ` [RFC PATCH v2 5/5] selftests/bpf: Add selftest for THP adjustment Yafang Shao
` (3 subsequent siblings)
7 siblings, 1 reply; 52+ messages in thread
From: Yafang Shao @ 2025-05-20 6:05 UTC (permalink / raw)
To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
npache, ryan.roberts, dev.jain, hannes, usamaarif642,
gutierrez.asier, willy, ast, daniel, andrii
Cc: bpf, linux-mm, Yafang Shao
While testing the BPF based THP adjustment feature, I noticed
bpf_get_current_comm() isn't available in bpf_base_func_proto. As this is a
commonly used helper, we should add it to bpf_base_func_proto.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
kernel/bpf/cgroup.c | 2 --
kernel/bpf/helpers.c | 2 ++
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 84f58f3d028a..22cd4f54d023 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -2609,8 +2609,6 @@ cgroup_current_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
switch (func_id) {
case BPF_FUNC_get_current_uid_gid:
return &bpf_get_current_uid_gid_proto;
- case BPF_FUNC_get_current_comm:
- return &bpf_get_current_comm_proto;
#ifdef CONFIG_CGROUP_NET_CLASSID
case BPF_FUNC_get_cgroup_classid:
return &bpf_get_cgroup_classid_curr_proto;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index e3a2662f4e33..2a60522cd66f 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1965,6 +1965,8 @@ bpf_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_get_current_pid_tgid_proto;
case BPF_FUNC_get_ns_current_pid_tgid:
return &bpf_get_ns_current_pid_tgid_proto;
+ case BPF_FUNC_get_current_comm:
+ return &bpf_get_current_comm_proto;
default:
break;
}
--
2.43.5
^ permalink raw reply related [flat|nested] 52+ messages in thread
* [RFC PATCH v2 5/5] selftests/bpf: Add selftest for THP adjustment
2025-05-20 6:04 [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
` (3 preceding siblings ...)
2025-05-20 6:05 ` [RFC PATCH v2 4/5] bpf: Add get_current_comm to bpf_base_func_proto Yafang Shao
@ 2025-05-20 6:05 ` Yafang Shao
2025-05-20 6:52 ` [RFC PATCH v2 0/5] mm, bpf: BPF based " Nico Pache
` (2 subsequent siblings)
7 siblings, 0 replies; 52+ messages in thread
From: Yafang Shao @ 2025-05-20 6:05 UTC (permalink / raw)
To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
npache, ryan.roberts, dev.jain, hannes, usamaarif642,
gutierrez.asier, willy, ast, daniel, andrii
Cc: bpf, linux-mm, Yafang Shao
This test case uses a BPF program to enforce the following THP allocation
policy:
- Only the current task is permitted to allocate THP.
- All other tasks are denied.
The expected behavior:
- Before the BPF prog is attached
No tasks can allocate THP.
- After the BPF prog is attached
Only the current task can allocate THP.
- Switch to "never" mode after the BPF prog is attached
THP allocation is not allowed even for the current task.
The result is as follows,
$ ./test_progs --name="thp_adjust"
#437 thp_adjust:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
CONFIG_TRANSPARENT_HUGEPAGE=y is required for this test.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
tools/testing/selftests/bpf/config | 1 +
.../selftests/bpf/prog_tests/thp_adjust.c | 175 ++++++++++++++++++
.../selftests/bpf/progs/test_thp_adjust.c | 39 ++++
3 files changed, 215 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index c378d5d07e02..bb8a8a9d77a2 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -113,3 +113,4 @@ CONFIG_XDP_SOCKETS=y
CONFIG_XFRM_INTERFACE=y
CONFIG_TCP_CONG_DCTCP=y
CONFIG_TCP_CONG_BBR=y
+CONFIG_TRANSPARENT_HUGEPAGE=y
diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
new file mode 100644
index 000000000000..6accd110d8ea
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -0,0 +1,175 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <sys/mman.h>
+#include <test_progs.h>
+#include "test_thp_adjust.skel.h"
+
+#define LEN (4 * 1024 * 1024) /* 4MB */
+#define THP_ENABLED_PATH "/sys/kernel/mm/transparent_hugepage/enabled"
+#define SMAPS_PATH "/proc/self/smaps"
+#define ANON_HUGE_PAGES "AnonHugePages:"
+
+static char *thp_addr;
+static char old_mode[32];
+
+int thp_mode_save(void)
+{
+ const char *start, *end;
+ char buf[128];
+ int fd, err;
+ size_t len;
+
+ fd = open(THP_ENABLED_PATH, O_RDONLY);
+ if (fd == -1)
+ return -1;
+
+ err = read(fd, buf, sizeof(buf) - 1);
+ if (err == -1)
+ goto close;
+
+ start = strchr(buf, '[');
+ end = start ? strchr(start, ']') : NULL;
+ if (!start || !end || end <= start) {
+ err = -1;
+ goto close;
+ }
+
+ len = end - start - 1;
+ if (len >= sizeof(old_mode))
+ len = sizeof(old_mode) - 1;
+ strncpy(old_mode, start + 1, len);
+ old_mode[len] = '\0';
+
+close:
+ close(fd);
+ return err;
+}
+
+int thp_set(const char *desired_mode)
+{
+ int fd, err;
+
+ fd = open(THP_ENABLED_PATH, O_RDWR);
+ if (fd == -1)
+ return -1;
+
+ err = write(fd, desired_mode, strlen(desired_mode));
+ close(fd);
+ return err;
+}
+
+int thp_reset(void)
+{
+ int fd, err;
+
+ fd = open(THP_ENABLED_PATH, O_WRONLY);
+ if (fd == -1)
+ return -1;
+
+ err = write(fd, old_mode, strlen(old_mode));
+ close(fd);
+ return err;
+}
+
+int thp_alloc(void)
+{
+ int err, i;
+
+ thp_addr = mmap(NULL, LEN, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+ if (thp_addr == MAP_FAILED)
+ return -1;
+
+ err = madvise(thp_addr, LEN, MADV_HUGEPAGE);
+ if (err == -1)
+ goto unmap;
+
+ for (i = 0; i < LEN; i += 4096)
+ thp_addr[i] = 1;
+ return 0;
+
+unmap:
+ munmap(thp_addr, LEN);
+ return -1;
+}
+
+void thp_free(void)
+{
+ if (!thp_addr)
+ return;
+ munmap(thp_addr, LEN);
+}
+
+void test_thp_adjust(void)
+{
+ struct bpf_link *fentry_link, *ops_link;
+ struct test_thp_adjust *skel;
+ int err, first_calls;
+
+ if (!ASSERT_NEQ(thp_mode_save(), -1, "THP mode save"))
+ return;
+ if (!ASSERT_GE(thp_set("bpf"), 0, "THP mode set"))
+ return;
+
+ skel = test_thp_adjust__open();
+ if (!ASSERT_OK_PTR(skel, "open"))
+ goto thp_reset;
+
+ skel->bss->target_pid = getpid();
+
+ err = test_thp_adjust__load(skel);
+ if (!ASSERT_OK(err, "load"))
+ goto destroy;
+
+ fentry_link = bpf_program__attach_trace(skel->progs.thp_run);
+ if (!ASSERT_OK_PTR(fentry_link, "attach fentry"))
+ goto destroy;
+
+ if (!ASSERT_NEQ(thp_alloc(), -1, "THP alloc"))
+ goto destroy;
+
+ /* Before attaching struct_ops, THP won't be allocated. */
+ if (!ASSERT_EQ(skel->bss->thp_calls, 0, "THP calls"))
+ goto thp_free;
+
+ if (!ASSERT_EQ(skel->bss->thp_wrong_calls, 0, "THP calls"))
+ goto thp_free;
+
+ thp_free();
+
+ ops_link = bpf_map__attach_struct_ops(skel->maps.thp);
+ if (!ASSERT_OK_PTR(ops_link, "attach struct_ops"))
+ goto destroy;
+
+ if (!ASSERT_NEQ(thp_alloc(), -1, "THP alloc"))
+ goto destroy;
+
+ /* After attaching struct_ops, THP will be allocated. */
+ if (!ASSERT_GT(skel->bss->thp_calls, 0, "THP calls"))
+ goto thp_free;
+
+ first_calls = skel->bss->thp_calls;
+
+ if (!ASSERT_EQ(skel->bss->thp_wrong_calls, 0, "THP calls"))
+ goto thp_free;
+
+ thp_free();
+
+ if (!ASSERT_GE(thp_set("never"), 0, "THP set"))
+ goto destroy;
+
+ if (!ASSERT_NEQ(thp_alloc(), -1, "THP alloc"))
+ goto destroy;
+
+ /* In "never" mode, THP won't be allocated even if the prog is attached. */
+ if (!ASSERT_EQ(skel->bss->thp_calls, first_calls, "THP calls"))
+ goto thp_free;
+
+ ASSERT_EQ(skel->bss->thp_wrong_calls, 0, "THP calls");
+
+thp_free:
+ thp_free();
+destroy:
+ test_thp_adjust__destroy(skel);
+thp_reset:
+ ASSERT_GE(thp_reset(), 0, "THP mode reset");
+}
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust.c b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
new file mode 100644
index 000000000000..69135380853c
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+int target_pid;
+int thp_calls;
+int thp_wrong_calls;
+
+SEC("fentry/do_huge_pmd_anonymous_page")
+int BPF_PROG(thp_run)
+{
+ struct task_struct *current = bpf_get_current_task_btf();
+
+ if (current->pid == target_pid)
+ thp_calls++;
+ else
+ thp_wrong_calls++;
+ return 0;
+}
+
+SEC("struct_ops/thp_bpf_allowable")
+bool BPF_PROG(thp_bpf_allowable)
+{
+ struct task_struct *current = bpf_get_current_task_btf();
+
+ /* Permit the current task to allocate memory using THP. */
+ if (current->pid == target_pid)
+ return true;
+ return false;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops thp = {
+ .thp_bpf_allowable = (void *)thp_bpf_allowable,
+};
--
2.43.5
^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 6:04 [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
` (4 preceding siblings ...)
2025-05-20 6:05 ` [RFC PATCH v2 5/5] selftests/bpf: Add selftest for THP adjustment Yafang Shao
@ 2025-05-20 6:52 ` Nico Pache
2025-05-20 7:25 ` Yafang Shao
2025-05-20 9:43 ` David Hildenbrand
2025-05-25 3:01 ` Yafang Shao
7 siblings, 1 reply; 52+ messages in thread
From: Nico Pache @ 2025-05-20 6:52 UTC (permalink / raw)
To: Yafang Shao
Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On Tue, May 20, 2025 at 12:06 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> Background
> ----------
>
> At my current employer, PDD, we have consistently configured THP to "never"
> on our production servers due to past incidents caused by its behavior:
>
> - Increased memory consumption
> THP significantly raises overall memory usage.
>
> - Latency spikes
> Random latency spikes occur due to more frequent memory compaction
> activity triggered by THP.
>
> These issues have made sysadmins hesitant to switch to "madvise" or
> "always" modes.
>
> New Motivation
> --------------
>
> We have now identified that certain AI workloads achieve substantial
> performance gains with THP enabled. However, we’ve also verified that some
> workloads see little to no benefit—or are even negatively impacted—by THP.
>
> In our Kubernetes environment, we deploy mixed workloads on a single server
> to maximize resource utilization. Our goal is to selectively enable THP for
> services that benefit from it while keeping it disabled for others. This
> approach allows us to incrementally enable THP for additional services and
> assess how to make it more viable in production.
>
> Proposed Solution
> -----------------
>
> For this use case, Johannes suggested introducing a dedicated mode [0]. In
> this new mode, we could implement BPF-based THP adjustment for fine-grained
> control over tasks or cgroups. If no BPF program is attached, THP remains
> in "never" mode. This solution elegantly meets our needs while avoiding the
> complexity of managing BPF alongside other THP modes.
>
> A selftest example demonstrates how to enable THP for the current task
> while keeping it disabled for others.
>
> Alternative Proposals
> ---------------------
>
> - Gutierrez’s cgroup-based approach [1]
> - Proposed adding a new cgroup file to control THP policy.
> - However, as Johannes noted, cgroups are designed for hierarchical
> resource allocation, not arbitrary policy settings [2].
>
> - Usama’s per-task THP proposal based on prctl() [3]:
> - Enabling THP per task via prctl().
> - As David pointed out, neither madvise() nor prctl() works in "never"
> mode [4], making this solution insufficient for our needs.
Hi Yafang Shao,
I believe you would have to invert your logic and disable the
processes you dont want using THPs, and have THP="madvise"|"always". I
have yet to look over Usama's solution in detail but I believe this is
possible based on his cover letter.
I also have an alternative solution proposed here!
https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/
It's different in the sense it doesn't give you granular control per
process, cgroup, or BPF programmability, but it "may" suit your needs
by taming the THP waste and removing the latency spikes of PF time THP
compactions/allocations.
Cheers,
-- Nico
>
> Conclusion
> ----------
>
> Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
> most effective solution for our requirements. This approach represents a
> small but meaningful step toward making THP truly usable—and manageable—in
> production environments.
>
> This is currently a PoC implementation. Feedback of any kind is welcome.
>
> Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.org/ [0]
> Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/ [1]
> Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/ [2]
> Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif642@gmail.com/ [3]
> Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com/ [4]
>
> RFC v1->v2:
> The main changes are as follows,
> - Use struct_ops instead of fmod_ret (Alexei)
> - Introduce a new THP mode (Johannes)
> - Introduce new helpers for BPF hook (Zi)
> - Refine the commit log
>
> RFC v1: https://lwn.net/Articles/1019290/
>
> Yafang Shao (5):
> mm: thp: Add a new mode "bpf"
> mm: thp: Add hook for BPF based THP adjustment
> mm: thp: add struct ops for BPF based THP adjustment
> bpf: Add get_current_comm to bpf_base_func_proto
> selftests/bpf: Add selftest for THP adjustment
>
> include/linux/huge_mm.h | 15 +-
> kernel/bpf/cgroup.c | 2 -
> kernel/bpf/helpers.c | 2 +
> mm/Makefile | 3 +
> mm/bpf_thp.c | 120 ++++++++++++
> mm/huge_memory.c | 65 ++++++-
> mm/khugepaged.c | 3 +
> tools/testing/selftests/bpf/config | 1 +
> .../selftests/bpf/prog_tests/thp_adjust.c | 175 ++++++++++++++++++
> .../selftests/bpf/progs/test_thp_adjust.c | 39 ++++
> 10 files changed, 414 insertions(+), 11 deletions(-)
> create mode 100644 mm/bpf_thp.c
> create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
> create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
>
> --
> 2.43.5
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 6:52 ` [RFC PATCH v2 0/5] mm, bpf: BPF based " Nico Pache
@ 2025-05-20 7:25 ` Yafang Shao
2025-05-20 13:10 ` Matthew Wilcox
0 siblings, 1 reply; 52+ messages in thread
From: Yafang Shao @ 2025-05-20 7:25 UTC (permalink / raw)
To: Nico Pache
Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On Tue, May 20, 2025 at 2:52 PM Nico Pache <npache@redhat.com> wrote:
>
> On Tue, May 20, 2025 at 12:06 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > Background
> > ----------
> >
> > At my current employer, PDD, we have consistently configured THP to "never"
> > on our production servers due to past incidents caused by its behavior:
> >
> > - Increased memory consumption
> > THP significantly raises overall memory usage.
> >
> > - Latency spikes
> > Random latency spikes occur due to more frequent memory compaction
> > activity triggered by THP.
> >
> > These issues have made sysadmins hesitant to switch to "madvise" or
> > "always" modes.
> >
> > New Motivation
> > --------------
> >
> > We have now identified that certain AI workloads achieve substantial
> > performance gains with THP enabled. However, we’ve also verified that some
> > workloads see little to no benefit—or are even negatively impacted—by THP.
> >
> > In our Kubernetes environment, we deploy mixed workloads on a single server
> > to maximize resource utilization. Our goal is to selectively enable THP for
> > services that benefit from it while keeping it disabled for others. This
> > approach allows us to incrementally enable THP for additional services and
> > assess how to make it more viable in production.
> >
> > Proposed Solution
> > -----------------
> >
> > For this use case, Johannes suggested introducing a dedicated mode [0]. In
> > this new mode, we could implement BPF-based THP adjustment for fine-grained
> > control over tasks or cgroups. If no BPF program is attached, THP remains
> > in "never" mode. This solution elegantly meets our needs while avoiding the
> > complexity of managing BPF alongside other THP modes.
> >
> > A selftest example demonstrates how to enable THP for the current task
> > while keeping it disabled for others.
> >
> > Alternative Proposals
> > ---------------------
> >
> > - Gutierrez’s cgroup-based approach [1]
> > - Proposed adding a new cgroup file to control THP policy.
> > - However, as Johannes noted, cgroups are designed for hierarchical
> > resource allocation, not arbitrary policy settings [2].
> >
> > - Usama’s per-task THP proposal based on prctl() [3]:
> > - Enabling THP per task via prctl().
> > - As David pointed out, neither madvise() nor prctl() works in "never"
> > mode [4], making this solution insufficient for our needs.
> Hi Yafang Shao,
>
> I believe you would have to invert your logic and disable the
> processes you dont want using THPs, and have THP="madvise"|"always". I
> have yet to look over Usama's solution in detail but I believe this is
> possible based on his cover letter.
>
> I also have an alternative solution proposed here!
> https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/
>
> It's different in the sense it doesn't give you granular control per
> process, cgroup, or BPF programmability, but it "may" suit your needs
> by taming the THP waste and removing the latency spikes of PF time THP
> compactions/allocations.
Thank you for developing this feature. I'll review it carefully.
The challenge we face is that our system administration team doesn't
permit enabling THP globally in production by setting it to "madvise"
or "always". As a result, we can only experiment with your feature on
our test servers at this stage.
Therefore, our immediate priority isn't THP optimization, but rather
finding a way to safely enable THP in production first. The kernel
team needs a solution that addresses this fundamental deployment
hurdle before we can consider performance improvements.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 6:04 [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
` (5 preceding siblings ...)
2025-05-20 6:52 ` [RFC PATCH v2 0/5] mm, bpf: BPF based " Nico Pache
@ 2025-05-20 9:43 ` David Hildenbrand
2025-05-20 9:49 ` Lorenzo Stoakes
2025-05-20 11:59 ` Yafang Shao
2025-05-25 3:01 ` Yafang Shao
7 siblings, 2 replies; 52+ messages in thread
From: David Hildenbrand @ 2025-05-20 9:43 UTC (permalink / raw)
To: Yafang Shao, akpm, ziy, baolin.wang, lorenzo.stoakes,
Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
usamaarif642, gutierrez.asier, willy, ast, daniel, andrii
Cc: bpf, linux-mm
> Conclusion
> ----------
>
> Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
> most effective solution for our requirements. This approach represents a
> small but meaningful step toward making THP truly usable—and manageable—in
> production environments.
A new "bpf" mode sounds way too special.
We currently have:
never -> never
madvise -> MADV_HUGEPAGE, except PR_SET_THP_DISABLE
always -> always, except PR_SET_THP_DISABLE and MADV_NOHUGEPAGE
Whatever new mode we add, it should honor PR_SET_THP_DISABLE +
MADV_NOHUGEPAGE.
So, if we want another way to enable things, it would live between
"never" and "madvise".
I'm wondering how we could make that generic: likely we want this new
mechanism to *not* be triggerable by the process itself (madvise).
I am not convinced bpf is the answer here ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 9:43 ` David Hildenbrand
@ 2025-05-20 9:49 ` Lorenzo Stoakes
2025-05-20 12:06 ` Yafang Shao
2025-05-20 11:59 ` Yafang Shao
1 sibling, 1 reply; 52+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 9:49 UTC (permalink / raw)
To: David Hildenbrand
Cc: Yafang Shao, akpm, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On Tue, May 20, 2025 at 11:43:11AM +0200, David Hildenbrand wrote:
> > Conclusion
> > ----------
> >
> > Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
> > most effective solution for our requirements. This approach represents a
> > small but meaningful step toward making THP truly usable—and manageable—in
> > production environments.
> A new "bpf" mode sounds way too special.
>
> We currently have:
>
> never -> never
> madvise -> MADV_HUGEPAGE, except PR_SET_THP_DISABLE
> always -> always, except PR_SET_THP_DISABLE and MADV_NOHUGEPAGE
>
> Whatever new mode we add, it should honor PR_SET_THP_DISABLE +
> MADV_NOHUGEPAGE.
>
> So, if we want another way to enable things, it would live between "never"
> and "madvise".
>
> I'm wondering how we could make that generic: likely we want this new
> mechanism to *not* be triggerable by the process itself (madvise).
>
> I am not convinced bpf is the answer here ...
Agreed.
I am also very concerned with us inserting BPF bits here - are we not then
ensuring that we cannot in any way move towards a future where we
'automagically' determine what to do?
I don't know what is claimed about BPF, but it strikes me that we're
establishing a permanent uABI (uAPI?) if we do that and essentially
promising that THP will continue to operate in a fashion similar to how it
does now.
While BPF is a wonderful technology, I thik we have to be very very careful
about inserting it in places that consist of -implementation details- that
we in mm already are planning to move away from.
It's one thing adding BPF in the oomk (simple interface, unlikely to
change, doesn't really constrain us) or the scheduler (again the hooks are
by nature reasonably stable), it's quite another sticking it in the heart
of a part of mm that is undergoing _constant_ change, partly as evidenced
by the sheer number of series related to THP that are currently on-list.
So while BPF may be the best solution for your needs _right now_, we need
be concerned with how things affect the kernel in the future.
I think we really do have to tread very carefully here.
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 9:43 ` David Hildenbrand
2025-05-20 9:49 ` Lorenzo Stoakes
@ 2025-05-20 11:59 ` Yafang Shao
1 sibling, 0 replies; 52+ messages in thread
From: Yafang Shao @ 2025-05-20 11:59 UTC (permalink / raw)
To: David Hildenbrand
Cc: akpm, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On Tue, May 20, 2025 at 5:44 PM David Hildenbrand <david@redhat.com> wrote:
>
> > Conclusion
> > ----------
> >
> > Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
> > most effective solution for our requirements. This approach represents a
> > small but meaningful step toward making THP truly usable—and manageable—in
> > production environments.
> A new "bpf" mode sounds way too special.
Alternatively, we could simply hook 'madvise' to define a BPF-based policy.
>
> We currently have:
>
> never -> never
> madvise -> MADV_HUGEPAGE, except PR_SET_THP_DISABLE
> always -> always, except PR_SET_THP_DISABLE and MADV_NOHUGEPAGE
If BPF had been invented before THP, we likely would have only three
modes—without PR_SET_THP_DISABLE, MADV_NOHUGEPAGE, or MADV_HUGEPAGE;-)
never -> never
user -> user defined per task or per vma THP mode selector, based on BPF
We can select "never" or "always" for a specific task or vma
The API is as follows,
bpf->per_task_mode_selector(task);
bpf->per_vma_mode_selecor(vma);
always -> always
However, it’s not too late to introduce a new BPF-based mode for THP,
especially since future adjustments to THP policies are still
expected. Regardless of the specific policy, two fundamental
principles apply:
1. Selective Benefit: Some tasks benefit from THP, while others do not.
2. Conditional Safety: THP allocation is safe under certain conditions
but not others.
Given these constraints, we could abstract stable APIs that allow
users to define custom THP policies tailored to their needs.
>
> Whatever new mode we add, it should honor PR_SET_THP_DISABLE +
> MADV_NOHUGEPAGE.
Yes, the BPF only selects different THP modes for different tasks,
nothing else won't be changed.
>
> So, if we want another way to enable things, it would live between
> "never" and "madvise".
Yes, BPF only selects the appropriate THP mode for each task—nothing
else is modified.
>
> I'm wondering how we could make that generic: likely we want this new
> mechanism to *not* be triggerable by the process itself (madvise).
>
> I am not convinced bpf is the answer here ...
I believe the key insight is that we should define a generic, stable
API for BPF-based THP mode selection.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 9:49 ` Lorenzo Stoakes
@ 2025-05-20 12:06 ` Yafang Shao
2025-05-20 13:45 ` Lorenzo Stoakes
0 siblings, 1 reply; 52+ messages in thread
From: Yafang Shao @ 2025-05-20 12:06 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: David Hildenbrand, akpm, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On Tue, May 20, 2025 at 5:49 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, May 20, 2025 at 11:43:11AM +0200, David Hildenbrand wrote:
> > > Conclusion
> > > ----------
> > >
> > > Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
> > > most effective solution for our requirements. This approach represents a
> > > small but meaningful step toward making THP truly usable—and manageable—in
> > > production environments.
> > A new "bpf" mode sounds way too special.
> >
> > We currently have:
> >
> > never -> never
> > madvise -> MADV_HUGEPAGE, except PR_SET_THP_DISABLE
> > always -> always, except PR_SET_THP_DISABLE and MADV_NOHUGEPAGE
> >
> > Whatever new mode we add, it should honor PR_SET_THP_DISABLE +
> > MADV_NOHUGEPAGE.
> >
> > So, if we want another way to enable things, it would live between "never"
> > and "madvise".
> >
> > I'm wondering how we could make that generic: likely we want this new
> > mechanism to *not* be triggerable by the process itself (madvise).
> >
> > I am not convinced bpf is the answer here ...
>
> Agreed.
>
> I am also very concerned with us inserting BPF bits here - are we not then
> ensuring that we cannot in any way move towards a future where we
> 'automagically' determine what to do?
>
> I don't know what is claimed about BPF, but it strikes me that we're
> establishing a permanent uABI (uAPI?) if we do that and essentially
> promising that THP will continue to operate in a fashion similar to how it
> does now.
>
> While BPF is a wonderful technology, I thik we have to be very very careful
> about inserting it in places that consist of -implementation details- that
> we in mm already are planning to move away from.
>
> It's one thing adding BPF in the oomk (simple interface, unlikely to
> change, doesn't really constrain us) or the scheduler (again the hooks are
> by nature reasonably stable), it's quite another sticking it in the heart
> of a part of mm that is undergoing _constant_ change, partly as evidenced
> by the sheer number of series related to THP that are currently on-list.
>
> So while BPF may be the best solution for your needs _right now_, we need
> be concerned with how things affect the kernel in the future.
>
> I think we really do have to tread very carefully here.
I totally agree with you that the key point here is how to define the
API. As I replied to David, I believe we have two fundamental
principles to adjust the THP policies:
1. Selective Benefit: Some tasks benefit from THP, while others do not.
2. Conditional Safety: THP allocation is safe under certain conditions
but not others.
Therefore, I believe we can define these APIs based on the established
principles - everything else constitutes implementation details, even
if core MM internals need to change.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 7:25 ` Yafang Shao
@ 2025-05-20 13:10 ` Matthew Wilcox
2025-05-20 14:08 ` Yafang Shao
0 siblings, 1 reply; 52+ messages in thread
From: Matthew Wilcox @ 2025-05-20 13:10 UTC (permalink / raw)
To: Yafang Shao
Cc: Nico Pache, akpm, david, ziy, baolin.wang, lorenzo.stoakes,
Liam.Howlett, ryan.roberts, dev.jain, hannes, usamaarif642,
gutierrez.asier, ast, daniel, andrii, bpf, linux-mm
On Tue, May 20, 2025 at 03:25:07PM +0800, Yafang Shao wrote:
> The challenge we face is that our system administration team doesn't
> permit enabling THP globally in production by setting it to "madvise"
> or "always". As a result, we can only experiment with your feature on
> our test servers at this stage.
That's a you problem. You need to figure out how to influence your
sysadmin team to change their mind; whether it's by talking to their
superiors or persuading them directly. It's not a justification for why
upstream should take this patch.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 12:06 ` Yafang Shao
@ 2025-05-20 13:45 ` Lorenzo Stoakes
2025-05-20 15:54 ` David Hildenbrand
2025-05-21 3:52 ` Yafang Shao
0 siblings, 2 replies; 52+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 13:45 UTC (permalink / raw)
To: Yafang Shao
Cc: David Hildenbrand, akpm, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On Tue, May 20, 2025 at 08:06:21PM +0800, Yafang Shao wrote:
> On Tue, May 20, 2025 at 5:49 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Tue, May 20, 2025 at 11:43:11AM +0200, David Hildenbrand wrote:
> > > > Conclusion
> > > > ----------
> > > >
> > > > Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
> > > > most effective solution for our requirements. This approach represents a
> > > > small but meaningful step toward making THP truly usable—and manageable—in
> > > > production environments.
> > > A new "bpf" mode sounds way too special.
> > >
> > > We currently have:
> > >
> > > never -> never
> > > madvise -> MADV_HUGEPAGE, except PR_SET_THP_DISABLE
> > > always -> always, except PR_SET_THP_DISABLE and MADV_NOHUGEPAGE
> > >
> > > Whatever new mode we add, it should honor PR_SET_THP_DISABLE +
> > > MADV_NOHUGEPAGE.
> > >
> > > So, if we want another way to enable things, it would live between "never"
> > > and "madvise".
> > >
> > > I'm wondering how we could make that generic: likely we want this new
> > > mechanism to *not* be triggerable by the process itself (madvise).
> > >
> > > I am not convinced bpf is the answer here ...
> >
> > Agreed.
> >
> > I am also very concerned with us inserting BPF bits here - are we not then
> > ensuring that we cannot in any way move towards a future where we
> > 'automagically' determine what to do?
> >
> > I don't know what is claimed about BPF, but it strikes me that we're
> > establishing a permanent uABI (uAPI?) if we do that and essentially
> > promising that THP will continue to operate in a fashion similar to how it
> > does now.
> >
> > While BPF is a wonderful technology, I thik we have to be very very careful
> > about inserting it in places that consist of -implementation details- that
> > we in mm already are planning to move away from.
> >
> > It's one thing adding BPF in the oomk (simple interface, unlikely to
> > change, doesn't really constrain us) or the scheduler (again the hooks are
> > by nature reasonably stable), it's quite another sticking it in the heart
> > of a part of mm that is undergoing _constant_ change, partly as evidenced
> > by the sheer number of series related to THP that are currently on-list.
> >
> > So while BPF may be the best solution for your needs _right now_, we need
> > be concerned with how things affect the kernel in the future.
> >
> > I think we really do have to tread very carefully here.
>
> I totally agree with you that the key point here is how to define the
> API. As I replied to David, I believe we have two fundamental
> principles to adjust the THP policies:
> 1. Selective Benefit: Some tasks benefit from THP, while others do not.
> 2. Conditional Safety: THP allocation is safe under certain conditions
> but not others.
>
> Therefore, I believe we can define these APIs based on the established
> principles - everything else constitutes implementation details, even
> if core MM internals need to change.
But if we're looking to make the concept of THP go away, we really need to
go further than this.
The second we have 'bpf program that figures out whether THP should be
used' we are permanently tied to the idea of THP on/off being a thing.
I mean any future stuff that makes THP more automagic will probably involve
having new modes for the legacy THP
/sys/kernel/mm/transparent_hugepage/enabled and
/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/enabled
But if people are super reliant on this stuff it's potentially really
limiting.
I think you said in another post here that you were toying with the notion
of exposing somehow the madvise() interface and having that be the 'stable
API' of sorts?
That definitely sounds more sensible than something that very explicitly
interacts with THP.
Of course we have Usama's series and my proposed series for extending
process_madvise() along those lines also.
>
> --
> Regards
> Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 13:10 ` Matthew Wilcox
@ 2025-05-20 14:08 ` Yafang Shao
2025-05-20 14:22 ` Lorenzo Stoakes
0 siblings, 1 reply; 52+ messages in thread
From: Yafang Shao @ 2025-05-20 14:08 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Nico Pache, akpm, david, ziy, baolin.wang, lorenzo.stoakes,
Liam.Howlett, ryan.roberts, dev.jain, hannes, usamaarif642,
gutierrez.asier, ast, daniel, andrii, bpf, linux-mm
On Tue, May 20, 2025 at 9:10 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, May 20, 2025 at 03:25:07PM +0800, Yafang Shao wrote:
> > The challenge we face is that our system administration team doesn't
> > permit enabling THP globally in production by setting it to "madvise"
> > or "always". As a result, we can only experiment with your feature on
> > our test servers at this stage.
>
> That's a you problem.
perhaps.
> You need to figure out how to influence your
> sysadmin team to change their mind; whether it's by talking to their
> superiors or persuading them directly.
I believe that "practicing" matters more than "talking" or "persuading".
I’m surprised your suggestion relies on "talking" ;-)
If I understand correctly, we all agree that "talk is cheap", right?
> It's not a justification for why
> upstream should take this patch.
I believe Johannes has clearly explained the challenges the community
is currently facing [0].
[0]. https://lore.kernel.org/linux-mm/20250430174521.GC2020@cmpxchg.org/
--
Regards
Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 14:08 ` Yafang Shao
@ 2025-05-20 14:22 ` Lorenzo Stoakes
2025-05-20 14:32 ` Usama Arif
0 siblings, 1 reply; 52+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 14:22 UTC (permalink / raw)
To: Yafang Shao
Cc: Matthew Wilcox, Nico Pache, akpm, david, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, hannes, usamaarif642,
gutierrez.asier, ast, daniel, andrii, bpf, linux-mm
On Tue, May 20, 2025 at 10:08:03PM +0800, Yafang Shao wrote:
> On Tue, May 20, 2025 at 9:10 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, May 20, 2025 at 03:25:07PM +0800, Yafang Shao wrote:
> > > The challenge we face is that our system administration team doesn't
> > > permit enabling THP globally in production by setting it to "madvise"
> > > or "always". As a result, we can only experiment with your feature on
> > > our test servers at this stage.
> >
> > That's a you problem.
>
> perhaps.
>
> > You need to figure out how to influence your
> > sysadmin team to change their mind; whether it's by talking to their
> > superiors or persuading them directly.
>
> I believe that "practicing" matters more than "talking" or "persuading".
> I’m surprised your suggestion relies on "talking" ;-)
> If I understand correctly, we all agree that "talk is cheap", right?
>
> > It's not a justification for why
> > upstream should take this patch.
>
> I believe Johannes has clearly explained the challenges the community
> is currently facing [0].
>
> [0]. https://lore.kernel.org/linux-mm/20250430174521.GC2020@cmpxchg.org/
(Sorry to interject on your conversation, but :)
I don't think anybody denies we have issues in configuring this stuff
sensibly. A global-only control isn't going to cut it in the real world it
seems.
To me as you say yourself, definining the ABI/API here is what really matters,
and we're right now inundated with several series all at once (you wait for one
bus then 3 come at once... :).
So this I think, should be the question.
I like the idea of just exposing something like madvise(), which is something
we're going to maintain indefinitely.
Though any such exposure would in my view would need to be opt-in i.e. have a
list of MADV_... options that are accepted, as we'd need to very cautiously
determine which are safe from this context.
Of course then this leads to the whole thing (and I really know very little
about BPF internals - obviously happy to understand more) of whether we can just
use the madvise() code direct or what locking we can do or how all that works.
At any rate, a custom thing that is specific as 'switch mode for mTHP pages of
size X to Y' is just something I'd rather us not tie ourselves to.
>
>
> --
> Regards
>
> Yafang
What do you think re: bpf vs. something like my proposed process_madvise()
extensions or Usama's proposed prctl()?
Simpler, but really just using madvise functionality and having a means of
defaulting across fork/exec (notwithstanding Jann's concerns in this area).
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 14:22 ` Lorenzo Stoakes
@ 2025-05-20 14:32 ` Usama Arif
2025-05-20 14:35 ` Lorenzo Stoakes
2025-05-20 15:00 ` David Hildenbrand
0 siblings, 2 replies; 52+ messages in thread
From: Usama Arif @ 2025-05-20 14:32 UTC (permalink / raw)
To: Lorenzo Stoakes, Yafang Shao
Cc: Matthew Wilcox, Nico Pache, akpm, david, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, hannes, gutierrez.asier,
ast, daniel, andrii, bpf, linux-mm
On 20/05/2025 15:22, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 10:08:03PM +0800, Yafang Shao wrote:
>> On Tue, May 20, 2025 at 9:10 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>
>>> On Tue, May 20, 2025 at 03:25:07PM +0800, Yafang Shao wrote:
>>>> The challenge we face is that our system administration team doesn't
>>>> permit enabling THP globally in production by setting it to "madvise"
>>>> or "always". As a result, we can only experiment with your feature on
>>>> our test servers at this stage.
>>>
>>> That's a you problem.
>>
>> perhaps.
>>
>>> You need to figure out how to influence your
>>> sysadmin team to change their mind; whether it's by talking to their
>>> superiors or persuading them directly.
>>
>> I believe that "practicing" matters more than "talking" or "persuading".
>> I’m surprised your suggestion relies on "talking" ;-)
>> If I understand correctly, we all agree that "talk is cheap", right?
>>
>>> It's not a justification for why
>>> upstream should take this patch.
>>
>> I believe Johannes has clearly explained the challenges the community
>> is currently facing [0].
>>
>> [0]. https://lore.kernel.org/linux-mm/20250430174521.GC2020@cmpxchg.org/
>
> (Sorry to interject on your conversation, but :)
>
> I don't think anybody denies we have issues in configuring this stuff
> sensibly. A global-only control isn't going to cut it in the real world it
> seems.
>
> To me as you say yourself, definining the ABI/API here is what really matters,
> and we're right now inundated with several series all at once (you wait for one
> bus then 3 come at once... :).
>
> So this I think, should be the question.
>
> I like the idea of just exposing something like madvise(), which is something
> we're going to maintain indefinitely.
>
> Though any such exposure would in my view would need to be opt-in i.e. have a
> list of MADV_... options that are accepted, as we'd need to very cautiously
> determine which are safe from this context.
>
> Of course then this leads to the whole thing (and I really know very little
> about BPF internals - obviously happy to understand more) of whether we can just
> use the madvise() code direct or what locking we can do or how all that works.
>
> At any rate, a custom thing that is specific as 'switch mode for mTHP pages of
> size X to Y' is just something I'd rather us not tie ourselves to.
>
>>
>>
>> --
>> Regards
>>
>> Yafang
>
> What do you think re: bpf vs. something like my proposed process_madvise()
> extensions or Usama's proposed prctl()?
>
> Simpler, but really just using madvise functionality and having a means of
> defaulting across fork/exec (notwithstanding Jann's concerns in this area).
Unfortunately I think the issue is that neither prctl or process_madvise would work
for Yafangs usecase? Its usecase 3 mentioned in [1], i.e.
global system policy=never, process wants "madvise" policy for itself.
Will let Yafang confirm.
[1] https://lore.kernel.org/all/13b68fa0-8755-43d8-8504-d181c2d46134@gmail.com/
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 14:32 ` Usama Arif
@ 2025-05-20 14:35 ` Lorenzo Stoakes
2025-05-20 14:42 ` Matthew Wilcox
2025-05-20 14:46 ` Usama Arif
2025-05-20 15:00 ` David Hildenbrand
1 sibling, 2 replies; 52+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 14:35 UTC (permalink / raw)
To: Usama Arif
Cc: Yafang Shao, Matthew Wilcox, Nico Pache, akpm, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, hannes,
gutierrez.asier, ast, daniel, andrii, bpf, linux-mm
On Tue, May 20, 2025 at 03:32:16PM +0100, Usama Arif wrote:
>
>
> On 20/05/2025 15:22, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 10:08:03PM +0800, Yafang Shao wrote:
> >> On Tue, May 20, 2025 at 9:10 PM Matthew Wilcox <willy@infradead.org> wrote:
> >>>
> >>> On Tue, May 20, 2025 at 03:25:07PM +0800, Yafang Shao wrote:
> >>>> The challenge we face is that our system administration team doesn't
> >>>> permit enabling THP globally in production by setting it to "madvise"
> >>>> or "always". As a result, we can only experiment with your feature on
> >>>> our test servers at this stage.
> >>>
> >>> That's a you problem.
> >>
> >> perhaps.
> >>
> >>> You need to figure out how to influence your
> >>> sysadmin team to change their mind; whether it's by talking to their
> >>> superiors or persuading them directly.
> >>
> >> I believe that "practicing" matters more than "talking" or "persuading".
> >> I’m surprised your suggestion relies on "talking" ;-)
> >> If I understand correctly, we all agree that "talk is cheap", right?
> >>
> >>> It's not a justification for why
> >>> upstream should take this patch.
> >>
> >> I believe Johannes has clearly explained the challenges the community
> >> is currently facing [0].
> >>
> >> [0]. https://lore.kernel.org/linux-mm/20250430174521.GC2020@cmpxchg.org/
> >
> > (Sorry to interject on your conversation, but :)
> >
> > I don't think anybody denies we have issues in configuring this stuff
> > sensibly. A global-only control isn't going to cut it in the real world it
> > seems.
> >
> > To me as you say yourself, definining the ABI/API here is what really matters,
> > and we're right now inundated with several series all at once (you wait for one
> > bus then 3 come at once... :).
> >
> > So this I think, should be the question.
> >
> > I like the idea of just exposing something like madvise(), which is something
> > we're going to maintain indefinitely.
> >
> > Though any such exposure would in my view would need to be opt-in i.e. have a
> > list of MADV_... options that are accepted, as we'd need to very cautiously
> > determine which are safe from this context.
> >
> > Of course then this leads to the whole thing (and I really know very little
> > about BPF internals - obviously happy to understand more) of whether we can just
> > use the madvise() code direct or what locking we can do or how all that works.
> >
> > At any rate, a custom thing that is specific as 'switch mode for mTHP pages of
> > size X to Y' is just something I'd rather us not tie ourselves to.
> >
> >>
> >>
> >> --
> >> Regards
> >>
> >> Yafang
> >
> > What do you think re: bpf vs. something like my proposed process_madvise()
> > extensions or Usama's proposed prctl()?
> >
> > Simpler, but really just using madvise functionality and having a means of
> > defaulting across fork/exec (notwithstanding Jann's concerns in this area).
>
> Unfortunately I think the issue is that neither prctl or process_madvise would work
> for Yafangs usecase? Its usecase 3 mentioned in [1], i.e.
> global system policy=never, process wants "madvise" policy for itself.
> Will let Yafang confirm.
>
> [1] https://lore.kernel.org/all/13b68fa0-8755-43d8-8504-d181c2d46134@gmail.com/
>
Yeah I really object to that case. I explicitly said on your series I
object to it, I believe David did too.
Never should mean never.
It's a NACK if that's what this is about unless I'm missing something here.
I agree global settings are not fine-grained enough, but 'sys admins refuse
to do X so we want to ignore what they do' is... really not right at all.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 14:35 ` Lorenzo Stoakes
@ 2025-05-20 14:42 ` Matthew Wilcox
2025-05-20 14:56 ` David Hildenbrand
2025-05-21 4:28 ` Yafang Shao
2025-05-20 14:46 ` Usama Arif
1 sibling, 2 replies; 52+ messages in thread
From: Matthew Wilcox @ 2025-05-20 14:42 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Usama Arif, Yafang Shao, Nico Pache, akpm, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, hannes,
gutierrez.asier, ast, daniel, andrii, bpf, linux-mm
On Tue, May 20, 2025 at 03:35:49PM +0100, Lorenzo Stoakes wrote:
> I agree global settings are not fine-grained enough, but 'sys admins refuse
> to do X so we want to ignore what they do' is... really not right at all.
Oh, we do that all the time, Leave the interface around but document
it's now a no-op. For example, file-backed memory ignores the THP
settings completely. And mounting an NFS filesystem as "intr" has
been a no-op for over a decade.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 14:35 ` Lorenzo Stoakes
2025-05-20 14:42 ` Matthew Wilcox
@ 2025-05-20 14:46 ` Usama Arif
1 sibling, 0 replies; 52+ messages in thread
From: Usama Arif @ 2025-05-20 14:46 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Yafang Shao, Matthew Wilcox, Nico Pache, akpm, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, hannes,
gutierrez.asier, ast, daniel, andrii, bpf, linux-mm
On 20/05/2025 15:35, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 03:32:16PM +0100, Usama Arif wrote:
>>
>>
>> On 20/05/2025 15:22, Lorenzo Stoakes wrote:
>>> On Tue, May 20, 2025 at 10:08:03PM +0800, Yafang Shao wrote:
>>>> On Tue, May 20, 2025 at 9:10 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>>>
>>>>> On Tue, May 20, 2025 at 03:25:07PM +0800, Yafang Shao wrote:
>>>>>> The challenge we face is that our system administration team doesn't
>>>>>> permit enabling THP globally in production by setting it to "madvise"
>>>>>> or "always". As a result, we can only experiment with your feature on
>>>>>> our test servers at this stage.
>>>>>
>>>>> That's a you problem.
>>>>
>>>> perhaps.
>>>>
>>>>> You need to figure out how to influence your
>>>>> sysadmin team to change their mind; whether it's by talking to their
>>>>> superiors or persuading them directly.
>>>>
>>>> I believe that "practicing" matters more than "talking" or "persuading".
>>>> I’m surprised your suggestion relies on "talking" ;-)
>>>> If I understand correctly, we all agree that "talk is cheap", right?
>>>>
>>>>> It's not a justification for why
>>>>> upstream should take this patch.
>>>>
>>>> I believe Johannes has clearly explained the challenges the community
>>>> is currently facing [0].
>>>>
>>>> [0]. https://lore.kernel.org/linux-mm/20250430174521.GC2020@cmpxchg.org/
>>>
>>> (Sorry to interject on your conversation, but :)
>>>
>>> I don't think anybody denies we have issues in configuring this stuff
>>> sensibly. A global-only control isn't going to cut it in the real world it
>>> seems.
>>>
>>> To me as you say yourself, definining the ABI/API here is what really matters,
>>> and we're right now inundated with several series all at once (you wait for one
>>> bus then 3 come at once... :).
>>>
>>> So this I think, should be the question.
>>>
>>> I like the idea of just exposing something like madvise(), which is something
>>> we're going to maintain indefinitely.
>>>
>>> Though any such exposure would in my view would need to be opt-in i.e. have a
>>> list of MADV_... options that are accepted, as we'd need to very cautiously
>>> determine which are safe from this context.
>>>
>>> Of course then this leads to the whole thing (and I really know very little
>>> about BPF internals - obviously happy to understand more) of whether we can just
>>> use the madvise() code direct or what locking we can do or how all that works.
>>>
>>> At any rate, a custom thing that is specific as 'switch mode for mTHP pages of
>>> size X to Y' is just something I'd rather us not tie ourselves to.
>>>
>>>>
>>>>
>>>> --
>>>> Regards
>>>>
>>>> Yafang
>>>
>>> What do you think re: bpf vs. something like my proposed process_madvise()
>>> extensions or Usama's proposed prctl()?
>>>
>>> Simpler, but really just using madvise functionality and having a means of
>>> defaulting across fork/exec (notwithstanding Jann's concerns in this area).
>>
>> Unfortunately I think the issue is that neither prctl or process_madvise would work
>> for Yafangs usecase? Its usecase 3 mentioned in [1], i.e.
>> global system policy=never, process wants "madvise" policy for itself.
>> Will let Yafang confirm.
>>
>> [1] https://lore.kernel.org/all/13b68fa0-8755-43d8-8504-d181c2d46134@gmail.com/
>>
>
> Yeah I really object to that case. I explicitly said on your series I
> object to it, I believe David did too.
Yes, I am not for it as well, which is why my series never tried to do it :)
As I mentioned in my series several times (unfortunately too many to count)
hugepage_global_enabled always evaluated to false when THP is never.
>
> Never should mean never.
>
> It's a NACK if that's what this is about unless I'm missing something here.
>
> I agree global settings are not fine-grained enough, but 'sys admins refuse
> to do X so we want to ignore what they do' is... really not right at all.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 14:42 ` Matthew Wilcox
@ 2025-05-20 14:56 ` David Hildenbrand
2025-05-21 4:28 ` Yafang Shao
1 sibling, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2025-05-20 14:56 UTC (permalink / raw)
To: Matthew Wilcox, Lorenzo Stoakes
Cc: Usama Arif, Yafang Shao, Nico Pache, akpm, ziy, baolin.wang,
Liam.Howlett, ryan.roberts, dev.jain, hannes, gutierrez.asier,
ast, daniel, andrii, bpf, linux-mm
On 20.05.25 16:42, Matthew Wilcox wrote:
> On Tue, May 20, 2025 at 03:35:49PM +0100, Lorenzo Stoakes wrote:
>> I agree global settings are not fine-grained enough, but 'sys admins refuse
>> to do X so we want to ignore what they do' is... really not right at all.
>
> Oh, we do that all the time, Leave the interface around but document
> it's now a no-op. For example, file-backed memory ignores the THP
> settings completely.
IIRC, it never honored it. Like shmem it also never honored it an
instead used it's own toggle (not sure what to think about that ...).
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 14:32 ` Usama Arif
2025-05-20 14:35 ` Lorenzo Stoakes
@ 2025-05-20 15:00 ` David Hildenbrand
1 sibling, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2025-05-20 15:00 UTC (permalink / raw)
To: Usama Arif, Lorenzo Stoakes, Yafang Shao
Cc: Matthew Wilcox, Nico Pache, akpm, ziy, baolin.wang, Liam.Howlett,
ryan.roberts, dev.jain, hannes, gutierrez.asier, ast, daniel,
andrii, bpf, linux-mm
On 20.05.25 16:32, Usama Arif wrote:
>
>
> On 20/05/2025 15:22, Lorenzo Stoakes wrote:
>> On Tue, May 20, 2025 at 10:08:03PM +0800, Yafang Shao wrote:
>>> On Tue, May 20, 2025 at 9:10 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>>
>>>> On Tue, May 20, 2025 at 03:25:07PM +0800, Yafang Shao wrote:
>>>>> The challenge we face is that our system administration team doesn't
>>>>> permit enabling THP globally in production by setting it to "madvise"
>>>>> or "always". As a result, we can only experiment with your feature on
>>>>> our test servers at this stage.
>>>>
>>>> That's a you problem.
>>>
>>> perhaps.
>>>
>>>> You need to figure out how to influence your
>>>> sysadmin team to change their mind; whether it's by talking to their
>>>> superiors or persuading them directly.
>>>
>>> I believe that "practicing" matters more than "talking" or "persuading".
>>> I’m surprised your suggestion relies on "talking" ;-)
>>> If I understand correctly, we all agree that "talk is cheap", right?
>>>
>>>> It's not a justification for why
>>>> upstream should take this patch.
>>>
>>> I believe Johannes has clearly explained the challenges the community
>>> is currently facing [0].
>>>
>>> [0]. https://lore.kernel.org/linux-mm/20250430174521.GC2020@cmpxchg.org/
>>
>> (Sorry to interject on your conversation, but :)
>>
>> I don't think anybody denies we have issues in configuring this stuff
>> sensibly. A global-only control isn't going to cut it in the real world it
>> seems.
>>
>> To me as you say yourself, definining the ABI/API here is what really matters,
>> and we're right now inundated with several series all at once (you wait for one
>> bus then 3 come at once... :).
>>
>> So this I think, should be the question.
>>
>> I like the idea of just exposing something like madvise(), which is something
>> we're going to maintain indefinitely.
>>
>> Though any such exposure would in my view would need to be opt-in i.e. have a
>> list of MADV_... options that are accepted, as we'd need to very cautiously
>> determine which are safe from this context.
>>
>> Of course then this leads to the whole thing (and I really know very little
>> about BPF internals - obviously happy to understand more) of whether we can just
>> use the madvise() code direct or what locking we can do or how all that works.
>>
>> At any rate, a custom thing that is specific as 'switch mode for mTHP pages of
>> size X to Y' is just something I'd rather us not tie ourselves to.
>>
>>>
>>>
>>> --
>>> Regards
>>>
>>> Yafang
>>
>> What do you think re: bpf vs. something like my proposed process_madvise()
>> extensions or Usama's proposed prctl()?
>>
>> Simpler, but really just using madvise functionality and having a means of
>> defaulting across fork/exec (notwithstanding Jann's concerns in this area).
>
> Unfortunately I think the issue is that neither prctl or process_madvise would work
> for Yafangs usecase? Its usecase 3 mentioned in [1], i.e.
> global system policy=never, process wants "madvise" policy for itself.
If the global system policy would be "madvise", you'd need a way to just
disable it for processes where you wouldn't ever want them.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 13:45 ` Lorenzo Stoakes
@ 2025-05-20 15:54 ` David Hildenbrand
2025-05-21 4:02 ` Yafang Shao
2025-05-21 3:52 ` Yafang Shao
1 sibling, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2025-05-20 15:54 UTC (permalink / raw)
To: Lorenzo Stoakes, Yafang Shao
Cc: akpm, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
daniel, andrii, bpf, linux-mm
>> I totally agree with you that the key point here is how to define the
>> API. As I replied to David, I believe we have two fundamental
>> principles to adjust the THP policies:
>> 1. Selective Benefit: Some tasks benefit from THP, while others do not.
>> 2. Conditional Safety: THP allocation is safe under certain conditions
>> but not others.
>>
>> Therefore, I believe we can define these APIs based on the established
>> principles - everything else constitutes implementation details, even
>> if core MM internals need to change.
>
> But if we're looking to make the concept of THP go away, we really need to
> go further than this.
Yeah. I might be wrong, but I also don't think doing control on a
per-process level etc would be the right solution long-term.
In a world where we do stuff automatically ("auto" mode), we would be
much smarter about where to place a (m)THP, and which size we would use.
One might use bpf to control the allocation policy. But I don't think
this would be per-process or even per-VMA etc. Sure, we might give
hints, but placement decisions should happen on another level (e.g.,
during page faults, during khugepaged etc).
>
> The second we have 'bpf program that figures out whether THP should be
> used' we are permanently tied to the idea of THP on/off being a thing.
>
> I mean any future stuff that makes THP more automagic will probably involve
> having new modes for the legacy THP
> /sys/kernel/mm/transparent_hugepage/enabled and
> /sys/kernel/mm/transparent_hugepage/hugepages-xxkB/enabled
Yeah, the plan is to have "auto" in
/sys/kernel/mm/transparent_hugepage/enabled and just have all other
sizes "inherit" that option. And have a Kconfig that just enables that
as default. Once we're there, just phase out the interface long-term.
That's the plan. Now we "only" have to figure out how to make the
placement actually better ;)
>
> But if people are super reliant on this stuff it's potentially really
> limiting.
>
> I think you said in another post here that you were toying with the notion
> of exposing somehow the madvise() interface and having that be the 'stable
> API' of sorts?
>
> That definitely sounds more sensible than something that very explicitly
> interacts with THP.
>
> Of course we have Usama's series and my proposed series for extending
> process_madvise() along those lines also.
Yes.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 4/5] bpf: Add get_current_comm to bpf_base_func_proto
2025-05-20 6:05 ` [RFC PATCH v2 4/5] bpf: Add get_current_comm to bpf_base_func_proto Yafang Shao
@ 2025-05-20 23:32 ` Andrii Nakryiko
0 siblings, 0 replies; 52+ messages in thread
From: Andrii Nakryiko @ 2025-05-20 23:32 UTC (permalink / raw)
To: Yafang Shao
Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
npache, ryan.roberts, dev.jain, hannes, usamaarif642,
gutierrez.asier, willy, ast, daniel, andrii, bpf, linux-mm
On Mon, May 19, 2025 at 11:06 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> While testing the BPF based THP adjustment feature, I noticed
> bpf_get_current_comm() isn't available in bpf_base_func_proto. As this is a
> commonly used helper, we should add it to bpf_base_func_proto.
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
> kernel/bpf/cgroup.c | 2 --
> kernel/bpf/helpers.c | 2 ++
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
please rebase, there were changes in this area and
bpf_get_current_comm is already in bpf_base_func_proto (and
cgroup_current_func_proto is gone)
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 84f58f3d028a..22cd4f54d023 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -2609,8 +2609,6 @@ cgroup_current_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> switch (func_id) {
> case BPF_FUNC_get_current_uid_gid:
> return &bpf_get_current_uid_gid_proto;
> - case BPF_FUNC_get_current_comm:
> - return &bpf_get_current_comm_proto;
> #ifdef CONFIG_CGROUP_NET_CLASSID
> case BPF_FUNC_get_cgroup_classid:
> return &bpf_get_cgroup_classid_curr_proto;
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index e3a2662f4e33..2a60522cd66f 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1965,6 +1965,8 @@ bpf_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> return &bpf_get_current_pid_tgid_proto;
> case BPF_FUNC_get_ns_current_pid_tgid:
> return &bpf_get_ns_current_pid_tgid_proto;
> + case BPF_FUNC_get_current_comm:
> + return &bpf_get_current_comm_proto;
> default:
> break;
> }
> --
> 2.43.5
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 13:45 ` Lorenzo Stoakes
2025-05-20 15:54 ` David Hildenbrand
@ 2025-05-21 3:52 ` Yafang Shao
1 sibling, 0 replies; 52+ messages in thread
From: Yafang Shao @ 2025-05-21 3:52 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: David Hildenbrand, akpm, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On Tue, May 20, 2025 at 9:45 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, May 20, 2025 at 08:06:21PM +0800, Yafang Shao wrote:
> > On Tue, May 20, 2025 at 5:49 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Tue, May 20, 2025 at 11:43:11AM +0200, David Hildenbrand wrote:
> > > > > Conclusion
> > > > > ----------
> > > > >
> > > > > Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
> > > > > most effective solution for our requirements. This approach represents a
> > > > > small but meaningful step toward making THP truly usable—and manageable—in
> > > > > production environments.
> > > > A new "bpf" mode sounds way too special.
> > > >
> > > > We currently have:
> > > >
> > > > never -> never
> > > > madvise -> MADV_HUGEPAGE, except PR_SET_THP_DISABLE
> > > > always -> always, except PR_SET_THP_DISABLE and MADV_NOHUGEPAGE
> > > >
> > > > Whatever new mode we add, it should honor PR_SET_THP_DISABLE +
> > > > MADV_NOHUGEPAGE.
> > > >
> > > > So, if we want another way to enable things, it would live between "never"
> > > > and "madvise".
> > > >
> > > > I'm wondering how we could make that generic: likely we want this new
> > > > mechanism to *not* be triggerable by the process itself (madvise).
> > > >
> > > > I am not convinced bpf is the answer here ...
> > >
> > > Agreed.
> > >
> > > I am also very concerned with us inserting BPF bits here - are we not then
> > > ensuring that we cannot in any way move towards a future where we
> > > 'automagically' determine what to do?
> > >
> > > I don't know what is claimed about BPF, but it strikes me that we're
> > > establishing a permanent uABI (uAPI?) if we do that and essentially
> > > promising that THP will continue to operate in a fashion similar to how it
> > > does now.
> > >
> > > While BPF is a wonderful technology, I thik we have to be very very careful
> > > about inserting it in places that consist of -implementation details- that
> > > we in mm already are planning to move away from.
> > >
> > > It's one thing adding BPF in the oomk (simple interface, unlikely to
> > > change, doesn't really constrain us) or the scheduler (again the hooks are
> > > by nature reasonably stable), it's quite another sticking it in the heart
> > > of a part of mm that is undergoing _constant_ change, partly as evidenced
> > > by the sheer number of series related to THP that are currently on-list.
> > >
> > > So while BPF may be the best solution for your needs _right now_, we need
> > > be concerned with how things affect the kernel in the future.
> > >
> > > I think we really do have to tread very carefully here.
> >
> > I totally agree with you that the key point here is how to define the
> > API. As I replied to David, I believe we have two fundamental
> > principles to adjust the THP policies:
> > 1. Selective Benefit: Some tasks benefit from THP, while others do not.
> > 2. Conditional Safety: THP allocation is safe under certain conditions
> > but not others.
> >
> > Therefore, I believe we can define these APIs based on the established
> > principles - everything else constitutes implementation details, even
> > if core MM internals need to change.
>
> But if we're looking to make the concept of THP go away, we really need to
> go further than this.
>
> The second we have 'bpf program that figures out whether THP should be
> used' we are permanently tied to the idea of THP on/off being a thing.
>
> I mean any future stuff that makes THP more automagic will probably involve
> having new modes for the legacy THP
> /sys/kernel/mm/transparent_hugepage/enabled and
> /sys/kernel/mm/transparent_hugepage/hugepages-xxkB/enabled
>
> But if people are super reliant on this stuff it's potentially really
> limiting.
>
> I think you said in another post here that you were toying with the notion
> of exposing somehow the madvise() interface and having that be the 'stable
> API' of sorts?
Yes, I have a BPF program that hooks into madvise() to selectively
enforce THP policies—allowing it for certain tasks while blocking it
for others. However, this violates the semantic guarantee of
madvise(). For instance, if a user sees THP configured in madvise
mode, they’d expect madvise() to reliably enable it. But with this BPF
logic, such calls might silently fail, creating inconsistency. This is
why we propose introducing a dedicated BPF-controlled mode, or
alternatively extending the semantics of the existing "never" mode.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 15:54 ` David Hildenbrand
@ 2025-05-21 4:02 ` Yafang Shao
0 siblings, 0 replies; 52+ messages in thread
From: Yafang Shao @ 2025-05-21 4:02 UTC (permalink / raw)
To: David Hildenbrand
Cc: Lorenzo Stoakes, akpm, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On Tue, May 20, 2025 at 11:54 PM David Hildenbrand <david@redhat.com> wrote:
>
> >> I totally agree with you that the key point here is how to define the
> >> API. As I replied to David, I believe we have two fundamental
> >> principles to adjust the THP policies:
> >> 1. Selective Benefit: Some tasks benefit from THP, while others do not.
> >> 2. Conditional Safety: THP allocation is safe under certain conditions
> >> but not others.
> >>
> >> Therefore, I believe we can define these APIs based on the established
> >> principles - everything else constitutes implementation details, even
> >> if core MM internals need to change.
> >
> > But if we're looking to make the concept of THP go away, we really need to
> > go further than this.
>
> Yeah. I might be wrong, but I also don't think doing control on a
> per-process level etc would be the right solution long-term.
The reality is that achieving truly 'automatic' THP behavior requires
process-level control. Given that THP provides no benefit for certain
workloads, there's no justification for incurring the overhead of
allocating higher-order pages in those cases.
>
> In a world where we do stuff automatically ("auto" mode), we would be
> much smarter about where to place a (m)THP, and which size we would use.
We still have considerable ground to cover before reaching this goal.
>
> One might use bpf to control the allocation policy. But I don't think
> this would be per-process or even per-VMA etc. Sure, we might give
> hints, but placement decisions should happen on another level (e.g.,
> during page faults, during khugepaged etc).
Nico has proposed introducing a new 'defer' mode to address this.
However, I argue that we could achieve the same functionality through
BPF instead of adding a dedicated policy mode. [0]
[0]. https://lore.kernel.org/linux-mm/CALOAHbAa7DY6+hO4RJtjg-MS+cnUmsiPXX8KS1MKSfgy6HLYAQ@mail.gmail.com/
>
> >
> > The second we have 'bpf program that figures out whether THP should be
> > used' we are permanently tied to the idea of THP on/off being a thing.
> >
> > I mean any future stuff that makes THP more automagic will probably involve
> > having new modes for the legacy THP
> > /sys/kernel/mm/transparent_hugepage/enabled and
> > /sys/kernel/mm/transparent_hugepage/hugepages-xxkB/enabled
>
> Yeah, the plan is to have "auto" in
> /sys/kernel/mm/transparent_hugepage/enabled and just have all other
> sizes "inherit" that option. And have a Kconfig that just enables that
> as default. Once we're there, just phase out the interface long-term.
>
> That's the plan. Now we "only" have to figure out how to make the
> placement actually better ;)
>
> >
> > But if people are super reliant on this stuff it's potentially really
> > limiting.
> >
> > I think you said in another post here that you were toying with the notion
> > of exposing somehow the madvise() interface and having that be the 'stable
> > API' of sorts?
> >
> > That definitely sounds more sensible than something that very explicitly
> > interacts with THP.
> >
> > Of course we have Usama's series and my proposed series for extending
> > process_madvise() along those lines also.
>
> Yes.
>
--
Regards
Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 14:42 ` Matthew Wilcox
2025-05-20 14:56 ` David Hildenbrand
@ 2025-05-21 4:28 ` Yafang Shao
1 sibling, 0 replies; 52+ messages in thread
From: Yafang Shao @ 2025-05-21 4:28 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Lorenzo Stoakes, Usama Arif, Nico Pache, akpm, david, ziy,
baolin.wang, Liam.Howlett, ryan.roberts, dev.jain, hannes,
gutierrez.asier, ast, daniel, andrii, bpf, linux-mm
On Tue, May 20, 2025 at 10:43 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, May 20, 2025 at 03:35:49PM +0100, Lorenzo Stoakes wrote:
> > I agree global settings are not fine-grained enough, but 'sys admins refuse
> > to do X so we want to ignore what they do' is... really not right at all.
>
> Oh, we do that all the time, Leave the interface around but document
> it's now a no-op. For example, file-backed memory ignores the THP
> settings completely.
This essentially invites downstream kernel developers to implement
their own "file-enabled" solutions ;-)
If you haven't yet encountered reports of file-backed THP causing
performance regressions for specific workloads, you may be missing
something. Our testing has confirmed performance degradation with
certain HDFS workloads, even on the 6.12.y kernel - though I've
prioritized discussing BPF-based THP control with you over
investigating those specific cases.
> And mounting an NFS filesystem as "intr" has
> been a no-op for over a decade.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-20 6:04 [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
` (6 preceding siblings ...)
2025-05-20 9:43 ` David Hildenbrand
@ 2025-05-25 3:01 ` Yafang Shao
2025-05-26 7:41 ` Gutierrez Asier
` (2 more replies)
7 siblings, 3 replies; 52+ messages in thread
From: Yafang Shao @ 2025-05-25 3:01 UTC (permalink / raw)
To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
npache, ryan.roberts, dev.jain, hannes, usamaarif642,
gutierrez.asier, willy, ast, daniel, andrii
Cc: bpf, linux-mm
On Tue, May 20, 2025 at 2:05 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> Background
> ----------
>
> At my current employer, PDD, we have consistently configured THP to "never"
> on our production servers due to past incidents caused by its behavior:
>
> - Increased memory consumption
> THP significantly raises overall memory usage.
>
> - Latency spikes
> Random latency spikes occur due to more frequent memory compaction
> activity triggered by THP.
>
> These issues have made sysadmins hesitant to switch to "madvise" or
> "always" modes.
>
> New Motivation
> --------------
>
> We have now identified that certain AI workloads achieve substantial
> performance gains with THP enabled. However, we’ve also verified that some
> workloads see little to no benefit—or are even negatively impacted—by THP.
>
> In our Kubernetes environment, we deploy mixed workloads on a single server
> to maximize resource utilization. Our goal is to selectively enable THP for
> services that benefit from it while keeping it disabled for others. This
> approach allows us to incrementally enable THP for additional services and
> assess how to make it more viable in production.
>
> Proposed Solution
> -----------------
>
> For this use case, Johannes suggested introducing a dedicated mode [0]. In
> this new mode, we could implement BPF-based THP adjustment for fine-grained
> control over tasks or cgroups. If no BPF program is attached, THP remains
> in "never" mode. This solution elegantly meets our needs while avoiding the
> complexity of managing BPF alongside other THP modes.
>
> A selftest example demonstrates how to enable THP for the current task
> while keeping it disabled for others.
>
> Alternative Proposals
> ---------------------
>
> - Gutierrez’s cgroup-based approach [1]
> - Proposed adding a new cgroup file to control THP policy.
> - However, as Johannes noted, cgroups are designed for hierarchical
> resource allocation, not arbitrary policy settings [2].
>
> - Usama’s per-task THP proposal based on prctl() [3]:
> - Enabling THP per task via prctl().
> - As David pointed out, neither madvise() nor prctl() works in "never"
> mode [4], making this solution insufficient for our needs.
>
> Conclusion
> ----------
>
> Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
> most effective solution for our requirements. This approach represents a
> small but meaningful step toward making THP truly usable—and manageable—in
> production environments.
>
> This is currently a PoC implementation. Feedback of any kind is welcome.
>
> Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.org/ [0]
> Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/ [1]
> Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/ [2]
> Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif642@gmail.com/ [3]
> Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com/ [4]
>
> RFC v1->v2:
> The main changes are as follows,
> - Use struct_ops instead of fmod_ret (Alexei)
> - Introduce a new THP mode (Johannes)
> - Introduce new helpers for BPF hook (Zi)
> - Refine the commit log
>
> RFC v1: https://lwn.net/Articles/1019290/
>
> Yafang Shao (5):
> mm: thp: Add a new mode "bpf"
> mm: thp: Add hook for BPF based THP adjustment
> mm: thp: add struct ops for BPF based THP adjustment
> bpf: Add get_current_comm to bpf_base_func_proto
> selftests/bpf: Add selftest for THP adjustment
>
> include/linux/huge_mm.h | 15 +-
> kernel/bpf/cgroup.c | 2 -
> kernel/bpf/helpers.c | 2 +
> mm/Makefile | 3 +
> mm/bpf_thp.c | 120 ++++++++++++
> mm/huge_memory.c | 65 ++++++-
> mm/khugepaged.c | 3 +
> tools/testing/selftests/bpf/config | 1 +
> .../selftests/bpf/prog_tests/thp_adjust.c | 175 ++++++++++++++++++
> .../selftests/bpf/progs/test_thp_adjust.c | 39 ++++
> 10 files changed, 414 insertions(+), 11 deletions(-)
> create mode 100644 mm/bpf_thp.c
> create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
> create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
>
> --
> 2.43.5
>
Hi all,
Let’s summarize the current state of the discussion and identify how
to move forward.
- Global-Only Control is Not Viable
We all seem to agree that a global-only control for THP is unwise. In
practice, some workloads benefit from THP while others do not, so a
one-size-fits-all approach doesn’t work.
- Should We Use "Always" or "Madvise"?
I suspect no one would choose 'always' in its current state. ;)
Both Lorenzo and David propose relying on the madvise mode. However,
since madvise is an unprivileged userspace mechanism, any user can
freely adjust their THP policy. This makes fine-grained control
impossible without breaking userspace compatibility—an undesirable
tradeoff.
Given these limitations, the community should consider introducing a
new "admin" mode for privileged THP policy management.
- Can the Kernel Automatically Manage THP Without User Input?
In practice, users define their own success metrics—such as latency
(RT), queries per second (QPS), or throughput—to evaluate a feature’s
usefulness. If a feature fails to improve these metrics, it provides
no practical value.
Currently, the kernel lacks visibility into user-defined metrics,
making fully automated optimization impossible (at least without user
input). More importantly, automatic management offers no benefit if it
doesn’t align with user needs.
Exception: For kernel-enforced changes (e.g., the page-to-folio
transition), users must adapt regardless. But THP tuning requires
flexibility—forcing automation without measurable gains is
counterproductive.
(Please correct me if I’ve overlooked anything.)
--
Regards
Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-25 3:01 ` Yafang Shao
@ 2025-05-26 7:41 ` Gutierrez Asier
2025-05-26 9:37 ` Yafang Shao
2025-05-26 8:14 ` David Hildenbrand
2025-05-26 14:32 ` Zi Yan
2 siblings, 1 reply; 52+ messages in thread
From: Gutierrez Asier @ 2025-05-26 7:41 UTC (permalink / raw)
To: Yafang Shao, akpm, david, ziy, baolin.wang, lorenzo.stoakes,
Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
usamaarif642, willy, ast, daniel, andrii
Cc: bpf, linux-mm
On 5/25/2025 6:01 AM, Yafang Shao wrote:
> On Tue, May 20, 2025 at 2:05 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>
>> Background
>> ----------
>>
>> At my current employer, PDD, we have consistently configured THP to "never"
>> on our production servers due to past incidents caused by its behavior:
>>
>> - Increased memory consumption
>> THP significantly raises overall memory usage.
>>
>> - Latency spikes
>> Random latency spikes occur due to more frequent memory compaction
>> activity triggered by THP.
>>
>> These issues have made sysadmins hesitant to switch to "madvise" or
>> "always" modes.
>>
>> New Motivation
>> --------------
>>
>> We have now identified that certain AI workloads achieve substantial
>> performance gains with THP enabled. However, we’ve also verified that some
>> workloads see little to no benefit—or are even negatively impacted—by THP.
>>
>> In our Kubernetes environment, we deploy mixed workloads on a single server
>> to maximize resource utilization. Our goal is to selectively enable THP for
>> services that benefit from it while keeping it disabled for others. This
>> approach allows us to incrementally enable THP for additional services and
>> assess how to make it more viable in production.
>>
>> Proposed Solution
>> -----------------
>>
>> For this use case, Johannes suggested introducing a dedicated mode [0]. In
>> this new mode, we could implement BPF-based THP adjustment for fine-grained
>> control over tasks or cgroups. If no BPF program is attached, THP remains
>> in "never" mode. This solution elegantly meets our needs while avoiding the
>> complexity of managing BPF alongside other THP modes.
>>
>> A selftest example demonstrates how to enable THP for the current task
>> while keeping it disabled for others.
>>
>> Alternative Proposals
>> ---------------------
>>
>> - Gutierrez’s cgroup-based approach [1]
>> - Proposed adding a new cgroup file to control THP policy.
>> - However, as Johannes noted, cgroups are designed for hierarchical
>> resource allocation, not arbitrary policy settings [2].
>>
>> - Usama’s per-task THP proposal based on prctl() [3]:
>> - Enabling THP per task via prctl().
>> - As David pointed out, neither madvise() nor prctl() works in "never"
>> mode [4], making this solution insufficient for our needs.
>>
>> Conclusion
>> ----------
>>
>> Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
>> most effective solution for our requirements. This approach represents a
>> small but meaningful step toward making THP truly usable—and manageable—in
>> production environments.
>>
>> This is currently a PoC implementation. Feedback of any kind is welcome.
>>
>> Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.org/ [0]
>> Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/ [1]
>> Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/ [2]
>> Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif642@gmail.com/ [3]
>> Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com/ [4]
>>
>> RFC v1->v2:
>> The main changes are as follows,
>> - Use struct_ops instead of fmod_ret (Alexei)
>> - Introduce a new THP mode (Johannes)
>> - Introduce new helpers for BPF hook (Zi)
>> - Refine the commit log
>>
>> RFC v1: https://lwn.net/Articles/1019290/
>>
>> Yafang Shao (5):
>> mm: thp: Add a new mode "bpf"
>> mm: thp: Add hook for BPF based THP adjustment
>> mm: thp: add struct ops for BPF based THP adjustment
>> bpf: Add get_current_comm to bpf_base_func_proto
>> selftests/bpf: Add selftest for THP adjustment
>>
>> include/linux/huge_mm.h | 15 +-
>> kernel/bpf/cgroup.c | 2 -
>> kernel/bpf/helpers.c | 2 +
>> mm/Makefile | 3 +
>> mm/bpf_thp.c | 120 ++++++++++++
>> mm/huge_memory.c | 65 ++++++-
>> mm/khugepaged.c | 3 +
>> tools/testing/selftests/bpf/config | 1 +
>> .../selftests/bpf/prog_tests/thp_adjust.c | 175 ++++++++++++++++++
>> .../selftests/bpf/progs/test_thp_adjust.c | 39 ++++
>> 10 files changed, 414 insertions(+), 11 deletions(-)
>> create mode 100644 mm/bpf_thp.c
>> create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
>> create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
>>
>> --
>> 2.43.5
>>
>
> Hi all,
>
> Let’s summarize the current state of the discussion and identify how
> to move forward.
>
> - Global-Only Control is Not Viable
> We all seem to agree that a global-only control for THP is unwise. In
> practice, some workloads benefit from THP while others do not, so a
> one-size-fits-all approach doesn’t work.
>
> - Should We Use "Always" or "Madvise"?
> I suspect no one would choose 'always' in its current state. ;)
> Both Lorenzo and David propose relying on the madvise mode. However,
> since madvise is an unprivileged userspace mechanism, any user can
> freely adjust their THP policy. This makes fine-grained control
> impossible without breaking userspace compatibility—an undesirable
> tradeoff.
> Given these limitations, the community should consider introducing a
> new "admin" mode for privileged THP policy management.
>
> - Can the Kernel Automatically Manage THP Without User Input?
> In practice, users define their own success metrics—such as latency
> (RT), queries per second (QPS), or throughput—to evaluate a feature’s
> usefulness. If a feature fails to improve these metrics, it provides
> no practical value.
> Currently, the kernel lacks visibility into user-defined metrics,
> making fully automated optimization impossible (at least without user
> input). More importantly, automatic management offers no benefit if it
> doesn’t align with user needs.
I don't think that using things like RPS or QPS is the right way.
These metrics can be affected by many factors like network issues,
garbage collectors in the user space (JVM, golang, etc.) and many other
things out of our control. Even noisy neighbors can slow down a service.
> Exception: For kernel-enforced changes (e.g., the page-to-folio
> transition), users must adapt regardless. But THP tuning requires
> flexibility—forcing automation without measurable gains is
> counterproductive.
> (Please correct me if I’ve overlooked anything.)
>
--
Asier Gutierrez
Huawei
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-25 3:01 ` Yafang Shao
2025-05-26 7:41 ` Gutierrez Asier
@ 2025-05-26 8:14 ` David Hildenbrand
2025-05-26 9:37 ` Yafang Shao
2025-05-26 14:32 ` Zi Yan
2 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2025-05-26 8:14 UTC (permalink / raw)
To: Yafang Shao, akpm, ziy, baolin.wang, lorenzo.stoakes,
Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
usamaarif642, gutierrez.asier, willy, ast, daniel, andrii
Cc: bpf, linux-mm
> Hi all,
>
> Let’s summarize the current state of the discussion and identify how
> to move forward.
>
> - Global-Only Control is Not Viable
> We all seem to agree that a global-only control for THP is unwise. In
> practice, some workloads benefit from THP while others do not, so a
> one-size-fits-all approach doesn’t work.
>
> - Should We Use "Always" or "Madvise"?
> I suspect no one would choose 'always' in its current state. ;)
IIRC, RHEL9 has the default set to "always" for a long time.
I guess it really depends on how different the workloads are that you
are running on the same machine.
> Both Lorenzo and David propose relying on the madvise mode. However,>
since madvise is an unprivileged userspace mechanism, any user can
> freely adjust their THP policy. This makes fine-grained control
> impossible without breaking userspace compatibility—an undesirable
> tradeoff.
If required, we could look into a "sealing" mechanism, that would
essentially lock modification attempts performed by the process (i.e.,
MADV_HUGEPAGE).
The could be added on top of the current proposals that are flying
around, and could be done e.g., per-process.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-26 8:14 ` David Hildenbrand
@ 2025-05-26 9:37 ` Yafang Shao
2025-05-26 10:49 ` David Hildenbrand
0 siblings, 1 reply; 52+ messages in thread
From: Yafang Shao @ 2025-05-26 9:37 UTC (permalink / raw)
To: David Hildenbrand
Cc: akpm, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On Mon, May 26, 2025 at 4:14 PM David Hildenbrand <david@redhat.com> wrote:
>
> > Hi all,
> >
> > Let’s summarize the current state of the discussion and identify how
> > to move forward.
> >
> > - Global-Only Control is Not Viable
> > We all seem to agree that a global-only control for THP is unwise. In
> > practice, some workloads benefit from THP while others do not, so a
> > one-size-fits-all approach doesn’t work.
> >
> > - Should We Use "Always" or "Madvise"?
> > I suspect no one would choose 'always' in its current state. ;)
>
> IIRC, RHEL9 has the default set to "always" for a long time.
good to know.
>
> I guess it really depends on how different the workloads are that you
> are running on the same machine.
Correct. If we want to enable THP for specific workloads without
modifying the kernel, we must isolate them on dedicated servers.
However, this approach wastes resources and is not an acceptable
solution.
>
> > Both Lorenzo and David propose relying on the madvise mode. However,>
> since madvise is an unprivileged userspace mechanism, any user can
> > freely adjust their THP policy. This makes fine-grained control
> > impossible without breaking userspace compatibility—an undesirable
> > tradeoff.
>
> If required, we could look into a "sealing" mechanism, that would
> essentially lock modification attempts performed by the process (i.e.,
> MADV_HUGEPAGE).
If we don’t introduce a new THP mode and instead rely solely on
madvise, the "sealing" mechanism could either violate the intended
semantics of madvise(), or simply break madvise() entirely, right?
>
> The could be added on top of the current proposals that are flying
> around, and could be done e.g., per-process.
How about introducing a dedicated "process" mode? This would allow
each process to use different THP modes—some in "always," others in
"madvise," and the rest in "never." Future THP modes could also be
added to this framework.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-26 7:41 ` Gutierrez Asier
@ 2025-05-26 9:37 ` Yafang Shao
0 siblings, 0 replies; 52+ messages in thread
From: Yafang Shao @ 2025-05-26 9:37 UTC (permalink / raw)
To: Gutierrez Asier
Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
npache, ryan.roberts, dev.jain, hannes, usamaarif642, willy, ast,
daniel, andrii, bpf, linux-mm
On Mon, May 26, 2025 at 3:41 PM Gutierrez Asier
<gutierrez.asier@huawei-partners.com> wrote:
>
>
>
> On 5/25/2025 6:01 AM, Yafang Shao wrote:
> > On Tue, May 20, 2025 at 2:05 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>
> >> Background
> >> ----------
> >>
> >> At my current employer, PDD, we have consistently configured THP to "never"
> >> on our production servers due to past incidents caused by its behavior:
> >>
> >> - Increased memory consumption
> >> THP significantly raises overall memory usage.
> >>
> >> - Latency spikes
> >> Random latency spikes occur due to more frequent memory compaction
> >> activity triggered by THP.
> >>
> >> These issues have made sysadmins hesitant to switch to "madvise" or
> >> "always" modes.
> >>
> >> New Motivation
> >> --------------
> >>
> >> We have now identified that certain AI workloads achieve substantial
> >> performance gains with THP enabled. However, we’ve also verified that some
> >> workloads see little to no benefit—or are even negatively impacted—by THP.
> >>
> >> In our Kubernetes environment, we deploy mixed workloads on a single server
> >> to maximize resource utilization. Our goal is to selectively enable THP for
> >> services that benefit from it while keeping it disabled for others. This
> >> approach allows us to incrementally enable THP for additional services and
> >> assess how to make it more viable in production.
> >>
> >> Proposed Solution
> >> -----------------
> >>
> >> For this use case, Johannes suggested introducing a dedicated mode [0]. In
> >> this new mode, we could implement BPF-based THP adjustment for fine-grained
> >> control over tasks or cgroups. If no BPF program is attached, THP remains
> >> in "never" mode. This solution elegantly meets our needs while avoiding the
> >> complexity of managing BPF alongside other THP modes.
> >>
> >> A selftest example demonstrates how to enable THP for the current task
> >> while keeping it disabled for others.
> >>
> >> Alternative Proposals
> >> ---------------------
> >>
> >> - Gutierrez’s cgroup-based approach [1]
> >> - Proposed adding a new cgroup file to control THP policy.
> >> - However, as Johannes noted, cgroups are designed for hierarchical
> >> resource allocation, not arbitrary policy settings [2].
> >>
> >> - Usama’s per-task THP proposal based on prctl() [3]:
> >> - Enabling THP per task via prctl().
> >> - As David pointed out, neither madvise() nor prctl() works in "never"
> >> mode [4], making this solution insufficient for our needs.
> >>
> >> Conclusion
> >> ----------
> >>
> >> Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
> >> most effective solution for our requirements. This approach represents a
> >> small but meaningful step toward making THP truly usable—and manageable—in
> >> production environments.
> >>
> >> This is currently a PoC implementation. Feedback of any kind is welcome.
> >>
> >> Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.org/ [0]
> >> Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/ [1]
> >> Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/ [2]
> >> Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif642@gmail.com/ [3]
> >> Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com/ [4]
> >>
> >> RFC v1->v2:
> >> The main changes are as follows,
> >> - Use struct_ops instead of fmod_ret (Alexei)
> >> - Introduce a new THP mode (Johannes)
> >> - Introduce new helpers for BPF hook (Zi)
> >> - Refine the commit log
> >>
> >> RFC v1: https://lwn.net/Articles/1019290/
> >>
> >> Yafang Shao (5):
> >> mm: thp: Add a new mode "bpf"
> >> mm: thp: Add hook for BPF based THP adjustment
> >> mm: thp: add struct ops for BPF based THP adjustment
> >> bpf: Add get_current_comm to bpf_base_func_proto
> >> selftests/bpf: Add selftest for THP adjustment
> >>
> >> include/linux/huge_mm.h | 15 +-
> >> kernel/bpf/cgroup.c | 2 -
> >> kernel/bpf/helpers.c | 2 +
> >> mm/Makefile | 3 +
> >> mm/bpf_thp.c | 120 ++++++++++++
> >> mm/huge_memory.c | 65 ++++++-
> >> mm/khugepaged.c | 3 +
> >> tools/testing/selftests/bpf/config | 1 +
> >> .../selftests/bpf/prog_tests/thp_adjust.c | 175 ++++++++++++++++++
> >> .../selftests/bpf/progs/test_thp_adjust.c | 39 ++++
> >> 10 files changed, 414 insertions(+), 11 deletions(-)
> >> create mode 100644 mm/bpf_thp.c
> >> create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
> >> create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
> >>
> >> --
> >> 2.43.5
> >>
> >
> > Hi all,
> >
> > Let’s summarize the current state of the discussion and identify how
> > to move forward.
> >
> > - Global-Only Control is Not Viable
> > We all seem to agree that a global-only control for THP is unwise. In
> > practice, some workloads benefit from THP while others do not, so a
> > one-size-fits-all approach doesn’t work.
> >
> > - Should We Use "Always" or "Madvise"?
> > I suspect no one would choose 'always' in its current state. ;)
> > Both Lorenzo and David propose relying on the madvise mode. However,
> > since madvise is an unprivileged userspace mechanism, any user can
> > freely adjust their THP policy. This makes fine-grained control
> > impossible without breaking userspace compatibility—an undesirable
> > tradeoff.
> > Given these limitations, the community should consider introducing a
> > new "admin" mode for privileged THP policy management.
> >
> > - Can the Kernel Automatically Manage THP Without User Input?
> > In practice, users define their own success metrics—such as latency
> > (RT), queries per second (QPS), or throughput—to evaluate a feature’s
> > usefulness. If a feature fails to improve these metrics, it provides
> > no practical value.
> > Currently, the kernel lacks visibility into user-defined metrics,
> > making fully automated optimization impossible (at least without user
> > input). More importantly, automatic management offers no benefit if it
> > doesn’t align with user needs.
>
> I don't think that using things like RPS or QPS is the right way.
> These metrics can be affected by many factors like network issues,
> garbage collectors in the user space (JVM, golang, etc.) and many other
> things out of our control. Even noisy neighbors can slow down a service.
This is an example of how to measure whether apps can benefit from a
new feature.
Please review the A/B test details here:
https://en.wikipedia.org/wiki/A/B_testing
--
Regards
Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-26 9:37 ` Yafang Shao
@ 2025-05-26 10:49 ` David Hildenbrand
2025-05-26 14:53 ` Liam R. Howlett
2025-05-27 5:46 ` Yafang Shao
0 siblings, 2 replies; 52+ messages in thread
From: David Hildenbrand @ 2025-05-26 10:49 UTC (permalink / raw)
To: Yafang Shao
Cc: akpm, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On 26.05.25 11:37, Yafang Shao wrote:
> On Mon, May 26, 2025 at 4:14 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>> Hi all,
>>>
>>> Let’s summarize the current state of the discussion and identify how
>>> to move forward.
>>>
>>> - Global-Only Control is Not Viable
>>> We all seem to agree that a global-only control for THP is unwise. In
>>> practice, some workloads benefit from THP while others do not, so a
>>> one-size-fits-all approach doesn’t work.
>>>
>>> - Should We Use "Always" or "Madvise"?
>>> I suspect no one would choose 'always' in its current state. ;)
>>
>> IIRC, RHEL9 has the default set to "always" for a long time.
>
> good to know.
>
>>
>> I guess it really depends on how different the workloads are that you
>> are running on the same machine.
>
> Correct. If we want to enable THP for specific workloads without
> modifying the kernel, we must isolate them on dedicated servers.
> However, this approach wastes resources and is not an acceptable
> solution.
>
>>
>> > Both Lorenzo and David propose relying on the madvise mode. However,>
>> since madvise is an unprivileged userspace mechanism, any user can
>>> freely adjust their THP policy. This makes fine-grained control
>>> impossible without breaking userspace compatibility—an undesirable
>>> tradeoff.
>>
>> If required, we could look into a "sealing" mechanism, that would
>> essentially lock modification attempts performed by the process (i.e.,
>> MADV_HUGEPAGE).
>
> If we don’t introduce a new THP mode and instead rely solely on
> madvise, the "sealing" mechanism could either violate the intended
> semantics of madvise(), or simply break madvise() entirely, right?
We would have to be a bit careful, yes.
Errors from MADV_HUGEPAGE/MADV_NOHUGEPAGE are often ignored, because
these options also fail with -EINVAL on kernels without THP support.
Ignoring MADV_NOHUGEPAGE can be problematic with userfaultfd.
What you likely really want to do is seal when you configured
MADV_NOHUGEPAGE to be the default, and fail MADV_HUGEPAGE later.
>>
>> The could be added on top of the current proposals that are flying
>> around, and could be done e.g., per-process.
>
> How about introducing a dedicated "process" mode? This would allow
> each process to use different THP modes—some in "always," others in
> "madvise," and the rest in "never." Future THP modes could also be
> added to this framework.
We have to be really careful about not creating even more mess with more
modes.
How would that design look like in detail (how would we set it per
process etc?)?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-25 3:01 ` Yafang Shao
2025-05-26 7:41 ` Gutierrez Asier
2025-05-26 8:14 ` David Hildenbrand
@ 2025-05-26 14:32 ` Zi Yan
2025-05-27 5:53 ` Yafang Shao
2 siblings, 1 reply; 52+ messages in thread
From: Zi Yan @ 2025-05-26 14:32 UTC (permalink / raw)
To: Yafang Shao
Cc: akpm, david, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On 24 May 2025, at 23:01, Yafang Shao wrote:
> On Tue, May 20, 2025 at 2:05 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>
>> Background
>> ----------
>>
>> At my current employer, PDD, we have consistently configured THP to "never"
>> on our production servers due to past incidents caused by its behavior:
>>
>> - Increased memory consumption
>> THP significantly raises overall memory usage.
>>
>> - Latency spikes
>> Random latency spikes occur due to more frequent memory compaction
>> activity triggered by THP.
>>
>> These issues have made sysadmins hesitant to switch to "madvise" or
>> "always" modes.
>>
>> New Motivation
>> --------------
>>
>> We have now identified that certain AI workloads achieve substantial
>> performance gains with THP enabled. However, we’ve also verified that some
>> workloads see little to no benefit—or are even negatively impacted—by THP.
>>
>> In our Kubernetes environment, we deploy mixed workloads on a single server
>> to maximize resource utilization. Our goal is to selectively enable THP for
>> services that benefit from it while keeping it disabled for others. This
>> approach allows us to incrementally enable THP for additional services and
>> assess how to make it more viable in production.
>>
>> Proposed Solution
>> -----------------
>>
>> For this use case, Johannes suggested introducing a dedicated mode [0]. In
>> this new mode, we could implement BPF-based THP adjustment for fine-grained
>> control over tasks or cgroups. If no BPF program is attached, THP remains
>> in "never" mode. This solution elegantly meets our needs while avoiding the
>> complexity of managing BPF alongside other THP modes.
>>
>> A selftest example demonstrates how to enable THP for the current task
>> while keeping it disabled for others.
>>
>> Alternative Proposals
>> ---------------------
>>
>> - Gutierrez’s cgroup-based approach [1]
>> - Proposed adding a new cgroup file to control THP policy.
>> - However, as Johannes noted, cgroups are designed for hierarchical
>> resource allocation, not arbitrary policy settings [2].
>>
>> - Usama’s per-task THP proposal based on prctl() [3]:
>> - Enabling THP per task via prctl().
>> - As David pointed out, neither madvise() nor prctl() works in "never"
>> mode [4], making this solution insufficient for our needs.
>>
>> Conclusion
>> ----------
>>
>> Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
>> most effective solution for our requirements. This approach represents a
>> small but meaningful step toward making THP truly usable—and manageable—in
>> production environments.
>>
>> This is currently a PoC implementation. Feedback of any kind is welcome.
>>
>> Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.org/ [0]
>> Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/ [1]
>> Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/ [2]
>> Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif642@gmail.com/ [3]
>> Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com/ [4]
>>
>> RFC v1->v2:
>> The main changes are as follows,
>> - Use struct_ops instead of fmod_ret (Alexei)
>> - Introduce a new THP mode (Johannes)
>> - Introduce new helpers for BPF hook (Zi)
>> - Refine the commit log
>>
>> RFC v1: https://lwn.net/Articles/1019290/
>>
>> Yafang Shao (5):
>> mm: thp: Add a new mode "bpf"
>> mm: thp: Add hook for BPF based THP adjustment
>> mm: thp: add struct ops for BPF based THP adjustment
>> bpf: Add get_current_comm to bpf_base_func_proto
>> selftests/bpf: Add selftest for THP adjustment
>>
>> include/linux/huge_mm.h | 15 +-
>> kernel/bpf/cgroup.c | 2 -
>> kernel/bpf/helpers.c | 2 +
>> mm/Makefile | 3 +
>> mm/bpf_thp.c | 120 ++++++++++++
>> mm/huge_memory.c | 65 ++++++-
>> mm/khugepaged.c | 3 +
>> tools/testing/selftests/bpf/config | 1 +
>> .../selftests/bpf/prog_tests/thp_adjust.c | 175 ++++++++++++++++++
>> .../selftests/bpf/progs/test_thp_adjust.c | 39 ++++
>> 10 files changed, 414 insertions(+), 11 deletions(-)
>> create mode 100644 mm/bpf_thp.c
>> create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
>> create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
>>
>> --
>> 2.43.5
>>
>
> Hi all,
>
> Let’s summarize the current state of the discussion and identify how
> to move forward.
>
> - Global-Only Control is Not Viable
> We all seem to agree that a global-only control for THP is unwise. In
> practice, some workloads benefit from THP while others do not, so a
> one-size-fits-all approach doesn’t work.
>
> - Should We Use "Always" or "Madvise"?
> I suspect no one would choose 'always' in its current state. ;)
> Both Lorenzo and David propose relying on the madvise mode. However,
> since madvise is an unprivileged userspace mechanism, any user can
> freely adjust their THP policy. This makes fine-grained control
> impossible without breaking userspace compatibility—an undesirable
> tradeoff.
> Given these limitations, the community should consider introducing a
> new "admin" mode for privileged THP policy management.
>
I agree with the above two points.
> - Can the Kernel Automatically Manage THP Without User Input?
> In practice, users define their own success metrics—such as latency
> (RT), queries per second (QPS), or throughput—to evaluate a feature’s
> usefulness. If a feature fails to improve these metrics, it provides
> no practical value.
> Currently, the kernel lacks visibility into user-defined metrics,
> making fully automated optimization impossible (at least without user
> input). More importantly, automatic management offers no benefit if it
> doesn’t align with user needs.
Yes, kernel is basically guessing what userspace wants with some hints
like MADV_HUGEPAGE/MADV_NOHUGEPAGE. But kernel has the global view
of memory fragmentation, which userspace cannot get easily. I wonder
if it is possible that userspace tuning might benefit one set of
applications but hurt others or overall performance. Right now,
THP tuning is 0 or 1, either an application wants THPs or not.
We might need a way of ranking THP requests from userspace to
let kernel prioritize them (I am not sure if we can add another
user input parameter, like THP_nice, to get this done, since
apparently everyone will set THP_nice to -100 to get themselves
at the top of the list).
> Exception: For kernel-enforced changes (e.g., the page-to-folio
> transition), users must adapt regardless. But THP tuning requires
> flexibility—forcing automation without measurable gains is
> counterproductive.
> (Please correct me if I’ve overlooked anything.)
>
> --
> Regards
> Yafang
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-26 10:49 ` David Hildenbrand
@ 2025-05-26 14:53 ` Liam R. Howlett
2025-05-26 15:54 ` Liam R. Howlett
2025-05-27 5:46 ` Yafang Shao
1 sibling, 1 reply; 52+ messages in thread
From: Liam R. Howlett @ 2025-05-26 14:53 UTC (permalink / raw)
To: David Hildenbrand
Cc: Yafang Shao, akpm, ziy, baolin.wang, lorenzo.stoakes, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
* David Hildenbrand <david@redhat.com> [250526 06:49]:
> On 26.05.25 11:37, Yafang Shao wrote:
> > On Mon, May 26, 2025 at 4:14 PM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Let’s summarize the current state of the discussion and identify how
> > > > to move forward.
> > > >
> > > > - Global-Only Control is Not Viable
> > > > We all seem to agree that a global-only control for THP is unwise. In
> > > > practice, some workloads benefit from THP while others do not, so a
> > > > one-size-fits-all approach doesn’t work.
> > > >
> > > > - Should We Use "Always" or "Madvise"?
> > > > I suspect no one would choose 'always' in its current state. ;)
> > >
> > > IIRC, RHEL9 has the default set to "always" for a long time.
> >
> > good to know.
> >
> > >
> > > I guess it really depends on how different the workloads are that you
> > > are running on the same machine.
> >
> > Correct. If we want to enable THP for specific workloads without
> > modifying the kernel, we must isolate them on dedicated servers.
> > However, this approach wastes resources and is not an acceptable
> > solution.
> >
> > >
> > > > Both Lorenzo and David propose relying on the madvise mode. However,>
> > > since madvise is an unprivileged userspace mechanism, any user can
> > > > freely adjust their THP policy. This makes fine-grained control
> > > > impossible without breaking userspace compatibility—an undesirable
> > > > tradeoff.
> > >
> > > If required, we could look into a "sealing" mechanism, that would
> > > essentially lock modification attempts performed by the process (i.e.,
> > > MADV_HUGEPAGE).
> >
> > If we don’t introduce a new THP mode and instead rely solely on
> > madvise, the "sealing" mechanism could either violate the intended
> > semantics of madvise(), or simply break madvise() entirely, right?
>
> We would have to be a bit careful, yes.
>
> Errors from MADV_HUGEPAGE/MADV_NOHUGEPAGE are often ignored, because these
> options also fail with -EINVAL on kernels without THP support.
>
> Ignoring MADV_NOHUGEPAGE can be problematic with userfaultfd.
>
> What you likely really want to do is seal when you configured
> MADV_NOHUGEPAGE to be the default, and fail MADV_HUGEPAGE later.
I think this works. Take the example from a previous thread where
containers are differentiated by allowing or not allowing THP. If you
set a container MADV_HOHUGEPAGE (or whatever flag we used for the same
meaning), then if a library uses that call and it fails do we want to
report it as a failure? I would reason that the library shouldn't hard
fail if its unable to use THP, so it's okay to return the failure.
Alternatively, if it is a hard requirement, then that container
shouldn't be allowed to continue in such a state and should verify the
return. (If this is even a possibility?)
>
> > >
> > > The could be added on top of the current proposals that are flying
> > > around, and could be done e.g., per-process.
> >
> > How about introducing a dedicated "process" mode? This would allow
> > each process to use different THP modes—some in "always," others in
> > "madvise," and the rest in "never." Future THP modes could also be
> > added to this framework.
>
> We have to be really careful about not creating even more mess with more
> modes.
Yes, and clarity would depend on the mode name, imo. Never meaning
never, for example.
So we'd need an answer to David's question below before agreeing on
"process". If it survives across fork and exec calls, is it really a
"process" setting?
I believe you are seeing it as "setting default" really doesn't mean
setting a default if you cannot overwrite it, and if you can overwrite
the "default" then it's not going to work for all use cases.
>
> How would that design look like in detail (how would we set it per process
> etc?)?
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-26 14:53 ` Liam R. Howlett
@ 2025-05-26 15:54 ` Liam R. Howlett
2025-05-26 16:51 ` David Hildenbrand
0 siblings, 1 reply; 52+ messages in thread
From: Liam R. Howlett @ 2025-05-26 15:54 UTC (permalink / raw)
To: David Hildenbrand, Yafang Shao, akpm, ziy, baolin.wang,
lorenzo.stoakes, npache, ryan.roberts, dev.jain, hannes,
usamaarif642, gutierrez.asier, willy, ast, daniel, andrii, bpf,
linux-mm
* Liam R. Howlett <Liam.Howlett@oracle.com> [250526 10:54]:
> * David Hildenbrand <david@redhat.com> [250526 06:49]:
> > On 26.05.25 11:37, Yafang Shao wrote:
> > > On Mon, May 26, 2025 at 4:14 PM David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Let’s summarize the current state of the discussion and identify how
> > > > > to move forward.
> > > > >
> > > > > - Global-Only Control is Not Viable
> > > > > We all seem to agree that a global-only control for THP is unwise. In
> > > > > practice, some workloads benefit from THP while others do not, so a
> > > > > one-size-fits-all approach doesn’t work.
> > > > >
> > > > > - Should We Use "Always" or "Madvise"?
> > > > > I suspect no one would choose 'always' in its current state. ;)
> > > >
> > > > IIRC, RHEL9 has the default set to "always" for a long time.
> > >
> > > good to know.
> > >
> > > >
> > > > I guess it really depends on how different the workloads are that you
> > > > are running on the same machine.
> > >
> > > Correct. If we want to enable THP for specific workloads without
> > > modifying the kernel, we must isolate them on dedicated servers.
> > > However, this approach wastes resources and is not an acceptable
> > > solution.
> > >
> > > >
> > > > > Both Lorenzo and David propose relying on the madvise mode. However,>
> > > > since madvise is an unprivileged userspace mechanism, any user can
> > > > > freely adjust their THP policy. This makes fine-grained control
> > > > > impossible without breaking userspace compatibility—an undesirable
> > > > > tradeoff.
> > > >
> > > > If required, we could look into a "sealing" mechanism, that would
> > > > essentially lock modification attempts performed by the process (i.e.,
> > > > MADV_HUGEPAGE).
> > >
> > > If we don’t introduce a new THP mode and instead rely solely on
> > > madvise, the "sealing" mechanism could either violate the intended
> > > semantics of madvise(), or simply break madvise() entirely, right?
> >
> > We would have to be a bit careful, yes.
> >
> > Errors from MADV_HUGEPAGE/MADV_NOHUGEPAGE are often ignored, because these
> > options also fail with -EINVAL on kernels without THP support.
> >
> > Ignoring MADV_NOHUGEPAGE can be problematic with userfaultfd.
> >
> > What you likely really want to do is seal when you configured
> > MADV_NOHUGEPAGE to be the default, and fail MADV_HUGEPAGE later.
I am also not entirely sure how sealing a non-existing vma would work.
We'd have to seal the default flags, but sealing is one way and this
surely shouldn't be one way?
>
> I think this works. Take the example from a previous thread where
> containers are differentiated by allowing or not allowing THP. If you
> set a container MADV_HOHUGEPAGE (or whatever flag we used for the same
> meaning), then if a library uses that call and it fails do we want to
> report it as a failure? I would reason that the library shouldn't hard
> fail if its unable to use THP, so it's okay to return the failure.
>
> Alternatively, if it is a hard requirement, then that container
> shouldn't be allowed to continue in such a state and should verify the
> return. (If this is even a possibility?)
>
> >
> > > >
> > > > The could be added on top of the current proposals that are flying
> > > > around, and could be done e.g., per-process.
> > >
> > > How about introducing a dedicated "process" mode? This would allow
> > > each process to use different THP modes—some in "always," others in
> > > "madvise," and the rest in "never." Future THP modes could also be
> > > added to this framework.
> >
> > We have to be really careful about not creating even more mess with more
> > modes.
>
> Yes, and clarity would depend on the mode name, imo. Never meaning
> never, for example.
>
> So we'd need an answer to David's question below before agreeing on
> "process". If it survives across fork and exec calls, is it really a
> "process" setting?
>
> I believe you are seeing it as "setting default" really doesn't mean
> setting a default if you cannot overwrite it, and if you can overwrite
> the "default" then it's not going to work for all use cases.
>
> >
> > How would that design look like in detail (how would we set it per process
> > etc?)?
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-26 15:54 ` Liam R. Howlett
@ 2025-05-26 16:51 ` David Hildenbrand
2025-05-26 17:07 ` Liam R. Howlett
2025-05-26 20:30 ` Gutierrez Asier
0 siblings, 2 replies; 52+ messages in thread
From: David Hildenbrand @ 2025-05-26 16:51 UTC (permalink / raw)
To: Liam R. Howlett, Yafang Shao, akpm, ziy, baolin.wang,
lorenzo.stoakes, npache, ryan.roberts, dev.jain, hannes,
usamaarif642, gutierrez.asier, willy, ast, daniel, andrii, bpf,
linux-mm
On 26.05.25 17:54, Liam R. Howlett wrote:
> * Liam R. Howlett <Liam.Howlett@oracle.com> [250526 10:54]:
>> * David Hildenbrand <david@redhat.com> [250526 06:49]:
>>> On 26.05.25 11:37, Yafang Shao wrote:
>>>> On Mon, May 26, 2025 at 4:14 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Let’s summarize the current state of the discussion and identify how
>>>>>> to move forward.
>>>>>>
>>>>>> - Global-Only Control is Not Viable
>>>>>> We all seem to agree that a global-only control for THP is unwise. In
>>>>>> practice, some workloads benefit from THP while others do not, so a
>>>>>> one-size-fits-all approach doesn’t work.
>>>>>>
>>>>>> - Should We Use "Always" or "Madvise"?
>>>>>> I suspect no one would choose 'always' in its current state. ;)
>>>>>
>>>>> IIRC, RHEL9 has the default set to "always" for a long time.
>>>>
>>>> good to know.
>>>>
>>>>>
>>>>> I guess it really depends on how different the workloads are that you
>>>>> are running on the same machine.
>>>>
>>>> Correct. If we want to enable THP for specific workloads without
>>>> modifying the kernel, we must isolate them on dedicated servers.
>>>> However, this approach wastes resources and is not an acceptable
>>>> solution.
>>>>
>>>>>
>>>>> > Both Lorenzo and David propose relying on the madvise mode. However,>
>>>>> since madvise is an unprivileged userspace mechanism, any user can
>>>>>> freely adjust their THP policy. This makes fine-grained control
>>>>>> impossible without breaking userspace compatibility—an undesirable
>>>>>> tradeoff.
>>>>>
>>>>> If required, we could look into a "sealing" mechanism, that would
>>>>> essentially lock modification attempts performed by the process (i.e.,
>>>>> MADV_HUGEPAGE).
>>>>
>>>> If we don’t introduce a new THP mode and instead rely solely on
>>>> madvise, the "sealing" mechanism could either violate the intended
>>>> semantics of madvise(), or simply break madvise() entirely, right?
>>>
>>> We would have to be a bit careful, yes.
>>>
>>> Errors from MADV_HUGEPAGE/MADV_NOHUGEPAGE are often ignored, because these
>>> options also fail with -EINVAL on kernels without THP support.
>>>
>>> Ignoring MADV_NOHUGEPAGE can be problematic with userfaultfd.
>>>
>>> What you likely really want to do is seal when you configured
>>> MADV_NOHUGEPAGE to be the default, and fail MADV_HUGEPAGE later.
>
> I am also not entirely sure how sealing a non-existing vma would work.
> We'd have to seal the default flags, but sealing is one way and this
> surely shouldn't be one way?
You probably have mseal() in mind. Just like we wouldn't be using
madvise(), we also wouldn't be using mseal().
It could be a simple mctrl()/whatever option/flag to set the default and
no longer allow changing the default and per-VMA flags, unless
CAP_SYS_ADMIN or sth like that.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-26 16:51 ` David Hildenbrand
@ 2025-05-26 17:07 ` Liam R. Howlett
2025-05-26 17:12 ` David Hildenbrand
2025-05-26 20:30 ` Gutierrez Asier
1 sibling, 1 reply; 52+ messages in thread
From: Liam R. Howlett @ 2025-05-26 17:07 UTC (permalink / raw)
To: David Hildenbrand
Cc: Yafang Shao, akpm, ziy, baolin.wang, lorenzo.stoakes, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
* David Hildenbrand <david@redhat.com> [250526 12:51]:
> On 26.05.25 17:54, Liam R. Howlett wrote:
> > * Liam R. Howlett <Liam.Howlett@oracle.com> [250526 10:54]:
> > > * David Hildenbrand <david@redhat.com> [250526 06:49]:
> > > > On 26.05.25 11:37, Yafang Shao wrote:
> > > > > On Mon, May 26, 2025 at 4:14 PM David Hildenbrand <david@redhat.com> wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Let’s summarize the current state of the discussion and identify how
> > > > > > > to move forward.
> > > > > > >
> > > > > > > - Global-Only Control is Not Viable
> > > > > > > We all seem to agree that a global-only control for THP is unwise. In
> > > > > > > practice, some workloads benefit from THP while others do not, so a
> > > > > > > one-size-fits-all approach doesn’t work.
> > > > > > >
> > > > > > > - Should We Use "Always" or "Madvise"?
> > > > > > > I suspect no one would choose 'always' in its current state. ;)
> > > > > >
> > > > > > IIRC, RHEL9 has the default set to "always" for a long time.
> > > > >
> > > > > good to know.
> > > > >
> > > > > >
> > > > > > I guess it really depends on how different the workloads are that you
> > > > > > are running on the same machine.
> > > > >
> > > > > Correct. If we want to enable THP for specific workloads without
> > > > > modifying the kernel, we must isolate them on dedicated servers.
> > > > > However, this approach wastes resources and is not an acceptable
> > > > > solution.
> > > > >
> > > > > >
> > > > > > > Both Lorenzo and David propose relying on the madvise mode. However,>
> > > > > > since madvise is an unprivileged userspace mechanism, any user can
> > > > > > > freely adjust their THP policy. This makes fine-grained control
> > > > > > > impossible without breaking userspace compatibility—an undesirable
> > > > > > > tradeoff.
> > > > > >
> > > > > > If required, we could look into a "sealing" mechanism, that would
> > > > > > essentially lock modification attempts performed by the process (i.e.,
> > > > > > MADV_HUGEPAGE).
> > > > >
> > > > > If we don’t introduce a new THP mode and instead rely solely on
> > > > > madvise, the "sealing" mechanism could either violate the intended
> > > > > semantics of madvise(), or simply break madvise() entirely, right?
> > > >
> > > > We would have to be a bit careful, yes.
> > > >
> > > > Errors from MADV_HUGEPAGE/MADV_NOHUGEPAGE are often ignored, because these
> > > > options also fail with -EINVAL on kernels without THP support.
> > > >
> > > > Ignoring MADV_NOHUGEPAGE can be problematic with userfaultfd.
> > > >
> > > > What you likely really want to do is seal when you configured
> > > > MADV_NOHUGEPAGE to be the default, and fail MADV_HUGEPAGE later.
> >
> > I am also not entirely sure how sealing a non-existing vma would work.
> > We'd have to seal the default flags, but sealing is one way and this
> > surely shouldn't be one way?
>
> You probably have mseal() in mind. Just like we wouldn't be using
> madvise(), we also wouldn't be using mseal().
yes, I do - but mostly in terms of the language used and not the code as
that can't be used here.
Do we use the term seal somewhere that allows undoing the sealed thing?
I'm _really_ hoping we don't, but am almost sure you're going to say
we do.
>
> It could be a simple mctrl()/whatever option/flag to set the default and no
> longer allow changing the default and per-VMA flags, unless CAP_SYS_ADMIN or
> sth like that.
Ah... okay, as long as we have testing I guess. I can see that getting
confusing.
Cheers,
Liam
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-26 17:07 ` Liam R. Howlett
@ 2025-05-26 17:12 ` David Hildenbrand
0 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2025-05-26 17:12 UTC (permalink / raw)
To: Liam R. Howlett, Yafang Shao, akpm, ziy, baolin.wang,
lorenzo.stoakes, npache, ryan.roberts, dev.jain, hannes,
usamaarif642, gutierrez.asier, willy, ast, daniel, andrii, bpf,
linux-mm
On 26.05.25 19:07, Liam R. Howlett wrote:
> * David Hildenbrand <david@redhat.com> [250526 12:51]:
>> On 26.05.25 17:54, Liam R. Howlett wrote:
>>> * Liam R. Howlett <Liam.Howlett@oracle.com> [250526 10:54]:
>>>> * David Hildenbrand <david@redhat.com> [250526 06:49]:
>>>>> On 26.05.25 11:37, Yafang Shao wrote:
>>>>>> On Mon, May 26, 2025 at 4:14 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Let’s summarize the current state of the discussion and identify how
>>>>>>>> to move forward.
>>>>>>>>
>>>>>>>> - Global-Only Control is Not Viable
>>>>>>>> We all seem to agree that a global-only control for THP is unwise. In
>>>>>>>> practice, some workloads benefit from THP while others do not, so a
>>>>>>>> one-size-fits-all approach doesn’t work.
>>>>>>>>
>>>>>>>> - Should We Use "Always" or "Madvise"?
>>>>>>>> I suspect no one would choose 'always' in its current state. ;)
>>>>>>>
>>>>>>> IIRC, RHEL9 has the default set to "always" for a long time.
>>>>>>
>>>>>> good to know.
>>>>>>
>>>>>>>
>>>>>>> I guess it really depends on how different the workloads are that you
>>>>>>> are running on the same machine.
>>>>>>
>>>>>> Correct. If we want to enable THP for specific workloads without
>>>>>> modifying the kernel, we must isolate them on dedicated servers.
>>>>>> However, this approach wastes resources and is not an acceptable
>>>>>> solution.
>>>>>>
>>>>>>>
>>>>>>> > Both Lorenzo and David propose relying on the madvise mode. However,>
>>>>>>> since madvise is an unprivileged userspace mechanism, any user can
>>>>>>>> freely adjust their THP policy. This makes fine-grained control
>>>>>>>> impossible without breaking userspace compatibility—an undesirable
>>>>>>>> tradeoff.
>>>>>>>
>>>>>>> If required, we could look into a "sealing" mechanism, that would
>>>>>>> essentially lock modification attempts performed by the process (i.e.,
>>>>>>> MADV_HUGEPAGE).
>>>>>>
>>>>>> If we don’t introduce a new THP mode and instead rely solely on
>>>>>> madvise, the "sealing" mechanism could either violate the intended
>>>>>> semantics of madvise(), or simply break madvise() entirely, right?
>>>>>
>>>>> We would have to be a bit careful, yes.
>>>>>
>>>>> Errors from MADV_HUGEPAGE/MADV_NOHUGEPAGE are often ignored, because these
>>>>> options also fail with -EINVAL on kernels without THP support.
>>>>>
>>>>> Ignoring MADV_NOHUGEPAGE can be problematic with userfaultfd.
>>>>>
>>>>> What you likely really want to do is seal when you configured
>>>>> MADV_NOHUGEPAGE to be the default, and fail MADV_HUGEPAGE later.
>>>
>>> I am also not entirely sure how sealing a non-existing vma would work.
>>> We'd have to seal the default flags, but sealing is one way and this
>>> surely shouldn't be one way?
>>
>> You probably have mseal() in mind. Just like we wouldn't be using
>> madvise(), we also wouldn't be using mseal().
>
> yes, I do - but mostly in terms of the language used and not the code as
> that can't be used here.
>
> Do we use the term seal somewhere that allows undoing the sealed thing?
> I'm _really_ hoping we don't, but am almost sure you're going to say
> we do.
The other place that comes to mind is memfd_create() with
MFD_ALLOW_SEALING and fcntl() with F_ADD_SEALS.
And yes, there is no "F_DEL_SEALS" :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-26 16:51 ` David Hildenbrand
2025-05-26 17:07 ` Liam R. Howlett
@ 2025-05-26 20:30 ` Gutierrez Asier
2025-05-26 20:37 ` David Hildenbrand
1 sibling, 1 reply; 52+ messages in thread
From: Gutierrez Asier @ 2025-05-26 20:30 UTC (permalink / raw)
To: David Hildenbrand, Liam R. Howlett, Yafang Shao, akpm, ziy,
baolin.wang, lorenzo.stoakes, npache, ryan.roberts, dev.jain,
hannes, usamaarif642, willy, ast, daniel, andrii, bpf, linux-mm
On 5/26/2025 7:51 PM, David Hildenbrand wrote:
> On 26.05.25 17:54, Liam R. Howlett wrote:
>> * Liam R. Howlett <Liam.Howlett@oracle.com> [250526 10:54]:
>>> * David Hildenbrand <david@redhat.com> [250526 06:49]:
>>>> On 26.05.25 11:37, Yafang Shao wrote:
>>>>> On Mon, May 26, 2025 at 4:14 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Let’s summarize the current state of the discussion and identify how
>>>>>>> to move forward.
>>>>>>>
>>>>>>> - Global-Only Control is Not Viable
>>>>>>> We all seem to agree that a global-only control for THP is unwise. In
>>>>>>> practice, some workloads benefit from THP while others do not, so a
>>>>>>> one-size-fits-all approach doesn’t work.
>>>>>>>
>>>>>>> - Should We Use "Always" or "Madvise"?
>>>>>>> I suspect no one would choose 'always' in its current state. ;)
>>>>>>
>>>>>> IIRC, RHEL9 has the default set to "always" for a long time.
>>>>>
>>>>> good to know.
>>>>>
>>>>>>
>>>>>> I guess it really depends on how different the workloads are that you
>>>>>> are running on the same machine.
>>>>>
>>>>> Correct. If we want to enable THP for specific workloads without
>>>>> modifying the kernel, we must isolate them on dedicated servers.
>>>>> However, this approach wastes resources and is not an acceptable
>>>>> solution.
>>>>>
>>>>>>
>>>>>> > Both Lorenzo and David propose relying on the madvise mode. However,>
>>>>>> since madvise is an unprivileged userspace mechanism, any user can
>>>>>>> freely adjust their THP policy. This makes fine-grained control
>>>>>>> impossible without breaking userspace compatibility—an undesirable
>>>>>>> tradeoff.
>>>>>>
>>>>>> If required, we could look into a "sealing" mechanism, that would
>>>>>> essentially lock modification attempts performed by the process (i.e.,
>>>>>> MADV_HUGEPAGE).
>>>>>
>>>>> If we don’t introduce a new THP mode and instead rely solely on
>>>>> madvise, the "sealing" mechanism could either violate the intended
>>>>> semantics of madvise(), or simply break madvise() entirely, right?
>>>>
>>>> We would have to be a bit careful, yes.
>>>>
>>>> Errors from MADV_HUGEPAGE/MADV_NOHUGEPAGE are often ignored, because these
>>>> options also fail with -EINVAL on kernels without THP support.
>>>>
>>>> Ignoring MADV_NOHUGEPAGE can be problematic with userfaultfd.
>>>>
>>>> What you likely really want to do is seal when you configured
>>>> MADV_NOHUGEPAGE to be the default, and fail MADV_HUGEPAGE later.
>>
>> I am also not entirely sure how sealing a non-existing vma would work.
>> We'd have to seal the default flags, but sealing is one way and this
>> surely shouldn't be one way?
>
> You probably have mseal() in mind. Just like we wouldn't be using madvise(), we also wouldn't be using mseal().
>
> It could be a simple mctrl()/whatever option/flag to set the default and no longer allow changing the default and per-VMA flags, unless CAP_SYS_ADMIN or sth like that.
>
This isn't really TRANSPARENT Huge Pages, since we will require
the application to determine which memory range will be mapped with
huge pages.
--
Asier Gutierrez
Huawei
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-26 20:30 ` Gutierrez Asier
@ 2025-05-26 20:37 ` David Hildenbrand
0 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2025-05-26 20:37 UTC (permalink / raw)
To: Gutierrez Asier, Liam R. Howlett, Yafang Shao, akpm, ziy,
baolin.wang, lorenzo.stoakes, npache, ryan.roberts, dev.jain,
hannes, usamaarif642, willy, ast, daniel, andrii, bpf, linux-mm
On 26.05.25 22:30, Gutierrez Asier wrote:
>
>
> On 5/26/2025 7:51 PM, David Hildenbrand wrote:
>> On 26.05.25 17:54, Liam R. Howlett wrote:
>>> * Liam R. Howlett <Liam.Howlett@oracle.com> [250526 10:54]:
>>>> * David Hildenbrand <david@redhat.com> [250526 06:49]:
>>>>> On 26.05.25 11:37, Yafang Shao wrote:
>>>>>> On Mon, May 26, 2025 at 4:14 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Let’s summarize the current state of the discussion and identify how
>>>>>>>> to move forward.
>>>>>>>>
>>>>>>>> - Global-Only Control is Not Viable
>>>>>>>> We all seem to agree that a global-only control for THP is unwise. In
>>>>>>>> practice, some workloads benefit from THP while others do not, so a
>>>>>>>> one-size-fits-all approach doesn’t work.
>>>>>>>>
>>>>>>>> - Should We Use "Always" or "Madvise"?
>>>>>>>> I suspect no one would choose 'always' in its current state. ;)
>>>>>>>
>>>>>>> IIRC, RHEL9 has the default set to "always" for a long time.
>>>>>>
>>>>>> good to know.
>>>>>>
>>>>>>>
>>>>>>> I guess it really depends on how different the workloads are that you
>>>>>>> are running on the same machine.
>>>>>>
>>>>>> Correct. If we want to enable THP for specific workloads without
>>>>>> modifying the kernel, we must isolate them on dedicated servers.
>>>>>> However, this approach wastes resources and is not an acceptable
>>>>>> solution.
>>>>>>
>>>>>>>
>>>>>>> > Both Lorenzo and David propose relying on the madvise mode. However,>
>>>>>>> since madvise is an unprivileged userspace mechanism, any user can
>>>>>>>> freely adjust their THP policy. This makes fine-grained control
>>>>>>>> impossible without breaking userspace compatibility—an undesirable
>>>>>>>> tradeoff.
>>>>>>>
>>>>>>> If required, we could look into a "sealing" mechanism, that would
>>>>>>> essentially lock modification attempts performed by the process (i.e.,
>>>>>>> MADV_HUGEPAGE).
>>>>>>
>>>>>> If we don’t introduce a new THP mode and instead rely solely on
>>>>>> madvise, the "sealing" mechanism could either violate the intended
>>>>>> semantics of madvise(), or simply break madvise() entirely, right?
>>>>>
>>>>> We would have to be a bit careful, yes.
>>>>>
>>>>> Errors from MADV_HUGEPAGE/MADV_NOHUGEPAGE are often ignored, because these
>>>>> options also fail with -EINVAL on kernels without THP support.
>>>>>
>>>>> Ignoring MADV_NOHUGEPAGE can be problematic with userfaultfd.
>>>>>
>>>>> What you likely really want to do is seal when you configured
>>>>> MADV_NOHUGEPAGE to be the default, and fail MADV_HUGEPAGE later.
>>>
>>> I am also not entirely sure how sealing a non-existing vma would work.
>>> We'd have to seal the default flags, but sealing is one way and this
>>> surely shouldn't be one way?
>>
>> You probably have mseal() in mind. Just like we wouldn't be using madvise(), we also wouldn't be using mseal().
>>
>> It could be a simple mctrl()/whatever option/flag to set the default and no longer allow changing the default and per-VMA flags, unless CAP_SYS_ADMIN or sth like that.
>>
>
> This isn't really TRANSPARENT Huge Pages, since we will require
> the application to determine which memory range will be mapped with
> huge pages.
Huh? No idea how you concluded that. Can you elaborate?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-26 10:49 ` David Hildenbrand
2025-05-26 14:53 ` Liam R. Howlett
@ 2025-05-27 5:46 ` Yafang Shao
2025-05-27 7:57 ` David Hildenbrand
1 sibling, 1 reply; 52+ messages in thread
From: Yafang Shao @ 2025-05-27 5:46 UTC (permalink / raw)
To: David Hildenbrand
Cc: akpm, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On Mon, May 26, 2025 at 6:49 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 26.05.25 11:37, Yafang Shao wrote:
> > On Mon, May 26, 2025 at 4:14 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >>> Hi all,
> >>>
> >>> Let’s summarize the current state of the discussion and identify how
> >>> to move forward.
> >>>
> >>> - Global-Only Control is Not Viable
> >>> We all seem to agree that a global-only control for THP is unwise. In
> >>> practice, some workloads benefit from THP while others do not, so a
> >>> one-size-fits-all approach doesn’t work.
> >>>
> >>> - Should We Use "Always" or "Madvise"?
> >>> I suspect no one would choose 'always' in its current state. ;)
> >>
> >> IIRC, RHEL9 has the default set to "always" for a long time.
> >
> > good to know.
> >
> >>
> >> I guess it really depends on how different the workloads are that you
> >> are running on the same machine.
> >
> > Correct. If we want to enable THP for specific workloads without
> > modifying the kernel, we must isolate them on dedicated servers.
> > However, this approach wastes resources and is not an acceptable
> > solution.
> >
> >>
> >> > Both Lorenzo and David propose relying on the madvise mode. However,>
> >> since madvise is an unprivileged userspace mechanism, any user can
> >>> freely adjust their THP policy. This makes fine-grained control
> >>> impossible without breaking userspace compatibility—an undesirable
> >>> tradeoff.
> >>
> >> If required, we could look into a "sealing" mechanism, that would
> >> essentially lock modification attempts performed by the process (i.e.,
> >> MADV_HUGEPAGE).
> >
> > If we don’t introduce a new THP mode and instead rely solely on
> > madvise, the "sealing" mechanism could either violate the intended
> > semantics of madvise(), or simply break madvise() entirely, right?
>
> We would have to be a bit careful, yes.
>
> Errors from MADV_HUGEPAGE/MADV_NOHUGEPAGE are often ignored, because
> these options also fail with -EINVAL on kernels without THP support.
>
> Ignoring MADV_NOHUGEPAGE can be problematic with userfaultfd.
>
> What you likely really want to do is seal when you configured
> MADV_NOHUGEPAGE to be the default, and fail MADV_HUGEPAGE later.
>
> >>
> >> The could be added on top of the current proposals that are flying
> >> around, and could be done e.g., per-process.
> >
> > How about introducing a dedicated "process" mode? This would allow
> > each process to use different THP modes—some in "always," others in
> > "madvise," and the rest in "never." Future THP modes could also be
> > added to this framework.
>
> We have to be really careful about not creating even more mess with more
> modes.
>
> How would that design look like in detail (how would we set it per
> process etc?)?
I have a preliminary idea to implement this using BPF. We could define
the API as follows:
struct bpf_thp_ops {
/**
* @task_thp_mode: Get the THP mode for a specific task
*
* Return:
* - TASK_THP_ALWAYS: "always" mode
* - TASK_THP_MADVISE: "madvise" mode
* - TASK_THP_NEVER: "never" mode
* Future modes can also be added.
*/
int (*task_thp_mode)(struct task_struct *p);
};
For observability, we could add a "THP mode" field to
/proc/[pid]/status. For example:
$ grep "THP mode" /proc/123/status
always
$ grep "THP mode" /proc/456/status
madvise
$ grep "THP mode" /proc/789/status
never
The THP mode for each task would be determined by the attached BPF
program based on the task's attributes. We would place the BPF hook in
appropriate kernel functions. Note that this setting wouldn't be
inherited during fork/exec - the BPF program would make the decision
dynamically for each task.
This approach also enables runtime adjustments to THP modes based on
system-wide conditions, such as memory fragmentation or other
performance overheads. The BPF program could adapt policies
dynamically, optimizing THP behavior in response to changing
workloads.
As Liam pointed out in another thread, naming is challenging here -
"process" might not be the most accurate term for this context.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-26 14:32 ` Zi Yan
@ 2025-05-27 5:53 ` Yafang Shao
0 siblings, 0 replies; 52+ messages in thread
From: Yafang Shao @ 2025-05-27 5:53 UTC (permalink / raw)
To: Zi Yan
Cc: akpm, david, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On Mon, May 26, 2025 at 10:32 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 24 May 2025, at 23:01, Yafang Shao wrote:
>
> > On Tue, May 20, 2025 at 2:05 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>
> >> Background
> >> ----------
> >>
> >> At my current employer, PDD, we have consistently configured THP to "never"
> >> on our production servers due to past incidents caused by its behavior:
> >>
> >> - Increased memory consumption
> >> THP significantly raises overall memory usage.
> >>
> >> - Latency spikes
> >> Random latency spikes occur due to more frequent memory compaction
> >> activity triggered by THP.
> >>
> >> These issues have made sysadmins hesitant to switch to "madvise" or
> >> "always" modes.
> >>
> >> New Motivation
> >> --------------
> >>
> >> We have now identified that certain AI workloads achieve substantial
> >> performance gains with THP enabled. However, we’ve also verified that some
> >> workloads see little to no benefit—or are even negatively impacted—by THP.
> >>
> >> In our Kubernetes environment, we deploy mixed workloads on a single server
> >> to maximize resource utilization. Our goal is to selectively enable THP for
> >> services that benefit from it while keeping it disabled for others. This
> >> approach allows us to incrementally enable THP for additional services and
> >> assess how to make it more viable in production.
> >>
> >> Proposed Solution
> >> -----------------
> >>
> >> For this use case, Johannes suggested introducing a dedicated mode [0]. In
> >> this new mode, we could implement BPF-based THP adjustment for fine-grained
> >> control over tasks or cgroups. If no BPF program is attached, THP remains
> >> in "never" mode. This solution elegantly meets our needs while avoiding the
> >> complexity of managing BPF alongside other THP modes.
> >>
> >> A selftest example demonstrates how to enable THP for the current task
> >> while keeping it disabled for others.
> >>
> >> Alternative Proposals
> >> ---------------------
> >>
> >> - Gutierrez’s cgroup-based approach [1]
> >> - Proposed adding a new cgroup file to control THP policy.
> >> - However, as Johannes noted, cgroups are designed for hierarchical
> >> resource allocation, not arbitrary policy settings [2].
> >>
> >> - Usama’s per-task THP proposal based on prctl() [3]:
> >> - Enabling THP per task via prctl().
> >> - As David pointed out, neither madvise() nor prctl() works in "never"
> >> mode [4], making this solution insufficient for our needs.
> >>
> >> Conclusion
> >> ----------
> >>
> >> Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the
> >> most effective solution for our requirements. This approach represents a
> >> small but meaningful step toward making THP truly usable—and manageable—in
> >> production environments.
> >>
> >> This is currently a PoC implementation. Feedback of any kind is welcome.
> >>
> >> Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.org/ [0]
> >> Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/ [1]
> >> Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/ [2]
> >> Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif642@gmail.com/ [3]
> >> Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com/ [4]
> >>
> >> RFC v1->v2:
> >> The main changes are as follows,
> >> - Use struct_ops instead of fmod_ret (Alexei)
> >> - Introduce a new THP mode (Johannes)
> >> - Introduce new helpers for BPF hook (Zi)
> >> - Refine the commit log
> >>
> >> RFC v1: https://lwn.net/Articles/1019290/
> >>
> >> Yafang Shao (5):
> >> mm: thp: Add a new mode "bpf"
> >> mm: thp: Add hook for BPF based THP adjustment
> >> mm: thp: add struct ops for BPF based THP adjustment
> >> bpf: Add get_current_comm to bpf_base_func_proto
> >> selftests/bpf: Add selftest for THP adjustment
> >>
> >> include/linux/huge_mm.h | 15 +-
> >> kernel/bpf/cgroup.c | 2 -
> >> kernel/bpf/helpers.c | 2 +
> >> mm/Makefile | 3 +
> >> mm/bpf_thp.c | 120 ++++++++++++
> >> mm/huge_memory.c | 65 ++++++-
> >> mm/khugepaged.c | 3 +
> >> tools/testing/selftests/bpf/config | 1 +
> >> .../selftests/bpf/prog_tests/thp_adjust.c | 175 ++++++++++++++++++
> >> .../selftests/bpf/progs/test_thp_adjust.c | 39 ++++
> >> 10 files changed, 414 insertions(+), 11 deletions(-)
> >> create mode 100644 mm/bpf_thp.c
> >> create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
> >> create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
> >>
> >> --
> >> 2.43.5
> >>
> >
> > Hi all,
> >
> > Let’s summarize the current state of the discussion and identify how
> > to move forward.
> >
> > - Global-Only Control is Not Viable
> > We all seem to agree that a global-only control for THP is unwise. In
> > practice, some workloads benefit from THP while others do not, so a
> > one-size-fits-all approach doesn’t work.
> >
> > - Should We Use "Always" or "Madvise"?
> > I suspect no one would choose 'always' in its current state. ;)
> > Both Lorenzo and David propose relying on the madvise mode. However,
> > since madvise is an unprivileged userspace mechanism, any user can
> > freely adjust their THP policy. This makes fine-grained control
> > impossible without breaking userspace compatibility—an undesirable
> > tradeoff.
> > Given these limitations, the community should consider introducing a
> > new "admin" mode for privileged THP policy management.
> >
>
> I agree with the above two points.
>
> > - Can the Kernel Automatically Manage THP Without User Input?
> > In practice, users define their own success metrics—such as latency
> > (RT), queries per second (QPS), or throughput—to evaluate a feature’s
> > usefulness. If a feature fails to improve these metrics, it provides
> > no practical value.
> > Currently, the kernel lacks visibility into user-defined metrics,
> > making fully automated optimization impossible (at least without user
> > input). More importantly, automatic management offers no benefit if it
> > doesn’t align with user needs.
>
> Yes, kernel is basically guessing what userspace wants with some hints
> like MADV_HUGEPAGE/MADV_NOHUGEPAGE. But kernel has the global view
> of memory fragmentation, which userspace cannot get easily.
Correct, memory fragmentation is another critical factor in
determining whether to allocate THP.
> I wonder
> if it is possible that userspace tuning might benefit one set of
> applications but hurt others or overall performance. Right now,
> THP tuning is 0 or 1, either an application wants THPs or not.
> We might need a way of ranking THP requests from userspace to
> let kernel prioritize them (I am not sure if we can add another
> user input parameter, like THP_nice, to get this done, since
> apparently everyone will set THP_nice to -100 to get themselves
> at the top of the list).
Interesting idea. Perhaps we could make this configurable only by sysadmins.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-27 5:46 ` Yafang Shao
@ 2025-05-27 7:57 ` David Hildenbrand
2025-05-27 8:13 ` Yafang Shao
0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2025-05-27 7:57 UTC (permalink / raw)
To: Yafang Shao
Cc: akpm, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On 27.05.25 07:46, Yafang Shao wrote:
> On Mon, May 26, 2025 at 6:49 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 26.05.25 11:37, Yafang Shao wrote:
>>> On Mon, May 26, 2025 at 4:14 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Let’s summarize the current state of the discussion and identify how
>>>>> to move forward.
>>>>>
>>>>> - Global-Only Control is Not Viable
>>>>> We all seem to agree that a global-only control for THP is unwise. In
>>>>> practice, some workloads benefit from THP while others do not, so a
>>>>> one-size-fits-all approach doesn’t work.
>>>>>
>>>>> - Should We Use "Always" or "Madvise"?
>>>>> I suspect no one would choose 'always' in its current state. ;)
>>>>
>>>> IIRC, RHEL9 has the default set to "always" for a long time.
>>>
>>> good to know.
>>>
>>>>
>>>> I guess it really depends on how different the workloads are that you
>>>> are running on the same machine.
>>>
>>> Correct. If we want to enable THP for specific workloads without
>>> modifying the kernel, we must isolate them on dedicated servers.
>>> However, this approach wastes resources and is not an acceptable
>>> solution.
>>>
>>>>
>>>> > Both Lorenzo and David propose relying on the madvise mode. However,>
>>>> since madvise is an unprivileged userspace mechanism, any user can
>>>>> freely adjust their THP policy. This makes fine-grained control
>>>>> impossible without breaking userspace compatibility—an undesirable
>>>>> tradeoff.
>>>>
>>>> If required, we could look into a "sealing" mechanism, that would
>>>> essentially lock modification attempts performed by the process (i.e.,
>>>> MADV_HUGEPAGE).
>>>
>>> If we don’t introduce a new THP mode and instead rely solely on
>>> madvise, the "sealing" mechanism could either violate the intended
>>> semantics of madvise(), or simply break madvise() entirely, right?
>>
>> We would have to be a bit careful, yes.
>>
>> Errors from MADV_HUGEPAGE/MADV_NOHUGEPAGE are often ignored, because
>> these options also fail with -EINVAL on kernels without THP support.
>>
>> Ignoring MADV_NOHUGEPAGE can be problematic with userfaultfd.
>>
>> What you likely really want to do is seal when you configured
>> MADV_NOHUGEPAGE to be the default, and fail MADV_HUGEPAGE later.
>>
>>>>
>>>> The could be added on top of the current proposals that are flying
>>>> around, and could be done e.g., per-process.
>>>
>>> How about introducing a dedicated "process" mode? This would allow
>>> each process to use different THP modes—some in "always," others in
>>> "madvise," and the rest in "never." Future THP modes could also be
>>> added to this framework.
>>
>> We have to be really careful about not creating even more mess with more
>> modes.
>>
>> How would that design look like in detail (how would we set it per
>> process etc?)?
>
> I have a preliminary idea to implement this using BPF.
I don't think we want to add such a mechanism (new mode) where the
primary configuration mechanism is through bpf.
Maybe bpf could be used as an alternative, but we should look into a
reasonable alternative first, like the discussed mctrl()/.../ raised in
the process_madvise() series.
No "bpf" mode in disguise, please :)
> We could define
> the API as follows:
>
> struct bpf_thp_ops {
> /**
> * @task_thp_mode: Get the THP mode for a specific task
> *
> * Return:
> * - TASK_THP_ALWAYS: "always" mode
> * - TASK_THP_MADVISE: "madvise" mode
> * - TASK_THP_NEVER: "never" mode
> * Future modes can also be added.
> */
> int (*task_thp_mode)(struct task_struct *p);
> };
>
> For observability, we could add a "THP mode" field to
> /proc/[pid]/status. For example:
>
> $ grep "THP mode" /proc/123/status
> always
> $ grep "THP mode" /proc/456/status
> madvise
> $ grep "THP mode" /proc/789/status
> never
>
> The THP mode for each task would be determined by the attached BPF
> program based on the task's attributes. We would place the BPF hook in
> appropriate kernel functions. Note that this setting wouldn't be
> inherited during fork/exec - the BPF program would make the decision
> dynamically for each task.
What would be the mode (default) when the bpf program would not be active?
> This approach also enables runtime adjustments to THP modes based on
> system-wide conditions, such as memory fragmentation or other
> performance overheads. The BPF program could adapt policies
> dynamically, optimizing THP behavior in response to changing
> workloads.
I am not sure that is the proper way to handle these scenarios: I never
heard that people would be adjusting the system-wide policy dynamically
in that way either.
Whatever we do, we have to make sure that what we add won't
over-complicate things in the future. Having tooling dynamically adjust
the THP policy of processes that coarsely sounds ... very wrong long-term.
> > As Liam pointed out in another thread, naming is challenging here -
> "process" might not be the most accurate term for this context.
No, it's not even a per-process thing. It is per MM, and a MM might be
used by multiple processes ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-27 7:57 ` David Hildenbrand
@ 2025-05-27 8:13 ` Yafang Shao
2025-05-27 8:30 ` David Hildenbrand
0 siblings, 1 reply; 52+ messages in thread
From: Yafang Shao @ 2025-05-27 8:13 UTC (permalink / raw)
To: David Hildenbrand
Cc: akpm, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On Tue, May 27, 2025 at 3:58 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 27.05.25 07:46, Yafang Shao wrote:
> > On Mon, May 26, 2025 at 6:49 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 26.05.25 11:37, Yafang Shao wrote:
> >>> On Mon, May 26, 2025 at 4:14 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> Let’s summarize the current state of the discussion and identify how
> >>>>> to move forward.
> >>>>>
> >>>>> - Global-Only Control is Not Viable
> >>>>> We all seem to agree that a global-only control for THP is unwise. In
> >>>>> practice, some workloads benefit from THP while others do not, so a
> >>>>> one-size-fits-all approach doesn’t work.
> >>>>>
> >>>>> - Should We Use "Always" or "Madvise"?
> >>>>> I suspect no one would choose 'always' in its current state. ;)
> >>>>
> >>>> IIRC, RHEL9 has the default set to "always" for a long time.
> >>>
> >>> good to know.
> >>>
> >>>>
> >>>> I guess it really depends on how different the workloads are that you
> >>>> are running on the same machine.
> >>>
> >>> Correct. If we want to enable THP for specific workloads without
> >>> modifying the kernel, we must isolate them on dedicated servers.
> >>> However, this approach wastes resources and is not an acceptable
> >>> solution.
> >>>
> >>>>
> >>>> > Both Lorenzo and David propose relying on the madvise mode. However,>
> >>>> since madvise is an unprivileged userspace mechanism, any user can
> >>>>> freely adjust their THP policy. This makes fine-grained control
> >>>>> impossible without breaking userspace compatibility—an undesirable
> >>>>> tradeoff.
> >>>>
> >>>> If required, we could look into a "sealing" mechanism, that would
> >>>> essentially lock modification attempts performed by the process (i.e.,
> >>>> MADV_HUGEPAGE).
> >>>
> >>> If we don’t introduce a new THP mode and instead rely solely on
> >>> madvise, the "sealing" mechanism could either violate the intended
> >>> semantics of madvise(), or simply break madvise() entirely, right?
> >>
> >> We would have to be a bit careful, yes.
> >>
> >> Errors from MADV_HUGEPAGE/MADV_NOHUGEPAGE are often ignored, because
> >> these options also fail with -EINVAL on kernels without THP support.
> >>
> >> Ignoring MADV_NOHUGEPAGE can be problematic with userfaultfd.
> >>
> >> What you likely really want to do is seal when you configured
> >> MADV_NOHUGEPAGE to be the default, and fail MADV_HUGEPAGE later.
> >>
> >>>>
> >>>> The could be added on top of the current proposals that are flying
> >>>> around, and could be done e.g., per-process.
> >>>
> >>> How about introducing a dedicated "process" mode? This would allow
> >>> each process to use different THP modes—some in "always," others in
> >>> "madvise," and the rest in "never." Future THP modes could also be
> >>> added to this framework.
> >>
> >> We have to be really careful about not creating even more mess with more
> >> modes.
> >>
> >> How would that design look like in detail (how would we set it per
> >> process etc?)?
> >
> > I have a preliminary idea to implement this using BPF.
>
> I don't think we want to add such a mechanism (new mode) where the
> primary configuration mechanism is through bpf.
>
> Maybe bpf could be used as an alternative, but we should look into a
> reasonable alternative first, like the discussed mctrl()/.../ raised in
> the process_madvise() series.
>
> No "bpf" mode in disguise, please :)
This goal can be readily achieved using a BPF program. In any case, it
is a feasible solution.
>
> > We could define
> > the API as follows:
> >
> > struct bpf_thp_ops {
> > /**
> > * @task_thp_mode: Get the THP mode for a specific task
> > *
> > * Return:
> > * - TASK_THP_ALWAYS: "always" mode
> > * - TASK_THP_MADVISE: "madvise" mode
> > * - TASK_THP_NEVER: "never" mode
> > * Future modes can also be added.
> > */
> > int (*task_thp_mode)(struct task_struct *p);
> > };
> >
> > For observability, we could add a "THP mode" field to
> > /proc/[pid]/status. For example:
> >
> > $ grep "THP mode" /proc/123/status
> > always
> > $ grep "THP mode" /proc/456/status
> > madvise
> > $ grep "THP mode" /proc/789/status
> > never
> >
> > The THP mode for each task would be determined by the attached BPF
> > program based on the task's attributes. We would place the BPF hook in
> > appropriate kernel functions. Note that this setting wouldn't be
> > inherited during fork/exec - the BPF program would make the decision
> > dynamically for each task.
>
> What would be the mode (default) when the bpf program would not be active?
>
> > This approach also enables runtime adjustments to THP modes based on
> > system-wide conditions, such as memory fragmentation or other
> > performance overheads. The BPF program could adapt policies
> > dynamically, optimizing THP behavior in response to changing
> > workloads.
>
> I am not sure that is the proper way to handle these scenarios: I never
> heard that people would be adjusting the system-wide policy dynamically
> in that way either.
>
> Whatever we do, we have to make sure that what we add won't
> over-complicate things in the future. Having tooling dynamically adjust
> the THP policy of processes that coarsely sounds ... very wrong long-term.
This is just an example demonstrating how BPF can be used to adjust
its flexibility. Notably, all these policies can be implemented
without modifying the kernel.
>
> > > As Liam pointed out in another thread, naming is challenging here -
> > "process" might not be the most accurate term for this context.
>
> No, it's not even a per-process thing. It is per MM, and a MM might be
> used by multiple processes ...
I consistently use 'thread' for the latter case. Additionally, this
can be implemented per-MM without kernel code modifications.
With a well-designed API, users can even implement custom THP
policies—all without altering kernel code.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-27 8:13 ` Yafang Shao
@ 2025-05-27 8:30 ` David Hildenbrand
2025-05-27 8:40 ` Yafang Shao
0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2025-05-27 8:30 UTC (permalink / raw)
To: Yafang Shao
Cc: akpm, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
>> I don't think we want to add such a mechanism (new mode) where the
>> primary configuration mechanism is through bpf.
>>
>> Maybe bpf could be used as an alternative, but we should look into a
>> reasonable alternative first, like the discussed mctrl()/.../ raised in
>> the process_madvise() series.
>>
>> No "bpf" mode in disguise, please :)
>
> This goal can be readily achieved using a BPF program. In any case, it
> is a feasible solution.
No BPF-only solution.
>
>>
>>> We could define
>>> the API as follows:
>>>
>>> struct bpf_thp_ops {
>>> /**
>>> * @task_thp_mode: Get the THP mode for a specific task
>>> *
>>> * Return:
>>> * - TASK_THP_ALWAYS: "always" mode
>>> * - TASK_THP_MADVISE: "madvise" mode
>>> * - TASK_THP_NEVER: "never" mode
>>> * Future modes can also be added.
>>> */
>>> int (*task_thp_mode)(struct task_struct *p);
>>> };
>>>
>>> For observability, we could add a "THP mode" field to
>>> /proc/[pid]/status. For example:
>>>
>>> $ grep "THP mode" /proc/123/status
>>> always
>>> $ grep "THP mode" /proc/456/status
>>> madvise
>>> $ grep "THP mode" /proc/789/status
>>> never
>>>
>>> The THP mode for each task would be determined by the attached BPF
>>> program based on the task's attributes. We would place the BPF hook in
>>> appropriate kernel functions. Note that this setting wouldn't be
>>> inherited during fork/exec - the BPF program would make the decision
>>> dynamically for each task.
>>
>> What would be the mode (default) when the bpf program would not be active?
>>
>>> This approach also enables runtime adjustments to THP modes based on
>>> system-wide conditions, such as memory fragmentation or other
>>> performance overheads. The BPF program could adapt policies
>>> dynamically, optimizing THP behavior in response to changing
>>> workloads.
>>
>> I am not sure that is the proper way to handle these scenarios: I never
>> heard that people would be adjusting the system-wide policy dynamically
>> in that way either.
>>
>> Whatever we do, we have to make sure that what we add won't
>> over-complicate things in the future. Having tooling dynamically adjust
>> the THP policy of processes that coarsely sounds ... very wrong long-term.
>
> This is just an example demonstrating how BPF can be used to adjust
> its flexibility. Notably, all these policies can be implemented
> without modifying the kernel.
See below on "policy".
>
>>
>> > > As Liam pointed out in another thread, naming is challenging here -
>>> "process" might not be the most accurate term for this context.
>>
>> No, it's not even a per-process thing. It is per MM, and a MM might be
>> used by multiple processes ...
>
> I consistently use 'thread' for the latter case.
You can use CLONE_VM without CLONE_THREAD ...
Additionally, this
> can be implemented per-MM without kernel code modifications.
> With a well-designed API, users can even implement custom THP
> policies—all without altering kernel code.
You can switch between modes, that' all you can do. I wouldn't really
call that "custom policy" as it is extremely limited.
And that's exactly my point: it's basic switching between modes ... a
reasonable policy in the future will make placement decisions and not
just state "always/never/madvise".
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-27 8:30 ` David Hildenbrand
@ 2025-05-27 8:40 ` Yafang Shao
2025-05-27 9:27 ` David Hildenbrand
0 siblings, 1 reply; 52+ messages in thread
From: Yafang Shao @ 2025-05-27 8:40 UTC (permalink / raw)
To: David Hildenbrand
Cc: akpm, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On Tue, May 27, 2025 at 4:30 PM David Hildenbrand <david@redhat.com> wrote:
>
> >> I don't think we want to add such a mechanism (new mode) where the
> >> primary configuration mechanism is through bpf.
> >>
> >> Maybe bpf could be used as an alternative, but we should look into a
> >> reasonable alternative first, like the discussed mctrl()/.../ raised in
> >> the process_madvise() series.
> >>
> >> No "bpf" mode in disguise, please :)
> >
> > This goal can be readily achieved using a BPF program. In any case, it
> > is a feasible solution.
>
> No BPF-only solution.
>
> >
> >>
> >>> We could define
> >>> the API as follows:
> >>>
> >>> struct bpf_thp_ops {
> >>> /**
> >>> * @task_thp_mode: Get the THP mode for a specific task
> >>> *
> >>> * Return:
> >>> * - TASK_THP_ALWAYS: "always" mode
> >>> * - TASK_THP_MADVISE: "madvise" mode
> >>> * - TASK_THP_NEVER: "never" mode
> >>> * Future modes can also be added.
> >>> */
> >>> int (*task_thp_mode)(struct task_struct *p);
> >>> };
> >>>
> >>> For observability, we could add a "THP mode" field to
> >>> /proc/[pid]/status. For example:
> >>>
> >>> $ grep "THP mode" /proc/123/status
> >>> always
> >>> $ grep "THP mode" /proc/456/status
> >>> madvise
> >>> $ grep "THP mode" /proc/789/status
> >>> never
> >>>
> >>> The THP mode for each task would be determined by the attached BPF
> >>> program based on the task's attributes. We would place the BPF hook in
> >>> appropriate kernel functions. Note that this setting wouldn't be
> >>> inherited during fork/exec - the BPF program would make the decision
> >>> dynamically for each task.
> >>
> >> What would be the mode (default) when the bpf program would not be active?
> >>
> >>> This approach also enables runtime adjustments to THP modes based on
> >>> system-wide conditions, such as memory fragmentation or other
> >>> performance overheads. The BPF program could adapt policies
> >>> dynamically, optimizing THP behavior in response to changing
> >>> workloads.
> >>
> >> I am not sure that is the proper way to handle these scenarios: I never
> >> heard that people would be adjusting the system-wide policy dynamically
> >> in that way either.
> >>
> >> Whatever we do, we have to make sure that what we add won't
> >> over-complicate things in the future. Having tooling dynamically adjust
> >> the THP policy of processes that coarsely sounds ... very wrong long-term.
> >
> > This is just an example demonstrating how BPF can be used to adjust
> > its flexibility. Notably, all these policies can be implemented
> > without modifying the kernel.
>
> See below on "policy".
>
> >
> >>
> >> > > As Liam pointed out in another thread, naming is challenging here -
> >>> "process" might not be the most accurate term for this context.
> >>
> >> No, it's not even a per-process thing. It is per MM, and a MM might be
> >> used by multiple processes ...
> >
> > I consistently use 'thread' for the latter case.
>
> You can use CLONE_VM without CLONE_THREAD ...
If I understand correctly, this can only occur for shared THP but not
anonymous THP. For instance, if either process allocates an anonymous
THP, it would trigger the creation of a new MM. Please correct me if
I'm mistaken.
>
> Additionally, this
> > can be implemented per-MM without kernel code modifications.
> > With a well-designed API, users can even implement custom THP
> > policies—all without altering kernel code.
>
> You can switch between modes, that' all you can do. I wouldn't really
> call that "custom policy" as it is extremely limited.
>
> And that's exactly my point: it's basic switching between modes ... a
> reasonable policy in the future will make placement decisions and not
> just state "always/never/madvise".
Could you please elaborate further on 'make placement decisions'? As
previously mentioned, we (including the broader community) really need
the user input to determine whether THP allocation is appropriate in a
given case.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-27 8:40 ` Yafang Shao
@ 2025-05-27 9:27 ` David Hildenbrand
2025-05-27 9:43 ` Yafang Shao
0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2025-05-27 9:27 UTC (permalink / raw)
To: Yafang Shao
Cc: akpm, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On 27.05.25 10:40, Yafang Shao wrote:
> On Tue, May 27, 2025 at 4:30 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>>> I don't think we want to add such a mechanism (new mode) where the
>>>> primary configuration mechanism is through bpf.
>>>>
>>>> Maybe bpf could be used as an alternative, but we should look into a
>>>> reasonable alternative first, like the discussed mctrl()/.../ raised in
>>>> the process_madvise() series.
>>>>
>>>> No "bpf" mode in disguise, please :)
>>>
>>> This goal can be readily achieved using a BPF program. In any case, it
>>> is a feasible solution.
>>
>> No BPF-only solution.
>>
>>>
>>>>
>>>>> We could define
>>>>> the API as follows:
>>>>>
>>>>> struct bpf_thp_ops {
>>>>> /**
>>>>> * @task_thp_mode: Get the THP mode for a specific task
>>>>> *
>>>>> * Return:
>>>>> * - TASK_THP_ALWAYS: "always" mode
>>>>> * - TASK_THP_MADVISE: "madvise" mode
>>>>> * - TASK_THP_NEVER: "never" mode
>>>>> * Future modes can also be added.
>>>>> */
>>>>> int (*task_thp_mode)(struct task_struct *p);
>>>>> };
>>>>>
>>>>> For observability, we could add a "THP mode" field to
>>>>> /proc/[pid]/status. For example:
>>>>>
>>>>> $ grep "THP mode" /proc/123/status
>>>>> always
>>>>> $ grep "THP mode" /proc/456/status
>>>>> madvise
>>>>> $ grep "THP mode" /proc/789/status
>>>>> never
>>>>>
>>>>> The THP mode for each task would be determined by the attached BPF
>>>>> program based on the task's attributes. We would place the BPF hook in
>>>>> appropriate kernel functions. Note that this setting wouldn't be
>>>>> inherited during fork/exec - the BPF program would make the decision
>>>>> dynamically for each task.
>>>>
>>>> What would be the mode (default) when the bpf program would not be active?
>>>>
>>>>> This approach also enables runtime adjustments to THP modes based on
>>>>> system-wide conditions, such as memory fragmentation or other
>>>>> performance overheads. The BPF program could adapt policies
>>>>> dynamically, optimizing THP behavior in response to changing
>>>>> workloads.
>>>>
>>>> I am not sure that is the proper way to handle these scenarios: I never
>>>> heard that people would be adjusting the system-wide policy dynamically
>>>> in that way either.
>>>>
>>>> Whatever we do, we have to make sure that what we add won't
>>>> over-complicate things in the future. Having tooling dynamically adjust
>>>> the THP policy of processes that coarsely sounds ... very wrong long-term.
>>>
>>> This is just an example demonstrating how BPF can be used to adjust
>>> its flexibility. Notably, all these policies can be implemented
>>> without modifying the kernel.
>>
>> See below on "policy".
>>
>>>
>>>>
>>>> > > As Liam pointed out in another thread, naming is challenging here -
>>>>> "process" might not be the most accurate term for this context.
>>>>
>>>> No, it's not even a per-process thing. It is per MM, and a MM might be
>>>> used by multiple processes ...
>>>
>>> I consistently use 'thread' for the latter case.
>>
>> You can use CLONE_VM without CLONE_THREAD ...
>
> If I understand correctly, this can only occur for shared THP but not
> anonymous THP. For instance, if either process allocates an anonymous
> THP, it would trigger the creation of a new MM. Please correct me if
> I'm mistaken.
What clone(CLONE_VM) will do is essentially create a new process, that
shares the MM with the original process. Similar to a thread, just that
the new process will show up in /proc/ as ... a new process, not as a
thread under /prod/$pid/tasks of the original process.
Both processes will operate on the shared MM struct as if they were
ordinary threads. No Copy-on-Write involved.
One example use case I've been involved in is async teardown in QEMU [1].
[1] https://kvm-forum.qemu.org/2022/ibm_async_destroy.pdf
>
>>
>> Additionally, this
>>> can be implemented per-MM without kernel code modifications.
>>> With a well-designed API, users can even implement custom THP
>>> policies—all without altering kernel code.
>>
>> You can switch between modes, that' all you can do. I wouldn't really
>> call that "custom policy" as it is extremely limited.
>>
>> And that's exactly my point: it's basic switching between modes ... a
>> reasonable policy in the future will make placement decisions and not
>> just state "always/never/madvise".
>
> Could you please elaborate further on 'make placement decisions'? As
> previously mentioned, we (including the broader community) really need
> the user input to determine whether THP allocation is appropriate in a
> given case.
The glorious future were we make smarter decisions where to actually
place THPs even in the "always" mode.
E.g., just because we enable "always" for a process does not mean that
we really want a THP everywhere; quite the opposite.
Treat the "always"/"madvise"/"never" as a rough mode, not a future-proof
policy that we would want to fine-tune dynamically ... that would be
very limiting.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-27 9:27 ` David Hildenbrand
@ 2025-05-27 9:43 ` Yafang Shao
2025-05-27 12:19 ` David Hildenbrand
0 siblings, 1 reply; 52+ messages in thread
From: Yafang Shao @ 2025-05-27 9:43 UTC (permalink / raw)
To: David Hildenbrand
Cc: akpm, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On Tue, May 27, 2025 at 5:27 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 27.05.25 10:40, Yafang Shao wrote:
> > On Tue, May 27, 2025 at 4:30 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >>>> I don't think we want to add such a mechanism (new mode) where the
> >>>> primary configuration mechanism is through bpf.
> >>>>
> >>>> Maybe bpf could be used as an alternative, but we should look into a
> >>>> reasonable alternative first, like the discussed mctrl()/.../ raised in
> >>>> the process_madvise() series.
> >>>>
> >>>> No "bpf" mode in disguise, please :)
> >>>
> >>> This goal can be readily achieved using a BPF program. In any case, it
> >>> is a feasible solution.
> >>
> >> No BPF-only solution.
> >>
> >>>
> >>>>
> >>>>> We could define
> >>>>> the API as follows:
> >>>>>
> >>>>> struct bpf_thp_ops {
> >>>>> /**
> >>>>> * @task_thp_mode: Get the THP mode for a specific task
> >>>>> *
> >>>>> * Return:
> >>>>> * - TASK_THP_ALWAYS: "always" mode
> >>>>> * - TASK_THP_MADVISE: "madvise" mode
> >>>>> * - TASK_THP_NEVER: "never" mode
> >>>>> * Future modes can also be added.
> >>>>> */
> >>>>> int (*task_thp_mode)(struct task_struct *p);
> >>>>> };
> >>>>>
> >>>>> For observability, we could add a "THP mode" field to
> >>>>> /proc/[pid]/status. For example:
> >>>>>
> >>>>> $ grep "THP mode" /proc/123/status
> >>>>> always
> >>>>> $ grep "THP mode" /proc/456/status
> >>>>> madvise
> >>>>> $ grep "THP mode" /proc/789/status
> >>>>> never
> >>>>>
> >>>>> The THP mode for each task would be determined by the attached BPF
> >>>>> program based on the task's attributes. We would place the BPF hook in
> >>>>> appropriate kernel functions. Note that this setting wouldn't be
> >>>>> inherited during fork/exec - the BPF program would make the decision
> >>>>> dynamically for each task.
> >>>>
> >>>> What would be the mode (default) when the bpf program would not be active?
> >>>>
> >>>>> This approach also enables runtime adjustments to THP modes based on
> >>>>> system-wide conditions, such as memory fragmentation or other
> >>>>> performance overheads. The BPF program could adapt policies
> >>>>> dynamically, optimizing THP behavior in response to changing
> >>>>> workloads.
> >>>>
> >>>> I am not sure that is the proper way to handle these scenarios: I never
> >>>> heard that people would be adjusting the system-wide policy dynamically
> >>>> in that way either.
> >>>>
> >>>> Whatever we do, we have to make sure that what we add won't
> >>>> over-complicate things in the future. Having tooling dynamically adjust
> >>>> the THP policy of processes that coarsely sounds ... very wrong long-term.
> >>>
> >>> This is just an example demonstrating how BPF can be used to adjust
> >>> its flexibility. Notably, all these policies can be implemented
> >>> without modifying the kernel.
> >>
> >> See below on "policy".
> >>
> >>>
> >>>>
> >>>> > > As Liam pointed out in another thread, naming is challenging here -
> >>>>> "process" might not be the most accurate term for this context.
> >>>>
> >>>> No, it's not even a per-process thing. It is per MM, and a MM might be
> >>>> used by multiple processes ...
> >>>
> >>> I consistently use 'thread' for the latter case.
> >>
> >> You can use CLONE_VM without CLONE_THREAD ...
> >
> > If I understand correctly, this can only occur for shared THP but not
> > anonymous THP. For instance, if either process allocates an anonymous
> > THP, it would trigger the creation of a new MM. Please correct me if
> > I'm mistaken.
>
> What clone(CLONE_VM) will do is essentially create a new process, that
> shares the MM with the original process. Similar to a thread, just that
> the new process will show up in /proc/ as ... a new process, not as a
> thread under /prod/$pid/tasks of the original process.
>
> Both processes will operate on the shared MM struct as if they were
> ordinary threads. No Copy-on-Write involved.
>
> One example use case I've been involved in is async teardown in QEMU [1].
>
> [1] https://kvm-forum.qemu.org/2022/ibm_async_destroy.pdf
I understand what you mean, but what I'm really confused about is how
this relates to allocating anonymous THP. If either one allocates
anon THP, it will definitely create a new MM, right ?
>
> >
> >>
> >> Additionally, this
> >>> can be implemented per-MM without kernel code modifications.
> >>> With a well-designed API, users can even implement custom THP
> >>> policies—all without altering kernel code.
> >>
> >> You can switch between modes, that' all you can do. I wouldn't really
> >> call that "custom policy" as it is extremely limited.
> >>
> >> And that's exactly my point: it's basic switching between modes ... a
> >> reasonable policy in the future will make placement decisions and not
> >> just state "always/never/madvise".
> >
> > Could you please elaborate further on 'make placement decisions'? As
> > previously mentioned, we (including the broader community) really need
> > the user input to determine whether THP allocation is appropriate in a
> > given case.
>
> The glorious future were we make smarter decisions where to actually
> place THPs even in the "always" mode.
>
> E.g., just because we enable "always" for a process does not mean that
> we really want a THP everywhere; quite the opposite.
So 'always' simply means "the system doesn't guarantee THP allocation
will succeed" ? If that's the case, we should revisit RFC v1 [0],
where we proposed rejecting THP allocations in certain scenarios for
specific tasks.
[0] https://lwn.net/Articles/1019290/
>
> Treat the "always"/"madvise"/"never" as a rough mode, not a future-proof
> policy that we would want to fine-tune dynamically ... that would be
> very limiting.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-27 9:43 ` Yafang Shao
@ 2025-05-27 12:19 ` David Hildenbrand
2025-05-28 2:04 ` Yafang Shao
0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2025-05-27 12:19 UTC (permalink / raw)
To: Yafang Shao
Cc: akpm, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On 27.05.25 11:43, Yafang Shao wrote:
> On Tue, May 27, 2025 at 5:27 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 27.05.25 10:40, Yafang Shao wrote:
>>> On Tue, May 27, 2025 at 4:30 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>>>> I don't think we want to add such a mechanism (new mode) where the
>>>>>> primary configuration mechanism is through bpf.
>>>>>>
>>>>>> Maybe bpf could be used as an alternative, but we should look into a
>>>>>> reasonable alternative first, like the discussed mctrl()/.../ raised in
>>>>>> the process_madvise() series.
>>>>>>
>>>>>> No "bpf" mode in disguise, please :)
>>>>>
>>>>> This goal can be readily achieved using a BPF program. In any case, it
>>>>> is a feasible solution.
>>>>
>>>> No BPF-only solution.
>>>>
>>>>>
>>>>>>
>>>>>>> We could define
>>>>>>> the API as follows:
>>>>>>>
>>>>>>> struct bpf_thp_ops {
>>>>>>> /**
>>>>>>> * @task_thp_mode: Get the THP mode for a specific task
>>>>>>> *
>>>>>>> * Return:
>>>>>>> * - TASK_THP_ALWAYS: "always" mode
>>>>>>> * - TASK_THP_MADVISE: "madvise" mode
>>>>>>> * - TASK_THP_NEVER: "never" mode
>>>>>>> * Future modes can also be added.
>>>>>>> */
>>>>>>> int (*task_thp_mode)(struct task_struct *p);
>>>>>>> };
>>>>>>>
>>>>>>> For observability, we could add a "THP mode" field to
>>>>>>> /proc/[pid]/status. For example:
>>>>>>>
>>>>>>> $ grep "THP mode" /proc/123/status
>>>>>>> always
>>>>>>> $ grep "THP mode" /proc/456/status
>>>>>>> madvise
>>>>>>> $ grep "THP mode" /proc/789/status
>>>>>>> never
>>>>>>>
>>>>>>> The THP mode for each task would be determined by the attached BPF
>>>>>>> program based on the task's attributes. We would place the BPF hook in
>>>>>>> appropriate kernel functions. Note that this setting wouldn't be
>>>>>>> inherited during fork/exec - the BPF program would make the decision
>>>>>>> dynamically for each task.
>>>>>>
>>>>>> What would be the mode (default) when the bpf program would not be active?
>>>>>>
>>>>>>> This approach also enables runtime adjustments to THP modes based on
>>>>>>> system-wide conditions, such as memory fragmentation or other
>>>>>>> performance overheads. The BPF program could adapt policies
>>>>>>> dynamically, optimizing THP behavior in response to changing
>>>>>>> workloads.
>>>>>>
>>>>>> I am not sure that is the proper way to handle these scenarios: I never
>>>>>> heard that people would be adjusting the system-wide policy dynamically
>>>>>> in that way either.
>>>>>>
>>>>>> Whatever we do, we have to make sure that what we add won't
>>>>>> over-complicate things in the future. Having tooling dynamically adjust
>>>>>> the THP policy of processes that coarsely sounds ... very wrong long-term.
>>>>>
>>>>> This is just an example demonstrating how BPF can be used to adjust
>>>>> its flexibility. Notably, all these policies can be implemented
>>>>> without modifying the kernel.
>>>>
>>>> See below on "policy".
>>>>
>>>>>
>>>>>>
>>>>>> > > As Liam pointed out in another thread, naming is challenging here -
>>>>>>> "process" might not be the most accurate term for this context.
>>>>>>
>>>>>> No, it's not even a per-process thing. It is per MM, and a MM might be
>>>>>> used by multiple processes ...
>>>>>
>>>>> I consistently use 'thread' for the latter case.
>>>>
>>>> You can use CLONE_VM without CLONE_THREAD ...
>>>
>>> If I understand correctly, this can only occur for shared THP but not
>>> anonymous THP. For instance, if either process allocates an anonymous
>>> THP, it would trigger the creation of a new MM. Please correct me if
>>> I'm mistaken.
>>
>> What clone(CLONE_VM) will do is essentially create a new process, that
>> shares the MM with the original process. Similar to a thread, just that
>> the new process will show up in /proc/ as ... a new process, not as a
>> thread under /prod/$pid/tasks of the original process.
>>
>> Both processes will operate on the shared MM struct as if they were
>> ordinary threads. No Copy-on-Write involved.
>>
>> One example use case I've been involved in is async teardown in QEMU [1].
>>
>> [1] https://kvm-forum.qemu.org/2022/ibm_async_destroy.pdf
>
> I understand what you mean, but what I'm really confused about is how
> this relates to allocating anonymous THP. If either one allocates
> anon THP, it will definitely create a new MM, right ?
No. They work on the same address space - same MM. Either can allocate a
new anon THP and the other one would be able to modify it. No fork/CoW.
I only bring it up because it's two "processes" sharing the same MM. And
the THP mode in your proposal would actually be per-MM and not per process.
It's confusing ... :)
>
>>
>>>
>>>>
>>>> Additionally, this
>>>>> can be implemented per-MM without kernel code modifications.
>>>>> With a well-designed API, users can even implement custom THP
>>>>> policies—all without altering kernel code.
>>>>
>>>> You can switch between modes, that' all you can do. I wouldn't really
>>>> call that "custom policy" as it is extremely limited.
>>>>
>>>> And that's exactly my point: it's basic switching between modes ... a
>>>> reasonable policy in the future will make placement decisions and not
>>>> just state "always/never/madvise".
>>>
>>> Could you please elaborate further on 'make placement decisions'? As
>>> previously mentioned, we (including the broader community) really need
>>> the user input to determine whether THP allocation is appropriate in a
>>> given case.
>>
>> The glorious future were we make smarter decisions where to actually
>> place THPs even in the "always" mode.
>>
>> E.g., just because we enable "always" for a process does not mean that
>> we really want a THP everywhere; quite the opposite.
>
> So 'always' simply means "the system doesn't guarantee THP allocation
> will succeed" ?
I mean, with THPs, there are no guarantees, ever :(
> If that's the case, we should revisit RFC v1 [0],
> where we proposed rejecting THP allocations in certain scenarios for
> specific tasks.
Hooking into actual page allocation during page faults (e.g., THP size,
khugepaged collapse decisions) is IMHO a much better application of ebpf
than setting a THP mode per process (or MM ... ) using epbf.
So yes, you could drive the system in "always" mode and decide to not
allocate THPs during page faults / khugepaged for specific processes.
IMHO that also does not contradict the VM_HUGEPAGE / VM_NOHUGEPAGE
default setting proposal: VM_HUGEPAGE could feed into the epbf program
as yet another parameter to make a decision.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-27 12:19 ` David Hildenbrand
@ 2025-05-28 2:04 ` Yafang Shao
2025-05-28 20:32 ` David Hildenbrand
0 siblings, 1 reply; 52+ messages in thread
From: Yafang Shao @ 2025-05-28 2:04 UTC (permalink / raw)
To: David Hildenbrand
Cc: akpm, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
On Tue, May 27, 2025 at 8:19 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 27.05.25 11:43, Yafang Shao wrote:
> > On Tue, May 27, 2025 at 5:27 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 27.05.25 10:40, Yafang Shao wrote:
> >>> On Tue, May 27, 2025 at 4:30 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>>>> I don't think we want to add such a mechanism (new mode) where the
> >>>>>> primary configuration mechanism is through bpf.
> >>>>>>
> >>>>>> Maybe bpf could be used as an alternative, but we should look into a
> >>>>>> reasonable alternative first, like the discussed mctrl()/.../ raised in
> >>>>>> the process_madvise() series.
> >>>>>>
> >>>>>> No "bpf" mode in disguise, please :)
> >>>>>
> >>>>> This goal can be readily achieved using a BPF program. In any case, it
> >>>>> is a feasible solution.
> >>>>
> >>>> No BPF-only solution.
> >>>>
> >>>>>
> >>>>>>
> >>>>>>> We could define
> >>>>>>> the API as follows:
> >>>>>>>
> >>>>>>> struct bpf_thp_ops {
> >>>>>>> /**
> >>>>>>> * @task_thp_mode: Get the THP mode for a specific task
> >>>>>>> *
> >>>>>>> * Return:
> >>>>>>> * - TASK_THP_ALWAYS: "always" mode
> >>>>>>> * - TASK_THP_MADVISE: "madvise" mode
> >>>>>>> * - TASK_THP_NEVER: "never" mode
> >>>>>>> * Future modes can also be added.
> >>>>>>> */
> >>>>>>> int (*task_thp_mode)(struct task_struct *p);
> >>>>>>> };
> >>>>>>>
> >>>>>>> For observability, we could add a "THP mode" field to
> >>>>>>> /proc/[pid]/status. For example:
> >>>>>>>
> >>>>>>> $ grep "THP mode" /proc/123/status
> >>>>>>> always
> >>>>>>> $ grep "THP mode" /proc/456/status
> >>>>>>> madvise
> >>>>>>> $ grep "THP mode" /proc/789/status
> >>>>>>> never
> >>>>>>>
> >>>>>>> The THP mode for each task would be determined by the attached BPF
> >>>>>>> program based on the task's attributes. We would place the BPF hook in
> >>>>>>> appropriate kernel functions. Note that this setting wouldn't be
> >>>>>>> inherited during fork/exec - the BPF program would make the decision
> >>>>>>> dynamically for each task.
> >>>>>>
> >>>>>> What would be the mode (default) when the bpf program would not be active?
> >>>>>>
> >>>>>>> This approach also enables runtime adjustments to THP modes based on
> >>>>>>> system-wide conditions, such as memory fragmentation or other
> >>>>>>> performance overheads. The BPF program could adapt policies
> >>>>>>> dynamically, optimizing THP behavior in response to changing
> >>>>>>> workloads.
> >>>>>>
> >>>>>> I am not sure that is the proper way to handle these scenarios: I never
> >>>>>> heard that people would be adjusting the system-wide policy dynamically
> >>>>>> in that way either.
> >>>>>>
> >>>>>> Whatever we do, we have to make sure that what we add won't
> >>>>>> over-complicate things in the future. Having tooling dynamically adjust
> >>>>>> the THP policy of processes that coarsely sounds ... very wrong long-term.
> >>>>>
> >>>>> This is just an example demonstrating how BPF can be used to adjust
> >>>>> its flexibility. Notably, all these policies can be implemented
> >>>>> without modifying the kernel.
> >>>>
> >>>> See below on "policy".
> >>>>
> >>>>>
> >>>>>>
> >>>>>> > > As Liam pointed out in another thread, naming is challenging here -
> >>>>>>> "process" might not be the most accurate term for this context.
> >>>>>>
> >>>>>> No, it's not even a per-process thing. It is per MM, and a MM might be
> >>>>>> used by multiple processes ...
> >>>>>
> >>>>> I consistently use 'thread' for the latter case.
> >>>>
> >>>> You can use CLONE_VM without CLONE_THREAD ...
> >>>
> >>> If I understand correctly, this can only occur for shared THP but not
> >>> anonymous THP. For instance, if either process allocates an anonymous
> >>> THP, it would trigger the creation of a new MM. Please correct me if
> >>> I'm mistaken.
> >>
> >> What clone(CLONE_VM) will do is essentially create a new process, that
> >> shares the MM with the original process. Similar to a thread, just that
> >> the new process will show up in /proc/ as ... a new process, not as a
> >> thread under /prod/$pid/tasks of the original process.
> >>
> >> Both processes will operate on the shared MM struct as if they were
> >> ordinary threads. No Copy-on-Write involved.
> >>
> >> One example use case I've been involved in is async teardown in QEMU [1].
> >>
> >> [1] https://kvm-forum.qemu.org/2022/ibm_async_destroy.pdf
> >
> > I understand what you mean, but what I'm really confused about is how
> > this relates to allocating anonymous THP. If either one allocates
> > anon THP, it will definitely create a new MM, right ?
>
> No. They work on the same address space - same MM. Either can allocate a
> new anon THP and the other one would be able to modify it. No fork/CoW.
>
> I only bring it up because it's two "processes" sharing the same MM. And
> the THP mode in your proposal would actually be per-MM and not per process.
>
> It's confusing ... :)
Thanks for the explanation.
>
> >
> >>
> >>>
> >>>>
> >>>> Additionally, this
> >>>>> can be implemented per-MM without kernel code modifications.
> >>>>> With a well-designed API, users can even implement custom THP
> >>>>> policies—all without altering kernel code.
> >>>>
> >>>> You can switch between modes, that' all you can do. I wouldn't really
> >>>> call that "custom policy" as it is extremely limited.
> >>>>
> >>>> And that's exactly my point: it's basic switching between modes ... a
> >>>> reasonable policy in the future will make placement decisions and not
> >>>> just state "always/never/madvise".
> >>>
> >>> Could you please elaborate further on 'make placement decisions'? As
> >>> previously mentioned, we (including the broader community) really need
> >>> the user input to determine whether THP allocation is appropriate in a
> >>> given case.
> >>
> >> The glorious future were we make smarter decisions where to actually
> >> place THPs even in the "always" mode.
> >>
> >> E.g., just because we enable "always" for a process does not mean that
> >> we really want a THP everywhere; quite the opposite.
> >
> > So 'always' simply means "the system doesn't guarantee THP allocation
> > will succeed" ?
>
> I mean, with THPs, there are no guarantees, ever :(
>
> > If that's the case, we should revisit RFC v1 [0],
> > where we proposed rejecting THP allocations in certain scenarios for
> > specific tasks.
>
> Hooking into actual page allocation during page faults (e.g., THP size,
> khugepaged collapse decisions) is IMHO a much better application of ebpf
> than setting a THP mode per process (or MM ... ) using epbf.
>
> So yes, you could drive the system in "always" mode and decide to not
> allocate THPs during page faults / khugepaged for specific processes.
>
> IMHO that also does not contradict the VM_HUGEPAGE / VM_NOHUGEPAGE
> default setting proposal: VM_HUGEPAGE could feed into the epbf program
> as yet another parameter to make a decision.
That seems like a viable solution. Thank you for your help.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
2025-05-28 2:04 ` Yafang Shao
@ 2025-05-28 20:32 ` David Hildenbrand
0 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2025-05-28 20:32 UTC (permalink / raw)
To: Yafang Shao
Cc: akpm, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
willy, ast, daniel, andrii, bpf, linux-mm
>>
>> I mean, with THPs, there are no guarantees, ever :(
>>
>>> If that's the case, we should revisit RFC v1 [0],
>>> where we proposed rejecting THP allocations in certain scenarios for
>>> specific tasks.
>>
>> Hooking into actual page allocation during page faults (e.g., THP size,
>> khugepaged collapse decisions) is IMHO a much better application of ebpf
>> than setting a THP mode per process (or MM ... ) using epbf.
>>
>> So yes, you could drive the system in "always" mode and decide to not
>> allocate THPs during page faults / khugepaged for specific processes.
>>
>> IMHO that also does not contradict the VM_HUGEPAGE / VM_NOHUGEPAGE
>> default setting proposal: VM_HUGEPAGE could feed into the epbf program
>> as yet another parameter to make a decision.
>
> That seems like a viable solution. Thank you for your help.
Good! And the required ebpf hooks would probably come in handy to write
more advanced policies / allocation logic than just "give it THPs" vs.
"don't give it THPs".
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2025-05-28 20:32 UTC | newest]
Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-20 6:04 [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
2025-05-20 6:04 ` [RFC PATCH v2 1/5] mm: thp: Add a new mode "bpf" Yafang Shao
2025-05-20 6:05 ` [RFC PATCH v2 2/5] mm: thp: Add hook for BPF based THP adjustment Yafang Shao
2025-05-20 6:05 ` [RFC PATCH v2 3/5] mm: thp: add struct ops " Yafang Shao
2025-05-20 6:05 ` [RFC PATCH v2 4/5] bpf: Add get_current_comm to bpf_base_func_proto Yafang Shao
2025-05-20 23:32 ` Andrii Nakryiko
2025-05-20 6:05 ` [RFC PATCH v2 5/5] selftests/bpf: Add selftest for THP adjustment Yafang Shao
2025-05-20 6:52 ` [RFC PATCH v2 0/5] mm, bpf: BPF based " Nico Pache
2025-05-20 7:25 ` Yafang Shao
2025-05-20 13:10 ` Matthew Wilcox
2025-05-20 14:08 ` Yafang Shao
2025-05-20 14:22 ` Lorenzo Stoakes
2025-05-20 14:32 ` Usama Arif
2025-05-20 14:35 ` Lorenzo Stoakes
2025-05-20 14:42 ` Matthew Wilcox
2025-05-20 14:56 ` David Hildenbrand
2025-05-21 4:28 ` Yafang Shao
2025-05-20 14:46 ` Usama Arif
2025-05-20 15:00 ` David Hildenbrand
2025-05-20 9:43 ` David Hildenbrand
2025-05-20 9:49 ` Lorenzo Stoakes
2025-05-20 12:06 ` Yafang Shao
2025-05-20 13:45 ` Lorenzo Stoakes
2025-05-20 15:54 ` David Hildenbrand
2025-05-21 4:02 ` Yafang Shao
2025-05-21 3:52 ` Yafang Shao
2025-05-20 11:59 ` Yafang Shao
2025-05-25 3:01 ` Yafang Shao
2025-05-26 7:41 ` Gutierrez Asier
2025-05-26 9:37 ` Yafang Shao
2025-05-26 8:14 ` David Hildenbrand
2025-05-26 9:37 ` Yafang Shao
2025-05-26 10:49 ` David Hildenbrand
2025-05-26 14:53 ` Liam R. Howlett
2025-05-26 15:54 ` Liam R. Howlett
2025-05-26 16:51 ` David Hildenbrand
2025-05-26 17:07 ` Liam R. Howlett
2025-05-26 17:12 ` David Hildenbrand
2025-05-26 20:30 ` Gutierrez Asier
2025-05-26 20:37 ` David Hildenbrand
2025-05-27 5:46 ` Yafang Shao
2025-05-27 7:57 ` David Hildenbrand
2025-05-27 8:13 ` Yafang Shao
2025-05-27 8:30 ` David Hildenbrand
2025-05-27 8:40 ` Yafang Shao
2025-05-27 9:27 ` David Hildenbrand
2025-05-27 9:43 ` Yafang Shao
2025-05-27 12:19 ` David Hildenbrand
2025-05-28 2:04 ` Yafang Shao
2025-05-28 20:32 ` David Hildenbrand
2025-05-26 14:32 ` Zi Yan
2025-05-27 5:53 ` Yafang Shao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).