* [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
@ 2025-04-29 2:41 Yafang Shao
2025-04-29 2:41 ` [RFC PATCH 1/4] mm: move hugepage_global_{enabled,always}() to internal.h Yafang Shao
` (5 more replies)
0 siblings, 6 replies; 41+ messages in thread
From: Yafang Shao @ 2025-04-29 2:41 UTC (permalink / raw)
To: akpm, ast, daniel, andrii; +Cc: bpf, linux-mm, Yafang Shao
In our container environment, we aim to enable THP selectively—allowing
specific services to use it while restricting others. This approach is
driven by the following considerations:
1. Memory Fragmentation
THP can lead to increased memory fragmentation, so we want to limit its
use across services.
2. Performance Impact
Some services see no benefit from THP, making its usage unnecessary.
3. Performance Gains
Certain workloads, such as machine learning services, experience
significant performance improvements with THP, so we enable it for them
specifically.
Since multiple services run on a single host in a containerized environment,
enabling THP globally is not ideal. Previously, we set THP to madvise,
allowing selected services to opt in via MADV_HUGEPAGE. However, this
approach had limitation:
- Some services inadvertently used madvise(MADV_HUGEPAGE) through
third-party libraries, bypassing our restrictions.
To address this issue, we initially hooked the __x64_sys_madvise() syscall,
which is error-injectable, to blacklist unwanted services. While this
worked, it was error-prone and ineffective for services needing always mode,
as modifying their code to use madvise was impractical.
To achieve finer-grained control, we introduced an fmod_ret-based solution.
Now, we dynamically adjust THP settings per service by hooking
hugepage_global_{enabled,always}() via BPF. This allows us to set THP to
enable or disable on a per-service basis without global impact.
The hugepage_global_{enabled,always}() functions currently share the same
BPF hook, which limits THP configuration to either always or never. While
this suffices for our specific use cases, full support for all three modes
(always, madvise, and never) would require splitting them into separate
hooks.
This is the initial RFC patch—feedback is welcome!
Yafang Shao (4):
mm: move hugepage_global_{enabled,always}() to internal.h
mm: pass VMA parameter to hugepage_global_{enabled,always}()
mm: add BPF hook for THP adjustment
selftests/bpf: Add selftest for THP adjustment
include/linux/huge_mm.h | 54 +-----
mm/Makefile | 3 +
mm/bpf.c | 36 ++++
mm/bpf.h | 21 +++
mm/huge_memory.c | 50 ++++-
mm/internal.h | 21 +++
mm/khugepaged.c | 18 +-
tools/testing/selftests/bpf/config | 1 +
.../selftests/bpf/prog_tests/thp_adjust.c | 176 ++++++++++++++++++
.../selftests/bpf/progs/test_thp_adjust.c | 32 ++++
10 files changed, 344 insertions(+), 68 deletions(-)
create mode 100644 mm/bpf.c
create mode 100644 mm/bpf.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
--
2.43.5
^ permalink raw reply [flat|nested] 41+ messages in thread
* [RFC PATCH 1/4] mm: move hugepage_global_{enabled,always}() to internal.h
2025-04-29 2:41 [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment Yafang Shao
@ 2025-04-29 2:41 ` Yafang Shao
2025-04-29 15:13 ` Zi Yan
2025-04-29 2:41 ` [RFC PATCH 2/4] mm: pass VMA parameter to hugepage_global_{enabled,always}() Yafang Shao
` (4 subsequent siblings)
5 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2025-04-29 2:41 UTC (permalink / raw)
To: akpm, ast, daniel, andrii; +Cc: bpf, linux-mm, Yafang Shao
The functions hugepage_global_{enabled,always}() are currently only used in
mm/huge_memory.c, so we can move them to mm/internal.h. They will also be
exposed for BPF hooking in a future change.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
include/linux/huge_mm.h | 54 +----------------------------------------
mm/huge_memory.c | 46 ++++++++++++++++++++++++++++++++---
mm/internal.h | 14 +++++++++++
3 files changed, 57 insertions(+), 57 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e893d546a49f..5e92db48fc99 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -177,19 +177,6 @@ extern unsigned long huge_anon_orders_always;
extern unsigned long huge_anon_orders_madvise;
extern unsigned long huge_anon_orders_inherit;
-static inline bool hugepage_global_enabled(void)
-{
- return transparent_hugepage_flags &
- ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
- (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG));
-}
-
-static inline bool hugepage_global_always(void)
-{
- return transparent_hugepage_flags &
- (1<<TRANSPARENT_HUGEPAGE_FLAG);
-}
-
static inline int highest_order(unsigned long orders)
{
return fls_long(orders) - 1;
@@ -260,49 +247,10 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
return orders;
}
-unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
- unsigned long vm_flags,
- unsigned long tva_flags,
- unsigned long orders);
-
-/**
- * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
- * @vma: the vm area to check
- * @vm_flags: use these vm_flags instead of vma->vm_flags
- * @tva_flags: Which TVA flags to honour
- * @orders: bitfield of all orders to consider
- *
- * Calculates the intersection of the requested hugepage orders and the allowed
- * hugepage orders for the provided vma. Permitted orders are encoded as a set
- * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
- * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
- *
- * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
- * orders are allowed.
- */
-static inline
unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
unsigned long vm_flags,
unsigned long tva_flags,
- unsigned long orders)
-{
- /* Optimization to check if required orders are enabled early. */
- if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
- unsigned long mask = READ_ONCE(huge_anon_orders_always);
-
- if (vm_flags & VM_HUGEPAGE)
- mask |= READ_ONCE(huge_anon_orders_madvise);
- if (hugepage_global_always() ||
- ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
- mask |= READ_ONCE(huge_anon_orders_inherit);
-
- orders &= mask;
- if (!orders)
- return 0;
- }
-
- return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
-}
+ unsigned long orders);
struct thpsize {
struct kobject kobj;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a47682d1ab7..39afa14af2f2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -98,10 +98,10 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
}
-unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
- unsigned long vm_flags,
- unsigned long tva_flags,
- unsigned long orders)
+static unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
+ unsigned long vm_flags,
+ unsigned long tva_flags,
+ unsigned long orders)
{
bool smaps = tva_flags & TVA_SMAPS;
bool in_pf = tva_flags & TVA_IN_PF;
@@ -208,6 +208,44 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
return orders;
}
+/**
+ * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
+ * @vma: the vm area to check
+ * @vm_flags: use these vm_flags instead of vma->vm_flags
+ * @tva_flags: Which TVA flags to honour
+ * @orders: bitfield of all orders to consider
+ *
+ * Calculates the intersection of the requested hugepage orders and the allowed
+ * hugepage orders for the provided vma. Permitted orders are encoded as a set
+ * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
+ * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
+ *
+ * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
+ * orders are allowed.
+ */
+unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
+ unsigned long vm_flags,
+ unsigned long tva_flags,
+ unsigned long orders)
+{
+ /* Optimization to check if required orders are enabled early. */
+ if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
+ unsigned long mask = READ_ONCE(huge_anon_orders_always);
+
+ if (vm_flags & VM_HUGEPAGE)
+ mask |= READ_ONCE(huge_anon_orders_madvise);
+ if (hugepage_global_always() ||
+ ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
+ mask |= READ_ONCE(huge_anon_orders_inherit);
+
+ orders &= mask;
+ if (!orders)
+ return 0;
+ }
+
+ return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
+}
+
static bool get_huge_zero_page(void)
{
struct folio *zero_folio;
diff --git a/mm/internal.h b/mm/internal.h
index e9695baa5922..462d85c2ba7b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1625,5 +1625,19 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
}
#endif /* CONFIG_PT_RECLAIM */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline bool hugepage_global_enabled(void)
+{
+ return transparent_hugepage_flags &
+ ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
+ (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG));
+}
+
+static inline bool hugepage_global_always(void)
+{
+ return transparent_hugepage_flags &
+ (1<<TRANSPARENT_HUGEPAGE_FLAG);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif /* __MM_INTERNAL_H */
--
2.43.5
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [RFC PATCH 2/4] mm: pass VMA parameter to hugepage_global_{enabled,always}()
2025-04-29 2:41 [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment Yafang Shao
2025-04-29 2:41 ` [RFC PATCH 1/4] mm: move hugepage_global_{enabled,always}() to internal.h Yafang Shao
@ 2025-04-29 2:41 ` Yafang Shao
2025-04-29 15:31 ` Zi Yan
2025-04-29 2:41 ` [RFC PATCH 3/4] mm: add BPF hook for THP adjustment Yafang Shao
` (3 subsequent siblings)
5 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2025-04-29 2:41 UTC (permalink / raw)
To: akpm, ast, daniel, andrii; +Cc: bpf, linux-mm, Yafang Shao
We will use the new @vma parameter to determine whether THP can be used.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
mm/huge_memory.c | 8 ++++----
mm/internal.h | 8 ++++++--
mm/khugepaged.c | 18 +++++++++---------
3 files changed, 19 insertions(+), 15 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 39afa14af2f2..7a4a968c7874 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -176,8 +176,8 @@ static unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
* were already handled in thp_vma_allowable_orders().
*/
if (enforce_sysfs &&
- (!hugepage_global_enabled() || (!(vm_flags & VM_HUGEPAGE) &&
- !hugepage_global_always())))
+ (!hugepage_global_enabled(vma) || (!(vm_flags & VM_HUGEPAGE) &&
+ !hugepage_global_always(vma))))
return 0;
/*
@@ -234,8 +234,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
if (vm_flags & VM_HUGEPAGE)
mask |= READ_ONCE(huge_anon_orders_madvise);
- if (hugepage_global_always() ||
- ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
+ if (hugepage_global_always(vma) ||
+ ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled(vma)))
mask |= READ_ONCE(huge_anon_orders_inherit);
orders &= mask;
diff --git a/mm/internal.h b/mm/internal.h
index 462d85c2ba7b..aa698a11dd68 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1626,14 +1626,18 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
#endif /* CONFIG_PT_RECLAIM */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline bool hugepage_global_enabled(void)
+/*
+ * Checks whether a given @vma can use THP. If @vma is NULL, the check is
+ * performed globally by khugepaged during a system-wide scan.
+ */
+static inline bool hugepage_global_enabled(struct vm_area_struct *vma)
{
return transparent_hugepage_flags &
((1<<TRANSPARENT_HUGEPAGE_FLAG) |
(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG));
}
-static inline bool hugepage_global_always(void)
+static inline bool hugepage_global_always(struct vm_area_struct *vma)
{
return transparent_hugepage_flags &
(1<<TRANSPARENT_HUGEPAGE_FLAG);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cc945c6ab3bd..b85e36ddd7db 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -413,7 +413,7 @@ static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
test_bit(MMF_DISABLE_THP, &mm->flags);
}
-static bool hugepage_pmd_enabled(void)
+static bool hugepage_pmd_enabled(struct vm_area_struct *vma)
{
/*
* We cover the anon, shmem and the file-backed case here; file-backed
@@ -423,14 +423,14 @@ static bool hugepage_pmd_enabled(void)
* except when the global shmem_huge is set to SHMEM_HUGE_DENY.
*/
if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
- hugepage_global_enabled())
+ hugepage_global_enabled(vma))
return true;
if (test_bit(PMD_ORDER, &huge_anon_orders_always))
return true;
if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
return true;
if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
- hugepage_global_enabled())
+ hugepage_global_enabled(vma))
return true;
if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
return true;
@@ -473,7 +473,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
unsigned long vm_flags)
{
if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
- hugepage_pmd_enabled()) {
+ hugepage_pmd_enabled(vma)) {
if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
PMD_ORDER))
__khugepaged_enter(vma->vm_mm);
@@ -2516,7 +2516,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
static int khugepaged_has_work(void)
{
- return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
+ return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled(NULL);
}
static int khugepaged_wait_event(void)
@@ -2589,7 +2589,7 @@ static void khugepaged_wait_work(void)
return;
}
- if (hugepage_pmd_enabled())
+ if (hugepage_pmd_enabled(NULL))
wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
}
@@ -2620,7 +2620,7 @@ static void set_recommended_min_free_kbytes(void)
int nr_zones = 0;
unsigned long recommended_min;
- if (!hugepage_pmd_enabled()) {
+ if (!hugepage_pmd_enabled(NULL)) {
calculate_min_free_kbytes();
goto update_wmarks;
}
@@ -2670,7 +2670,7 @@ int start_stop_khugepaged(void)
int err = 0;
mutex_lock(&khugepaged_mutex);
- if (hugepage_pmd_enabled()) {
+ if (hugepage_pmd_enabled(NULL)) {
if (!khugepaged_thread)
khugepaged_thread = kthread_run(khugepaged, NULL,
"khugepaged");
@@ -2696,7 +2696,7 @@ int start_stop_khugepaged(void)
void khugepaged_min_free_kbytes_update(void)
{
mutex_lock(&khugepaged_mutex);
- if (hugepage_pmd_enabled() && khugepaged_thread)
+ if (hugepage_pmd_enabled(NULL) && khugepaged_thread)
set_recommended_min_free_kbytes();
mutex_unlock(&khugepaged_mutex);
}
--
2.43.5
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [RFC PATCH 3/4] mm: add BPF hook for THP adjustment
2025-04-29 2:41 [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment Yafang Shao
2025-04-29 2:41 ` [RFC PATCH 1/4] mm: move hugepage_global_{enabled,always}() to internal.h Yafang Shao
2025-04-29 2:41 ` [RFC PATCH 2/4] mm: pass VMA parameter to hugepage_global_{enabled,always}() Yafang Shao
@ 2025-04-29 2:41 ` Yafang Shao
2025-04-29 15:19 ` Alexei Starovoitov
2025-04-29 2:41 ` [RFC PATCH 4/4] selftests/bpf: Add selftest " Yafang Shao
` (2 subsequent siblings)
5 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2025-04-29 2:41 UTC (permalink / raw)
To: akpm, ast, daniel, andrii; +Cc: bpf, linux-mm, Yafang Shao
We will use the @vma parameter in BPF programs to determine whether THP can
be used. The typical workflow is as follows:
1. Retrieve the mm_struct from the given @vma.
2. Obtain the task_struct associated with the mm_struct
It depends on CONFIG_MEMCG.
3. Adjust THP behavior dynamically based on task attributes
E.g., based on the task’s cgroup
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
mm/Makefile | 3 +++
mm/bpf.c | 36 ++++++++++++++++++++++++++++++++++++
mm/bpf.h | 21 +++++++++++++++++++++
mm/internal.h | 3 +++
4 files changed, 63 insertions(+)
create mode 100644 mm/bpf.c
create mode 100644 mm/bpf.h
diff --git a/mm/Makefile b/mm/Makefile
index e7f6bbf8ae5f..97055da04746 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,6 +99,9 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_NUMA) += memory-tiers.o
obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+ifdef CONFIG_BPF_SYSCALL
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += bpf.o
+endif
obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
diff --git a/mm/bpf.c b/mm/bpf.c
new file mode 100644
index 000000000000..72eebcdbad56
--- /dev/null
+++ b/mm/bpf.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Author: Yafang Shao <laoar.shao@gmail.com>
+ */
+
+#include <linux/bpf.h>
+#include <linux/mm_types.h>
+
+__bpf_hook_start();
+
+/* Checks if this @vma can use THP. */
+__weak noinline int
+mm_bpf_thp_vma_allowable(struct vm_area_struct *vma)
+{
+ /* At present, fmod_ret exclusively uses 0 to signify that the return
+ * value remains unchanged.
+ */
+ return 0;
+}
+
+__bpf_hook_end();
+
+BTF_SET8_START(mm_bpf_fmod_ret_ids)
+BTF_ID_FLAGS(func, mm_bpf_thp_vma_allowable)
+BTF_SET8_END(mm_bpf_fmod_ret_ids)
+
+static const struct btf_kfunc_id_set mm_bpf_fmodret_set = {
+ .owner = THIS_MODULE,
+ .set = &mm_bpf_fmod_ret_ids,
+};
+
+static int __init bpf_mm_kfunc_init(void)
+{
+ return register_btf_fmodret_id_set(&mm_bpf_fmodret_set);
+}
+late_initcall(bpf_mm_kfunc_init);
diff --git a/mm/bpf.h b/mm/bpf.h
new file mode 100644
index 000000000000..e03a38084b08
--- /dev/null
+++ b/mm/bpf.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef __MM_BPF_H
+#define __MM_BPF_H
+
+#define MM_BPF_ALLOWABLE (1)
+#define MM_BPF_NOT_ALLOWABLE (-1)
+
+#define MM_BPF_ALLOWABLE_HOOK(func, args...) { \
+ int ret = func(args); \
+ \
+ if (ret == MM_BPF_ALLOWABLE) \
+ return 1; \
+ if (ret == MM_BPF_NOT_ALLOWABLE) \
+ return 0; \
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int mm_bpf_thp_vma_allowable(struct vm_area_struct *vma);
+#endif
+
+#endif
diff --git a/mm/internal.h b/mm/internal.h
index aa698a11dd68..c8bf405fa581 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -21,6 +21,7 @@
/* Internal core VMA manipulation functions. */
#include "vma.h"
+#include "bpf.h"
struct folio_batch;
@@ -1632,6 +1633,7 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
*/
static inline bool hugepage_global_enabled(struct vm_area_struct *vma)
{
+ MM_BPF_ALLOWABLE_HOOK(mm_bpf_thp_vma_allowable, vma);
return transparent_hugepage_flags &
((1<<TRANSPARENT_HUGEPAGE_FLAG) |
(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG));
@@ -1639,6 +1641,7 @@ static inline bool hugepage_global_enabled(struct vm_area_struct *vma)
static inline bool hugepage_global_always(struct vm_area_struct *vma)
{
+ MM_BPF_ALLOWABLE_HOOK(mm_bpf_thp_vma_allowable, vma);
return transparent_hugepage_flags &
(1<<TRANSPARENT_HUGEPAGE_FLAG);
}
--
2.43.5
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [RFC PATCH 4/4] selftests/bpf: Add selftest for THP adjustment
2025-04-29 2:41 [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment Yafang Shao
` (2 preceding siblings ...)
2025-04-29 2:41 ` [RFC PATCH 3/4] mm: add BPF hook for THP adjustment Yafang Shao
@ 2025-04-29 2:41 ` Yafang Shao
2025-04-29 3:11 ` [RFC PATCH 0/4] mm, bpf: BPF based " Matthew Wilcox
2025-04-29 15:09 ` Zi Yan
5 siblings, 0 replies; 41+ messages in thread
From: Yafang Shao @ 2025-04-29 2:41 UTC (permalink / raw)
To: akpm, ast, daniel, andrii; +Cc: bpf, linux-mm, Yafang Shao
In this test case, we intentionally reject THP allocations via
madvise(MADV_HUGEPAGE) when THP is configured in either 'madvise' or
'always' mode. To prevent spurious failures (e.g., due to insufficient
memory for THP allocation), we deliberately omit testing the THP allocation
path when the system is configured with THP 'never' mode.
The result is as follows,
$ ./test_progs --name="thp_adjust"
#437 thp_adjust:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
CONFIG_TRANSPARENT_HUGEPAGE=y is required for this test.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
tools/testing/selftests/bpf/config | 1 +
.../selftests/bpf/prog_tests/thp_adjust.c | 176 ++++++++++++++++++
.../selftests/bpf/progs/test_thp_adjust.c | 32 ++++
3 files changed, 209 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index c378d5d07e02..bb8a8a9d77a2 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -113,3 +113,4 @@ CONFIG_XDP_SOCKETS=y
CONFIG_XFRM_INTERFACE=y
CONFIG_TCP_CONG_DCTCP=y
CONFIG_TCP_CONG_BBR=y
+CONFIG_TRANSPARENT_HUGEPAGE=y
diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
new file mode 100644
index 000000000000..bc307dac5bda
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -0,0 +1,176 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <sys/mman.h>
+#include <test_progs.h>
+#include "test_thp_adjust.skel.h"
+
+#define LEN (4 * 1024 * 1024) /* 4MB */
+#define THP_ENABLED_PATH "/sys/kernel/mm/transparent_hugepage/enabled"
+#define SMAPS_PATH "/proc/self/smaps"
+#define ANON_HUGE_PAGES "AnonHugePages:"
+
+static bool need_reset;
+static char *thp_addr;
+
+int parse_thp_setting(const char *buf)
+{
+ const char *start = strchr(buf, '[');
+ const char *end = start ? strchr(start, ']') : NULL;
+ char setting[32] = {0};
+ size_t len;
+
+ if (!start || !end || end <= start)
+ return -1;
+
+ len = end - start - 1;
+ if (len >= sizeof(setting))
+ len = sizeof(setting) - 1;
+
+ strncpy(setting, start + 1, len);
+ setting[len] = '\0';
+
+ if (strcmp(setting, "madvise") == 0 || strcmp(setting, "always") == 0)
+ return 0;
+ return 1;
+}
+
+int thp_set(void)
+{
+ const char *desired_value = "madvise";
+ char buf[32] = {0};
+ int fd, err;
+
+ fd = open(THP_ENABLED_PATH, O_RDWR);
+ if (fd == -1)
+ return -1;
+
+ err = read(fd, buf, sizeof(buf) - 1);
+ if (err == -1)
+ goto close_fd;
+
+ err = parse_thp_setting(buf);
+ if (err == -1 || err == 0)
+ goto close_fd;
+
+ err = lseek(fd, 0, SEEK_SET);
+ if (err == -1)
+ goto close_fd;
+
+ err = write(fd, desired_value, strlen(desired_value));
+ if (err == -1)
+ goto close_fd;
+ need_reset = true;
+
+close_fd:
+ close(fd);
+ return err;
+}
+
+int thp_reset(void)
+{
+ int fd, err;
+
+ if (!need_reset)
+ return 0;
+
+ fd = open(THP_ENABLED_PATH, O_WRONLY);
+ if (fd == -1)
+ return -1;
+
+ err = write(fd, "never", strlen("never"));
+ close(fd);
+ return err;
+}
+
+int thp_alloc(void)
+{
+ int err, i;
+
+ thp_addr = mmap(NULL, LEN, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+ if (thp_addr == MAP_FAILED)
+ return -1;
+
+ err = madvise(thp_addr, LEN, MADV_HUGEPAGE);
+ if (err == -1)
+ goto unmap;
+
+ for (i = 0; i < LEN; i += 4096)
+ thp_addr[i] = 1;
+ return 0;
+
+unmap:
+ munmap(thp_addr, LEN);
+ return -1;
+}
+
+void thp_free(void)
+{
+ if (!thp_addr)
+ return;
+ munmap(thp_addr, LEN);
+}
+
+int thp_size(void)
+{
+ unsigned long total_kb = 0;
+ char *line, *saveptr;
+ ssize_t bytes_read;
+ char buf[4096];
+ int fd;
+
+ fd = open(SMAPS_PATH, O_RDONLY);
+ if (fd == -1)
+ return -1;
+
+ while ((bytes_read = read(fd, buf, sizeof(buf) - 1)) > 0) {
+ buf[bytes_read] = '\0';
+ line = strtok_r(buf, "\n", &saveptr);
+ while (line) {
+ if (strstr(line, ANON_HUGE_PAGES)) {
+ unsigned long kb;
+
+ if (sscanf(line + strlen(ANON_HUGE_PAGES), "%lu", &kb) == 1)
+ total_kb += kb;
+ }
+ line = strtok_r(NULL, "\n", &saveptr);
+ }
+ }
+
+ if (bytes_read == -1)
+ total_kb = -1;
+
+ close(fd);
+ return total_kb;
+}
+
+void test_thp_adjust(void)
+{
+ struct test_thp_adjust *skel;
+ int err;
+
+ skel = test_thp_adjust__open();
+ if (!ASSERT_OK_PTR(skel, "open"))
+ return;
+
+ skel->bss->target_pid = getpid();
+
+ err = test_thp_adjust__load(skel);
+ if (!ASSERT_OK(err, "load"))
+ goto destroy;
+
+ err = test_thp_adjust__attach(skel);
+ if (!ASSERT_OK(err, "attach"))
+ goto destroy;
+
+ if (!ASSERT_NEQ(thp_set(), -1, "THP set"))
+ goto destroy;
+ if (!ASSERT_NEQ(thp_alloc(), -1, "THP alloc"))
+ goto thp_reset;
+ ASSERT_EQ(thp_size(), 0, "THP size");
+ thp_free();
+
+thp_reset:
+ ASSERT_NEQ(thp_reset(), -1, "THP reset");
+destroy:
+ test_thp_adjust__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust.c b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
new file mode 100644
index 000000000000..45026bba2c8d
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define MM_BPF_ALLOWABLE (1)
+#define MM_BPF_NOT_ALLOWABLE (-1)
+
+int target_pid;
+
+SEC("fmod_ret/mm_bpf_thp_vma_allowable")
+int BPF_PROG(thp_vma_allowable, struct vm_area_struct *vma)
+{
+ struct task_struct *p;
+ struct mm_struct *mm;
+
+ if (!vma)
+ return 0;
+
+ mm = vma->vm_mm;
+ if (!mm)
+ return 0;
+
+ p = mm->owner;
+ /* The target task is not allowed to use THP. */
+ if (p->pid == target_pid)
+ return MM_BPF_NOT_ALLOWABLE;
+ return 0;
+}
--
2.43.5
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-29 2:41 [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment Yafang Shao
` (3 preceding siblings ...)
2025-04-29 2:41 ` [RFC PATCH 4/4] selftests/bpf: Add selftest " Yafang Shao
@ 2025-04-29 3:11 ` Matthew Wilcox
2025-04-29 4:53 ` Yafang Shao
2025-04-29 15:09 ` Zi Yan
5 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2025-04-29 3:11 UTC (permalink / raw)
To: Yafang Shao; +Cc: akpm, ast, daniel, andrii, bpf, linux-mm
On Tue, Apr 29, 2025 at 10:41:35AM +0800, Yafang Shao wrote:
> In our container environment, we aim to enable THP selectively—allowing
> specific services to use it while restricting others. This approach is
> driven by the following considerations:
>
> 1. Memory Fragmentation
> THP can lead to increased memory fragmentation, so we want to limit its
> use across services.
What? That's precisely wrong. _not_ using THPs increases
fragmentation.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-29 3:11 ` [RFC PATCH 0/4] mm, bpf: BPF based " Matthew Wilcox
@ 2025-04-29 4:53 ` Yafang Shao
0 siblings, 0 replies; 41+ messages in thread
From: Yafang Shao @ 2025-04-29 4:53 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: akpm, ast, daniel, andrii, bpf, linux-mm
On Tue, Apr 29, 2025 at 11:11 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Apr 29, 2025 at 10:41:35AM +0800, Yafang Shao wrote:
> > In our container environment, we aim to enable THP selectively—allowing
> > specific services to use it while restricting others. This approach is
> > driven by the following considerations:
> >
> > 1. Memory Fragmentation
> > THP can lead to increased memory fragmentation, so we want to limit its
> > use across services.
>
> What? That's precisely wrong. _not_ using THPs increases
> fragmentation.
It appears my previous explanation about memory fragmentation wasn't
clear enough.
To clarify, when I mention "memory fragmentation" in the context of
THP, I'm specifically referring to how it can increase memory
compaction activity. Additionally, I should have mentioned another
significant drawback of THP: memory wastage.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-29 2:41 [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment Yafang Shao
` (4 preceding siblings ...)
2025-04-29 3:11 ` [RFC PATCH 0/4] mm, bpf: BPF based " Matthew Wilcox
@ 2025-04-29 15:09 ` Zi Yan
2025-04-30 2:33 ` Yafang Shao
5 siblings, 1 reply; 41+ messages in thread
From: Zi Yan @ 2025-04-29 15:09 UTC (permalink / raw)
To: Yafang Shao, akpm, ast, daniel, andrii, David Hildenbrand,
Baolin Wang, Lorenzo Stoakes, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain
Cc: bpf, linux-mm
Hi Yafang,
We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing
people there in your next version? (I added them here)
[1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589
On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
> In our container environment, we aim to enable THP selectively—allowing
> specific services to use it while restricting others. This approach is
> driven by the following considerations:
>
> 1. Memory Fragmentation
> THP can lead to increased memory fragmentation, so we want to limit its
> use across services.
> 2. Performance Impact
> Some services see no benefit from THP, making its usage unnecessary.
> 3. Performance Gains
> Certain workloads, such as machine learning services, experience
> significant performance improvements with THP, so we enable it for them
> specifically.
>
> Since multiple services run on a single host in a containerized environment,
> enabling THP globally is not ideal. Previously, we set THP to madvise,
> allowing selected services to opt in via MADV_HUGEPAGE. However, this
> approach had limitation:
>
> - Some services inadvertently used madvise(MADV_HUGEPAGE) through
> third-party libraries, bypassing our restrictions.
Basically, you want more precise control of THP enablement and the
ability of overriding madvise() from userspace.
In terms of overriding madvise(), do you have any concrete example of
these third-party libraries? madvise() users are supposed to know what
they are doing, so I wonder why they are causing trouble in your
environment.
>
> To address this issue, we initially hooked the __x64_sys_madvise() syscall,
> which is error-injectable, to blacklist unwanted services. While this
> worked, it was error-prone and ineffective for services needing always mode,
> as modifying their code to use madvise was impractical.
>
> To achieve finer-grained control, we introduced an fmod_ret-based solution.
> Now, we dynamically adjust THP settings per service by hooking
> hugepage_global_{enabled,always}() via BPF. This allows us to set THP to
> enable or disable on a per-service basis without global impact.
hugepage_global_*() are whole system knobs. How did you use it to
achieve per-service control? In terms of per-service, does it mean
you need per-memcg group (I assume each service has its own memcg) THP
configuration?
>
> The hugepage_global_{enabled,always}() functions currently share the same
> BPF hook, which limits THP configuration to either always or never. While
> this suffices for our specific use cases, full support for all three modes
> (always, madvise, and never) would require splitting them into separate
> hooks.
>
> This is the initial RFC patch—feedback is welcome!
>
> Yafang Shao (4):
> mm: move hugepage_global_{enabled,always}() to internal.h
> mm: pass VMA parameter to hugepage_global_{enabled,always}()
> mm: add BPF hook for THP adjustment
> selftests/bpf: Add selftest for THP adjustment
>
> include/linux/huge_mm.h | 54 +-----
> mm/Makefile | 3 +
> mm/bpf.c | 36 ++++
> mm/bpf.h | 21 +++
> mm/huge_memory.c | 50 ++++-
> mm/internal.h | 21 +++
> mm/khugepaged.c | 18 +-
> tools/testing/selftests/bpf/config | 1 +
> .../selftests/bpf/prog_tests/thp_adjust.c | 176 ++++++++++++++++++
> .../selftests/bpf/progs/test_thp_adjust.c | 32 ++++
> 10 files changed, 344 insertions(+), 68 deletions(-)
> create mode 100644 mm/bpf.c
> create mode 100644 mm/bpf.h
> create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
> create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 1/4] mm: move hugepage_global_{enabled,always}() to internal.h
2025-04-29 2:41 ` [RFC PATCH 1/4] mm: move hugepage_global_{enabled,always}() to internal.h Yafang Shao
@ 2025-04-29 15:13 ` Zi Yan
2025-04-30 2:40 ` Yafang Shao
0 siblings, 1 reply; 41+ messages in thread
From: Zi Yan @ 2025-04-29 15:13 UTC (permalink / raw)
To: Yafang Shao, akpm, ast, daniel, andrii; +Cc: bpf, linux-mm
On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
> The functions hugepage_global_{enabled,always}() are currently only used in
> mm/huge_memory.c, so we can move them to mm/internal.h. They will also be
> exposed for BPF hooking in a future change.
Why cannot BPF include huge_mm.h instead?
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
> include/linux/huge_mm.h | 54 +----------------------------------------
> mm/huge_memory.c | 46 ++++++++++++++++++++++++++++++++---
> mm/internal.h | 14 +++++++++++
> 3 files changed, 57 insertions(+), 57 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index e893d546a49f..5e92db48fc99 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -177,19 +177,6 @@ extern unsigned long huge_anon_orders_always;
> extern unsigned long huge_anon_orders_madvise;
> extern unsigned long huge_anon_orders_inherit;
>
> -static inline bool hugepage_global_enabled(void)
> -{
> - return transparent_hugepage_flags &
> - ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
> - (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG));
> -}
> -
> -static inline bool hugepage_global_always(void)
> -{
> - return transparent_hugepage_flags &
> - (1<<TRANSPARENT_HUGEPAGE_FLAG);
> -}
> -
> static inline int highest_order(unsigned long orders)
> {
> return fls_long(orders) - 1;
> @@ -260,49 +247,10 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
> return orders;
> }
>
> -unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> - unsigned long vm_flags,
> - unsigned long tva_flags,
> - unsigned long orders);
> -
> -/**
> - * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
> - * @vma: the vm area to check
> - * @vm_flags: use these vm_flags instead of vma->vm_flags
> - * @tva_flags: Which TVA flags to honour
> - * @orders: bitfield of all orders to consider
> - *
> - * Calculates the intersection of the requested hugepage orders and the allowed
> - * hugepage orders for the provided vma. Permitted orders are encoded as a set
> - * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
> - * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
> - *
> - * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
> - * orders are allowed.
> - */
> -static inline
> unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> unsigned long vm_flags,
> unsigned long tva_flags,
> - unsigned long orders)
> -{
> - /* Optimization to check if required orders are enabled early. */
> - if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
> - unsigned long mask = READ_ONCE(huge_anon_orders_always);
> -
> - if (vm_flags & VM_HUGEPAGE)
> - mask |= READ_ONCE(huge_anon_orders_madvise);
> - if (hugepage_global_always() ||
> - ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
> - mask |= READ_ONCE(huge_anon_orders_inherit);
> -
> - orders &= mask;
> - if (!orders)
> - return 0;
> - }
> -
> - return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> -}
> + unsigned long orders);
>
> struct thpsize {
> struct kobject kobj;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2a47682d1ab7..39afa14af2f2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -98,10 +98,10 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
> return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
> }
>
> -unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> - unsigned long vm_flags,
> - unsigned long tva_flags,
> - unsigned long orders)
> +static unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> + unsigned long vm_flags,
> + unsigned long tva_flags,
> + unsigned long orders)
> {
> bool smaps = tva_flags & TVA_SMAPS;
> bool in_pf = tva_flags & TVA_IN_PF;
> @@ -208,6 +208,44 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> return orders;
> }
>
> +/**
> + * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
> + * @vma: the vm area to check
> + * @vm_flags: use these vm_flags instead of vma->vm_flags
> + * @tva_flags: Which TVA flags to honour
> + * @orders: bitfield of all orders to consider
> + *
> + * Calculates the intersection of the requested hugepage orders and the allowed
> + * hugepage orders for the provided vma. Permitted orders are encoded as a set
> + * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
> + * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
> + *
> + * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
> + * orders are allowed.
> + */
> +unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> + unsigned long vm_flags,
> + unsigned long tva_flags,
> + unsigned long orders)
> +{
> + /* Optimization to check if required orders are enabled early. */
> + if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
> + unsigned long mask = READ_ONCE(huge_anon_orders_always);
> +
> + if (vm_flags & VM_HUGEPAGE)
> + mask |= READ_ONCE(huge_anon_orders_madvise);
> + if (hugepage_global_always() ||
> + ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
> + mask |= READ_ONCE(huge_anon_orders_inherit);
> +
> + orders &= mask;
> + if (!orders)
> + return 0;
> + }
> +
> + return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
> +}
> +
> static bool get_huge_zero_page(void)
> {
> struct folio *zero_folio;
> diff --git a/mm/internal.h b/mm/internal.h
> index e9695baa5922..462d85c2ba7b 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1625,5 +1625,19 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
> }
> #endif /* CONFIG_PT_RECLAIM */
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static inline bool hugepage_global_enabled(void)
> +{
> + return transparent_hugepage_flags &
> + ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
> + (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG));
> +}
> +
> +static inline bool hugepage_global_always(void)
> +{
> + return transparent_hugepage_flags &
> + (1<<TRANSPARENT_HUGEPAGE_FLAG);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> #endif /* __MM_INTERNAL_H */
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 3/4] mm: add BPF hook for THP adjustment
2025-04-29 2:41 ` [RFC PATCH 3/4] mm: add BPF hook for THP adjustment Yafang Shao
@ 2025-04-29 15:19 ` Alexei Starovoitov
2025-04-30 2:48 ` Yafang Shao
0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2025-04-29 15:19 UTC (permalink / raw)
To: Yafang Shao
Cc: Andrew Morton, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, bpf, linux-mm
On Mon, Apr 28, 2025 at 7:42 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> We will use the @vma parameter in BPF programs to determine whether THP can
> be used. The typical workflow is as follows:
>
> 1. Retrieve the mm_struct from the given @vma.
> 2. Obtain the task_struct associated with the mm_struct
> It depends on CONFIG_MEMCG.
> 3. Adjust THP behavior dynamically based on task attributes
> E.g., based on the task’s cgroup
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
> mm/Makefile | 3 +++
> mm/bpf.c | 36 ++++++++++++++++++++++++++++++++++++
> mm/bpf.h | 21 +++++++++++++++++++++
> mm/internal.h | 3 +++
> 4 files changed, 63 insertions(+)
> create mode 100644 mm/bpf.c
> create mode 100644 mm/bpf.h
>
> diff --git a/mm/Makefile b/mm/Makefile
> index e7f6bbf8ae5f..97055da04746 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -99,6 +99,9 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> obj-$(CONFIG_NUMA) += memory-tiers.o
> obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> +ifdef CONFIG_BPF_SYSCALL
> +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += bpf.o
> +endif
> obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> diff --git a/mm/bpf.c b/mm/bpf.c
> new file mode 100644
> index 000000000000..72eebcdbad56
> --- /dev/null
> +++ b/mm/bpf.c
> @@ -0,0 +1,36 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Author: Yafang Shao <laoar.shao@gmail.com>
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/mm_types.h>
> +
> +__bpf_hook_start();
> +
> +/* Checks if this @vma can use THP. */
> +__weak noinline int
> +mm_bpf_thp_vma_allowable(struct vm_area_struct *vma)
> +{
> + /* At present, fmod_ret exclusively uses 0 to signify that the return
> + * value remains unchanged.
> + */
> + return 0;
> +}
> +
> +__bpf_hook_end();
> +
> +BTF_SET8_START(mm_bpf_fmod_ret_ids)
> +BTF_ID_FLAGS(func, mm_bpf_thp_vma_allowable)
> +BTF_SET8_END(mm_bpf_fmod_ret_ids)
> +
> +static const struct btf_kfunc_id_set mm_bpf_fmodret_set = {
> + .owner = THIS_MODULE,
> + .set = &mm_bpf_fmod_ret_ids,
> +};
> +
> +static int __init bpf_mm_kfunc_init(void)
> +{
> + return register_btf_fmodret_id_set(&mm_bpf_fmodret_set);
> +}
> +late_initcall(bpf_mm_kfunc_init);
> diff --git a/mm/bpf.h b/mm/bpf.h
> new file mode 100644
> index 000000000000..e03a38084b08
> --- /dev/null
> +++ b/mm/bpf.h
> @@ -0,0 +1,21 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +#ifndef __MM_BPF_H
> +#define __MM_BPF_H
> +
> +#define MM_BPF_ALLOWABLE (1)
> +#define MM_BPF_NOT_ALLOWABLE (-1)
> +
> +#define MM_BPF_ALLOWABLE_HOOK(func, args...) { \
> + int ret = func(args); \
> + \
> + if (ret == MM_BPF_ALLOWABLE) \
> + return 1; \
> + if (ret == MM_BPF_NOT_ALLOWABLE) \
> + return 0; \
> +}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +int mm_bpf_thp_vma_allowable(struct vm_area_struct *vma);
> +#endif
> +
> +#endif
> diff --git a/mm/internal.h b/mm/internal.h
> index aa698a11dd68..c8bf405fa581 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -21,6 +21,7 @@
>
> /* Internal core VMA manipulation functions. */
> #include "vma.h"
> +#include "bpf.h"
>
> struct folio_batch;
>
> @@ -1632,6 +1633,7 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
> */
> static inline bool hugepage_global_enabled(struct vm_area_struct *vma)
> {
> + MM_BPF_ALLOWABLE_HOOK(mm_bpf_thp_vma_allowable, vma);
> return transparent_hugepage_flags &
> ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
> (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG));
> @@ -1639,6 +1641,7 @@ static inline bool hugepage_global_enabled(struct vm_area_struct *vma)
>
> static inline bool hugepage_global_always(struct vm_area_struct *vma)
> {
> + MM_BPF_ALLOWABLE_HOOK(mm_bpf_thp_vma_allowable, vma);
Please define a clean struct_ops based interface and demonstrate
the generality of the api with both bpf prog and a kernel module.
Do not use fmod_ret since it's global while struct_ops can be made
scoped for use case. Ex: per cgroup.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 2/4] mm: pass VMA parameter to hugepage_global_{enabled,always}()
2025-04-29 2:41 ` [RFC PATCH 2/4] mm: pass VMA parameter to hugepage_global_{enabled,always}() Yafang Shao
@ 2025-04-29 15:31 ` Zi Yan
2025-04-30 2:46 ` Yafang Shao
0 siblings, 1 reply; 41+ messages in thread
From: Zi Yan @ 2025-04-29 15:31 UTC (permalink / raw)
To: Yafang Shao, akpm, ast, daniel, andrii; +Cc: bpf, linux-mm
On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
> We will use the new @vma parameter to determine whether THP can be used.
This is wrong and a completely hack. hugepage_global_*() are sytem-wide
functions, so they do not take VMAs. Furthermore, the VMAs passed in
are not used at all. I notice that in the later patch VMA is used by BPF
hooks, but that does not justify the addition.
If you really want to do this, you can add new functions that take VMA
as an input and check hugepage_global_*() to replace some of the if
conditions below. Something like hugepage_vma_{enable,always}.
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
> mm/huge_memory.c | 8 ++++----
> mm/internal.h | 8 ++++++--
> mm/khugepaged.c | 18 +++++++++---------
> 3 files changed, 19 insertions(+), 15 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 39afa14af2f2..7a4a968c7874 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -176,8 +176,8 @@ static unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> * were already handled in thp_vma_allowable_orders().
> */
> if (enforce_sysfs &&
> - (!hugepage_global_enabled() || (!(vm_flags & VM_HUGEPAGE) &&
> - !hugepage_global_always())))
> + (!hugepage_global_enabled(vma) || (!(vm_flags & VM_HUGEPAGE) &&
> + !hugepage_global_always(vma))))
> return 0;
>
> /*
> @@ -234,8 +234,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>
> if (vm_flags & VM_HUGEPAGE)
> mask |= READ_ONCE(huge_anon_orders_madvise);
> - if (hugepage_global_always() ||
> - ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
> + if (hugepage_global_always(vma) ||
> + ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled(vma)))
> mask |= READ_ONCE(huge_anon_orders_inherit);
>
> orders &= mask;
> diff --git a/mm/internal.h b/mm/internal.h
> index 462d85c2ba7b..aa698a11dd68 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1626,14 +1626,18 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
> #endif /* CONFIG_PT_RECLAIM */
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -static inline bool hugepage_global_enabled(void)
> +/*
> + * Checks whether a given @vma can use THP. If @vma is NULL, the check is
> + * performed globally by khugepaged during a system-wide scan.
> + */
> +static inline bool hugepage_global_enabled(struct vm_area_struct *vma)
> {
> return transparent_hugepage_flags &
> ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
> (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG));
> }
>
> -static inline bool hugepage_global_always(void)
> +static inline bool hugepage_global_always(struct vm_area_struct *vma)
> {
> return transparent_hugepage_flags &
> (1<<TRANSPARENT_HUGEPAGE_FLAG);
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index cc945c6ab3bd..b85e36ddd7db 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -413,7 +413,7 @@ static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
> test_bit(MMF_DISABLE_THP, &mm->flags);
> }
>
> -static bool hugepage_pmd_enabled(void)
> +static bool hugepage_pmd_enabled(struct vm_area_struct *vma)
> {
> /*
> * We cover the anon, shmem and the file-backed case here; file-backed
> @@ -423,14 +423,14 @@ static bool hugepage_pmd_enabled(void)
> * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
> */
> if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
> - hugepage_global_enabled())
> + hugepage_global_enabled(vma))
> return true;
> if (test_bit(PMD_ORDER, &huge_anon_orders_always))
> return true;
> if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
> return true;
> if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
> - hugepage_global_enabled())
> + hugepage_global_enabled(vma))
> return true;
> if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
> return true;
> @@ -473,7 +473,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
> unsigned long vm_flags)
> {
> if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
> - hugepage_pmd_enabled()) {
> + hugepage_pmd_enabled(vma)) {
> if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
> PMD_ORDER))
> __khugepaged_enter(vma->vm_mm);
> @@ -2516,7 +2516,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>
> static int khugepaged_has_work(void)
> {
> - return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
> + return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled(NULL);
> }
>
> static int khugepaged_wait_event(void)
> @@ -2589,7 +2589,7 @@ static void khugepaged_wait_work(void)
> return;
> }
>
> - if (hugepage_pmd_enabled())
> + if (hugepage_pmd_enabled(NULL))
> wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
> }
>
> @@ -2620,7 +2620,7 @@ static void set_recommended_min_free_kbytes(void)
> int nr_zones = 0;
> unsigned long recommended_min;
>
> - if (!hugepage_pmd_enabled()) {
> + if (!hugepage_pmd_enabled(NULL)) {
> calculate_min_free_kbytes();
> goto update_wmarks;
> }
> @@ -2670,7 +2670,7 @@ int start_stop_khugepaged(void)
> int err = 0;
>
> mutex_lock(&khugepaged_mutex);
> - if (hugepage_pmd_enabled()) {
> + if (hugepage_pmd_enabled(NULL)) {
> if (!khugepaged_thread)
> khugepaged_thread = kthread_run(khugepaged, NULL,
> "khugepaged");
> @@ -2696,7 +2696,7 @@ int start_stop_khugepaged(void)
> void khugepaged_min_free_kbytes_update(void)
> {
> mutex_lock(&khugepaged_mutex);
> - if (hugepage_pmd_enabled() && khugepaged_thread)
> + if (hugepage_pmd_enabled(NULL) && khugepaged_thread)
> set_recommended_min_free_kbytes();
> mutex_unlock(&khugepaged_mutex);
> }
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-29 15:09 ` Zi Yan
@ 2025-04-30 2:33 ` Yafang Shao
2025-04-30 13:19 ` Zi Yan
2025-04-30 14:40 ` Liam R. Howlett
0 siblings, 2 replies; 41+ messages in thread
From: Yafang Shao @ 2025-04-30 2:33 UTC (permalink / raw)
To: Zi Yan
Cc: akpm, ast, daniel, andrii, David Hildenbrand, Baolin Wang,
Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
Dev Jain, bpf, linux-mm
On Tue, Apr 29, 2025 at 11:09 PM Zi Yan <ziy@nvidia.com> wrote:
>
> Hi Yafang,
>
> We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing
> people there in your next version? (I added them here)
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589
Thanks for your reminder.
I will add the maintainers and reviewers in the next version.
>
> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
> > In our container environment, we aim to enable THP selectively—allowing
> > specific services to use it while restricting others. This approach is
> > driven by the following considerations:
> >
> > 1. Memory Fragmentation
> > THP can lead to increased memory fragmentation, so we want to limit its
> > use across services.
> > 2. Performance Impact
> > Some services see no benefit from THP, making its usage unnecessary.
> > 3. Performance Gains
> > Certain workloads, such as machine learning services, experience
> > significant performance improvements with THP, so we enable it for them
> > specifically.
> >
> > Since multiple services run on a single host in a containerized environment,
> > enabling THP globally is not ideal. Previously, we set THP to madvise,
> > allowing selected services to opt in via MADV_HUGEPAGE. However, this
> > approach had limitation:
> >
> > - Some services inadvertently used madvise(MADV_HUGEPAGE) through
> > third-party libraries, bypassing our restrictions.
>
> Basically, you want more precise control of THP enablement and the
> ability of overriding madvise() from userspace.
>
> In terms of overriding madvise(), do you have any concrete example of
> these third-party libraries? madvise() users are supposed to know what
> they are doing, so I wonder why they are causing trouble in your
> environment.
To my knowledge, jemalloc [0] supports THP.
Applications using jemalloc typically rely on its default
configurations rather than explicitly enabling or disabling THP. If
the system is configured with THP=madvise, these applications may
automatically leverage THP where appropriate
[0]. https://github.com/jemalloc/jemalloc
>
> >
> > To address this issue, we initially hooked the __x64_sys_madvise() syscall,
> > which is error-injectable, to blacklist unwanted services. While this
> > worked, it was error-prone and ineffective for services needing always mode,
> > as modifying their code to use madvise was impractical.
> >
> > To achieve finer-grained control, we introduced an fmod_ret-based solution.
> > Now, we dynamically adjust THP settings per service by hooking
> > hugepage_global_{enabled,always}() via BPF. This allows us to set THP to
> > enable or disable on a per-service basis without global impact.
>
> hugepage_global_*() are whole system knobs. How did you use it to
> achieve per-service control? In terms of per-service, does it mean
> you need per-memcg group (I assume each service has its own memcg) THP
> configuration?
With this new BPF hook, we can manage THP behavior either per-service
or per-memory.
In our use case, we’ve chosen memcg-based control for finer-grained
management. Below is a simplified example of our implementation:
struct{
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 4096); /* usually there won't too
many cgroups */
__type(key, u64);
__type(value, u32);
__uint(map_flags, BPF_F_NO_PREALLOC);
} thp_whitelist SEC(".maps");
SEC("fmod_ret/mm_bpf_thp_vma_allowable")
int BPF_PROG(thp_vma_allowable, struct vm_area_struct *vma)
{
struct cgroup_subsys_state *css;
struct css_set *cgroups;
struct mm_struct *mm;
struct cgroup *cgroup;
struct cgroup *parent;
struct task_struct *p;
u64 cgrp_id;
if (!vma)
return 0;
mm = vma->vm_mm;
if (!mm)
return 0;
p = mm->owner;
cgroups = p->cgroups;
cgroup = cgroups->subsys[memory_cgrp_id]->cgroup;
cgrp_id = cgroup->kn->id;
/* Allow the tasks in the thp_whiltelist to use THP. */
if (bpf_map_lookup_elem(&thp_whitelist, &cgrp_id))
return 1;
return 0;
}
I chose not to include this in the self-tests to avoid the complexity
of setting up cgroups for testing purposes. However, in patch #4 of
this series, I've included a simpler example demonstrating task-level
control.
For service-level control, we could potentially utilize BPF task local
storage as an alternative approach.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 1/4] mm: move hugepage_global_{enabled,always}() to internal.h
2025-04-29 15:13 ` Zi Yan
@ 2025-04-30 2:40 ` Yafang Shao
2025-04-30 12:11 ` Zi Yan
0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2025-04-30 2:40 UTC (permalink / raw)
To: Zi Yan; +Cc: akpm, ast, daniel, andrii, bpf, linux-mm
On Tue, Apr 29, 2025 at 11:13 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
> > The functions hugepage_global_{enabled,always}() are currently only used in
> > mm/huge_memory.c, so we can move them to mm/internal.h. They will also be
> > exposed for BPF hooking in a future change.
>
> Why cannot BPF include huge_mm.h instead?
To maintain better code organization, it would be better to separate
the BPF-related logic into dedicated files. It will prevent overlap
with other components and improve long-term maintainability.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 2/4] mm: pass VMA parameter to hugepage_global_{enabled,always}()
2025-04-29 15:31 ` Zi Yan
@ 2025-04-30 2:46 ` Yafang Shao
0 siblings, 0 replies; 41+ messages in thread
From: Yafang Shao @ 2025-04-30 2:46 UTC (permalink / raw)
To: Zi Yan; +Cc: akpm, ast, daniel, andrii, bpf, linux-mm
On Tue, Apr 29, 2025 at 11:31 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
> > We will use the new @vma parameter to determine whether THP can be used.
>
> This is wrong and a completely hack. hugepage_global_*() are sytem-wide
> functions, so they do not take VMAs.
I modified hugepage_global_*() to enable BPF programs to bypass the
global THP settings
> Furthermore, the VMAs passed in
> are not used at all. I notice that in the later patch VMA is used by BPF
> hooks, but that does not justify the addition.
>
> If you really want to do this, you can add new functions that take VMA
> as an input and check hugepage_global_*() to replace some of the if
> conditions below. Something like hugepage_vma_{enable,always}.
Thanks for your suggestion.
I'll proceed with adding the necessary helper functions for this feature.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 3/4] mm: add BPF hook for THP adjustment
2025-04-29 15:19 ` Alexei Starovoitov
@ 2025-04-30 2:48 ` Yafang Shao
0 siblings, 0 replies; 41+ messages in thread
From: Yafang Shao @ 2025-04-30 2:48 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Andrew Morton, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, bpf, linux-mm
On Tue, Apr 29, 2025 at 11:27 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Apr 28, 2025 at 7:42 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > We will use the @vma parameter in BPF programs to determine whether THP can
> > be used. The typical workflow is as follows:
> >
> > 1. Retrieve the mm_struct from the given @vma.
> > 2. Obtain the task_struct associated with the mm_struct
> > It depends on CONFIG_MEMCG.
> > 3. Adjust THP behavior dynamically based on task attributes
> > E.g., based on the task’s cgroup
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> > mm/Makefile | 3 +++
> > mm/bpf.c | 36 ++++++++++++++++++++++++++++++++++++
> > mm/bpf.h | 21 +++++++++++++++++++++
> > mm/internal.h | 3 +++
> > 4 files changed, 63 insertions(+)
> > create mode 100644 mm/bpf.c
> > create mode 100644 mm/bpf.h
> >
> > diff --git a/mm/Makefile b/mm/Makefile
> > index e7f6bbf8ae5f..97055da04746 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -99,6 +99,9 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> > obj-$(CONFIG_NUMA) += memory-tiers.o
> > obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> > obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > +ifdef CONFIG_BPF_SYSCALL
> > +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += bpf.o
> > +endif
> > obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> > obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> > diff --git a/mm/bpf.c b/mm/bpf.c
> > new file mode 100644
> > index 000000000000..72eebcdbad56
> > --- /dev/null
> > +++ b/mm/bpf.c
> > @@ -0,0 +1,36 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Author: Yafang Shao <laoar.shao@gmail.com>
> > + */
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/mm_types.h>
> > +
> > +__bpf_hook_start();
> > +
> > +/* Checks if this @vma can use THP. */
> > +__weak noinline int
> > +mm_bpf_thp_vma_allowable(struct vm_area_struct *vma)
> > +{
> > + /* At present, fmod_ret exclusively uses 0 to signify that the return
> > + * value remains unchanged.
> > + */
> > + return 0;
> > +}
> > +
> > +__bpf_hook_end();
> > +
> > +BTF_SET8_START(mm_bpf_fmod_ret_ids)
> > +BTF_ID_FLAGS(func, mm_bpf_thp_vma_allowable)
> > +BTF_SET8_END(mm_bpf_fmod_ret_ids)
> > +
> > +static const struct btf_kfunc_id_set mm_bpf_fmodret_set = {
> > + .owner = THIS_MODULE,
> > + .set = &mm_bpf_fmod_ret_ids,
> > +};
> > +
> > +static int __init bpf_mm_kfunc_init(void)
> > +{
> > + return register_btf_fmodret_id_set(&mm_bpf_fmodret_set);
> > +}
> > +late_initcall(bpf_mm_kfunc_init);
> > diff --git a/mm/bpf.h b/mm/bpf.h
> > new file mode 100644
> > index 000000000000..e03a38084b08
> > --- /dev/null
> > +++ b/mm/bpf.h
> > @@ -0,0 +1,21 @@
> > +/* SPDX-License-Identifier: GPL-2.0-or-later */
> > +#ifndef __MM_BPF_H
> > +#define __MM_BPF_H
> > +
> > +#define MM_BPF_ALLOWABLE (1)
> > +#define MM_BPF_NOT_ALLOWABLE (-1)
> > +
> > +#define MM_BPF_ALLOWABLE_HOOK(func, args...) { \
> > + int ret = func(args); \
> > + \
> > + if (ret == MM_BPF_ALLOWABLE) \
> > + return 1; \
> > + if (ret == MM_BPF_NOT_ALLOWABLE) \
> > + return 0; \
> > +}
> > +
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +int mm_bpf_thp_vma_allowable(struct vm_area_struct *vma);
> > +#endif
> > +
> > +#endif
> > diff --git a/mm/internal.h b/mm/internal.h
> > index aa698a11dd68..c8bf405fa581 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -21,6 +21,7 @@
> >
> > /* Internal core VMA manipulation functions. */
> > #include "vma.h"
> > +#include "bpf.h"
> >
> > struct folio_batch;
> >
> > @@ -1632,6 +1633,7 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
> > */
> > static inline bool hugepage_global_enabled(struct vm_area_struct *vma)
> > {
> > + MM_BPF_ALLOWABLE_HOOK(mm_bpf_thp_vma_allowable, vma);
> > return transparent_hugepage_flags &
> > ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
> > (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG));
> > @@ -1639,6 +1641,7 @@ static inline bool hugepage_global_enabled(struct vm_area_struct *vma)
> >
> > static inline bool hugepage_global_always(struct vm_area_struct *vma)
> > {
> > + MM_BPF_ALLOWABLE_HOOK(mm_bpf_thp_vma_allowable, vma);
>
> Please define a clean struct_ops based interface and demonstrate
> the generality of the api with both bpf prog and a kernel module.
> Do not use fmod_ret since it's global while struct_ops can be made
> scoped for use case. Ex: per cgroup.
Thank you for the suggestion. I'll give this careful consideration.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 1/4] mm: move hugepage_global_{enabled,always}() to internal.h
2025-04-30 2:40 ` Yafang Shao
@ 2025-04-30 12:11 ` Zi Yan
2025-04-30 14:43 ` Yafang Shao
0 siblings, 1 reply; 41+ messages in thread
From: Zi Yan @ 2025-04-30 12:11 UTC (permalink / raw)
To: Yafang Shao; +Cc: akpm, ast, daniel, andrii, bpf, linux-mm
On 29 Apr 2025, at 22:40, Yafang Shao wrote:
> On Tue, Apr 29, 2025 at 11:13 PM Zi Yan <ziy@nvidia.com> wrote:
>>
>> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
>>> The functions hugepage_global_{enabled,always}() are currently only used in
>>> mm/huge_memory.c, so we can move them to mm/internal.h. They will also be
>>> exposed for BPF hooking in a future change.
>>
>> Why cannot BPF include huge_mm.h instead?
>
> To maintain better code organization, it would be better to separate
> the BPF-related logic into dedicated files. It will prevent overlap
> with other components and improve long-term maintainability.
But at the cost of mm code maintainability? It sets a precedent that one
could grow mm/internal.h very large by moving code to it. I do not think
it is the right way to go.
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-30 2:33 ` Yafang Shao
@ 2025-04-30 13:19 ` Zi Yan
2025-04-30 14:38 ` Yafang Shao
2025-04-30 14:40 ` Liam R. Howlett
1 sibling, 1 reply; 41+ messages in thread
From: Zi Yan @ 2025-04-30 13:19 UTC (permalink / raw)
To: Yafang Shao
Cc: akpm, ast, daniel, andrii, David Hildenbrand, Baolin Wang,
Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
Dev Jain, bpf, linux-mm, Johannes Weiner, Michal Hocko
On 29 Apr 2025, at 22:33, Yafang Shao wrote:
> On Tue, Apr 29, 2025 at 11:09 PM Zi Yan <ziy@nvidia.com> wrote:
>>
>> Hi Yafang,
>>
>> We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing
>> people there in your next version? (I added them here)
>>
>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589
>
> Thanks for your reminder.
> I will add the maintainers and reviewers in the next version.
>
>>
>> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
>>> In our container environment, we aim to enable THP selectively—allowing
>>> specific services to use it while restricting others. This approach is
>>> driven by the following considerations:
>>>
>>> 1. Memory Fragmentation
>>> THP can lead to increased memory fragmentation, so we want to limit its
>>> use across services.
>>> 2. Performance Impact
>>> Some services see no benefit from THP, making its usage unnecessary.
>>> 3. Performance Gains
>>> Certain workloads, such as machine learning services, experience
>>> significant performance improvements with THP, so we enable it for them
>>> specifically.
>>>
>>> Since multiple services run on a single host in a containerized environment,
>>> enabling THP globally is not ideal. Previously, we set THP to madvise,
>>> allowing selected services to opt in via MADV_HUGEPAGE. However, this
>>> approach had limitation:
>>>
>>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through
>>> third-party libraries, bypassing our restrictions.
>>
>> Basically, you want more precise control of THP enablement and the
>> ability of overriding madvise() from userspace.
>>
>> In terms of overriding madvise(), do you have any concrete example of
>> these third-party libraries? madvise() users are supposed to know what
>> they are doing, so I wonder why they are causing trouble in your
>> environment.
>
> To my knowledge, jemalloc [0] supports THP.
> Applications using jemalloc typically rely on its default
> configurations rather than explicitly enabling or disabling THP. If
> the system is configured with THP=madvise, these applications may
> automatically leverage THP where appropriate
>
> [0]. https://github.com/jemalloc/jemalloc
It sounds like a userspace issue. For jemalloc, if applications require
it, can't you replace the jemalloc with a one compiled with --disable-thp
to work around the issue?
>
>>
>>>
>>> To address this issue, we initially hooked the __x64_sys_madvise() syscall,
>>> which is error-injectable, to blacklist unwanted services. While this
>>> worked, it was error-prone and ineffective for services needing always mode,
>>> as modifying their code to use madvise was impractical.
>>>
>>> To achieve finer-grained control, we introduced an fmod_ret-based solution.
>>> Now, we dynamically adjust THP settings per service by hooking
>>> hugepage_global_{enabled,always}() via BPF. This allows us to set THP to
>>> enable or disable on a per-service basis without global impact.
>>
>> hugepage_global_*() are whole system knobs. How did you use it to
>> achieve per-service control? In terms of per-service, does it mean
>> you need per-memcg group (I assume each service has its own memcg) THP
>> configuration?
>
> With this new BPF hook, we can manage THP behavior either per-service
> or per-memory.
> In our use case, we’ve chosen memcg-based control for finer-grained
> management. Below is a simplified example of our implementation:
>
> struct{
> __uint(type, BPF_MAP_TYPE_HASH);
> __uint(max_entries, 4096); /* usually there won't too
> many cgroups */
> __type(key, u64);
> __type(value, u32);
> __uint(map_flags, BPF_F_NO_PREALLOC);
> } thp_whitelist SEC(".maps");
>
> SEC("fmod_ret/mm_bpf_thp_vma_allowable")
> int BPF_PROG(thp_vma_allowable, struct vm_area_struct *vma)
> {
> struct cgroup_subsys_state *css;
> struct css_set *cgroups;
> struct mm_struct *mm;
> struct cgroup *cgroup;
> struct cgroup *parent;
> struct task_struct *p;
> u64 cgrp_id;
>
> if (!vma)
> return 0;
>
> mm = vma->vm_mm;
> if (!mm)
> return 0;
>
> p = mm->owner;
> cgroups = p->cgroups;
> cgroup = cgroups->subsys[memory_cgrp_id]->cgroup;
> cgrp_id = cgroup->kn->id;
>
> /* Allow the tasks in the thp_whiltelist to use THP. */
> if (bpf_map_lookup_elem(&thp_whitelist, &cgrp_id))
> return 1;
> return 0;
> }
>
> I chose not to include this in the self-tests to avoid the complexity
> of setting up cgroups for testing purposes. However, in patch #4 of
> this series, I've included a simpler example demonstrating task-level
> control.
For task-level control, why not using prctl(PR_SET_THP_DISABLE)?
> For service-level control, we could potentially utilize BPF task local
> storage as an alternative approach.
+cgroup people
For service-level control, there was a proposal of adding cgroup based
THP control[1]. You might need a strong use case to convince people.
[1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-30 13:19 ` Zi Yan
@ 2025-04-30 14:38 ` Yafang Shao
2025-04-30 15:00 ` Zi Yan
2025-04-30 17:59 ` Johannes Weiner
0 siblings, 2 replies; 41+ messages in thread
From: Yafang Shao @ 2025-04-30 14:38 UTC (permalink / raw)
To: Zi Yan
Cc: akpm, ast, daniel, andrii, David Hildenbrand, Baolin Wang,
Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
Dev Jain, bpf, linux-mm, Johannes Weiner, Michal Hocko
On Wed, Apr 30, 2025 at 9:19 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 29 Apr 2025, at 22:33, Yafang Shao wrote:
>
> > On Tue, Apr 29, 2025 at 11:09 PM Zi Yan <ziy@nvidia.com> wrote:
> >>
> >> Hi Yafang,
> >>
> >> We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing
> >> people there in your next version? (I added them here)
> >>
> >> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589
> >
> > Thanks for your reminder.
> > I will add the maintainers and reviewers in the next version.
> >
> >>
> >> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
> >>> In our container environment, we aim to enable THP selectively—allowing
> >>> specific services to use it while restricting others. This approach is
> >>> driven by the following considerations:
> >>>
> >>> 1. Memory Fragmentation
> >>> THP can lead to increased memory fragmentation, so we want to limit its
> >>> use across services.
> >>> 2. Performance Impact
> >>> Some services see no benefit from THP, making its usage unnecessary.
> >>> 3. Performance Gains
> >>> Certain workloads, such as machine learning services, experience
> >>> significant performance improvements with THP, so we enable it for them
> >>> specifically.
> >>>
> >>> Since multiple services run on a single host in a containerized environment,
> >>> enabling THP globally is not ideal. Previously, we set THP to madvise,
> >>> allowing selected services to opt in via MADV_HUGEPAGE. However, this
> >>> approach had limitation:
> >>>
> >>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through
> >>> third-party libraries, bypassing our restrictions.
> >>
> >> Basically, you want more precise control of THP enablement and the
> >> ability of overriding madvise() from userspace.
> >>
> >> In terms of overriding madvise(), do you have any concrete example of
> >> these third-party libraries? madvise() users are supposed to know what
> >> they are doing, so I wonder why they are causing trouble in your
> >> environment.
> >
> > To my knowledge, jemalloc [0] supports THP.
> > Applications using jemalloc typically rely on its default
> > configurations rather than explicitly enabling or disabling THP. If
> > the system is configured with THP=madvise, these applications may
> > automatically leverage THP where appropriate
> >
> > [0]. https://github.com/jemalloc/jemalloc
>
> It sounds like a userspace issue. For jemalloc, if applications require
> it, can't you replace the jemalloc with a one compiled with --disable-thp
> to work around the issue?
That’s not the issue this patchset is trying to address or work
around. I believe we should focus on the actual problem it's meant to
solve.
By the way, you might not raise this question if you were managing a
large fleet of servers. We're a platform provider, but we don’t
maintain all the packages ourselves. Users make their own choices
based on their specific requirements. It's not a feasible solution for
us to develop and maintain every package.
>
> >
> >>
> >>>
> >>> To address this issue, we initially hooked the __x64_sys_madvise() syscall,
> >>> which is error-injectable, to blacklist unwanted services. While this
> >>> worked, it was error-prone and ineffective for services needing always mode,
> >>> as modifying their code to use madvise was impractical.
> >>>
> >>> To achieve finer-grained control, we introduced an fmod_ret-based solution.
> >>> Now, we dynamically adjust THP settings per service by hooking
> >>> hugepage_global_{enabled,always}() via BPF. This allows us to set THP to
> >>> enable or disable on a per-service basis without global impact.
> >>
> >> hugepage_global_*() are whole system knobs. How did you use it to
> >> achieve per-service control? In terms of per-service, does it mean
> >> you need per-memcg group (I assume each service has its own memcg) THP
> >> configuration?
> >
> > With this new BPF hook, we can manage THP behavior either per-service
> > or per-memory.
> > In our use case, we’ve chosen memcg-based control for finer-grained
> > management. Below is a simplified example of our implementation:
> >
> > struct{
> > __uint(type, BPF_MAP_TYPE_HASH);
> > __uint(max_entries, 4096); /* usually there won't too
> > many cgroups */
> > __type(key, u64);
> > __type(value, u32);
> > __uint(map_flags, BPF_F_NO_PREALLOC);
> > } thp_whitelist SEC(".maps");
> >
> > SEC("fmod_ret/mm_bpf_thp_vma_allowable")
> > int BPF_PROG(thp_vma_allowable, struct vm_area_struct *vma)
> > {
> > struct cgroup_subsys_state *css;
> > struct css_set *cgroups;
> > struct mm_struct *mm;
> > struct cgroup *cgroup;
> > struct cgroup *parent;
> > struct task_struct *p;
> > u64 cgrp_id;
> >
> > if (!vma)
> > return 0;
> >
> > mm = vma->vm_mm;
> > if (!mm)
> > return 0;
> >
> > p = mm->owner;
> > cgroups = p->cgroups;
> > cgroup = cgroups->subsys[memory_cgrp_id]->cgroup;
> > cgrp_id = cgroup->kn->id;
> >
> > /* Allow the tasks in the thp_whiltelist to use THP. */
> > if (bpf_map_lookup_elem(&thp_whitelist, &cgrp_id))
> > return 1;
> > return 0;
> > }
> >
> > I chose not to include this in the self-tests to avoid the complexity
> > of setting up cgroups for testing purposes. However, in patch #4 of
> > this series, I've included a simpler example demonstrating task-level
> > control.
>
> For task-level control, why not using prctl(PR_SET_THP_DISABLE)?
You’ll need to modify the user-space code—and again, this likely
wouldn’t be a concern if you were managing a large fleet of servers.
>
> > For service-level control, we could potentially utilize BPF task local
> > storage as an alternative approach.
>
> +cgroup people
>
> For service-level control, there was a proposal of adding cgroup based
> THP control[1]. You might need a strong use case to convince people.
>
> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
Thanks for the reference. I've reviewed the related discussion, and if
I understand correctly, the proposal was rejected by the maintainers.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-30 2:33 ` Yafang Shao
2025-04-30 13:19 ` Zi Yan
@ 2025-04-30 14:40 ` Liam R. Howlett
2025-04-30 14:49 ` Yafang Shao
1 sibling, 1 reply; 41+ messages in thread
From: Liam R. Howlett @ 2025-04-30 14:40 UTC (permalink / raw)
To: Yafang Shao
Cc: Zi Yan, akpm, ast, daniel, andrii, David Hildenbrand, Baolin Wang,
Lorenzo Stoakes, Nico Pache, Ryan Roberts, Dev Jain, bpf,
linux-mm
* Yafang Shao <laoar.shao@gmail.com> [250429 22:34]:
> On Tue, Apr 29, 2025 at 11:09 PM Zi Yan <ziy@nvidia.com> wrote:
> >
> > Hi Yafang,
> >
> > We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing
> > people there in your next version? (I added them here)
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589
>
> Thanks for your reminder.
> I will add the maintainers and reviewers in the next version.
>
> >
> > On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
> > > In our container environment, we aim to enable THP selectively—allowing
> > > specific services to use it while restricting others. This approach is
> > > driven by the following considerations:
> > >
> > > 1. Memory Fragmentation
> > > THP can lead to increased memory fragmentation, so we want to limit its
> > > use across services.
> > > 2. Performance Impact
> > > Some services see no benefit from THP, making its usage unnecessary.
> > > 3. Performance Gains
> > > Certain workloads, such as machine learning services, experience
> > > significant performance improvements with THP, so we enable it for them
> > > specifically.
> > >
> > > Since multiple services run on a single host in a containerized environment,
> > > enabling THP globally is not ideal. Previously, we set THP to madvise,
> > > allowing selected services to opt in via MADV_HUGEPAGE. However, this
> > > approach had limitation:
> > >
> > > - Some services inadvertently used madvise(MADV_HUGEPAGE) through
> > > third-party libraries, bypassing our restrictions.
> >
> > Basically, you want more precise control of THP enablement and the
> > ability of overriding madvise() from userspace.
> >
> > In terms of overriding madvise(), do you have any concrete example of
> > these third-party libraries? madvise() users are supposed to know what
> > they are doing, so I wonder why they are causing trouble in your
> > environment.
>
> To my knowledge, jemalloc [0] supports THP.
> Applications using jemalloc typically rely on its default
> configurations rather than explicitly enabling or disabling THP. If
> the system is configured with THP=madvise, these applications may
> automatically leverage THP where appropriate
Isn't jemalloc THP aware and can be configured to always, never, or
"default to the system setting" use THP for both metadata and
allocations? It seems like this is an example of a thrid party library
that knows what it is doing in regards to THP. [1]
If jemalloc is not following its own settings then it is an issue in
jemalloc and not a reason for a kernel change.
If you are relying on the default configuration of jemalloc and it
doesn't work as you expect, then maybe try the thp settings?
>
> [0]. https://github.com/jemalloc/jemalloc
...
Thanks,
Liam
[1]. https://jemalloc.net/jemalloc.3.html
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 1/4] mm: move hugepage_global_{enabled,always}() to internal.h
2025-04-30 12:11 ` Zi Yan
@ 2025-04-30 14:43 ` Yafang Shao
0 siblings, 0 replies; 41+ messages in thread
From: Yafang Shao @ 2025-04-30 14:43 UTC (permalink / raw)
To: Zi Yan; +Cc: akpm, ast, daniel, andrii, bpf, linux-mm
On Wed, Apr 30, 2025 at 8:11 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 29 Apr 2025, at 22:40, Yafang Shao wrote:
>
> > On Tue, Apr 29, 2025 at 11:13 PM Zi Yan <ziy@nvidia.com> wrote:
> >>
> >> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
> >>> The functions hugepage_global_{enabled,always}() are currently only used in
> >>> mm/huge_memory.c, so we can move them to mm/internal.h. They will also be
> >>> exposed for BPF hooking in a future change.
> >>
> >> Why cannot BPF include huge_mm.h instead?
> >
> > To maintain better code organization, it would be better to separate
> > the BPF-related logic into dedicated files. It will prevent overlap
> > with other components and improve long-term maintainability.
>
> But at the cost of mm code maintainability? It sets a precedent that one
> could grow mm/internal.h very large by moving code to it. I do not think
> it is the right way to go.
I believe the helpers used exclusively by mm/ should be moved to
mm/internal.h. If the size of mm/internal.h is a concern, we could
consider introducing a separate mm/huge_internal.h.
That said, I won’t proceed with moving them to mm/internal.h if you
strongly believe it’s not appropriate—though personally, I don't think
avoiding it is the right direction.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-30 14:40 ` Liam R. Howlett
@ 2025-04-30 14:49 ` Yafang Shao
0 siblings, 0 replies; 41+ messages in thread
From: Yafang Shao @ 2025-04-30 14:49 UTC (permalink / raw)
To: Liam R. Howlett, Yafang Shao, Zi Yan, akpm, ast, daniel, andrii,
David Hildenbrand, Baolin Wang, Lorenzo Stoakes, Nico Pache,
Ryan Roberts, Dev Jain, bpf, linux-mm
On Wed, Apr 30, 2025 at 10:40 PM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> * Yafang Shao <laoar.shao@gmail.com> [250429 22:34]:
> > On Tue, Apr 29, 2025 at 11:09 PM Zi Yan <ziy@nvidia.com> wrote:
> > >
> > > Hi Yafang,
> > >
> > > We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing
> > > people there in your next version? (I added them here)
> > >
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589
> >
> > Thanks for your reminder.
> > I will add the maintainers and reviewers in the next version.
> >
> > >
> > > On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
> > > > In our container environment, we aim to enable THP selectively—allowing
> > > > specific services to use it while restricting others. This approach is
> > > > driven by the following considerations:
> > > >
> > > > 1. Memory Fragmentation
> > > > THP can lead to increased memory fragmentation, so we want to limit its
> > > > use across services.
> > > > 2. Performance Impact
> > > > Some services see no benefit from THP, making its usage unnecessary.
> > > > 3. Performance Gains
> > > > Certain workloads, such as machine learning services, experience
> > > > significant performance improvements with THP, so we enable it for them
> > > > specifically.
> > > >
> > > > Since multiple services run on a single host in a containerized environment,
> > > > enabling THP globally is not ideal. Previously, we set THP to madvise,
> > > > allowing selected services to opt in via MADV_HUGEPAGE. However, this
> > > > approach had limitation:
> > > >
> > > > - Some services inadvertently used madvise(MADV_HUGEPAGE) through
> > > > third-party libraries, bypassing our restrictions.
> > >
> > > Basically, you want more precise control of THP enablement and the
> > > ability of overriding madvise() from userspace.
> > >
> > > In terms of overriding madvise(), do you have any concrete example of
> > > these third-party libraries? madvise() users are supposed to know what
> > > they are doing, so I wonder why they are causing trouble in your
> > > environment.
> >
> > To my knowledge, jemalloc [0] supports THP.
> > Applications using jemalloc typically rely on its default
> > configurations rather than explicitly enabling or disabling THP. If
> > the system is configured with THP=madvise, these applications may
> > automatically leverage THP where appropriate
>
> Isn't jemalloc THP aware and can be configured to always, never, or
> "default to the system setting" use THP for both metadata and
> allocations? It seems like this is an example of a thrid party library
> that knows what it is doing in regards to THP. [1]
Thanks for your explanation.
>
> If jemalloc is not following its own settings then it is an issue in
> jemalloc and not a reason for a kernel change.
We don’t change the kernel to accommodate specific userspace
settings—we change it only when it benefits users more broadly.
By the way, this patchset isn’t intended to address that issue. If
it’s causing confusion about the problem this patchset is trying to
solve, I’ll remove that part from the commit log in the next version.
>
> If you are relying on the default configuration of jemalloc and it
> doesn't work as you expect, then maybe try the thp settings?
>
> >
> > [0]. https://github.com/jemalloc/jemalloc
>
> ...
>
> Thanks,
> Liam
>
> [1]. https://jemalloc.net/jemalloc.3.html
--
Regards
Yafang
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-30 14:38 ` Yafang Shao
@ 2025-04-30 15:00 ` Zi Yan
2025-04-30 15:16 ` Yafang Shao
2025-04-30 15:21 ` Liam R. Howlett
2025-04-30 17:59 ` Johannes Weiner
1 sibling, 2 replies; 41+ messages in thread
From: Zi Yan @ 2025-04-30 15:00 UTC (permalink / raw)
To: Yafang Shao
Cc: akpm, ast, daniel, andrii, David Hildenbrand, Baolin Wang,
Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
Dev Jain, bpf, linux-mm, Johannes Weiner, Michal Hocko
On 30 Apr 2025, at 10:38, Yafang Shao wrote:
> On Wed, Apr 30, 2025 at 9:19 PM Zi Yan <ziy@nvidia.com> wrote:
>>
>> On 29 Apr 2025, at 22:33, Yafang Shao wrote:
>>
>>> On Tue, Apr 29, 2025 at 11:09 PM Zi Yan <ziy@nvidia.com> wrote:
>>>>
>>>> Hi Yafang,
>>>>
>>>> We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing
>>>> people there in your next version? (I added them here)
>>>>
>>>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589
>>>
>>> Thanks for your reminder.
>>> I will add the maintainers and reviewers in the next version.
>>>
>>>>
>>>> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
>>>>> In our container environment, we aim to enable THP selectively—allowing
>>>>> specific services to use it while restricting others. This approach is
>>>>> driven by the following considerations:
>>>>>
>>>>> 1. Memory Fragmentation
>>>>> THP can lead to increased memory fragmentation, so we want to limit its
>>>>> use across services.
>>>>> 2. Performance Impact
>>>>> Some services see no benefit from THP, making its usage unnecessary.
>>>>> 3. Performance Gains
>>>>> Certain workloads, such as machine learning services, experience
>>>>> significant performance improvements with THP, so we enable it for them
>>>>> specifically.
>>>>>
>>>>> Since multiple services run on a single host in a containerized environment,
>>>>> enabling THP globally is not ideal. Previously, we set THP to madvise,
>>>>> allowing selected services to opt in via MADV_HUGEPAGE. However, this
>>>>> approach had limitation:
>>>>>
>>>>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through
>>>>> third-party libraries, bypassing our restrictions.
>>>>
>>>> Basically, you want more precise control of THP enablement and the
>>>> ability of overriding madvise() from userspace.
>>>>
>>>> In terms of overriding madvise(), do you have any concrete example of
>>>> these third-party libraries? madvise() users are supposed to know what
>>>> they are doing, so I wonder why they are causing trouble in your
>>>> environment.
>>>
>>> To my knowledge, jemalloc [0] supports THP.
>>> Applications using jemalloc typically rely on its default
>>> configurations rather than explicitly enabling or disabling THP. If
>>> the system is configured with THP=madvise, these applications may
>>> automatically leverage THP where appropriate
>>>
>>> [0]. https://github.com/jemalloc/jemalloc
>>
>> It sounds like a userspace issue. For jemalloc, if applications require
>> it, can't you replace the jemalloc with a one compiled with --disable-thp
>> to work around the issue?
>
> That’s not the issue this patchset is trying to address or work
> around. I believe we should focus on the actual problem it's meant to
> solve.
>
> By the way, you might not raise this question if you were managing a
> large fleet of servers. We're a platform provider, but we don’t
> maintain all the packages ourselves. Users make their own choices
> based on their specific requirements. It's not a feasible solution for
> us to develop and maintain every package.
Basically, user wants to use THP, but as a service provider, you think
differently, so want to override userspace choice. Am I getting it right?
>
>>
>>>
>>>>
>>>>>
>>>>> To address this issue, we initially hooked the __x64_sys_madvise() syscall,
>>>>> which is error-injectable, to blacklist unwanted services. While this
>>>>> worked, it was error-prone and ineffective for services needing always mode,
>>>>> as modifying their code to use madvise was impractical.
>>>>>
>>>>> To achieve finer-grained control, we introduced an fmod_ret-based solution.
>>>>> Now, we dynamically adjust THP settings per service by hooking
>>>>> hugepage_global_{enabled,always}() via BPF. This allows us to set THP to
>>>>> enable or disable on a per-service basis without global impact.
>>>>
>>>> hugepage_global_*() are whole system knobs. How did you use it to
>>>> achieve per-service control? In terms of per-service, does it mean
>>>> you need per-memcg group (I assume each service has its own memcg) THP
>>>> configuration?
>>>
>>> With this new BPF hook, we can manage THP behavior either per-service
>>> or per-memory.
>>> In our use case, we’ve chosen memcg-based control for finer-grained
>>> management. Below is a simplified example of our implementation:
>>>
>>> struct{
>>> __uint(type, BPF_MAP_TYPE_HASH);
>>> __uint(max_entries, 4096); /* usually there won't too
>>> many cgroups */
>>> __type(key, u64);
>>> __type(value, u32);
>>> __uint(map_flags, BPF_F_NO_PREALLOC);
>>> } thp_whitelist SEC(".maps");
>>>
>>> SEC("fmod_ret/mm_bpf_thp_vma_allowable")
>>> int BPF_PROG(thp_vma_allowable, struct vm_area_struct *vma)
>>> {
>>> struct cgroup_subsys_state *css;
>>> struct css_set *cgroups;
>>> struct mm_struct *mm;
>>> struct cgroup *cgroup;
>>> struct cgroup *parent;
>>> struct task_struct *p;
>>> u64 cgrp_id;
>>>
>>> if (!vma)
>>> return 0;
>>>
>>> mm = vma->vm_mm;
>>> if (!mm)
>>> return 0;
>>>
>>> p = mm->owner;
>>> cgroups = p->cgroups;
>>> cgroup = cgroups->subsys[memory_cgrp_id]->cgroup;
>>> cgrp_id = cgroup->kn->id;
>>>
>>> /* Allow the tasks in the thp_whiltelist to use THP. */
>>> if (bpf_map_lookup_elem(&thp_whitelist, &cgrp_id))
>>> return 1;
>>> return 0;
>>> }
>>>
>>> I chose not to include this in the self-tests to avoid the complexity
>>> of setting up cgroups for testing purposes. However, in patch #4 of
>>> this series, I've included a simpler example demonstrating task-level
>>> control.
>>
>> For task-level control, why not using prctl(PR_SET_THP_DISABLE)?
>
> You’ll need to modify the user-space code—and again, this likely
> wouldn’t be a concern if you were managing a large fleet of servers.
>
>>
>>> For service-level control, we could potentially utilize BPF task local
>>> storage as an alternative approach.
>>
>> +cgroup people
>>
>> For service-level control, there was a proposal of adding cgroup based
>> THP control[1]. You might need a strong use case to convince people.
>>
>> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
>
> Thanks for the reference. I've reviewed the related discussion, and if
> I understand correctly, the proposal was rejected by the maintainers.
I wonder why your approach is better than the cgroup based THP control proposal.
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-30 15:00 ` Zi Yan
@ 2025-04-30 15:16 ` Yafang Shao
2025-04-30 15:21 ` Liam R. Howlett
1 sibling, 0 replies; 41+ messages in thread
From: Yafang Shao @ 2025-04-30 15:16 UTC (permalink / raw)
To: Zi Yan
Cc: akpm, ast, daniel, andrii, David Hildenbrand, Baolin Wang,
Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
Dev Jain, bpf, linux-mm, Johannes Weiner, Michal Hocko
On Wed, Apr 30, 2025 at 11:00 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 30 Apr 2025, at 10:38, Yafang Shao wrote:
>
> > On Wed, Apr 30, 2025 at 9:19 PM Zi Yan <ziy@nvidia.com> wrote:
> >>
> >> On 29 Apr 2025, at 22:33, Yafang Shao wrote:
> >>
> >>> On Tue, Apr 29, 2025 at 11:09 PM Zi Yan <ziy@nvidia.com> wrote:
> >>>>
> >>>> Hi Yafang,
> >>>>
> >>>> We recently added a new THP entry in MAINTAINERS file[1], do you mind ccing
> >>>> people there in your next version? (I added them here)
> >>>>
> >>>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/MAINTAINERS?h=mm-everything#n15589
> >>>
> >>> Thanks for your reminder.
> >>> I will add the maintainers and reviewers in the next version.
> >>>
> >>>>
> >>>> On Mon Apr 28, 2025 at 10:41 PM EDT, Yafang Shao wrote:
> >>>>> In our container environment, we aim to enable THP selectively—allowing
> >>>>> specific services to use it while restricting others. This approach is
> >>>>> driven by the following considerations:
> >>>>>
> >>>>> 1. Memory Fragmentation
> >>>>> THP can lead to increased memory fragmentation, so we want to limit its
> >>>>> use across services.
> >>>>> 2. Performance Impact
> >>>>> Some services see no benefit from THP, making its usage unnecessary.
> >>>>> 3. Performance Gains
> >>>>> Certain workloads, such as machine learning services, experience
> >>>>> significant performance improvements with THP, so we enable it for them
> >>>>> specifically.
> >>>>>
> >>>>> Since multiple services run on a single host in a containerized environment,
> >>>>> enabling THP globally is not ideal. Previously, we set THP to madvise,
> >>>>> allowing selected services to opt in via MADV_HUGEPAGE. However, this
> >>>>> approach had limitation:
> >>>>>
> >>>>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through
> >>>>> third-party libraries, bypassing our restrictions.
> >>>>
> >>>> Basically, you want more precise control of THP enablement and the
> >>>> ability of overriding madvise() from userspace.
> >>>>
> >>>> In terms of overriding madvise(), do you have any concrete example of
> >>>> these third-party libraries? madvise() users are supposed to know what
> >>>> they are doing, so I wonder why they are causing trouble in your
> >>>> environment.
> >>>
> >>> To my knowledge, jemalloc [0] supports THP.
> >>> Applications using jemalloc typically rely on its default
> >>> configurations rather than explicitly enabling or disabling THP. If
> >>> the system is configured with THP=madvise, these applications may
> >>> automatically leverage THP where appropriate
> >>>
> >>> [0]. https://github.com/jemalloc/jemalloc
> >>
> >> It sounds like a userspace issue. For jemalloc, if applications require
> >> it, can't you replace the jemalloc with a one compiled with --disable-thp
> >> to work around the issue?
> >
> > That’s not the issue this patchset is trying to address or work
> > around. I believe we should focus on the actual problem it's meant to
> > solve.
> >
> > By the way, you might not raise this question if you were managing a
> > large fleet of servers. We're a platform provider, but we don’t
> > maintain all the packages ourselves. Users make their own choices
> > based on their specific requirements. It's not a feasible solution for
> > us to develop and maintain every package.
>
> Basically, user wants to use THP, but as a service provider, you think
> differently, so want to override userspace choice. Am I getting it right?
No—the users aren’t specifically concerned with THP. They just copied
a configuration from the internet and deployed it in the production
environment.
>
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>> To address this issue, we initially hooked the __x64_sys_madvise() syscall,
> >>>>> which is error-injectable, to blacklist unwanted services. While this
> >>>>> worked, it was error-prone and ineffective for services needing always mode,
> >>>>> as modifying their code to use madvise was impractical.
> >>>>>
> >>>>> To achieve finer-grained control, we introduced an fmod_ret-based solution.
> >>>>> Now, we dynamically adjust THP settings per service by hooking
> >>>>> hugepage_global_{enabled,always}() via BPF. This allows us to set THP to
> >>>>> enable or disable on a per-service basis without global impact.
> >>>>
> >>>> hugepage_global_*() are whole system knobs. How did you use it to
> >>>> achieve per-service control? In terms of per-service, does it mean
> >>>> you need per-memcg group (I assume each service has its own memcg) THP
> >>>> configuration?
> >>>
> >>> With this new BPF hook, we can manage THP behavior either per-service
> >>> or per-memory.
> >>> In our use case, we’ve chosen memcg-based control for finer-grained
> >>> management. Below is a simplified example of our implementation:
> >>>
> >>> struct{
> >>> __uint(type, BPF_MAP_TYPE_HASH);
> >>> __uint(max_entries, 4096); /* usually there won't too
> >>> many cgroups */
> >>> __type(key, u64);
> >>> __type(value, u32);
> >>> __uint(map_flags, BPF_F_NO_PREALLOC);
> >>> } thp_whitelist SEC(".maps");
> >>>
> >>> SEC("fmod_ret/mm_bpf_thp_vma_allowable")
> >>> int BPF_PROG(thp_vma_allowable, struct vm_area_struct *vma)
> >>> {
> >>> struct cgroup_subsys_state *css;
> >>> struct css_set *cgroups;
> >>> struct mm_struct *mm;
> >>> struct cgroup *cgroup;
> >>> struct cgroup *parent;
> >>> struct task_struct *p;
> >>> u64 cgrp_id;
> >>>
> >>> if (!vma)
> >>> return 0;
> >>>
> >>> mm = vma->vm_mm;
> >>> if (!mm)
> >>> return 0;
> >>>
> >>> p = mm->owner;
> >>> cgroups = p->cgroups;
> >>> cgroup = cgroups->subsys[memory_cgrp_id]->cgroup;
> >>> cgrp_id = cgroup->kn->id;
> >>>
> >>> /* Allow the tasks in the thp_whiltelist to use THP. */
> >>> if (bpf_map_lookup_elem(&thp_whitelist, &cgrp_id))
> >>> return 1;
> >>> return 0;
> >>> }
> >>>
> >>> I chose not to include this in the self-tests to avoid the complexity
> >>> of setting up cgroups for testing purposes. However, in patch #4 of
> >>> this series, I've included a simpler example demonstrating task-level
> >>> control.
> >>
> >> For task-level control, why not using prctl(PR_SET_THP_DISABLE)?
> >
> > You’ll need to modify the user-space code—and again, this likely
> > wouldn’t be a concern if you were managing a large fleet of servers.
> >
> >>
> >>> For service-level control, we could potentially utilize BPF task local
> >>> storage as an alternative approach.
> >>
> >> +cgroup people
> >>
> >> For service-level control, there was a proposal of adding cgroup based
> >> THP control[1]. You might need a strong use case to convince people.
> >>
> >> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
> >
> > Thanks for the reference. I've reviewed the related discussion, and if
> > I understand correctly, the proposal was rejected by the maintainers.
>
> I wonder why your approach is better than the cgroup based THP control proposal.
It’s more flexible, and you can still use it even without cgroups.
One limitation is that CONFIG_MEMCG must be enabled due to the use of
mm_struct::owner. I'm wondering if it would be feasible to decouple
mm_struct::owner from CONFIG_MEMCG. Alternatively, if there’s another
reliable way to retrieve the task_struct without relying on
mm_struct::owner, we could consider adding BPF kfuncs to support it.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-30 15:00 ` Zi Yan
2025-04-30 15:16 ` Yafang Shao
@ 2025-04-30 15:21 ` Liam R. Howlett
2025-04-30 15:37 ` Yafang Shao
1 sibling, 1 reply; 41+ messages in thread
From: Liam R. Howlett @ 2025-04-30 15:21 UTC (permalink / raw)
To: Zi Yan
Cc: Yafang Shao, akpm, ast, daniel, andrii, David Hildenbrand,
Baolin Wang, Lorenzo Stoakes, Nico Pache, Ryan Roberts, Dev Jain,
bpf, linux-mm, Johannes Weiner, Michal Hocko
* Zi Yan <ziy@nvidia.com> [250430 11:01]:
...
> >>>>> Since multiple services run on a single host in a containerized environment,
> >>>>> enabling THP globally is not ideal. Previously, we set THP to madvise,
> >>>>> allowing selected services to opt in via MADV_HUGEPAGE. However, this
> >>>>> approach had limitation:
> >>>>>
> >>>>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through
> >>>>> third-party libraries, bypassing our restrictions.
> >>>>
> >>>> Basically, you want more precise control of THP enablement and the
> >>>> ability of overriding madvise() from userspace.
> >>>>
> >>>> In terms of overriding madvise(), do you have any concrete example of
> >>>> these third-party libraries? madvise() users are supposed to know what
> >>>> they are doing, so I wonder why they are causing trouble in your
> >>>> environment.
> >>>
> >>> To my knowledge, jemalloc [0] supports THP.
> >>> Applications using jemalloc typically rely on its default
> >>> configurations rather than explicitly enabling or disabling THP. If
> >>> the system is configured with THP=madvise, these applications may
> >>> automatically leverage THP where appropriate
> >>>
> >>> [0]. https://github.com/jemalloc/jemalloc
> >>
> >> It sounds like a userspace issue. For jemalloc, if applications require
> >> it, can't you replace the jemalloc with a one compiled with --disable-thp
> >> to work around the issue?
> >
> > That’s not the issue this patchset is trying to address or work
> > around. I believe we should focus on the actual problem it's meant to
> > solve.
> >
> > By the way, you might not raise this question if you were managing a
> > large fleet of servers. We're a platform provider, but we don’t
> > maintain all the packages ourselves. Users make their own choices
> > based on their specific requirements. It's not a feasible solution for
> > us to develop and maintain every package.
>
> Basically, user wants to use THP, but as a service provider, you think
> differently, so want to override userspace choice. Am I getting it right?
Who is the platform provider in question? It makes me uneasy to have
such claims from an @gmail account with current world events..
...
> >>>
> >>> I chose not to include this in the self-tests to avoid the complexity
> >>> of setting up cgroups for testing purposes. However, in patch #4 of
> >>> this series, I've included a simpler example demonstrating task-level
> >>> control.
> >>
> >> For task-level control, why not using prctl(PR_SET_THP_DISABLE)?
> >
> > You’ll need to modify the user-space code—and again, this likely
> > wouldn’t be a concern if you were managing a large fleet of servers.
> >
> >>
> >>> For service-level control, we could potentially utilize BPF task local
> >>> storage as an alternative approach.
> >>
> >> +cgroup people
> >>
> >> For service-level control, there was a proposal of adding cgroup based
> >> THP control[1]. You might need a strong use case to convince people.
> >>
> >> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
> >
> > Thanks for the reference. I've reviewed the related discussion, and if
> > I understand correctly, the proposal was rejected by the maintainers.
More of the point is why it was rejected. Why is your motive different?
>
> I wonder why your approach is better than the cgroup based THP control proposal.
I think Matthew's response in that thread is pretty clear and still
relevant. If it isn't, can you state why?
The main difference is that you are saying it's in a container that you
don't control. Your plan is to violate the control the internal
applications have over THP because you know better. I'm not sure how
people might feel about you messing with workloads, but beyond that, you
are fundamentally fixing things at a sysadmin level because programmers
have made errors. You state as much in the cover letter, yes?
Thanks,
Liam
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-30 15:21 ` Liam R. Howlett
@ 2025-04-30 15:37 ` Yafang Shao
2025-04-30 15:53 ` Liam R. Howlett
0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2025-04-30 15:37 UTC (permalink / raw)
To: Liam R. Howlett, Zi Yan, Yafang Shao, akpm, ast, daniel, andrii,
David Hildenbrand, Baolin Wang, Lorenzo Stoakes, Nico Pache,
Ryan Roberts, Dev Jain, bpf, linux-mm, Johannes Weiner,
Michal Hocko
On Wed, Apr 30, 2025 at 11:21 PM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> * Zi Yan <ziy@nvidia.com> [250430 11:01]:
>
> ...
>
> > >>>>> Since multiple services run on a single host in a containerized environment,
> > >>>>> enabling THP globally is not ideal. Previously, we set THP to madvise,
> > >>>>> allowing selected services to opt in via MADV_HUGEPAGE. However, this
> > >>>>> approach had limitation:
> > >>>>>
> > >>>>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through
> > >>>>> third-party libraries, bypassing our restrictions.
> > >>>>
> > >>>> Basically, you want more precise control of THP enablement and the
> > >>>> ability of overriding madvise() from userspace.
> > >>>>
> > >>>> In terms of overriding madvise(), do you have any concrete example of
> > >>>> these third-party libraries? madvise() users are supposed to know what
> > >>>> they are doing, so I wonder why they are causing trouble in your
> > >>>> environment.
> > >>>
> > >>> To my knowledge, jemalloc [0] supports THP.
> > >>> Applications using jemalloc typically rely on its default
> > >>> configurations rather than explicitly enabling or disabling THP. If
> > >>> the system is configured with THP=madvise, these applications may
> > >>> automatically leverage THP where appropriate
> > >>>
> > >>> [0]. https://github.com/jemalloc/jemalloc
> > >>
> > >> It sounds like a userspace issue. For jemalloc, if applications require
> > >> it, can't you replace the jemalloc with a one compiled with --disable-thp
> > >> to work around the issue?
> > >
> > > That’s not the issue this patchset is trying to address or work
> > > around. I believe we should focus on the actual problem it's meant to
> > > solve.
> > >
> > > By the way, you might not raise this question if you were managing a
> > > large fleet of servers. We're a platform provider, but we don’t
> > > maintain all the packages ourselves. Users make their own choices
> > > based on their specific requirements. It's not a feasible solution for
> > > us to develop and maintain every package.
> >
> > Basically, user wants to use THP, but as a service provider, you think
> > differently, so want to override userspace choice. Am I getting it right?
>
> Who is the platform provider in question? It makes me uneasy to have
> such claims from an @gmail account with current world events..
It’s a small company based in China, called PDD—if that information is helpful.
>
> ...
>
> > >>>
> > >>> I chose not to include this in the self-tests to avoid the complexity
> > >>> of setting up cgroups for testing purposes. However, in patch #4 of
> > >>> this series, I've included a simpler example demonstrating task-level
> > >>> control.
> > >>
> > >> For task-level control, why not using prctl(PR_SET_THP_DISABLE)?
> > >
> > > You’ll need to modify the user-space code—and again, this likely
> > > wouldn’t be a concern if you were managing a large fleet of servers.
> > >
> > >>
> > >>> For service-level control, we could potentially utilize BPF task local
> > >>> storage as an alternative approach.
> > >>
> > >> +cgroup people
> > >>
> > >> For service-level control, there was a proposal of adding cgroup based
> > >> THP control[1]. You might need a strong use case to convince people.
> > >>
> > >> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
> > >
> > > Thanks for the reference. I've reviewed the related discussion, and if
> > > I understand correctly, the proposal was rejected by the maintainers.
>
> More of the point is why it was rejected. Why is your motive different?
>
> >
> > I wonder why your approach is better than the cgroup based THP control proposal.
>
> I think Matthew's response in that thread is pretty clear and still
> relevant.
Are you refering
https://lore.kernel.org/linux-mm/ZyT7QebITxOKNi_c@casper.infradead.org/
or https://lore.kernel.org/linux-mm/ZyIxRExcJvKKv4JW@casper.infradead.org/
?
If it’s the latter, then this patchset aims to make sysadmins' lives easier.
> If it isn't, can you state why?
>
> The main difference is that you are saying it's in a container that you
> don't control. Your plan is to violate the control the internal
> applications have over THP because you know better. I'm not sure how
> people might feel about you messing with workloads,
It’s not a mess. They have the option to deploy their services on
dedicated servers, but they would need to pay more for that choice.
This is a two-way decision.
> but beyond that, you
> are fundamentally fixing things at a sysadmin level because programmers
> have made errors.
No, they’re not making mistakes—they simply focus on the
implementation details of their own services and don’t find it
worthwhile to dive into kernel internals. Their services run perfectly
well with or without THP.
> You state as much in the cover letter, yes?
I’ll try to explain it in more detail in the next version if that
would be helpful.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-30 15:37 ` Yafang Shao
@ 2025-04-30 15:53 ` Liam R. Howlett
2025-04-30 16:06 ` Yafang Shao
0 siblings, 1 reply; 41+ messages in thread
From: Liam R. Howlett @ 2025-04-30 15:53 UTC (permalink / raw)
To: Yafang Shao
Cc: Zi Yan, akpm, ast, daniel, andrii, David Hildenbrand, Baolin Wang,
Lorenzo Stoakes, Nico Pache, Ryan Roberts, Dev Jain, bpf,
linux-mm, Johannes Weiner, Michal Hocko
* Yafang Shao <laoar.shao@gmail.com> [250430 11:37]:
> On Wed, Apr 30, 2025 at 11:21 PM Liam R. Howlett
> <Liam.Howlett@oracle.com> wrote:
> >
> > * Zi Yan <ziy@nvidia.com> [250430 11:01]:
> >
> > ...
> >
> > > >>>>> Since multiple services run on a single host in a containerized environment,
> > > >>>>> enabling THP globally is not ideal. Previously, we set THP to madvise,
> > > >>>>> allowing selected services to opt in via MADV_HUGEPAGE. However, this
> > > >>>>> approach had limitation:
> > > >>>>>
> > > >>>>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through
> > > >>>>> third-party libraries, bypassing our restrictions.
> > > >>>>
> > > >>>> Basically, you want more precise control of THP enablement and the
> > > >>>> ability of overriding madvise() from userspace.
> > > >>>>
> > > >>>> In terms of overriding madvise(), do you have any concrete example of
> > > >>>> these third-party libraries? madvise() users are supposed to know what
> > > >>>> they are doing, so I wonder why they are causing trouble in your
> > > >>>> environment.
> > > >>>
> > > >>> To my knowledge, jemalloc [0] supports THP.
> > > >>> Applications using jemalloc typically rely on its default
> > > >>> configurations rather than explicitly enabling or disabling THP. If
> > > >>> the system is configured with THP=madvise, these applications may
> > > >>> automatically leverage THP where appropriate
> > > >>>
> > > >>> [0]. https://github.com/jemalloc/jemalloc
> > > >>
> > > >> It sounds like a userspace issue. For jemalloc, if applications require
> > > >> it, can't you replace the jemalloc with a one compiled with --disable-thp
> > > >> to work around the issue?
> > > >
> > > > That’s not the issue this patchset is trying to address or work
> > > > around. I believe we should focus on the actual problem it's meant to
> > > > solve.
> > > >
> > > > By the way, you might not raise this question if you were managing a
> > > > large fleet of servers. We're a platform provider, but we don’t
> > > > maintain all the packages ourselves. Users make their own choices
> > > > based on their specific requirements. It's not a feasible solution for
> > > > us to develop and maintain every package.
> > >
> > > Basically, user wants to use THP, but as a service provider, you think
> > > differently, so want to override userspace choice. Am I getting it right?
> >
> > Who is the platform provider in question? It makes me uneasy to have
> > such claims from an @gmail account with current world events..
>
> It’s a small company based in China, called PDD—if that information is helpful.
Thanks.
>
> >
> > ...
> >
> > > >>>
> > > >>> I chose not to include this in the self-tests to avoid the complexity
> > > >>> of setting up cgroups for testing purposes. However, in patch #4 of
> > > >>> this series, I've included a simpler example demonstrating task-level
> > > >>> control.
> > > >>
> > > >> For task-level control, why not using prctl(PR_SET_THP_DISABLE)?
> > > >
> > > > You’ll need to modify the user-space code—and again, this likely
> > > > wouldn’t be a concern if you were managing a large fleet of servers.
> > > >
> > > >>
> > > >>> For service-level control, we could potentially utilize BPF task local
> > > >>> storage as an alternative approach.
> > > >>
> > > >> +cgroup people
> > > >>
> > > >> For service-level control, there was a proposal of adding cgroup based
> > > >> THP control[1]. You might need a strong use case to convince people.
> > > >>
> > > >> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
> > > >
> > > > Thanks for the reference. I've reviewed the related discussion, and if
> > > > I understand correctly, the proposal was rejected by the maintainers.
> >
> > More of the point is why it was rejected. Why is your motive different?
> >
> > >
> > > I wonder why your approach is better than the cgroup based THP control proposal.
> >
> > I think Matthew's response in that thread is pretty clear and still
> > relevant.
>
> Are you refering
> https://lore.kernel.org/linux-mm/ZyT7QebITxOKNi_c@casper.infradead.org/
> or https://lore.kernel.org/linux-mm/ZyIxRExcJvKKv4JW@casper.infradead.org/
> ?
>
> If it’s the latter, then this patchset aims to make sysadmins' lives easier.
Both, really. Your patch gives the sysadm another knob to turn and know
when to turn it. Matthew is suggesting we should know when to do the
right thing and avoid a knob in the first place.
>
> > If it isn't, can you state why?
> >
> > The main difference is that you are saying it's in a container that you
> > don't control. Your plan is to violate the control the internal
> > applications have over THP because you know better. I'm not sure how
> > people might feel about you messing with workloads,
>
> It’s not a mess. They have the option to deploy their services on
> dedicated servers, but they would need to pay more for that choice.
> This is a two-way decision.
This implies you want a container-level way of controlling the setting
and not a system service-level? I guess I find the wording of the
problem statement unclear.
>
> > but beyond that, you
> > are fundamentally fixing things at a sysadmin level because programmers
> > have made errors.
>
> No, they’re not making mistakes—they simply focus on the
> implementation details of their own services and don’t find it
> worthwhile to dive into kernel internals. Their services run perfectly
> well with or without THP.
>
> > You state as much in the cover letter, yes?
>
> I’ll try to explain it in more detail in the next version if that
> would be helpful.
Yes, I think so.
Thanks,
Liam
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-30 15:53 ` Liam R. Howlett
@ 2025-04-30 16:06 ` Yafang Shao
2025-04-30 17:45 ` Johannes Weiner
0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2025-04-30 16:06 UTC (permalink / raw)
To: Liam R. Howlett, Yafang Shao, Zi Yan, akpm, ast, daniel, andrii,
David Hildenbrand, Baolin Wang, Lorenzo Stoakes, Nico Pache,
Ryan Roberts, Dev Jain, bpf, linux-mm, Johannes Weiner,
Michal Hocko
On Wed, Apr 30, 2025 at 11:53 PM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> * Yafang Shao <laoar.shao@gmail.com> [250430 11:37]:
> > On Wed, Apr 30, 2025 at 11:21 PM Liam R. Howlett
> > <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Zi Yan <ziy@nvidia.com> [250430 11:01]:
> > >
> > > ...
> > >
> > > > >>>>> Since multiple services run on a single host in a containerized environment,
> > > > >>>>> enabling THP globally is not ideal. Previously, we set THP to madvise,
> > > > >>>>> allowing selected services to opt in via MADV_HUGEPAGE. However, this
> > > > >>>>> approach had limitation:
> > > > >>>>>
> > > > >>>>> - Some services inadvertently used madvise(MADV_HUGEPAGE) through
> > > > >>>>> third-party libraries, bypassing our restrictions.
> > > > >>>>
> > > > >>>> Basically, you want more precise control of THP enablement and the
> > > > >>>> ability of overriding madvise() from userspace.
> > > > >>>>
> > > > >>>> In terms of overriding madvise(), do you have any concrete example of
> > > > >>>> these third-party libraries? madvise() users are supposed to know what
> > > > >>>> they are doing, so I wonder why they are causing trouble in your
> > > > >>>> environment.
> > > > >>>
> > > > >>> To my knowledge, jemalloc [0] supports THP.
> > > > >>> Applications using jemalloc typically rely on its default
> > > > >>> configurations rather than explicitly enabling or disabling THP. If
> > > > >>> the system is configured with THP=madvise, these applications may
> > > > >>> automatically leverage THP where appropriate
> > > > >>>
> > > > >>> [0]. https://github.com/jemalloc/jemalloc
> > > > >>
> > > > >> It sounds like a userspace issue. For jemalloc, if applications require
> > > > >> it, can't you replace the jemalloc with a one compiled with --disable-thp
> > > > >> to work around the issue?
> > > > >
> > > > > That’s not the issue this patchset is trying to address or work
> > > > > around. I believe we should focus on the actual problem it's meant to
> > > > > solve.
> > > > >
> > > > > By the way, you might not raise this question if you were managing a
> > > > > large fleet of servers. We're a platform provider, but we don’t
> > > > > maintain all the packages ourselves. Users make their own choices
> > > > > based on their specific requirements. It's not a feasible solution for
> > > > > us to develop and maintain every package.
> > > >
> > > > Basically, user wants to use THP, but as a service provider, you think
> > > > differently, so want to override userspace choice. Am I getting it right?
> > >
> > > Who is the platform provider in question? It makes me uneasy to have
> > > such claims from an @gmail account with current world events..
> >
> > It’s a small company based in China, called PDD—if that information is helpful.
>
> Thanks.
>
> >
> > >
> > > ...
> > >
> > > > >>>
> > > > >>> I chose not to include this in the self-tests to avoid the complexity
> > > > >>> of setting up cgroups for testing purposes. However, in patch #4 of
> > > > >>> this series, I've included a simpler example demonstrating task-level
> > > > >>> control.
> > > > >>
> > > > >> For task-level control, why not using prctl(PR_SET_THP_DISABLE)?
> > > > >
> > > > > You’ll need to modify the user-space code—and again, this likely
> > > > > wouldn’t be a concern if you were managing a large fleet of servers.
> > > > >
> > > > >>
> > > > >>> For service-level control, we could potentially utilize BPF task local
> > > > >>> storage as an alternative approach.
> > > > >>
> > > > >> +cgroup people
> > > > >>
> > > > >> For service-level control, there was a proposal of adding cgroup based
> > > > >> THP control[1]. You might need a strong use case to convince people.
> > > > >>
> > > > >> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
> > > > >
> > > > > Thanks for the reference. I've reviewed the related discussion, and if
> > > > > I understand correctly, the proposal was rejected by the maintainers.
> > >
> > > More of the point is why it was rejected. Why is your motive different?
> > >
> > > >
> > > > I wonder why your approach is better than the cgroup based THP control proposal.
> > >
> > > I think Matthew's response in that thread is pretty clear and still
> > > relevant.
> >
> > Are you refering
> > https://lore.kernel.org/linux-mm/ZyT7QebITxOKNi_c@casper.infradead.org/
> > or https://lore.kernel.org/linux-mm/ZyIxRExcJvKKv4JW@casper.infradead.org/
> > ?
> >
> > If it’s the latter, then this patchset aims to make sysadmins' lives easier.
>
> Both, really. Your patch gives the sysadm another knob to turn and know
> when to turn it. Matthew is suggesting we should know when to do the
> right thing and avoid a knob in the first place.
The problem is that there's no proper mechanism to control THP at the
container level. From the moment we introduced containers and cgroups,
the goal has been to manage all resources through cgroups. Of course,
implementing everything at once wasn’t feasible, so we added
controllers incrementally—and we’re still introducing new ones even
today, aren’t we? Now, with BPF, we have a more flexible way to
achieve this—so why not use it?
I believe we should focus on making life easier for users, not just
sysadmins. That philosophy has been a driving force behind the
continued development of the Linux kernel.
>
> >
> > > If it isn't, can you state why?
> > >
> > > The main difference is that you are saying it's in a container that you
> > > don't control. Your plan is to violate the control the internal
> > > applications have over THP because you know better. I'm not sure how
> > > people might feel about you messing with workloads,
> >
> > It’s not a mess. They have the option to deploy their services on
> > dedicated servers, but they would need to pay more for that choice.
> > This is a two-way decision.
>
> This implies you want a container-level way of controlling the setting
> and not a system service-level?
Right. We want to control the THP per container.
> I guess I find the wording of the
> problem statement unclear.
>
> >
> > > but beyond that, you
> > > are fundamentally fixing things at a sysadmin level because programmers
> > > have made errors.
> >
> > No, they’re not making mistakes—they simply focus on the
> > implementation details of their own services and don’t find it
> > worthwhile to dive into kernel internals. Their services run perfectly
> > well with or without THP.
> >
> > > You state as much in the cover letter, yes?
> >
> > I’ll try to explain it in more detail in the next version if that
> > would be helpful.
>
> Yes, I think so.
>
> Thanks,
> Liam
--
Regards
Yafang
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-30 16:06 ` Yafang Shao
@ 2025-04-30 17:45 ` Johannes Weiner
2025-04-30 17:53 ` Zi Yan
0 siblings, 1 reply; 41+ messages in thread
From: Johannes Weiner @ 2025-04-30 17:45 UTC (permalink / raw)
To: Yafang Shao
Cc: Liam R. Howlett, Zi Yan, akpm, ast, daniel, andrii,
David Hildenbrand, Baolin Wang, Lorenzo Stoakes, Nico Pache,
Ryan Roberts, Dev Jain, bpf, linux-mm, Michal Hocko
On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
> > > > If it isn't, can you state why?
> > > >
> > > > The main difference is that you are saying it's in a container that you
> > > > don't control. Your plan is to violate the control the internal
> > > > applications have over THP because you know better. I'm not sure how
> > > > people might feel about you messing with workloads,
> > >
> > > It’s not a mess. They have the option to deploy their services on
> > > dedicated servers, but they would need to pay more for that choice.
> > > This is a two-way decision.
> >
> > This implies you want a container-level way of controlling the setting
> > and not a system service-level?
>
> Right. We want to control the THP per container.
This does strike me as a reasonable usecase.
I think there is consensus that in the long-term we want this stuff to
just work and truly be transparent to userspace.
In the short-to-medium term, however, there are still quite a few
caveats. thp=always can significantly increase the memory footprint of
sparse virtual regions. Huge allocations are not as cheap and reliable
as we would like them to be, which for real production systems means
having to make workload-specifcic choices and tradeoffs.
There is ongoing work in these areas, but we do have a bit of a
chicken-and-egg problem: on the one hand, huge page adoption is slow
due to limitations in how they can be deployed. For example, we can't
do thp=always on a DC node that runs arbitary combinations of jobs
from a wide array of services. Some might benefit, some might hurt.
Yet, it's much easier to improve the kernel based on exactly such
production experience and data from real-world usecases. We can't
improve the THP shrinker if we can't run THP.
So I don't see it as overriding whoever wrote the software running
inside the container. They don't know, and they shouldn't have to care
about page sizes. It's about letting admins and kernel teams get
started on using and experimenting with this stuff, given the very
real constraints right now, so we can get the feedback necessary to
improve the situation.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-30 17:45 ` Johannes Weiner
@ 2025-04-30 17:53 ` Zi Yan
2025-05-01 19:36 ` Gutierrez Asier
0 siblings, 1 reply; 41+ messages in thread
From: Zi Yan @ 2025-04-30 17:53 UTC (permalink / raw)
To: Johannes Weiner
Cc: Yafang Shao, Liam R. Howlett, akpm, ast, daniel, andrii,
David Hildenbrand, Baolin Wang, Lorenzo Stoakes, Nico Pache,
Ryan Roberts, Dev Jain, bpf, linux-mm, Michal Hocko,
Asier Gutierrez
On 30 Apr 2025, at 13:45, Johannes Weiner wrote:
> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
>>>>> If it isn't, can you state why?
>>>>>
>>>>> The main difference is that you are saying it's in a container that you
>>>>> don't control. Your plan is to violate the control the internal
>>>>> applications have over THP because you know better. I'm not sure how
>>>>> people might feel about you messing with workloads,
>>>>
>>>> It’s not a mess. They have the option to deploy their services on
>>>> dedicated servers, but they would need to pay more for that choice.
>>>> This is a two-way decision.
>>>
>>> This implies you want a container-level way of controlling the setting
>>> and not a system service-level?
>>
>> Right. We want to control the THP per container.
>
> This does strike me as a reasonable usecase.
>
> I think there is consensus that in the long-term we want this stuff to
> just work and truly be transparent to userspace.
>
> In the short-to-medium term, however, there are still quite a few
> caveats. thp=always can significantly increase the memory footprint of
> sparse virtual regions. Huge allocations are not as cheap and reliable
> as we would like them to be, which for real production systems means
> having to make workload-specifcic choices and tradeoffs.
>
> There is ongoing work in these areas, but we do have a bit of a
> chicken-and-egg problem: on the one hand, huge page adoption is slow
> due to limitations in how they can be deployed. For example, we can't
> do thp=always on a DC node that runs arbitary combinations of jobs
> from a wide array of services. Some might benefit, some might hurt.
>
> Yet, it's much easier to improve the kernel based on exactly such
> production experience and data from real-world usecases. We can't
> improve the THP shrinker if we can't run THP.
>
> So I don't see it as overriding whoever wrote the software running
> inside the container. They don't know, and they shouldn't have to care
> about page sizes. It's about letting admins and kernel teams get
> started on using and experimenting with this stuff, given the very
> real constraints right now, so we can get the feedback necessary to
> improve the situation.
Since you think it is reasonable to control THP at container-level,
namely per-cgroup. Should we reconsider cgroup-based THP control[1]?
(Asier cc'd)
In this patchset, Yafang uses BPF to adjust THP global configs based
on VMA, which does not look a good approach to me. WDYT?
[1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-30 14:38 ` Yafang Shao
2025-04-30 15:00 ` Zi Yan
@ 2025-04-30 17:59 ` Johannes Weiner
2025-05-01 0:40 ` Yafang Shao
1 sibling, 1 reply; 41+ messages in thread
From: Johannes Weiner @ 2025-04-30 17:59 UTC (permalink / raw)
To: Yafang Shao
Cc: Zi Yan, akpm, ast, daniel, andrii, David Hildenbrand, Baolin Wang,
Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
Dev Jain, bpf, linux-mm, Michal Hocko
On Wed, Apr 30, 2025 at 10:38:10PM +0800, Yafang Shao wrote:
> On Wed, Apr 30, 2025 at 9:19 PM Zi Yan <ziy@nvidia.com> wrote:
> > For task-level control, why not using prctl(PR_SET_THP_DISABLE)?
>
> You’ll need to modify the user-space code—and again, this likely
> wouldn’t be a concern if you were managing a large fleet of servers.
These flags are propagated along the process tree, so you only need to
tweak the management software that launches the container
workload. Which is presumably the same entity that would tweak cgroup
settings.
> > For service-level control, there was a proposal of adding cgroup based
> > THP control[1]. You might need a strong use case to convince people.
> >
> > [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
>
> Thanks for the reference. I've reviewed the related discussion, and if
> I understand correctly, the proposal was rejected by the maintainers.
Cgroups are for nested trees dividing up resources. They're not a good
fit for arbitrary, non-hierarchical policy settings.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-30 17:59 ` Johannes Weiner
@ 2025-05-01 0:40 ` Yafang Shao
0 siblings, 0 replies; 41+ messages in thread
From: Yafang Shao @ 2025-05-01 0:40 UTC (permalink / raw)
To: Johannes Weiner
Cc: Zi Yan, akpm, ast, daniel, andrii, David Hildenbrand, Baolin Wang,
Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
Dev Jain, bpf, linux-mm, Michal Hocko
On Thu, May 1, 2025 at 2:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Apr 30, 2025 at 10:38:10PM +0800, Yafang Shao wrote:
> > On Wed, Apr 30, 2025 at 9:19 PM Zi Yan <ziy@nvidia.com> wrote:
> > > For task-level control, why not using prctl(PR_SET_THP_DISABLE)?
> >
> > You’ll need to modify the user-space code—and again, this likely
> > wouldn’t be a concern if you were managing a large fleet of servers.
>
> These flags are propagated along the process tree, so you only need to
> tweak the management software that launches the container
> workload. Which is presumably the same entity that would tweak cgroup
> settings.
Right, we can modify the parent process code. In a containerized
environment, that would mean modifying the containerd source code.
However, deploying such changes is far from simple:
1. We'd need to deploy the modified containerd to our production servers.
2. All running services would need to be restarted.
3. We'd have to coordinate with teams whose services don’t benefit
from THP to ensure their cooperation.
4. Only then could we set “thp=always” across our production servers.
5. Next, we’d annotate the services that do benefit from THP, restart
them, and monitor behavior.
6. If anything goes wrong, we may have to repeat the process.
7. For systemd-managed services, we might need to implement a separate solution.
This is a painful and disruptive process. In contrast, with the
BPF-based solution, we simply introduce a plugin. Only the services
that want to use THP need to restart, and they’re generally willing to
do so since they benefit from it.
As you mentioned in another email, this entire process is
experimental, so it’s very likely that we’ll encounter unexpected
issues. That’s why flexibility and ease of adjustment are critical.
By the way, I have another draft that hooks into the fork() procedure
using BPF to adjust per-process attributes of services, aiming to
simplify deployment—but that’s a separate topic.
>
> > > For service-level control, there was a proposal of adding cgroup based
> > > THP control[1]. You might need a strong use case to convince people.
> > >
> > > [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
> >
> > Thanks for the reference. I've reviewed the related discussion, and if
> > I understand correctly, the proposal was rejected by the maintainers.
>
> Cgroups are for nested trees dividing up resources. They're not a good
> fit for arbitrary, non-hierarchical policy settings.
Thanks for the explanation.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-04-30 17:53 ` Zi Yan
@ 2025-05-01 19:36 ` Gutierrez Asier
2025-05-02 5:48 ` Yafang Shao
0 siblings, 1 reply; 41+ messages in thread
From: Gutierrez Asier @ 2025-05-01 19:36 UTC (permalink / raw)
To: Zi Yan, Johannes Weiner
Cc: Yafang Shao, Liam R. Howlett, akpm, ast, daniel, andrii,
David Hildenbrand, Baolin Wang, Lorenzo Stoakes, Nico Pache,
Ryan Roberts, Dev Jain, bpf, linux-mm, Michal Hocko
On 4/30/2025 8:53 PM, Zi Yan wrote:
> On 30 Apr 2025, at 13:45, Johannes Weiner wrote:
>
>> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
>>>>>> If it isn't, can you state why?
>>>>>>
>>>>>> The main difference is that you are saying it's in a container that you
>>>>>> don't control. Your plan is to violate the control the internal
>>>>>> applications have over THP because you know better. I'm not sure how
>>>>>> people might feel about you messing with workloads,
>>>>>
>>>>> It’s not a mess. They have the option to deploy their services on
>>>>> dedicated servers, but they would need to pay more for that choice.
>>>>> This is a two-way decision.
>>>>
>>>> This implies you want a container-level way of controlling the setting
>>>> and not a system service-level?
>>>
>>> Right. We want to control the THP per container.
>>
>> This does strike me as a reasonable usecase.
>>
>> I think there is consensus that in the long-term we want this stuff to
>> just work and truly be transparent to userspace.
>>
>> In the short-to-medium term, however, there are still quite a few
>> caveats. thp=always can significantly increase the memory footprint of
>> sparse virtual regions. Huge allocations are not as cheap and reliable
>> as we would like them to be, which for real production systems means
>> having to make workload-specifcic choices and tradeoffs.
>>
>> There is ongoing work in these areas, but we do have a bit of a
>> chicken-and-egg problem: on the one hand, huge page adoption is slow
>> due to limitations in how they can be deployed. For example, we can't
>> do thp=always on a DC node that runs arbitary combinations of jobs
>> from a wide array of services. Some might benefit, some might hurt.
>>
>> Yet, it's much easier to improve the kernel based on exactly such
>> production experience and data from real-world usecases. We can't
>> improve the THP shrinker if we can't run THP.
>>
>> So I don't see it as overriding whoever wrote the software running
>> inside the container. They don't know, and they shouldn't have to care
>> about page sizes. It's about letting admins and kernel teams get
>> started on using and experimenting with this stuff, given the very
>> real constraints right now, so we can get the feedback necessary to
>> improve the situation.
>
> Since you think it is reasonable to control THP at container-level,
> namely per-cgroup. Should we reconsider cgroup-based THP control[1]?
> (Asier cc'd)
>
> In this patchset, Yafang uses BPF to adjust THP global configs based
> on VMA, which does not look a good approach to me. WDYT?
>
>
> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
>
> --
> Best Regards,
> Yan, Zi
Hi,
I believe cgroup is a better approach for containers, since this
approach can be easily integrated with the user space stack like
containerd and kubernets, which use cgroup to control system resources.
However, I pointed out earlier, the approach I suggested has some
flaws:
1. Potential polution of cgroup with a big number of knobs
2. Requires configuration by the admin
Ideally, as Matthew W. mentioned, there should be an automatic system.
Anyway, regarding containers, I believe cgroup is a good approach
given that the admin or the container management system uses cgroups
to set up the containers.
--
Asier Gutierrez
Huawei
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-05-01 19:36 ` Gutierrez Asier
@ 2025-05-02 5:48 ` Yafang Shao
2025-05-02 12:00 ` Zi Yan
2025-05-05 9:11 ` Gutierrez Asier
0 siblings, 2 replies; 41+ messages in thread
From: Yafang Shao @ 2025-05-02 5:48 UTC (permalink / raw)
To: Gutierrez Asier
Cc: Zi Yan, Johannes Weiner, Liam R. Howlett, akpm, ast, daniel,
andrii, David Hildenbrand, Baolin Wang, Lorenzo Stoakes,
Nico Pache, Ryan Roberts, Dev Jain, bpf, linux-mm, Michal Hocko
On Fri, May 2, 2025 at 3:36 AM Gutierrez Asier
<gutierrez.asier@huawei-partners.com> wrote:
>
>
> On 4/30/2025 8:53 PM, Zi Yan wrote:
> > On 30 Apr 2025, at 13:45, Johannes Weiner wrote:
> >
> >> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
> >>>>>> If it isn't, can you state why?
> >>>>>>
> >>>>>> The main difference is that you are saying it's in a container that you
> >>>>>> don't control. Your plan is to violate the control the internal
> >>>>>> applications have over THP because you know better. I'm not sure how
> >>>>>> people might feel about you messing with workloads,
> >>>>>
> >>>>> It’s not a mess. They have the option to deploy their services on
> >>>>> dedicated servers, but they would need to pay more for that choice.
> >>>>> This is a two-way decision.
> >>>>
> >>>> This implies you want a container-level way of controlling the setting
> >>>> and not a system service-level?
> >>>
> >>> Right. We want to control the THP per container.
> >>
> >> This does strike me as a reasonable usecase.
> >>
> >> I think there is consensus that in the long-term we want this stuff to
> >> just work and truly be transparent to userspace.
> >>
> >> In the short-to-medium term, however, there are still quite a few
> >> caveats. thp=always can significantly increase the memory footprint of
> >> sparse virtual regions. Huge allocations are not as cheap and reliable
> >> as we would like them to be, which for real production systems means
> >> having to make workload-specifcic choices and tradeoffs.
> >>
> >> There is ongoing work in these areas, but we do have a bit of a
> >> chicken-and-egg problem: on the one hand, huge page adoption is slow
> >> due to limitations in how they can be deployed. For example, we can't
> >> do thp=always on a DC node that runs arbitary combinations of jobs
> >> from a wide array of services. Some might benefit, some might hurt.
> >>
> >> Yet, it's much easier to improve the kernel based on exactly such
> >> production experience and data from real-world usecases. We can't
> >> improve the THP shrinker if we can't run THP.
> >>
> >> So I don't see it as overriding whoever wrote the software running
> >> inside the container. They don't know, and they shouldn't have to care
> >> about page sizes. It's about letting admins and kernel teams get
> >> started on using and experimenting with this stuff, given the very
> >> real constraints right now, so we can get the feedback necessary to
> >> improve the situation.
> >
> > Since you think it is reasonable to control THP at container-level,
> > namely per-cgroup. Should we reconsider cgroup-based THP control[1]?
> > (Asier cc'd)
> >
> > In this patchset, Yafang uses BPF to adjust THP global configs based
> > on VMA, which does not look a good approach to me. WDYT?
> >
> >
> > [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
> >
> > --
> > Best Regards,
> > Yan, Zi
>
> Hi,
>
> I believe cgroup is a better approach for containers, since this
> approach can be easily integrated with the user space stack like
> containerd and kubernets, which use cgroup to control system resources.
The integration of BPF with containerd and Kubernetes is emerging as a
clear trend.
>
> However, I pointed out earlier, the approach I suggested has some
> flaws:
> 1. Potential polution of cgroup with a big number of knobs
Right, the memcg maintainers once told me that introducing a new
cgroup file means committing to maintaining it indefinitely, as these
interface files are treated as part of the ABI.
In contrast, BPF kfuncs are considered an unstable API, giving you the
flexibility to modify them later if needed.
> 2. Requires configuration by the admin
>
> Ideally, as Matthew W. mentioned, there should be an automatic system.
Take Matthew’s XFS large folio feature as an example—it was enabled
automatically. A few years ago, when we upgraded to the 6.1.y stable
kernel, we noticed this new feature. Since it was enabled by default,
we assumed the author was confident in its stability. Unfortunately,
it led to severe issues in our production environment: servers crashed
randomly, and in some cases, we experienced data loss without
understanding the root cause.
We began disabling various kernel configurations in an attempt to
isolate the issue, and eventually, the problem disappeared after
disabling CONFIG_TRANSPARENT_HUGEPAGE. As a result, we released a new
kernel version with THP disabled and had to restart hundreds of
thousands of production servers. It was a nightmare for both us and
our sysadmins.
Last year, we discovered that the initial issue had been resolved by this patch:
https://lore.kernel.org/stable/20241001210625.95825-1-ryncsn@gmail.com/.
We backported the fix and re-enabled XFS large folios—only to face a
new nightmare. One of our services began crashing sporadically with
core dumps. It took us several months to trace the issue back to the
re-enabled XFS large folio feature. Fortunately, we were able to
disable it using livepatch, avoiding another round of mass server
restarts. To this day, the root cause remains unknown. The good news
is that the issue appears to be resolved in the 6.12.y stable kernel.
We're still trying to bisect which commit fixed it, though progress is
slow because the issue is not reliably reproducible.
In theory, new features should be enabled automatically. But in
practice, every new feature should come with a tunable knob. That’s a
lesson we learned the hard way from this experience—and perhaps
Matthew did too.
>
> Anyway, regarding containers, I believe cgroup is a good approach
> given that the admin or the container management system uses cgroups
> to set up the containers.
>
> --
> Asier Gutierrez
> Huawei
>
--
Regards
Yafang
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-05-02 5:48 ` Yafang Shao
@ 2025-05-02 12:00 ` Zi Yan
2025-05-02 12:18 ` Yafang Shao
2025-05-05 9:11 ` Gutierrez Asier
1 sibling, 1 reply; 41+ messages in thread
From: Zi Yan @ 2025-05-02 12:00 UTC (permalink / raw)
To: Yafang Shao
Cc: Gutierrez Asier, Johannes Weiner, Liam R. Howlett, akpm, ast,
daniel, andrii, David Hildenbrand, Baolin Wang, Lorenzo Stoakes,
Nico Pache, Ryan Roberts, Dev Jain, bpf, linux-mm, Michal Hocko
On 2 May 2025, at 1:48, Yafang Shao wrote:
> On Fri, May 2, 2025 at 3:36 AM Gutierrez Asier
> <gutierrez.asier@huawei-partners.com> wrote:
>>
>>
>> On 4/30/2025 8:53 PM, Zi Yan wrote:
>>> On 30 Apr 2025, at 13:45, Johannes Weiner wrote:
>>>
>>>> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
>>>>>>>> If it isn't, can you state why?
>>>>>>>>
>>>>>>>> The main difference is that you are saying it's in a container that you
>>>>>>>> don't control. Your plan is to violate the control the internal
>>>>>>>> applications have over THP because you know better. I'm not sure how
>>>>>>>> people might feel about you messing with workloads,
>>>>>>>
>>>>>>> It’s not a mess. They have the option to deploy their services on
>>>>>>> dedicated servers, but they would need to pay more for that choice.
>>>>>>> This is a two-way decision.
>>>>>>
>>>>>> This implies you want a container-level way of controlling the setting
>>>>>> and not a system service-level?
>>>>>
>>>>> Right. We want to control the THP per container.
>>>>
>>>> This does strike me as a reasonable usecase.
>>>>
>>>> I think there is consensus that in the long-term we want this stuff to
>>>> just work and truly be transparent to userspace.
>>>>
>>>> In the short-to-medium term, however, there are still quite a few
>>>> caveats. thp=always can significantly increase the memory footprint of
>>>> sparse virtual regions. Huge allocations are not as cheap and reliable
>>>> as we would like them to be, which for real production systems means
>>>> having to make workload-specifcic choices and tradeoffs.
>>>>
>>>> There is ongoing work in these areas, but we do have a bit of a
>>>> chicken-and-egg problem: on the one hand, huge page adoption is slow
>>>> due to limitations in how they can be deployed. For example, we can't
>>>> do thp=always on a DC node that runs arbitary combinations of jobs
>>>> from a wide array of services. Some might benefit, some might hurt.
>>>>
>>>> Yet, it's much easier to improve the kernel based on exactly such
>>>> production experience and data from real-world usecases. We can't
>>>> improve the THP shrinker if we can't run THP.
>>>>
>>>> So I don't see it as overriding whoever wrote the software running
>>>> inside the container. They don't know, and they shouldn't have to care
>>>> about page sizes. It's about letting admins and kernel teams get
>>>> started on using and experimenting with this stuff, given the very
>>>> real constraints right now, so we can get the feedback necessary to
>>>> improve the situation.
>>>
>>> Since you think it is reasonable to control THP at container-level,
>>> namely per-cgroup. Should we reconsider cgroup-based THP control[1]?
>>> (Asier cc'd)
>>>
>>> In this patchset, Yafang uses BPF to adjust THP global configs based
>>> on VMA, which does not look a good approach to me. WDYT?
>>>
>>>
>>> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi
>>
>> Hi,
>>
>> I believe cgroup is a better approach for containers, since this
>> approach can be easily integrated with the user space stack like
>> containerd and kubernets, which use cgroup to control system resources.
>
> The integration of BPF with containerd and Kubernetes is emerging as a
> clear trend.
>
>>
>> However, I pointed out earlier, the approach I suggested has some
>> flaws:
>> 1. Potential polution of cgroup with a big number of knobs
>
> Right, the memcg maintainers once told me that introducing a new
> cgroup file means committing to maintaining it indefinitely, as these
> interface files are treated as part of the ABI.
> In contrast, BPF kfuncs are considered an unstable API, giving you the
> flexibility to modify them later if needed.
>
>> 2. Requires configuration by the admin
>>
>> Ideally, as Matthew W. mentioned, there should be an automatic system.
>
> Take Matthew’s XFS large folio feature as an example—it was enabled
> automatically. A few years ago, when we upgraded to the 6.1.y stable
> kernel, we noticed this new feature. Since it was enabled by default,
> we assumed the author was confident in its stability. Unfortunately,
> it led to severe issues in our production environment: servers crashed
> randomly, and in some cases, we experienced data loss without
> understanding the root cause.
>
> We began disabling various kernel configurations in an attempt to
> isolate the issue, and eventually, the problem disappeared after
> disabling CONFIG_TRANSPARENT_HUGEPAGE. As a result, we released a new
> kernel version with THP disabled and had to restart hundreds of
> thousands of production servers. It was a nightmare for both us and
> our sysadmins.
>
> Last year, we discovered that the initial issue had been resolved by this patch:
> https://lore.kernel.org/stable/20241001210625.95825-1-ryncsn@gmail.com/.
> We backported the fix and re-enabled XFS large folios—only to face a
> new nightmare. One of our services began crashing sporadically with
> core dumps. It took us several months to trace the issue back to the
> re-enabled XFS large folio feature. Fortunately, we were able to
> disable it using livepatch, avoiding another round of mass server
> restarts. To this day, the root cause remains unknown. The good news
> is that the issue appears to be resolved in the 6.12.y stable kernel.
> We're still trying to bisect which commit fixed it, though progress is
> slow because the issue is not reliably reproducible.
This is a very wrong attitude towards open source projects. You sounded
like, whether intended or not, Linux community should provide issue-free
kernels and is responsible for fixing all issues. But that is wrong.
Since you are using the kernel, you could help improve it like Kairong
is doing instead of waiting for others to fix the issue.
>
> In theory, new features should be enabled automatically. But in
> practice, every new feature should come with a tunable knob. That’s a
> lesson we learned the hard way from this experience—and perhaps
> Matthew did too.
That means new features will not get enough testing. People like you
will just simply disable all new features and wait for they are stable.
It would never come without testing and bug fixes.
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-05-02 12:00 ` Zi Yan
@ 2025-05-02 12:18 ` Yafang Shao
2025-05-02 13:04 ` David Hildenbrand
0 siblings, 1 reply; 41+ messages in thread
From: Yafang Shao @ 2025-05-02 12:18 UTC (permalink / raw)
To: Zi Yan
Cc: Gutierrez Asier, Johannes Weiner, Liam R. Howlett, akpm, ast,
daniel, andrii, David Hildenbrand, Baolin Wang, Lorenzo Stoakes,
Nico Pache, Ryan Roberts, Dev Jain, bpf, linux-mm, Michal Hocko
On Fri, May 2, 2025 at 8:00 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 2 May 2025, at 1:48, Yafang Shao wrote:
>
> > On Fri, May 2, 2025 at 3:36 AM Gutierrez Asier
> > <gutierrez.asier@huawei-partners.com> wrote:
> >>
> >>
> >> On 4/30/2025 8:53 PM, Zi Yan wrote:
> >>> On 30 Apr 2025, at 13:45, Johannes Weiner wrote:
> >>>
> >>>> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
> >>>>>>>> If it isn't, can you state why?
> >>>>>>>>
> >>>>>>>> The main difference is that you are saying it's in a container that you
> >>>>>>>> don't control. Your plan is to violate the control the internal
> >>>>>>>> applications have over THP because you know better. I'm not sure how
> >>>>>>>> people might feel about you messing with workloads,
> >>>>>>>
> >>>>>>> It’s not a mess. They have the option to deploy their services on
> >>>>>>> dedicated servers, but they would need to pay more for that choice.
> >>>>>>> This is a two-way decision.
> >>>>>>
> >>>>>> This implies you want a container-level way of controlling the setting
> >>>>>> and not a system service-level?
> >>>>>
> >>>>> Right. We want to control the THP per container.
> >>>>
> >>>> This does strike me as a reasonable usecase.
> >>>>
> >>>> I think there is consensus that in the long-term we want this stuff to
> >>>> just work and truly be transparent to userspace.
> >>>>
> >>>> In the short-to-medium term, however, there are still quite a few
> >>>> caveats. thp=always can significantly increase the memory footprint of
> >>>> sparse virtual regions. Huge allocations are not as cheap and reliable
> >>>> as we would like them to be, which for real production systems means
> >>>> having to make workload-specifcic choices and tradeoffs.
> >>>>
> >>>> There is ongoing work in these areas, but we do have a bit of a
> >>>> chicken-and-egg problem: on the one hand, huge page adoption is slow
> >>>> due to limitations in how they can be deployed. For example, we can't
> >>>> do thp=always on a DC node that runs arbitary combinations of jobs
> >>>> from a wide array of services. Some might benefit, some might hurt.
> >>>>
> >>>> Yet, it's much easier to improve the kernel based on exactly such
> >>>> production experience and data from real-world usecases. We can't
> >>>> improve the THP shrinker if we can't run THP.
> >>>>
> >>>> So I don't see it as overriding whoever wrote the software running
> >>>> inside the container. They don't know, and they shouldn't have to care
> >>>> about page sizes. It's about letting admins and kernel teams get
> >>>> started on using and experimenting with this stuff, given the very
> >>>> real constraints right now, so we can get the feedback necessary to
> >>>> improve the situation.
> >>>
> >>> Since you think it is reasonable to control THP at container-level,
> >>> namely per-cgroup. Should we reconsider cgroup-based THP control[1]?
> >>> (Asier cc'd)
> >>>
> >>> In this patchset, Yafang uses BPF to adjust THP global configs based
> >>> on VMA, which does not look a good approach to me. WDYT?
> >>>
> >>>
> >>> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
> >>>
> >>> --
> >>> Best Regards,
> >>> Yan, Zi
> >>
> >> Hi,
> >>
> >> I believe cgroup is a better approach for containers, since this
> >> approach can be easily integrated with the user space stack like
> >> containerd and kubernets, which use cgroup to control system resources.
> >
> > The integration of BPF with containerd and Kubernetes is emerging as a
> > clear trend.
> >
> >>
> >> However, I pointed out earlier, the approach I suggested has some
> >> flaws:
> >> 1. Potential polution of cgroup with a big number of knobs
> >
> > Right, the memcg maintainers once told me that introducing a new
> > cgroup file means committing to maintaining it indefinitely, as these
> > interface files are treated as part of the ABI.
> > In contrast, BPF kfuncs are considered an unstable API, giving you the
> > flexibility to modify them later if needed.
> >
> >> 2. Requires configuration by the admin
> >>
> >> Ideally, as Matthew W. mentioned, there should be an automatic system.
> >
> > Take Matthew’s XFS large folio feature as an example—it was enabled
> > automatically. A few years ago, when we upgraded to the 6.1.y stable
> > kernel, we noticed this new feature. Since it was enabled by default,
> > we assumed the author was confident in its stability. Unfortunately,
> > it led to severe issues in our production environment: servers crashed
> > randomly, and in some cases, we experienced data loss without
> > understanding the root cause.
> >
> > We began disabling various kernel configurations in an attempt to
> > isolate the issue, and eventually, the problem disappeared after
> > disabling CONFIG_TRANSPARENT_HUGEPAGE. As a result, we released a new
> > kernel version with THP disabled and had to restart hundreds of
> > thousands of production servers. It was a nightmare for both us and
> > our sysadmins.
> >
> > Last year, we discovered that the initial issue had been resolved by this patch:
> > https://lore.kernel.org/stable/20241001210625.95825-1-ryncsn@gmail.com/.
> > We backported the fix and re-enabled XFS large folios—only to face a
> > new nightmare. One of our services began crashing sporadically with
> > core dumps. It took us several months to trace the issue back to the
> > re-enabled XFS large folio feature. Fortunately, we were able to
> > disable it using livepatch, avoiding another round of mass server
> > restarts. To this day, the root cause remains unknown. The good news
> > is that the issue appears to be resolved in the 6.12.y stable kernel.
> > We're still trying to bisect which commit fixed it, though progress is
> > slow because the issue is not reliably reproducible.
>
> This is a very wrong attitude towards open source projects. You sounded
> like, whether intended or not, Linux community should provide issue-free
> kernels and is responsible for fixing all issues. But that is wrong.
> Since you are using the kernel, you could help improve it like Kairong
> is doing instead of waiting for others to fix the issue.
>
> >
> > In theory, new features should be enabled automatically. But in
> > practice, every new feature should come with a tunable knob. That’s a
> > lesson we learned the hard way from this experience—and perhaps
> > Matthew did too.
>
> That means new features will not get enough testing. People like you
> will just simply disable all new features and wait for they are stable.
> It would never come without testing and bug fixes.
Pardon me?
This discussion has taken such an unexpected turn that I don’t feel
the need to explain what I’ve contributed to the Linux community over
the past few years.
That said, you're free to express yourself as you wish—even if it
comes across as unnecessarily rude toward someone who has been
participating in the community voluntarily for many years.
Best of luck in your first maintainer role.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-05-02 12:18 ` Yafang Shao
@ 2025-05-02 13:04 ` David Hildenbrand
2025-05-02 13:06 ` Matthew Wilcox
` (2 more replies)
0 siblings, 3 replies; 41+ messages in thread
From: David Hildenbrand @ 2025-05-02 13:04 UTC (permalink / raw)
To: Yafang Shao, Zi Yan
Cc: Gutierrez Asier, Johannes Weiner, Liam R. Howlett, akpm, ast,
daniel, andrii, Baolin Wang, Lorenzo Stoakes, Nico Pache,
Ryan Roberts, Dev Jain, bpf, linux-mm, Michal Hocko
On 02.05.25 14:18, Yafang Shao wrote:
> On Fri, May 2, 2025 at 8:00 PM Zi Yan <ziy@nvidia.com> wrote:
>>
>> On 2 May 2025, at 1:48, Yafang Shao wrote:
>>
>>> On Fri, May 2, 2025 at 3:36 AM Gutierrez Asier
>>> <gutierrez.asier@huawei-partners.com> wrote:
>>>>
>>>>
>>>> On 4/30/2025 8:53 PM, Zi Yan wrote:
>>>>> On 30 Apr 2025, at 13:45, Johannes Weiner wrote:
>>>>>
>>>>>> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
>>>>>>>>>> If it isn't, can you state why?
>>>>>>>>>>
>>>>>>>>>> The main difference is that you are saying it's in a container that you
>>>>>>>>>> don't control. Your plan is to violate the control the internal
>>>>>>>>>> applications have over THP because you know better. I'm not sure how
>>>>>>>>>> people might feel about you messing with workloads,
>>>>>>>>>
>>>>>>>>> It’s not a mess. They have the option to deploy their services on
>>>>>>>>> dedicated servers, but they would need to pay more for that choice.
>>>>>>>>> This is a two-way decision.
>>>>>>>>
>>>>>>>> This implies you want a container-level way of controlling the setting
>>>>>>>> and not a system service-level?
>>>>>>>
>>>>>>> Right. We want to control the THP per container.
>>>>>>
>>>>>> This does strike me as a reasonable usecase.
>>>>>>
>>>>>> I think there is consensus that in the long-term we want this stuff to
>>>>>> just work and truly be transparent to userspace.
>>>>>>
>>>>>> In the short-to-medium term, however, there are still quite a few
>>>>>> caveats. thp=always can significantly increase the memory footprint of
>>>>>> sparse virtual regions. Huge allocations are not as cheap and reliable
>>>>>> as we would like them to be, which for real production systems means
>>>>>> having to make workload-specifcic choices and tradeoffs.
>>>>>>
>>>>>> There is ongoing work in these areas, but we do have a bit of a
>>>>>> chicken-and-egg problem: on the one hand, huge page adoption is slow
>>>>>> due to limitations in how they can be deployed. For example, we can't
>>>>>> do thp=always on a DC node that runs arbitary combinations of jobs
>>>>>> from a wide array of services. Some might benefit, some might hurt.
>>>>>>
>>>>>> Yet, it's much easier to improve the kernel based on exactly such
>>>>>> production experience and data from real-world usecases. We can't
>>>>>> improve the THP shrinker if we can't run THP.
>>>>>>
>>>>>> So I don't see it as overriding whoever wrote the software running
>>>>>> inside the container. They don't know, and they shouldn't have to care
>>>>>> about page sizes. It's about letting admins and kernel teams get
>>>>>> started on using and experimenting with this stuff, given the very
>>>>>> real constraints right now, so we can get the feedback necessary to
>>>>>> improve the situation.
>>>>>
>>>>> Since you think it is reasonable to control THP at container-level,
>>>>> namely per-cgroup. Should we reconsider cgroup-based THP control[1]?
>>>>> (Asier cc'd)
>>>>>
>>>>> In this patchset, Yafang uses BPF to adjust THP global configs based
>>>>> on VMA, which does not look a good approach to me. WDYT?
>>>>>
>>>>>
>>>>> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Yan, Zi
>>>>
>>>> Hi,
>>>>
>>>> I believe cgroup is a better approach for containers, since this
>>>> approach can be easily integrated with the user space stack like
>>>> containerd and kubernets, which use cgroup to control system resources.
>>>
>>> The integration of BPF with containerd and Kubernetes is emerging as a
>>> clear trend.
>>>
>>>>
>>>> However, I pointed out earlier, the approach I suggested has some
>>>> flaws:
>>>> 1. Potential polution of cgroup with a big number of knobs
>>>
>>> Right, the memcg maintainers once told me that introducing a new
>>> cgroup file means committing to maintaining it indefinitely, as these
>>> interface files are treated as part of the ABI.
>>> In contrast, BPF kfuncs are considered an unstable API, giving you the
>>> flexibility to modify them later if needed.
>>>
>>>> 2. Requires configuration by the admin
>>>>
>>>> Ideally, as Matthew W. mentioned, there should be an automatic system.
>>>
>>> Take Matthew’s XFS large folio feature as an example—it was enabled
>>> automatically. A few years ago, when we upgraded to the 6.1.y stable
>>> kernel, we noticed this new feature. Since it was enabled by default,
>>> we assumed the author was confident in its stability. Unfortunately,
>>> it led to severe issues in our production environment: servers crashed
>>> randomly, and in some cases, we experienced data loss without
>>> understanding the root cause.
>>>
>>> We began disabling various kernel configurations in an attempt to
>>> isolate the issue, and eventually, the problem disappeared after
>>> disabling CONFIG_TRANSPARENT_HUGEPAGE. As a result, we released a new
>>> kernel version with THP disabled and had to restart hundreds of
>>> thousands of production servers. It was a nightmare for both us and
>>> our sysadmins.
>>>
>>> Last year, we discovered that the initial issue had been resolved by this patch:
>>> https://lore.kernel.org/stable/20241001210625.95825-1-ryncsn@gmail.com/.
>>> We backported the fix and re-enabled XFS large folios—only to face a
>>> new nightmare. One of our services began crashing sporadically with
>>> core dumps. It took us several months to trace the issue back to the
>>> re-enabled XFS large folio feature. Fortunately, we were able to
>>> disable it using livepatch, avoiding another round of mass server
>>> restarts. To this day, the root cause remains unknown. The good news
>>> is that the issue appears to be resolved in the 6.12.y stable kernel.
>>> We're still trying to bisect which commit fixed it, though progress is
>>> slow because the issue is not reliably reproducible.
>>
>> This is a very wrong attitude towards open source projects. You sounded
>> like, whether intended or not, Linux community should provide issue-free
>> kernels and is responsible for fixing all issues. But that is wrong.
>> Since you are using the kernel, you could help improve it like Kairong
>> is doing instead of waiting for others to fix the issue.
>>
>>>
>>> In theory, new features should be enabled automatically. But in
>>> practice, every new feature should come with a tunable knob. That’s a
>>> lesson we learned the hard way from this experience—and perhaps
>>> Matthew did too.
>>
>> That means new features will not get enough testing. People like you
>> will just simply disable all new features and wait for they are stable.
>> It would never come without testing and bug fixes.
We do have the concept of EXPERIMENTAL kernel configs, that are either
expected get removed completely ("always enabled") or get turned into
actual long-term kernel options. But yeah, it's always tricky what we
actually want to put behind such options.
I mean, READ_ONLY_THP_FOR_FS is still around and still EXPERIMENTAL ...
Distro kernels are usually very careful about what to backport and what
to support. Once we (working for a distro) do backport + test, we
usually find some additional things that upstream hasn't spotted yet: in
particular, because some workloads are only run in that form on distro
kernels. We also ran into some issues with large folios (e.g., me
personally with s390x KVM guests) and trying our best to fix them.
It can be quite time consuming, so I can understand that not everybody
has the time to invest into heavy debugging, especially if it's
extremely hard to reproduce (or even corrupts data :( ).
I agree that adding a toggle after the effects to work around issues is
not the right approach. Introducing a EXPERIMENTAL toggle early because
one suspects complicated interactions in a different story. It's
absolutely not trivial to make that decision.
>
> Pardon me?
> This discussion has taken such an unexpected turn that I don’t feel
> the need to explain what I’ve contributed to the Linux community over
> the past few years.
I'm sure Zi Yan didn't mean to insult you. I would have phrased it as:
"It's difficult to decide which toggles make sense. There is a fine line
between adding a toggle and not getting people actually testing it to
stabilize it vs. not adding a toggle and forcing people to test it and
fix it/report issues."
Ideally, we'd find most issue in the RC phase or at least shortly after.
You've been active in the kernel for a long time, please don't feel like
the community is not appreciating that.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-05-02 13:04 ` David Hildenbrand
@ 2025-05-02 13:06 ` Matthew Wilcox
2025-05-02 13:34 ` Zi Yan
2025-05-05 2:35 ` Yafang Shao
2 siblings, 0 replies; 41+ messages in thread
From: Matthew Wilcox @ 2025-05-02 13:06 UTC (permalink / raw)
To: David Hildenbrand
Cc: Yafang Shao, Zi Yan, Gutierrez Asier, Johannes Weiner,
Liam R. Howlett, akpm, ast, daniel, andrii, Baolin Wang,
Lorenzo Stoakes, Nico Pache, Ryan Roberts, Dev Jain, bpf,
linux-mm, Michal Hocko
On Fri, May 02, 2025 at 03:04:12PM +0200, David Hildenbrand wrote:
> I mean, READ_ONLY_THP_FOR_FS is still around and still EXPERIMENTAL ...
It's going away RSN.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-05-02 13:04 ` David Hildenbrand
2025-05-02 13:06 ` Matthew Wilcox
@ 2025-05-02 13:34 ` Zi Yan
2025-05-05 2:35 ` Yafang Shao
2 siblings, 0 replies; 41+ messages in thread
From: Zi Yan @ 2025-05-02 13:34 UTC (permalink / raw)
To: Yafang Shao, David Hildenbrand
Cc: Gutierrez Asier, Johannes Weiner, Liam R. Howlett, akpm, ast,
daniel, andrii, Baolin Wang, Lorenzo Stoakes, Nico Pache,
Ryan Roberts, Dev Jain, bpf, linux-mm, Michal Hocko
On 2 May 2025, at 9:04, David Hildenbrand wrote:
> On 02.05.25 14:18, Yafang Shao wrote:
>> On Fri, May 2, 2025 at 8:00 PM Zi Yan <ziy@nvidia.com> wrote:
>>>
>>> On 2 May 2025, at 1:48, Yafang Shao wrote:
>>>
>>>> On Fri, May 2, 2025 at 3:36 AM Gutierrez Asier
>>>> <gutierrez.asier@huawei-partners.com> wrote:
>>>>>
>>>>>
>>>>> On 4/30/2025 8:53 PM, Zi Yan wrote:
>>>>>> On 30 Apr 2025, at 13:45, Johannes Weiner wrote:
>>>>>>
>>>>>>> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
>>>>>>>>>>> If it isn't, can you state why?
>>>>>>>>>>>
>>>>>>>>>>> The main difference is that you are saying it's in a container that you
>>>>>>>>>>> don't control. Your plan is to violate the control the internal
>>>>>>>>>>> applications have over THP because you know better. I'm not sure how
>>>>>>>>>>> people might feel about you messing with workloads,
>>>>>>>>>>
>>>>>>>>>> It’s not a mess. They have the option to deploy their services on
>>>>>>>>>> dedicated servers, but they would need to pay more for that choice.
>>>>>>>>>> This is a two-way decision.
>>>>>>>>>
>>>>>>>>> This implies you want a container-level way of controlling the setting
>>>>>>>>> and not a system service-level?
>>>>>>>>
>>>>>>>> Right. We want to control the THP per container.
>>>>>>>
>>>>>>> This does strike me as a reasonable usecase.
>>>>>>>
>>>>>>> I think there is consensus that in the long-term we want this stuff to
>>>>>>> just work and truly be transparent to userspace.
>>>>>>>
>>>>>>> In the short-to-medium term, however, there are still quite a few
>>>>>>> caveats. thp=always can significantly increase the memory footprint of
>>>>>>> sparse virtual regions. Huge allocations are not as cheap and reliable
>>>>>>> as we would like them to be, which for real production systems means
>>>>>>> having to make workload-specifcic choices and tradeoffs.
>>>>>>>
>>>>>>> There is ongoing work in these areas, but we do have a bit of a
>>>>>>> chicken-and-egg problem: on the one hand, huge page adoption is slow
>>>>>>> due to limitations in how they can be deployed. For example, we can't
>>>>>>> do thp=always on a DC node that runs arbitary combinations of jobs
>>>>>>> from a wide array of services. Some might benefit, some might hurt.
>>>>>>>
>>>>>>> Yet, it's much easier to improve the kernel based on exactly such
>>>>>>> production experience and data from real-world usecases. We can't
>>>>>>> improve the THP shrinker if we can't run THP.
>>>>>>>
>>>>>>> So I don't see it as overriding whoever wrote the software running
>>>>>>> inside the container. They don't know, and they shouldn't have to care
>>>>>>> about page sizes. It's about letting admins and kernel teams get
>>>>>>> started on using and experimenting with this stuff, given the very
>>>>>>> real constraints right now, so we can get the feedback necessary to
>>>>>>> improve the situation.
>>>>>>
>>>>>> Since you think it is reasonable to control THP at container-level,
>>>>>> namely per-cgroup. Should we reconsider cgroup-based THP control[1]?
>>>>>> (Asier cc'd)
>>>>>>
>>>>>> In this patchset, Yafang uses BPF to adjust THP global configs based
>>>>>> on VMA, which does not look a good approach to me. WDYT?
>>>>>>
>>>>>>
>>>>>> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Yan, Zi
>>>>>
>>>>> Hi,
>>>>>
>>>>> I believe cgroup is a better approach for containers, since this
>>>>> approach can be easily integrated with the user space stack like
>>>>> containerd and kubernets, which use cgroup to control system resources.
>>>>
>>>> The integration of BPF with containerd and Kubernetes is emerging as a
>>>> clear trend.
>>>>
>>>>>
>>>>> However, I pointed out earlier, the approach I suggested has some
>>>>> flaws:
>>>>> 1. Potential polution of cgroup with a big number of knobs
>>>>
>>>> Right, the memcg maintainers once told me that introducing a new
>>>> cgroup file means committing to maintaining it indefinitely, as these
>>>> interface files are treated as part of the ABI.
>>>> In contrast, BPF kfuncs are considered an unstable API, giving you the
>>>> flexibility to modify them later if needed.
>>>>
>>>>> 2. Requires configuration by the admin
>>>>>
>>>>> Ideally, as Matthew W. mentioned, there should be an automatic system.
>>>>
>>>> Take Matthew’s XFS large folio feature as an example—it was enabled
>>>> automatically. A few years ago, when we upgraded to the 6.1.y stable
>>>> kernel, we noticed this new feature. Since it was enabled by default,
>>>> we assumed the author was confident in its stability. Unfortunately,
>>>> it led to severe issues in our production environment: servers crashed
>>>> randomly, and in some cases, we experienced data loss without
>>>> understanding the root cause.
>>>>
>>>> We began disabling various kernel configurations in an attempt to
>>>> isolate the issue, and eventually, the problem disappeared after
>>>> disabling CONFIG_TRANSPARENT_HUGEPAGE. As a result, we released a new
>>>> kernel version with THP disabled and had to restart hundreds of
>>>> thousands of production servers. It was a nightmare for both us and
>>>> our sysadmins.
>>>>
>>>> Last year, we discovered that the initial issue had been resolved by this patch:
>>>> https://lore.kernel.org/stable/20241001210625.95825-1-ryncsn@gmail.com/.
>>>> We backported the fix and re-enabled XFS large folios—only to face a
>>>> new nightmare. One of our services began crashing sporadically with
>>>> core dumps. It took us several months to trace the issue back to the
>>>> re-enabled XFS large folio feature. Fortunately, we were able to
>>>> disable it using livepatch, avoiding another round of mass server
>>>> restarts. To this day, the root cause remains unknown. The good news
>>>> is that the issue appears to be resolved in the 6.12.y stable kernel.
>>>> We're still trying to bisect which commit fixed it, though progress is
>>>> slow because the issue is not reliably reproducible.
>>>
>>> This is a very wrong attitude towards open source projects. You sounded
>>> like, whether intended or not, Linux community should provide issue-free
>>> kernels and is responsible for fixing all issues. But that is wrong.
>>> Since you are using the kernel, you could help improve it like Kairong
>>> is doing instead of waiting for others to fix the issue.
>>>
>>>>
>>>> In theory, new features should be enabled automatically. But in
>>>> practice, every new feature should come with a tunable knob. That’s a
>>>> lesson we learned the hard way from this experience—and perhaps
>>>> Matthew did too.
>>>
>>> That means new features will not get enough testing. People like you
>>> will just simply disable all new features and wait for they are stable.
>>> It would never come without testing and bug fixes.
>
> We do have the concept of EXPERIMENTAL kernel configs, that are either expected get removed completely ("always enabled") or get turned into actual long-term kernel options. But yeah, it's always tricky what we actually want to put behind such options.
>
> I mean, READ_ONLY_THP_FOR_FS is still around and still EXPERIMENTAL ...
>
> Distro kernels are usually very careful about what to backport and what to support. Once we (working for a distro) do backport + test, we usually find some additional things that upstream hasn't spotted yet: in particular, because some workloads are only run in that form on distro kernels. We also ran into some issues with large folios (e.g., me personally with s390x KVM guests) and trying our best to fix them.
>
> It can be quite time consuming, so I can understand that not everybody has the time to invest into heavy debugging, especially if it's extremely hard to reproduce (or even corrupts data :( ).
>
> I agree that adding a toggle after the effects to work around issues is not the right approach. Introducing a EXPERIMENTAL toggle early because one suspects complicated interactions in a different story. It's absolutely not trivial to make that decision.
>
>>
>> Pardon me?
>> This discussion has taken such an unexpected turn that I don’t feel
>> the need to explain what I’ve contributed to the Linux community over
>> the past few years.
>
> I'm sure Zi Yan didn't mean to insult you. I would have phrased it as:
Hi Yafang,
I do apologize if you feel insulted. I am aware of that you have contributed
a lot to the community. My point is that disabling features and waiting for
they become stable might not work.
>
> "It's difficult to decide which toggles make sense. There is a fine line between adding a toggle and not getting people actually testing it to stabilize it vs. not adding a toggle and forcing people to test it and fix it/report issues."
>
> Ideally, we'd find most issue in the RC phase or at least shortly after.
>
> You've been active in the kernel for a long time, please don't feel like the community is not appreciating that.
David, thank you for the clarification.
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-05-02 13:04 ` David Hildenbrand
2025-05-02 13:06 ` Matthew Wilcox
2025-05-02 13:34 ` Zi Yan
@ 2025-05-05 2:35 ` Yafang Shao
2 siblings, 0 replies; 41+ messages in thread
From: Yafang Shao @ 2025-05-05 2:35 UTC (permalink / raw)
To: David Hildenbrand
Cc: Zi Yan, Gutierrez Asier, Johannes Weiner, Liam R. Howlett, akpm,
ast, daniel, andrii, Baolin Wang, Lorenzo Stoakes, Nico Pache,
Ryan Roberts, Dev Jain, bpf, linux-mm, Michal Hocko
On Fri, May 2, 2025 at 9:04 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.05.25 14:18, Yafang Shao wrote:
> > On Fri, May 2, 2025 at 8:00 PM Zi Yan <ziy@nvidia.com> wrote:
> >>
> >> On 2 May 2025, at 1:48, Yafang Shao wrote:
> >>
> >>> On Fri, May 2, 2025 at 3:36 AM Gutierrez Asier
> >>> <gutierrez.asier@huawei-partners.com> wrote:
> >>>>
> >>>>
> >>>> On 4/30/2025 8:53 PM, Zi Yan wrote:
> >>>>> On 30 Apr 2025, at 13:45, Johannes Weiner wrote:
> >>>>>
> >>>>>> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
> >>>>>>>>>> If it isn't, can you state why?
> >>>>>>>>>>
> >>>>>>>>>> The main difference is that you are saying it's in a container that you
> >>>>>>>>>> don't control. Your plan is to violate the control the internal
> >>>>>>>>>> applications have over THP because you know better. I'm not sure how
> >>>>>>>>>> people might feel about you messing with workloads,
> >>>>>>>>>
> >>>>>>>>> It’s not a mess. They have the option to deploy their services on
> >>>>>>>>> dedicated servers, but they would need to pay more for that choice.
> >>>>>>>>> This is a two-way decision.
> >>>>>>>>
> >>>>>>>> This implies you want a container-level way of controlling the setting
> >>>>>>>> and not a system service-level?
> >>>>>>>
> >>>>>>> Right. We want to control the THP per container.
> >>>>>>
> >>>>>> This does strike me as a reasonable usecase.
> >>>>>>
> >>>>>> I think there is consensus that in the long-term we want this stuff to
> >>>>>> just work and truly be transparent to userspace.
> >>>>>>
> >>>>>> In the short-to-medium term, however, there are still quite a few
> >>>>>> caveats. thp=always can significantly increase the memory footprint of
> >>>>>> sparse virtual regions. Huge allocations are not as cheap and reliable
> >>>>>> as we would like them to be, which for real production systems means
> >>>>>> having to make workload-specifcic choices and tradeoffs.
> >>>>>>
> >>>>>> There is ongoing work in these areas, but we do have a bit of a
> >>>>>> chicken-and-egg problem: on the one hand, huge page adoption is slow
> >>>>>> due to limitations in how they can be deployed. For example, we can't
> >>>>>> do thp=always on a DC node that runs arbitary combinations of jobs
> >>>>>> from a wide array of services. Some might benefit, some might hurt.
> >>>>>>
> >>>>>> Yet, it's much easier to improve the kernel based on exactly such
> >>>>>> production experience and data from real-world usecases. We can't
> >>>>>> improve the THP shrinker if we can't run THP.
> >>>>>>
> >>>>>> So I don't see it as overriding whoever wrote the software running
> >>>>>> inside the container. They don't know, and they shouldn't have to care
> >>>>>> about page sizes. It's about letting admins and kernel teams get
> >>>>>> started on using and experimenting with this stuff, given the very
> >>>>>> real constraints right now, so we can get the feedback necessary to
> >>>>>> improve the situation.
> >>>>>
> >>>>> Since you think it is reasonable to control THP at container-level,
> >>>>> namely per-cgroup. Should we reconsider cgroup-based THP control[1]?
> >>>>> (Asier cc'd)
> >>>>>
> >>>>> In this patchset, Yafang uses BPF to adjust THP global configs based
> >>>>> on VMA, which does not look a good approach to me. WDYT?
> >>>>>
> >>>>>
> >>>>> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
> >>>>>
> >>>>> --
> >>>>> Best Regards,
> >>>>> Yan, Zi
> >>>>
> >>>> Hi,
> >>>>
> >>>> I believe cgroup is a better approach for containers, since this
> >>>> approach can be easily integrated with the user space stack like
> >>>> containerd and kubernets, which use cgroup to control system resources.
> >>>
> >>> The integration of BPF with containerd and Kubernetes is emerging as a
> >>> clear trend.
> >>>
> >>>>
> >>>> However, I pointed out earlier, the approach I suggested has some
> >>>> flaws:
> >>>> 1. Potential polution of cgroup with a big number of knobs
> >>>
> >>> Right, the memcg maintainers once told me that introducing a new
> >>> cgroup file means committing to maintaining it indefinitely, as these
> >>> interface files are treated as part of the ABI.
> >>> In contrast, BPF kfuncs are considered an unstable API, giving you the
> >>> flexibility to modify them later if needed.
> >>>
> >>>> 2. Requires configuration by the admin
> >>>>
> >>>> Ideally, as Matthew W. mentioned, there should be an automatic system.
> >>>
> >>> Take Matthew’s XFS large folio feature as an example—it was enabled
> >>> automatically. A few years ago, when we upgraded to the 6.1.y stable
> >>> kernel, we noticed this new feature. Since it was enabled by default,
> >>> we assumed the author was confident in its stability. Unfortunately,
> >>> it led to severe issues in our production environment: servers crashed
> >>> randomly, and in some cases, we experienced data loss without
> >>> understanding the root cause.
> >>>
> >>> We began disabling various kernel configurations in an attempt to
> >>> isolate the issue, and eventually, the problem disappeared after
> >>> disabling CONFIG_TRANSPARENT_HUGEPAGE. As a result, we released a new
> >>> kernel version with THP disabled and had to restart hundreds of
> >>> thousands of production servers. It was a nightmare for both us and
> >>> our sysadmins.
> >>>
> >>> Last year, we discovered that the initial issue had been resolved by this patch:
> >>> https://lore.kernel.org/stable/20241001210625.95825-1-ryncsn@gmail.com/.
> >>> We backported the fix and re-enabled XFS large folios—only to face a
> >>> new nightmare. One of our services began crashing sporadically with
> >>> core dumps. It took us several months to trace the issue back to the
> >>> re-enabled XFS large folio feature. Fortunately, we were able to
> >>> disable it using livepatch, avoiding another round of mass server
> >>> restarts. To this day, the root cause remains unknown. The good news
> >>> is that the issue appears to be resolved in the 6.12.y stable kernel.
> >>> We're still trying to bisect which commit fixed it, though progress is
> >>> slow because the issue is not reliably reproducible.
> >>
> >> This is a very wrong attitude towards open source projects. You sounded
> >> like, whether intended or not, Linux community should provide issue-free
> >> kernels and is responsible for fixing all issues. But that is wrong.
> >> Since you are using the kernel, you could help improve it like Kairong
> >> is doing instead of waiting for others to fix the issue.
> >>
> >>>
> >>> In theory, new features should be enabled automatically. But in
> >>> practice, every new feature should come with a tunable knob. That’s a
> >>> lesson we learned the hard way from this experience—and perhaps
> >>> Matthew did too.
> >>
> >> That means new features will not get enough testing. People like you
> >> will just simply disable all new features and wait for they are stable.
> >> It would never come without testing and bug fixes.
>
Hello David,
Thanks for your replyment.
> We do have the concept of EXPERIMENTAL kernel configs, that are either
> expected get removed completely ("always enabled") or get turned into
> actual long-term kernel options. But yeah, it's always tricky what we
> actually want to put behind such options.
>
> I mean, READ_ONLY_THP_FOR_FS is still around and still EXPERIMENTAL ...
READ_ONLY_THP_FOR_FS is not enabled in our 6.1 kernel, as we are
cautious about enabling any EXPERIMENTAL feature. XFS large folio
support operates independently of READ_ONLY_THP_FOR_FS. However, it is
automatically enabled when CONFIG_TRANSPARENT_HUGEPAGE is set, as seen
in the 6.1.y stable kernel mapping_large_folio_support().
>
> Distro kernels are usually very careful about what to backport and what
> to support. Once we (working for a distro) do backport + test, we
> usually find some additional things that upstream hasn't spotted yet: in
> particular, because some workloads are only run in that form on distro
> kernels. We also ran into some issues with large folios (e.g., me
> personally with s390x KVM guests) and trying our best to fix them.
We also worked on this. As you may recall, I previously fixed a large
folio bug, which was merged into the 6.1.y stable kernel [0].
[0]. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-6.1.y&id=a3f8ee15228c89ce3713ee7e8e82f6d8a13fdb4b
>
> It can be quite time consuming, so I can understand that not everybody
> has the time to invest into heavy debugging, especially if it's
> extremely hard to reproduce (or even corrupts data :( ).
Correct. If the vmcore is incomplete, it is nearly impossible to
reliably determine the root cause. The best approach is to isolate the
issue as quickly as possible.
>
> I agree that adding a toggle after the effects to work around issues is
> not the right approach. Introducing a EXPERIMENTAL toggle early because
> one suspects complicated interactions in a different story. It's
> absolutely not trivial to make that decision.
In this patchset, we are not introducing a toggle as a workaround.
Rather, the change reflects the fact that some workloads benefit from
THP while others are negatively impacted. Therefore, it makes sense to
enable THP selectively based on workload characteristics.
>
> >
> > Pardon me?
> > This discussion has taken such an unexpected turn that I don’t feel
> > the need to explain what I’ve contributed to the Linux community over
> > the past few years.
>
> I'm sure Zi Yan didn't mean to insult you. I would have phrased it as:
>
> "It's difficult to decide which toggles make sense. There is a fine line
> between adding a toggle and not getting people actually testing it to
> stabilize it vs. not adding a toggle and forcing people to test it and
> fix it/report issues."
>
> Ideally, we'd find most issue in the RC phase or at least shortly after.
>
> You've been active in the kernel for a long time, please don't feel like
> the community is not appreciating that.
Thank you for the clarification. I truly appreciate your patience and
thoroughness.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-05-02 5:48 ` Yafang Shao
2025-05-02 12:00 ` Zi Yan
@ 2025-05-05 9:11 ` Gutierrez Asier
2025-05-05 9:38 ` Yafang Shao
1 sibling, 1 reply; 41+ messages in thread
From: Gutierrez Asier @ 2025-05-05 9:11 UTC (permalink / raw)
To: Yafang Shao
Cc: Zi Yan, Johannes Weiner, Liam R. Howlett, akpm, ast, daniel,
andrii, David Hildenbrand, Baolin Wang, Lorenzo Stoakes,
Nico Pache, Ryan Roberts, Dev Jain, bpf, linux-mm, Michal Hocko
On 5/2/2025 8:48 AM, Yafang Shao wrote:
> On Fri, May 2, 2025 at 3:36 AM Gutierrez Asier
> <gutierrez.asier@huawei-partners.com> wrote:
>>
>>
>> On 4/30/2025 8:53 PM, Zi Yan wrote:
>>> On 30 Apr 2025, at 13:45, Johannes Weiner wrote:
>>>
>>>> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
>>>>>>>> If it isn't, can you state why?
>>>>>>>>
>>>>>>>> The main difference is that you are saying it's in a container that you
>>>>>>>> don't control. Your plan is to violate the control the internal
>>>>>>>> applications have over THP because you know better. I'm not sure how
>>>>>>>> people might feel about you messing with workloads,
>>>>>>>
>>>>>>> It’s not a mess. They have the option to deploy their services on
>>>>>>> dedicated servers, but they would need to pay more for that choice.
>>>>>>> This is a two-way decision.
>>>>>>
>>>>>> This implies you want a container-level way of controlling the setting
>>>>>> and not a system service-level?
>>>>>
>>>>> Right. We want to control the THP per container.
>>>>
>>>> This does strike me as a reasonable usecase.
>>>>
>>>> I think there is consensus that in the long-term we want this stuff to
>>>> just work and truly be transparent to userspace.
>>>>
>>>> In the short-to-medium term, however, there are still quite a few
>>>> caveats. thp=always can significantly increase the memory footprint of
>>>> sparse virtual regions. Huge allocations are not as cheap and reliable
>>>> as we would like them to be, which for real production systems means
>>>> having to make workload-specifcic choices and tradeoffs.
>>>>
>>>> There is ongoing work in these areas, but we do have a bit of a
>>>> chicken-and-egg problem: on the one hand, huge page adoption is slow
>>>> due to limitations in how they can be deployed. For example, we can't
>>>> do thp=always on a DC node that runs arbitary combinations of jobs
>>>> from a wide array of services. Some might benefit, some might hurt.
>>>>
>>>> Yet, it's much easier to improve the kernel based on exactly such
>>>> production experience and data from real-world usecases. We can't
>>>> improve the THP shrinker if we can't run THP.
>>>>
>>>> So I don't see it as overriding whoever wrote the software running
>>>> inside the container. They don't know, and they shouldn't have to care
>>>> about page sizes. It's about letting admins and kernel teams get
>>>> started on using and experimenting with this stuff, given the very
>>>> real constraints right now, so we can get the feedback necessary to
>>>> improve the situation.
>>>
>>> Since you think it is reasonable to control THP at container-level,
>>> namely per-cgroup. Should we reconsider cgroup-based THP control[1]?
>>> (Asier cc'd)
>>>
>>> In this patchset, Yafang uses BPF to adjust THP global configs based
>>> on VMA, which does not look a good approach to me. WDYT?
>>>
>>>
>>> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi
>>
>> Hi,
>>
>> I believe cgroup is a better approach for containers, since this
>> approach can be easily integrated with the user space stack like
>> containerd and kubernets, which use cgroup to control system resources.
>
> The integration of BPF with containerd and Kubernetes is emerging as a
> clear trend.
>
No, eBPF is not used for resource management, it is mainly used by the
network stack (CNI), monitoring and security. All the resource
management by Kubernetes is done using cgroups. You are very unlikely
to convince the Kubernetes community to manage memory resources using
eBPF.
>>
>> However, I pointed out earlier, the approach I suggested has some
>> flaws:
>> 1. Potential polution of cgroup with a big number of knobs
>
> Right, the memcg maintainers once told me that introducing a new
> cgroup file means committing to maintaining it indefinitely, as these
> interface files are treated as part of the ABI.
> In contrast, BPF kfuncs are considered an unstable API, giving you the
> flexibility to modify them later if needed.
>
>> 2. Requires configuration by the admin
>>
>> Ideally, as Matthew W. mentioned, there should be an automatic system.
>
> Take Matthew’s XFS large folio feature as an example—it was enabled
> automatically. A few years ago, when we upgraded to the 6.1.y stable
> kernel, we noticed this new feature. Since it was enabled by default,
> we assumed the author was confident in its stability. Unfortunately,
> it led to severe issues in our production environment: servers crashed
> randomly, and in some cases, we experienced data loss without
> understanding the root cause.
>
> We began disabling various kernel configurations in an attempt to
> isolate the issue, and eventually, the problem disappeared after
> disabling CONFIG_TRANSPARENT_HUGEPAGE. As a result, we released a new
> kernel version with THP disabled and had to restart hundreds of
> thousands of production servers. It was a nightmare for both us and
> our sysadmins.
>
> Last year, we discovered that the initial issue had been resolved by this patch:
> https://lore.kernel.org/stable/20241001210625.95825-1-ryncsn@gmail.com/.
> We backported the fix and re-enabled XFS large folios—only to face a
> new nightmare. One of our services began crashing sporadically with
> core dumps. It took us several months to trace the issue back to the
> re-enabled XFS large folio feature. Fortunately, we were able to
> disable it using livepatch, avoiding another round of mass server
> restarts. To this day, the root cause remains unknown. The good news
> is that the issue appears to be resolved in the 6.12.y stable kernel.
> We're still trying to bisect which commit fixed it, though progress is
> slow because the issue is not reliably reproducible.
>
> In theory, new features should be enabled automatically. But in
> practice, every new feature should come with a tunable knob. That’s a
> lesson we learned the hard way from this experience—and perhaps
> Matthew did too.
>
>>
>> Anyway, regarding containers, I believe cgroup is a good approach
>> given that the admin or the container management system uses cgroups
>> to set up the containers.
>>
>> --
>> Asier Gutierrez
>> Huawei
>>
>
--
Asier Gutierrez
Huawei
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment
2025-05-05 9:11 ` Gutierrez Asier
@ 2025-05-05 9:38 ` Yafang Shao
0 siblings, 0 replies; 41+ messages in thread
From: Yafang Shao @ 2025-05-05 9:38 UTC (permalink / raw)
To: Gutierrez Asier
Cc: Zi Yan, Johannes Weiner, Liam R. Howlett, akpm, ast, daniel,
andrii, David Hildenbrand, Baolin Wang, Lorenzo Stoakes,
Nico Pache, Ryan Roberts, Dev Jain, bpf, linux-mm, Michal Hocko
On Mon, May 5, 2025 at 5:11 PM Gutierrez Asier
<gutierrez.asier@huawei-partners.com> wrote:
>
>
>
> On 5/2/2025 8:48 AM, Yafang Shao wrote:
> > On Fri, May 2, 2025 at 3:36 AM Gutierrez Asier
> > <gutierrez.asier@huawei-partners.com> wrote:
> >>
> >>
> >> On 4/30/2025 8:53 PM, Zi Yan wrote:
> >>> On 30 Apr 2025, at 13:45, Johannes Weiner wrote:
> >>>
> >>>> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote:
> >>>>>>>> If it isn't, can you state why?
> >>>>>>>>
> >>>>>>>> The main difference is that you are saying it's in a container that you
> >>>>>>>> don't control. Your plan is to violate the control the internal
> >>>>>>>> applications have over THP because you know better. I'm not sure how
> >>>>>>>> people might feel about you messing with workloads,
> >>>>>>>
> >>>>>>> It’s not a mess. They have the option to deploy their services on
> >>>>>>> dedicated servers, but they would need to pay more for that choice.
> >>>>>>> This is a two-way decision.
> >>>>>>
> >>>>>> This implies you want a container-level way of controlling the setting
> >>>>>> and not a system service-level?
> >>>>>
> >>>>> Right. We want to control the THP per container.
> >>>>
> >>>> This does strike me as a reasonable usecase.
> >>>>
> >>>> I think there is consensus that in the long-term we want this stuff to
> >>>> just work and truly be transparent to userspace.
> >>>>
> >>>> In the short-to-medium term, however, there are still quite a few
> >>>> caveats. thp=always can significantly increase the memory footprint of
> >>>> sparse virtual regions. Huge allocations are not as cheap and reliable
> >>>> as we would like them to be, which for real production systems means
> >>>> having to make workload-specifcic choices and tradeoffs.
> >>>>
> >>>> There is ongoing work in these areas, but we do have a bit of a
> >>>> chicken-and-egg problem: on the one hand, huge page adoption is slow
> >>>> due to limitations in how they can be deployed. For example, we can't
> >>>> do thp=always on a DC node that runs arbitary combinations of jobs
> >>>> from a wide array of services. Some might benefit, some might hurt.
> >>>>
> >>>> Yet, it's much easier to improve the kernel based on exactly such
> >>>> production experience and data from real-world usecases. We can't
> >>>> improve the THP shrinker if we can't run THP.
> >>>>
> >>>> So I don't see it as overriding whoever wrote the software running
> >>>> inside the container. They don't know, and they shouldn't have to care
> >>>> about page sizes. It's about letting admins and kernel teams get
> >>>> started on using and experimenting with this stuff, given the very
> >>>> real constraints right now, so we can get the feedback necessary to
> >>>> improve the situation.
> >>>
> >>> Since you think it is reasonable to control THP at container-level,
> >>> namely per-cgroup. Should we reconsider cgroup-based THP control[1]?
> >>> (Asier cc'd)
> >>>
> >>> In this patchset, Yafang uses BPF to adjust THP global configs based
> >>> on VMA, which does not look a good approach to me. WDYT?
> >>>
> >>>
> >>> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
> >>>
> >>> --
> >>> Best Regards,
> >>> Yan, Zi
> >>
> >> Hi,
> >>
> >> I believe cgroup is a better approach for containers, since this
> >> approach can be easily integrated with the user space stack like
> >> containerd and kubernets, which use cgroup to control system resources.
> >
> > The integration of BPF with containerd and Kubernetes is emerging as a
> > clear trend.
> >
>
> No, eBPF is not used for resource management, it is mainly used by the
> network stack (CNI), monitoring and security.
This is the most well-known use case of BPF in Kubernetes, thanks to Cilium.
> All the resource
> management by Kubernetes is done using cgroups.
The landscape has shifted. As Johannes (the memcg maintainer)
noted[0], "Cgroups are for nested trees dividing up resources. They're
not a good fit for arbitrary, non-hierarchical policy settings."
[0]. https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/
> You are very unlikely
> to convince the Kubernetes community to manage memory resources using
> eBPF.
Kubernetes already natively supports this capability. As documented in
the Container Lifecycle Hooks guide[1], you can easily load BPF
programs as plugins using these hooks. This is exactly the approach
we've successfully implemented in our production environments.
[1]. https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/
--
Regards
Yafang
^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2025-05-05 9:39 UTC | newest]
Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-29 2:41 [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment Yafang Shao
2025-04-29 2:41 ` [RFC PATCH 1/4] mm: move hugepage_global_{enabled,always}() to internal.h Yafang Shao
2025-04-29 15:13 ` Zi Yan
2025-04-30 2:40 ` Yafang Shao
2025-04-30 12:11 ` Zi Yan
2025-04-30 14:43 ` Yafang Shao
2025-04-29 2:41 ` [RFC PATCH 2/4] mm: pass VMA parameter to hugepage_global_{enabled,always}() Yafang Shao
2025-04-29 15:31 ` Zi Yan
2025-04-30 2:46 ` Yafang Shao
2025-04-29 2:41 ` [RFC PATCH 3/4] mm: add BPF hook for THP adjustment Yafang Shao
2025-04-29 15:19 ` Alexei Starovoitov
2025-04-30 2:48 ` Yafang Shao
2025-04-29 2:41 ` [RFC PATCH 4/4] selftests/bpf: Add selftest " Yafang Shao
2025-04-29 3:11 ` [RFC PATCH 0/4] mm, bpf: BPF based " Matthew Wilcox
2025-04-29 4:53 ` Yafang Shao
2025-04-29 15:09 ` Zi Yan
2025-04-30 2:33 ` Yafang Shao
2025-04-30 13:19 ` Zi Yan
2025-04-30 14:38 ` Yafang Shao
2025-04-30 15:00 ` Zi Yan
2025-04-30 15:16 ` Yafang Shao
2025-04-30 15:21 ` Liam R. Howlett
2025-04-30 15:37 ` Yafang Shao
2025-04-30 15:53 ` Liam R. Howlett
2025-04-30 16:06 ` Yafang Shao
2025-04-30 17:45 ` Johannes Weiner
2025-04-30 17:53 ` Zi Yan
2025-05-01 19:36 ` Gutierrez Asier
2025-05-02 5:48 ` Yafang Shao
2025-05-02 12:00 ` Zi Yan
2025-05-02 12:18 ` Yafang Shao
2025-05-02 13:04 ` David Hildenbrand
2025-05-02 13:06 ` Matthew Wilcox
2025-05-02 13:34 ` Zi Yan
2025-05-05 2:35 ` Yafang Shao
2025-05-05 9:11 ` Gutierrez Asier
2025-05-05 9:38 ` Yafang Shao
2025-04-30 17:59 ` Johannes Weiner
2025-05-01 0:40 ` Yafang Shao
2025-04-30 14:40 ` Liam R. Howlett
2025-04-30 14:49 ` Yafang Shao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).