linux-kselftest.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/4] mm: introduce THP deferred setting
@ 2025-05-15  3:38 Nico Pache
  2025-05-15  3:38 ` [PATCH v6 1/4] mm: defer THP insertion to khugepaged Nico Pache
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Nico Pache @ 2025-05-15  3:38 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-kselftest
  Cc: rientjes, hannes, lorenzo.stoakes, rdunlap, mhocko, Liam.Howlett,
	zokeefe, surenb, jglisse, cl, jack, dave.hansen, will, tiwai,
	catalin.marinas, anshuman.khandual, dev.jain, raquini, aarcange,
	kirill.shutemov, yang, thomas.hellstrom, vishal.moola, sunnanyong,
	usamaarif642, wangkefeng.wang, ziy, shuah, peterx, willy,
	ryan.roberts, baolin.wang, baohua, david, mathieu.desnoyers,
	mhiramat, rostedt, corbet, akpm, npache

This series is a follow-up to [1], which adds mTHP support to khugepaged.
mTHP khugepaged support is a "loose" dependency for the sysfs/sysctl
configs to make sense. Without it global="defer" and  mTHP="inherit" case
is "undefined" behavior.

We've seen cases were customers switching from RHEL7 to RHEL8 see a
significant increase in the memory footprint for the same workloads.

Through our investigations we found that a large contributing factor to
the increase in RSS was an increase in THP usage.

For workloads like MySQL, or when using allocators like jemalloc, it is
often recommended to set /transparent_hugepages/enabled=never. This is
in part due to performance degradations and increased memory waste.

This series introduces enabled=defer, this setting acts as a middle
ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
page fault handler will act normally, making a hugepage if possible. If
the allocation is not MADV_HUGEPAGE, then the page fault handler will
default to the base size allocation. The caveat is that khugepaged can
still operate on pages that are not MADV_HUGEPAGE.

This allows for three things... one, applications specifically designed to
use hugepages will get them, and two, applications that don't use
hugepages can still benefit from them without aggressively inserting
THPs at every possible chance. This curbs the memory waste, and defers
the use of hugepages to khugepaged. Khugepaged can then scan the memory
for eligible collapsing. Lastly there is the added benefit for those who
want THPs but experience higher latency PFs. Now you can get base page
performance at the PF handler and Hugepage performance for those mappings
after they collapse.

Admins may want to lower max_ptes_none, if not, khugepaged may
aggressively collapse single allocations into hugepages.

TESTING:
- Built for x86_64, aarch64, ppc64le, and s390x
- selftests mm
- In [1] I provided a script [2] that has multiple access patterns
- lots of general use.
- redis testing. This test was my original case for the defer mode. What I
   was able to prove was that THP=always leads to increased max_latency
   cases; hence why it is recommended to disable THPs for redis servers.
   However with 'defer' we dont have the max_latency spikes and can still
   get the system to utilize THPs. I further tested this with the mTHP
   defer setting and found that redis (and probably other jmalloc users)
   can utilize THPs via defer (+mTHP defer) without a large latency
   penalty and some potential gains. I uploaded some mmtest results
   here[3] which compares:
       stock+thp=never
       stock+(m)thp=always
       khugepaged-mthp + defer (max_ptes_none=64)

  The results show that (m)THPs can cause some throughput regression in
  some cases, but also has gains in other cases. The mTHP+defer results
  have more gains and less losses over the (m)THP=always case.

V6 Changes:
- nits
- rebased dependent series and added review tags

V5 Changes:
- rebased dependent series
- added reviewed-by tag on 2/4

V4 Changes:
- Minor Documentation fixes
- rebased the dependent series [1] onto mm-unstable
    commit 0e68b850b1d3 ("vmalloc: use atomic_long_add_return_relaxed()")

V3 Changes:
- Combined the documentation commits into one, and moved a section to the
  khugepaged mthp patchset

V2 Changes:
- base changes on mTHP khugepaged support
- Fix selftests parsing issue
- add mTHP defer option
- add mTHP defer Documentation

[1] - https://lore.kernel.org/all/20250515032226.128900-1-npache@redhat.com/
[2] - https://gitlab.com/npache/khugepaged_mthp_test
[3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.html

Nico Pache (4):
  mm: defer THP insertion to khugepaged
  mm: document (m)THP defer usage
  khugepaged: add defer option to mTHP options
  selftests: mm: add defer to thp setting parser

 Documentation/admin-guide/mm/transhuge.rst | 31 +++++++---
 include/linux/huge_mm.h                    | 18 +++++-
 mm/huge_memory.c                           | 69 +++++++++++++++++++---
 mm/khugepaged.c                            |  8 +--
 tools/testing/selftests/mm/thp_settings.c  |  1 +
 tools/testing/selftests/mm/thp_settings.h  |  1 +
 6 files changed, 106 insertions(+), 22 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v6 1/4] mm: defer THP insertion to khugepaged
  2025-05-15  3:38 [PATCH v6 0/4] mm: introduce THP deferred setting Nico Pache
@ 2025-05-15  3:38 ` Nico Pache
  2025-05-20  7:43   ` Yafang Shao
  2025-06-14 11:25   ` Klara Modin
  2025-05-15  3:38 ` [PATCH v6 2/4] mm: document (m)THP defer usage Nico Pache
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 20+ messages in thread
From: Nico Pache @ 2025-05-15  3:38 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-kselftest
  Cc: rientjes, hannes, lorenzo.stoakes, rdunlap, mhocko, Liam.Howlett,
	zokeefe, surenb, jglisse, cl, jack, dave.hansen, will, tiwai,
	catalin.marinas, anshuman.khandual, dev.jain, raquini, aarcange,
	kirill.shutemov, yang, thomas.hellstrom, vishal.moola, sunnanyong,
	usamaarif642, wangkefeng.wang, ziy, shuah, peterx, willy,
	ryan.roberts, baolin.wang, baohua, david, mathieu.desnoyers,
	mhiramat, rostedt, corbet, akpm, npache

setting /transparent_hugepages/enabled=always allows applications
to benefit from THPs without having to madvise. However, the page fault
handler takes very few considerations to decide weather or not to actually
use a THP. This can lead to a lot of wasted memory. khugepaged only
operates on memory that was either allocated with enabled=always or
MADV_HUGEPAGE.

Introduce the ability to set enabled=defer, which will prevent THPs from
being allocated by the page fault handler unless madvise is set,
leaving it up to khugepaged to decide which allocations will collapse to a
THP. This should allow applications to benefits from THPs, while curbing
some of the memory waste.

Acked-by: Zi Yan <ziy@nvidia.com>
Co-developed-by: Rafael Aquini <raquini@redhat.com>
Signed-off-by: Rafael Aquini <raquini@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/linux/huge_mm.h | 15 +++++++++++++--
 mm/huge_memory.c        | 31 +++++++++++++++++++++++++++----
 2 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e3d15c737008..02038e3db829 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -48,6 +48,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_UNSUPPORTED,
 	TRANSPARENT_HUGEPAGE_FLAG,
 	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG,
@@ -186,6 +187,7 @@ static inline bool hugepage_global_enabled(void)
 {
 	return transparent_hugepage_flags &
 			((1<<TRANSPARENT_HUGEPAGE_FLAG) |
+			(1<<TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG) |
 			(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG));
 }
 
@@ -195,6 +197,12 @@ static inline bool hugepage_global_always(void)
 			(1<<TRANSPARENT_HUGEPAGE_FLAG);
 }
 
+static inline bool hugepage_global_defer(void)
+{
+	return transparent_hugepage_flags &
+			(1<<TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG);
+}
+
 static inline int highest_order(unsigned long orders)
 {
 	return fls_long(orders) - 1;
@@ -291,13 +299,16 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 				       unsigned long tva_flags,
 				       unsigned long orders)
 {
+	if ((tva_flags & TVA_IN_PF) && hugepage_global_defer() &&
+			!(vm_flags & VM_HUGEPAGE))
+		return 0;
+
 	/* Optimization to check if required orders are enabled early. */
 	if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
 		unsigned long mask = READ_ONCE(huge_anon_orders_always);
-
 		if (vm_flags & VM_HUGEPAGE)
 			mask |= READ_ONCE(huge_anon_orders_madvise);
-		if (hugepage_global_always() ||
+		if (hugepage_global_always() || hugepage_global_defer() ||
 		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
 			mask |= READ_ONCE(huge_anon_orders_inherit);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 700988a0d5cf..ce0ee74753af 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -297,12 +297,15 @@ static ssize_t enabled_show(struct kobject *kobj,
 	const char *output;
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags))
-		output = "[always] madvise never";
+		output = "[always] madvise defer never";
 	else if (test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
 			  &transparent_hugepage_flags))
-		output = "always [madvise] never";
+		output = "always [madvise] defer never";
+	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG,
+			  &transparent_hugepage_flags))
+		output = "always madvise [defer] never";
 	else
-		output = "always madvise [never]";
+		output = "always madvise defer [never]";
 
 	return sysfs_emit(buf, "%s\n", output);
 }
@@ -315,13 +318,20 @@ static ssize_t enabled_store(struct kobject *kobj,
 
 	if (sysfs_streq(buf, "always")) {
 		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG, &transparent_hugepage_flags);
 		set_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+	} else if (sysfs_streq(buf, "defer")) {
+		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG, &transparent_hugepage_flags);
 	} else if (sysfs_streq(buf, "madvise")) {
 		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG, &transparent_hugepage_flags);
 		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
 	} else if (sysfs_streq(buf, "never")) {
 		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
 		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG, &transparent_hugepage_flags);
 	} else
 		ret = -EINVAL;
 
@@ -954,18 +964,31 @@ static int __init setup_transparent_hugepage(char *str)
 			&transparent_hugepage_flags);
 		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
 			  &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG,
+			  &transparent_hugepage_flags);
 		ret = 1;
+	} else if (!strcmp(str, "defer")) {
+		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
+			  &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+			  &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG,
+			  &transparent_hugepage_flags);
 	} else if (!strcmp(str, "madvise")) {
 		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
 			  &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG,
+			  &transparent_hugepage_flags);
 		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
-			&transparent_hugepage_flags);
+			  &transparent_hugepage_flags);
 		ret = 1;
 	} else if (!strcmp(str, "never")) {
 		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
 			  &transparent_hugepage_flags);
 		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
 			  &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG,
+			  &transparent_hugepage_flags);
 		ret = 1;
 	}
 out:
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v6 2/4] mm: document (m)THP defer usage
  2025-05-15  3:38 [PATCH v6 0/4] mm: introduce THP deferred setting Nico Pache
  2025-05-15  3:38 ` [PATCH v6 1/4] mm: defer THP insertion to khugepaged Nico Pache
@ 2025-05-15  3:38 ` Nico Pache
  2025-05-15  3:38 ` [PATCH v6 3/4] khugepaged: add defer option to mTHP options Nico Pache
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Nico Pache @ 2025-05-15  3:38 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-kselftest
  Cc: rientjes, hannes, lorenzo.stoakes, rdunlap, mhocko, Liam.Howlett,
	zokeefe, surenb, jglisse, cl, jack, dave.hansen, will, tiwai,
	catalin.marinas, anshuman.khandual, dev.jain, raquini, aarcange,
	kirill.shutemov, yang, thomas.hellstrom, vishal.moola, sunnanyong,
	usamaarif642, wangkefeng.wang, ziy, shuah, peterx, willy,
	ryan.roberts, baolin.wang, baohua, david, mathieu.desnoyers,
	mhiramat, rostedt, corbet, akpm, npache, Bagas Sanjaya

The new defer option for (m)THPs allows for a more conservative
approach to (m)THPs. Document its usage in the transhuge admin-guide.

Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 31 ++++++++++++++++------
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 5c63fe51b3ad..7e87ef317add 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -88,8 +88,9 @@ In certain cases when hugepages are enabled system wide, application
 may end up allocating more memory resources. An application may mmap a
 large region but only touch 1 byte of it, in that case a 2M page might
 be allocated instead of a 4k page for no good. This is why it's
-possible to disable hugepages system-wide and to only have them inside
-MADV_HUGEPAGE madvise regions.
+possible to disable hugepages system-wide, only have them inside
+MADV_HUGEPAGE madvise regions, or defer them away from the page fault
+handler to khugepaged.
 
 Embedded systems should enable hugepages only inside madvise regions
 to eliminate any risk of wasting any precious byte of memory and to
@@ -99,6 +100,15 @@ Applications that gets a lot of benefit from hugepages and that don't
 risk to lose memory by using hugepages, should use
 madvise(MADV_HUGEPAGE) on their critical mmapped regions.
 
+Applications that would like to benefit from THPs but would still like a
+more memory conservative approach can choose 'defer'. This avoids
+inserting THPs at the page fault handler unless they are MADV_HUGEPAGE.
+Khugepaged will then scan all mappings, even those not explicitly marked
+with MADV_HUGEPAGE, for potential collapses into (m)THPs. Admins using
+this the 'defer' setting should consider tweaking max_ptes_none. The
+current default of 511 may aggressively collapse your PTEs into PMDs.
+Lower this value to conserve more memory (i.e., max_ptes_none=64).
+
 .. _thp_sysfs:
 
 sysfs
@@ -109,11 +119,14 @@ Global THP controls
 
 Transparent Hugepage Support for anonymous memory can be entirely disabled
 (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
-regions (to avoid the risk of consuming more memory resources) or enabled
-system wide. This can be achieved per-supported-THP-size with one of::
+regions (to avoid the risk of consuming more memory resources), deferred to
+khugepaged, or enabled system wide.
+
+This can be achieved per-supported-THP-size with one of::
 
 	echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
 	echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
+	echo defer >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
 	echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
 
 where <size> is the hugepage size being addressed, the available sizes
@@ -136,6 +149,7 @@ The top-level setting (for use with "inherit") can be set by issuing
 one of the following commands::
 
 	echo always >/sys/kernel/mm/transparent_hugepage/enabled
+	echo defer >/sys/kernel/mm/transparent_hugepage/enabled
 	echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
 	echo never >/sys/kernel/mm/transparent_hugepage/enabled
 
@@ -286,7 +300,8 @@ of small pages into one large page::
 A higher value leads to use additional memory for programs.
 A lower value leads to gain less thp performance. Value of
 max_ptes_none can waste cpu time very little, you can
-ignore it.
+ignore it. Consider lowering this value when using
+``transparent_hugepage=defer``
 
 ``max_ptes_swap`` specifies how many pages can be brought in from
 swap when collapsing a group of pages into a transparent huge page::
@@ -311,14 +326,14 @@ Boot parameters
 
 You can change the sysfs boot time default for the top-level "enabled"
 control by passing the parameter ``transparent_hugepage=always`` or
-``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the
-kernel command line.
+``transparent_hugepage=madvise`` or ``transparent_hugepage=defer`` or
+``transparent_hugepage=never`` to the kernel command line.
 
 Alternatively, each supported anonymous THP size can be controlled by
 passing ``thp_anon=<size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state>``,
 where ``<size>`` is the THP size (must be a power of 2 of PAGE_SIZE and
 supported anonymous THP)  and ``<state>`` is one of ``always``, ``madvise``,
-``never`` or ``inherit``.
+``defer``, ``never`` or ``inherit``.
 
 For example, the following will set 16K, 32K, 64K THP to ``always``,
 set 128K, 512K to ``inherit``, set 256K to ``madvise`` and 1M, 2M
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v6 3/4] khugepaged: add defer option to mTHP options
  2025-05-15  3:38 [PATCH v6 0/4] mm: introduce THP deferred setting Nico Pache
  2025-05-15  3:38 ` [PATCH v6 1/4] mm: defer THP insertion to khugepaged Nico Pache
  2025-05-15  3:38 ` [PATCH v6 2/4] mm: document (m)THP defer usage Nico Pache
@ 2025-05-15  3:38 ` Nico Pache
  2025-05-15  3:38 ` [PATCH v6 4/4] selftests: mm: add defer to thp setting parser Nico Pache
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Nico Pache @ 2025-05-15  3:38 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-kselftest
  Cc: rientjes, hannes, lorenzo.stoakes, rdunlap, mhocko, Liam.Howlett,
	zokeefe, surenb, jglisse, cl, jack, dave.hansen, will, tiwai,
	catalin.marinas, anshuman.khandual, dev.jain, raquini, aarcange,
	kirill.shutemov, yang, thomas.hellstrom, vishal.moola, sunnanyong,
	usamaarif642, wangkefeng.wang, ziy, shuah, peterx, willy,
	ryan.roberts, baolin.wang, baohua, david, mathieu.desnoyers,
	mhiramat, rostedt, corbet, akpm, npache

Now that we have defer to globally disable THPs at fault time, lets add
a defer setting to the mTHP options. This will allow khugepaged to
operate at that order, while avoiding it at PF time.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/linux/huge_mm.h |  5 +++++
 mm/huge_memory.c        | 38 +++++++++++++++++++++++++++++++++-----
 mm/khugepaged.c         |  8 ++++----
 3 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 02038e3db829..71a1edb5062e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -96,6 +96,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 #define TVA_SMAPS		(1 << 0)	/* Will be used for procfs */
 #define TVA_IN_PF		(1 << 1)	/* Page fault handler */
 #define TVA_ENFORCE_SYSFS	(1 << 2)	/* Obey sysfs configuration */
+#define TVA_IN_KHUGEPAGE	((1 << 2) | (1 << 3)) /* Khugepaged defer support */
 
 #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
 	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
@@ -182,6 +183,7 @@ extern unsigned long transparent_hugepage_flags;
 extern unsigned long huge_anon_orders_always;
 extern unsigned long huge_anon_orders_madvise;
 extern unsigned long huge_anon_orders_inherit;
+extern unsigned long huge_anon_orders_defer;
 
 static inline bool hugepage_global_enabled(void)
 {
@@ -306,6 +308,9 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 	/* Optimization to check if required orders are enabled early. */
 	if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
 		unsigned long mask = READ_ONCE(huge_anon_orders_always);
+
+		if ((tva_flags & TVA_IN_KHUGEPAGE) == TVA_IN_KHUGEPAGE)
+			mask |= READ_ONCE(huge_anon_orders_defer);
 		if (vm_flags & VM_HUGEPAGE)
 			mask |= READ_ONCE(huge_anon_orders_madvise);
 		if (hugepage_global_always() || hugepage_global_defer() ||
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ce0ee74753af..addf4c16c91d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -81,6 +81,7 @@ unsigned long huge_zero_pfn __read_mostly = ~0UL;
 unsigned long huge_anon_orders_always __read_mostly;
 unsigned long huge_anon_orders_madvise __read_mostly;
 unsigned long huge_anon_orders_inherit __read_mostly;
+unsigned long huge_anon_orders_defer __read_mostly;
 static bool anon_orders_configured __initdata;
 
 static inline bool file_thp_enabled(struct vm_area_struct *vma)
@@ -505,13 +506,15 @@ static ssize_t anon_enabled_show(struct kobject *kobj,
 	const char *output;
 
 	if (test_bit(order, &huge_anon_orders_always))
-		output = "[always] inherit madvise never";
+		output = "[always] inherit madvise defer never";
 	else if (test_bit(order, &huge_anon_orders_inherit))
-		output = "always [inherit] madvise never";
+		output = "always [inherit] madvise defer never";
 	else if (test_bit(order, &huge_anon_orders_madvise))
-		output = "always inherit [madvise] never";
+		output = "always inherit [madvise] defer never";
+	else if (test_bit(order, &huge_anon_orders_defer))
+		output = "always inherit madvise [defer] never";
 	else
-		output = "always inherit madvise [never]";
+		output = "always inherit madvise defer [never]";
 
 	return sysfs_emit(buf, "%s\n", output);
 }
@@ -527,25 +530,36 @@ static ssize_t anon_enabled_store(struct kobject *kobj,
 		spin_lock(&huge_anon_orders_lock);
 		clear_bit(order, &huge_anon_orders_inherit);
 		clear_bit(order, &huge_anon_orders_madvise);
+		clear_bit(order, &huge_anon_orders_defer);
 		set_bit(order, &huge_anon_orders_always);
 		spin_unlock(&huge_anon_orders_lock);
 	} else if (sysfs_streq(buf, "inherit")) {
 		spin_lock(&huge_anon_orders_lock);
 		clear_bit(order, &huge_anon_orders_always);
 		clear_bit(order, &huge_anon_orders_madvise);
+		clear_bit(order, &huge_anon_orders_defer);
 		set_bit(order, &huge_anon_orders_inherit);
 		spin_unlock(&huge_anon_orders_lock);
 	} else if (sysfs_streq(buf, "madvise")) {
 		spin_lock(&huge_anon_orders_lock);
 		clear_bit(order, &huge_anon_orders_always);
 		clear_bit(order, &huge_anon_orders_inherit);
+		clear_bit(order, &huge_anon_orders_defer);
 		set_bit(order, &huge_anon_orders_madvise);
 		spin_unlock(&huge_anon_orders_lock);
+	} else if (sysfs_streq(buf, "defer")) {
+		spin_lock(&huge_anon_orders_lock);
+		clear_bit(order, &huge_anon_orders_always);
+		clear_bit(order, &huge_anon_orders_inherit);
+		clear_bit(order, &huge_anon_orders_madvise);
+		set_bit(order, &huge_anon_orders_defer);
+		spin_unlock(&huge_anon_orders_lock);
 	} else if (sysfs_streq(buf, "never")) {
 		spin_lock(&huge_anon_orders_lock);
 		clear_bit(order, &huge_anon_orders_always);
 		clear_bit(order, &huge_anon_orders_inherit);
 		clear_bit(order, &huge_anon_orders_madvise);
+		clear_bit(order, &huge_anon_orders_defer);
 		spin_unlock(&huge_anon_orders_lock);
 	} else
 		ret = -EINVAL;
@@ -1002,7 +1016,7 @@ static char str_dup[PAGE_SIZE] __initdata;
 static int __init setup_thp_anon(char *str)
 {
 	char *token, *range, *policy, *subtoken;
-	unsigned long always, inherit, madvise;
+	unsigned long always, inherit, madvise, defer;
 	char *start_size, *end_size;
 	int start, end, nr;
 	char *p;
@@ -1014,6 +1028,8 @@ static int __init setup_thp_anon(char *str)
 	always = huge_anon_orders_always;
 	madvise = huge_anon_orders_madvise;
 	inherit = huge_anon_orders_inherit;
+	defer = huge_anon_orders_defer;
+
 	p = str_dup;
 	while ((token = strsep(&p, ";")) != NULL) {
 		range = strsep(&token, ":");
@@ -1053,18 +1069,28 @@ static int __init setup_thp_anon(char *str)
 				bitmap_set(&always, start, nr);
 				bitmap_clear(&inherit, start, nr);
 				bitmap_clear(&madvise, start, nr);
+				bitmap_clear(&defer, start, nr);
 			} else if (!strcmp(policy, "madvise")) {
 				bitmap_set(&madvise, start, nr);
 				bitmap_clear(&inherit, start, nr);
 				bitmap_clear(&always, start, nr);
+				bitmap_clear(&defer, start, nr);
 			} else if (!strcmp(policy, "inherit")) {
 				bitmap_set(&inherit, start, nr);
 				bitmap_clear(&madvise, start, nr);
 				bitmap_clear(&always, start, nr);
+				bitmap_clear(&defer, start, nr);
+			} else if (!strcmp(policy, "defer")) {
+				bitmap_set(&defer, start, nr);
+				bitmap_clear(&madvise, start, nr);
+				bitmap_clear(&always, start, nr);
+				bitmap_clear(&inherit, start, nr);
 			} else if (!strcmp(policy, "never")) {
 				bitmap_clear(&inherit, start, nr);
 				bitmap_clear(&madvise, start, nr);
 				bitmap_clear(&always, start, nr);
+				bitmap_clear(&defer, start, nr);
+
 			} else {
 				pr_err("invalid policy %s in thp_anon boot parameter\n", policy);
 				goto err;
@@ -1075,6 +1101,8 @@ static int __init setup_thp_anon(char *str)
 	huge_anon_orders_always = always;
 	huge_anon_orders_madvise = madvise;
 	huge_anon_orders_inherit = inherit;
+	huge_anon_orders_defer = defer;
+
 	anon_orders_configured = true;
 	return 1;
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0723b184c7a4..428060495c49 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -491,7 +491,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 {
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
 	    hugepage_pmd_enabled()) {
-		if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
+		if (thp_vma_allowable_order(vma, vm_flags, TVA_IN_KHUGEPAGE,
 					    PMD_ORDER))
 			__khugepaged_enter(vma->vm_mm);
 	}
@@ -955,7 +955,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 				   struct collapse_control *cc, int order)
 {
 	struct vm_area_struct *vma;
-	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
+	unsigned long tva_flags = cc->is_khugepaged ? TVA_IN_KHUGEPAGE  : 0;
 
 	if (unlikely(khugepaged_test_exit_or_disable(mm)))
 		return SCAN_ANY_PROCESS;
@@ -1434,7 +1434,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	bool writable = false;
 	int chunk_none_count = 0;
 	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER);
-	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
+	unsigned long tva_flags = cc->is_khugepaged ? TVA_IN_KHUGEPAGE : 0;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	result = find_pmd_or_thp_or_none(mm, address, &pmd);
@@ -2626,7 +2626,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			break;
 		}
 		if (!thp_vma_allowable_order(vma, vma->vm_flags,
-					TVA_ENFORCE_SYSFS, PMD_ORDER)) {
+					TVA_IN_KHUGEPAGE, PMD_ORDER)) {
 skip:
 			progress++;
 			continue;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v6 4/4] selftests: mm: add defer to thp setting parser
  2025-05-15  3:38 [PATCH v6 0/4] mm: introduce THP deferred setting Nico Pache
                   ` (2 preceding siblings ...)
  2025-05-15  3:38 ` [PATCH v6 3/4] khugepaged: add defer option to mTHP options Nico Pache
@ 2025-05-15  3:38 ` Nico Pache
  2025-05-20  9:24 ` [PATCH v6 0/4] mm: introduce THP deferred setting Yafang Shao
  2025-05-20  9:42 ` Lorenzo Stoakes
  5 siblings, 0 replies; 20+ messages in thread
From: Nico Pache @ 2025-05-15  3:38 UTC (permalink / raw)
  To: linux-mm, linux-doc, linux-kernel, linux-kselftest
  Cc: rientjes, hannes, lorenzo.stoakes, rdunlap, mhocko, Liam.Howlett,
	zokeefe, surenb, jglisse, cl, jack, dave.hansen, will, tiwai,
	catalin.marinas, anshuman.khandual, dev.jain, raquini, aarcange,
	kirill.shutemov, yang, thomas.hellstrom, vishal.moola, sunnanyong,
	usamaarif642, wangkefeng.wang, ziy, shuah, peterx, willy,
	ryan.roberts, baolin.wang, baohua, david, mathieu.desnoyers,
	mhiramat, rostedt, corbet, akpm, npache

add the defer setting to the selftests library for reading thp settings.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 tools/testing/selftests/mm/thp_settings.c | 1 +
 tools/testing/selftests/mm/thp_settings.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/tools/testing/selftests/mm/thp_settings.c b/tools/testing/selftests/mm/thp_settings.c
index ad872af1c81a..b2f9f62b302a 100644
--- a/tools/testing/selftests/mm/thp_settings.c
+++ b/tools/testing/selftests/mm/thp_settings.c
@@ -20,6 +20,7 @@ static const char * const thp_enabled_strings[] = {
 	"always",
 	"inherit",
 	"madvise",
+	"defer",
 	NULL
 };
 
diff --git a/tools/testing/selftests/mm/thp_settings.h b/tools/testing/selftests/mm/thp_settings.h
index fc131d23d593..0d52e6d4f754 100644
--- a/tools/testing/selftests/mm/thp_settings.h
+++ b/tools/testing/selftests/mm/thp_settings.h
@@ -11,6 +11,7 @@ enum thp_enabled {
 	THP_ALWAYS,
 	THP_INHERIT,
 	THP_MADVISE,
+	THP_DEFER,
 };
 
 enum thp_defrag {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 1/4] mm: defer THP insertion to khugepaged
  2025-05-15  3:38 ` [PATCH v6 1/4] mm: defer THP insertion to khugepaged Nico Pache
@ 2025-05-20  7:43   ` Yafang Shao
  2025-06-14 11:25   ` Klara Modin
  1 sibling, 0 replies; 20+ messages in thread
From: Yafang Shao @ 2025-05-20  7:43 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-kselftest, rientjes,
	hannes, lorenzo.stoakes, rdunlap, mhocko, Liam.Howlett, zokeefe,
	surenb, jglisse, cl, jack, dave.hansen, will, tiwai,
	catalin.marinas, anshuman.khandual, dev.jain, raquini, aarcange,
	kirill.shutemov, yang, thomas.hellstrom, vishal.moola, sunnanyong,
	usamaarif642, wangkefeng.wang, ziy, shuah, peterx, willy,
	ryan.roberts, baolin.wang, baohua, david, mathieu.desnoyers,
	mhiramat, rostedt, corbet, akpm

On Thu, May 15, 2025 at 12:39 PM Nico Pache <npache@redhat.com> wrote:
>
> setting /transparent_hugepages/enabled=always allows applications
> to benefit from THPs without having to madvise. However, the page fault
> handler takes very few considerations to decide weather or not to actually
> use a THP. This can lead to a lot of wasted memory. khugepaged only
> operates on memory that was either allocated with enabled=always or
> MADV_HUGEPAGE.
>
> Introduce the ability to set enabled=defer, which will prevent THPs from
> being allocated by the page fault handler unless madvise is set,
> leaving it up to khugepaged to decide which allocations will collapse to a
> THP. This should allow applications to benefits from THPs, while curbing
> some of the memory waste.
>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Co-developed-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  include/linux/huge_mm.h | 15 +++++++++++++--
>  mm/huge_memory.c        | 31 +++++++++++++++++++++++++++----
>  2 files changed, 40 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index e3d15c737008..02038e3db829 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -48,6 +48,7 @@ enum transparent_hugepage_flag {
>         TRANSPARENT_HUGEPAGE_UNSUPPORTED,
>         TRANSPARENT_HUGEPAGE_FLAG,
>         TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +       TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG,
>         TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
>         TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
>         TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG,
> @@ -186,6 +187,7 @@ static inline bool hugepage_global_enabled(void)
>  {
>         return transparent_hugepage_flags &
>                         ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
> +                       (1<<TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG) |
>                         (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG));
>  }
>
> @@ -195,6 +197,12 @@ static inline bool hugepage_global_always(void)
>                         (1<<TRANSPARENT_HUGEPAGE_FLAG);
>  }
>
> +static inline bool hugepage_global_defer(void)
> +{
> +       return transparent_hugepage_flags &
> +                       (1<<TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG);
> +}
> +
>  static inline int highest_order(unsigned long orders)
>  {
>         return fls_long(orders) - 1;
> @@ -291,13 +299,16 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>                                        unsigned long tva_flags,
>                                        unsigned long orders)
>  {
> +       if ((tva_flags & TVA_IN_PF) && hugepage_global_defer() &&
> +                       !(vm_flags & VM_HUGEPAGE))
> +               return 0;
> +
>         /* Optimization to check if required orders are enabled early. */
>         if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
>                 unsigned long mask = READ_ONCE(huge_anon_orders_always);
> -
>                 if (vm_flags & VM_HUGEPAGE)
>                         mask |= READ_ONCE(huge_anon_orders_madvise);
> -               if (hugepage_global_always() ||
> +               if (hugepage_global_always() || hugepage_global_defer() ||
>                     ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
>                         mask |= READ_ONCE(huge_anon_orders_inherit);
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 700988a0d5cf..ce0ee74753af 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -297,12 +297,15 @@ static ssize_t enabled_show(struct kobject *kobj,
>         const char *output;
>
>         if (test_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags))
> -               output = "[always] madvise never";
> +               output = "[always] madvise defer never";

a small nit: alphabetical ordering might improve readability here.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/4] mm: introduce THP deferred setting
  2025-05-15  3:38 [PATCH v6 0/4] mm: introduce THP deferred setting Nico Pache
                   ` (3 preceding siblings ...)
  2025-05-15  3:38 ` [PATCH v6 4/4] selftests: mm: add defer to thp setting parser Nico Pache
@ 2025-05-20  9:24 ` Yafang Shao
  2025-05-21 10:19   ` Nico Pache
  2025-05-20  9:42 ` Lorenzo Stoakes
  5 siblings, 1 reply; 20+ messages in thread
From: Yafang Shao @ 2025-05-20  9:24 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-kselftest, rientjes,
	hannes, lorenzo.stoakes, rdunlap, mhocko, Liam.Howlett, zokeefe,
	surenb, jglisse, cl, jack, dave.hansen, will, tiwai,
	catalin.marinas, anshuman.khandual, dev.jain, raquini, aarcange,
	kirill.shutemov, yang, thomas.hellstrom, vishal.moola, sunnanyong,
	usamaarif642, wangkefeng.wang, ziy, shuah, peterx, willy,
	ryan.roberts, baolin.wang, baohua, david, mathieu.desnoyers,
	mhiramat, rostedt, corbet, akpm

On Thu, May 15, 2025 at 11:41 AM Nico Pache <npache@redhat.com> wrote:
>
> This series is a follow-up to [1], which adds mTHP support to khugepaged.
> mTHP khugepaged support is a "loose" dependency for the sysfs/sysctl
> configs to make sense. Without it global="defer" and  mTHP="inherit" case
> is "undefined" behavior.
>
> We've seen cases were customers switching from RHEL7 to RHEL8 see a
> significant increase in the memory footprint for the same workloads.
>
> Through our investigations we found that a large contributing factor to
> the increase in RSS was an increase in THP usage.
>
> For workloads like MySQL, or when using allocators like jemalloc, it is
> often recommended to set /transparent_hugepages/enabled=never. This is
> in part due to performance degradations and increased memory waste.
>
> This series introduces enabled=defer, this setting acts as a middle
> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
> page fault handler will act normally, making a hugepage if possible. If
> the allocation is not MADV_HUGEPAGE, then the page fault handler will
> default to the base size allocation. The caveat is that khugepaged can
> still operate on pages that are not MADV_HUGEPAGE.
>
> This allows for three things... one, applications specifically designed to
> use hugepages will get them, and two, applications that don't use
> hugepages can still benefit from them without aggressively inserting
> THPs at every possible chance. This curbs the memory waste, and defers
> the use of hugepages to khugepaged. Khugepaged can then scan the memory
> for eligible collapsing. Lastly there is the added benefit for those who
> want THPs but experience higher latency PFs. Now you can get base page
> performance at the PF handler and Hugepage performance for those mappings
> after they collapse.
>
> Admins may want to lower max_ptes_none, if not, khugepaged may
> aggressively collapse single allocations into hugepages.
>
> TESTING:
> - Built for x86_64, aarch64, ppc64le, and s390x
> - selftests mm
> - In [1] I provided a script [2] that has multiple access patterns
> - lots of general use.
> - redis testing. This test was my original case for the defer mode. What I
>    was able to prove was that THP=always leads to increased max_latency
>    cases; hence why it is recommended to disable THPs for redis servers.
>    However with 'defer' we dont have the max_latency spikes and can still
>    get the system to utilize THPs. I further tested this with the mTHP
>    defer setting and found that redis (and probably other jmalloc users)
>    can utilize THPs via defer (+mTHP defer) without a large latency
>    penalty and some potential gains. I uploaded some mmtest results
>    here[3] which compares:
>        stock+thp=never
>        stock+(m)thp=always
>        khugepaged-mthp + defer (max_ptes_none=64)
>
>   The results show that (m)THPs can cause some throughput regression in
>   some cases, but also has gains in other cases. The mTHP+defer results
>   have more gains and less losses over the (m)THP=always case.
>
> V6 Changes:
> - nits
> - rebased dependent series and added review tags
>
> V5 Changes:
> - rebased dependent series
> - added reviewed-by tag on 2/4
>
> V4 Changes:
> - Minor Documentation fixes
> - rebased the dependent series [1] onto mm-unstable
>     commit 0e68b850b1d3 ("vmalloc: use atomic_long_add_return_relaxed()")
>
> V3 Changes:
> - Combined the documentation commits into one, and moved a section to the
>   khugepaged mthp patchset
>
> V2 Changes:
> - base changes on mTHP khugepaged support
> - Fix selftests parsing issue
> - add mTHP defer option
> - add mTHP defer Documentation
>
> [1] - https://lore.kernel.org/all/20250515032226.128900-1-npache@redhat.com/
> [2] - https://gitlab.com/npache/khugepaged_mthp_test
> [3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.html
>
> Nico Pache (4):
>   mm: defer THP insertion to khugepaged
>   mm: document (m)THP defer usage
>   khugepaged: add defer option to mTHP options
>   selftests: mm: add defer to thp setting parser
>
>  Documentation/admin-guide/mm/transhuge.rst | 31 +++++++---
>  include/linux/huge_mm.h                    | 18 +++++-
>  mm/huge_memory.c                           | 69 +++++++++++++++++++---
>  mm/khugepaged.c                            |  8 +--
>  tools/testing/selftests/mm/thp_settings.c  |  1 +
>  tools/testing/selftests/mm/thp_settings.h  |  1 +
>  6 files changed, 106 insertions(+), 22 deletions(-)
>
> --
> 2.49.0
>
>

Hello Nico,

Upon reviewing the series, it occurred to me that BPF could solve this
more cleanly. Adding a 'tva_flags' parameter to the BPF hook would
handle this case and future scenarios without requiring new modes. The
BPF mode could then serve as a unified solution.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/4] mm: introduce THP deferred setting
  2025-05-15  3:38 [PATCH v6 0/4] mm: introduce THP deferred setting Nico Pache
                   ` (4 preceding siblings ...)
  2025-05-20  9:24 ` [PATCH v6 0/4] mm: introduce THP deferred setting Yafang Shao
@ 2025-05-20  9:42 ` Lorenzo Stoakes
  2025-05-21 10:41   ` Nico Pache
  5 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20  9:42 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-kselftest, rientjes,
	hannes, rdunlap, mhocko, Liam.Howlett, zokeefe, surenb, jglisse,
	cl, jack, dave.hansen, will, tiwai, catalin.marinas,
	anshuman.khandual, dev.jain, raquini, aarcange, kirill.shutemov,
	yang, thomas.hellstrom, vishal.moola, sunnanyong, usamaarif642,
	wangkefeng.wang, ziy, shuah, peterx, willy, ryan.roberts,
	baolin.wang, baohua, david, mathieu.desnoyers, mhiramat, rostedt,
	corbet, akpm

On Wed, May 14, 2025 at 09:38:53PM -0600, Nico Pache wrote:
> This series is a follow-up to [1], which adds mTHP support to khugepaged.
> mTHP khugepaged support is a "loose" dependency for the sysfs/sysctl
> configs to make sense. Without it global="defer" and  mTHP="inherit" case
> is "undefined" behavior.

How can this be a follow up to an unmerged series? I'm confused by that.

And you're saying that you're introducing 'undefined behaviour' on the
assumption that another series which seems to have quite a bit of
discussion let to run will be merged?

While I'd understand if this was an RFC just to put the idea out there,
you're not proposing it as such?

Unless there's a really good reason we're doing this way (I may be missing
something), can we just have this as an RFC until the series it depends on
is settled?

>
> We've seen cases were customers switching from RHEL7 to RHEL8 see a
> significant increase in the memory footprint for the same workloads.
>
> Through our investigations we found that a large contributing factor to
> the increase in RSS was an increase in THP usage.
>
> For workloads like MySQL, or when using allocators like jemalloc, it is
> often recommended to set /transparent_hugepages/enabled=never. This is
> in part due to performance degradations and increased memory waste.
>
> This series introduces enabled=defer, this setting acts as a middle
> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
> page fault handler will act normally, making a hugepage if possible. If
> the allocation is not MADV_HUGEPAGE, then the page fault handler will
> default to the base size allocation. The caveat is that khugepaged can
> still operate on pages that are not MADV_HUGEPAGE.
>
> This allows for three things... one, applications specifically designed to
> use hugepages will get them, and two, applications that don't use
> hugepages can still benefit from them without aggressively inserting
> THPs at every possible chance. This curbs the memory waste, and defers
> the use of hugepages to khugepaged. Khugepaged can then scan the memory
> for eligible collapsing. Lastly there is the added benefit for those who
> want THPs but experience higher latency PFs. Now you can get base page
> performance at the PF handler and Hugepage performance for those mappings
> after they collapse.
>
> Admins may want to lower max_ptes_none, if not, khugepaged may
> aggressively collapse single allocations into hugepages.
>
> TESTING:
> - Built for x86_64, aarch64, ppc64le, and s390x
> - selftests mm
> - In [1] I provided a script [2] that has multiple access patterns
> - lots of general use.

OK so this truly is dependent on the unmerged series? Or isn't it?

Is your testing based on that?

Because again... that surely makes this series a no-go until we land the
prior (which might be changed, and thus necessitate re-testing).

Are you going to provide any of these numbers/data anywhere?

> - redis testing. This test was my original case for the defer mode. What I
>    was able to prove was that THP=always leads to increased max_latency
>    cases; hence why it is recommended to disable THPs for redis servers.
>    However with 'defer' we dont have the max_latency spikes and can still
>    get the system to utilize THPs. I further tested this with the mTHP
>    defer setting and found that redis (and probably other jmalloc users)
>    can utilize THPs via defer (+mTHP defer) without a large latency
>    penalty and some potential gains. I uploaded some mmtest results
>    here[3] which compares:
>        stock+thp=never
>        stock+(m)thp=always
>        khugepaged-mthp + defer (max_ptes_none=64)
>
>   The results show that (m)THPs can cause some throughput regression in
>   some cases, but also has gains in other cases. The mTHP+defer results
>   have more gains and less losses over the (m)THP=always case.
>
> V6 Changes:
> - nits
> - rebased dependent series and added review tags
>
> V5 Changes:
> - rebased dependent series
> - added reviewed-by tag on 2/4
>
> V4 Changes:
> - Minor Documentation fixes
> - rebased the dependent series [1] onto mm-unstable
>     commit 0e68b850b1d3 ("vmalloc: use atomic_long_add_return_relaxed()")
>
> V3 Changes:
> - Combined the documentation commits into one, and moved a section to the
>   khugepaged mthp patchset
>
> V2 Changes:
> - base changes on mTHP khugepaged support
> - Fix selftests parsing issue
> - add mTHP defer option
> - add mTHP defer Documentation
>
> [1] - https://lore.kernel.org/all/20250515032226.128900-1-npache@redhat.com/
> [2] - https://gitlab.com/npache/khugepaged_mthp_test
> [3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.html
>
> Nico Pache (4):
>   mm: defer THP insertion to khugepaged
>   mm: document (m)THP defer usage
>   khugepaged: add defer option to mTHP options
>   selftests: mm: add defer to thp setting parser
>
>  Documentation/admin-guide/mm/transhuge.rst | 31 +++++++---
>  include/linux/huge_mm.h                    | 18 +++++-
>  mm/huge_memory.c                           | 69 +++++++++++++++++++---
>  mm/khugepaged.c                            |  8 +--
>  tools/testing/selftests/mm/thp_settings.c  |  1 +
>  tools/testing/selftests/mm/thp_settings.h  |  1 +
>  6 files changed, 106 insertions(+), 22 deletions(-)
>
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/4] mm: introduce THP deferred setting
  2025-05-20  9:24 ` [PATCH v6 0/4] mm: introduce THP deferred setting Yafang Shao
@ 2025-05-21 10:19   ` Nico Pache
  2025-05-21 11:35     ` Yafang Shao
  0 siblings, 1 reply; 20+ messages in thread
From: Nico Pache @ 2025-05-21 10:19 UTC (permalink / raw)
  To: Yafang Shao
  Cc: linux-mm, linux-doc, linux-kernel, linux-kselftest, rientjes,
	hannes, lorenzo.stoakes, rdunlap, mhocko, Liam.Howlett, zokeefe,
	surenb, jglisse, cl, jack, dave.hansen, will, tiwai,
	catalin.marinas, anshuman.khandual, dev.jain, raquini, aarcange,
	kirill.shutemov, yang, thomas.hellstrom, vishal.moola, sunnanyong,
	usamaarif642, wangkefeng.wang, ziy, shuah, peterx, willy,
	ryan.roberts, baolin.wang, baohua, david, mathieu.desnoyers,
	mhiramat, rostedt, corbet, akpm

On Tue, May 20, 2025 at 3:25 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, May 15, 2025 at 11:41 AM Nico Pache <npache@redhat.com> wrote:
> >
> > This series is a follow-up to [1], which adds mTHP support to khugepaged.
> > mTHP khugepaged support is a "loose" dependency for the sysfs/sysctl
> > configs to make sense. Without it global="defer" and  mTHP="inherit" case
> > is "undefined" behavior.
> >
> > We've seen cases were customers switching from RHEL7 to RHEL8 see a
> > significant increase in the memory footprint for the same workloads.
> >
> > Through our investigations we found that a large contributing factor to
> > the increase in RSS was an increase in THP usage.
> >
> > For workloads like MySQL, or when using allocators like jemalloc, it is
> > often recommended to set /transparent_hugepages/enabled=never. This is
> > in part due to performance degradations and increased memory waste.
> >
> > This series introduces enabled=defer, this setting acts as a middle
> > ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
> > page fault handler will act normally, making a hugepage if possible. If
> > the allocation is not MADV_HUGEPAGE, then the page fault handler will
> > default to the base size allocation. The caveat is that khugepaged can
> > still operate on pages that are not MADV_HUGEPAGE.
> >
> > This allows for three things... one, applications specifically designed to
> > use hugepages will get them, and two, applications that don't use
> > hugepages can still benefit from them without aggressively inserting
> > THPs at every possible chance. This curbs the memory waste, and defers
> > the use of hugepages to khugepaged. Khugepaged can then scan the memory
> > for eligible collapsing. Lastly there is the added benefit for those who
> > want THPs but experience higher latency PFs. Now you can get base page
> > performance at the PF handler and Hugepage performance for those mappings
> > after they collapse.
> >
> > Admins may want to lower max_ptes_none, if not, khugepaged may
> > aggressively collapse single allocations into hugepages.
> >
> > TESTING:
> > - Built for x86_64, aarch64, ppc64le, and s390x
> > - selftests mm
> > - In [1] I provided a script [2] that has multiple access patterns
> > - lots of general use.
> > - redis testing. This test was my original case for the defer mode. What I
> >    was able to prove was that THP=always leads to increased max_latency
> >    cases; hence why it is recommended to disable THPs for redis servers.
> >    However with 'defer' we dont have the max_latency spikes and can still
> >    get the system to utilize THPs. I further tested this with the mTHP
> >    defer setting and found that redis (and probably other jmalloc users)
> >    can utilize THPs via defer (+mTHP defer) without a large latency
> >    penalty and some potential gains. I uploaded some mmtest results
> >    here[3] which compares:
> >        stock+thp=never
> >        stock+(m)thp=always
> >        khugepaged-mthp + defer (max_ptes_none=64)
> >
> >   The results show that (m)THPs can cause some throughput regression in
> >   some cases, but also has gains in other cases. The mTHP+defer results
> >   have more gains and less losses over the (m)THP=always case.
> >
> > V6 Changes:
> > - nits
> > - rebased dependent series and added review tags
> >
> > V5 Changes:
> > - rebased dependent series
> > - added reviewed-by tag on 2/4
> >
> > V4 Changes:
> > - Minor Documentation fixes
> > - rebased the dependent series [1] onto mm-unstable
> >     commit 0e68b850b1d3 ("vmalloc: use atomic_long_add_return_relaxed()")
> >
> > V3 Changes:
> > - Combined the documentation commits into one, and moved a section to the
> >   khugepaged mthp patchset
> >
> > V2 Changes:
> > - base changes on mTHP khugepaged support
> > - Fix selftests parsing issue
> > - add mTHP defer option
> > - add mTHP defer Documentation
> >
> > [1] - https://lore.kernel.org/all/20250515032226.128900-1-npache@redhat.com/
> > [2] - https://gitlab.com/npache/khugepaged_mthp_test
> > [3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.html
> >
> > Nico Pache (4):
> >   mm: defer THP insertion to khugepaged
> >   mm: document (m)THP defer usage
> >   khugepaged: add defer option to mTHP options
> >   selftests: mm: add defer to thp setting parser
> >
> >  Documentation/admin-guide/mm/transhuge.rst | 31 +++++++---
> >  include/linux/huge_mm.h                    | 18 +++++-
> >  mm/huge_memory.c                           | 69 +++++++++++++++++++---
> >  mm/khugepaged.c                            |  8 +--
> >  tools/testing/selftests/mm/thp_settings.c  |  1 +
> >  tools/testing/selftests/mm/thp_settings.h  |  1 +
> >  6 files changed, 106 insertions(+), 22 deletions(-)
> >
> > --
> > 2.49.0
> >
> >
>
> Hello Nico,
>
> Upon reviewing the series, it occurred to me that BPF could solve this
> more cleanly. Adding a 'tva_flags' parameter to the BPF hook would
> handle this case and future scenarios without requiring new modes. The
> BPF mode could then serve as a unified solution.
Hi Yafang,

I dont see how this is the case? This would require users to
modify/add functionality rather than configuring the system in this
manner. What if BPF is not configured or being used? Having to use an
additional technology that requires precise configuration doesn't seem
cleaner.

Either way, thank you for taking a look into the series !

-- Nico
>
> --
> Regards
> Yafang
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/4] mm: introduce THP deferred setting
  2025-05-20  9:42 ` Lorenzo Stoakes
@ 2025-05-21 10:41   ` Nico Pache
  2025-05-21 11:24     ` Lorenzo Stoakes
  0 siblings, 1 reply; 20+ messages in thread
From: Nico Pache @ 2025-05-21 10:41 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, linux-doc, linux-kernel, linux-kselftest, rientjes,
	hannes, rdunlap, mhocko, Liam.Howlett, zokeefe, surenb, jglisse,
	cl, jack, dave.hansen, will, tiwai, catalin.marinas,
	anshuman.khandual, dev.jain, raquini, aarcange, kirill.shutemov,
	yang, thomas.hellstrom, vishal.moola, sunnanyong, usamaarif642,
	wangkefeng.wang, ziy, shuah, peterx, willy, ryan.roberts,
	baolin.wang, baohua, david, mathieu.desnoyers, mhiramat, rostedt,
	corbet, akpm

On Tue, May 20, 2025 at 3:43 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, May 14, 2025 at 09:38:53PM -0600, Nico Pache wrote:
> > This series is a follow-up to [1], which adds mTHP support to khugepaged.
> > mTHP khugepaged support is a "loose" dependency for the sysfs/sysctl
> > configs to make sense. Without it global="defer" and  mTHP="inherit" case
> > is "undefined" behavior.
>
> How can this be a follow up to an unmerged series? I'm confused by that.
Hi Lorenzo,

follow up or loose dependency. Not sure the correct terminology.

Either way, as I was developing this as a potential solution for the
THP internal fragmentation issue, upstream was working on adding
mTHPs. By adding a new THP sysctl entry I noticed mTHP would now be
missing the same entry. Furthermore I was told mTHP support for
khugepaged was a desire, so I began working on it in conjunction. So
given the undefined behavior of defer globally while any mix of mTHP
settings, it became dependent on the khugepaged support. Either way
patch 1 of this series is the core functionality. The rest is to fill
the undefined behavior gap.
>
> And you're saying that you're introducing 'undefined behaviour' on the
> assumption that another series which seems to have quite a bit of
> discussion let to run will be merged?
This could technically get merged without the mTHP khugepaged changes,
but then the reviews would probably all be pointing out what I pointed
out above. Chicken or Egg problem...
>
> While I'd understand if this was an RFC just to put the idea out there,
> you're not proposing it as such?
Nope we've already discussed this in both the MM alignment and thp
upstream meetings, no one was opposing it, and a lot of testing was
done-- by me, RH's CI, and our perf teams. Ive posted several RFCs
before posting a patchset.
>
> Unless there's a really good reason we're doing this way (I may be missing
> something), can we just have this as an RFC until the series it depends on
> is settled?
Hopefully paragraph one clears this up! They were built in
conjunction, but posting them as one series didn't feel right (and
IIRC this was also discussed, and this was decided).
>
> >
> > We've seen cases were customers switching from RHEL7 to RHEL8 see a
> > significant increase in the memory footprint for the same workloads.
> >
> > Through our investigations we found that a large contributing factor to
> > the increase in RSS was an increase in THP usage.
> >
> > For workloads like MySQL, or when using allocators like jemalloc, it is
> > often recommended to set /transparent_hugepages/enabled=never. This is
> > in part due to performance degradations and increased memory waste.
> >
> > This series introduces enabled=defer, this setting acts as a middle
> > ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
> > page fault handler will act normally, making a hugepage if possible. If
> > the allocation is not MADV_HUGEPAGE, then the page fault handler will
> > default to the base size allocation. The caveat is that khugepaged can
> > still operate on pages that are not MADV_HUGEPAGE.
> >
> > This allows for three things... one, applications specifically designed to
> > use hugepages will get them, and two, applications that don't use
> > hugepages can still benefit from them without aggressively inserting
> > THPs at every possible chance. This curbs the memory waste, and defers
> > the use of hugepages to khugepaged. Khugepaged can then scan the memory
> > for eligible collapsing. Lastly there is the added benefit for those who
> > want THPs but experience higher latency PFs. Now you can get base page
> > performance at the PF handler and Hugepage performance for those mappings
> > after they collapse.
> >
> > Admins may want to lower max_ptes_none, if not, khugepaged may
> > aggressively collapse single allocations into hugepages.
> >
> > TESTING:
> > - Built for x86_64, aarch64, ppc64le, and s390x
> > - selftests mm
> > - In [1] I provided a script [2] that has multiple access patterns
> > - lots of general use.
>
> OK so this truly is dependent on the unmerged series? Or isn't it?
>
> Is your testing based on that?
Most of the testing was done in conjunction, but independent testing
was also done on this series (including by a large customer that was
itching to try the changes, and they were very satisfied with the
results).
>
> Because again... that surely makes this series a no-go until we land the
> prior (which might be changed, and thus necessitate re-testing).
>
> Are you going to provide any of these numbers/data anywhere?
There is a link to the results in this cover letter
[3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.html
>
> > - redis testing. This test was my original case for the defer mode. What I
> >    was able to prove was that THP=always leads to increased max_latency
> >    cases; hence why it is recommended to disable THPs for redis servers.
> >    However with 'defer' we dont have the max_latency spikes and can still
> >    get the system to utilize THPs. I further tested this with the mTHP
> >    defer setting and found that redis (and probably other jmalloc users)
> >    can utilize THPs via defer (+mTHP defer) without a large latency
> >    penalty and some potential gains. I uploaded some mmtest results
> >    here[3] which compares:
> >        stock+thp=never
> >        stock+(m)thp=always
> >        khugepaged-mthp + defer (max_ptes_none=64)
> >
> >   The results show that (m)THPs can cause some throughput regression in
> >   some cases, but also has gains in other cases. The mTHP+defer results
> >   have more gains and less losses over the (m)THP=always case.
> >
> > V6 Changes:
> > - nits
> > - rebased dependent series and added review tags
> >
> > V5 Changes:
> > - rebased dependent series
> > - added reviewed-by tag on 2/4
> >
> > V4 Changes:
> > - Minor Documentation fixes
> > - rebased the dependent series [1] onto mm-unstable
> >     commit 0e68b850b1d3 ("vmalloc: use atomic_long_add_return_relaxed()")
> >
> > V3 Changes:
> > - Combined the documentation commits into one, and moved a section to the
> >   khugepaged mthp patchset
> >
> > V2 Changes:
> > - base changes on mTHP khugepaged support
> > - Fix selftests parsing issue
> > - add mTHP defer option
> > - add mTHP defer Documentation
> >
> > [1] - https://lore.kernel.org/all/20250515032226.128900-1-npache@redhat.com/
> > [2] - https://gitlab.com/npache/khugepaged_mthp_test
> > [3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.html
> >
> > Nico Pache (4):
> >   mm: defer THP insertion to khugepaged
> >   mm: document (m)THP defer usage
> >   khugepaged: add defer option to mTHP options
> >   selftests: mm: add defer to thp setting parser
> >
> >  Documentation/admin-guide/mm/transhuge.rst | 31 +++++++---
> >  include/linux/huge_mm.h                    | 18 +++++-
> >  mm/huge_memory.c                           | 69 +++++++++++++++++++---
> >  mm/khugepaged.c                            |  8 +--
> >  tools/testing/selftests/mm/thp_settings.c  |  1 +
> >  tools/testing/selftests/mm/thp_settings.h  |  1 +
> >  6 files changed, 106 insertions(+), 22 deletions(-)
> >
> > --
> > 2.49.0
> >
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/4] mm: introduce THP deferred setting
  2025-05-21 10:41   ` Nico Pache
@ 2025-05-21 11:24     ` Lorenzo Stoakes
  2025-05-21 11:46       ` David Hildenbrand
  2025-05-29  4:26       ` Nico Pache
  0 siblings, 2 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-05-21 11:24 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-kselftest, rientjes,
	hannes, rdunlap, mhocko, Liam.Howlett, zokeefe, surenb, jglisse,
	cl, jack, dave.hansen, will, tiwai, catalin.marinas,
	anshuman.khandual, dev.jain, raquini, aarcange, kirill.shutemov,
	yang, thomas.hellstrom, vishal.moola, sunnanyong, usamaarif642,
	wangkefeng.wang, ziy, shuah, peterx, willy, ryan.roberts,
	baolin.wang, baohua, david, mathieu.desnoyers, mhiramat, rostedt,
	corbet, akpm

To start with I do apologise for coming to this at v6, I realise it's
irritating to have push back at this late stage. This is more so my attempt
to understand where this series -sits- so I can properly review it.

So please bear with me here :)

So, I remain very confused. This may just be a _me_ thing here :)

So let me check my understanding:

1. This series introduces this new THP deferred mode.
2. By 'follow-up' really you mean 'inspired by' or 'related to' right?
3. If this series lands before [1], commits 2 - 4 are 'undefined
   behaviour'.

In my view if 3 is true this series should be RFC until [1] merges.

If I've got it wrong and this needs to land first, we should RFC [1].

That way we can un-RFC once the dependency is met.

We have about 5 [m]THP series in flight at the moment, all touching at
least vaguely related stuff, so any help for reviewers would be hugely
appreciated thanks :)

On Wed, May 21, 2025 at 04:41:54AM -0600, Nico Pache wrote:
> On Tue, May 20, 2025 at 3:43 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Wed, May 14, 2025 at 09:38:53PM -0600, Nico Pache wrote:
> > > This series is a follow-up to [1], which adds mTHP support to khugepaged.
> > > mTHP khugepaged support is a "loose" dependency for the sysfs/sysctl
> > > configs to make sense. Without it global="defer" and  mTHP="inherit" case
> > > is "undefined" behavior.
> >
> > How can this be a follow up to an unmerged series? I'm confused by that.
> Hi Lorenzo,
>
> follow up or loose dependency. Not sure the correct terminology.
>

See above. Let's nail this down please.

> Either way, as I was developing this as a potential solution for the
> THP internal fragmentation issue, upstream was working on adding
> mTHPs. By adding a new THP sysctl entry I noticed mTHP would now be
> missing the same entry. Furthermore I was told mTHP support for
> khugepaged was a desire, so I began working on it in conjunction. So
> given the undefined behavior of defer globally while any mix of mTHP
> settings, it became dependent on the khugepaged support. Either way
> patch 1 of this series is the core functionality. The rest is to fill
> the undefined behavior gap.
> >
> > And you're saying that you're introducing 'undefined behaviour' on the
> > assumption that another series which seems to have quite a bit of
> > discussion let to run will be merged?
> This could technically get merged without the mTHP khugepaged changes,
> but then the reviews would probably all be pointing out what I pointed
> out above. Chicken or Egg problem...
> >
> > While I'd understand if this was an RFC just to put the idea out there,
> > you're not proposing it as such?
> Nope we've already discussed this in both the MM alignment and thp
> upstream meetings, no one was opposing it, and a lot of testing was
> done-- by me, RH's CI, and our perf teams. Ive posted several RFCs
> before posting a patchset.
> >
> > Unless there's a really good reason we're doing this way (I may be missing
> > something), can we just have this as an RFC until the series it depends on
> > is settled?
> Hopefully paragraph one clears this up! They were built in
> conjunction, but posting them as one series didn't feel right (and
> IIRC this was also discussed, and this was decided).

'This was also discussed and this was decided' :)

I'm guessing rather you mean discussion was had with other reviewers and of
course our earstwhile THP maintainer David, and you guys decided this made
more sense?

Obviously upstream discussion is what counts, but as annoying as it is, one
does have to address the concerns of reviewers even if late to a series
(again, apologies for this).

So, to be clear - I'm not intending to hold up or block the series, I just
want to understand how things are, this is the purpose here.

Thanks!

> >
> > >
> > > We've seen cases were customers switching from RHEL7 to RHEL8 see a
> > > significant increase in the memory footprint for the same workloads.
> > >
> > > Through our investigations we found that a large contributing factor to
> > > the increase in RSS was an increase in THP usage.
> > >
> > > For workloads like MySQL, or when using allocators like jemalloc, it is
> > > often recommended to set /transparent_hugepages/enabled=never. This is
> > > in part due to performance degradations and increased memory waste.
> > >
> > > This series introduces enabled=defer, this setting acts as a middle
> > > ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
> > > page fault handler will act normally, making a hugepage if possible. If
> > > the allocation is not MADV_HUGEPAGE, then the page fault handler will
> > > default to the base size allocation. The caveat is that khugepaged can
> > > still operate on pages that are not MADV_HUGEPAGE.
> > >
> > > This allows for three things... one, applications specifically designed to
> > > use hugepages will get them, and two, applications that don't use
> > > hugepages can still benefit from them without aggressively inserting
> > > THPs at every possible chance. This curbs the memory waste, and defers
> > > the use of hugepages to khugepaged. Khugepaged can then scan the memory
> > > for eligible collapsing. Lastly there is the added benefit for those who
> > > want THPs but experience higher latency PFs. Now you can get base page
> > > performance at the PF handler and Hugepage performance for those mappings
> > > after they collapse.
> > >
> > > Admins may want to lower max_ptes_none, if not, khugepaged may
> > > aggressively collapse single allocations into hugepages.
> > >
> > > TESTING:
> > > - Built for x86_64, aarch64, ppc64le, and s390x
> > > - selftests mm
> > > - In [1] I provided a script [2] that has multiple access patterns
> > > - lots of general use.
> >
> > OK so this truly is dependent on the unmerged series? Or isn't it?
> >
> > Is your testing based on that?
> Most of the testing was done in conjunction, but independent testing
> was also done on this series (including by a large customer that was
> itching to try the changes, and they were very satisfied with the
> results).

You should make this very clear in the cover letter.

> >
> > Because again... that surely makes this series a no-go until we land the
> > prior (which might be changed, and thus necessitate re-testing).
> >
> > Are you going to provide any of these numbers/data anywhere?
> There is a link to the results in this cover letter
> [3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.html
> >

Ultimately it's not ok in mm to have a link to a website that might go away
any time, these cover letters are 'baked in' to the commit log. Are you
sure this website with 'testoutput2' will exist in 10 years time? :)

You should at the very least add a summary of this data in the cover
letter, perhaps referring back to this link as 'at the time of writing full
results are available at' or something like this.

> > > - redis testing. This test was my original case for the defer mode. What I
> > >    was able to prove was that THP=always leads to increased max_latency
> > >    cases; hence why it is recommended to disable THPs for redis servers.
> > >    However with 'defer' we dont have the max_latency spikes and can still
> > >    get the system to utilize THPs. I further tested this with the mTHP
> > >    defer setting and found that redis (and probably other jmalloc users)
> > >    can utilize THPs via defer (+mTHP defer) without a large latency
> > >    penalty and some potential gains. I uploaded some mmtest results
> > >    here[3] which compares:
> > >        stock+thp=never
> > >        stock+(m)thp=always
> > >        khugepaged-mthp + defer (max_ptes_none=64)
> > >
> > >   The results show that (m)THPs can cause some throughput regression in
> > >   some cases, but also has gains in other cases. The mTHP+defer results
> > >   have more gains and less losses over the (m)THP=always case.
> > >
> > > V6 Changes:
> > > - nits
> > > - rebased dependent series and added review tags
> > >
> > > V5 Changes:
> > > - rebased dependent series
> > > - added reviewed-by tag on 2/4
> > >
> > > V4 Changes:
> > > - Minor Documentation fixes
> > > - rebased the dependent series [1] onto mm-unstable
> > >     commit 0e68b850b1d3 ("vmalloc: use atomic_long_add_return_relaxed()")
> > >
> > > V3 Changes:
> > > - Combined the documentation commits into one, and moved a section to the
> > >   khugepaged mthp patchset
> > >
> > > V2 Changes:
> > > - base changes on mTHP khugepaged support
> > > - Fix selftests parsing issue
> > > - add mTHP defer option
> > > - add mTHP defer Documentation
> > >
> > > [1] - https://lore.kernel.org/all/20250515032226.128900-1-npache@redhat.com/
> > > [2] - https://gitlab.com/npache/khugepaged_mthp_test
> > > [3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.html
> > >
> > > Nico Pache (4):
> > >   mm: defer THP insertion to khugepaged
> > >   mm: document (m)THP defer usage
> > >   khugepaged: add defer option to mTHP options
> > >   selftests: mm: add defer to thp setting parser
> > >
> > >  Documentation/admin-guide/mm/transhuge.rst | 31 +++++++---
> > >  include/linux/huge_mm.h                    | 18 +++++-
> > >  mm/huge_memory.c                           | 69 +++++++++++++++++++---
> > >  mm/khugepaged.c                            |  8 +--
> > >  tools/testing/selftests/mm/thp_settings.c  |  1 +
> > >  tools/testing/selftests/mm/thp_settings.h  |  1 +
> > >  6 files changed, 106 insertions(+), 22 deletions(-)
> > >
> > > --
> > > 2.49.0
> > >
> >
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/4] mm: introduce THP deferred setting
  2025-05-21 10:19   ` Nico Pache
@ 2025-05-21 11:35     ` Yafang Shao
  0 siblings, 0 replies; 20+ messages in thread
From: Yafang Shao @ 2025-05-21 11:35 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-kselftest, rientjes,
	hannes, lorenzo.stoakes, rdunlap, mhocko, Liam.Howlett, zokeefe,
	surenb, jglisse, cl, jack, dave.hansen, will, tiwai,
	catalin.marinas, anshuman.khandual, dev.jain, raquini, aarcange,
	kirill.shutemov, yang, thomas.hellstrom, vishal.moola, sunnanyong,
	usamaarif642, wangkefeng.wang, ziy, shuah, peterx, willy,
	ryan.roberts, baolin.wang, baohua, david, mathieu.desnoyers,
	mhiramat, rostedt, corbet, akpm

On Wed, May 21, 2025 at 6:19 PM Nico Pache <npache@redhat.com> wrote:
>
> On Tue, May 20, 2025 at 3:25 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Thu, May 15, 2025 at 11:41 AM Nico Pache <npache@redhat.com> wrote:
> > >
> > > This series is a follow-up to [1], which adds mTHP support to khugepaged.
> > > mTHP khugepaged support is a "loose" dependency for the sysfs/sysctl
> > > configs to make sense. Without it global="defer" and  mTHP="inherit" case
> > > is "undefined" behavior.
> > >
> > > We've seen cases were customers switching from RHEL7 to RHEL8 see a
> > > significant increase in the memory footprint for the same workloads.
> > >
> > > Through our investigations we found that a large contributing factor to
> > > the increase in RSS was an increase in THP usage.
> > >
> > > For workloads like MySQL, or when using allocators like jemalloc, it is
> > > often recommended to set /transparent_hugepages/enabled=never. This is
> > > in part due to performance degradations and increased memory waste.
> > >
> > > This series introduces enabled=defer, this setting acts as a middle
> > > ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
> > > page fault handler will act normally, making a hugepage if possible. If
> > > the allocation is not MADV_HUGEPAGE, then the page fault handler will
> > > default to the base size allocation. The caveat is that khugepaged can
> > > still operate on pages that are not MADV_HUGEPAGE.
> > >
> > > This allows for three things... one, applications specifically designed to
> > > use hugepages will get them, and two, applications that don't use
> > > hugepages can still benefit from them without aggressively inserting
> > > THPs at every possible chance. This curbs the memory waste, and defers
> > > the use of hugepages to khugepaged. Khugepaged can then scan the memory
> > > for eligible collapsing. Lastly there is the added benefit for those who
> > > want THPs but experience higher latency PFs. Now you can get base page
> > > performance at the PF handler and Hugepage performance for those mappings
> > > after they collapse.
> > >
> > > Admins may want to lower max_ptes_none, if not, khugepaged may
> > > aggressively collapse single allocations into hugepages.
> > >
> > > TESTING:
> > > - Built for x86_64, aarch64, ppc64le, and s390x
> > > - selftests mm
> > > - In [1] I provided a script [2] that has multiple access patterns
> > > - lots of general use.
> > > - redis testing. This test was my original case for the defer mode. What I
> > >    was able to prove was that THP=always leads to increased max_latency
> > >    cases; hence why it is recommended to disable THPs for redis servers.
> > >    However with 'defer' we dont have the max_latency spikes and can still
> > >    get the system to utilize THPs. I further tested this with the mTHP
> > >    defer setting and found that redis (and probably other jmalloc users)
> > >    can utilize THPs via defer (+mTHP defer) without a large latency
> > >    penalty and some potential gains. I uploaded some mmtest results
> > >    here[3] which compares:
> > >        stock+thp=never
> > >        stock+(m)thp=always
> > >        khugepaged-mthp + defer (max_ptes_none=64)
> > >
> > >   The results show that (m)THPs can cause some throughput regression in
> > >   some cases, but also has gains in other cases. The mTHP+defer results
> > >   have more gains and less losses over the (m)THP=always case.
> > >
> > > V6 Changes:
> > > - nits
> > > - rebased dependent series and added review tags
> > >
> > > V5 Changes:
> > > - rebased dependent series
> > > - added reviewed-by tag on 2/4
> > >
> > > V4 Changes:
> > > - Minor Documentation fixes
> > > - rebased the dependent series [1] onto mm-unstable
> > >     commit 0e68b850b1d3 ("vmalloc: use atomic_long_add_return_relaxed()")
> > >
> > > V3 Changes:
> > > - Combined the documentation commits into one, and moved a section to the
> > >   khugepaged mthp patchset
> > >
> > > V2 Changes:
> > > - base changes on mTHP khugepaged support
> > > - Fix selftests parsing issue
> > > - add mTHP defer option
> > > - add mTHP defer Documentation
> > >
> > > [1] - https://lore.kernel.org/all/20250515032226.128900-1-npache@redhat.com/
> > > [2] - https://gitlab.com/npache/khugepaged_mthp_test
> > > [3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.html
> > >
> > > Nico Pache (4):
> > >   mm: defer THP insertion to khugepaged
> > >   mm: document (m)THP defer usage
> > >   khugepaged: add defer option to mTHP options
> > >   selftests: mm: add defer to thp setting parser
> > >
> > >  Documentation/admin-guide/mm/transhuge.rst | 31 +++++++---
> > >  include/linux/huge_mm.h                    | 18 +++++-
> > >  mm/huge_memory.c                           | 69 +++++++++++++++++++---
> > >  mm/khugepaged.c                            |  8 +--
> > >  tools/testing/selftests/mm/thp_settings.c  |  1 +
> > >  tools/testing/selftests/mm/thp_settings.h  |  1 +
> > >  6 files changed, 106 insertions(+), 22 deletions(-)
> > >
> > > --
> > > 2.49.0
> > >
> > >
> >
> > Hello Nico,
> >
> > Upon reviewing the series, it occurred to me that BPF could solve this
> > more cleanly. Adding a 'tva_flags' parameter to the BPF hook would
> > handle this case and future scenarios without requiring new modes. The
> > BPF mode could then serve as a unified solution.
> Hi Yafang,
>
> I dont see how this is the case? This would require users to
> modify/add functionality rather than configuring the system in this
> manner. What if BPF is not configured or being used? Having to use an
> additional technology that requires precise configuration doesn't seem
> cleaner.

The core challenge remains: while certain tasks benefit from this new
mode, others see no improvement—or may even regress.
For that reason, implementing it globally seems unwise—per-task
control would be far more effective.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/4] mm: introduce THP deferred setting
  2025-05-21 11:24     ` Lorenzo Stoakes
@ 2025-05-21 11:46       ` David Hildenbrand
  2025-05-21 12:00         ` Lorenzo Stoakes
  2025-05-29  4:26       ` Nico Pache
  1 sibling, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-05-21 11:46 UTC (permalink / raw)
  To: Lorenzo Stoakes, Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-kselftest, rientjes,
	hannes, rdunlap, mhocko, Liam.Howlett, zokeefe, surenb, jglisse,
	cl, jack, dave.hansen, will, tiwai, catalin.marinas,
	anshuman.khandual, dev.jain, raquini, aarcange, kirill.shutemov,
	yang, thomas.hellstrom, vishal.moola, sunnanyong, usamaarif642,
	wangkefeng.wang, ziy, shuah, peterx, willy, ryan.roberts,
	baolin.wang, baohua, mathieu.desnoyers, mhiramat, rostedt, corbet,
	akpm

On 21.05.25 13:24, Lorenzo Stoakes wrote:
> To start with I do apologise for coming to this at v6, I realise it's
> irritating to have push back at this late stage. This is more so my attempt
> to understand where this series -sits- so I can properly review it.
> 
> So please bear with me here :)
> 
> So, I remain very confused. This may just be a _me_ thing here :)
> 
> So let me check my understanding:
> 
> 1. This series introduces this new THP deferred mode.
> 2. By 'follow-up' really you mean 'inspired by' or 'related to' right?
> 3. If this series lands before [1], commits 2 - 4 are 'undefined
>     behaviour'.
> 
> In my view if 3 is true this series should be RFC until [1] merges.
> 
> If I've got it wrong and this needs to land first, we should RFC [1].
> 
> That way we can un-RFC once the dependency is met.

I really don't have a strong opinion on the RFC vs. !RFC like others 
here -- as long as the dependency is obvious. I treat RFC more as a 
"rough idea" than well tested work.

Anyhow, to me the dependency is obvious, but I've followed the MM 
meeting discussions, development etc.

I interpret "follow up" as "depends on" here. Likely we should have 
spelled out "This series depends on the patch series X that was not 
merged yet, and likely a new version will be required once merged.".

In this particular case, maybe we should just have sent one initial RFC, 
and then rebased it on top of the other work on a public git branch 
(linked from the RFC cover letter).

Once the dependency gets merged, we could just resend the series. 
Looking at the changelog, only minor stuff changed (mostly rebasing etc).

Moving forward, I don't think there is the need to resend as long as the 
dependency isn't merged upstream (or close to being merged upstream) yet.

>>> Because again... that surely makes this series a no-go until we land the
>>> prior (which might be changed, and thus necessitate re-testing).
>>>
>>> Are you going to provide any of these numbers/data anywhere?
>> There is a link to the results in this cover letter
>> [3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.html
>>>
> 
> Ultimately it's not ok in mm to have a link to a website that might go away
> any time, these cover letters are 'baked in' to the commit log. Are you
> sure this website with 'testoutput2' will exist in 10 years time? :)
> 
> You should at the very least add a summary of this data in the cover
> letter, perhaps referring back to this link as 'at the time of writing full
> results are available at' or something like this.

Yeah, or of they were included in some other mail, we can link to that 
mail in lore.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/4] mm: introduce THP deferred setting
  2025-05-21 11:46       ` David Hildenbrand
@ 2025-05-21 12:00         ` Lorenzo Stoakes
  2025-05-21 12:24           ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-05-21 12:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-kselftest,
	rientjes, hannes, rdunlap, mhocko, Liam.Howlett, zokeefe, surenb,
	jglisse, cl, jack, dave.hansen, will, tiwai, catalin.marinas,
	anshuman.khandual, dev.jain, raquini, aarcange, kirill.shutemov,
	yang, thomas.hellstrom, vishal.moola, sunnanyong, usamaarif642,
	wangkefeng.wang, ziy, shuah, peterx, willy, ryan.roberts,
	baolin.wang, baohua, mathieu.desnoyers, mhiramat, rostedt, corbet,
	akpm

I think the TL;DR here to avoid too much back and forth is - let's please
make this super super simple :)

I would prefer anything that has a dependency to just sit in RFC until the
dependency is merged.

Or, alternatively, to have a big note at the top:

ANDREW - Please do not merge in mm-unstable until series [1] is merged, and
when that is merged please ping for a resend.

Or whatever it might be.

On Wed, May 21, 2025 at 01:46:38PM +0200, David Hildenbrand wrote:
> On 21.05.25 13:24, Lorenzo Stoakes wrote:
> > To start with I do apologise for coming to this at v6, I realise it's
> > irritating to have push back at this late stage. This is more so my attempt
> > to understand where this series -sits- so I can properly review it.
> >
> > So please bear with me here :)
> >
> > So, I remain very confused. This may just be a _me_ thing here :)
> >
> > So let me check my understanding:
> >
> > 1. This series introduces this new THP deferred mode.
> > 2. By 'follow-up' really you mean 'inspired by' or 'related to' right?
> > 3. If this series lands before [1], commits 2 - 4 are 'undefined
> >     behaviour'.
> >
> > In my view if 3 is true this series should be RFC until [1] merges.
> >
> > If I've got it wrong and this needs to land first, we should RFC [1].
> >
> > That way we can un-RFC once the dependency is met.
>
> I really don't have a strong opinion on the RFC vs. !RFC like others here --
> as long as the dependency is obvious. I treat RFC more as a "rough idea"
> than well tested work.
>
> Anyhow, to me the dependency is obvious, but I've followed the MM meeting
> discussions, development etc.

Right but is it clear to Andrew? I mean the cover letter was super unclear
to me.

What's to prevent things getting merged out of order? And do people 'just
have to remember' to resend? And a resend doesn't necessarily mean patch
set X will come after patch set Y.

If there's a requirement related to the ordering of these series it really
has to be expressed very clearly.

(by the way, I feel expressing things like this is a kind of area where we
have _some kind_ of a break down in kernel process or it'd be nice to have
tags or something to properly express this sort of thing. But maybe another
discussion :)

>
> I interpret "follow up" as "depends on" here. Likely we should have spelled
> out "This series depends on the patch series X that was not merged yet, and
> likely a new version will be required once merged.".
>
> In this particular case, maybe we should just have sent one initial RFC, and
> then rebased it on top of the other work on a public git branch (linked from
> the RFC cover letter).
>
> Once the dependency gets merged, we could just resend the series. Looking at
> the changelog, only minor stuff changed (mostly rebasing etc).
>
> Moving forward, I don't think there is the need to resend as long as the
> dependency isn't merged upstream (or close to being merged upstream) yet.

I mean this is still 'just have to remember' stuff :)

Do we need patches 2-4 if the dependency isn't merged? That was unclear to
me.

>
> > > > Because again... that surely makes this series a no-go until we land the
> > > > prior (which might be changed, and thus necessitate re-testing).
> > > >
> > > > Are you going to provide any of these numbers/data anywhere?
> > > There is a link to the results in this cover letter
> > > [3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.html
> > > >
> >
> > Ultimately it's not ok in mm to have a link to a website that might go away
> > any time, these cover letters are 'baked in' to the commit log. Are you
> > sure this website with 'testoutput2' will exist in 10 years time? :)
> >
> > You should at the very least add a summary of this data in the cover
> > letter, perhaps referring back to this link as 'at the time of writing full
> > results are available at' or something like this.
>
> Yeah, or of they were included in some other mail, we can link to that mail
> in lore.
>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/4] mm: introduce THP deferred setting
  2025-05-21 12:00         ` Lorenzo Stoakes
@ 2025-05-21 12:24           ` David Hildenbrand
  2025-05-21 12:33             ` Lorenzo Stoakes
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-05-21 12:24 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-kselftest,
	rientjes, hannes, rdunlap, mhocko, Liam.Howlett, zokeefe, surenb,
	jglisse, cl, jack, dave.hansen, will, tiwai, catalin.marinas,
	anshuman.khandual, dev.jain, raquini, aarcange, kirill.shutemov,
	yang, thomas.hellstrom, vishal.moola, sunnanyong, usamaarif642,
	wangkefeng.wang, ziy, shuah, peterx, willy, ryan.roberts,
	baolin.wang, baohua, mathieu.desnoyers, mhiramat, rostedt, corbet,
	akpm

>>
>> Anyhow, to me the dependency is obvious, but I've followed the MM meeting
>> discussions, development etc.
> 
> Right but is it clear to Andrew? I mean the cover letter was super unclear
> to me.

I mean, assuming that it would not be clear to Andrew (and I think it is 
clear to Andrew), I we would get CCed on these emails and could 
immediately scream STOOOOOP :)

And until this would hit mm-stable, a bit more time would pass.

> 
> What's to prevent things getting merged out of order?

Fortunately, there are still people working here and not machines (at 
least, that's what I hope).

> And do people 'just
> have to remember' to resend?

Yes, in this case Nico wants to get his stuff upstream and must drive it 
once the dependencies are met IMHO.

> 
> If there's a requirement related to the ordering of these series it really
> has to be expressed very clearly.

Jup. I'll note that for now there was no strict rule what to tag as RFC 
and what not that I know of. Of course, if people send broken, 
half-implemented, untested ... crap, it should *clearly* be RFC.

People should be spelling out dependencies in any case (especially for 
non-RFC versions) clearly.

I'll note that even if there would be a rule, I'm afraid we don't have a 
good place to document it (and not sure if people would find it or even 
try finding it ...) :/

A big problem is when some subsystems have their own rules for how to 
handle such things. That causes major pain for contributors ...

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/4] mm: introduce THP deferred setting
  2025-05-21 12:24           ` David Hildenbrand
@ 2025-05-21 12:33             ` Lorenzo Stoakes
  2025-05-21 12:40               ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-05-21 12:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-kselftest,
	rientjes, hannes, rdunlap, mhocko, Liam.Howlett, zokeefe, surenb,
	jglisse, cl, jack, dave.hansen, will, tiwai, catalin.marinas,
	anshuman.khandual, dev.jain, raquini, aarcange, kirill.shutemov,
	yang, thomas.hellstrom, vishal.moola, sunnanyong, usamaarif642,
	wangkefeng.wang, ziy, shuah, peterx, willy, ryan.roberts,
	baolin.wang, baohua, mathieu.desnoyers, mhiramat, rostedt, corbet,
	akpm

Fundamentally I trust you to make sure this all goes correctly so let's not
belabour the point or delay things here :)

So in that vein, Nico - I would sugesst for future respins adding a really
clear bit to the header as David suggested :) also update the cover letter
tests so it isn't reliant on a possibly ephemeral web link.

But otherwise let's proceed as was.

On Wed, May 21, 2025 at 02:24:45PM +0200, David Hildenbrand wrote:
> > >
> > > Anyhow, to me the dependency is obvious, but I've followed the MM meeting
> > > discussions, development etc.
> >
> > Right but is it clear to Andrew? I mean the cover letter was super unclear
> > to me.
>
> I mean, assuming that it would not be clear to Andrew (and I think it is
> clear to Andrew), I we would get CCed on these emails and could immediately
> scream STOOOOOP :)
>
> And until this would hit mm-stable, a bit more time would pass.
>
> >
> > What's to prevent things getting merged out of order?
>
> Fortunately, there are still people working here and not machines (at least,
> that's what I hope).

Obligatory link to this :P

https://www.youtube.com/watch?v=5lsExRvJTAI

>
> > And do people 'just
> > have to remember' to resend?
>
> Yes, in this case Nico wants to get his stuff upstream and must drive it
> once the dependencies are met IMHO.
>
> >
> > If there's a requirement related to the ordering of these series it really
> > has to be expressed very clearly.
>
> Jup. I'll note that for now there was no strict rule what to tag as RFC and
> what not that I know of. Of course, if people send broken, half-implemented,
> untested ... crap, it should *clearly* be RFC.
>
> People should be spelling out dependencies in any case (especially for
> non-RFC versions) clearly.
>
> I'll note that even if there would be a rule, I'm afraid we don't have a
> good place to document it (and not sure if people would find it or even try
> finding it ...) :/

Yeah... :)

>
> A big problem is when some subsystems have their own rules for how to handle
> such things. That causes major pain for contributors ...

Yeah, I wish there was something more general.

>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/4] mm: introduce THP deferred setting
  2025-05-21 12:33             ` Lorenzo Stoakes
@ 2025-05-21 12:40               ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-05-21 12:40 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Nico Pache, linux-mm, linux-doc, linux-kernel, linux-kselftest,
	rientjes, hannes, rdunlap, mhocko, Liam.Howlett, zokeefe, surenb,
	jglisse, cl, jack, dave.hansen, will, tiwai, catalin.marinas,
	anshuman.khandual, dev.jain, raquini, aarcange, kirill.shutemov,
	yang, thomas.hellstrom, vishal.moola, sunnanyong, usamaarif642,
	wangkefeng.wang, ziy, shuah, peterx, willy, ryan.roberts,
	baolin.wang, baohua, mathieu.desnoyers, mhiramat, rostedt, corbet,
	akpm

On 21.05.25 14:33, Lorenzo Stoakes wrote:
> Fundamentally I trust you to make sure this all goes correctly so let's not
> belabour the point or delay things here :)
> 
> So in that vein, Nico - I would sugesst for future respins adding a really
> clear bit to the header as David suggested :) also update the cover letter
> tests so it isn't reliant on a possibly ephemeral web link.
> 
> But otherwise let's proceed as was.

Right, and maybe only post this series if there was a major change, 
otherwise wait until the other thing is on it's way upstream.

> 
> On Wed, May 21, 2025 at 02:24:45PM +0200, David Hildenbrand wrote:
>>>>
>>>> Anyhow, to me the dependency is obvious, but I've followed the MM meeting
>>>> discussions, development etc.
>>>
>>> Right but is it clear to Andrew? I mean the cover letter was super unclear
>>> to me.
>>
>> I mean, assuming that it would not be clear to Andrew (and I think it is
>> clear to Andrew), I we would get CCed on these emails and could immediately
>> scream STOOOOOP :)
>>
>> And until this would hit mm-stable, a bit more time would pass.
>>
>>>
>>> What's to prevent things getting merged out of order?
>>
>> Fortunately, there are still people working here and not machines (at least,
>> that's what I hope).
> 
> Obligatory link to this :P
> 

It's scary how relevant that has become lately :D

> https://www.youtube.com/watch?v=5lsExRvJTAI

... fortunately, whenever I tell the chatbots that they are wrong (IOW, 
everytime I use them) they reply with "Oh yes, you are right." ... so 
far ...

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 0/4] mm: introduce THP deferred setting
  2025-05-21 11:24     ` Lorenzo Stoakes
  2025-05-21 11:46       ` David Hildenbrand
@ 2025-05-29  4:26       ` Nico Pache
  1 sibling, 0 replies; 20+ messages in thread
From: Nico Pache @ 2025-05-29  4:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, linux-doc, linux-kernel, linux-kselftest, rientjes,
	hannes, rdunlap, mhocko, Liam.Howlett, zokeefe, surenb, jglisse,
	cl, jack, dave.hansen, will, tiwai, catalin.marinas,
	anshuman.khandual, dev.jain, raquini, aarcange, kirill.shutemov,
	yang, thomas.hellstrom, vishal.moola, sunnanyong, usamaarif642,
	wangkefeng.wang, ziy, shuah, peterx, willy, ryan.roberts,
	baolin.wang, baohua, david, mathieu.desnoyers, mhiramat, rostedt,
	corbet, akpm

On Wed, May 21, 2025 at 5:25 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> To start with I do apologise for coming to this at v6, I realise it's
> irritating to have push back at this late stage. This is more so my attempt
> to understand where this series -sits- so I can properly review it.

No worries at all! The only thing that frustrates/upsets me in
upstream mailing lists is unprovoked rudeness (which you have not
been).
>
> So please bear with me here :)
>
> So, I remain very confused. This may just be a _me_ thing here :)
>
> So let me check my understanding:
>
> 1. This series introduces this new THP deferred mode.
> 2. By 'follow-up' really you mean 'inspired by' or 'related to' right?
> 3. If this series lands before [1], commits 2 - 4 are 'undefined
>    behaviour'.
The khugepaged mTHP support should land first as without it, adding a
defer option to the global parameters, makes for undefined behavior in
the sysctls from a admin perspective.
>
> In my view if 3 is true this series should be RFC until [1] merges.
Ideally I was trying to get them merged together (Andrew actually had
them both in mm-new a few weeks ago, but a bug was found that got it
pulled, but that is fixed now). The series' complement each other
nicely.
>
> If I've got it wrong and this needs to land first, we should RFC [1].
The khugepaged series [1] should get merged first, but I was shooting
for both at the same time.
>
> That way we can un-RFC once the dependency is met.
>
> We have about 5 [m]THP series in flight at the moment, all touching at
> least vaguely related stuff, so any help for reviewers would be hugely
> appreciated thanks :)
>
> On Wed, May 21, 2025 at 04:41:54AM -0600, Nico Pache wrote:
> > On Tue, May 20, 2025 at 3:43 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Wed, May 14, 2025 at 09:38:53PM -0600, Nico Pache wrote:
> > > > This series is a follow-up to [1], which adds mTHP support to khugepaged.
> > > > mTHP khugepaged support is a "loose" dependency for the sysfs/sysctl
> > > > configs to make sense. Without it global="defer" and  mTHP="inherit" case
> > > > is "undefined" behavior.
> > >
> > > How can this be a follow up to an unmerged series? I'm confused by that.
> > Hi Lorenzo,
> >
> > follow up or loose dependency. Not sure the correct terminology.
> >
>
> See above. Let's nail this down please.
>
> > Either way, as I was developing this as a potential solution for the
> > THP internal fragmentation issue, upstream was working on adding
> > mTHPs. By adding a new THP sysctl entry I noticed mTHP would now be
> > missing the same entry. Furthermore I was told mTHP support for
> > khugepaged was a desire, so I began working on it in conjunction. So
> > given the undefined behavior of defer globally while any mix of mTHP
> > settings, it became dependent on the khugepaged support. Either way
> > patch 1 of this series is the core functionality. The rest is to fill
> > the undefined behavior gap.
> > >
> > > And you're saying that you're introducing 'undefined behaviour' on the
> > > assumption that another series which seems to have quite a bit of
> > > discussion let to run will be merged?
> > This could technically get merged without the mTHP khugepaged changes,
> > but then the reviews would probably all be pointing out what I pointed
> > out above. Chicken or Egg problem...
> > >
> > > While I'd understand if this was an RFC just to put the idea out there,
> > > you're not proposing it as such?
> > Nope we've already discussed this in both the MM alignment and thp
> > upstream meetings, no one was opposing it, and a lot of testing was
> > done-- by me, RH's CI, and our perf teams. Ive posted several RFCs
> > before posting a patchset.
> > >
> > > Unless there's a really good reason we're doing this way (I may be missing
> > > something), can we just have this as an RFC until the series it depends on
> > > is settled?
> > Hopefully paragraph one clears this up! They were built in
> > conjunction, but posting them as one series didn't feel right (and
> > IIRC this was also discussed, and this was decided).
>
> 'This was also discussed and this was decided' :)
>
> I'm guessing rather you mean discussion was had with other reviewers and of
> course our earstwhile THP maintainer David, and you guys decided this made
> more sense?
>
> Obviously upstream discussion is what counts, but as annoying as it is, one
> does have to address the concerns of reviewers even if late to a series
> (again, apologies for this).
>
> So, to be clear - I'm not intending to hold up or block the series, I just
> want to understand how things are, this is the purpose here.

Thanks I do appreciate the discussion around the process as I am
fairly new to upstream work (at least to this magnitude). I have been
mostly downstream focused for the last 6 years and I'm trying to shift
upstream as much as possible. So please bear with me as I learn all
the minor undocumented caveats!
>
> Thanks!
>
> > >
> > > >
> > > > We've seen cases were customers switching from RHEL7 to RHEL8 see a
> > > > significant increase in the memory footprint for the same workloads.
> > > >
> > > > Through our investigations we found that a large contributing factor to
> > > > the increase in RSS was an increase in THP usage.
> > > >
> > > > For workloads like MySQL, or when using allocators like jemalloc, it is
> > > > often recommended to set /transparent_hugepages/enabled=never. This is
> > > > in part due to performance degradations and increased memory waste.
> > > >
> > > > This series introduces enabled=defer, this setting acts as a middle
> > > > ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
> > > > page fault handler will act normally, making a hugepage if possible. If
> > > > the allocation is not MADV_HUGEPAGE, then the page fault handler will
> > > > default to the base size allocation. The caveat is that khugepaged can
> > > > still operate on pages that are not MADV_HUGEPAGE.
> > > >
> > > > This allows for three things... one, applications specifically designed to
> > > > use hugepages will get them, and two, applications that don't use
> > > > hugepages can still benefit from them without aggressively inserting
> > > > THPs at every possible chance. This curbs the memory waste, and defers
> > > > the use of hugepages to khugepaged. Khugepaged can then scan the memory
> > > > for eligible collapsing. Lastly there is the added benefit for those who
> > > > want THPs but experience higher latency PFs. Now you can get base page
> > > > performance at the PF handler and Hugepage performance for those mappings
> > > > after they collapse.
> > > >
> > > > Admins may want to lower max_ptes_none, if not, khugepaged may
> > > > aggressively collapse single allocations into hugepages.
> > > >
> > > > TESTING:
> > > > - Built for x86_64, aarch64, ppc64le, and s390x
> > > > - selftests mm
> > > > - In [1] I provided a script [2] that has multiple access patterns
> > > > - lots of general use.
> > >
> > > OK so this truly is dependent on the unmerged series? Or isn't it?
> > >
> > > Is your testing based on that?
> > Most of the testing was done in conjunction, but independent testing
> > was also done on this series (including by a large customer that was
> > itching to try the changes, and they were very satisfied with the
> > results).
>
> You should make this very clear in the cover letter.
I will try to do better at updating and providing more information in
my cover letters and patches. I was never sure how much information to
include! I guess the more the merrier.
>
> > >
> > > Because again... that surely makes this series a no-go until we land the
> > > prior (which might be changed, and thus necessitate re-testing).
> > >
> > > Are you going to provide any of these numbers/data anywhere?
> > There is a link to the results in this cover letter
> > [3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.html
> > >
>
> Ultimately it's not ok in mm to have a link to a website that might go away
> any time, these cover letters are 'baked in' to the commit log. Are you
> sure this website with 'testoutput2' will exist in 10 years time? :)
>
> You should at the very least add a summary of this data in the cover
> letter, perhaps referring back to this link as 'at the time of writing full
> results are available at' or something like this.

Ok good to know I will find a way to summarize the performance and
result changes more cleanly in the cover letter.
>
> > > > - redis testing. This test was my original case for the defer mode. What I
> > > >    was able to prove was that THP=always leads to increased max_latency
> > > >    cases; hence why it is recommended to disable THPs for redis servers.
> > > >    However with 'defer' we dont have the max_latency spikes and can still
> > > >    get the system to utilize THPs. I further tested this with the mTHP
> > > >    defer setting and found that redis (and probably other jmalloc users)
> > > >    can utilize THPs via defer (+mTHP defer) without a large latency
> > > >    penalty and some potential gains. I uploaded some mmtest results
> > > >    here[3] which compares:
> > > >        stock+thp=never
> > > >        stock+(m)thp=always
> > > >        khugepaged-mthp + defer (max_ptes_none=64)
> > > >
> > > >   The results show that (m)THPs can cause some throughput regression in
> > > >   some cases, but also has gains in other cases. The mTHP+defer results
> > > >   have more gains and less losses over the (m)THP=always case.
> > > >
> > > > V6 Changes:
> > > > - nits
> > > > - rebased dependent series and added review tags
> > > >
> > > > V5 Changes:
> > > > - rebased dependent series
> > > > - added reviewed-by tag on 2/4
> > > >
> > > > V4 Changes:
> > > > - Minor Documentation fixes
> > > > - rebased the dependent series [1] onto mm-unstable
> > > >     commit 0e68b850b1d3 ("vmalloc: use atomic_long_add_return_relaxed()")
> > > >
> > > > V3 Changes:
> > > > - Combined the documentation commits into one, and moved a section to the
> > > >   khugepaged mthp patchset
> > > >
> > > > V2 Changes:
> > > > - base changes on mTHP khugepaged support
> > > > - Fix selftests parsing issue
> > > > - add mTHP defer option
> > > > - add mTHP defer Documentation
> > > >
> > > > [1] - https://lore.kernel.org/all/20250515032226.128900-1-npache@redhat.com/
> > > > [2] - https://gitlab.com/npache/khugepaged_mthp_test
> > > > [3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.html
> > > >
> > > > Nico Pache (4):
> > > >   mm: defer THP insertion to khugepaged
> > > >   mm: document (m)THP defer usage
> > > >   khugepaged: add defer option to mTHP options
> > > >   selftests: mm: add defer to thp setting parser
> > > >
> > > >  Documentation/admin-guide/mm/transhuge.rst | 31 +++++++---
> > > >  include/linux/huge_mm.h                    | 18 +++++-
> > > >  mm/huge_memory.c                           | 69 +++++++++++++++++++---
> > > >  mm/khugepaged.c                            |  8 +--
> > > >  tools/testing/selftests/mm/thp_settings.c  |  1 +
> > > >  tools/testing/selftests/mm/thp_settings.h  |  1 +
> > > >  6 files changed, 106 insertions(+), 22 deletions(-)
> > > >
> > > > --
> > > > 2.49.0
> > > >
> > >
> >
>
> Cheers, Lorenzo
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 1/4] mm: defer THP insertion to khugepaged
  2025-05-15  3:38 ` [PATCH v6 1/4] mm: defer THP insertion to khugepaged Nico Pache
  2025-05-20  7:43   ` Yafang Shao
@ 2025-06-14 11:25   ` Klara Modin
  2025-06-17 17:52     ` Nico Pache
  1 sibling, 1 reply; 20+ messages in thread
From: Klara Modin @ 2025-06-14 11:25 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-doc, linux-kernel, linux-kselftest, rientjes,
	hannes, lorenzo.stoakes, rdunlap, mhocko, Liam.Howlett, zokeefe,
	surenb, jglisse, cl, jack, dave.hansen, will, tiwai,
	catalin.marinas, anshuman.khandual, dev.jain, raquini, aarcange,
	kirill.shutemov, yang, thomas.hellstrom, vishal.moola, sunnanyong,
	usamaarif642, wangkefeng.wang, ziy, shuah, peterx, willy,
	ryan.roberts, baolin.wang, baohua, david, mathieu.desnoyers,
	mhiramat, rostedt, corbet, akpm

Hi,

On 2025-05-14 21:38:54 -0600, Nico Pache wrote:
> setting /transparent_hugepages/enabled=always allows applications
> to benefit from THPs without having to madvise. However, the page fault
> handler takes very few considerations to decide weather or not to actually
> use a THP. This can lead to a lot of wasted memory. khugepaged only
> operates on memory that was either allocated with enabled=always or
> MADV_HUGEPAGE.
> 
> Introduce the ability to set enabled=defer, which will prevent THPs from
> being allocated by the page fault handler unless madvise is set,
> leaving it up to khugepaged to decide which allocations will collapse to a
> THP. This should allow applications to benefits from THPs, while curbing
> some of the memory waste.
> 
> Acked-by: Zi Yan <ziy@nvidia.com>
> Co-developed-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Rafael Aquini <raquini@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>

...

> @@ -315,13 +318,20 @@ static ssize_t enabled_store(struct kobject *kobj,
>  
>  	if (sysfs_streq(buf, "always")) {
>  		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG, &transparent_hugepage_flags);
>  		set_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
> +	} else if (sysfs_streq(buf, "defer")) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
> +		set_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG, &transparent_hugepage_flags);
>  	} else if (sysfs_streq(buf, "madvise")) {
>  		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG, &transparent_hugepage_flags);
>  		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
>  	} else if (sysfs_streq(buf, "never")) {
>  		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
>  		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG, &transparent_hugepage_flags);
>  	} else
>  		ret = -EINVAL;
>  
> @@ -954,18 +964,31 @@ static int __init setup_transparent_hugepage(char *str)
>  			&transparent_hugepage_flags);
>  		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>  			  &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG,
> +			  &transparent_hugepage_flags);
>  		ret = 1;
> +	} else if (!strcmp(str, "defer")) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +			  &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +			  &transparent_hugepage_flags);
> +		set_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG,
> +			  &transparent_hugepage_flags);

There should probably be a corresponding
		ret = 1;
here. Otherwise the cannot parse message will displayed even if defer
was set.

>  	} else if (!strcmp(str, "madvise")) {
>  		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
>  			  &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG,
> +			  &transparent_hugepage_flags);
>  		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> -			&transparent_hugepage_flags);
> +			  &transparent_hugepage_flags);
>  		ret = 1;
>  	} else if (!strcmp(str, "never")) {
>  		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
>  			  &transparent_hugepage_flags);
>  		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>  			  &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG,
> +			  &transparent_hugepage_flags);
>  		ret = 1;
>  	}
>  out:
> -- 
> 2.49.0
> 

Regards,
Klara Modin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 1/4] mm: defer THP insertion to khugepaged
  2025-06-14 11:25   ` Klara Modin
@ 2025-06-17 17:52     ` Nico Pache
  0 siblings, 0 replies; 20+ messages in thread
From: Nico Pache @ 2025-06-17 17:52 UTC (permalink / raw)
  To: Klara Modin
  Cc: linux-mm, linux-doc, linux-kernel, linux-kselftest, rientjes,
	hannes, lorenzo.stoakes, rdunlap, mhocko, Liam.Howlett, zokeefe,
	surenb, jglisse, cl, jack, dave.hansen, will, tiwai,
	catalin.marinas, anshuman.khandual, dev.jain, raquini, aarcange,
	kirill.shutemov, yang, thomas.hellstrom, vishal.moola, sunnanyong,
	usamaarif642, wangkefeng.wang, ziy, shuah, peterx, willy,
	ryan.roberts, baolin.wang, baohua, david, mathieu.desnoyers,
	mhiramat, rostedt, corbet, akpm

On Sat, Jun 14, 2025 at 5:25 AM Klara Modin <klarasmodin@gmail.com> wrote:
>
> Hi,
>
> On 2025-05-14 21:38:54 -0600, Nico Pache wrote:
> > setting /transparent_hugepages/enabled=always allows applications
> > to benefit from THPs without having to madvise. However, the page fault
> > handler takes very few considerations to decide weather or not to actually
> > use a THP. This can lead to a lot of wasted memory. khugepaged only
> > operates on memory that was either allocated with enabled=always or
> > MADV_HUGEPAGE.
> >
> > Introduce the ability to set enabled=defer, which will prevent THPs from
> > being allocated by the page fault handler unless madvise is set,
> > leaving it up to khugepaged to decide which allocations will collapse to a
> > THP. This should allow applications to benefits from THPs, while curbing
> > some of the memory waste.
> >
> > Acked-by: Zi Yan <ziy@nvidia.com>
> > Co-developed-by: Rafael Aquini <raquini@redhat.com>
> > Signed-off-by: Rafael Aquini <raquini@redhat.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> ...
>
> > @@ -315,13 +318,20 @@ static ssize_t enabled_store(struct kobject *kobj,
> >
> >       if (sysfs_streq(buf, "always")) {
> >               clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
> > +             clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG, &transparent_hugepage_flags);
> >               set_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
> > +     } else if (sysfs_streq(buf, "defer")) {
> > +             clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
> > +             clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
> > +             set_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG, &transparent_hugepage_flags);
> >       } else if (sysfs_streq(buf, "madvise")) {
> >               clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
> > +             clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG, &transparent_hugepage_flags);
> >               set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
> >       } else if (sysfs_streq(buf, "never")) {
> >               clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
> >               clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
> > +             clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG, &transparent_hugepage_flags);
> >       } else
> >               ret = -EINVAL;
> >
> > @@ -954,18 +964,31 @@ static int __init setup_transparent_hugepage(char *str)
> >                       &transparent_hugepage_flags);
> >               clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> >                         &transparent_hugepage_flags);
> > +             clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG,
> > +                       &transparent_hugepage_flags);
> >               ret = 1;
> > +     } else if (!strcmp(str, "defer")) {
> > +             clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
> > +                       &transparent_hugepage_flags);
> > +             clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> > +                       &transparent_hugepage_flags);
> > +             set_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG,
> > +                       &transparent_hugepage_flags);
>
> There should probably be a corresponding
>                 ret = 1;
> here. Otherwise the cannot parse message will displayed even if defer
> was set.
Thanks Klara-- I will make sure to add it on the next version!
>
> >       } else if (!strcmp(str, "madvise")) {
> >               clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
> >                         &transparent_hugepage_flags);
> > +             clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG,
> > +                       &transparent_hugepage_flags);
> >               set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> > -                     &transparent_hugepage_flags);
> > +                       &transparent_hugepage_flags);
> >               ret = 1;
> >       } else if (!strcmp(str, "never")) {
> >               clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
> >                         &transparent_hugepage_flags);
> >               clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> >                         &transparent_hugepage_flags);
> > +             clear_bit(TRANSPARENT_HUGEPAGE_DEFER_PF_FLAG,
> > +                       &transparent_hugepage_flags);
> >               ret = 1;
> >       }
> >  out:
> > --
> > 2.49.0
> >
>
> Regards,
> Klara Modin
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-06-17 17:52 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-15  3:38 [PATCH v6 0/4] mm: introduce THP deferred setting Nico Pache
2025-05-15  3:38 ` [PATCH v6 1/4] mm: defer THP insertion to khugepaged Nico Pache
2025-05-20  7:43   ` Yafang Shao
2025-06-14 11:25   ` Klara Modin
2025-06-17 17:52     ` Nico Pache
2025-05-15  3:38 ` [PATCH v6 2/4] mm: document (m)THP defer usage Nico Pache
2025-05-15  3:38 ` [PATCH v6 3/4] khugepaged: add defer option to mTHP options Nico Pache
2025-05-15  3:38 ` [PATCH v6 4/4] selftests: mm: add defer to thp setting parser Nico Pache
2025-05-20  9:24 ` [PATCH v6 0/4] mm: introduce THP deferred setting Yafang Shao
2025-05-21 10:19   ` Nico Pache
2025-05-21 11:35     ` Yafang Shao
2025-05-20  9:42 ` Lorenzo Stoakes
2025-05-21 10:41   ` Nico Pache
2025-05-21 11:24     ` Lorenzo Stoakes
2025-05-21 11:46       ` David Hildenbrand
2025-05-21 12:00         ` Lorenzo Stoakes
2025-05-21 12:24           ` David Hildenbrand
2025-05-21 12:33             ` Lorenzo Stoakes
2025-05-21 12:40               ` David Hildenbrand
2025-05-29  4:26       ` Nico Pache

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).