linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
@ 2025-06-06 14:37 Usama Arif
  2025-06-06 15:01 ` Usama Arif
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Usama Arif @ 2025-06-06 14:37 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hughd, linux-kernel,
	linux-doc, kernel-team, Usama Arif

On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
watermarks are evaluated to extremely high values, for e.g. a server with
480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
of the sizes set to never, the min, low and high watermarks evaluate to
11.2G, 14G and 16.8G respectively.
In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
and 1G respectively.
This is because set_recommended_min_free_kbytes is designed for PMD
hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
Such high watermark values can cause performance and latency issues in
memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
most of them would never actually use a 512M PMD THP.

Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
folio order enabled in set_recommended_min_free_kbytes.
With this patch, when only 2M THP hugepage size is set to madvise for the
same machine with 64K page size, with the rest of the sizes set to never,
the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
respectively. When 512M THP hugepage size is set to madvise for the same
machine with 64K page size, the min, low and high watermarks evaluate to
11.2G, 14G and 16.8G respectively, the same as without this patch.

An alternative solution would be to change PAGE_BLOCK_ORDER by changing
ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this
is not dynamic with hugepage size, will need different kernel builds for
different hugepage sizes and most users won't know that this needs to be
done as it can be difficult to detmermine that the performance and latency
issues are coming from the high watermark values.

All watermark numbers are for zones of nodes that had the highest number
of pages, i.e. the value for min size for 4K is obtained using:
cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}';
and for 64K using:
cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}';

An arbirtary min of 128 pages is used for when no hugepage sizes are set
enabled.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/huge_mm.h | 25 +++++++++++++++++++++++++
 mm/khugepaged.c         | 32 ++++++++++++++++++++++++++++----
 mm/shmem.c              | 29 +++++------------------------
 3 files changed, 58 insertions(+), 28 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..fb4e51ef0acb 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item)
 }
 #endif
 
+/*
+ * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
+ *
+ * SHMEM_HUGE_NEVER:
+ *	disables huge pages for the mount;
+ * SHMEM_HUGE_ALWAYS:
+ *	enables huge pages for the mount;
+ * SHMEM_HUGE_WITHIN_SIZE:
+ *	only allocate huge pages if the page will be fully within i_size,
+ *	also respect madvise() hints;
+ * SHMEM_HUGE_ADVISE:
+ *	only allocate huge pages if requested with madvise();
+ */
+
+ #define SHMEM_HUGE_NEVER	0
+ #define SHMEM_HUGE_ALWAYS	1
+ #define SHMEM_HUGE_WITHIN_SIZE	2
+ #define SHMEM_HUGE_ADVISE	3
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 extern unsigned long transparent_hugepage_flags;
@@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always;
 extern unsigned long huge_anon_orders_madvise;
 extern unsigned long huge_anon_orders_inherit;
 
+extern int shmem_huge __read_mostly;
+extern unsigned long huge_shmem_orders_always;
+extern unsigned long huge_shmem_orders_madvise;
+extern unsigned long huge_shmem_orders_inherit;
+extern unsigned long huge_shmem_orders_within_size;
+
 static inline bool hugepage_global_enabled(void)
 {
 	return transparent_hugepage_flags &
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 15203ea7d007..e64cba74eb2a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2607,6 +2607,26 @@ static int khugepaged(void *none)
 	return 0;
 }
 
+static int thp_highest_allowable_order(void)
+{
+	unsigned long orders = READ_ONCE(huge_anon_orders_always)
+			       | READ_ONCE(huge_anon_orders_madvise)
+			       | READ_ONCE(huge_shmem_orders_always)
+			       | READ_ONCE(huge_shmem_orders_madvise)
+			       | READ_ONCE(huge_shmem_orders_within_size);
+	if (hugepage_global_enabled())
+		orders |= READ_ONCE(huge_anon_orders_inherit);
+	if (shmem_huge != SHMEM_HUGE_NEVER)
+		orders |= READ_ONCE(huge_shmem_orders_inherit);
+
+	return orders == 0 ? 0 : fls(orders) - 1;
+}
+
+static unsigned long min_thp_pageblock_nr_pages(void)
+{
+	return (1UL << min(thp_highest_allowable_order(), PAGE_BLOCK_ORDER));
+}
+
 static void set_recommended_min_free_kbytes(void)
 {
 	struct zone *zone;
@@ -2638,12 +2658,16 @@ static void set_recommended_min_free_kbytes(void)
 	 * second to avoid subsequent fallbacks of other types There are 3
 	 * MIGRATE_TYPES we care about.
 	 */
-	recommended_min += pageblock_nr_pages * nr_zones *
+	recommended_min += min_thp_pageblock_nr_pages() * nr_zones *
 			   MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;
 
-	/* don't ever allow to reserve more than 5% of the lowmem */
-	recommended_min = min(recommended_min,
-			      (unsigned long) nr_free_buffer_pages() / 20);
+	/*
+	 * Don't ever allow to reserve more than 5% of the lowmem.
+	 * Use a min of 128 pages when all THP orders are set to never.
+	 */
+	recommended_min = clamp(recommended_min, 128,
+				(unsigned long) nr_free_buffer_pages() / 20);
+
 	recommended_min <<= (PAGE_SHIFT-10);
 
 	if (recommended_min > min_free_kbytes) {
diff --git a/mm/shmem.c b/mm/shmem.c
index 0c5fb4ffa03a..8e92678d1175 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -136,10 +136,10 @@ struct shmem_options {
 };
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static unsigned long huge_shmem_orders_always __read_mostly;
-static unsigned long huge_shmem_orders_madvise __read_mostly;
-static unsigned long huge_shmem_orders_inherit __read_mostly;
-static unsigned long huge_shmem_orders_within_size __read_mostly;
+unsigned long huge_shmem_orders_always __read_mostly;
+unsigned long huge_shmem_orders_madvise __read_mostly;
+unsigned long huge_shmem_orders_inherit __read_mostly;
+unsigned long huge_shmem_orders_within_size __read_mostly;
 static bool shmem_orders_configured __initdata;
 #endif
 
@@ -516,25 +516,6 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 	return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
 }
 
-/*
- * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
- *
- * SHMEM_HUGE_NEVER:
- *	disables huge pages for the mount;
- * SHMEM_HUGE_ALWAYS:
- *	enables huge pages for the mount;
- * SHMEM_HUGE_WITHIN_SIZE:
- *	only allocate huge pages if the page will be fully within i_size,
- *	also respect madvise() hints;
- * SHMEM_HUGE_ADVISE:
- *	only allocate huge pages if requested with madvise();
- */
-
-#define SHMEM_HUGE_NEVER	0
-#define SHMEM_HUGE_ALWAYS	1
-#define SHMEM_HUGE_WITHIN_SIZE	2
-#define SHMEM_HUGE_ADVISE	3
-
 /*
  * Special values.
  * Only can be set via /sys/kernel/mm/transparent_hugepage/shmem_enabled:
@@ -551,7 +532,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /* ifdef here to avoid bloating shmem.o when not necessary */
 
-static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
+int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
 static int tmpfs_huge __read_mostly = SHMEM_HUGE_NEVER;
 
 /**
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-06 14:37 [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes Usama Arif
@ 2025-06-06 15:01 ` Usama Arif
  2025-06-06 15:18 ` Zi Yan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Usama Arif @ 2025-06-06 15:01 UTC (permalink / raw)
  To: Andrew Morton, david, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hughd, linux-kernel,
	linux-doc, kernel-team, Breno Leitao



On 06/06/2025 15:37, Usama Arif wrote:
> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
> watermarks are evaluated to extremely high values, for e.g. a server with
> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
> of the sizes set to never, the min, low and high watermarks evaluate to
> 11.2G, 14G and 16.8G respectively.
> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
> and 1G respectively.
> This is because set_recommended_min_free_kbytes is designed for PMD
> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
> Such high watermark values can cause performance and latency issues in
> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
> most of them would never actually use a 512M PMD THP.
> 
> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
> folio order enabled in set_recommended_min_free_kbytes.
> With this patch, when only 2M THP hugepage size is set to madvise for the
> same machine with 64K page size, with the rest of the sizes set to never,
> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G

I forgot to change the other pageblock_nr_pages instance, the patch
will need the below fixlet as well. The watermark numbers will then be
the same as when 4K PAGE_SIZE is used.

commit 0c6bb4e5b3aa078949d712ab9c35e7b2a33cd8a4 (HEAD)
Author: Usama Arif <usamaarif642@gmail.com>
Date:   Fri Jun 6 15:43:25 2025 +0100

    [fixlet] mm: khugepaged: replace all instances of pageblock_nr_pages
    
    This will change the 64K page size, 2M THP hugepage madvise  min, low
    and high watermarks to 87M, 575M and 1G.
    
    Signed-off-by: Usama Arif <usamaarif642@gmail.com>

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e64cba74eb2a..1c643f13135e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2650,7 +2650,7 @@ static void set_recommended_min_free_kbytes(void)
        }
 
        /* Ensure 2 pageblocks are free to assist fragmentation avoidance */
-       recommended_min = pageblock_nr_pages * nr_zones * 2;
+       recommended_min = min_thp_pageblock_nr_pages() * nr_zones * 2;
 
        /*
         * Make sure that on average at least two pageblocks are almost free

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-06 14:37 [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes Usama Arif
  2025-06-06 15:01 ` Usama Arif
@ 2025-06-06 15:18 ` Zi Yan
  2025-06-06 15:38   ` Usama Arif
  2025-06-06 17:37 ` David Hildenbrand
  2025-06-07  8:18 ` Lorenzo Stoakes
  3 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2025-06-06 15:18 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hughd, linux-kernel, linux-doc, kernel-team,
	Juan Yescas

On 6 Jun 2025, at 10:37, Usama Arif wrote:

> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
> watermarks are evaluated to extremely high values, for e.g. a server with
> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
> of the sizes set to never, the min, low and high watermarks evaluate to
> 11.2G, 14G and 16.8G respectively.
> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
> and 1G respectively.
> This is because set_recommended_min_free_kbytes is designed for PMD
> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
> Such high watermark values can cause performance and latency issues in
> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
> most of them would never actually use a 512M PMD THP.
>
> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
> folio order enabled in set_recommended_min_free_kbytes.
> With this patch, when only 2M THP hugepage size is set to madvise for the
> same machine with 64K page size, with the rest of the sizes set to never,
> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
> respectively. When 512M THP hugepage size is set to madvise for the same
> machine with 64K page size, the min, low and high watermarks evaluate to
> 11.2G, 14G and 16.8G respectively, the same as without this patch.

Getting pageblock_order involved here might be confusing. I think you just
want to adjust min, low and high watermarks to reasonable values.
Is it OK to rename min_thp_pageblock_nr_pages to min_nr_free_pages_per_zone
and move MIGRATE_PCPTYPES * MIGRATE_PCPTYPES inside? Otherwise, the changes
look reasonable to me.

Another concern on tying watermarks to highest THP order is that if
user enables PMD THP on such systems with 2MB mTHP enabled initially,
it could trigger unexpected memory reclaim and compaction, right?
That might surprise user, since they just want to adjust availability
of THP sizes, but the whole system suddenly begins to be busy.
Have you experimented with it?

>
> An alternative solution would be to change PAGE_BLOCK_ORDER by changing
> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this

PAGE_BLOCK_ORDER can be changed in Kconfig without changing
ARCH_FORCE_MAX_ORDER by Juan’s recent patch[1].

[1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything&id=e13e7922d03439e374c263049af5f740ceae6346

> is not dynamic with hugepage size, will need different kernel builds for
> different hugepage sizes and most users won't know that this needs to be
> done as it can be difficult to detmermine that the performance and latency
> issues are coming from the high watermark values.
>
> All watermark numbers are for zones of nodes that had the highest number
> of pages, i.e. the value for min size for 4K is obtained using:
> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}';
> and for 64K using:
> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}';
>
> An arbirtary min of 128 pages is used for when no hugepage sizes are set
> enabled.
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  include/linux/huge_mm.h | 25 +++++++++++++++++++++++++
>  mm/khugepaged.c         | 32 ++++++++++++++++++++++++++++----
>  mm/shmem.c              | 29 +++++------------------------
>  3 files changed, 58 insertions(+), 28 deletions(-)
>

Thanks.

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-06 15:18 ` Zi Yan
@ 2025-06-06 15:38   ` Usama Arif
  2025-06-06 16:10     ` Zi Yan
  0 siblings, 1 reply; 32+ messages in thread
From: Usama Arif @ 2025-06-06 15:38 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hughd, linux-kernel, linux-doc, kernel-team,
	Juan Yescas, Breno Leitao



On 06/06/2025 16:18, Zi Yan wrote:
> On 6 Jun 2025, at 10:37, Usama Arif wrote:
> 
>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
>> watermarks are evaluated to extremely high values, for e.g. a server with
>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
>> of the sizes set to never, the min, low and high watermarks evaluate to
>> 11.2G, 14G and 16.8G respectively.
>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
>> and 1G respectively.
>> This is because set_recommended_min_free_kbytes is designed for PMD
>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
>> Such high watermark values can cause performance and latency issues in
>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
>> most of them would never actually use a 512M PMD THP.
>>
>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
>> folio order enabled in set_recommended_min_free_kbytes.
>> With this patch, when only 2M THP hugepage size is set to madvise for the
>> same machine with 64K page size, with the rest of the sizes set to never,
>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
>> respectively. When 512M THP hugepage size is set to madvise for the same
>> machine with 64K page size, the min, low and high watermarks evaluate to
>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
> 
> Getting pageblock_order involved here might be confusing. I think you just
> want to adjust min, low and high watermarks to reasonable values.
> Is it OK to rename min_thp_pageblock_nr_pages to min_nr_free_pages_per_zone
> and move MIGRATE_PCPTYPES * MIGRATE_PCPTYPES inside? Otherwise, the changes
> look reasonable to me.

Hi Zi,

Thanks for the review!

I forgot to change it in another place, sorry about that! So can't move
MIGRATE_PCPTYPES * MIGRATE_PCPTYPES into the combined function.
Have added the additional place where min_thp_pageblock_nr_pages() is called
as a fixlet here:
https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/

I think atleast in this context the orginal name pageblock_nr_pages isn't
correct as its min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER).
The new name min_thp_pageblock_nr_pages is also not really good, so happy
to change it to something appropriate.
> 
> Another concern on tying watermarks to highest THP order is that if
> user enables PMD THP on such systems with 2MB mTHP enabled initially,
> it could trigger unexpected memory reclaim and compaction, right?
> That might surprise user, since they just want to adjust availability
> of THP sizes, but the whole system suddenly begins to be busy.
> Have you experimented with it?
> 

Yes I would imagine it would trigger reclaim and compaction if the system memory
is too low, but that should hopefully be expected? If the user is enabling 512M
THP, they should expect changes by kernel to allow them to give hugepage of
that size.
Also hopefully, no one is enabling PMD THPs when the system is so low on
memory that it triggers reclaim! There would be an OOM after just a few
of those are faulted in.

Thanks!


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-06 15:38   ` Usama Arif
@ 2025-06-06 16:10     ` Zi Yan
  2025-06-07  8:35       ` Lorenzo Stoakes
  2025-06-09 11:13       ` Usama Arif
  0 siblings, 2 replies; 32+ messages in thread
From: Zi Yan @ 2025-06-06 16:10 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hughd, linux-kernel, linux-doc, kernel-team,
	Juan Yescas, Breno Leitao

On 6 Jun 2025, at 11:38, Usama Arif wrote:

> On 06/06/2025 16:18, Zi Yan wrote:
>> On 6 Jun 2025, at 10:37, Usama Arif wrote:
>>
>>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
>>> watermarks are evaluated to extremely high values, for e.g. a server with
>>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
>>> of the sizes set to never, the min, low and high watermarks evaluate to
>>> 11.2G, 14G and 16.8G respectively.
>>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
>>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
>>> and 1G respectively.
>>> This is because set_recommended_min_free_kbytes is designed for PMD
>>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
>>> Such high watermark values can cause performance and latency issues in
>>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
>>> most of them would never actually use a 512M PMD THP.
>>>
>>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
>>> folio order enabled in set_recommended_min_free_kbytes.
>>> With this patch, when only 2M THP hugepage size is set to madvise for the
>>> same machine with 64K page size, with the rest of the sizes set to never,
>>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
>>> respectively. When 512M THP hugepage size is set to madvise for the same
>>> machine with 64K page size, the min, low and high watermarks evaluate to
>>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
>>
>> Getting pageblock_order involved here might be confusing. I think you just
>> want to adjust min, low and high watermarks to reasonable values.
>> Is it OK to rename min_thp_pageblock_nr_pages to min_nr_free_pages_per_zone
>> and move MIGRATE_PCPTYPES * MIGRATE_PCPTYPES inside? Otherwise, the changes
>> look reasonable to me.
>
> Hi Zi,
>
> Thanks for the review!
>
> I forgot to change it in another place, sorry about that! So can't move
> MIGRATE_PCPTYPES * MIGRATE_PCPTYPES into the combined function.
> Have added the additional place where min_thp_pageblock_nr_pages() is called
> as a fixlet here:
> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
>
> I think atleast in this context the orginal name pageblock_nr_pages isn't
> correct as its min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER).
> The new name min_thp_pageblock_nr_pages is also not really good, so happy
> to change it to something appropriate.

Got it. pageblock is the defragmentation granularity. If user only wants
2MB mTHP, maybe pageblock order should be adjusted. Otherwise,
kernel will defragment at 512MB granularity, which might not be efficient.
Maybe make pageblock_order a boot time parameter?

In addition, we are mixing two things together:
1. min, low, and high watermarks: they affect when memory reclaim and compaction
   will be triggered;
2. pageblock order: it is the granularity of defragmentation for creating
   mTHP/THP.

In your use case, you want to lower watermarks, right? Considering what you
said below, I wonder if we want a way of enforcing vm.min_free_kbytes,
like a new sysctl knob, vm.force_min_free_kbytes (yeah the suggestion
is lame, sorry).

I think for 2, we might want to decouple pageblock order from defragmentation
granularity.


>>
>> Another concern on tying watermarks to highest THP order is that if
>> user enables PMD THP on such systems with 2MB mTHP enabled initially,
>> it could trigger unexpected memory reclaim and compaction, right?
>> That might surprise user, since they just want to adjust availability
>> of THP sizes, but the whole system suddenly begins to be busy.
>> Have you experimented with it?
>>
>
> Yes I would imagine it would trigger reclaim and compaction if the system memory
> is too low, but that should hopefully be expected? If the user is enabling 512M
> THP, they should expect changes by kernel to allow them to give hugepage of
> that size.
> Also hopefully, no one is enabling PMD THPs when the system is so low on
> memory that it triggers reclaim! There would be an OOM after just a few
> of those are faulted in.



Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-06 14:37 [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes Usama Arif
  2025-06-06 15:01 ` Usama Arif
  2025-06-06 15:18 ` Zi Yan
@ 2025-06-06 17:37 ` David Hildenbrand
  2025-06-09 11:34   ` Usama Arif
  2025-06-07  8:18 ` Lorenzo Stoakes
  3 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2025-06-06 17:37 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hughd, linux-kernel,
	linux-doc, kernel-team

On 06.06.25 16:37, Usama Arif wrote:
> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
> watermarks are evaluated to extremely high values, for e.g. a server with
> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
> of the sizes set to never, the min, low and high watermarks evaluate to
> 11.2G, 14G and 16.8G respectively.
> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
> and 1G respectively.
> This is because set_recommended_min_free_kbytes is designed for PMD
> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
> Such high watermark values can cause performance and latency issues in
> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
> most of them would never actually use a 512M PMD THP.
> 
> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
> folio order enabled in set_recommended_min_free_kbytes.
> With this patch, when only 2M THP hugepage size is set to madvise for the
> same machine with 64K page size, with the rest of the sizes set to never,
> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
> respectively. When 512M THP hugepage size is set to madvise for the same
> machine with 64K page size, the min, low and high watermarks evaluate to
> 11.2G, 14G and 16.8G respectively, the same as without this patch.
> 
> An alternative solution would be to change PAGE_BLOCK_ORDER by changing
> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this
> is not dynamic with hugepage size, will need different kernel builds for
> different hugepage sizes and most users won't know that this needs to be
> done as it can be difficult to detmermine that the performance and latency
> issues are coming from the high watermark values.
> 
> All watermark numbers are for zones of nodes that had the highest number
> of pages, i.e. the value for min size for 4K is obtained using:
> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}';
> and for 64K using:
> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}';
> 
> An arbirtary min of 128 pages is used for when no hugepage sizes are set
> enabled.
> 
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>   include/linux/huge_mm.h | 25 +++++++++++++++++++++++++
>   mm/khugepaged.c         | 32 ++++++++++++++++++++++++++++----
>   mm/shmem.c              | 29 +++++------------------------
>   3 files changed, 58 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..fb4e51ef0acb 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item)
>   }
>   #endif
>   
> +/*
> + * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
> + *
> + * SHMEM_HUGE_NEVER:
> + *	disables huge pages for the mount;
> + * SHMEM_HUGE_ALWAYS:
> + *	enables huge pages for the mount;
> + * SHMEM_HUGE_WITHIN_SIZE:
> + *	only allocate huge pages if the page will be fully within i_size,
> + *	also respect madvise() hints;
> + * SHMEM_HUGE_ADVISE:
> + *	only allocate huge pages if requested with madvise();
> + */
> +
> + #define SHMEM_HUGE_NEVER	0
> + #define SHMEM_HUGE_ALWAYS	1
> + #define SHMEM_HUGE_WITHIN_SIZE	2
> + #define SHMEM_HUGE_ADVISE	3
> +
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   
>   extern unsigned long transparent_hugepage_flags;
> @@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always;
>   extern unsigned long huge_anon_orders_madvise;
>   extern unsigned long huge_anon_orders_inherit;
>   
> +extern int shmem_huge __read_mostly;
> +extern unsigned long huge_shmem_orders_always;
> +extern unsigned long huge_shmem_orders_madvise;
> +extern unsigned long huge_shmem_orders_inherit;
> +extern unsigned long huge_shmem_orders_within_size;

Do really all of these have to be exported?

> +
>   static inline bool hugepage_global_enabled(void)
>   {
>   	return transparent_hugepage_flags &
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 15203ea7d007..e64cba74eb2a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2607,6 +2607,26 @@ static int khugepaged(void *none)
>   	return 0;
>   }
>   
> +static int thp_highest_allowable_order(void)

Did you mean "largest" ?

> +{
> +	unsigned long orders = READ_ONCE(huge_anon_orders_always)
> +			       | READ_ONCE(huge_anon_orders_madvise)
> +			       | READ_ONCE(huge_shmem_orders_always)
> +			       | READ_ONCE(huge_shmem_orders_madvise)
> +			       | READ_ONCE(huge_shmem_orders_within_size);
> +	if (hugepage_global_enabled())
> +		orders |= READ_ONCE(huge_anon_orders_inherit);
> +	if (shmem_huge != SHMEM_HUGE_NEVER)
> +		orders |= READ_ONCE(huge_shmem_orders_inherit);
> +
> +	return orders == 0 ? 0 : fls(orders) - 1;
> +}

But how does this interact with large folios / THPs in the page cache?

> +
> +static unsigned long min_thp_pageblock_nr_pages(void)

Reading the function name, I have no idea what this function is supposed 
to do.


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-06 14:37 [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes Usama Arif
                   ` (2 preceding siblings ...)
  2025-06-06 17:37 ` David Hildenbrand
@ 2025-06-07  8:18 ` Lorenzo Stoakes
  2025-06-07  8:44   ` Lorenzo Stoakes
  2025-06-09 12:07   ` Usama Arif
  3 siblings, 2 replies; 32+ messages in thread
From: Lorenzo Stoakes @ 2025-06-07  8:18 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, hughd,
	linux-kernel, linux-doc, kernel-team

It's important to base against mm-new for new mm stuff, PAGE_BLOCK_ORDER got
renamed to PAGE_BLOCK_MAX_ORDER in Zi's series at [0] and this doesn't compile.

Please always do a quick rebase + compile check before sending.

[0]:  https://lkml.kernel.org/r/20250604211427.1590859-1-ziy@nvidia.com

Overall this seems to me to be implemented at the wrong level of
abstraction - we implement set_recommended_min_free_kbytes() to interact
with the page block mechanism.

While the problem you describe is absolutely a problem and we need to
figure out a way to avoid reserving ridiculous amounts of memory for higher
page tables, we surely need to figure this out at a page block granularity
don't we?

On Fri, Jun 06, 2025 at 03:37:00PM +0100, Usama Arif wrote:
> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
> watermarks are evaluated to extremely high values, for e.g. a server with
> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
> of the sizes set to never, the min, low and high watermarks evaluate to
> 11.2G, 14G and 16.8G respectively.
> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
> and 1G respectively.
> This is because set_recommended_min_free_kbytes is designed for PMD
> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).

Right it is, but not this line really, the _pageblock order_ is set to be
the minimum of the huge page PMD order and PAGE_BLOCK_MAX_ORDER as it makes
sense to use page block heuristics to reduce the odds of fragmentation and
so we can grab a PMD huge page at a time.

Obviously if the user wants to set a _smaller_ page block order they can,
but if it's larger we want to heuristically avoid fragmentation of
physically contiguous huge page aligned ranges (the whole page block
mechanism).

I absolutely hate how set_recommended_min_free_kbytes() has basically
hacked in some THP considerations but otherwise invokes
calculate_min_free_kbytes()... ugh. But an existing issue.

> Such high watermark values can cause performance and latency issues in
> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
> most of them would never actually use a 512M PMD THP.

512MB, yeah crazy. We've not thought this through, and this is a very real
issue.

Again, it strikes me that we should be changing the page block order for 64
KB arm64 rather than this calculation though.

Keep in mind pageblocks are a heuristic mechanism designed to reduce
fragmentation, the decision could be made to cap how far we're willing to
go with that...

>
> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
> folio order enabled in set_recommended_min_free_kbytes.
> With this patch, when only 2M THP hugepage size is set to madvise for the
> same machine with 64K page size, with the rest of the sizes set to never,
> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
> respectively. When 512M THP hugepage size is set to madvise for the same
> machine with 64K page size, the min, low and high watermarks evaluate to
> 11.2G, 14G and 16.8G respectively, the same as without this patch.

Hmm, but what happens if a user changes this live, does this get updated?

OK I see it does via:

sysfs stuff -> enabled_store() -> start_stop_khugepaged() -> set_recommended_min_free_kbytes()

But don't we want to change this in general? Does somebody happening to
have 512MB THP at madvise or always suggest we want insane watermark
numbers?

I'm not really convinced by this 'dynamic' aspect, you're changing global
watermark numbers and reserves _massively_ based on a 'maybe' use of
something that's meant to be transparent + best-effort...

>
> An alternative solution would be to change PAGE_BLOCK_ORDER by changing
> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this
> is not dynamic with hugepage size, will need different kernel builds for
> different hugepage sizes and most users won't know that this needs to be
> done as it can be difficult to detmermine that the performance and latency
> issues are coming from the high watermark values.

Or, we could adjust pageblock_order accordingly in this instance no?

>
> All watermark numbers are for zones of nodes that had the highest number
> of pages, i.e. the value for min size for 4K is obtained using:
> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}';
> and for 64K using:
> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}';
>
> An arbirtary min of 128 pages is used for when no hugepage sizes are set
> enabled.

I don't think it's really okay to out and out add an arbitrary value like this
without explanation. This is basis for rejection of the patch already.

That seems a little low too no?

IMPORTANT: I'd really like to see some before/after numbers for 4k, 16k,
64k with THP enabled/disabled so you can prove your patch isn't
fundamentally changing these values unexpectedly for users that aren't
using crazy page sizes.

>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  include/linux/huge_mm.h | 25 +++++++++++++++++++++++++
>  mm/khugepaged.c         | 32 ++++++++++++++++++++++++++++----
>  mm/shmem.c              | 29 +++++------------------------
>  3 files changed, 58 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..fb4e51ef0acb 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item)
>  }
>  #endif
>
> +/*
> + * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
> + *
> + * SHMEM_HUGE_NEVER:
> + *	disables huge pages for the mount;
> + * SHMEM_HUGE_ALWAYS:
> + *	enables huge pages for the mount;
> + * SHMEM_HUGE_WITHIN_SIZE:
> + *	only allocate huge pages if the page will be fully within i_size,
> + *	also respect madvise() hints;
> + * SHMEM_HUGE_ADVISE:
> + *	only allocate huge pages if requested with madvise();
> + */
> +
> + #define SHMEM_HUGE_NEVER	0
> + #define SHMEM_HUGE_ALWAYS	1
> + #define SHMEM_HUGE_WITHIN_SIZE	2
> + #define SHMEM_HUGE_ADVISE	3
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>
>  extern unsigned long transparent_hugepage_flags;
> @@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always;
>  extern unsigned long huge_anon_orders_madvise;
>  extern unsigned long huge_anon_orders_inherit;
>
> +extern int shmem_huge __read_mostly;
> +extern unsigned long huge_shmem_orders_always;
> +extern unsigned long huge_shmem_orders_madvise;
> +extern unsigned long huge_shmem_orders_inherit;
> +extern unsigned long huge_shmem_orders_within_size;
> +

Rather than exposing all of this shmem state as globals, can we not just have
shmem provide a function that grabs this informtaion?

>  static inline bool hugepage_global_enabled(void)
>  {
>  	return transparent_hugepage_flags &
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 15203ea7d007..e64cba74eb2a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2607,6 +2607,26 @@ static int khugepaged(void *none)
>  	return 0;
>  }
>

> +static int thp_highest_allowable_order(void)

Thisa absolutely needs a comment.

> +{
> +	unsigned long orders = READ_ONCE(huge_anon_orders_always)
> +			       | READ_ONCE(huge_anon_orders_madvise)
> +			       | READ_ONCE(huge_shmem_orders_always)
> +			       | READ_ONCE(huge_shmem_orders_madvise)
> +			       | READ_ONCE(huge_shmem_orders_within_size);

Same comment as above, have shmem export this.

> +	if (hugepage_global_enabled())
> +		orders |= READ_ONCE(huge_anon_orders_inherit);
> +	if (shmem_huge != SHMEM_HUGE_NEVER)
> +		orders |= READ_ONCE(huge_shmem_orders_inherit);
> +
> +	return orders == 0 ? 0 : fls(orders) - 1;
> +}
> +
> +static unsigned long min_thp_pageblock_nr_pages(void)

I really really hate this name. This isn't number of pageblock pages any
more this is something else? You're not changing the page block size right?

> +{
> +	return (1UL << min(thp_highest_allowable_order(), PAGE_BLOCK_ORDER));
> +}
> +
>  static void set_recommended_min_free_kbytes(void)
>  {
>  	struct zone *zone;
> @@ -2638,12 +2658,16 @@ static void set_recommended_min_free_kbytes(void)

You provide a 'patchlet' in
https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/

That also does:

        /* Ensure 2 pageblocks are free to assist fragmentation avoidance */
-       recommended_min = pageblock_nr_pages * nr_zones * 2;
+       recommended_min = min_thp_pageblock_nr_pages() * nr_zones * 2;

So comment here - this comment is now incorrect, this isn't 2 page blocks,
it's 2 of 'sub-pageblock size as if page blocks were dynamically altered by
always/madvise THP size'.

Again, this whole thing strikes me as we're doing things at the wrong level
of abstraction.

And you're definitely now not helping avoid pageblock-sized
fragmentation. You're accepting that you need less so... why not reduce
pageblock size? :)

	/*
	 * Make sure that on average at least two pageblocks are almost free
	 * of another type, one for a migratetype to fall back to and a

^ remainder of comment

>  	 * second to avoid subsequent fallbacks of other types There are 3
>  	 * MIGRATE_TYPES we care about.
>  	 */
> -	recommended_min += pageblock_nr_pages * nr_zones *
> +	recommended_min += min_thp_pageblock_nr_pages() * nr_zones *
>  			   MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;

This just seems wrong now and contradicts the comment - you're setting
minimum pages based on migrate PCP types that operate at pageblock order
but without reference to the actual number of page block pages?

So the comment is just wrong now? 'make sure there are at least two
pageblocks', well this isn't what you're doing is it? So why there are we
making reference to PCP counts etc.?

This seems like we're essentially just tuning these numbers someswhat
arbitrarily to reduce them?

>
> -	/* don't ever allow to reserve more than 5% of the lowmem */
> -	recommended_min = min(recommended_min,
> -			      (unsigned long) nr_free_buffer_pages() / 20);
> +	/*
> +	 * Don't ever allow to reserve more than 5% of the lowmem.
> +	 * Use a min of 128 pages when all THP orders are set to never.

Why? Did you just choose this number out of the blue?

Previously, on x86-64 with thp -> never on everything a pageblock order-9
wouldn't this be a much higher value?

I mean just putting '128' here is not acceptable. It needs to be justified
(even if empirically with data to back it) and defined as a named thing.


> +	 */
> +	recommended_min = clamp(recommended_min, 128,
> +				(unsigned long) nr_free_buffer_pages() / 20);
> +
>  	recommended_min <<= (PAGE_SHIFT-10);
>
>  	if (recommended_min > min_free_kbytes) {
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 0c5fb4ffa03a..8e92678d1175 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -136,10 +136,10 @@ struct shmem_options {
>  };
>
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -static unsigned long huge_shmem_orders_always __read_mostly;
> -static unsigned long huge_shmem_orders_madvise __read_mostly;
> -static unsigned long huge_shmem_orders_inherit __read_mostly;
> -static unsigned long huge_shmem_orders_within_size __read_mostly;
> +unsigned long huge_shmem_orders_always __read_mostly;
> +unsigned long huge_shmem_orders_madvise __read_mostly;
> +unsigned long huge_shmem_orders_inherit __read_mostly;
> +unsigned long huge_shmem_orders_within_size __read_mostly;

Again, we really shouldn't need to do this.

>  static bool shmem_orders_configured __initdata;
>  #endif
>
> @@ -516,25 +516,6 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>  	return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
>  }
>
> -/*
> - * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
> - *
> - * SHMEM_HUGE_NEVER:
> - *	disables huge pages for the mount;
> - * SHMEM_HUGE_ALWAYS:
> - *	enables huge pages for the mount;
> - * SHMEM_HUGE_WITHIN_SIZE:
> - *	only allocate huge pages if the page will be fully within i_size,
> - *	also respect madvise() hints;
> - * SHMEM_HUGE_ADVISE:
> - *	only allocate huge pages if requested with madvise();
> - */
> -
> -#define SHMEM_HUGE_NEVER	0
> -#define SHMEM_HUGE_ALWAYS	1
> -#define SHMEM_HUGE_WITHIN_SIZE	2
> -#define SHMEM_HUGE_ADVISE	3
> -

Again we really shouldn't need to do this, just provide some function from
shmem that gives you what you need.

>  /*
>   * Special values.
>   * Only can be set via /sys/kernel/mm/transparent_hugepage/shmem_enabled:
> @@ -551,7 +532,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  /* ifdef here to avoid bloating shmem.o when not necessary */
>
> -static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
> +int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;

Same comment.

>  static int tmpfs_huge __read_mostly = SHMEM_HUGE_NEVER;
>
>  /**
> --
> 2.47.1
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-06 16:10     ` Zi Yan
@ 2025-06-07  8:35       ` Lorenzo Stoakes
  2025-06-08  0:04         ` Zi Yan
  2025-06-09 11:13       ` Usama Arif
  1 sibling, 1 reply; 32+ messages in thread
From: Lorenzo Stoakes @ 2025-06-07  8:35 UTC (permalink / raw)
  To: Zi Yan
  Cc: Usama Arif, Andrew Morton, david, linux-mm, hannes, shakeel.butt,
	riel, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	hughd, linux-kernel, linux-doc, kernel-team, Juan Yescas,
	Breno Leitao

On Fri, Jun 06, 2025 at 12:10:43PM -0400, Zi Yan wrote:
> On 6 Jun 2025, at 11:38, Usama Arif wrote:
>
> > On 06/06/2025 16:18, Zi Yan wrote:
> >> On 6 Jun 2025, at 10:37, Usama Arif wrote:
> >>
> >>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
> >>> watermarks are evaluated to extremely high values, for e.g. a server with
> >>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
> >>> of the sizes set to never, the min, low and high watermarks evaluate to
> >>> 11.2G, 14G and 16.8G respectively.
> >>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
> >>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
> >>> and 1G respectively.
> >>> This is because set_recommended_min_free_kbytes is designed for PMD
> >>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
> >>> Such high watermark values can cause performance and latency issues in
> >>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
> >>> most of them would never actually use a 512M PMD THP.
> >>>
> >>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
> >>> folio order enabled in set_recommended_min_free_kbytes.
> >>> With this patch, when only 2M THP hugepage size is set to madvise for the
> >>> same machine with 64K page size, with the rest of the sizes set to never,
> >>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
> >>> respectively. When 512M THP hugepage size is set to madvise for the same
> >>> machine with 64K page size, the min, low and high watermarks evaluate to
> >>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
> >>
> >> Getting pageblock_order involved here might be confusing. I think you just
> >> want to adjust min, low and high watermarks to reasonable values.
> >> Is it OK to rename min_thp_pageblock_nr_pages to min_nr_free_pages_per_zone
> >> and move MIGRATE_PCPTYPES * MIGRATE_PCPTYPES inside? Otherwise, the changes
> >> look reasonable to me.
> >
> > Hi Zi,
> >
> > Thanks for the review!
> >
> > I forgot to change it in another place, sorry about that! So can't move
> > MIGRATE_PCPTYPES * MIGRATE_PCPTYPES into the combined function.
> > Have added the additional place where min_thp_pageblock_nr_pages() is called
> > as a fixlet here:
> > https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
> >
> > I think atleast in this context the orginal name pageblock_nr_pages isn't
> > correct as its min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER).
> > The new name min_thp_pageblock_nr_pages is also not really good, so happy
> > to change it to something appropriate.
>
> Got it. pageblock is the defragmentation granularity. If user only wants
> 2MB mTHP, maybe pageblock order should be adjusted. Otherwise,
> kernel will defragment at 512MB granularity, which might not be efficient.
> Maybe make pageblock_order a boot time parameter?
>
> In addition, we are mixing two things together:
> 1. min, low, and high watermarks: they affect when memory reclaim and compaction
>    will be triggered;
> 2. pageblock order: it is the granularity of defragmentation for creating
>    mTHP/THP.
>
> In your use case, you want to lower watermarks, right? Considering what you
> said below, I wonder if we want a way of enforcing vm.min_free_kbytes,
> like a new sysctl knob, vm.force_min_free_kbytes (yeah the suggestion
> is lame, sorry).

Hmmm :>) I really think this is something we should do automatically.

I know it's becoming silly as Usama and others have clearly demonstrated the 'T'
in THP doesn't stand for transparent, but I think providing a new sysctl for an
apparently automated system is not the way to go, especially as we intend to
make it more automagic in future.

>
> I think for 2, we might want to decouple pageblock order from defragmentation
> granularity.

Well, isn't pageblock order explicitly a heuristic for defragmenting physical
memory for the purposes of higher order allocations?

I don't think we can decouple that.

But I think we can say, as the existence of PAGE_BLOCK_MAX_ORDER already sort of
implies, 'we are fine with increasing the chances of fragmentation of
<ridiculously huge page size> in order to improve reclaim behaviour'.

And again really strikes me that the parameter to adjust here is pageblock size,
maybe default max size for systems with very large page table size.

The THP mechanism is meant to be 'best effort' and opportunistic right? So it's
ok if we aren't quite perfect in providing crazy huge page sizes.

I think 'on arm64 64KB we give up on page block beyond sensible mTHP size' is
really a fine thing to do, and implementable by just... changing max pageblock
order :>)

Not having pageblocks at the crazy size doesn't mean those regions won't exist,
it just means they're more likely not to due to fragmentation.

512MB PMD's... man haha.

>
>
> >>
> >> Another concern on tying watermarks to highest THP order is that if
> >> user enables PMD THP on such systems with 2MB mTHP enabled initially,
> >> it could trigger unexpected memory reclaim and compaction, right?
> >> That might surprise user, since they just want to adjust availability
> >> of THP sizes, but the whole system suddenly begins to be busy.
> >> Have you experimented with it?
> >>
> >
> > Yes I would imagine it would trigger reclaim and compaction if the system memory
> > is too low, but that should hopefully be expected? If the user is enabling 512M
> > THP, they should expect changes by kernel to allow them to give hugepage of
> > that size.
> > Also hopefully, no one is enabling PMD THPs when the system is so low on
> > memory that it triggers reclaim! There would be an OOM after just a few
> > of those are faulted in.
>
>
>
> Best Regards,
> Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-07  8:18 ` Lorenzo Stoakes
@ 2025-06-07  8:44   ` Lorenzo Stoakes
  2025-06-09 12:07   ` Usama Arif
  1 sibling, 0 replies; 32+ messages in thread
From: Lorenzo Stoakes @ 2025-06-07  8:44 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, hughd,
	linux-kernel, linux-doc, kernel-team

OK a 'reviewlet' :P (have to say I like this terminology...)

I see we are already ostensibly 'dynamic' in that hugepage_pmd_enabled() changes
our algorithm.

So my comments about dynamic alterting are probably not quite valid. Are we not
in a weird situation where, with mTHP-only enabled we just use
calculate_min_free_kbytes()??

That probably needs addressing (perhaps separately...)

Lord I hate how we do this (not your fault!)

Rest of review still applicable :>)

Thanks for raising these important issues, while I am raising technical
criticism of the patch, I do very much think you're identifying a real (and
really very significant for those using v. large page table sizes) problem we
need to address one way or another.

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-07  8:35       ` Lorenzo Stoakes
@ 2025-06-08  0:04         ` Zi Yan
  0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2025-06-08  0:04 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Usama Arif, Andrew Morton, david, linux-mm, hannes, shakeel.butt,
	riel, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	hughd, linux-kernel, linux-doc, kernel-team, Juan Yescas,
	Breno Leitao

On 7 Jun 2025, at 4:35, Lorenzo Stoakes wrote:

> On Fri, Jun 06, 2025 at 12:10:43PM -0400, Zi Yan wrote:
>> On 6 Jun 2025, at 11:38, Usama Arif wrote:
>>
>>> On 06/06/2025 16:18, Zi Yan wrote:
>>>> On 6 Jun 2025, at 10:37, Usama Arif wrote:
>>>>
>>>>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
>>>>> watermarks are evaluated to extremely high values, for e.g. a server with
>>>>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
>>>>> of the sizes set to never, the min, low and high watermarks evaluate to
>>>>> 11.2G, 14G and 16.8G respectively.
>>>>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
>>>>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
>>>>> and 1G respectively.
>>>>> This is because set_recommended_min_free_kbytes is designed for PMD
>>>>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
>>>>> Such high watermark values can cause performance and latency issues in
>>>>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
>>>>> most of them would never actually use a 512M PMD THP.
>>>>>
>>>>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
>>>>> folio order enabled in set_recommended_min_free_kbytes.
>>>>> With this patch, when only 2M THP hugepage size is set to madvise for the
>>>>> same machine with 64K page size, with the rest of the sizes set to never,
>>>>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
>>>>> respectively. When 512M THP hugepage size is set to madvise for the same
>>>>> machine with 64K page size, the min, low and high watermarks evaluate to
>>>>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
>>>>
>>>> Getting pageblock_order involved here might be confusing. I think you just
>>>> want to adjust min, low and high watermarks to reasonable values.
>>>> Is it OK to rename min_thp_pageblock_nr_pages to min_nr_free_pages_per_zone
>>>> and move MIGRATE_PCPTYPES * MIGRATE_PCPTYPES inside? Otherwise, the changes
>>>> look reasonable to me.
>>>
>>> Hi Zi,
>>>
>>> Thanks for the review!
>>>
>>> I forgot to change it in another place, sorry about that! So can't move
>>> MIGRATE_PCPTYPES * MIGRATE_PCPTYPES into the combined function.
>>> Have added the additional place where min_thp_pageblock_nr_pages() is called
>>> as a fixlet here:
>>> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
>>>
>>> I think atleast in this context the orginal name pageblock_nr_pages isn't
>>> correct as its min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER).
>>> The new name min_thp_pageblock_nr_pages is also not really good, so happy
>>> to change it to something appropriate.
>>
>> Got it. pageblock is the defragmentation granularity. If user only wants
>> 2MB mTHP, maybe pageblock order should be adjusted. Otherwise,
>> kernel will defragment at 512MB granularity, which might not be efficient.
>> Maybe make pageblock_order a boot time parameter?
>>
>> In addition, we are mixing two things together:
>> 1. min, low, and high watermarks: they affect when memory reclaim and compaction
>>    will be triggered;
>> 2. pageblock order: it is the granularity of defragmentation for creating
>>    mTHP/THP.
>>
>> In your use case, you want to lower watermarks, right? Considering what you
>> said below, I wonder if we want a way of enforcing vm.min_free_kbytes,
>> like a new sysctl knob, vm.force_min_free_kbytes (yeah the suggestion
>> is lame, sorry).
>
> Hmmm :>) I really think this is something we should do automatically.
>
> I know it's becoming silly as Usama and others have clearly demonstrated the 'T'
> in THP doesn't stand for transparent, but I think providing a new sysctl for an
> apparently automated system is not the way to go, especially as we intend to
> make it more automagic in future.

Right. I think current setting, which boosts watermarks based on THP sizes,
seems too conservative, implying we are so afraid of not being able to provide
a THP when there is not enough memory. But that prevents user from using all
available memory is silly. Maybe just get rid of the watermark change code
in khugepaged. If user wants to use all available memory, they pay the penalty
of not easily getting THPs from the system. Kernel should not make the decision
for user.

>
>>
>> I think for 2, we might want to decouple pageblock order from defragmentation
>> granularity.
>
> Well, isn't pageblock order explicitly a heuristic for defragmenting physical
> memory for the purposes of higher order allocations?
>
> I don't think we can decouple that.

Yes, but pageblock is also used for memory hotadd and hotremove as the minimal
unit, so bigger pageblock is not memory hotplug friendly. And the main use
is pageblock isolation to remove free pages from any possible user.

In terms of defragmentation, pageblock has two purposes: 1) pageblock size
matches THP size, so memory compaction can migrate in-use pages to create
an THP-size free page; 2) avoid mixing movable and unmovable pages to avoid
wasting memory compaction effort, since a single unmovable page makes
a whole pageblock not suitable for THP creation with the help of memory
compaction.

Now we have mTHP, whose sizes varies from order-1 (anon starts from order-2)
to PMD-order. But if user only wants a smaller size mTHP (like in this case
2MB mTHP in a system with 512MB THP), having a large pageblock might not be
efficient, since why defragmenting 512MB range for a 2MB mTHP.
I do not have data to support my claim yet, since it is possible that
defragmenting at > THP size range can provide better THP creation success
rate. So some study is needed to understand the impact of defragmentation
granularity on THP creation.

A single granularity, i.e., one pageblock size, which determines defragmentation
granularity, cannot rule all mTHP sizes. That is why I am thinking about decouple
pageblock size from defragmentation granularity.


>
> But I think we can say, as the existence of PAGE_BLOCK_MAX_ORDER already sort of
> implies, 'we are fine with increasing the chances of fragmentation of
> <ridiculously huge page size> in order to improve reclaim behaviour'.

Right, especially these huge page sizes are rarely used.

>
> And again really strikes me that the parameter to adjust here is pageblock size,
> maybe default max size for systems with very large page table size.

Short term, yes. Since watermarks are tied to pageblock size and the rationale
is that pageblock size is equal to THP size and we want to make some guarantee
on THP creation.
>
> The THP mechanism is meant to be 'best effort' and opportunistic right? So it's
> ok if we aren't quite perfect in providing crazy huge page sizes.

Yes. And changing pageblock size to lower watermarks give more available free
memory to user might be better than having guarantees on creating a rarely
used THP size.

>
> I think 'on arm64 64KB we give up on page block beyond sensible mTHP size' is
> really a fine thing to do, and implementable by just... changing max pageblock
> order :>)
>
> Not having pageblocks at the crazy size doesn't mean those regions won't exist,
> it just means they're more likely not to due to fragmentation.
>
> 512MB PMD's... man haha.

Right. One caveat is that pageblock size currently can only be changed via
Kconfig, so if user wants a different mTHP size than 2MB, they will need
to build a different kernel. Yes, we can make pageblock a boot time parameter
(I proposed it in Juan's patch). That implies if user wants a different mTHP
size, they need to reboot the machine. It is slightly better than kernel
compilation. Making pageblock size changeable at runtime might be too
complicated and involve a lot of runtime cost to merging and splitting pageblocks.
That is why I want to decouple pageblock from defragmentation granularity.
Yeah, it is going to be a big project. :)

>
>>
>>
>>>>
>>>> Another concern on tying watermarks to highest THP order is that if
>>>> user enables PMD THP on such systems with 2MB mTHP enabled initially,
>>>> it could trigger unexpected memory reclaim and compaction, right?
>>>> That might surprise user, since they just want to adjust availability
>>>> of THP sizes, but the whole system suddenly begins to be busy.
>>>> Have you experimented with it?
>>>>
>>>
>>> Yes I would imagine it would trigger reclaim and compaction if the system memory
>>> is too low, but that should hopefully be expected? If the user is enabling 512M
>>> THP, they should expect changes by kernel to allow them to give hugepage of
>>> that size.
>>> Also hopefully, no one is enabling PMD THPs when the system is so low on
>>> memory that it triggers reclaim! There would be an OOM after just a few
>>> of those are faulted in.


--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-06 16:10     ` Zi Yan
  2025-06-07  8:35       ` Lorenzo Stoakes
@ 2025-06-09 11:13       ` Usama Arif
  2025-06-09 13:19         ` Zi Yan
  1 sibling, 1 reply; 32+ messages in thread
From: Usama Arif @ 2025-06-09 11:13 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hughd, linux-kernel, linux-doc, kernel-team,
	Juan Yescas, Breno Leitao



On 06/06/2025 17:10, Zi Yan wrote:
> On 6 Jun 2025, at 11:38, Usama Arif wrote:
> 
>> On 06/06/2025 16:18, Zi Yan wrote:
>>> On 6 Jun 2025, at 10:37, Usama Arif wrote:
>>>
>>>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
>>>> watermarks are evaluated to extremely high values, for e.g. a server with
>>>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
>>>> of the sizes set to never, the min, low and high watermarks evaluate to
>>>> 11.2G, 14G and 16.8G respectively.
>>>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
>>>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
>>>> and 1G respectively.
>>>> This is because set_recommended_min_free_kbytes is designed for PMD
>>>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
>>>> Such high watermark values can cause performance and latency issues in
>>>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
>>>> most of them would never actually use a 512M PMD THP.
>>>>
>>>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
>>>> folio order enabled in set_recommended_min_free_kbytes.
>>>> With this patch, when only 2M THP hugepage size is set to madvise for the
>>>> same machine with 64K page size, with the rest of the sizes set to never,
>>>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
>>>> respectively. When 512M THP hugepage size is set to madvise for the same
>>>> machine with 64K page size, the min, low and high watermarks evaluate to
>>>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
>>>
>>> Getting pageblock_order involved here might be confusing. I think you just
>>> want to adjust min, low and high watermarks to reasonable values.
>>> Is it OK to rename min_thp_pageblock_nr_pages to min_nr_free_pages_per_zone
>>> and move MIGRATE_PCPTYPES * MIGRATE_PCPTYPES inside? Otherwise, the changes
>>> look reasonable to me.
>>
>> Hi Zi,
>>
>> Thanks for the review!
>>
>> I forgot to change it in another place, sorry about that! So can't move
>> MIGRATE_PCPTYPES * MIGRATE_PCPTYPES into the combined function.
>> Have added the additional place where min_thp_pageblock_nr_pages() is called
>> as a fixlet here:
>> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
>>
>> I think atleast in this context the orginal name pageblock_nr_pages isn't
>> correct as its min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER).
>> The new name min_thp_pageblock_nr_pages is also not really good, so happy
>> to change it to something appropriate.
> 
> Got it. pageblock is the defragmentation granularity. If user only wants
> 2MB mTHP, maybe pageblock order should be adjusted. Otherwise,
> kernel will defragment at 512MB granularity, which might not be efficient.
> Maybe make pageblock_order a boot time parameter?
> 
> In addition, we are mixing two things together:
> 1. min, low, and high watermarks: they affect when memory reclaim and compaction
>    will be triggered;
> 2. pageblock order: it is the granularity of defragmentation for creating
>    mTHP/THP.
> 
> In your use case, you want to lower watermarks, right? Considering what you
> said below, I wonder if we want a way of enforcing vm.min_free_kbytes,
> like a new sysctl knob, vm.force_min_free_kbytes (yeah the suggestion
> is lame, sorry).
> 
> I think for 2, we might want to decouple pageblock order from defragmentation
> granularity.
> 

This is a good point. I only did it for the watermarks in the RFC, but there
is no reason that the defrag granularity is done in 512M chunks and is probably
very inefficient to do so?

Instead of replacing the pageblock_nr_pages for just set_recommended_min_free_kbytes,
maybe we just need to change the definition of pageblock_order in [1] to take into
account the highest large folio order enabled instead of HPAGE_PMD_ORDER?

[1] https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/pageblock-flags.h#L50

I really want to avoid coming up with a solution that requires changing a Kconfig or needs
kernel commandline to change. It would mean a reboot whenever a different workload
runs on a server that works optimally with a different THP size, and that would make
workload orchestration a nightmare.


> 
>>>
>>> Another concern on tying watermarks to highest THP order is that if
>>> user enables PMD THP on such systems with 2MB mTHP enabled initially,
>>> it could trigger unexpected memory reclaim and compaction, right?
>>> That might surprise user, since they just want to adjust availability
>>> of THP sizes, but the whole system suddenly begins to be busy.
>>> Have you experimented with it?
>>>
>>
>> Yes I would imagine it would trigger reclaim and compaction if the system memory
>> is too low, but that should hopefully be expected? If the user is enabling 512M
>> THP, they should expect changes by kernel to allow them to give hugepage of
>> that size.
>> Also hopefully, no one is enabling PMD THPs when the system is so low on
>> memory that it triggers reclaim! There would be an OOM after just a few
>> of those are faulted in.
> 
> 
> 
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-06 17:37 ` David Hildenbrand
@ 2025-06-09 11:34   ` Usama Arif
  2025-06-09 13:28     ` Zi Yan
  0 siblings, 1 reply; 32+ messages in thread
From: Usama Arif @ 2025-06-09 11:34 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, linux-mm
  Cc: hannes, shakeel.butt, riel, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hughd, linux-kernel,
	linux-doc, kernel-team, Matthew Wilcox



On 06/06/2025 18:37, David Hildenbrand wrote:
> On 06.06.25 16:37, Usama Arif wrote:
>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
>> watermarks are evaluated to extremely high values, for e.g. a server with
>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
>> of the sizes set to never, the min, low and high watermarks evaluate to
>> 11.2G, 14G and 16.8G respectively.
>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
>> and 1G respectively.
>> This is because set_recommended_min_free_kbytes is designed for PMD
>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
>> Such high watermark values can cause performance and latency issues in
>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
>> most of them would never actually use a 512M PMD THP.
>>
>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
>> folio order enabled in set_recommended_min_free_kbytes.
>> With this patch, when only 2M THP hugepage size is set to madvise for the
>> same machine with 64K page size, with the rest of the sizes set to never,
>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
>> respectively. When 512M THP hugepage size is set to madvise for the same
>> machine with 64K page size, the min, low and high watermarks evaluate to
>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
>>
>> An alternative solution would be to change PAGE_BLOCK_ORDER by changing
>> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this
>> is not dynamic with hugepage size, will need different kernel builds for
>> different hugepage sizes and most users won't know that this needs to be
>> done as it can be difficult to detmermine that the performance and latency
>> issues are coming from the high watermark values.
>>
>> All watermark numbers are for zones of nodes that had the highest number
>> of pages, i.e. the value for min size for 4K is obtained using:
>> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}';
>> and for 64K using:
>> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}';
>>
>> An arbirtary min of 128 pages is used for when no hugepage sizes are set
>> enabled.
>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>>   include/linux/huge_mm.h | 25 +++++++++++++++++++++++++
>>   mm/khugepaged.c         | 32 ++++++++++++++++++++++++++++----
>>   mm/shmem.c              | 29 +++++------------------------
>>   3 files changed, 58 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2f190c90192d..fb4e51ef0acb 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item)
>>   }
>>   #endif
>>   +/*
>> + * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
>> + *
>> + * SHMEM_HUGE_NEVER:
>> + *    disables huge pages for the mount;
>> + * SHMEM_HUGE_ALWAYS:
>> + *    enables huge pages for the mount;
>> + * SHMEM_HUGE_WITHIN_SIZE:
>> + *    only allocate huge pages if the page will be fully within i_size,
>> + *    also respect madvise() hints;
>> + * SHMEM_HUGE_ADVISE:
>> + *    only allocate huge pages if requested with madvise();
>> + */
>> +
>> + #define SHMEM_HUGE_NEVER    0
>> + #define SHMEM_HUGE_ALWAYS    1
>> + #define SHMEM_HUGE_WITHIN_SIZE    2
>> + #define SHMEM_HUGE_ADVISE    3
>> +
>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>     extern unsigned long transparent_hugepage_flags;
>> @@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always;
>>   extern unsigned long huge_anon_orders_madvise;
>>   extern unsigned long huge_anon_orders_inherit;
>>   +extern int shmem_huge __read_mostly;
>> +extern unsigned long huge_shmem_orders_always;
>> +extern unsigned long huge_shmem_orders_madvise;
>> +extern unsigned long huge_shmem_orders_inherit;
>> +extern unsigned long huge_shmem_orders_within_size;
> 
> Do really all of these have to be exported?
> 

Hi David,

Thanks for the review!

For the RFC, I just did it similar to the anon ones when I got the build error
trying to use these, but yeah a much better approach would be to just have a
function in shmem that would return the largest shmem thp allowable order.

>> +
>>   static inline bool hugepage_global_enabled(void)
>>   {
>>       return transparent_hugepage_flags &
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 15203ea7d007..e64cba74eb2a 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -2607,6 +2607,26 @@ static int khugepaged(void *none)
>>       return 0;
>>   }
>>   +static int thp_highest_allowable_order(void)
> 
> Did you mean "largest" ?

Yes

> 
>> +{
>> +    unsigned long orders = READ_ONCE(huge_anon_orders_always)
>> +                   | READ_ONCE(huge_anon_orders_madvise)
>> +                   | READ_ONCE(huge_shmem_orders_always)
>> +                   | READ_ONCE(huge_shmem_orders_madvise)
>> +                   | READ_ONCE(huge_shmem_orders_within_size);
>> +    if (hugepage_global_enabled())
>> +        orders |= READ_ONCE(huge_anon_orders_inherit);
>> +    if (shmem_huge != SHMEM_HUGE_NEVER)
>> +        orders |= READ_ONCE(huge_shmem_orders_inherit);
>> +
>> +    return orders == 0 ? 0 : fls(orders) - 1;
>> +}
> 
> But how does this interact with large folios / THPs in the page cache?
> 

Yes this will be a problem.

From what I see, there doesn't seem to be a max order for pagecache, only
mapping_set_folio_min_order for the min.
Does this mean that pagecache can fault in 128M, 256M, 512M large folios?

I think this could increase the OOM rate significantly when ARM64 servers
are used with filesystems that support large folios..

Should there be an upper limit for pagecache? If so, it would either be a new
sysfs entry (which I dont like :( ) or just try and reuse the existing entries
with something like thp_highest_allowable_order?
 

>> +
>> +static unsigned long min_thp_pageblock_nr_pages(void)
> 
> Reading the function name, I have no idea what this function is supposed to do.
> 
> 
Yeah sorry about that. I knew even before sending the RFC that this was a bad name :(

I think an issue is that pageblock_nr_pages is not really 1 << PAGE_BLOCK_ORDER but is
1 << min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) when THP is enabled.

I wanted to highlight with the name that it will use the minimum of the max THP order that
is enabled and PAGE_BLOCK_ORDER when calculating the number of pages..



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-07  8:18 ` Lorenzo Stoakes
  2025-06-07  8:44   ` Lorenzo Stoakes
@ 2025-06-09 12:07   ` Usama Arif
  2025-06-09 12:12     ` Usama Arif
  2025-06-09 14:57     ` Lorenzo Stoakes
  1 sibling, 2 replies; 32+ messages in thread
From: Usama Arif @ 2025-06-09 12:07 UTC (permalink / raw)
  To: Lorenzo Stoakes, ziy
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, hughd,
	linux-kernel, linux-doc, kernel-team



On 07/06/2025 09:18, Lorenzo Stoakes wrote:
> It's important to base against mm-new for new mm stuff, PAGE_BLOCK_ORDER got
> renamed to PAGE_BLOCK_MAX_ORDER in Zi's series at [0] and this doesn't compile.
> 
> Please always do a quick rebase + compile check before sending.
> 
> [0]:  https://lkml.kernel.org/r/20250604211427.1590859-1-ziy@nvidia.com
> 
> Overall this seems to me to be implemented at the wrong level of
> abstraction - we implement set_recommended_min_free_kbytes() to interact
> with the page block mechanism.
> 
> While the problem you describe is absolutely a problem and we need to
> figure out a way to avoid reserving ridiculous amounts of memory for higher
> page tables, we surely need to figure this out at a page block granularity
> don't we?
> 

Yes agreed, Zi raised a good point in [1], and I think there is no reason to just
do it to lower watermarks, it should be done at page block order so that defrag
also happens at for 2M and not 512M with the example given in the commit message.

[1] https://lore.kernel.org/all/c600a6c0-aa59-4896-9e0d-3649a32d1771@gmail.com/

> On Fri, Jun 06, 2025 at 03:37:00PM +0100, Usama Arif wrote:
>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
>> watermarks are evaluated to extremely high values, for e.g. a server with
>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
>> of the sizes set to never, the min, low and high watermarks evaluate to
>> 11.2G, 14G and 16.8G respectively.
>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
>> and 1G respectively.
>> This is because set_recommended_min_free_kbytes is designed for PMD
>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
> 
> Right it is, but not this line really, the _pageblock order_ is set to be
> the minimum of the huge page PMD order and PAGE_BLOCK_MAX_ORDER as it makes
> sense to use page block heuristics to reduce the odds of fragmentation and
> so we can grab a PMD huge page at a time.
> 
> Obviously if the user wants to set a _smaller_ page block order they can,
> but if it's larger we want to heuristically avoid fragmentation of
> physically contiguous huge page aligned ranges (the whole page block
> mechanism).
> 
> I absolutely hate how set_recommended_min_free_kbytes() has basically
> hacked in some THP considerations but otherwise invokes
> calculate_min_free_kbytes()... ugh. But an existing issue.
> 
>> Such high watermark values can cause performance and latency issues in
>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
>> most of them would never actually use a 512M PMD THP.
> 
> 512MB, yeah crazy. We've not thought this through, and this is a very real
> issue.
> 
> Again, it strikes me that we should be changing the page block order for 64
> KB arm64 rather than this calculation though.
> 

yes agreed. I think changing pageblock_order is the right approach.

> Keep in mind pageblocks are a heuristic mechanism designed to reduce
> fragmentation, the decision could be made to cap how far we're willing to
> go with that...
> 
>>
>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
>> folio order enabled in set_recommended_min_free_kbytes.
>> With this patch, when only 2M THP hugepage size is set to madvise for the
>> same machine with 64K page size, with the rest of the sizes set to never,
>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
>> respectively. When 512M THP hugepage size is set to madvise for the same
>> machine with 64K page size, the min, low and high watermarks evaluate to
>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
> 
> Hmm, but what happens if a user changes this live, does this get updated?
> 
> OK I see it does via:
> 
> sysfs stuff -> enabled_store() -> start_stop_khugepaged() -> set_recommended_min_free_kbytes()
> 
> But don't we want to change this in general? Does somebody happening to
> have 512MB THP at madvise or always suggest we want insane watermark
> numbers?

Unfortunately the answer to this is probably a lot of servers that use 64K
page size do. You can see in [1] that if anyone hasn't actually configured
the hugepage sizes via kernel commandline, and if the global policy is set
to madvise or always, then 512M is inheriting madvise/always and they would
have a very high watermark set. I dont think this behaviour is what most
people are expecting.

I actually think [1] should be wrapped in ifndef CONFIG_PAGE_SIZE_64KB,
but its always been the case that PMD is set to inherit, so probably
shouldnt be wrapped.

[1] https://elixir.bootlin.com/linux/v6.15.1/source/mm/huge_memory.c#L782

> 
> I'm not really convinced by this 'dynamic' aspect, you're changing global
> watermark numbers and reserves _massively_ based on a 'maybe' use of
> something that's meant to be transparent + best-effort...
> 

If someone sets 512M to madvise/always, it brings back the watermarks to
the levels without this patch.

>>
>> An alternative solution would be to change PAGE_BLOCK_ORDER by changing
>> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this
>> is not dynamic with hugepage size, will need different kernel builds for
>> different hugepage sizes and most users won't know that this needs to be
>> done as it can be difficult to detmermine that the performance and latency
>> issues are coming from the high watermark values.
> 
> Or, we could adjust pageblock_order accordingly in this instance no?
> 
>>
>> All watermark numbers are for zones of nodes that had the highest number
>> of pages, i.e. the value for min size for 4K is obtained using:
>> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}';
>> and for 64K using:
>> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}';
>>
>> An arbirtary min of 128 pages is used for when no hugepage sizes are set
>> enabled.
> 
> I don't think it's really okay to out and out add an arbitrary value like this
> without explanation. This is basis for rejection of the patch already.
> 

I just took 128 from calculate_min_free_kbytes, although I realize now that over there
its 128 kB, but over it will mean 128 pages = 128*64kB.

I think maybe a better number is sqrt(lowmem_kbytes * 16) from calculate_min_free_kbytes.

I cant see in git history how 128 and the sqrt number was chosen in calculate_min_free_kbytes.

> That seems a little low too no?
> 
> IMPORTANT: I'd really like to see some before/after numbers for 4k, 16k,
> 64k with THP enabled/disabled so you can prove your patch isn't
> fundamentally changing these values unexpectedly for users that aren't
> using crazy page sizes.
> 
>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>>  include/linux/huge_mm.h | 25 +++++++++++++++++++++++++
>>  mm/khugepaged.c         | 32 ++++++++++++++++++++++++++++----
>>  mm/shmem.c              | 29 +++++------------------------
>>  3 files changed, 58 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2f190c90192d..fb4e51ef0acb 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item)
>>  }
>>  #endif
>>
>> +/*
>> + * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
>> + *
>> + * SHMEM_HUGE_NEVER:
>> + *	disables huge pages for the mount;
>> + * SHMEM_HUGE_ALWAYS:
>> + *	enables huge pages for the mount;
>> + * SHMEM_HUGE_WITHIN_SIZE:
>> + *	only allocate huge pages if the page will be fully within i_size,
>> + *	also respect madvise() hints;
>> + * SHMEM_HUGE_ADVISE:
>> + *	only allocate huge pages if requested with madvise();
>> + */
>> +
>> + #define SHMEM_HUGE_NEVER	0
>> + #define SHMEM_HUGE_ALWAYS	1
>> + #define SHMEM_HUGE_WITHIN_SIZE	2
>> + #define SHMEM_HUGE_ADVISE	3
>> +
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>
>>  extern unsigned long transparent_hugepage_flags;
>> @@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always;
>>  extern unsigned long huge_anon_orders_madvise;
>>  extern unsigned long huge_anon_orders_inherit;
>>
>> +extern int shmem_huge __read_mostly;
>> +extern unsigned long huge_shmem_orders_always;
>> +extern unsigned long huge_shmem_orders_madvise;
>> +extern unsigned long huge_shmem_orders_inherit;
>> +extern unsigned long huge_shmem_orders_within_size;
>> +
> 
> Rather than exposing all of this shmem state as globals, can we not just have
> shmem provide a function that grabs this informtaion?
> 
>>  static inline bool hugepage_global_enabled(void)
>>  {
>>  	return transparent_hugepage_flags &
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 15203ea7d007..e64cba74eb2a 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -2607,6 +2607,26 @@ static int khugepaged(void *none)
>>  	return 0;
>>  }
>>
> 
>> +static int thp_highest_allowable_order(void)
> 
> Thisa absolutely needs a comment.
> 
>> +{
>> +	unsigned long orders = READ_ONCE(huge_anon_orders_always)
>> +			       | READ_ONCE(huge_anon_orders_madvise)
>> +			       | READ_ONCE(huge_shmem_orders_always)
>> +			       | READ_ONCE(huge_shmem_orders_madvise)
>> +			       | READ_ONCE(huge_shmem_orders_within_size);
> 
> Same comment as above, have shmem export this.
> 
>> +	if (hugepage_global_enabled())
>> +		orders |= READ_ONCE(huge_anon_orders_inherit);
>> +	if (shmem_huge != SHMEM_HUGE_NEVER)
>> +		orders |= READ_ONCE(huge_shmem_orders_inherit);
>> +
>> +	return orders == 0 ? 0 : fls(orders) - 1;
>> +}
>> +
>> +static unsigned long min_thp_pageblock_nr_pages(void)
> 
> I really really hate this name. This isn't number of pageblock pages any
> more this is something else? You're not changing the page block size right?
> 

I dont like it either :)

As I mentioned in reply to David now in [1], pageblock_nr_pages is not really
1 << PAGE_BLOCK_ORDER but is 1 << min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) when
THP is enabled.

It needs a better name, but I think the right approach is just to change
pageblock_order as recommended in [2]

[1] https://lore.kernel.org/all/4adf1f8b-781d-4ab0-b82e-49795ad712cb@gmail.com/

>> +{
>> +	return (1UL << min(thp_highest_allowable_order(), PAGE_BLOCK_ORDER));
>> +}
>> +
>>  static void set_recommended_min_free_kbytes(void)
>>  {
>>  	struct zone *zone;
>> @@ -2638,12 +2658,16 @@ static void set_recommended_min_free_kbytes(void)
> 
> You provide a 'patchlet' in
> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
> 
> That also does:
> 
>         /* Ensure 2 pageblocks are free to assist fragmentation avoidance */
> -       recommended_min = pageblock_nr_pages * nr_zones * 2;
> +       recommended_min = min_thp_pageblock_nr_pages() * nr_zones * 2;
> 
> So comment here - this comment is now incorrect, this isn't 2 page blocks,
> it's 2 of 'sub-pageblock size as if page blocks were dynamically altered by
> always/madvise THP size'.
> 
> Again, this whole thing strikes me as we're doing things at the wrong level
> of abstraction.
> 
> And you're definitely now not helping avoid pageblock-sized
> fragmentation. You're accepting that you need less so... why not reduce
> pageblock size? :)
> 
> 	/*
> 	 * Make sure that on average at least two pageblocks are almost free
> 	 * of another type, one for a migratetype to fall back to and a
> 
> ^ remainder of comment
> 
>>  	 * second to avoid subsequent fallbacks of other types There are 3
>>  	 * MIGRATE_TYPES we care about.
>>  	 */
>> -	recommended_min += pageblock_nr_pages * nr_zones *
>> +	recommended_min += min_thp_pageblock_nr_pages() * nr_zones *
>>  			   MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;
> 
> This just seems wrong now and contradicts the comment - you're setting
> minimum pages based on migrate PCP types that operate at pageblock order
> but without reference to the actual number of page block pages?
> 
> So the comment is just wrong now? 'make sure there are at least two
> pageblocks', well this isn't what you're doing is it? So why there are we
> making reference to PCP counts etc.?
> 
> This seems like we're essentially just tuning these numbers someswhat
> arbitrarily to reduce them?
> 
>>
>> -	/* don't ever allow to reserve more than 5% of the lowmem */
>> -	recommended_min = min(recommended_min,
>> -			      (unsigned long) nr_free_buffer_pages() / 20);
>> +	/*
>> +	 * Don't ever allow to reserve more than 5% of the lowmem.
>> +	 * Use a min of 128 pages when all THP orders are set to never.
> 
> Why? Did you just choose this number out of the blue?
> 
> Previously, on x86-64 with thp -> never on everything a pageblock order-9
> wouldn't this be a much higher value?
> 
> I mean just putting '128' here is not acceptable. It needs to be justified
> (even if empirically with data to back it) and defined as a named thing.
> 
> 
>> +	 */
>> +	recommended_min = clamp(recommended_min, 128,
>> +				(unsigned long) nr_free_buffer_pages() / 20);
>> +
>>  	recommended_min <<= (PAGE_SHIFT-10);
>>
>>  	if (recommended_min > min_free_kbytes) {
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 0c5fb4ffa03a..8e92678d1175 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -136,10 +136,10 @@ struct shmem_options {
>>  };
>>
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -static unsigned long huge_shmem_orders_always __read_mostly;
>> -static unsigned long huge_shmem_orders_madvise __read_mostly;
>> -static unsigned long huge_shmem_orders_inherit __read_mostly;
>> -static unsigned long huge_shmem_orders_within_size __read_mostly;
>> +unsigned long huge_shmem_orders_always __read_mostly;
>> +unsigned long huge_shmem_orders_madvise __read_mostly;
>> +unsigned long huge_shmem_orders_inherit __read_mostly;
>> +unsigned long huge_shmem_orders_within_size __read_mostly;
> 
> Again, we really shouldn't need to do this.
> 
>>  static bool shmem_orders_configured __initdata;
>>  #endif
>>
>> @@ -516,25 +516,6 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>>  	return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
>>  }
>>
>> -/*
>> - * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
>> - *
>> - * SHMEM_HUGE_NEVER:
>> - *	disables huge pages for the mount;
>> - * SHMEM_HUGE_ALWAYS:
>> - *	enables huge pages for the mount;
>> - * SHMEM_HUGE_WITHIN_SIZE:
>> - *	only allocate huge pages if the page will be fully within i_size,
>> - *	also respect madvise() hints;
>> - * SHMEM_HUGE_ADVISE:
>> - *	only allocate huge pages if requested with madvise();
>> - */
>> -
>> -#define SHMEM_HUGE_NEVER	0
>> -#define SHMEM_HUGE_ALWAYS	1
>> -#define SHMEM_HUGE_WITHIN_SIZE	2
>> -#define SHMEM_HUGE_ADVISE	3
>> -
> 
> Again we really shouldn't need to do this, just provide some function from
> shmem that gives you what you need.
> 
>>  /*
>>   * Special values.
>>   * Only can be set via /sys/kernel/mm/transparent_hugepage/shmem_enabled:
>> @@ -551,7 +532,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>  /* ifdef here to avoid bloating shmem.o when not necessary */
>>
>> -static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>> +int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
> 
> Same comment.
> 
>>  static int tmpfs_huge __read_mostly = SHMEM_HUGE_NEVER;
>>
>>  /**
>> --
>> 2.47.1
>>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 12:07   ` Usama Arif
@ 2025-06-09 12:12     ` Usama Arif
  2025-06-09 14:58       ` Lorenzo Stoakes
  2025-06-09 14:57     ` Lorenzo Stoakes
  1 sibling, 1 reply; 32+ messages in thread
From: Usama Arif @ 2025-06-09 12:12 UTC (permalink / raw)
  To: Lorenzo Stoakes, ziy
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, hughd,
	linux-kernel, linux-doc, kernel-team


> I dont like it either :)
> 

Pressed "Ctrl+enter" instead of "enter" by mistake which sent the email prematurely :)
Adding replies to the rest of the comments in this email.

As I mentioned in reply to David now in [1], pageblock_nr_pages is not really
1 << PAGE_BLOCK_ORDER but is 1 << min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) when
THP is enabled.

It needs a better name, but I think the right approach is just to change
pageblock_order as recommended in [2]
 
[1] https://lore.kernel.org/all/4adf1f8b-781d-4ab0-b82e-49795ad712cb@gmail.com/
[2] https://lore.kernel.org/all/c600a6c0-aa59-4896-9e0d-3649a32d1771@gmail.com/


> 
>>> +{
>>> +	return (1UL << min(thp_highest_allowable_order(), PAGE_BLOCK_ORDER));
>>> +}
>>> +
>>>  static void set_recommended_min_free_kbytes(void)
>>>  {
>>>  	struct zone *zone;
>>> @@ -2638,12 +2658,16 @@ static void set_recommended_min_free_kbytes(void)
>>
>> You provide a 'patchlet' in
>> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
>>
>> That also does:
>>
>>         /* Ensure 2 pageblocks are free to assist fragmentation avoidance */
>> -       recommended_min = pageblock_nr_pages * nr_zones * 2;
>> +       recommended_min = min_thp_pageblock_nr_pages() * nr_zones * 2;
>>
>> So comment here - this comment is now incorrect, this isn't 2 page blocks,
>> it's 2 of 'sub-pageblock size as if page blocks were dynamically altered by
>> always/madvise THP size'.
>>
>> Again, this whole thing strikes me as we're doing things at the wrong level
>> of abstraction.
>>
>> And you're definitely now not helping avoid pageblock-sized
>> fragmentation. You're accepting that you need less so... why not reduce
>> pageblock size? :)
>>

Yes agreed.

>> 	/*
>> 	 * Make sure that on average at least two pageblocks are almost free
>> 	 * of another type, one for a migratetype to fall back to and a
>>
>> ^ remainder of comment
>>
>>>  	 * second to avoid subsequent fallbacks of other types There are 3
>>>  	 * MIGRATE_TYPES we care about.
>>>  	 */
>>> -	recommended_min += pageblock_nr_pages * nr_zones *
>>> +	recommended_min += min_thp_pageblock_nr_pages() * nr_zones *
>>>  			   MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;
>>
>> This just seems wrong now and contradicts the comment - you're setting
>> minimum pages based on migrate PCP types that operate at pageblock order
>> but without reference to the actual number of page block pages?
>>
>> So the comment is just wrong now? 'make sure there are at least two
>> pageblocks', well this isn't what you're doing is it? So why there are we
>> making reference to PCP counts etc.?
>>
>> This seems like we're essentially just tuning these numbers someswhat
>> arbitrarily to reduce them?
>>
>>>
>>> -	/* don't ever allow to reserve more than 5% of the lowmem */
>>> -	recommended_min = min(recommended_min,
>>> -			      (unsigned long) nr_free_buffer_pages() / 20);
>>> +	/*
>>> +	 * Don't ever allow to reserve more than 5% of the lowmem.
>>> +	 * Use a min of 128 pages when all THP orders are set to never.
>>
>> Why? Did you just choose this number out of the blue?


Mentioned this in the previous comment.
>>
>> Previously, on x86-64 with thp -> never on everything a pageblock order-9
>> wouldn't this be a much higher value?
>>
>> I mean just putting '128' here is not acceptable. It needs to be justified
>> (even if empirically with data to back it) and defined as a named thing.
>>
>>
>>> +	 */
>>> +	recommended_min = clamp(recommended_min, 128,
>>> +				(unsigned long) nr_free_buffer_pages() / 20);
>>> +
>>>  	recommended_min <<= (PAGE_SHIFT-10);
>>>
>>>  	if (recommended_min > min_free_kbytes) {
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 0c5fb4ffa03a..8e92678d1175 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -136,10 +136,10 @@ struct shmem_options {
>>>  };
>>>
>>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> -static unsigned long huge_shmem_orders_always __read_mostly;
>>> -static unsigned long huge_shmem_orders_madvise __read_mostly;
>>> -static unsigned long huge_shmem_orders_inherit __read_mostly;
>>> -static unsigned long huge_shmem_orders_within_size __read_mostly;
>>> +unsigned long huge_shmem_orders_always __read_mostly;
>>> +unsigned long huge_shmem_orders_madvise __read_mostly;
>>> +unsigned long huge_shmem_orders_inherit __read_mostly;
>>> +unsigned long huge_shmem_orders_within_size __read_mostly;
>>
>> Again, we really shouldn't need to do this.

Agreed, for the RFC, I just did it similar to the anon ones when I got the build error
trying to use these, but yeah a much better approach would be to just have a
function in shmem that would return the largest shmem thp allowable order.


>>
>>>  static bool shmem_orders_configured __initdata;
>>>  #endif
>>>
>>> @@ -516,25 +516,6 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>>>  	return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
>>>  }
>>>
>>> -/*
>>> - * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
>>> - *
>>> - * SHMEM_HUGE_NEVER:
>>> - *	disables huge pages for the mount;
>>> - * SHMEM_HUGE_ALWAYS:
>>> - *	enables huge pages for the mount;
>>> - * SHMEM_HUGE_WITHIN_SIZE:
>>> - *	only allocate huge pages if the page will be fully within i_size,
>>> - *	also respect madvise() hints;
>>> - * SHMEM_HUGE_ADVISE:
>>> - *	only allocate huge pages if requested with madvise();
>>> - */
>>> -
>>> -#define SHMEM_HUGE_NEVER	0
>>> -#define SHMEM_HUGE_ALWAYS	1
>>> -#define SHMEM_HUGE_WITHIN_SIZE	2
>>> -#define SHMEM_HUGE_ADVISE	3
>>> -
>>
>> Again we really shouldn't need to do this, just provide some function from
>> shmem that gives you what you need.
>>
>>>  /*
>>>   * Special values.
>>>   * Only can be set via /sys/kernel/mm/transparent_hugepage/shmem_enabled:
>>> @@ -551,7 +532,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>  /* ifdef here to avoid bloating shmem.o when not necessary */
>>>
>>> -static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>>> +int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>>
>> Same comment.
>>
>>>  static int tmpfs_huge __read_mostly = SHMEM_HUGE_NEVER;
>>>
>>>  /**
>>> --
>>> 2.47.1
>>>
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 11:13       ` Usama Arif
@ 2025-06-09 13:19         ` Zi Yan
  2025-06-09 14:11           ` Usama Arif
  0 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2025-06-09 13:19 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hughd, linux-kernel, linux-doc, kernel-team,
	Juan Yescas, Breno Leitao

On 9 Jun 2025, at 7:13, Usama Arif wrote:

> On 06/06/2025 17:10, Zi Yan wrote:
>> On 6 Jun 2025, at 11:38, Usama Arif wrote:
>>
>>> On 06/06/2025 16:18, Zi Yan wrote:
>>>> On 6 Jun 2025, at 10:37, Usama Arif wrote:
>>>>
>>>>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
>>>>> watermarks are evaluated to extremely high values, for e.g. a server with
>>>>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
>>>>> of the sizes set to never, the min, low and high watermarks evaluate to
>>>>> 11.2G, 14G and 16.8G respectively.
>>>>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
>>>>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
>>>>> and 1G respectively.
>>>>> This is because set_recommended_min_free_kbytes is designed for PMD
>>>>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
>>>>> Such high watermark values can cause performance and latency issues in
>>>>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
>>>>> most of them would never actually use a 512M PMD THP.
>>>>>
>>>>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
>>>>> folio order enabled in set_recommended_min_free_kbytes.
>>>>> With this patch, when only 2M THP hugepage size is set to madvise for the
>>>>> same machine with 64K page size, with the rest of the sizes set to never,
>>>>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
>>>>> respectively. When 512M THP hugepage size is set to madvise for the same
>>>>> machine with 64K page size, the min, low and high watermarks evaluate to
>>>>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
>>>>
>>>> Getting pageblock_order involved here might be confusing. I think you just
>>>> want to adjust min, low and high watermarks to reasonable values.
>>>> Is it OK to rename min_thp_pageblock_nr_pages to min_nr_free_pages_per_zone
>>>> and move MIGRATE_PCPTYPES * MIGRATE_PCPTYPES inside? Otherwise, the changes
>>>> look reasonable to me.
>>>
>>> Hi Zi,
>>>
>>> Thanks for the review!
>>>
>>> I forgot to change it in another place, sorry about that! So can't move
>>> MIGRATE_PCPTYPES * MIGRATE_PCPTYPES into the combined function.
>>> Have added the additional place where min_thp_pageblock_nr_pages() is called
>>> as a fixlet here:
>>> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
>>>
>>> I think atleast in this context the orginal name pageblock_nr_pages isn't
>>> correct as its min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER).
>>> The new name min_thp_pageblock_nr_pages is also not really good, so happy
>>> to change it to something appropriate.
>>
>> Got it. pageblock is the defragmentation granularity. If user only wants
>> 2MB mTHP, maybe pageblock order should be adjusted. Otherwise,
>> kernel will defragment at 512MB granularity, which might not be efficient.
>> Maybe make pageblock_order a boot time parameter?
>>
>> In addition, we are mixing two things together:
>> 1. min, low, and high watermarks: they affect when memory reclaim and compaction
>>    will be triggered;
>> 2. pageblock order: it is the granularity of defragmentation for creating
>>    mTHP/THP.
>>
>> In your use case, you want to lower watermarks, right? Considering what you
>> said below, I wonder if we want a way of enforcing vm.min_free_kbytes,
>> like a new sysctl knob, vm.force_min_free_kbytes (yeah the suggestion
>> is lame, sorry).
>>
>> I think for 2, we might want to decouple pageblock order from defragmentation
>> granularity.
>>
>
> This is a good point. I only did it for the watermarks in the RFC, but there
> is no reason that the defrag granularity is done in 512M chunks and is probably
> very inefficient to do so?
>
> Instead of replacing the pageblock_nr_pages for just set_recommended_min_free_kbytes,
> maybe we just need to change the definition of pageblock_order in [1] to take into
> account the highest large folio order enabled instead of HPAGE_PMD_ORDER?

Ideally, yes. But pageblock migratetypes are stored in a fixed size array
determined by pageblock_order at boot time (see usemap_size() in mm/mm_init.c).
Changing pageblock_order at runtime means we will need to resize pageblock
migratetypes array, which is a little unrealistic. In a system with GBs or TBs
memory, reducing pageblock_order by 1 means doubling pageblock migratetypes
array and replicating one pageblock migratetypes to two; increasing pageblock
order by 1 means halving the array and splitting a pageblock into two.
The former, if memory is enough, might be easy, but the latter is a little
involved, since for a pageblock with both movable and unmovable pages,
you will need to check all pages to decide the migratetypes of the after-split
pageblocks to make sure pageblock migratetype matches the pages inside that
pageblock.


>
> [1] https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/pageblock-flags.h#L50
>
> I really want to avoid coming up with a solution that requires changing a Kconfig or needs
> kernel commandline to change. It would mean a reboot whenever a different workload
> runs on a server that works optimally with a different THP size, and that would make
> workload orchestration a nightmare.
>

As I said above, changing pageblock order at runtime might not be easy. But
changing defragmentation granularity should be fine, since it just changes
the range of memory compaction. That is the reason of my proposal,
decoupling pageblock order from defragmentation granularity. We probably
need to do some experiments to see the impact of the decoupling, as I
imagine defragmenting a range smaller than pageblock order is fine, but
defragmenting a range larger than pageblock order might cause issues
if there is any unmovable pageblock within that range. Since it is very likely
unmovable pages reside in an unmovable pageblock and lead to a defragmentation
failure.


--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 11:34   ` Usama Arif
@ 2025-06-09 13:28     ` Zi Yan
  0 siblings, 0 replies; 32+ messages in thread
From: Zi Yan @ 2025-06-09 13:28 UTC (permalink / raw)
  To: Usama Arif
  Cc: David Hildenbrand, Andrew Morton, linux-mm, hannes, shakeel.butt,
	riel, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hughd, linux-kernel, linux-doc,
	kernel-team, Matthew Wilcox

On 9 Jun 2025, at 7:34, Usama Arif wrote:

> On 06/06/2025 18:37, David Hildenbrand wrote:
>> On 06.06.25 16:37, Usama Arif wrote:
>>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
>>> watermarks are evaluated to extremely high values, for e.g. a server with
>>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
>>> of the sizes set to never, the min, low and high watermarks evaluate to
>>> 11.2G, 14G and 16.8G respectively.
>>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
>>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
>>> and 1G respectively.
>>> This is because set_recommended_min_free_kbytes is designed for PMD
>>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
>>> Such high watermark values can cause performance and latency issues in
>>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
>>> most of them would never actually use a 512M PMD THP.
>>>
>>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
>>> folio order enabled in set_recommended_min_free_kbytes.
>>> With this patch, when only 2M THP hugepage size is set to madvise for the
>>> same machine with 64K page size, with the rest of the sizes set to never,
>>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
>>> respectively. When 512M THP hugepage size is set to madvise for the same
>>> machine with 64K page size, the min, low and high watermarks evaluate to
>>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
>>>
>>> An alternative solution would be to change PAGE_BLOCK_ORDER by changing
>>> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this
>>> is not dynamic with hugepage size, will need different kernel builds for
>>> different hugepage sizes and most users won't know that this needs to be
>>> done as it can be difficult to detmermine that the performance and latency
>>> issues are coming from the high watermark values.
>>>
>>> All watermark numbers are for zones of nodes that had the highest number
>>> of pages, i.e. the value for min size for 4K is obtained using:
>>> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}';
>>> and for 64K using:
>>> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}';
>>>
>>> An arbirtary min of 128 pages is used for when no hugepage sizes are set
>>> enabled.
>>>
>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>> ---
>>>   include/linux/huge_mm.h | 25 +++++++++++++++++++++++++
>>>   mm/khugepaged.c         | 32 ++++++++++++++++++++++++++++----
>>>   mm/shmem.c              | 29 +++++------------------------
>>>   3 files changed, 58 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 2f190c90192d..fb4e51ef0acb 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item)
>>>   }
>>>   #endif
>>>   +/*
>>> + * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
>>> + *
>>> + * SHMEM_HUGE_NEVER:
>>> + *    disables huge pages for the mount;
>>> + * SHMEM_HUGE_ALWAYS:
>>> + *    enables huge pages for the mount;
>>> + * SHMEM_HUGE_WITHIN_SIZE:
>>> + *    only allocate huge pages if the page will be fully within i_size,
>>> + *    also respect madvise() hints;
>>> + * SHMEM_HUGE_ADVISE:
>>> + *    only allocate huge pages if requested with madvise();
>>> + */
>>> +
>>> + #define SHMEM_HUGE_NEVER    0
>>> + #define SHMEM_HUGE_ALWAYS    1
>>> + #define SHMEM_HUGE_WITHIN_SIZE    2
>>> + #define SHMEM_HUGE_ADVISE    3
>>> +
>>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>     extern unsigned long transparent_hugepage_flags;
>>> @@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always;
>>>   extern unsigned long huge_anon_orders_madvise;
>>>   extern unsigned long huge_anon_orders_inherit;
>>>   +extern int shmem_huge __read_mostly;
>>> +extern unsigned long huge_shmem_orders_always;
>>> +extern unsigned long huge_shmem_orders_madvise;
>>> +extern unsigned long huge_shmem_orders_inherit;
>>> +extern unsigned long huge_shmem_orders_within_size;
>>
>> Do really all of these have to be exported?
>>
>
> Hi David,
>
> Thanks for the review!
>
> For the RFC, I just did it similar to the anon ones when I got the build error
> trying to use these, but yeah a much better approach would be to just have a
> function in shmem that would return the largest shmem thp allowable order.
>
>>> +
>>>   static inline bool hugepage_global_enabled(void)
>>>   {
>>>       return transparent_hugepage_flags &
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 15203ea7d007..e64cba74eb2a 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -2607,6 +2607,26 @@ static int khugepaged(void *none)
>>>       return 0;
>>>   }
>>>   +static int thp_highest_allowable_order(void)
>>
>> Did you mean "largest" ?
>
> Yes
>
>>
>>> +{
>>> +    unsigned long orders = READ_ONCE(huge_anon_orders_always)
>>> +                   | READ_ONCE(huge_anon_orders_madvise)
>>> +                   | READ_ONCE(huge_shmem_orders_always)
>>> +                   | READ_ONCE(huge_shmem_orders_madvise)
>>> +                   | READ_ONCE(huge_shmem_orders_within_size);
>>> +    if (hugepage_global_enabled())
>>> +        orders |= READ_ONCE(huge_anon_orders_inherit);
>>> +    if (shmem_huge != SHMEM_HUGE_NEVER)
>>> +        orders |= READ_ONCE(huge_shmem_orders_inherit);
>>> +
>>> +    return orders == 0 ? 0 : fls(orders) - 1;
>>> +}
>>
>> But how does this interact with large folios / THPs in the page cache?
>>
>
> Yes this will be a problem.
>
> From what I see, there doesn't seem to be a max order for pagecache, only
> mapping_set_folio_min_order for the min.

Actually, there is one[1]. But it is limited by xas_split_alloc() and
can be lifted once xas_split_alloc() is gone (implying READ_ONLY_THP_FOR_FS
needs to go).

[1] https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/pagemap.h#L377

> Does this mean that pagecache can fault in 128M, 256M, 512M large folios?
>
> I think this could increase the OOM rate significantly when ARM64 servers
> are used with filesystems that support large folios..
>
> Should there be an upper limit for pagecache? If so, it would either be a new
> sysfs entry (which I dont like :( ) or just try and reuse the existing entries
> with something like thp_highest_allowable_order?

MAX_PAGECACHE_ORDER limits the max folio size at the moment in theory
and the readahead code only reads PMD level folios at max IIRC.


--
Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 13:19         ` Zi Yan
@ 2025-06-09 14:11           ` Usama Arif
  2025-06-09 14:16             ` Lorenzo Stoakes
  2025-06-09 15:32             ` Zi Yan
  0 siblings, 2 replies; 32+ messages in thread
From: Usama Arif @ 2025-06-09 14:11 UTC (permalink / raw)
  To: Zi Yan, lorenzo.stoakes, david
  Cc: Andrew Morton, linux-mm, hannes, shakeel.butt, riel, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hughd, linux-kernel,
	linux-doc, kernel-team, Juan Yescas, Breno Leitao



On 09/06/2025 14:19, Zi Yan wrote:
> On 9 Jun 2025, at 7:13, Usama Arif wrote:
> 
>> On 06/06/2025 17:10, Zi Yan wrote:
>>> On 6 Jun 2025, at 11:38, Usama Arif wrote:
>>>
>>>> On 06/06/2025 16:18, Zi Yan wrote:
>>>>> On 6 Jun 2025, at 10:37, Usama Arif wrote:
>>>>>
>>>>>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
>>>>>> watermarks are evaluated to extremely high values, for e.g. a server with
>>>>>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
>>>>>> of the sizes set to never, the min, low and high watermarks evaluate to
>>>>>> 11.2G, 14G and 16.8G respectively.
>>>>>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
>>>>>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
>>>>>> and 1G respectively.
>>>>>> This is because set_recommended_min_free_kbytes is designed for PMD
>>>>>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
>>>>>> Such high watermark values can cause performance and latency issues in
>>>>>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
>>>>>> most of them would never actually use a 512M PMD THP.
>>>>>>
>>>>>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
>>>>>> folio order enabled in set_recommended_min_free_kbytes.
>>>>>> With this patch, when only 2M THP hugepage size is set to madvise for the
>>>>>> same machine with 64K page size, with the rest of the sizes set to never,
>>>>>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
>>>>>> respectively. When 512M THP hugepage size is set to madvise for the same
>>>>>> machine with 64K page size, the min, low and high watermarks evaluate to
>>>>>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
>>>>>
>>>>> Getting pageblock_order involved here might be confusing. I think you just
>>>>> want to adjust min, low and high watermarks to reasonable values.
>>>>> Is it OK to rename min_thp_pageblock_nr_pages to min_nr_free_pages_per_zone
>>>>> and move MIGRATE_PCPTYPES * MIGRATE_PCPTYPES inside? Otherwise, the changes
>>>>> look reasonable to me.
>>>>
>>>> Hi Zi,
>>>>
>>>> Thanks for the review!
>>>>
>>>> I forgot to change it in another place, sorry about that! So can't move
>>>> MIGRATE_PCPTYPES * MIGRATE_PCPTYPES into the combined function.
>>>> Have added the additional place where min_thp_pageblock_nr_pages() is called
>>>> as a fixlet here:
>>>> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
>>>>
>>>> I think atleast in this context the orginal name pageblock_nr_pages isn't
>>>> correct as its min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER).
>>>> The new name min_thp_pageblock_nr_pages is also not really good, so happy
>>>> to change it to something appropriate.
>>>
>>> Got it. pageblock is the defragmentation granularity. If user only wants
>>> 2MB mTHP, maybe pageblock order should be adjusted. Otherwise,
>>> kernel will defragment at 512MB granularity, which might not be efficient.
>>> Maybe make pageblock_order a boot time parameter?
>>>
>>> In addition, we are mixing two things together:
>>> 1. min, low, and high watermarks: they affect when memory reclaim and compaction
>>>    will be triggered;
>>> 2. pageblock order: it is the granularity of defragmentation for creating
>>>    mTHP/THP.
>>>
>>> In your use case, you want to lower watermarks, right? Considering what you
>>> said below, I wonder if we want a way of enforcing vm.min_free_kbytes,
>>> like a new sysctl knob, vm.force_min_free_kbytes (yeah the suggestion
>>> is lame, sorry).
>>>
>>> I think for 2, we might want to decouple pageblock order from defragmentation
>>> granularity.
>>>
>>
>> This is a good point. I only did it for the watermarks in the RFC, but there
>> is no reason that the defrag granularity is done in 512M chunks and is probably
>> very inefficient to do so?
>>
>> Instead of replacing the pageblock_nr_pages for just set_recommended_min_free_kbytes,
>> maybe we just need to change the definition of pageblock_order in [1] to take into
>> account the highest large folio order enabled instead of HPAGE_PMD_ORDER?
> 
> Ideally, yes. But pageblock migratetypes are stored in a fixed size array
> determined by pageblock_order at boot time (see usemap_size() in mm/mm_init.c).
> Changing pageblock_order at runtime means we will need to resize pageblock
> migratetypes array, which is a little unrealistic. In a system with GBs or TBs
> memory, reducing pageblock_order by 1 means doubling pageblock migratetypes
> array and replicating one pageblock migratetypes to two; increasing pageblock
> order by 1 means halving the array and splitting a pageblock into two.
> The former, if memory is enough, might be easy, but the latter is a little
> involved, since for a pageblock with both movable and unmovable pages,
> you will need to check all pages to decide the migratetypes of the after-split
> pageblocks to make sure pageblock migratetype matches the pages inside that
> pageblock.
> 

Thanks for explaining this so well and the code pointer!

Yeah it doesnt seem reasonable to change the size of pageblock_flags at runtime.
> 
>>
>> [1] https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/pageblock-flags.h#L50
>>
>> I really want to avoid coming up with a solution that requires changing a Kconfig or needs
>> kernel commandline to change. It would mean a reboot whenever a different workload
>> runs on a server that works optimally with a different THP size, and that would make
>> workload orchestration a nightmare.
>>
> 
> As I said above, changing pageblock order at runtime might not be easy. But
> changing defragmentation granularity should be fine, since it just changes
> the range of memory compaction. That is the reason of my proposal,
> decoupling pageblock order from defragmentation granularity. We probably
> need to do some experiments to see the impact of the decoupling, as I
> imagine defragmenting a range smaller than pageblock order is fine, but
> defragmenting a range larger than pageblock order might cause issues
> if there is any unmovable pageblock within that range. Since it is very likely
> unmovable pages reside in an unmovable pageblock and lead to a defragmentation
> failure.
> 
>

I saw you mentioned of a proposal to decouple pageblock order from defrag granularity
in one of the other replies as well, just wanted to check if there was anything you had
sent in lore in terms of proposal or RFC that I could look at.

So I guess the question is what should be the next step? The following has been discussed:

- Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
  and might have unintended consequences if done at runtime, so a no go?
- Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
  The decoupling can be done separately. Watermark calculation can be decoupled using the
  approach taken in this RFC. Although max order used by pagecache needs to be addressed.


> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 14:11           ` Usama Arif
@ 2025-06-09 14:16             ` Lorenzo Stoakes
  2025-06-09 14:37               ` Zi Yan
  2025-06-09 15:32             ` Zi Yan
  1 sibling, 1 reply; 32+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 14:16 UTC (permalink / raw)
  To: Usama Arif
  Cc: Zi Yan, david, Andrew Morton, linux-mm, hannes, shakeel.butt,
	riel, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	hughd, linux-kernel, linux-doc, kernel-team, Juan Yescas,
	Breno Leitao

On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote:
>
>
> On 09/06/2025 14:19, Zi Yan wrote:
> > On 9 Jun 2025, at 7:13, Usama Arif wrote:
> >
> >> On 06/06/2025 17:10, Zi Yan wrote:
> >>> On 6 Jun 2025, at 11:38, Usama Arif wrote:
> >>>
> >>>> On 06/06/2025 16:18, Zi Yan wrote:
> >>>>> On 6 Jun 2025, at 10:37, Usama Arif wrote:
> >>>>>
> >>>>>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
> >>>>>> watermarks are evaluated to extremely high values, for e.g. a server with
> >>>>>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
> >>>>>> of the sizes set to never, the min, low and high watermarks evaluate to
> >>>>>> 11.2G, 14G and 16.8G respectively.
> >>>>>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
> >>>>>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
> >>>>>> and 1G respectively.
> >>>>>> This is because set_recommended_min_free_kbytes is designed for PMD
> >>>>>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
> >>>>>> Such high watermark values can cause performance and latency issues in
> >>>>>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
> >>>>>> most of them would never actually use a 512M PMD THP.
> >>>>>>
> >>>>>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
> >>>>>> folio order enabled in set_recommended_min_free_kbytes.
> >>>>>> With this patch, when only 2M THP hugepage size is set to madvise for the
> >>>>>> same machine with 64K page size, with the rest of the sizes set to never,
> >>>>>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
> >>>>>> respectively. When 512M THP hugepage size is set to madvise for the same
> >>>>>> machine with 64K page size, the min, low and high watermarks evaluate to
> >>>>>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
> >>>>>
> >>>>> Getting pageblock_order involved here might be confusing. I think you just
> >>>>> want to adjust min, low and high watermarks to reasonable values.
> >>>>> Is it OK to rename min_thp_pageblock_nr_pages to min_nr_free_pages_per_zone
> >>>>> and move MIGRATE_PCPTYPES * MIGRATE_PCPTYPES inside? Otherwise, the changes
> >>>>> look reasonable to me.
> >>>>
> >>>> Hi Zi,
> >>>>
> >>>> Thanks for the review!
> >>>>
> >>>> I forgot to change it in another place, sorry about that! So can't move
> >>>> MIGRATE_PCPTYPES * MIGRATE_PCPTYPES into the combined function.
> >>>> Have added the additional place where min_thp_pageblock_nr_pages() is called
> >>>> as a fixlet here:
> >>>> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
> >>>>
> >>>> I think atleast in this context the orginal name pageblock_nr_pages isn't
> >>>> correct as its min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER).
> >>>> The new name min_thp_pageblock_nr_pages is also not really good, so happy
> >>>> to change it to something appropriate.
> >>>
> >>> Got it. pageblock is the defragmentation granularity. If user only wants
> >>> 2MB mTHP, maybe pageblock order should be adjusted. Otherwise,
> >>> kernel will defragment at 512MB granularity, which might not be efficient.
> >>> Maybe make pageblock_order a boot time parameter?
> >>>
> >>> In addition, we are mixing two things together:
> >>> 1. min, low, and high watermarks: they affect when memory reclaim and compaction
> >>>    will be triggered;
> >>> 2. pageblock order: it is the granularity of defragmentation for creating
> >>>    mTHP/THP.
> >>>
> >>> In your use case, you want to lower watermarks, right? Considering what you
> >>> said below, I wonder if we want a way of enforcing vm.min_free_kbytes,
> >>> like a new sysctl knob, vm.force_min_free_kbytes (yeah the suggestion
> >>> is lame, sorry).
> >>>
> >>> I think for 2, we might want to decouple pageblock order from defragmentation
> >>> granularity.
> >>>
> >>
> >> This is a good point. I only did it for the watermarks in the RFC, but there
> >> is no reason that the defrag granularity is done in 512M chunks and is probably
> >> very inefficient to do so?
> >>
> >> Instead of replacing the pageblock_nr_pages for just set_recommended_min_free_kbytes,
> >> maybe we just need to change the definition of pageblock_order in [1] to take into
> >> account the highest large folio order enabled instead of HPAGE_PMD_ORDER?
> >
> > Ideally, yes. But pageblock migratetypes are stored in a fixed size array
> > determined by pageblock_order at boot time (see usemap_size() in mm/mm_init.c).
> > Changing pageblock_order at runtime means we will need to resize pageblock
> > migratetypes array, which is a little unrealistic. In a system with GBs or TBs
> > memory, reducing pageblock_order by 1 means doubling pageblock migratetypes
> > array and replicating one pageblock migratetypes to two; increasing pageblock
> > order by 1 means halving the array and splitting a pageblock into two.
> > The former, if memory is enough, might be easy, but the latter is a little
> > involved, since for a pageblock with both movable and unmovable pages,
> > you will need to check all pages to decide the migratetypes of the after-split
> > pageblocks to make sure pageblock migratetype matches the pages inside that
> > pageblock.
> >
>
> Thanks for explaining this so well and the code pointer!
>
> Yeah it doesnt seem reasonable to change the size of pageblock_flags at runtime.
> >
> >>
> >> [1] https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/pageblock-flags.h#L50
> >>
> >> I really want to avoid coming up with a solution that requires changing a Kconfig or needs
> >> kernel commandline to change. It would mean a reboot whenever a different workload
> >> runs on a server that works optimally with a different THP size, and that would make
> >> workload orchestration a nightmare.
> >>
> >
> > As I said above, changing pageblock order at runtime might not be easy. But
> > changing defragmentation granularity should be fine, since it just changes
> > the range of memory compaction. That is the reason of my proposal,
> > decoupling pageblock order from defragmentation granularity. We probably
> > need to do some experiments to see the impact of the decoupling, as I
> > imagine defragmenting a range smaller than pageblock order is fine, but
> > defragmenting a range larger than pageblock order might cause issues
> > if there is any unmovable pageblock within that range. Since it is very likely
> > unmovable pages reside in an unmovable pageblock and lead to a defragmentation
> > failure.
> >
> >
>
> I saw you mentioned of a proposal to decouple pageblock order from defrag granularity
> in one of the other replies as well, just wanted to check if there was anything you had
> sent in lore in terms of proposal or RFC that I could look at.
>
> So I guess the question is what should be the next step? The following has been discussed:
>
> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
>   and might have unintended consequences if done at runtime, so a no go?
> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
>   The decoupling can be done separately. Watermark calculation can be decoupled using the
>   approach taken in this RFC. Although max order used by pagecache needs to be addressed.
>

I need to catch up with the thread (workload crazy atm), but why isn't it
feasible to simply statically adjust the pageblock size?

The whole point of 'defragmentation' is to _heuristically_ make it less
likely there'll be fragmentation when requesting page blocks.

And the watermark code is explicitly about providing reserves at a
_pageblock granularity_.

Why would we want to 'defragment' to 512MB physically contiguous chunks
that we rarely use?

Since it's all heuristic, it seems reasonable to me to cap it at a sensible
level no?

>
> > --
> > Best Regards,
> > Yan, Zi
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 14:16             ` Lorenzo Stoakes
@ 2025-06-09 14:37               ` Zi Yan
  2025-06-09 14:50                 ` Lorenzo Stoakes
  0 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2025-06-09 14:37 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Usama Arif, david, Andrew Morton, linux-mm, hannes, shakeel.butt,
	riel, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	hughd, linux-kernel, linux-doc, kernel-team, Juan Yescas,
	Breno Leitao

On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote:

> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote:
>>
>>
>> On 09/06/2025 14:19, Zi Yan wrote:
>>> On 9 Jun 2025, at 7:13, Usama Arif wrote:
>>>
>>>> On 06/06/2025 17:10, Zi Yan wrote:
>>>>> On 6 Jun 2025, at 11:38, Usama Arif wrote:
>>>>>
>>>>>> On 06/06/2025 16:18, Zi Yan wrote:
>>>>>>> On 6 Jun 2025, at 10:37, Usama Arif wrote:
>>>>>>>
>>>>>>>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
>>>>>>>> watermarks are evaluated to extremely high values, for e.g. a server with
>>>>>>>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
>>>>>>>> of the sizes set to never, the min, low and high watermarks evaluate to
>>>>>>>> 11.2G, 14G and 16.8G respectively.
>>>>>>>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
>>>>>>>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
>>>>>>>> and 1G respectively.
>>>>>>>> This is because set_recommended_min_free_kbytes is designed for PMD
>>>>>>>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
>>>>>>>> Such high watermark values can cause performance and latency issues in
>>>>>>>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
>>>>>>>> most of them would never actually use a 512M PMD THP.
>>>>>>>>
>>>>>>>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
>>>>>>>> folio order enabled in set_recommended_min_free_kbytes.
>>>>>>>> With this patch, when only 2M THP hugepage size is set to madvise for the
>>>>>>>> same machine with 64K page size, with the rest of the sizes set to never,
>>>>>>>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
>>>>>>>> respectively. When 512M THP hugepage size is set to madvise for the same
>>>>>>>> machine with 64K page size, the min, low and high watermarks evaluate to
>>>>>>>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
>>>>>>>
>>>>>>> Getting pageblock_order involved here might be confusing. I think you just
>>>>>>> want to adjust min, low and high watermarks to reasonable values.
>>>>>>> Is it OK to rename min_thp_pageblock_nr_pages to min_nr_free_pages_per_zone
>>>>>>> and move MIGRATE_PCPTYPES * MIGRATE_PCPTYPES inside? Otherwise, the changes
>>>>>>> look reasonable to me.
>>>>>>
>>>>>> Hi Zi,
>>>>>>
>>>>>> Thanks for the review!
>>>>>>
>>>>>> I forgot to change it in another place, sorry about that! So can't move
>>>>>> MIGRATE_PCPTYPES * MIGRATE_PCPTYPES into the combined function.
>>>>>> Have added the additional place where min_thp_pageblock_nr_pages() is called
>>>>>> as a fixlet here:
>>>>>> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
>>>>>>
>>>>>> I think atleast in this context the orginal name pageblock_nr_pages isn't
>>>>>> correct as its min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER).
>>>>>> The new name min_thp_pageblock_nr_pages is also not really good, so happy
>>>>>> to change it to something appropriate.
>>>>>
>>>>> Got it. pageblock is the defragmentation granularity. If user only wants
>>>>> 2MB mTHP, maybe pageblock order should be adjusted. Otherwise,
>>>>> kernel will defragment at 512MB granularity, which might not be efficient.
>>>>> Maybe make pageblock_order a boot time parameter?
>>>>>
>>>>> In addition, we are mixing two things together:
>>>>> 1. min, low, and high watermarks: they affect when memory reclaim and compaction
>>>>>    will be triggered;
>>>>> 2. pageblock order: it is the granularity of defragmentation for creating
>>>>>    mTHP/THP.
>>>>>
>>>>> In your use case, you want to lower watermarks, right? Considering what you
>>>>> said below, I wonder if we want a way of enforcing vm.min_free_kbytes,
>>>>> like a new sysctl knob, vm.force_min_free_kbytes (yeah the suggestion
>>>>> is lame, sorry).
>>>>>
>>>>> I think for 2, we might want to decouple pageblock order from defragmentation
>>>>> granularity.
>>>>>
>>>>
>>>> This is a good point. I only did it for the watermarks in the RFC, but there
>>>> is no reason that the defrag granularity is done in 512M chunks and is probably
>>>> very inefficient to do so?
>>>>
>>>> Instead of replacing the pageblock_nr_pages for just set_recommended_min_free_kbytes,
>>>> maybe we just need to change the definition of pageblock_order in [1] to take into
>>>> account the highest large folio order enabled instead of HPAGE_PMD_ORDER?
>>>
>>> Ideally, yes. But pageblock migratetypes are stored in a fixed size array
>>> determined by pageblock_order at boot time (see usemap_size() in mm/mm_init.c).
>>> Changing pageblock_order at runtime means we will need to resize pageblock
>>> migratetypes array, which is a little unrealistic. In a system with GBs or TBs
>>> memory, reducing pageblock_order by 1 means doubling pageblock migratetypes
>>> array and replicating one pageblock migratetypes to two; increasing pageblock
>>> order by 1 means halving the array and splitting a pageblock into two.
>>> The former, if memory is enough, might be easy, but the latter is a little
>>> involved, since for a pageblock with both movable and unmovable pages,
>>> you will need to check all pages to decide the migratetypes of the after-split
>>> pageblocks to make sure pageblock migratetype matches the pages inside that
>>> pageblock.
>>>
>>
>> Thanks for explaining this so well and the code pointer!
>>
>> Yeah it doesnt seem reasonable to change the size of pageblock_flags at runtime.
>>>
>>>>
>>>> [1] https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/pageblock-flags.h#L50
>>>>
>>>> I really want to avoid coming up with a solution that requires changing a Kconfig or needs
>>>> kernel commandline to change. It would mean a reboot whenever a different workload
>>>> runs on a server that works optimally with a different THP size, and that would make
>>>> workload orchestration a nightmare.
>>>>
>>>
>>> As I said above, changing pageblock order at runtime might not be easy. But
>>> changing defragmentation granularity should be fine, since it just changes
>>> the range of memory compaction. That is the reason of my proposal,
>>> decoupling pageblock order from defragmentation granularity. We probably
>>> need to do some experiments to see the impact of the decoupling, as I
>>> imagine defragmenting a range smaller than pageblock order is fine, but
>>> defragmenting a range larger than pageblock order might cause issues
>>> if there is any unmovable pageblock within that range. Since it is very likely
>>> unmovable pages reside in an unmovable pageblock and lead to a defragmentation
>>> failure.
>>>
>>>
>>
>> I saw you mentioned of a proposal to decouple pageblock order from defrag granularity
>> in one of the other replies as well, just wanted to check if there was anything you had
>> sent in lore in terms of proposal or RFC that I could look at.
>>
>> So I guess the question is what should be the next step? The following has been discussed:
>>
>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
>>   and might have unintended consequences if done at runtime, so a no go?
>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
>>   The decoupling can be done separately. Watermark calculation can be decoupled using the
>>   approach taken in this RFC. Although max order used by pagecache needs to be addressed.
>>
>
> I need to catch up with the thread (workload crazy atm), but why isn't it
> feasible to simply statically adjust the pageblock size?
>
> The whole point of 'defragmentation' is to _heuristically_ make it less
> likely there'll be fragmentation when requesting page blocks.
>
> And the watermark code is explicitly about providing reserves at a
> _pageblock granularity_.
>
> Why would we want to 'defragment' to 512MB physically contiguous chunks
> that we rarely use?
>
> Since it's all heuristic, it seems reasonable to me to cap it at a sensible
> level no?

What is a sensible level? 2MB is a good starting point. If we cap pageblock
at 2MB, everyone should be happy at the moment. But if one user wants to
allocate 4MB mTHP, they will most likely fail miserably, because pageblock
is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB
MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job.

Defragmentation has two components: 1) pageblock, which has migratetypes
to prevent mixing movable and unmovable pages, as a single unmovable page
blocks large free pages from being created; 2) memory compaction granularity,
which is the actual work to move pages around and form a large free pages.
Currently, kernel assumes pageblock size = defragmentation granularity,
but in reality, as long as pageblock size >= defragmentation granularity,
memory compaction would still work, but not the other way around. So we
need to choose pageblock size carefully to not break memory compaction.

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 14:37               ` Zi Yan
@ 2025-06-09 14:50                 ` Lorenzo Stoakes
  2025-06-09 15:20                   ` Zi Yan
  0 siblings, 1 reply; 32+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 14:50 UTC (permalink / raw)
  To: Zi Yan
  Cc: Usama Arif, david, Andrew Morton, linux-mm, hannes, shakeel.butt,
	riel, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	hughd, linux-kernel, linux-doc, kernel-team, Juan Yescas,
	Breno Leitao

On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote:
> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote:
>
> > On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote:

[snip]

> >> So I guess the question is what should be the next step? The following has been discussed:
> >>
> >> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
> >>   and might have unintended consequences if done at runtime, so a no go?
> >> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
> >>   The decoupling can be done separately. Watermark calculation can be decoupled using the
> >>   approach taken in this RFC. Although max order used by pagecache needs to be addressed.
> >>
> >
> > I need to catch up with the thread (workload crazy atm), but why isn't it
> > feasible to simply statically adjust the pageblock size?
> >
> > The whole point of 'defragmentation' is to _heuristically_ make it less
> > likely there'll be fragmentation when requesting page blocks.
> >
> > And the watermark code is explicitly about providing reserves at a
> > _pageblock granularity_.
> >
> > Why would we want to 'defragment' to 512MB physically contiguous chunks
> > that we rarely use?
> >
> > Since it's all heuristic, it seems reasonable to me to cap it at a sensible
> > level no?
>
> What is a sensible level? 2MB is a good starting point. If we cap pageblock
> at 2MB, everyone should be happy at the moment. But if one user wants to
> allocate 4MB mTHP, they will most likely fail miserably, because pageblock
> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB
> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job.
>
> Defragmentation has two components: 1) pageblock, which has migratetypes
> to prevent mixing movable and unmovable pages, as a single unmovable page
> blocks large free pages from being created; 2) memory compaction granularity,
> which is the actual work to move pages around and form a large free pages.
> Currently, kernel assumes pageblock size = defragmentation granularity,
> but in reality, as long as pageblock size >= defragmentation granularity,
> memory compaction would still work, but not the other way around. So we
> need to choose pageblock size carefully to not break memory compaction.

OK I get it - the issue is that compaction itself operations at a pageblock
granularity, and once you get so fragmented that compaction is critical to
defragmentation, you are stuck if the pageblock is not big enough.

Thing is, 512MB pageblock size for compaction seems insanely inefficient in
itself, and if we're complaining about issues with unavailable reserved
memory due to crazy PMD size, surely one will encounter the compaction
process simply failing to succeed/taking forever/causing issues with
reclaim/higher order folio allocation.

I mean, I don't really know the compaction code _at all_ (ran out of time
to cover in book ;), but is it all or-nothing? Does it grab a pageblock or
gives up?

Because it strikes me that a crazy pageblock size would cause really
serious system issues on that basis alone if that's the case.

And again this leads me back to thinking it should just be the page block
size _as a whole_ that should be adjusted.

Keep in mind a user can literally reduce the page block size already via
CONFIG_PAGE_BLOCK_MAX_ORDER.

To me it seems that we should cap it at the highest _reasonable_ mTHP size
you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size
system.

That way, people _can still get_ super huge PMD sized huge folios up to the
point of fragmentation.

If we do reduce things this way we should give a config option to allow
users who truly want collosal PMD sizes with associated
watermarks/compaction to be able to still have it.

CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something?

I also question this de-coupling in general (I may be missing somethig
however!) - the watermark code _very explicitly_ refers to providing
_pageblocks_ in order to ensure _defragmentation_ right?

We would need to absolutely justify why it's suddenly ok to not provide
page blocks here.

This is very very delicate code we have to be SO careful about.

This is why I am being cautious here :)

>
> Best Regards,
> Yan, Zi

Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 12:07   ` Usama Arif
  2025-06-09 12:12     ` Usama Arif
@ 2025-06-09 14:57     ` Lorenzo Stoakes
  1 sibling, 0 replies; 32+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 14:57 UTC (permalink / raw)
  To: Usama Arif
  Cc: ziy, Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, hughd,
	linux-kernel, linux-doc, kernel-team

On Mon, Jun 09, 2025 at 01:07:42PM +0100, Usama Arif wrote:
>
>
> On 07/06/2025 09:18, Lorenzo Stoakes wrote:
> > It's important to base against mm-new for new mm stuff, PAGE_BLOCK_ORDER got
> > renamed to PAGE_BLOCK_MAX_ORDER in Zi's series at [0] and this doesn't compile.
> >
> > Please always do a quick rebase + compile check before sending.
> >
> > [0]:  https://lkml.kernel.org/r/20250604211427.1590859-1-ziy@nvidia.com
> >
> > Overall this seems to me to be implemented at the wrong level of
> > abstraction - we implement set_recommended_min_free_kbytes() to interact
> > with the page block mechanism.
> >
> > While the problem you describe is absolutely a problem and we need to
> > figure out a way to avoid reserving ridiculous amounts of memory for higher
> > page tables, we surely need to figure this out at a page block granularity
> > don't we?
> >
>
> Yes agreed, Zi raised a good point in [1], and I think there is no reason to just
> do it to lower watermarks, it should be done at page block order so that defrag
> also happens at for 2M and not 512M with the example given in the commit message.
>
> [1] https://lore.kernel.org/all/c600a6c0-aa59-4896-9e0d-3649a32d1771@gmail.com/
>

Yeah right exactly.

These things are heuristic anyway, and so I think it's fine to add an
additional heuristic of 'well people are unlikely to use 512 MB PMD sized
huge pages in practice on 64 KB page size systems'.

And obviously you have rightly raised that - in practice - this is really
just causing an issue rather than serving the needs of users.


> > On Fri, Jun 06, 2025 at 03:37:00PM +0100, Usama Arif wrote:
> >> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
> >> watermarks are evaluated to extremely high values, for e.g. a server with
> >> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
> >> of the sizes set to never, the min, low and high watermarks evaluate to
> >> 11.2G, 14G and 16.8G respectively.
> >> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
> >> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
> >> and 1G respectively.
> >> This is because set_recommended_min_free_kbytes is designed for PMD
> >> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
> >
> > Right it is, but not this line really, the _pageblock order_ is set to be
> > the minimum of the huge page PMD order and PAGE_BLOCK_MAX_ORDER as it makes
> > sense to use page block heuristics to reduce the odds of fragmentation and
> > so we can grab a PMD huge page at a time.
> >
> > Obviously if the user wants to set a _smaller_ page block order they can,
> > but if it's larger we want to heuristically avoid fragmentation of
> > physically contiguous huge page aligned ranges (the whole page block
> > mechanism).
> >
> > I absolutely hate how set_recommended_min_free_kbytes() has basically
> > hacked in some THP considerations but otherwise invokes
> > calculate_min_free_kbytes()... ugh. But an existing issue.
> >
> >> Such high watermark values can cause performance and latency issues in
> >> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
> >> most of them would never actually use a 512M PMD THP.
> >
> > 512MB, yeah crazy. We've not thought this through, and this is a very real
> > issue.
> >
> > Again, it strikes me that we should be changing the page block order for 64
> > KB arm64 rather than this calculation though.
> >
>
> yes agreed. I think changing pageblock_order is the right approach.

Thanks, yes!

>
> > Keep in mind pageblocks are a heuristic mechanism designed to reduce
> > fragmentation, the decision could be made to cap how far we're willing to
> > go with that...
> >
> >>
> >> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
> >> folio order enabled in set_recommended_min_free_kbytes.
> >> With this patch, when only 2M THP hugepage size is set to madvise for the
> >> same machine with 64K page size, with the rest of the sizes set to never,
> >> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
> >> respectively. When 512M THP hugepage size is set to madvise for the same
> >> machine with 64K page size, the min, low and high watermarks evaluate to
> >> 11.2G, 14G and 16.8G respectively, the same as without this patch.
> >
> > Hmm, but what happens if a user changes this live, does this get updated?
> >
> > OK I see it does via:
> >
> > sysfs stuff -> enabled_store() -> start_stop_khugepaged() -> set_recommended_min_free_kbytes()
> >
> > But don't we want to change this in general? Does somebody happening to
> > have 512MB THP at madvise or always suggest we want insane watermark
> > numbers?
>
> Unfortunately the answer to this is probably a lot of servers that use 64K
> page size do. You can see in [1] that if anyone hasn't actually configured
> the hugepage sizes via kernel commandline, and if the global policy is set
> to madvise or always, then 512M is inheriting madvise/always and they would
> have a very high watermark set. I dont think this behaviour is what most
> people are expecting.

Right they will be default have this (they are silly to do so, and we are
silly to let them but we've all said I think without fail that the THP
interface is flawed so I wont' belabour this :)

But again the 'T' in THP is transparent :) it's _trying_ to provide
PMD-sized folios, if it can.

But if it can't, then it can't, and that's ok.

>
> I actually think [1] should be wrapped in ifndef CONFIG_PAGE_SIZE_64KB,
> but its always been the case that PMD is set to inherit, so probably
> shouldnt be wrapped.
>
> [1] https://elixir.bootlin.com/linux/v6.15.1/source/mm/huge_memory.c#L782

IDEALLY I'd rather not, we should be figuring out how to do this scalably
not relying on 'well if this page size or if that page size' it's kinda a
slippery slope.

Plus I think users will naturally assume the 'PMD sized' behaviour is
consistent regardless of base page size.

I mean I don't LOVE (understatement) this 'water mark calculation is
different if PMD-sized' stuff in general. That rather smacks of unscalable
hard-coding on its own.


>
> >
> > I'm not really convinced by this 'dynamic' aspect, you're changing global
> > watermark numbers and reserves _massively_ based on a 'maybe' use of
> > something that's meant to be transparent + best-effort...
> >
>
> If someone sets 512M to madvise/always, it brings back the watermarks to
> the levels without this patch.

Yeah sorry see my follow up on this, I was just mistaken, things are
already dynamic like this.

Yuck this interface.

>
> >>
> >> An alternative solution would be to change PAGE_BLOCK_ORDER by changing
> >> ARCH_FORCE_MAX_ORDER to a lower value for ARM64_64K_PAGES. However, this
> >> is not dynamic with hugepage size, will need different kernel builds for
> >> different hugepage sizes and most users won't know that this needs to be
> >> done as it can be difficult to detmermine that the performance and latency
> >> issues are coming from the high watermark values.
> >
> > Or, we could adjust pageblock_order accordingly in this instance no?
> >
> >>
> >> All watermark numbers are for zones of nodes that had the highest number
> >> of pages, i.e. the value for min size for 4K is obtained using:
> >> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 4096 / 1024 / 1024}';
> >> and for 64K using:
> >> cat /proc/zoneinfo  | grep -i min | awk '{print $2}' | sort -n  | tail -n 1 | awk '{print $1 * 65536 / 1024 / 1024}';
> >>
> >> An arbirtary min of 128 pages is used for when no hugepage sizes are set
> >> enabled.
> >
> > I don't think it's really okay to out and out add an arbitrary value like this
> > without explanation. This is basis for rejection of the patch already.
> >
>
> I just took 128 from calculate_min_free_kbytes, although I realize now that over there
> its 128 kB, but over it will mean 128 pages = 128*64kB.
>
> I think maybe a better number is sqrt(lowmem_kbytes * 16) from calculate_min_free_kbytes.
>
> I cant see in git history how 128 and the sqrt number was chosen in calculate_min_free_kbytes.

Yeah, the key thing is to provide a justification, hard-coded numbers like
this are scary, not least because people don't know why it is but are
scared to change it :P

>
> > That seems a little low too no?
> >
> > IMPORTANT: I'd really like to see some before/after numbers for 4k, 16k,
> > 64k with THP enabled/disabled so you can prove your patch isn't
> > fundamentally changing these values unexpectedly for users that aren't
> > using crazy page sizes.
> >
> >>
> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >> ---
> >>  include/linux/huge_mm.h | 25 +++++++++++++++++++++++++
> >>  mm/khugepaged.c         | 32 ++++++++++++++++++++++++++++----
> >>  mm/shmem.c              | 29 +++++------------------------
> >>  3 files changed, 58 insertions(+), 28 deletions(-)
> >>
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index 2f190c90192d..fb4e51ef0acb 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -170,6 +170,25 @@ static inline void count_mthp_stat(int order, enum mthp_stat_item item)
> >>  }
> >>  #endif
> >>
> >> +/*
> >> + * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
> >> + *
> >> + * SHMEM_HUGE_NEVER:
> >> + *	disables huge pages for the mount;
> >> + * SHMEM_HUGE_ALWAYS:
> >> + *	enables huge pages for the mount;
> >> + * SHMEM_HUGE_WITHIN_SIZE:
> >> + *	only allocate huge pages if the page will be fully within i_size,
> >> + *	also respect madvise() hints;
> >> + * SHMEM_HUGE_ADVISE:
> >> + *	only allocate huge pages if requested with madvise();
> >> + */
> >> +
> >> + #define SHMEM_HUGE_NEVER	0
> >> + #define SHMEM_HUGE_ALWAYS	1
> >> + #define SHMEM_HUGE_WITHIN_SIZE	2
> >> + #define SHMEM_HUGE_ADVISE	3
> >> +
> >>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>
> >>  extern unsigned long transparent_hugepage_flags;
> >> @@ -177,6 +196,12 @@ extern unsigned long huge_anon_orders_always;
> >>  extern unsigned long huge_anon_orders_madvise;
> >>  extern unsigned long huge_anon_orders_inherit;
> >>
> >> +extern int shmem_huge __read_mostly;
> >> +extern unsigned long huge_shmem_orders_always;
> >> +extern unsigned long huge_shmem_orders_madvise;
> >> +extern unsigned long huge_shmem_orders_inherit;
> >> +extern unsigned long huge_shmem_orders_within_size;
> >> +
> >
> > Rather than exposing all of this shmem state as globals, can we not just have
> > shmem provide a function that grabs this informtaion?
> >
> >>  static inline bool hugepage_global_enabled(void)
> >>  {
> >>  	return transparent_hugepage_flags &
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index 15203ea7d007..e64cba74eb2a 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -2607,6 +2607,26 @@ static int khugepaged(void *none)
> >>  	return 0;
> >>  }
> >>
> >
> >> +static int thp_highest_allowable_order(void)
> >
> > Thisa absolutely needs a comment.
> >
> >> +{
> >> +	unsigned long orders = READ_ONCE(huge_anon_orders_always)
> >> +			       | READ_ONCE(huge_anon_orders_madvise)
> >> +			       | READ_ONCE(huge_shmem_orders_always)
> >> +			       | READ_ONCE(huge_shmem_orders_madvise)
> >> +			       | READ_ONCE(huge_shmem_orders_within_size);
> >
> > Same comment as above, have shmem export this.
> >
> >> +	if (hugepage_global_enabled())
> >> +		orders |= READ_ONCE(huge_anon_orders_inherit);
> >> +	if (shmem_huge != SHMEM_HUGE_NEVER)
> >> +		orders |= READ_ONCE(huge_shmem_orders_inherit);
> >> +
> >> +	return orders == 0 ? 0 : fls(orders) - 1;
> >> +}
> >> +
> >> +static unsigned long min_thp_pageblock_nr_pages(void)
> >
> > I really really hate this name. This isn't number of pageblock pages any
> > more this is something else? You're not changing the page block size right?
> >
>
> I dont like it either :)
>
> As I mentioned in reply to David now in [1], pageblock_nr_pages is not really
> 1 << PAGE_BLOCK_ORDER but is 1 << min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) when
> THP is enabled.

Yuuuuck.

>
> It needs a better name, but I think the right approach is just to change
> pageblock_order as recommended in [2]
>
> [1] https://lore.kernel.org/all/4adf1f8b-781d-4ab0-b82e-49795ad712cb@gmail.com/

Right yeah (I assume you typo'd 2 and you mean the 1 above? :P).

>
> >> +{
> >> +	return (1UL << min(thp_highest_allowable_order(), PAGE_BLOCK_ORDER));
> >> +}
> >> +
> >>  static void set_recommended_min_free_kbytes(void)
> >>  {
> >>  	struct zone *zone;
> >> @@ -2638,12 +2658,16 @@ static void set_recommended_min_free_kbytes(void)
> >
> > You provide a 'patchlet' in
> > https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
> >
> > That also does:
> >
> >         /* Ensure 2 pageblocks are free to assist fragmentation avoidance */
> > -       recommended_min = pageblock_nr_pages * nr_zones * 2;
> > +       recommended_min = min_thp_pageblock_nr_pages() * nr_zones * 2;
> >
> > So comment here - this comment is now incorrect, this isn't 2 page blocks,
> > it's 2 of 'sub-pageblock size as if page blocks were dynamically altered by
> > always/madvise THP size'.
> >
> > Again, this whole thing strikes me as we're doing things at the wrong level
> > of abstraction.
> >
> > And you're definitely now not helping avoid pageblock-sized
> > fragmentation. You're accepting that you need less so... why not reduce
> > pageblock size? :)
> >
> > 	/*
> > 	 * Make sure that on average at least two pageblocks are almost free
> > 	 * of another type, one for a migratetype to fall back to and a
> >
> > ^ remainder of comment
> >
> >>  	 * second to avoid subsequent fallbacks of other types There are 3
> >>  	 * MIGRATE_TYPES we care about.
> >>  	 */
> >> -	recommended_min += pageblock_nr_pages * nr_zones *
> >> +	recommended_min += min_thp_pageblock_nr_pages() * nr_zones *
> >>  			   MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;
> >
> > This just seems wrong now and contradicts the comment - you're setting
> > minimum pages based on migrate PCP types that operate at pageblock order
> > but without reference to the actual number of page block pages?
> >
> > So the comment is just wrong now? 'make sure there are at least two
> > pageblocks', well this isn't what you're doing is it? So why there are we
> > making reference to PCP counts etc.?
> >
> > This seems like we're essentially just tuning these numbers someswhat
> > arbitrarily to reduce them?
> >
> >>
> >> -	/* don't ever allow to reserve more than 5% of the lowmem */
> >> -	recommended_min = min(recommended_min,
> >> -			      (unsigned long) nr_free_buffer_pages() / 20);
> >> +	/*
> >> +	 * Don't ever allow to reserve more than 5% of the lowmem.
> >> +	 * Use a min of 128 pages when all THP orders are set to never.
> >
> > Why? Did you just choose this number out of the blue?
> >
> > Previously, on x86-64 with thp -> never on everything a pageblock order-9
> > wouldn't this be a much higher value?
> >
> > I mean just putting '128' here is not acceptable. It needs to be justified
> > (even if empirically with data to back it) and defined as a named thing.
> >
> >
> >> +	 */
> >> +	recommended_min = clamp(recommended_min, 128,
> >> +				(unsigned long) nr_free_buffer_pages() / 20);
> >> +
> >>  	recommended_min <<= (PAGE_SHIFT-10);
> >>
> >>  	if (recommended_min > min_free_kbytes) {
> >> diff --git a/mm/shmem.c b/mm/shmem.c
> >> index 0c5fb4ffa03a..8e92678d1175 100644
> >> --- a/mm/shmem.c
> >> +++ b/mm/shmem.c
> >> @@ -136,10 +136,10 @@ struct shmem_options {
> >>  };
> >>
> >>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> -static unsigned long huge_shmem_orders_always __read_mostly;
> >> -static unsigned long huge_shmem_orders_madvise __read_mostly;
> >> -static unsigned long huge_shmem_orders_inherit __read_mostly;
> >> -static unsigned long huge_shmem_orders_within_size __read_mostly;
> >> +unsigned long huge_shmem_orders_always __read_mostly;
> >> +unsigned long huge_shmem_orders_madvise __read_mostly;
> >> +unsigned long huge_shmem_orders_inherit __read_mostly;
> >> +unsigned long huge_shmem_orders_within_size __read_mostly;
> >
> > Again, we really shouldn't need to do this.
> >
> >>  static bool shmem_orders_configured __initdata;
> >>  #endif
> >>
> >> @@ -516,25 +516,6 @@ static bool shmem_confirm_swap(struct address_space *mapping,
> >>  	return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
> >>  }
> >>
> >> -/*
> >> - * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
> >> - *
> >> - * SHMEM_HUGE_NEVER:
> >> - *	disables huge pages for the mount;
> >> - * SHMEM_HUGE_ALWAYS:
> >> - *	enables huge pages for the mount;
> >> - * SHMEM_HUGE_WITHIN_SIZE:
> >> - *	only allocate huge pages if the page will be fully within i_size,
> >> - *	also respect madvise() hints;
> >> - * SHMEM_HUGE_ADVISE:
> >> - *	only allocate huge pages if requested with madvise();
> >> - */
> >> -
> >> -#define SHMEM_HUGE_NEVER	0
> >> -#define SHMEM_HUGE_ALWAYS	1
> >> -#define SHMEM_HUGE_WITHIN_SIZE	2
> >> -#define SHMEM_HUGE_ADVISE	3
> >> -
> >
> > Again we really shouldn't need to do this, just provide some function from
> > shmem that gives you what you need.
> >
> >>  /*
> >>   * Special values.
> >>   * Only can be set via /sys/kernel/mm/transparent_hugepage/shmem_enabled:
> >> @@ -551,7 +532,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
> >>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>  /* ifdef here to avoid bloating shmem.o when not necessary */
> >>
> >> -static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
> >> +int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
> >
> > Same comment.
> >
> >>  static int tmpfs_huge __read_mostly = SHMEM_HUGE_NEVER;
> >>
> >>  /**
> >> --
> >> 2.47.1
> >>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 12:12     ` Usama Arif
@ 2025-06-09 14:58       ` Lorenzo Stoakes
  0 siblings, 0 replies; 32+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 14:58 UTC (permalink / raw)
  To: Usama Arif
  Cc: ziy, Andrew Morton, david, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, hughd,
	linux-kernel, linux-doc, kernel-team

On Mon, Jun 09, 2025 at 01:12:25PM +0100, Usama Arif wrote:
>
> > I dont like it either :)
> >
>
> Pressed "Ctrl+enter" instead of "enter" by mistake which sent the email prematurely :)
> Adding replies to the rest of the comments in this email.

We've all been there :)

>
> As I mentioned in reply to David now in [1], pageblock_nr_pages is not really
> 1 << PAGE_BLOCK_ORDER but is 1 << min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER) when
> THP is enabled.
>
> It needs a better name, but I think the right approach is just to change
> pageblock_order as recommended in [2]
>
> [1] https://lore.kernel.org/all/4adf1f8b-781d-4ab0-b82e-49795ad712cb@gmail.com/
> [2] https://lore.kernel.org/all/c600a6c0-aa59-4896-9e0d-3649a32d1771@gmail.com/
>
Replied there.

>
> >
> >>> +{
> >>> +	return (1UL << min(thp_highest_allowable_order(), PAGE_BLOCK_ORDER));
> >>> +}
> >>> +
> >>>  static void set_recommended_min_free_kbytes(void)
> >>>  {
> >>>  	struct zone *zone;
> >>> @@ -2638,12 +2658,16 @@ static void set_recommended_min_free_kbytes(void)
> >>
> >> You provide a 'patchlet' in
> >> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
> >>
> >> That also does:
> >>
> >>         /* Ensure 2 pageblocks are free to assist fragmentation avoidance */
> >> -       recommended_min = pageblock_nr_pages * nr_zones * 2;
> >> +       recommended_min = min_thp_pageblock_nr_pages() * nr_zones * 2;
> >>
> >> So comment here - this comment is now incorrect, this isn't 2 page blocks,
> >> it's 2 of 'sub-pageblock size as if page blocks were dynamically altered by
> >> always/madvise THP size'.
> >>
> >> Again, this whole thing strikes me as we're doing things at the wrong level
> >> of abstraction.
> >>
> >> And you're definitely now not helping avoid pageblock-sized
> >> fragmentation. You're accepting that you need less so... why not reduce
> >> pageblock size? :)
> >>
>
> Yes agreed.

:)

>
> >> 	/*
> >> 	 * Make sure that on average at least two pageblocks are almost free
> >> 	 * of another type, one for a migratetype to fall back to and a
> >>
> >> ^ remainder of comment
> >>
> >>>  	 * second to avoid subsequent fallbacks of other types There are 3
> >>>  	 * MIGRATE_TYPES we care about.
> >>>  	 */
> >>> -	recommended_min += pageblock_nr_pages * nr_zones *
> >>> +	recommended_min += min_thp_pageblock_nr_pages() * nr_zones *
> >>>  			   MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;
> >>
> >> This just seems wrong now and contradicts the comment - you're setting
> >> minimum pages based on migrate PCP types that operate at pageblock order
> >> but without reference to the actual number of page block pages?
> >>
> >> So the comment is just wrong now? 'make sure there are at least two
> >> pageblocks', well this isn't what you're doing is it? So why there are we
> >> making reference to PCP counts etc.?
> >>
> >> This seems like we're essentially just tuning these numbers someswhat
> >> arbitrarily to reduce them?
> >>
> >>>
> >>> -	/* don't ever allow to reserve more than 5% of the lowmem */
> >>> -	recommended_min = min(recommended_min,
> >>> -			      (unsigned long) nr_free_buffer_pages() / 20);
> >>> +	/*
> >>> +	 * Don't ever allow to reserve more than 5% of the lowmem.
> >>> +	 * Use a min of 128 pages when all THP orders are set to never.
> >>
> >> Why? Did you just choose this number out of the blue?
>
>
> Mentioned this in the previous comment.

Ack

> >>
> >> Previously, on x86-64 with thp -> never on everything a pageblock order-9
> >> wouldn't this be a much higher value?
> >>
> >> I mean just putting '128' here is not acceptable. It needs to be justified
> >> (even if empirically with data to back it) and defined as a named thing.
> >>
> >>
> >>> +	 */
> >>> +	recommended_min = clamp(recommended_min, 128,
> >>> +				(unsigned long) nr_free_buffer_pages() / 20);
> >>> +
> >>>  	recommended_min <<= (PAGE_SHIFT-10);
> >>>
> >>>  	if (recommended_min > min_free_kbytes) {
> >>> diff --git a/mm/shmem.c b/mm/shmem.c
> >>> index 0c5fb4ffa03a..8e92678d1175 100644
> >>> --- a/mm/shmem.c
> >>> +++ b/mm/shmem.c
> >>> @@ -136,10 +136,10 @@ struct shmem_options {
> >>>  };
> >>>
> >>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> -static unsigned long huge_shmem_orders_always __read_mostly;
> >>> -static unsigned long huge_shmem_orders_madvise __read_mostly;
> >>> -static unsigned long huge_shmem_orders_inherit __read_mostly;
> >>> -static unsigned long huge_shmem_orders_within_size __read_mostly;
> >>> +unsigned long huge_shmem_orders_always __read_mostly;
> >>> +unsigned long huge_shmem_orders_madvise __read_mostly;
> >>> +unsigned long huge_shmem_orders_inherit __read_mostly;
> >>> +unsigned long huge_shmem_orders_within_size __read_mostly;
> >>
> >> Again, we really shouldn't need to do this.
>
> Agreed, for the RFC, I just did it similar to the anon ones when I got the build error
> trying to use these, but yeah a much better approach would be to just have a
> function in shmem that would return the largest shmem thp allowable order.

Ack, yeah it's fiddly but would be better this way.

>
>
> >>
> >>>  static bool shmem_orders_configured __initdata;
> >>>  #endif
> >>>
> >>> @@ -516,25 +516,6 @@ static bool shmem_confirm_swap(struct address_space *mapping,
> >>>  	return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
> >>>  }
> >>>
> >>> -/*
> >>> - * Definitions for "huge tmpfs": tmpfs mounted with the huge= option
> >>> - *
> >>> - * SHMEM_HUGE_NEVER:
> >>> - *	disables huge pages for the mount;
> >>> - * SHMEM_HUGE_ALWAYS:
> >>> - *	enables huge pages for the mount;
> >>> - * SHMEM_HUGE_WITHIN_SIZE:
> >>> - *	only allocate huge pages if the page will be fully within i_size,
> >>> - *	also respect madvise() hints;
> >>> - * SHMEM_HUGE_ADVISE:
> >>> - *	only allocate huge pages if requested with madvise();
> >>> - */
> >>> -
> >>> -#define SHMEM_HUGE_NEVER	0
> >>> -#define SHMEM_HUGE_ALWAYS	1
> >>> -#define SHMEM_HUGE_WITHIN_SIZE	2
> >>> -#define SHMEM_HUGE_ADVISE	3
> >>> -
> >>
> >> Again we really shouldn't need to do this, just provide some function from
> >> shmem that gives you what you need.
> >>
> >>>  /*
> >>>   * Special values.
> >>>   * Only can be set via /sys/kernel/mm/transparent_hugepage/shmem_enabled:
> >>> @@ -551,7 +532,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
> >>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>  /* ifdef here to avoid bloating shmem.o when not necessary */
> >>>
> >>> -static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
> >>> +int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
> >>
> >> Same comment.
> >>
> >>>  static int tmpfs_huge __read_mostly = SHMEM_HUGE_NEVER;
> >>>
> >>>  /**
> >>> --
> >>> 2.47.1
> >>>
> >
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 14:50                 ` Lorenzo Stoakes
@ 2025-06-09 15:20                   ` Zi Yan
  2025-06-09 19:40                     ` Lorenzo Stoakes
  0 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2025-06-09 15:20 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Usama Arif, david, Andrew Morton, linux-mm, hannes, shakeel.butt,
	riel, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	hughd, linux-kernel, linux-doc, kernel-team, Juan Yescas,
	Breno Leitao

On 9 Jun 2025, at 10:50, Lorenzo Stoakes wrote:

> On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote:
>> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote:
>>
>>> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote:
>
> [snip]
>
>>>> So I guess the question is what should be the next step? The following has been discussed:
>>>>
>>>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
>>>>   and might have unintended consequences if done at runtime, so a no go?
>>>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
>>>>   The decoupling can be done separately. Watermark calculation can be decoupled using the
>>>>   approach taken in this RFC. Although max order used by pagecache needs to be addressed.
>>>>
>>>
>>> I need to catch up with the thread (workload crazy atm), but why isn't it
>>> feasible to simply statically adjust the pageblock size?
>>>
>>> The whole point of 'defragmentation' is to _heuristically_ make it less
>>> likely there'll be fragmentation when requesting page blocks.
>>>
>>> And the watermark code is explicitly about providing reserves at a
>>> _pageblock granularity_.
>>>
>>> Why would we want to 'defragment' to 512MB physically contiguous chunks
>>> that we rarely use?
>>>
>>> Since it's all heuristic, it seems reasonable to me to cap it at a sensible
>>> level no?
>>
>> What is a sensible level? 2MB is a good starting point. If we cap pageblock
>> at 2MB, everyone should be happy at the moment. But if one user wants to
>> allocate 4MB mTHP, they will most likely fail miserably, because pageblock
>> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB
>> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job.
>>
>> Defragmentation has two components: 1) pageblock, which has migratetypes
>> to prevent mixing movable and unmovable pages, as a single unmovable page
>> blocks large free pages from being created; 2) memory compaction granularity,
>> which is the actual work to move pages around and form a large free pages.
>> Currently, kernel assumes pageblock size = defragmentation granularity,
>> but in reality, as long as pageblock size >= defragmentation granularity,
>> memory compaction would still work, but not the other way around. So we
>> need to choose pageblock size carefully to not break memory compaction.
>
> OK I get it - the issue is that compaction itself operations at a pageblock
> granularity, and once you get so fragmented that compaction is critical to
> defragmentation, you are stuck if the pageblock is not big enough.

Right.

>
> Thing is, 512MB pageblock size for compaction seems insanely inefficient in
> itself, and if we're complaining about issues with unavailable reserved
> memory due to crazy PMD size, surely one will encounter the compaction
> process simply failing to succeed/taking forever/causing issues with
> reclaim/higher order folio allocation.

Yep. Initially, we probably never thought PMD THP would be as large as
512MB.

>
> I mean, I don't really know the compaction code _at all_ (ran out of time
> to cover in book ;), but is it all or-nothing? Does it grab a pageblock or
> gives up?

compaction works on one pageblock at a time, trying to migrate in-use pages
within the pageblock away to create a free page for THP allocation.
It assumes PMD THP size is equal to pageblock size. It will keep working
until a PMD THP size free page is created. This is a very high level
description, omitting a lot of details like how to avoid excessive compaction
work, how to reduce compaction latency.

>
> Because it strikes me that a crazy pageblock size would cause really
> serious system issues on that basis alone if that's the case.
>
> And again this leads me back to thinking it should just be the page block
> size _as a whole_ that should be adjusted.
>
> Keep in mind a user can literally reduce the page block size already via
> CONFIG_PAGE_BLOCK_MAX_ORDER.
>
> To me it seems that we should cap it at the highest _reasonable_ mTHP size
> you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size
> system.
>
> That way, people _can still get_ super huge PMD sized huge folios up to the
> point of fragmentation.
>
> If we do reduce things this way we should give a config option to allow
> users who truly want collosal PMD sizes with associated
> watermarks/compaction to be able to still have it.
>
> CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something?

I agree with capping pageblock size at a highest reasonable mTHP size.
In case there is some user relying on this huge PMD THP, making
pageblock a boot time variable might be a little better, since
they do not need to recompile the kernel for their need, assuming
distros will pick something like 2MB as the default pageblock size.

>
> I also question this de-coupling in general (I may be missing somethig
> however!) - the watermark code _very explicitly_ refers to providing
> _pageblocks_ in order to ensure _defragmentation_ right?

Yes. Since without enough free memory (bigger than a PMD THP),
memory compaction will just do useless work.

>
> We would need to absolutely justify why it's suddenly ok to not provide
> page blocks here.
>
> This is very very delicate code we have to be SO careful about.
>
> This is why I am being cautious here :)

Understood. In theory, we can associate watermarks with THP allowed orders
the other way around too, meaning if user lowers vm.min_free_kbytes,
all THP/mTHP sizes bigger than the watermark threshold are disabled
automatically. This could fix the memory compaction issues, but
that might also drive user crazy as they cannot use the THP sizes
they want.

Often, user just ask for an impossible combination: they
want to use all free memory, because they paid for it, and they
want THPs, because they want max performance. When PMD THP is
small like 2MB, the “unusable” free memory is not that noticeable,
but when PMD THP is as large as 512MB, user just cannot unsee it. :)


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 14:11           ` Usama Arif
  2025-06-09 14:16             ` Lorenzo Stoakes
@ 2025-06-09 15:32             ` Zi Yan
  1 sibling, 0 replies; 32+ messages in thread
From: Zi Yan @ 2025-06-09 15:32 UTC (permalink / raw)
  To: Usama Arif
  Cc: lorenzo.stoakes, david, Andrew Morton, linux-mm, hannes,
	shakeel.butt, riel, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hughd, linux-kernel, linux-doc,
	kernel-team, Juan Yescas, Breno Leitao

On 9 Jun 2025, at 10:11, Usama Arif wrote:

> On 09/06/2025 14:19, Zi Yan wrote:
>> On 9 Jun 2025, at 7:13, Usama Arif wrote:
>>
>>> On 06/06/2025 17:10, Zi Yan wrote:
>>>> On 6 Jun 2025, at 11:38, Usama Arif wrote:
>>>>
>>>>> On 06/06/2025 16:18, Zi Yan wrote:
>>>>>> On 6 Jun 2025, at 10:37, Usama Arif wrote:
>>>>>>
>>>>>>> On arm64 machines with 64K PAGE_SIZE, the min_free_kbytes and hence the
>>>>>>> watermarks are evaluated to extremely high values, for e.g. a server with
>>>>>>> 480G of memory, only 2M mTHP hugepage size set to madvise, with the rest
>>>>>>> of the sizes set to never, the min, low and high watermarks evaluate to
>>>>>>> 11.2G, 14G and 16.8G respectively.
>>>>>>> In contrast for 4K PAGE_SIZE of the same machine, with only 2M THP hugepage
>>>>>>> size set to madvise, the min, low and high watermarks evaluate to 86M, 566M
>>>>>>> and 1G respectively.
>>>>>>> This is because set_recommended_min_free_kbytes is designed for PMD
>>>>>>> hugepages (pageblock_order = min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)).
>>>>>>> Such high watermark values can cause performance and latency issues in
>>>>>>> memory bound applications on arm servers that use 64K PAGE_SIZE, eventhough
>>>>>>> most of them would never actually use a 512M PMD THP.
>>>>>>>
>>>>>>> Instead of using HPAGE_PMD_ORDER for pageblock_order use the highest large
>>>>>>> folio order enabled in set_recommended_min_free_kbytes.
>>>>>>> With this patch, when only 2M THP hugepage size is set to madvise for the
>>>>>>> same machine with 64K page size, with the rest of the sizes set to never,
>>>>>>> the min, low and high watermarks evaluate to 2.08G, 2.6G and 3.1G
>>>>>>> respectively. When 512M THP hugepage size is set to madvise for the same
>>>>>>> machine with 64K page size, the min, low and high watermarks evaluate to
>>>>>>> 11.2G, 14G and 16.8G respectively, the same as without this patch.
>>>>>>
>>>>>> Getting pageblock_order involved here might be confusing. I think you just
>>>>>> want to adjust min, low and high watermarks to reasonable values.
>>>>>> Is it OK to rename min_thp_pageblock_nr_pages to min_nr_free_pages_per_zone
>>>>>> and move MIGRATE_PCPTYPES * MIGRATE_PCPTYPES inside? Otherwise, the changes
>>>>>> look reasonable to me.
>>>>>
>>>>> Hi Zi,
>>>>>
>>>>> Thanks for the review!
>>>>>
>>>>> I forgot to change it in another place, sorry about that! So can't move
>>>>> MIGRATE_PCPTYPES * MIGRATE_PCPTYPES into the combined function.
>>>>> Have added the additional place where min_thp_pageblock_nr_pages() is called
>>>>> as a fixlet here:
>>>>> https://lore.kernel.org/all/a179fd65-dc3f-4769-9916-3033497188ba@gmail.com/
>>>>>
>>>>> I think atleast in this context the orginal name pageblock_nr_pages isn't
>>>>> correct as its min(HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER).
>>>>> The new name min_thp_pageblock_nr_pages is also not really good, so happy
>>>>> to change it to something appropriate.
>>>>
>>>> Got it. pageblock is the defragmentation granularity. If user only wants
>>>> 2MB mTHP, maybe pageblock order should be adjusted. Otherwise,
>>>> kernel will defragment at 512MB granularity, which might not be efficient.
>>>> Maybe make pageblock_order a boot time parameter?
>>>>
>>>> In addition, we are mixing two things together:
>>>> 1. min, low, and high watermarks: they affect when memory reclaim and compaction
>>>>    will be triggered;
>>>> 2. pageblock order: it is the granularity of defragmentation for creating
>>>>    mTHP/THP.
>>>>
>>>> In your use case, you want to lower watermarks, right? Considering what you
>>>> said below, I wonder if we want a way of enforcing vm.min_free_kbytes,
>>>> like a new sysctl knob, vm.force_min_free_kbytes (yeah the suggestion
>>>> is lame, sorry).
>>>>
>>>> I think for 2, we might want to decouple pageblock order from defragmentation
>>>> granularity.
>>>>
>>>
>>> This is a good point. I only did it for the watermarks in the RFC, but there
>>> is no reason that the defrag granularity is done in 512M chunks and is probably
>>> very inefficient to do so?
>>>
>>> Instead of replacing the pageblock_nr_pages for just set_recommended_min_free_kbytes,
>>> maybe we just need to change the definition of pageblock_order in [1] to take into
>>> account the highest large folio order enabled instead of HPAGE_PMD_ORDER?
>>
>> Ideally, yes. But pageblock migratetypes are stored in a fixed size array
>> determined by pageblock_order at boot time (see usemap_size() in mm/mm_init.c).
>> Changing pageblock_order at runtime means we will need to resize pageblock
>> migratetypes array, which is a little unrealistic. In a system with GBs or TBs
>> memory, reducing pageblock_order by 1 means doubling pageblock migratetypes
>> array and replicating one pageblock migratetypes to two; increasing pageblock
>> order by 1 means halving the array and splitting a pageblock into two.
>> The former, if memory is enough, might be easy, but the latter is a little
>> involved, since for a pageblock with both movable and unmovable pages,
>> you will need to check all pages to decide the migratetypes of the after-split
>> pageblocks to make sure pageblock migratetype matches the pages inside that
>> pageblock.
>>
>
> Thanks for explaining this so well and the code pointer!
>
> Yeah it doesnt seem reasonable to change the size of pageblock_flags at runtime.

Sure. :)

>>
>>>
>>> [1] https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/pageblock-flags.h#L50
>>>
>>> I really want to avoid coming up with a solution that requires changing a Kconfig or needs
>>> kernel commandline to change. It would mean a reboot whenever a different workload
>>> runs on a server that works optimally with a different THP size, and that would make
>>> workload orchestration a nightmare.
>>>
>>
>> As I said above, changing pageblock order at runtime might not be easy. But
>> changing defragmentation granularity should be fine, since it just changes
>> the range of memory compaction. That is the reason of my proposal,
>> decoupling pageblock order from defragmentation granularity. We probably
>> need to do some experiments to see the impact of the decoupling, as I
>> imagine defragmenting a range smaller than pageblock order is fine, but
>> defragmenting a range larger than pageblock order might cause issues
>> if there is any unmovable pageblock within that range. Since it is very likely
>> unmovable pages reside in an unmovable pageblock and lead to a defragmentation
>> failure.
>>
>>
>
> I saw you mentioned of a proposal to decouple pageblock order from defrag granularity
> in one of the other replies as well, just wanted to check if there was anything you had
> sent in lore in terms of proposal or RFC that I could look at.

Not at the moment. I only discussed this with David recently at LSFMM.

>
> So I guess the question is what should be the next step? The following has been discussed:
>
> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
>   and might have unintended consequences if done at runtime, so a no go?

We can try making pageblock_order a boot time variable. At least setting it
to 2MB could lower the watermarks and at least 2MB mTHP would work. If user
wants 512MB THP, they should be warned about the watermark issue and
change pageblock_order on their own.

> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
>   The decoupling can be done separately. Watermark calculation can be decoupled using the
>   approach taken in this RFC. Although max order used by pagecache needs to be addressed.

In terms of watermark calculation, I wonder if we can disable mTHP/THP orders
that are impossible to create due to low watermarks (as I mentioned to Lorenzo
in another email). So that kernel will not waste time on creating these
mTHP/THPs.

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 15:20                   ` Zi Yan
@ 2025-06-09 19:40                     ` Lorenzo Stoakes
  2025-06-09 19:49                       ` Zi Yan
  0 siblings, 1 reply; 32+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 19:40 UTC (permalink / raw)
  To: Zi Yan
  Cc: Usama Arif, david, Andrew Morton, linux-mm, hannes, shakeel.butt,
	riel, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	hughd, linux-kernel, linux-doc, kernel-team, Juan Yescas,
	Breno Leitao

On Mon, Jun 09, 2025 at 11:20:04AM -0400, Zi Yan wrote:
> On 9 Jun 2025, at 10:50, Lorenzo Stoakes wrote:
>
> > On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote:
> >> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote:
> >>
> >>> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote:
> >
> > [snip]
> >
> >>>> So I guess the question is what should be the next step? The following has been discussed:
> >>>>
> >>>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
> >>>>   and might have unintended consequences if done at runtime, so a no go?
> >>>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
> >>>>   The decoupling can be done separately. Watermark calculation can be decoupled using the
> >>>>   approach taken in this RFC. Although max order used by pagecache needs to be addressed.
> >>>>
> >>>
> >>> I need to catch up with the thread (workload crazy atm), but why isn't it
> >>> feasible to simply statically adjust the pageblock size?
> >>>
> >>> The whole point of 'defragmentation' is to _heuristically_ make it less
> >>> likely there'll be fragmentation when requesting page blocks.
> >>>
> >>> And the watermark code is explicitly about providing reserves at a
> >>> _pageblock granularity_.
> >>>
> >>> Why would we want to 'defragment' to 512MB physically contiguous chunks
> >>> that we rarely use?
> >>>
> >>> Since it's all heuristic, it seems reasonable to me to cap it at a sensible
> >>> level no?
> >>
> >> What is a sensible level? 2MB is a good starting point. If we cap pageblock
> >> at 2MB, everyone should be happy at the moment. But if one user wants to
> >> allocate 4MB mTHP, they will most likely fail miserably, because pageblock
> >> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB
> >> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job.
> >>
> >> Defragmentation has two components: 1) pageblock, which has migratetypes
> >> to prevent mixing movable and unmovable pages, as a single unmovable page
> >> blocks large free pages from being created; 2) memory compaction granularity,
> >> which is the actual work to move pages around and form a large free pages.
> >> Currently, kernel assumes pageblock size = defragmentation granularity,
> >> but in reality, as long as pageblock size >= defragmentation granularity,
> >> memory compaction would still work, but not the other way around. So we
> >> need to choose pageblock size carefully to not break memory compaction.
> >
> > OK I get it - the issue is that compaction itself operations at a pageblock
> > granularity, and once you get so fragmented that compaction is critical to
> > defragmentation, you are stuck if the pageblock is not big enough.
>
> Right.
>
> >
> > Thing is, 512MB pageblock size for compaction seems insanely inefficient in
> > itself, and if we're complaining about issues with unavailable reserved
> > memory due to crazy PMD size, surely one will encounter the compaction
> > process simply failing to succeed/taking forever/causing issues with
> > reclaim/higher order folio allocation.
>
> Yep. Initially, we probably never thought PMD THP would be as large as
> 512MB.

Of course, such is the 'organic' nature of kernel development :)

>
> >
> > I mean, I don't really know the compaction code _at all_ (ran out of time
> > to cover in book ;), but is it all or-nothing? Does it grab a pageblock or
> > gives up?
>
> compaction works on one pageblock at a time, trying to migrate in-use pages
> within the pageblock away to create a free page for THP allocation.
> It assumes PMD THP size is equal to pageblock size. It will keep working
> until a PMD THP size free page is created. This is a very high level
> description, omitting a lot of details like how to avoid excessive compaction
> work, how to reduce compaction latency.

Yeah this matches my assumptions.

>
> >
> > Because it strikes me that a crazy pageblock size would cause really
> > serious system issues on that basis alone if that's the case.
> >
> > And again this leads me back to thinking it should just be the page block
> > size _as a whole_ that should be adjusted.
> >
> > Keep in mind a user can literally reduce the page block size already via
> > CONFIG_PAGE_BLOCK_MAX_ORDER.
> >
> > To me it seems that we should cap it at the highest _reasonable_ mTHP size
> > you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size
> > system.
> >
> > That way, people _can still get_ super huge PMD sized huge folios up to the
> > point of fragmentation.
> >
> > If we do reduce things this way we should give a config option to allow
> > users who truly want collosal PMD sizes with associated
> > watermarks/compaction to be able to still have it.
> >
> > CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something?
>
> I agree with capping pageblock size at a highest reasonable mTHP size.
> In case there is some user relying on this huge PMD THP, making
> pageblock a boot time variable might be a little better, since
> they do not need to recompile the kernel for their need, assuming
> distros will pick something like 2MB as the default pageblock size.

Right, this seems sensible, as long as we set a _default_ that limits to
whatever it would be, 2MB or such.

I don't think it's unreasonable to make that change since this 512 MB thing
is so entirely unexpected and unusual.

I think Usama said it would be a pain it working this way if it had to be
explicitly set as a boot time variable without defaulting like this.

>
> >
> > I also question this de-coupling in general (I may be missing somethig
> > however!) - the watermark code _very explicitly_ refers to providing
> > _pageblocks_ in order to ensure _defragmentation_ right?
>
> Yes. Since without enough free memory (bigger than a PMD THP),
> memory compaction will just do useless work.

Yeah right, so this is a key thing and why we need to rework the current
state of the patch.

>
> >
> > We would need to absolutely justify why it's suddenly ok to not provide
> > page blocks here.
> >
> > This is very very delicate code we have to be SO careful about.
> >
> > This is why I am being cautious here :)
>
> Understood. In theory, we can associate watermarks with THP allowed orders
> the other way around too, meaning if user lowers vm.min_free_kbytes,
> all THP/mTHP sizes bigger than the watermark threshold are disabled
> automatically. This could fix the memory compaction issues, but
> that might also drive user crazy as they cannot use the THP sizes
> they want.

Yeah that's interesting but I think that's just far too subtle and people will
have no idea what's going on.

I really think a hard cap, expressed in KB/MB, on pageblock size is the way to
go (but overrideable for people crazy enough to truly want 512 MB pages - and
who cannot then complain about watermarks).

>
> Often, user just ask for an impossible combination: they
> want to use all free memory, because they paid for it, and they
> want THPs, because they want max performance. When PMD THP is
> small like 2MB, the “unusable” free memory is not that noticeable,
> but when PMD THP is as large as 512MB, user just cannot unsee it. :)

Well, users asking for crazy things then being surprised when they get them
is nothing new :P

>
>
> Best Regards,
> Yan, Zi

Thanks for your input!

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 19:40                     ` Lorenzo Stoakes
@ 2025-06-09 19:49                       ` Zi Yan
  2025-06-09 20:03                         ` Usama Arif
  2025-06-10 14:03                         ` Lorenzo Stoakes
  0 siblings, 2 replies; 32+ messages in thread
From: Zi Yan @ 2025-06-09 19:49 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Usama Arif, david, Andrew Morton, linux-mm, hannes, shakeel.butt,
	riel, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	hughd, linux-kernel, linux-doc, kernel-team, Juan Yescas,
	Breno Leitao

On 9 Jun 2025, at 15:40, Lorenzo Stoakes wrote:

> On Mon, Jun 09, 2025 at 11:20:04AM -0400, Zi Yan wrote:
>> On 9 Jun 2025, at 10:50, Lorenzo Stoakes wrote:
>>
>>> On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote:
>>>> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote:
>>>>
>>>>> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote:
>>>
>>> [snip]
>>>
>>>>>> So I guess the question is what should be the next step? The following has been discussed:
>>>>>>
>>>>>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
>>>>>>   and might have unintended consequences if done at runtime, so a no go?
>>>>>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
>>>>>>   The decoupling can be done separately. Watermark calculation can be decoupled using the
>>>>>>   approach taken in this RFC. Although max order used by pagecache needs to be addressed.
>>>>>>
>>>>>
>>>>> I need to catch up with the thread (workload crazy atm), but why isn't it
>>>>> feasible to simply statically adjust the pageblock size?
>>>>>
>>>>> The whole point of 'defragmentation' is to _heuristically_ make it less
>>>>> likely there'll be fragmentation when requesting page blocks.
>>>>>
>>>>> And the watermark code is explicitly about providing reserves at a
>>>>> _pageblock granularity_.
>>>>>
>>>>> Why would we want to 'defragment' to 512MB physically contiguous chunks
>>>>> that we rarely use?
>>>>>
>>>>> Since it's all heuristic, it seems reasonable to me to cap it at a sensible
>>>>> level no?
>>>>
>>>> What is a sensible level? 2MB is a good starting point. If we cap pageblock
>>>> at 2MB, everyone should be happy at the moment. But if one user wants to
>>>> allocate 4MB mTHP, they will most likely fail miserably, because pageblock
>>>> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB
>>>> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job.
>>>>
>>>> Defragmentation has two components: 1) pageblock, which has migratetypes
>>>> to prevent mixing movable and unmovable pages, as a single unmovable page
>>>> blocks large free pages from being created; 2) memory compaction granularity,
>>>> which is the actual work to move pages around and form a large free pages.
>>>> Currently, kernel assumes pageblock size = defragmentation granularity,
>>>> but in reality, as long as pageblock size >= defragmentation granularity,
>>>> memory compaction would still work, but not the other way around. So we
>>>> need to choose pageblock size carefully to not break memory compaction.
>>>
>>> OK I get it - the issue is that compaction itself operations at a pageblock
>>> granularity, and once you get so fragmented that compaction is critical to
>>> defragmentation, you are stuck if the pageblock is not big enough.
>>
>> Right.
>>
>>>
>>> Thing is, 512MB pageblock size for compaction seems insanely inefficient in
>>> itself, and if we're complaining about issues with unavailable reserved
>>> memory due to crazy PMD size, surely one will encounter the compaction
>>> process simply failing to succeed/taking forever/causing issues with
>>> reclaim/higher order folio allocation.
>>
>> Yep. Initially, we probably never thought PMD THP would be as large as
>> 512MB.
>
> Of course, such is the 'organic' nature of kernel development :)
>
>>
>>>
>>> I mean, I don't really know the compaction code _at all_ (ran out of time
>>> to cover in book ;), but is it all or-nothing? Does it grab a pageblock or
>>> gives up?
>>
>> compaction works on one pageblock at a time, trying to migrate in-use pages
>> within the pageblock away to create a free page for THP allocation.
>> It assumes PMD THP size is equal to pageblock size. It will keep working
>> until a PMD THP size free page is created. This is a very high level
>> description, omitting a lot of details like how to avoid excessive compaction
>> work, how to reduce compaction latency.
>
> Yeah this matches my assumptions.
>
>>
>>>
>>> Because it strikes me that a crazy pageblock size would cause really
>>> serious system issues on that basis alone if that's the case.
>>>
>>> And again this leads me back to thinking it should just be the page block
>>> size _as a whole_ that should be adjusted.
>>>
>>> Keep in mind a user can literally reduce the page block size already via
>>> CONFIG_PAGE_BLOCK_MAX_ORDER.
>>>
>>> To me it seems that we should cap it at the highest _reasonable_ mTHP size
>>> you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size
>>> system.
>>>
>>> That way, people _can still get_ super huge PMD sized huge folios up to the
>>> point of fragmentation.
>>>
>>> If we do reduce things this way we should give a config option to allow
>>> users who truly want collosal PMD sizes with associated
>>> watermarks/compaction to be able to still have it.
>>>
>>> CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something?
>>
>> I agree with capping pageblock size at a highest reasonable mTHP size.
>> In case there is some user relying on this huge PMD THP, making
>> pageblock a boot time variable might be a little better, since
>> they do not need to recompile the kernel for their need, assuming
>> distros will pick something like 2MB as the default pageblock size.
>
> Right, this seems sensible, as long as we set a _default_ that limits to
> whatever it would be, 2MB or such.
>
> I don't think it's unreasonable to make that change since this 512 MB thing
> is so entirely unexpected and unusual.
>
> I think Usama said it would be a pain it working this way if it had to be
> explicitly set as a boot time variable without defaulting like this.
>
>>
>>>
>>> I also question this de-coupling in general (I may be missing somethig
>>> however!) - the watermark code _very explicitly_ refers to providing
>>> _pageblocks_ in order to ensure _defragmentation_ right?
>>
>> Yes. Since without enough free memory (bigger than a PMD THP),
>> memory compaction will just do useless work.
>
> Yeah right, so this is a key thing and why we need to rework the current
> state of the patch.
>
>>
>>>
>>> We would need to absolutely justify why it's suddenly ok to not provide
>>> page blocks here.
>>>
>>> This is very very delicate code we have to be SO careful about.
>>>
>>> This is why I am being cautious here :)
>>
>> Understood. In theory, we can associate watermarks with THP allowed orders
>> the other way around too, meaning if user lowers vm.min_free_kbytes,
>> all THP/mTHP sizes bigger than the watermark threshold are disabled
>> automatically. This could fix the memory compaction issues, but
>> that might also drive user crazy as they cannot use the THP sizes
>> they want.
>
> Yeah that's interesting but I think that's just far too subtle and people will
> have no idea what's going on.
>
> I really think a hard cap, expressed in KB/MB, on pageblock size is the way to
> go (but overrideable for people crazy enough to truly want 512 MB pages - and
> who cannot then complain about watermarks).

I agree. Basically, I am thinking:
1) use something like 2MB as default pageblock size for all arch (the value can
be set differently if some arch wants a different pageblock size due to other reasons), this can be done by modifying PAGE_BLOCK_MAX_ORDER’s default
value;

2) make pageblock_order a boot time parameter, so that user who wants
512MB pages can still get it by changing pageblock order at boot time.

WDYT?

>
>>
>> Often, user just ask for an impossible combination: they
>> want to use all free memory, because they paid for it, and they
>> want THPs, because they want max performance. When PMD THP is
>> small like 2MB, the “unusable” free memory is not that noticeable,
>> but when PMD THP is as large as 512MB, user just cannot unsee it. :)
>
> Well, users asking for crazy things then being surprised when they get them
> is nothing new :P
>
>>
>>
>> Best Regards,
>> Yan, Zi
>
> Thanks for your input!
>
> Cheers, Lorenzo


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 19:49                       ` Zi Yan
@ 2025-06-09 20:03                         ` Usama Arif
  2025-06-09 20:24                           ` Zi Yan
  2025-06-10 14:03                         ` Lorenzo Stoakes
  1 sibling, 1 reply; 32+ messages in thread
From: Usama Arif @ 2025-06-09 20:03 UTC (permalink / raw)
  To: Zi Yan, Lorenzo Stoakes
  Cc: david, Andrew Morton, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, hughd,
	linux-kernel, linux-doc, kernel-team, Juan Yescas, Breno Leitao



On 09/06/2025 20:49, Zi Yan wrote:
> On 9 Jun 2025, at 15:40, Lorenzo Stoakes wrote:
> 
>> On Mon, Jun 09, 2025 at 11:20:04AM -0400, Zi Yan wrote:
>>> On 9 Jun 2025, at 10:50, Lorenzo Stoakes wrote:
>>>
>>>> On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote:
>>>>> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote:
>>>>>
>>>>>> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote:
>>>>
>>>> [snip]
>>>>
>>>>>>> So I guess the question is what should be the next step? The following has been discussed:
>>>>>>>
>>>>>>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
>>>>>>>   and might have unintended consequences if done at runtime, so a no go?
>>>>>>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
>>>>>>>   The decoupling can be done separately. Watermark calculation can be decoupled using the
>>>>>>>   approach taken in this RFC. Although max order used by pagecache needs to be addressed.
>>>>>>>
>>>>>>
>>>>>> I need to catch up with the thread (workload crazy atm), but why isn't it
>>>>>> feasible to simply statically adjust the pageblock size?
>>>>>>
>>>>>> The whole point of 'defragmentation' is to _heuristically_ make it less
>>>>>> likely there'll be fragmentation when requesting page blocks.
>>>>>>
>>>>>> And the watermark code is explicitly about providing reserves at a
>>>>>> _pageblock granularity_.
>>>>>>
>>>>>> Why would we want to 'defragment' to 512MB physically contiguous chunks
>>>>>> that we rarely use?
>>>>>>
>>>>>> Since it's all heuristic, it seems reasonable to me to cap it at a sensible
>>>>>> level no?
>>>>>
>>>>> What is a sensible level? 2MB is a good starting point. If we cap pageblock
>>>>> at 2MB, everyone should be happy at the moment. But if one user wants to
>>>>> allocate 4MB mTHP, they will most likely fail miserably, because pageblock
>>>>> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB
>>>>> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job.
>>>>>
>>>>> Defragmentation has two components: 1) pageblock, which has migratetypes
>>>>> to prevent mixing movable and unmovable pages, as a single unmovable page
>>>>> blocks large free pages from being created; 2) memory compaction granularity,
>>>>> which is the actual work to move pages around and form a large free pages.
>>>>> Currently, kernel assumes pageblock size = defragmentation granularity,
>>>>> but in reality, as long as pageblock size >= defragmentation granularity,
>>>>> memory compaction would still work, but not the other way around. So we
>>>>> need to choose pageblock size carefully to not break memory compaction.
>>>>
>>>> OK I get it - the issue is that compaction itself operations at a pageblock
>>>> granularity, and once you get so fragmented that compaction is critical to
>>>> defragmentation, you are stuck if the pageblock is not big enough.
>>>
>>> Right.
>>>
>>>>
>>>> Thing is, 512MB pageblock size for compaction seems insanely inefficient in
>>>> itself, and if we're complaining about issues with unavailable reserved
>>>> memory due to crazy PMD size, surely one will encounter the compaction
>>>> process simply failing to succeed/taking forever/causing issues with
>>>> reclaim/higher order folio allocation.
>>>
>>> Yep. Initially, we probably never thought PMD THP would be as large as
>>> 512MB.
>>
>> Of course, such is the 'organic' nature of kernel development :)
>>
>>>
>>>>
>>>> I mean, I don't really know the compaction code _at all_ (ran out of time
>>>> to cover in book ;), but is it all or-nothing? Does it grab a pageblock or
>>>> gives up?
>>>
>>> compaction works on one pageblock at a time, trying to migrate in-use pages
>>> within the pageblock away to create a free page for THP allocation.
>>> It assumes PMD THP size is equal to pageblock size. It will keep working
>>> until a PMD THP size free page is created. This is a very high level
>>> description, omitting a lot of details like how to avoid excessive compaction
>>> work, how to reduce compaction latency.
>>
>> Yeah this matches my assumptions.
>>
>>>
>>>>
>>>> Because it strikes me that a crazy pageblock size would cause really
>>>> serious system issues on that basis alone if that's the case.
>>>>
>>>> And again this leads me back to thinking it should just be the page block
>>>> size _as a whole_ that should be adjusted.
>>>>
>>>> Keep in mind a user can literally reduce the page block size already via
>>>> CONFIG_PAGE_BLOCK_MAX_ORDER.
>>>>
>>>> To me it seems that we should cap it at the highest _reasonable_ mTHP size
>>>> you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size
>>>> system.
>>>>
>>>> That way, people _can still get_ super huge PMD sized huge folios up to the
>>>> point of fragmentation.
>>>>
>>>> If we do reduce things this way we should give a config option to allow
>>>> users who truly want collosal PMD sizes with associated
>>>> watermarks/compaction to be able to still have it.
>>>>
>>>> CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something?
>>>
>>> I agree with capping pageblock size at a highest reasonable mTHP size.
>>> In case there is some user relying on this huge PMD THP, making
>>> pageblock a boot time variable might be a little better, since
>>> they do not need to recompile the kernel for their need, assuming
>>> distros will pick something like 2MB as the default pageblock size.
>>
>> Right, this seems sensible, as long as we set a _default_ that limits to
>> whatever it would be, 2MB or such.
>>
>> I don't think it's unreasonable to make that change since this 512 MB thing
>> is so entirely unexpected and unusual.
>>
>> I think Usama said it would be a pain it working this way if it had to be
>> explicitly set as a boot time variable without defaulting like this.
>>
>>>
>>>>
>>>> I also question this de-coupling in general (I may be missing somethig
>>>> however!) - the watermark code _very explicitly_ refers to providing
>>>> _pageblocks_ in order to ensure _defragmentation_ right?
>>>
>>> Yes. Since without enough free memory (bigger than a PMD THP),
>>> memory compaction will just do useless work.
>>
>> Yeah right, so this is a key thing and why we need to rework the current
>> state of the patch.
>>
>>>
>>>>
>>>> We would need to absolutely justify why it's suddenly ok to not provide
>>>> page blocks here.
>>>>
>>>> This is very very delicate code we have to be SO careful about.
>>>>
>>>> This is why I am being cautious here :)
>>>
>>> Understood. In theory, we can associate watermarks with THP allowed orders
>>> the other way around too, meaning if user lowers vm.min_free_kbytes,
>>> all THP/mTHP sizes bigger than the watermark threshold are disabled
>>> automatically. This could fix the memory compaction issues, but
>>> that might also drive user crazy as they cannot use the THP sizes
>>> they want.
>>
>> Yeah that's interesting but I think that's just far too subtle and people will
>> have no idea what's going on.
>>
>> I really think a hard cap, expressed in KB/MB, on pageblock size is the way to
>> go (but overrideable for people crazy enough to truly want 512 MB pages - and
>> who cannot then complain about watermarks).
> 
> I agree. Basically, I am thinking:
> 1) use something like 2MB as default pageblock size for all arch (the value can
> be set differently if some arch wants a different pageblock size due to other reasons), this can be done by modifying PAGE_BLOCK_MAX_ORDER’s default
> value;
> 
> 2) make pageblock_order a boot time parameter, so that user who wants
> 512MB pages can still get it by changing pageblock order at boot time.
> 
> WDYT?
> 

I was really hoping we would come up with a dynamic way of doing this,
especially one that doesn't require any more input from the user apart
from just setting the mTHP size via sysfs..

1) in a way is already done. We can set it to 2M by setting
ARCH_FORCE_MAX_ORDER to 5:

In arch/arm64/Kconfig we already have:

config ARCH_FORCE_MAX_ORDER
	int
	default "13" if ARM64_64K_PAGES
	default "11" if ARM64_16K_PAGES
	default "10"

Doing 2) would require reboot and doing this just for changing mTHP size
will probably be a nightmare for workload orchestration.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 20:03                         ` Usama Arif
@ 2025-06-09 20:24                           ` Zi Yan
  2025-06-10 10:41                             ` Usama Arif
  0 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2025-06-09 20:24 UTC (permalink / raw)
  To: Usama Arif
  Cc: Lorenzo Stoakes, david, Andrew Morton, linux-mm, hannes,
	shakeel.butt, riel, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hughd, linux-kernel, linux-doc,
	kernel-team, Juan Yescas, Breno Leitao

On 9 Jun 2025, at 16:03, Usama Arif wrote:

> On 09/06/2025 20:49, Zi Yan wrote:
>> On 9 Jun 2025, at 15:40, Lorenzo Stoakes wrote:
>>
>>> On Mon, Jun 09, 2025 at 11:20:04AM -0400, Zi Yan wrote:
>>>> On 9 Jun 2025, at 10:50, Lorenzo Stoakes wrote:
>>>>
>>>>> On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote:
>>>>>> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote:
>>>>>>
>>>>>>> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote:
>>>>>
>>>>> [snip]
>>>>>
>>>>>>>> So I guess the question is what should be the next step? The following has been discussed:
>>>>>>>>
>>>>>>>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
>>>>>>>>   and might have unintended consequences if done at runtime, so a no go?
>>>>>>>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
>>>>>>>>   The decoupling can be done separately. Watermark calculation can be decoupled using the
>>>>>>>>   approach taken in this RFC. Although max order used by pagecache needs to be addressed.
>>>>>>>>
>>>>>>>
>>>>>>> I need to catch up with the thread (workload crazy atm), but why isn't it
>>>>>>> feasible to simply statically adjust the pageblock size?
>>>>>>>
>>>>>>> The whole point of 'defragmentation' is to _heuristically_ make it less
>>>>>>> likely there'll be fragmentation when requesting page blocks.
>>>>>>>
>>>>>>> And the watermark code is explicitly about providing reserves at a
>>>>>>> _pageblock granularity_.
>>>>>>>
>>>>>>> Why would we want to 'defragment' to 512MB physically contiguous chunks
>>>>>>> that we rarely use?
>>>>>>>
>>>>>>> Since it's all heuristic, it seems reasonable to me to cap it at a sensible
>>>>>>> level no?
>>>>>>
>>>>>> What is a sensible level? 2MB is a good starting point. If we cap pageblock
>>>>>> at 2MB, everyone should be happy at the moment. But if one user wants to
>>>>>> allocate 4MB mTHP, they will most likely fail miserably, because pageblock
>>>>>> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB
>>>>>> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job.
>>>>>>
>>>>>> Defragmentation has two components: 1) pageblock, which has migratetypes
>>>>>> to prevent mixing movable and unmovable pages, as a single unmovable page
>>>>>> blocks large free pages from being created; 2) memory compaction granularity,
>>>>>> which is the actual work to move pages around and form a large free pages.
>>>>>> Currently, kernel assumes pageblock size = defragmentation granularity,
>>>>>> but in reality, as long as pageblock size >= defragmentation granularity,
>>>>>> memory compaction would still work, but not the other way around. So we
>>>>>> need to choose pageblock size carefully to not break memory compaction.
>>>>>
>>>>> OK I get it - the issue is that compaction itself operations at a pageblock
>>>>> granularity, and once you get so fragmented that compaction is critical to
>>>>> defragmentation, you are stuck if the pageblock is not big enough.
>>>>
>>>> Right.
>>>>
>>>>>
>>>>> Thing is, 512MB pageblock size for compaction seems insanely inefficient in
>>>>> itself, and if we're complaining about issues with unavailable reserved
>>>>> memory due to crazy PMD size, surely one will encounter the compaction
>>>>> process simply failing to succeed/taking forever/causing issues with
>>>>> reclaim/higher order folio allocation.
>>>>
>>>> Yep. Initially, we probably never thought PMD THP would be as large as
>>>> 512MB.
>>>
>>> Of course, such is the 'organic' nature of kernel development :)
>>>
>>>>
>>>>>
>>>>> I mean, I don't really know the compaction code _at all_ (ran out of time
>>>>> to cover in book ;), but is it all or-nothing? Does it grab a pageblock or
>>>>> gives up?
>>>>
>>>> compaction works on one pageblock at a time, trying to migrate in-use pages
>>>> within the pageblock away to create a free page for THP allocation.
>>>> It assumes PMD THP size is equal to pageblock size. It will keep working
>>>> until a PMD THP size free page is created. This is a very high level
>>>> description, omitting a lot of details like how to avoid excessive compaction
>>>> work, how to reduce compaction latency.
>>>
>>> Yeah this matches my assumptions.
>>>
>>>>
>>>>>
>>>>> Because it strikes me that a crazy pageblock size would cause really
>>>>> serious system issues on that basis alone if that's the case.
>>>>>
>>>>> And again this leads me back to thinking it should just be the page block
>>>>> size _as a whole_ that should be adjusted.
>>>>>
>>>>> Keep in mind a user can literally reduce the page block size already via
>>>>> CONFIG_PAGE_BLOCK_MAX_ORDER.
>>>>>
>>>>> To me it seems that we should cap it at the highest _reasonable_ mTHP size
>>>>> you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size
>>>>> system.
>>>>>
>>>>> That way, people _can still get_ super huge PMD sized huge folios up to the
>>>>> point of fragmentation.
>>>>>
>>>>> If we do reduce things this way we should give a config option to allow
>>>>> users who truly want collosal PMD sizes with associated
>>>>> watermarks/compaction to be able to still have it.
>>>>>
>>>>> CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something?
>>>>
>>>> I agree with capping pageblock size at a highest reasonable mTHP size.
>>>> In case there is some user relying on this huge PMD THP, making
>>>> pageblock a boot time variable might be a little better, since
>>>> they do not need to recompile the kernel for their need, assuming
>>>> distros will pick something like 2MB as the default pageblock size.
>>>
>>> Right, this seems sensible, as long as we set a _default_ that limits to
>>> whatever it would be, 2MB or such.
>>>
>>> I don't think it's unreasonable to make that change since this 512 MB thing
>>> is so entirely unexpected and unusual.
>>>
>>> I think Usama said it would be a pain it working this way if it had to be
>>> explicitly set as a boot time variable without defaulting like this.
>>>
>>>>
>>>>>
>>>>> I also question this de-coupling in general (I may be missing somethig
>>>>> however!) - the watermark code _very explicitly_ refers to providing
>>>>> _pageblocks_ in order to ensure _defragmentation_ right?
>>>>
>>>> Yes. Since without enough free memory (bigger than a PMD THP),
>>>> memory compaction will just do useless work.
>>>
>>> Yeah right, so this is a key thing and why we need to rework the current
>>> state of the patch.
>>>
>>>>
>>>>>
>>>>> We would need to absolutely justify why it's suddenly ok to not provide
>>>>> page blocks here.
>>>>>
>>>>> This is very very delicate code we have to be SO careful about.
>>>>>
>>>>> This is why I am being cautious here :)
>>>>
>>>> Understood. In theory, we can associate watermarks with THP allowed orders
>>>> the other way around too, meaning if user lowers vm.min_free_kbytes,
>>>> all THP/mTHP sizes bigger than the watermark threshold are disabled
>>>> automatically. This could fix the memory compaction issues, but
>>>> that might also drive user crazy as they cannot use the THP sizes
>>>> they want.
>>>
>>> Yeah that's interesting but I think that's just far too subtle and people will
>>> have no idea what's going on.
>>>
>>> I really think a hard cap, expressed in KB/MB, on pageblock size is the way to
>>> go (but overrideable for people crazy enough to truly want 512 MB pages - and
>>> who cannot then complain about watermarks).
>>
>> I agree. Basically, I am thinking:
>> 1) use something like 2MB as default pageblock size for all arch (the value can
>> be set differently if some arch wants a different pageblock size due to other reasons), this can be done by modifying PAGE_BLOCK_MAX_ORDER’s default
>> value;
>>
>> 2) make pageblock_order a boot time parameter, so that user who wants
>> 512MB pages can still get it by changing pageblock order at boot time.
>>
>> WDYT?
>>
>
> I was really hoping we would come up with a dynamic way of doing this,
> especially one that doesn't require any more input from the user apart
> from just setting the mTHP size via sysfs..

Then we will need to get rid of pageblock size from both watermark calculation
and memory compaction and think about a new anti-fragmentation mechanism
to handle unmovable pages as current pageblock based mechanism no longer
fit the need.

What you are expecting is:
1) watermarks should change as the largest enabled THP/mTHP size changes;
2) memory compaction targets the largest enabled THP/mTHP size (next step
would improve memory compaction to optimize for all enabled sizes);
3) partitions of movable and unmovable pages can change dynamically
based on the largest enabled THP/mTHP size;
4) pageblock size becomes irrelevant.

>
> 1) in a way is already done. We can set it to 2M by setting
> ARCH_FORCE_MAX_ORDER to 5:
>
> In arch/arm64/Kconfig we already have:
>
> config ARCH_FORCE_MAX_ORDER
> 	int
> 	default "13" if ARM64_64K_PAGES
> 	default "11" if ARM64_16K_PAGES
> 	default "10"

Nah, that means user no longer can allocate pages larger than 2MB,
because the cap is in the buddy allocator.

>
> Doing 2) would require reboot and doing this just for changing mTHP size
> will probably be a nightmare for workload orchestration.

No. That is not what I mean. pageblock_order set at boot time only limits
the largest mTHP size. By default, user can get up to 2MB THP/mTHP,
but if they want to get 512MB THP, they can reboot with a larger pageblock
order and they can still use 2MB mTHP. The downside is that with
larger pageblock order, user cannot get the optimal THP/mTHP performance
kernel is designed to achieve.

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 20:24                           ` Zi Yan
@ 2025-06-10 10:41                             ` Usama Arif
  0 siblings, 0 replies; 32+ messages in thread
From: Usama Arif @ 2025-06-10 10:41 UTC (permalink / raw)
  To: Zi Yan
  Cc: Lorenzo Stoakes, david, Andrew Morton, linux-mm, hannes,
	shakeel.butt, riel, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hughd, linux-kernel, linux-doc,
	kernel-team, Juan Yescas, Breno Leitao



On 09/06/2025 21:24, Zi Yan wrote:
> On 9 Jun 2025, at 16:03, Usama Arif wrote:
> 
>> On 09/06/2025 20:49, Zi Yan wrote:
>>> On 9 Jun 2025, at 15:40, Lorenzo Stoakes wrote:
>>>
>>>> On Mon, Jun 09, 2025 at 11:20:04AM -0400, Zi Yan wrote:
>>>>> On 9 Jun 2025, at 10:50, Lorenzo Stoakes wrote:
>>>>>
>>>>>> On Mon, Jun 09, 2025 at 10:37:26AM -0400, Zi Yan wrote:
>>>>>>> On 9 Jun 2025, at 10:16, Lorenzo Stoakes wrote:
>>>>>>>
>>>>>>>> On Mon, Jun 09, 2025 at 03:11:27PM +0100, Usama Arif wrote:
>>>>>>
>>>>>> [snip]
>>>>>>
>>>>>>>>> So I guess the question is what should be the next step? The following has been discussed:
>>>>>>>>>
>>>>>>>>> - Changing pageblock_order at runtime: This seems unreasonable after Zi's explanation above
>>>>>>>>>   and might have unintended consequences if done at runtime, so a no go?
>>>>>>>>> - Decouple only watermark calculation and defrag granularity from pageblock order (also from Zi).
>>>>>>>>>   The decoupling can be done separately. Watermark calculation can be decoupled using the
>>>>>>>>>   approach taken in this RFC. Although max order used by pagecache needs to be addressed.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I need to catch up with the thread (workload crazy atm), but why isn't it
>>>>>>>> feasible to simply statically adjust the pageblock size?
>>>>>>>>
>>>>>>>> The whole point of 'defragmentation' is to _heuristically_ make it less
>>>>>>>> likely there'll be fragmentation when requesting page blocks.
>>>>>>>>
>>>>>>>> And the watermark code is explicitly about providing reserves at a
>>>>>>>> _pageblock granularity_.
>>>>>>>>
>>>>>>>> Why would we want to 'defragment' to 512MB physically contiguous chunks
>>>>>>>> that we rarely use?
>>>>>>>>
>>>>>>>> Since it's all heuristic, it seems reasonable to me to cap it at a sensible
>>>>>>>> level no?
>>>>>>>
>>>>>>> What is a sensible level? 2MB is a good starting point. If we cap pageblock
>>>>>>> at 2MB, everyone should be happy at the moment. But if one user wants to
>>>>>>> allocate 4MB mTHP, they will most likely fail miserably, because pageblock
>>>>>>> is 2MB, kernel is OK to have a 2MB MIGRATE_MOVABLE pageblock next to a 2MB
>>>>>>> MGIRATE_UNMOVABLE one, making defragmenting 4MB an impossible job.
>>>>>>>
>>>>>>> Defragmentation has two components: 1) pageblock, which has migratetypes
>>>>>>> to prevent mixing movable and unmovable pages, as a single unmovable page
>>>>>>> blocks large free pages from being created; 2) memory compaction granularity,
>>>>>>> which is the actual work to move pages around and form a large free pages.
>>>>>>> Currently, kernel assumes pageblock size = defragmentation granularity,
>>>>>>> but in reality, as long as pageblock size >= defragmentation granularity,
>>>>>>> memory compaction would still work, but not the other way around. So we
>>>>>>> need to choose pageblock size carefully to not break memory compaction.
>>>>>>
>>>>>> OK I get it - the issue is that compaction itself operations at a pageblock
>>>>>> granularity, and once you get so fragmented that compaction is critical to
>>>>>> defragmentation, you are stuck if the pageblock is not big enough.
>>>>>
>>>>> Right.
>>>>>
>>>>>>
>>>>>> Thing is, 512MB pageblock size for compaction seems insanely inefficient in
>>>>>> itself, and if we're complaining about issues with unavailable reserved
>>>>>> memory due to crazy PMD size, surely one will encounter the compaction
>>>>>> process simply failing to succeed/taking forever/causing issues with
>>>>>> reclaim/higher order folio allocation.
>>>>>
>>>>> Yep. Initially, we probably never thought PMD THP would be as large as
>>>>> 512MB.
>>>>
>>>> Of course, such is the 'organic' nature of kernel development :)
>>>>
>>>>>
>>>>>>
>>>>>> I mean, I don't really know the compaction code _at all_ (ran out of time
>>>>>> to cover in book ;), but is it all or-nothing? Does it grab a pageblock or
>>>>>> gives up?
>>>>>
>>>>> compaction works on one pageblock at a time, trying to migrate in-use pages
>>>>> within the pageblock away to create a free page for THP allocation.
>>>>> It assumes PMD THP size is equal to pageblock size. It will keep working
>>>>> until a PMD THP size free page is created. This is a very high level
>>>>> description, omitting a lot of details like how to avoid excessive compaction
>>>>> work, how to reduce compaction latency.
>>>>
>>>> Yeah this matches my assumptions.
>>>>
>>>>>
>>>>>>
>>>>>> Because it strikes me that a crazy pageblock size would cause really
>>>>>> serious system issues on that basis alone if that's the case.
>>>>>>
>>>>>> And again this leads me back to thinking it should just be the page block
>>>>>> size _as a whole_ that should be adjusted.
>>>>>>
>>>>>> Keep in mind a user can literally reduce the page block size already via
>>>>>> CONFIG_PAGE_BLOCK_MAX_ORDER.
>>>>>>
>>>>>> To me it seems that we should cap it at the highest _reasonable_ mTHP size
>>>>>> you can get on a 64KB (i.e. maximum right? RIGHT? :P) base page size
>>>>>> system.
>>>>>>
>>>>>> That way, people _can still get_ super huge PMD sized huge folios up to the
>>>>>> point of fragmentation.
>>>>>>
>>>>>> If we do reduce things this way we should give a config option to allow
>>>>>> users who truly want collosal PMD sizes with associated
>>>>>> watermarks/compaction to be able to still have it.
>>>>>>
>>>>>> CONFIG_PAGE_BLOCK_HARD_LIMIT_MB or something?
>>>>>
>>>>> I agree with capping pageblock size at a highest reasonable mTHP size.
>>>>> In case there is some user relying on this huge PMD THP, making
>>>>> pageblock a boot time variable might be a little better, since
>>>>> they do not need to recompile the kernel for their need, assuming
>>>>> distros will pick something like 2MB as the default pageblock size.
>>>>
>>>> Right, this seems sensible, as long as we set a _default_ that limits to
>>>> whatever it would be, 2MB or such.
>>>>
>>>> I don't think it's unreasonable to make that change since this 512 MB thing
>>>> is so entirely unexpected and unusual.
>>>>
>>>> I think Usama said it would be a pain it working this way if it had to be
>>>> explicitly set as a boot time variable without defaulting like this.
>>>>
>>>>>
>>>>>>
>>>>>> I also question this de-coupling in general (I may be missing somethig
>>>>>> however!) - the watermark code _very explicitly_ refers to providing
>>>>>> _pageblocks_ in order to ensure _defragmentation_ right?
>>>>>
>>>>> Yes. Since without enough free memory (bigger than a PMD THP),
>>>>> memory compaction will just do useless work.
>>>>
>>>> Yeah right, so this is a key thing and why we need to rework the current
>>>> state of the patch.
>>>>
>>>>>
>>>>>>
>>>>>> We would need to absolutely justify why it's suddenly ok to not provide
>>>>>> page blocks here.
>>>>>>
>>>>>> This is very very delicate code we have to be SO careful about.
>>>>>>
>>>>>> This is why I am being cautious here :)
>>>>>
>>>>> Understood. In theory, we can associate watermarks with THP allowed orders
>>>>> the other way around too, meaning if user lowers vm.min_free_kbytes,
>>>>> all THP/mTHP sizes bigger than the watermark threshold are disabled
>>>>> automatically. This could fix the memory compaction issues, but
>>>>> that might also drive user crazy as they cannot use the THP sizes
>>>>> they want.
>>>>
>>>> Yeah that's interesting but I think that's just far too subtle and people will
>>>> have no idea what's going on.
>>>>
>>>> I really think a hard cap, expressed in KB/MB, on pageblock size is the way to
>>>> go (but overrideable for people crazy enough to truly want 512 MB pages - and
>>>> who cannot then complain about watermarks).
>>>
>>> I agree. Basically, I am thinking:
>>> 1) use something like 2MB as default pageblock size for all arch (the value can
>>> be set differently if some arch wants a different pageblock size due to other reasons), this can be done by modifying PAGE_BLOCK_MAX_ORDER’s default
>>> value;
>>>
>>> 2) make pageblock_order a boot time parameter, so that user who wants
>>> 512MB pages can still get it by changing pageblock order at boot time.
>>>
>>> WDYT?
>>>
>>
>> I was really hoping we would come up with a dynamic way of doing this,
>> especially one that doesn't require any more input from the user apart
>> from just setting the mTHP size via sysfs..
> 
> Then we will need to get rid of pageblock size from both watermark calculation
> and memory compaction and think about a new anti-fragmentation mechanism
> to handle unmovable pages as current pageblock based mechanism no longer
> fit the need.
> 
> What you are expecting is:
> 1) watermarks should change as the largest enabled THP/mTHP size changes;
> 2) memory compaction targets the largest enabled THP/mTHP size (next step
> would improve memory compaction to optimize for all enabled sizes);
> 3) partitions of movable and unmovable pages can change dynamically
> based on the largest enabled THP/mTHP size;
> 4) pageblock size becomes irrelevant.
> 

I think both 1 and 2 can be achieved in a similar way? i.e. changing
pageblock_order to be min(largest_enabled_thp_order(), PAGE_BLOCK_MAX_ORDER).
But there a lot of instances of pageblock_order and pageblock_nr_pages and
all of them would need to be audited very carefully.

For 3, we need to do the dynamic array resizing that you mentioned for
pageblock_flags?

Yeah overall it sounds like quite a big change and would need a lot
of testing to make sure nothing breaks.

>>
>> 1) in a way is already done. We can set it to 2M by setting
>> ARCH_FORCE_MAX_ORDER to 5:
>>
>> In arch/arm64/Kconfig we already have:
>>
>> config ARCH_FORCE_MAX_ORDER
>> 	int
>> 	default "13" if ARM64_64K_PAGES
>> 	default "11" if ARM64_16K_PAGES
>> 	default "10"
> 
> Nah, that means user no longer can allocate pages larger than 2MB,
> because the cap is in the buddy allocator.
> 
>>
>> Doing 2) would require reboot and doing this just for changing mTHP size
>> will probably be a nightmare for workload orchestration.
> 
> No. That is not what I mean. pageblock_order set at boot time only limits
> the largest mTHP size. By default, user can get up to 2MB THP/mTHP,
> but if they want to get 512MB THP, they can reboot with a larger pageblock
> order and they can still use 2MB mTHP. The downside is that with
> larger pageblock order, user cannot get the optimal THP/mTHP performance
> kernel is designed to achieve.
> 

Yes, I mean this as well. If the largest mTHP size enabled goes from 2M to
512M than we need a reboot to actually obtain 512M THPs. If we switch from 512M
to 2M, we again need a reboot to get the best performance out of the server.

> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-09 19:49                       ` Zi Yan
  2025-06-09 20:03                         ` Usama Arif
@ 2025-06-10 14:03                         ` Lorenzo Stoakes
  2025-06-10 14:20                           ` Zi Yan
  1 sibling, 1 reply; 32+ messages in thread
From: Lorenzo Stoakes @ 2025-06-10 14:03 UTC (permalink / raw)
  To: Zi Yan
  Cc: Usama Arif, david, Andrew Morton, linux-mm, hannes, shakeel.butt,
	riel, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	hughd, linux-kernel, linux-doc, kernel-team, Juan Yescas,
	Breno Leitao

On Mon, Jun 09, 2025 at 03:49:52PM -0400, Zi Yan wrote:
[snip]
> > I really think a hard cap, expressed in KB/MB, on pageblock size is the way to
> > go (but overrideable for people crazy enough to truly want 512 MB pages - and
> > who cannot then complain about watermarks).
>
> I agree. Basically, I am thinking:
> 1) use something like 2MB as default pageblock size for all arch (the value can
> be set differently if some arch wants a different pageblock size due to other reasons), this can be done by modifying PAGE_BLOCK_MAX_ORDER’s default
> value;

I don't think we can set this using CONFIG_PAGE_BLOCK_MAX_ORDER.

Because the 'order' will be a different size depending on page size obviously.

So I'm not sure how this would achieve what we want?

It seems to me we should have CONFIG_PAGE_BLOCK_MAX_SIZE_MB or something like
this, and we take min(page_size << CONFIG_PAGE_BLOCK_MAX_ORDER,
CONFIG_PAGE_BLOCK_MAX_SIZE_MB << 20) as the size.

>
> 2) make pageblock_order a boot time parameter, so that user who wants
> 512MB pages can still get it by changing pageblock order at boot time.
>

Again, I don't think order is the right choice here, though having it boot time
configurable (perhaps overriding the default config there) seems sensible.

> WDYT?

>
> >
> >>
> >> Often, user just ask for an impossible combination: they
> >> want to use all free memory, because they paid for it, and they
> >> want THPs, because they want max performance. When PMD THP is
> >> small like 2MB, the “unusable” free memory is not that noticeable,
> >> but when PMD THP is as large as 512MB, user just cannot unsee it. :)
> >
> > Well, users asking for crazy things then being surprised when they get them
> > is nothing new :P
> >
> >>
> >>
> >> Best Regards,
> >> Yan, Zi
> >
> > Thanks for your input!
> >
> > Cheers, Lorenzo
>
>
> Best Regards,
> Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-10 14:03                         ` Lorenzo Stoakes
@ 2025-06-10 14:20                           ` Zi Yan
  2025-06-10 15:16                             ` Usama Arif
  0 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2025-06-10 14:20 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Usama Arif, david, Andrew Morton, linux-mm, hannes, shakeel.butt,
	riel, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	hughd, linux-kernel, linux-doc, kernel-team, Juan Yescas,
	Breno Leitao

On 10 Jun 2025, at 10:03, Lorenzo Stoakes wrote:

> On Mon, Jun 09, 2025 at 03:49:52PM -0400, Zi Yan wrote:
> [snip]
>>> I really think a hard cap, expressed in KB/MB, on pageblock size is the way to
>>> go (but overrideable for people crazy enough to truly want 512 MB pages - and
>>> who cannot then complain about watermarks).
>>
>> I agree. Basically, I am thinking:
>> 1) use something like 2MB as default pageblock size for all arch (the value can
>> be set differently if some arch wants a different pageblock size due to other reasons), this can be done by modifying PAGE_BLOCK_MAX_ORDER’s default
>> value;
>
> I don't think we can set this using CONFIG_PAGE_BLOCK_MAX_ORDER.
>
> Because the 'order' will be a different size depending on page size obviously.
>
> So I'm not sure how this would achieve what we want?
>
> It seems to me we should have CONFIG_PAGE_BLOCK_MAX_SIZE_MB or something like
> this, and we take min(page_size << CONFIG_PAGE_BLOCK_MAX_ORDER,
> CONFIG_PAGE_BLOCK_MAX_SIZE_MB << 20) as the size.

OK. Now I get what you mean. Yeah, using MB is clearer as user does not
need to know page size to set the right pageblock size.

>
>>
>> 2) make pageblock_order a boot time parameter, so that user who wants
>> 512MB pages can still get it by changing pageblock order at boot time.
>>
>
> Again, I don't think order is the right choice here, though having it boot time
> configurable (perhaps overriding the default config there) seems sensible.

Understood. The new pageblock size should be set using MB.

>
>> WDYT?
>
>>
>>>
>>>>
>>>> Often, user just ask for an impossible combination: they
>>>> want to use all free memory, because they paid for it, and they
>>>> want THPs, because they want max performance. When PMD THP is
>>>> small like 2MB, the “unusable” free memory is not that noticeable,
>>>> but when PMD THP is as large as 512MB, user just cannot unsee it. :)
>>>
>>> Well, users asking for crazy things then being surprised when they get them
>>> is nothing new :P
>>>
>>>>
>>>>
>>>> Best Regards,
>>>> Yan, Zi
>>>
>>> Thanks for your input!
>>>
>>> Cheers, Lorenzo
>>
>>
>> Best Regards,
>> Yan, Zi


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes
  2025-06-10 14:20                           ` Zi Yan
@ 2025-06-10 15:16                             ` Usama Arif
  0 siblings, 0 replies; 32+ messages in thread
From: Usama Arif @ 2025-06-10 15:16 UTC (permalink / raw)
  To: Zi Yan, Lorenzo Stoakes
  Cc: david, Andrew Morton, linux-mm, hannes, shakeel.butt, riel,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, hughd,
	linux-kernel, linux-doc, kernel-team, Juan Yescas, Breno Leitao



On 10/06/2025 15:20, Zi Yan wrote:
> On 10 Jun 2025, at 10:03, Lorenzo Stoakes wrote:
> 
>> On Mon, Jun 09, 2025 at 03:49:52PM -0400, Zi Yan wrote:
>> [snip]
>>>> I really think a hard cap, expressed in KB/MB, on pageblock size is the way to
>>>> go (but overrideable for people crazy enough to truly want 512 MB pages - and
>>>> who cannot then complain about watermarks).
>>>
>>> I agree. Basically, I am thinking:
>>> 1) use something like 2MB as default pageblock size for all arch (the value can
>>> be set differently if some arch wants a different pageblock size due to other reasons), this can be done by modifying PAGE_BLOCK_MAX_ORDER’s default
>>> value;
>>
>> I don't think we can set this using CONFIG_PAGE_BLOCK_MAX_ORDER.
>>
>> Because the 'order' will be a different size depending on page size obviously.
>>
>> So I'm not sure how this would achieve what we want?
>>
>> It seems to me we should have CONFIG_PAGE_BLOCK_MAX_SIZE_MB or something like
>> this, and we take min(page_size << CONFIG_PAGE_BLOCK_MAX_ORDER,
>> CONFIG_PAGE_BLOCK_MAX_SIZE_MB << 20) as the size.
> 
> OK. Now I get what you mean. Yeah, using MB is clearer as user does not
> need to know page size to set the right pageblock size.
> 

Just adding it here for completeness, but we could do something like below probably
or use PAGE_SIZE_64KB instead of ARM64_64K_PAGES.
Although it will be messy, as you would then need to do it for every arch and every
page size of that arch.


diff --git a/mm/Kconfig b/mm/Kconfig
index 99910bc649f6..ae83e31ea412 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1023,6 +1023,7 @@ config PAGE_BLOCK_MAX_ORDER
        default 10 if ARCH_FORCE_MAX_ORDER = 0
        range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
        default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
+       default 5 if ARM64_64K_PAGES
        help
          The page block order refers to the power of two number of pages that
          are physically contiguous and can have a migrate type associated to
>>
>>>
>>> 2) make pageblock_order a boot time parameter, so that user who wants
>>> 512MB pages can still get it by changing pageblock order at boot time.
>>>
>>
>> Again, I don't think order is the right choice here, though having it boot time
>> configurable (perhaps overriding the default config there) seems sensible.
> 
> Understood. The new pageblock size should be set using MB.
> 
>>
>>> WDYT?
>>
>>>
>>>>
>>>>>
>>>>> Often, user just ask for an impossible combination: they
>>>>> want to use all free memory, because they paid for it, and they
>>>>> want THPs, because they want max performance. When PMD THP is
>>>>> small like 2MB, the “unusable” free memory is not that noticeable,
>>>>> but when PMD THP is as large as 512MB, user just cannot unsee it. :)
>>>>
>>>> Well, users asking for crazy things then being surprised when they get them
>>>> is nothing new :P
>>>>
>>>>>
>>>>>
>>>>> Best Regards,
>>>>> Yan, Zi
>>>>
>>>> Thanks for your input!
>>>>
>>>> Cheers, Lorenzo
>>>
>>>
>>> Best Regards,
>>> Yan, Zi
> 
> 
> Best Regards,
> Yan, Zi


^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-06-10 15:16 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-06 14:37 [RFC] mm: khugepaged: use largest enabled hugepage order for min_free_kbytes Usama Arif
2025-06-06 15:01 ` Usama Arif
2025-06-06 15:18 ` Zi Yan
2025-06-06 15:38   ` Usama Arif
2025-06-06 16:10     ` Zi Yan
2025-06-07  8:35       ` Lorenzo Stoakes
2025-06-08  0:04         ` Zi Yan
2025-06-09 11:13       ` Usama Arif
2025-06-09 13:19         ` Zi Yan
2025-06-09 14:11           ` Usama Arif
2025-06-09 14:16             ` Lorenzo Stoakes
2025-06-09 14:37               ` Zi Yan
2025-06-09 14:50                 ` Lorenzo Stoakes
2025-06-09 15:20                   ` Zi Yan
2025-06-09 19:40                     ` Lorenzo Stoakes
2025-06-09 19:49                       ` Zi Yan
2025-06-09 20:03                         ` Usama Arif
2025-06-09 20:24                           ` Zi Yan
2025-06-10 10:41                             ` Usama Arif
2025-06-10 14:03                         ` Lorenzo Stoakes
2025-06-10 14:20                           ` Zi Yan
2025-06-10 15:16                             ` Usama Arif
2025-06-09 15:32             ` Zi Yan
2025-06-06 17:37 ` David Hildenbrand
2025-06-09 11:34   ` Usama Arif
2025-06-09 13:28     ` Zi Yan
2025-06-07  8:18 ` Lorenzo Stoakes
2025-06-07  8:44   ` Lorenzo Stoakes
2025-06-09 12:07   ` Usama Arif
2025-06-09 12:12     ` Usama Arif
2025-06-09 14:58       ` Lorenzo Stoakes
2025-06-09 14:57     ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).