linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
@ 2025-05-20 22:59 Juan Yescas
  2025-05-21  6:47 ` David Hildenbrand
  0 siblings, 1 reply; 4+ messages in thread
From: Juan Yescas @ 2025-05-20 22:59 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Juan Yescas, Zi Yan, linux-mm,
	linux-kernel
  Cc: tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim

Problem: On large page size configurations (16KiB, 64KiB), the CMA
alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
and this causes the CMA reservations to be larger than necessary.
This means that system will have less available MIGRATE_UNMOVABLE and
MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.

The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.

For example, in ARM, the CMA alignment requirement when:

- CONFIG_ARCH_FORCE_MAX_ORDER default value is used
- CONFIG_TRANSPARENT_HUGEPAGE is set:

PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
-----------------------------------------------------------------------
   4KiB   |      10        |      10         |  4KiB * (2 ^ 10)  =  4MiB
  16Kib   |      11        |      11         | 16KiB * (2 ^ 11) =  32MiB
  64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB

There are some extreme cases for the CMA alignment requirement when:

- CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
- CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
- CONFIG_HUGETLB_PAGE is NOT set

PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order |  CMA_MIN_ALIGNMENT_BYTES
------------------------------------------------------------------------
   4KiB   |      15        |      15         |  4KiB * (2 ^ 15) = 128MiB
  16Kib   |      13        |      13         | 16KiB * (2 ^ 13) = 128MiB
  64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB

This affects the CMA reservations for the drivers. If a driver in a
4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
reservation has to be 32MiB due to the alignment requirements:

reserved-memory {
    ...
    cma_test_reserve: cma_test_reserve {
        compatible = "shared-dma-pool";
        size = <0x0 0x400000>; /* 4 MiB */
        ...
    };
};

reserved-memory {
    ...
    cma_test_reserve: cma_test_reserve {
        compatible = "shared-dma-pool";
        size = <0x0 0x2000000>; /* 32 MiB */
        ...
    };
};

Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
allows to set the page block order in all the architectures.
The maximum page block order will be given by
ARCH_FORCE_MAX_ORDER.

By default, CONFIG_PAGE_BLOCK_ORDER will have the same
value that ARCH_FORCE_MAX_ORDER. This will make sure that
current kernel configurations won't be affected by this
change. It is a opt-in change.

This patch will allow to have the same CMA alignment
requirements for large page sizes (16KiB, 64KiB) as that
in 4kb kernels by setting a lower pageblock_order.

Tests:

- Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
on 4k and 16k kernels.

- Verified that Transparent Huge Pages work when pageblock_order
is 1, 7, 10 on 4k and 16k kernels.

- Verified that dma-buf heaps allocations work when pageblock_order
is 1, 7, 10 on 4k and 16k kernels.

Benchmarks:

The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
reason for the pageblock_order 7 is because this value makes the min
CMA alignment requirement the same as that in 4kb kernels (2MB).

- Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
(https://developer.android.com/ndk/guides/simpleperf) to measure
the # of instructions and page-faults on 16k kernels.
The benchmark was executed 10 times. The averages are below:

           # instructions         |     #page-faults
    order 10     |  order 7       | order 10 | order 7
--------------------------------------------------------
 13,891,765,770	 | 11,425,777,314 |    220   |   217
 14,456,293,487	 | 12,660,819,302 |    224   |   219
 13,924,261,018	 | 13,243,970,736 |    217   |   221
 13,910,886,504	 | 13,845,519,630 |    217   |   221
 14,388,071,190	 | 13,498,583,098 |    223   |   224
 13,656,442,167	 | 12,915,831,681 |    216   |   218
 13,300,268,343	 | 12,930,484,776 |    222   |   218
 13,625,470,223	 | 14,234,092,777 |    219   |   218
 13,508,964,965	 | 13,432,689,094 |    225   |   219
 13,368,950,667	 | 13,683,587,37  |    219   |   225
-------------------------------------------------------------------
 13,803,137,433  | 13,131,974,268 |    220   |   220    Averages

There were 4.85% #instructions when order was 7, in comparison
with order 10.

     13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)

The number of page faults in order 7 and 10 were the same.

These results didn't show any significant regression when the
pageblock_order is set to 7 on 16kb kernels.

- Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
 on the 16k kernels with pageblock_order 7 and 10.

order 10 | order 7  | order 7 - order 10 | (order 7 - order 10) %
-------------------------------------------------------------------
  15.8	 |  16.4    |         0.6        |     3.80%
  16.4	 |  16.2    |        -0.2        |    -1.22%
  16.6	 |  16.3    |        -0.3        |    -1.81%
  16.8	 |  16.3    |        -0.5        |    -2.98%
  16.6	 |  16.8    |         0.2        |     1.20%
-------------------------------------------------------------------
  16.44     16.4            -0.04	          -0.24%   Averages

The results didn't show any significant regression when the
pageblock_order is set to 7 on 16kb kernels.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
CC: Mike Rapoport <rppt@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Juan Yescas <jyescas@google.com>
Acked-by: Zi Yan <ziy@nvidia.com>
---

Changes in v6:
  - Applied the change provided by Zi Yan to fix
    the Kconfig. The change consists in evaluating
    to true or false in the if expression for range:
    range 1 <symbol> if <expression to eval true/false>.

Changes in v5:
  - Remove the ranges for CONFIG_PAGE_BLOCK_ORDER. The
    ranges with config definitions don't work in Kconfig,
    for example (range 1 MY_CONFIG).
  - Add PAGE_BLOCK_ORDER_MANUAL config for the
    page block order number. The default value was not
    defined.
  - Fix typos reported by Andrew.
  - Test default configs in powerpc. 

Changes in v4:
  - Set PAGE_BLOCK_ORDER in incluxe/linux/mmzone.h to
    validate that MAX_PAGE_ORDER >= PAGE_BLOCK_ORDER at
    compile time.
  - This change fixes the warning in:
    https://lore.kernel.org/oe-kbuild-all/202505091548.FuKO4b4v-lkp@intel.com/

Changes in v3:
  - Rename ARCH_FORCE_PAGE_BLOCK_ORDER to PAGE_BLOCK_ORDER
    as per Matthew's suggestion.
  - Update comments in pageblock-flags.h for pageblock_order
    value when THP or HugeTLB are not used.

Changes in v2:
  - Add Zi's Acked-by tag.
  - Move ARCH_FORCE_PAGE_BLOCK_ORDER config to mm/Kconfig as
    per Zi and Matthew suggestion so it is available to
    all the architectures.
  - Set ARCH_FORCE_PAGE_BLOCK_ORDER to 10 by default when
    ARCH_FORCE_MAX_ORDER is not available.

 include/linux/mmzone.h          | 16 ++++++++++++++++
 include/linux/pageblock-flags.h |  8 ++++----
 mm/Kconfig                      | 34 +++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6ccec1bf2896..05610337bbb6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -37,6 +37,22 @@
 
 #define NR_PAGE_ORDERS (MAX_PAGE_ORDER + 1)
 
+/* Defines the order for the number of pages that have a migrate type. */
+#ifndef CONFIG_PAGE_BLOCK_ORDER
+#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
+#else
+#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
+#endif /* CONFIG_PAGE_BLOCK_ORDER */
+
+/*
+ * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
+ * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
+ * which defines the order for the number of pages that can have a migrate type
+ */
+#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
+#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
+#endif
+
 /*
  * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
  * costly to service.  That is between allocation orders which should
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index fc6b9c87cb0a..e73a4292ef02 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
  * Huge pages are a constant size, but don't exceed the maximum allocation
  * granularity.
  */
-#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
+#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
 
 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
 #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
 
-#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
+#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
 
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
-#define pageblock_order		MAX_PAGE_ORDER
+/* If huge pages are not used, group by PAGE_BLOCK_ORDER */
+#define pageblock_order		PAGE_BLOCK_ORDER
 
 #endif /* CONFIG_HUGETLB_PAGE */
 
diff --git a/mm/Kconfig b/mm/Kconfig
index e113f713b493..13a5c4f6e6b6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -989,6 +989,40 @@ config CMA_AREAS
 
 	  If unsure, leave the default value "8" in UMA and "20" in NUMA.
 
+#
+# Select this config option from the architecture Kconfig, if available, to set
+# the max page order for physically contiguous allocations.
+#
+config ARCH_FORCE_MAX_ORDER
+	int
+
+#
+# When ARCH_FORCE_MAX_ORDER is not defined,
+# the default page block order is MAX_PAGE_ORDER (10) as per
+# include/linux/mmzone.h.
+#
+config PAGE_BLOCK_ORDER
+	int "Page Block Order"
+	range 1 10 if ARCH_FORCE_MAX_ORDER = 0
+	default 10 if ARCH_FORCE_MAX_ORDER = 0
+	range 1 ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
+	default ARCH_FORCE_MAX_ORDER if ARCH_FORCE_MAX_ORDER != 0
+	help
+	  The page block order refers to the power of two number of pages that
+	  are physically contiguous and can have a migrate type associated to
+	  them. The maximum size of the page block order is limited by
+	  ARCH_FORCE_MAX_ORDER.
+
+	  This config allows overriding the default page block order when the
+	  page block order is required to be smaller than ARCH_FORCE_MAX_ORDER
+	  or MAX_PAGE_ORDER.
+
+	  Reducing pageblock order can negatively impact THP generation
+	  success rate. If your workloads uses THP heavily, please use this
+	  option with caution.
+
+	  Don't change if unsure.
+
 config MEM_SOFT_DIRTY
 	bool "Track memory changes"
 	depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY && PROC_FS
-- 
2.49.0.1112.g889b7c5bd8-goog



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v6] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
  2025-05-20 22:59 [PATCH v6] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order Juan Yescas
@ 2025-05-21  6:47 ` David Hildenbrand
  2025-05-21 16:51   ` Juan Yescas
  0 siblings, 1 reply; 4+ messages in thread
From: David Hildenbrand @ 2025-05-21  6:47 UTC (permalink / raw)
  To: Juan Yescas, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, linux-mm, linux-kernel
  Cc: tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim

On 21.05.25 00:59, Juan Yescas wrote:
> Problem: On large page size configurations (16KiB, 64KiB), the CMA
> alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
> and this causes the CMA reservations to be larger than necessary.
> This means that system will have less available MIGRATE_UNMOVABLE and
> MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
> 
> The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
> MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
> ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
> 
> For example, in ARM, the CMA alignment requirement when:
> 
> - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
> - CONFIG_TRANSPARENT_HUGEPAGE is set:
> 
> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
> -----------------------------------------------------------------------
>     4KiB   |      10        |      10         |  4KiB * (2 ^ 10)  =  4MiB

Why is pageblock_nr_pages 10 in that case?

	#define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)

So it should be 2 MiB (order-9)?

>    16Kib   |      11        |      11         | 16KiB * (2 ^ 11) =  32MiB
>    64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
> 
> There are some extreme cases for the CMA alignment requirement when:
> 
> - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
> - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
> - CONFIG_HUGETLB_PAGE is NOT set

I think we should just always group at HPAGE_PMD_ORDER also in this case. But that's
a different thing to sort out :)

> 
> PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order |  CMA_MIN_ALIGNMENT_BYTES
> ------------------------------------------------------------------------
>     4KiB   |      15        |      15         |  4KiB * (2 ^ 15) = 128MiB
>    16Kib   |      13        |      13         | 16KiB * (2 ^ 13) = 128MiB
>    64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
> 
> This affects the CMA reservations for the drivers. If a driver in a
> 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
> reservation has to be 32MiB due to the alignment requirements:
> 
> reserved-memory {
>      ...
>      cma_test_reserve: cma_test_reserve {
>          compatible = "shared-dma-pool";
>          size = <0x0 0x400000>; /* 4 MiB */
>          ...
>      };
> };
> 
> reserved-memory {
>      ...
>      cma_test_reserve: cma_test_reserve {
>          compatible = "shared-dma-pool";
>          size = <0x0 0x2000000>; /* 32 MiB */
>          ...
>      };
> };
> 
> Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
> allows to set the page block order in all the architectures.
> The maximum page block order will be given by
> ARCH_FORCE_MAX_ORDER.
> 
> By default, CONFIG_PAGE_BLOCK_ORDER will have the same
> value that ARCH_FORCE_MAX_ORDER. This will make sure that
> current kernel configurations won't be affected by this
> change. It is a opt-in change.
> 
> This patch will allow to have the same CMA alignment
> requirements for large page sizes (16KiB, 64KiB) as that
> in 4kb kernels by setting a lower pageblock_order.
> 
> Tests:
> 
> - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
> on 4k and 16k kernels.
> 
> - Verified that Transparent Huge Pages work when pageblock_order
> is 1, 7, 10 on 4k and 16k kernels.
> 
> - Verified that dma-buf heaps allocations work when pageblock_order
> is 1, 7, 10 on 4k and 16k kernels.
> 
> Benchmarks:
> 
> The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
> reason for the pageblock_order 7 is because this value makes the min
> CMA alignment requirement the same as that in 4kb kernels (2MB).
> 
> - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
> SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
> (https://developer.android.com/ndk/guides/simpleperf) to measure
> the # of instructions and page-faults on 16k kernels.
> The benchmark was executed 10 times. The averages are below:
> 
>             # instructions         |     #page-faults
>      order 10     |  order 7       | order 10 | order 7
> --------------------------------------------------------
>   13,891,765,770	 | 11,425,777,314 |    220   |   217
>   14,456,293,487	 | 12,660,819,302 |    224   |   219
>   13,924,261,018	 | 13,243,970,736 |    217   |   221
>   13,910,886,504	 | 13,845,519,630 |    217   |   221
>   14,388,071,190	 | 13,498,583,098 |    223   |   224
>   13,656,442,167	 | 12,915,831,681 |    216   |   218
>   13,300,268,343	 | 12,930,484,776 |    222   |   218
>   13,625,470,223	 | 14,234,092,777 |    219   |   218
>   13,508,964,965	 | 13,432,689,094 |    225   |   219
>   13,368,950,667	 | 13,683,587,37  |    219   |   225
> -------------------------------------------------------------------
>   13,803,137,433  | 13,131,974,268 |    220   |   220    Averages
> 
> There were 4.85% #instructions when order was 7, in comparison
> with order 10.
> 
>       13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
> 
> The number of page faults in order 7 and 10 were the same.
> 
> These results didn't show any significant regression when the
> pageblock_order is set to 7 on 16kb kernels.
> 
> - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
>   on the 16k kernels with pageblock_order 7 and 10.
> 
> order 10 | order 7  | order 7 - order 10 | (order 7 - order 10) %
> -------------------------------------------------------------------
>    15.8	 |  16.4    |         0.6        |     3.80%
>    16.4	 |  16.2    |        -0.2        |    -1.22%
>    16.6	 |  16.3    |        -0.3        |    -1.81%
>    16.8	 |  16.3    |        -0.5        |    -2.98%
>    16.6	 |  16.8    |         0.2        |     1.20%
> -------------------------------------------------------------------
>    16.44     16.4            -0.04	          -0.24%   Averages
> 
> The results didn't show any significant regression when the
> pageblock_order is set to 7 on 16kb kernels.
> 

Sorry for the late reply. I think using a bootime option might have saved us
some of the headake. :)

[...]

> +/* Defines the order for the number of pages that have a migrate type. */
> +#ifndef CONFIG_PAGE_BLOCK_ORDER
> +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
> +#else
> +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
> +#endif /* CONFIG_PAGE_BLOCK_ORDER */
> +
> +/*
> + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
> + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
> + * which defines the order for the number of pages that can have a migrate type
> + */
> +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
> +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
> +#endif
> +>   /*
>    * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
>    * costly to service.  That is between allocation orders which should
> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> index fc6b9c87cb0a..e73a4292ef02 100644
> --- a/include/linux/pageblock-flags.h
> +++ b/include/linux/pageblock-flags.h
> @@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
>    * Huge pages are a constant size, but don't exceed the maximum allocation
>    * granularity.
>    */

How is CONFIG_HUGETLB_PAGE_SIZE_VARIABLE handled?

> -#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
> +#define pageblock_order		MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
>   
>   #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>   
>   #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
>   
> -#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
> +#define pageblock_order		MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)

Wait, why are we using the MIN_T in that case? If someone requests 4 MiB, why would we reduce
it to 2 MiB even though MAX_PAGE_ORDER allows for it?


Maybe we really have to clean all that up first :/

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v6] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
  2025-05-21  6:47 ` David Hildenbrand
@ 2025-05-21 16:51   ` Juan Yescas
  2025-05-28  7:31     ` Vlastimil Babka
  0 siblings, 1 reply; 4+ messages in thread
From: Juan Yescas @ 2025-05-21 16:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan, linux-mm,
	linux-kernel, tjmercier, isaacmanjarres, kaleshsingh, masahiroy,
	Minchan Kim

On Tue, May 20, 2025 at 11:47 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 21.05.25 00:59, Juan Yescas wrote:
> > Problem: On large page size configurations (16KiB, 64KiB), the CMA
> > alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
> > and this causes the CMA reservations to be larger than necessary.
> > This means that system will have less available MIGRATE_UNMOVABLE and
> > MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.
> >
> > The CMA_MIN_ALIGNMENT_BYTES increases because it depends on
> > MAX_PAGE_ORDER which depends on ARCH_FORCE_MAX_ORDER. The value of
> > ARCH_FORCE_MAX_ORDER increases on 16k and 64k kernels.
> >
> > For example, in ARM, the CMA alignment requirement when:
> >
> > - CONFIG_ARCH_FORCE_MAX_ORDER default value is used
> > - CONFIG_TRANSPARENT_HUGEPAGE is set:
> >
> > PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
> > -----------------------------------------------------------------------
> >     4KiB   |      10        |      10         |  4KiB * (2 ^ 10)  =  4MiB
>
> Why is pageblock_nr_pages 10 in that case?
>
>         #define pageblock_order MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
>
> So it should be 2 MiB (order-9)?
>

That is right. I will update the description to set it to 2 MiB.

> >    16Kib   |      11        |      11         | 16KiB * (2 ^ 11) =  32MiB
> >    64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
> >
> > There are some extreme cases for the CMA alignment requirement when:
> >
> > - CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
> > - CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
> > - CONFIG_HUGETLB_PAGE is NOT set
>
> I think we should just always group at HPAGE_PMD_ORDER also in this case. But that's
> a different thing to sort out :)
>
> >
> > PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order |  CMA_MIN_ALIGNMENT_BYTES
> > ------------------------------------------------------------------------
> >     4KiB   |      15        |      15         |  4KiB * (2 ^ 15) = 128MiB
> >    16Kib   |      13        |      13         | 16KiB * (2 ^ 13) = 128MiB
> >    64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB
> >
> > This affects the CMA reservations for the drivers. If a driver in a
> > 4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
> > reservation has to be 32MiB due to the alignment requirements:
> >
> > reserved-memory {
> >      ...
> >      cma_test_reserve: cma_test_reserve {
> >          compatible = "shared-dma-pool";
> >          size = <0x0 0x400000>; /* 4 MiB */
> >          ...
> >      };
> > };
> >
> > reserved-memory {
> >      ...
> >      cma_test_reserve: cma_test_reserve {
> >          compatible = "shared-dma-pool";
> >          size = <0x0 0x2000000>; /* 32 MiB */
> >          ...
> >      };
> > };
> >
> > Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that
> > allows to set the page block order in all the architectures.
> > The maximum page block order will be given by
> > ARCH_FORCE_MAX_ORDER.
> >
> > By default, CONFIG_PAGE_BLOCK_ORDER will have the same
> > value that ARCH_FORCE_MAX_ORDER. This will make sure that
> > current kernel configurations won't be affected by this
> > change. It is a opt-in change.
> >
> > This patch will allow to have the same CMA alignment
> > requirements for large page sizes (16KiB, 64KiB) as that
> > in 4kb kernels by setting a lower pageblock_order.
> >
> > Tests:
> >
> > - Verified that HugeTLB pages work when pageblock_order is 1, 7, 10
> > on 4k and 16k kernels.
> >
> > - Verified that Transparent Huge Pages work when pageblock_order
> > is 1, 7, 10 on 4k and 16k kernels.
> >
> > - Verified that dma-buf heaps allocations work when pageblock_order
> > is 1, 7, 10 on 4k and 16k kernels.
> >
> > Benchmarks:
> >
> > The benchmarks compare 16kb kernels with pageblock_order 10 and 7. The
> > reason for the pageblock_order 7 is because this value makes the min
> > CMA alignment requirement the same as that in 4kb kernels (2MB).
> >
> > - Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
> > SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4. Use simpleperf
> > (https://developer.android.com/ndk/guides/simpleperf) to measure
> > the # of instructions and page-faults on 16k kernels.
> > The benchmark was executed 10 times. The averages are below:
> >
> >             # instructions         |     #page-faults
> >      order 10     |  order 7       | order 10 | order 7
> > --------------------------------------------------------
> >   13,891,765,770       | 11,425,777,314 |    220   |   217
> >   14,456,293,487       | 12,660,819,302 |    224   |   219
> >   13,924,261,018       | 13,243,970,736 |    217   |   221
> >   13,910,886,504       | 13,845,519,630 |    217   |   221
> >   14,388,071,190       | 13,498,583,098 |    223   |   224
> >   13,656,442,167       | 12,915,831,681 |    216   |   218
> >   13,300,268,343       | 12,930,484,776 |    222   |   218
> >   13,625,470,223       | 14,234,092,777 |    219   |   218
> >   13,508,964,965       | 13,432,689,094 |    225   |   219
> >   13,368,950,667       | 13,683,587,37  |    219   |   225
> > -------------------------------------------------------------------
> >   13,803,137,433  | 13,131,974,268 |    220   |   220    Averages
> >
> > There were 4.85% #instructions when order was 7, in comparison
> > with order 10.
> >
> >       13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)
> >
> > The number of page faults in order 7 and 10 were the same.
> >
> > These results didn't show any significant regression when the
> > pageblock_order is set to 7 on 16kb kernels.
> >
> > - Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
> >   on the 16k kernels with pageblock_order 7 and 10.
> >
> > order 10 | order 7  | order 7 - order 10 | (order 7 - order 10) %
> > -------------------------------------------------------------------
> >    15.8        |  16.4    |         0.6        |     3.80%
> >    16.4        |  16.2    |        -0.2        |    -1.22%
> >    16.6        |  16.3    |        -0.3        |    -1.81%
> >    16.8        |  16.3    |        -0.5        |    -2.98%
> >    16.6        |  16.8    |         0.2        |     1.20%
> > -------------------------------------------------------------------
> >    16.44     16.4            -0.04              -0.24%   Averages
> >
> > The results didn't show any significant regression when the
> > pageblock_order is set to 7 on 16kb kernels.
> >
>
> Sorry for the late reply. I think using a bootime option might have saved us
> some of the headake. :)

No worries.

The bootime option sounds good, however, there are these tradeoffs:

- bootloader needs to be updated to find out the kernel page size and calculate
the pageblock_order to pass to the kernel.
- if the pageblock_order changes, it is likely that some CMA
reservations might need
to be updated, so the DTS needs to be compiled.

> [...]
>
> > +/* Defines the order for the number of pages that have a migrate type. */
> > +#ifndef CONFIG_PAGE_BLOCK_ORDER
> > +#define PAGE_BLOCK_ORDER MAX_PAGE_ORDER
> > +#else
> > +#define PAGE_BLOCK_ORDER CONFIG_PAGE_BLOCK_ORDER
> > +#endif /* CONFIG_PAGE_BLOCK_ORDER */
> > +
> > +/*
> > + * The MAX_PAGE_ORDER, which defines the max order of pages to be allocated
> > + * by the buddy allocator, has to be larger or equal to the PAGE_BLOCK_ORDER,
> > + * which defines the order for the number of pages that can have a migrate type
> > + */
> > +#if (PAGE_BLOCK_ORDER > MAX_PAGE_ORDER)
> > +#error MAX_PAGE_ORDER must be >= PAGE_BLOCK_ORDER
> > +#endif
> > +>   /*
> >    * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
> >    * costly to service.  That is between allocation orders which should
> > diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> > index fc6b9c87cb0a..e73a4292ef02 100644
> > --- a/include/linux/pageblock-flags.h
> > +++ b/include/linux/pageblock-flags.h
> > @@ -41,18 +41,18 @@ extern unsigned int pageblock_order;
> >    * Huge pages are a constant size, but don't exceed the maximum allocation
> >    * granularity.
> >    */
>
> How is CONFIG_HUGETLB_PAGE_SIZE_VARIABLE handled?

That is a powepc configuration, and the pageorder_order variable is
initialized in:

mm/mm_init.c
#ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
/* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
void __init set_pageblock_order(void)
{
unsigned int order = MAX_PAGE_ORDER;

/* Check that pageblock_nr_pages has not already been setup */
if (pageblock_order)
return;

/* Don't let pageblocks exceed the maximum allocation granularity. */
if (HPAGE_SHIFT > PAGE_SHIFT && HUGETLB_PAGE_ORDER < order)
order = HUGETLB_PAGE_ORDER;

/*
* Assume the largest contiguous order of interest is a huge page.
* This value may be variable depending on boot parameters on powerpc.
*/
pageblock_order = order;
}

Should this line be updated?
https://elixir.bootlin.com/linux/v6.15-rc7/source/mm/mm_init.c#L1513
unsigned int order = MAX_PAGE_ORDER;

> > -#define pageblock_order              MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
> > +#define pageblock_order              MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
> >
> >   #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
> >
> >   #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
> >
> > -#define pageblock_order              MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
> > +#define pageblock_order              MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>
> Wait, why are we using the MIN_T in that case? If someone requests 4 MiB, why would we reduce
> it to 2 MiB even though MAX_PAGE_ORDER allows for it?
>
I don't have the context for that change. I think Vlastimil might know
why it is needed:

That change was introduced in this patch:
https://lore.kernel.org/all/20240426040258.AD47FC113CD@smtp.kernel.org/

Thanks
Juan

>
> Maybe we really have to clean all that up first :/
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v6] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order
  2025-05-21 16:51   ` Juan Yescas
@ 2025-05-28  7:31     ` Vlastimil Babka
  0 siblings, 0 replies; 4+ messages in thread
From: Vlastimil Babka @ 2025-05-28  7:31 UTC (permalink / raw)
  To: Juan Yescas, David Hildenbrand
  Cc: Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, linux-mm, linux-kernel,
	tjmercier, isaacmanjarres, kaleshsingh, masahiroy, Minchan Kim

On 5/21/25 18:51, Juan Yescas wrote:
> On Tue, May 20, 2025 at 11:47 PM David Hildenbrand <david@redhat.com> wrote:
>> > -#define pageblock_order              MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
>> > +#define pageblock_order              MIN_T(unsigned int, HUGETLB_PAGE_ORDER, PAGE_BLOCK_ORDER)
>> >
>> >   #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>> >
>> >   #elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
>> >
>> > -#define pageblock_order              MIN_T(unsigned int, HPAGE_PMD_ORDER, MAX_PAGE_ORDER)
>> > +#define pageblock_order              MIN_T(unsigned int, HPAGE_PMD_ORDER, PAGE_BLOCK_ORDER)
>>
>> Wait, why are we using the MIN_T in that case? If someone requests 4 MiB, why would we reduce
>> it to 2 MiB even though MAX_PAGE_ORDER allows for it?
>>
> I don't have the context for that change. I think Vlastimil might know
> why it is needed:
> 
> That change was introduced in this patch:
> https://lore.kernel.org/all/20240426040258.AD47FC113CD@smtp.kernel.org/

Well the intention was always that pageblock order should be lowered to pmd
order when THPs are enabled as then compaction/anti-fragmentation can better
help them being successfully allocated. And when it turned out this was not
true without having also CONFIG_HUGETLB_PAGE enabled, I considered it a bug.

At the time there was not a proposal to make the pageblock order fully
configurable, so it was just about having the best possible heuristic. Now
we could let the new config override that, but since the main intention here
is to make pageblock order smaller and not larger, it doesn't seem that urgent.

But if we go that way we should make sure the defaults (user doesn't
override MAX_PAGE_ORDER) still result in pageblock_order match PMD_ORDER
with hugepages/THPs enabled, and not become accidentally larger.

> Thanks
> Juan
> 
>>
>> Maybe we really have to clean all that up first :/
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-05-28  7:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-20 22:59 [PATCH v6] mm: Add CONFIG_PAGE_BLOCK_ORDER to select page block order Juan Yescas
2025-05-21  6:47 ` David Hildenbrand
2025-05-21 16:51   ` Juan Yescas
2025-05-28  7:31     ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).