* [PATCH v4 0/3] mm, swap: Enable THP SWAP for PowerPC Book3S64
@ 2026-06-19 4:40 Ritesh Harjani (IBM)
2026-06-19 4:40 ` [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-19 4:40 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
On PowerPC Book3S64, MMU is selected at runtime, so macros like PMD_SHIFT
are effectively runtime variables in the Book3S64 code. THP swap code uses
these macros for e.g. to size some of its array data structures based on
PMD_ORDER. This patch series makes that usage dependent on the runtime
variable and provides an upper-bound architecture override for cases (e.g.
SWAP_NR_ORDERS), where the runtime conversion is not considered beneficial.
This series increases bandwidth throughput with zram backend for swapout by
around 40-50% with Radix and 100-130% with Hash (Tested by Sayali)
Note that this patch series is based out of linux-next (next-20260608).
v3->v4:
======
1. Revert SWAPFILE_CLUSTER definition - since we already adjusted all the users
of SWAPFILE_CLUSTER and made those users use this value at runtime (Kairui Song)
v2 -> v3:
=========
1. Fixed sparse warning for swap_table_use_page reported by lkp in patch-1
RFC -> v2:
==========
1. Send the unused leftovers change in swap.h separately [1]
2. Changed RFC Patch-3 design from runtime SWAP_NR_ORDERS to arch override
(ARCH_MAX_PMD_ORDER) - suggested by Youngjun
3. Dropped RFC tag
[1]: https://lore.kernel.org/linux-mm/68591daf0d679e5a0072d63751f187d14613e2b0.1781146877.git.ritesh.list@gmail.com/
[RFC]: https://lore.kernel.org/linux-mm/cover.1781000840.git.ritesh.list@gmail.com/
[v2]: https://lore.kernel.org/linuxppc-dev/cover.1781170904.git.ritesh.list@gmail.com/
Ritesh Harjani (IBM) (3):
mm, swap: make SWAPFILE_CLUSTER runtime
mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER
powerpc: Kconfig: Enable THP_SWAP on Book3S64
arch/powerpc/include/asm/book3s/64/pgtable.h | 7 +++++++
arch/powerpc/platforms/Kconfig.cputype | 1 +
include/linux/swap.h | 12 +++++++++++-
mm/swap_table.h | 6 ++----
mm/swapfile.c | 17 ++++++++++++-----
5 files changed, 33 insertions(+), 10 deletions(-)
--
2.39.5
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime
2026-06-19 4:40 [PATCH v4 0/3] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
@ 2026-06-19 4:40 ` Ritesh Harjani (IBM)
2026-06-22 1:39 ` Barry Song
2026-06-23 8:44 ` Barry Song
2026-06-19 4:40 ` [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER Ritesh Harjani (IBM)
2026-06-19 4:40 ` [PATCH v4 3/3] powerpc: Kconfig: Enable THP_SWAP on Book3S64 Ritesh Harjani (IBM)
2 siblings, 2 replies; 15+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-19 4:40 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
On PowerPC Book3S64, MMU is selected at runtime, so macros like
PMD_SHIFT are effectively runtime variables in the Book3S64 code. THP
swap code uses these macros to size some of its array data structures
based on PMD_ORDER e.g. SWAPFILE_CLUSTER macro is used for this very
purpose.
Hence this patch makes the users of SWAPFILE_CLUSTER to use this macro value at
runtime and also modifies swap_table and swap_memcg_table which were earlier
using this macro for defining the number of table entries.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
mm/swap_table.h | 6 ++----
mm/swapfile.c | 17 ++++++++++++-----
2 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/mm/swap_table.h b/mm/swap_table.h
index e6613e62f8d0..90e2a7852300 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -8,16 +8,14 @@
/* A typical flat array in each cluster as swap table */
struct swap_table {
- atomic_long_t entries[SWAPFILE_CLUSTER];
+ DECLARE_FLEX_ARRAY(atomic_long_t, entries);
};
/* For storing memcg private id */
struct swap_memcg_table {
- unsigned short id[SWAPFILE_CLUSTER];
+ DECLARE_FLEX_ARRAY(unsigned short, id);
};
-#define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
-
/*
* A swap table entry represents the status of a swap slot on a swap
* (physical or virtual) device. The swap table in each cluster is a
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 78b49b0658ad..4bf11c5b87eb 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -129,6 +129,8 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
.lock = INIT_LOCAL_LOCK(),
};
+static bool swap_table_use_page __ro_after_init;
+
/* May return NULL on invalid type, caller must check for NULL return */
static struct swap_info_struct *swap_type_to_info(int type)
{
@@ -437,7 +439,7 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci)
return;
rcu_assign_pointer(ci->table, NULL);
- if (!SWP_TABLE_USE_PAGE) {
+ if (!swap_table_use_page) {
kmem_cache_free(swap_table_cachep, table);
return;
}
@@ -456,7 +458,7 @@ static int swap_cluster_alloc_table(struct swap_cluster_info *ci, gfp_t gfp)
if (rcu_access_pointer(ci->table))
return 0;
- if (SWP_TABLE_USE_PAGE) {
+ if (swap_table_use_page) {
folio = folio_alloc(gfp | __GFP_ZERO, 0);
if (folio)
table = folio_address(folio);
@@ -471,7 +473,8 @@ static int swap_cluster_alloc_table(struct swap_cluster_info *ci, gfp_t gfp)
#ifdef CONFIG_MEMCG
if (!mem_cgroup_disabled()) {
VM_WARN_ON_ONCE(ci->memcg_table);
- ci->memcg_table = kzalloc_obj(*ci->memcg_table, gfp);
+ ci->memcg_table = kzalloc_flex(*ci->memcg_table, id,
+ SWAPFILE_CLUSTER, gfp);
if (!ci->memcg_table) {
swap_cluster_free_table(ci);
return -ENOMEM;
@@ -3912,14 +3915,18 @@ static int __init swapfile_init(void)
{
swapfile_maximum_size = arch_max_swapfile_size();
+ swap_table_use_page =
+ (SWAPFILE_CLUSTER * sizeof(atomic_long_t) == PAGE_SIZE);
+
/*
* Once a cluster is freed, it's swap table content is read
* only, and all swap cache readers (swap_cache_*) verifies
* the content before use. So it's safe to use RCU slab here.
*/
- if (!SWP_TABLE_USE_PAGE)
+ if (!swap_table_use_page)
swap_table_cachep = kmem_cache_create("swap_table",
- sizeof(struct swap_table),
+ struct_size_t(struct swap_table, entries,
+ SWAPFILE_CLUSTER),
0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
#ifdef CONFIG_MIGRATION
--
2.39.5
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER
2026-06-19 4:40 [PATCH v4 0/3] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
2026-06-19 4:40 ` [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
@ 2026-06-19 4:40 ` Ritesh Harjani (IBM)
2026-06-23 5:11 ` Barry Song
2026-06-19 4:40 ` [PATCH v4 3/3] powerpc: Kconfig: Enable THP_SWAP on Book3S64 Ritesh Harjani (IBM)
2 siblings, 1 reply; 15+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-19 4:40 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
SWAP_NR_ORDERS sizes a few small bounded arrays inside THP swap
allocator code (nofull/frag cluster lists, percpu_swap_cluster's
si/offset arrays, next array for rotational device). This currently
expands to PMD_ORDER+1, which only works when PMD_ORDER is a compile
time constant.
However on architecture like PowerPC Book3S64, PMD_ORDER is a runtime
variable which depends upon which MMU is selected (Radix / Hash), so in
that case, PMD_ORDER cannot be used to size the static arrays.
This patch provides an optional ARCH_MAX_PMD_ORDER (upper-bound)
override for such architectures. The memory overhead on enabling this
override is negligible. Even if we make SWAP_NR_ORDERS runtime alloc,
default slab padding could cause some memory waste. Also we lose the
per-cpu cacheline benefits (for percpu_swap_cluster) because it might
cost an extra cacheline indirection overhead in swap_alloc_fast() for
fetching si[order]/offset[order]. Note that a fully runtime
SWAP_NR_ORDERS was considered in previous version but was dropped for
this reason [1]
[1]: https://lore.kernel.org/linuxppc-dev/pl1zdksc.ritesh.list@gmail.com/
Suggested-by: YoungJun Park <youngjun.park@lge.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
arch/powerpc/include/asm/book3s/64/pgtable.h | 7 +++++++
include/linux/swap.h | 12 +++++++++++-
2 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index e67e64ac6e8c..7f22d5d5fbdf 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -204,6 +204,13 @@ extern unsigned long __pmd_frag_size_shift;
#define MAX_PTRS_PER_PGD (1 << (H_PGD_INDEX_SIZE > RADIX_PGD_INDEX_SIZE ? \
H_PGD_INDEX_SIZE : RADIX_PGD_INDEX_SIZE))
+/*
+ * Compile-time upper bound on PMD_ORDER across hash and radix MMUs.
+ * Used by THP SWAP code. Check include/linux/swap.h
+ */
+#define ARCH_MAX_PMD_ORDER ((H_PTE_INDEX_SIZE > RADIX_PTE_INDEX_SIZE) ? \
+ H_PTE_INDEX_SIZE : RADIX_PTE_INDEX_SIZE)
+
/* PMD_SHIFT determines what a second-level page table entry can map */
#define PMD_SHIFT (PAGE_SHIFT + PTE_INDEX_SIZE)
#define PMD_SIZE (1UL << PMD_SHIFT)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8f0f68e245ba..317168aa2db5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -229,11 +229,21 @@ enum {
*/
#define SWAP_ENTRY_INVALID 0
+/*
+ * ARCH_MAX_PMD_ORDER is an optional arch hook: a compile-time upper bound for
+ * PMD_ORDER across all possible MMU configurations of that arch. It is used to
+ * size SWAP_NR_ORDERS on architectures (e.g. powerpc book3s64) where PMD_ORDER
+ * is selected at boot rather than at compile time.
+ */
#ifdef CONFIG_THP_SWAP
+#ifdef ARCH_MAX_PMD_ORDER
+#define SWAP_NR_ORDERS (ARCH_MAX_PMD_ORDER + 1)
+#else
#define SWAP_NR_ORDERS (PMD_ORDER + 1)
+#endif /* ARCH_MAX_PMD_ORDER */
#else
#define SWAP_NR_ORDERS 1
-#endif
+#endif /* CONFIG_THP_SWAP */
/*
* We keep using same cluster for rotational device so IO will be sequential.
--
2.39.5
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v4 3/3] powerpc: Kconfig: Enable THP_SWAP on Book3S64
2026-06-19 4:40 [PATCH v4 0/3] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
2026-06-19 4:40 ` [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
2026-06-19 4:40 ` [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER Ritesh Harjani (IBM)
@ 2026-06-19 4:40 ` Ritesh Harjani (IBM)
2026-06-23 5:21 ` Barry Song
2 siblings, 1 reply; 15+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-19 4:40 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
THP_SWAP avoids splitting of a transparent huge folio into 32 smaller
64K folios (Radix-64K pagesize / 2M PMD) or into 256 smaller 64K folios
(Hash-64K pagesize / 16M PMD), during swapout. This improves the
swapping performance since all the bookking & I/O submission happens
once per large folio. More details at [1].
PowerPC Book3S64 could not enable this before because PMD_ORDER is
selected at runtime depending upon the chosen MMU. The earlier patches
in this series turn SWAPFILE_CLUSTER into a runtime value and introduce
an ARCH_MAX_PMD_ORDER upperbound override for SWAP_NR_ORDERS. With those
changes, we can now enable THP SWAP for Book3S64.
This increases bandwidth throughput with zram backend for swapout by
40-50% with Radix and 100-130% with Hash (Tested by Sayali)
[1]: https://lore.kernel.org/all/20170515112522.32457-2-ying.huang@intel.com/
Tested-by: Sayali Patil <sayalip@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
arch/powerpc/platforms/Kconfig.cputype | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index bac02c83bb3e..48f74bd22343 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -113,6 +113,7 @@ config PPC_THP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+ select ARCH_WANTS_THP_SWAP if TRANSPARENT_HUGEPAGE
choice
prompt "CPU selection"
--
2.39.5
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime
2026-06-19 4:40 ` [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
@ 2026-06-22 1:39 ` Barry Song
2026-06-23 4:11 ` Ritesh Harjani
2026-06-23 8:44 ` Barry Song
1 sibling, 1 reply; 15+ messages in thread
From: Barry Song @ 2026-06-22 1:39 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil
On Fri, Jun 19, 2026 at 12:41 PM Ritesh Harjani (IBM)
<ritesh.list@gmail.com> wrote:
>
> On PowerPC Book3S64, MMU is selected at runtime, so macros like
> PMD_SHIFT are effectively runtime variables in the Book3S64 code. THP
Not an expert on Book3S64—could you explain the runtime variables in
more detail? Does enabling THP_SWAP on PowerPC cause any build issues?
> swap code uses these macros to size some of its array data structures
> based on PMD_ORDER e.g. SWAPFILE_CLUSTER macro is used for this very
> purpose.
> Hence this patch makes the users of SWAPFILE_CLUSTER to use this macro value at
> runtime and also modifies swap_table and swap_memcg_table which were earlier
> using this macro for defining the number of table entries.
>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
> mm/swap_table.h | 6 ++----
> mm/swapfile.c | 17 ++++++++++++-----
> 2 files changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/mm/swap_table.h b/mm/swap_table.h
> index e6613e62f8d0..90e2a7852300 100644
> --- a/mm/swap_table.h
> +++ b/mm/swap_table.h
> @@ -8,16 +8,14 @@
>
> /* A typical flat array in each cluster as swap table */
> struct swap_table {
> - atomic_long_t entries[SWAPFILE_CLUSTER];
> + DECLARE_FLEX_ARRAY(atomic_long_t, entries);
> };
>
> /* For storing memcg private id */
> struct swap_memcg_table {
> - unsigned short id[SWAPFILE_CLUSTER];
> + DECLARE_FLEX_ARRAY(unsigned short, id);
> };
>
> -#define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
> -
> /*
> * A swap table entry represents the status of a swap slot on a swap
> * (physical or virtual) device. The swap table in each cluster is a
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 78b49b0658ad..4bf11c5b87eb 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -129,6 +129,8 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
> .lock = INIT_LOCAL_LOCK(),
> };
>
> +static bool swap_table_use_page __ro_after_init;
Does a static key help here?
Best Regards
Barry
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime
2026-06-22 1:39 ` Barry Song
@ 2026-06-23 4:11 ` Ritesh Harjani
0 siblings, 0 replies; 15+ messages in thread
From: Ritesh Harjani @ 2026-06-23 4:11 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil
Barry Song <baohua@kernel.org> writes:
> On Fri, Jun 19, 2026 at 12:41 PM Ritesh Harjani (IBM)
> <ritesh.list@gmail.com> wrote:
>>
>> On PowerPC Book3S64, MMU is selected at runtime, so macros like
>> PMD_SHIFT are effectively runtime variables in the Book3S64 code. THP
>
> Not an expert on Book3S64—could you explain the runtime variables in
> more detail? Does enabling THP_SWAP on PowerPC cause any build issues?
>
yes, build issues. We cannot declare array sizes by using runtime
variables. That's what this patch series fixes.
>> swap code uses these macros to size some of its array data structures
>> based on PMD_ORDER e.g. SWAPFILE_CLUSTER macro is used for this very
>> purpose.
>> Hence this patch makes the users of SWAPFILE_CLUSTER to use this macro value at
>> runtime and also modifies swap_table and swap_memcg_table which were earlier
>> using this macro for defining the number of table entries.
>>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>> mm/swap_table.h | 6 ++----
>> mm/swapfile.c | 17 ++++++++++++-----
>> 2 files changed, 14 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/swap_table.h b/mm/swap_table.h
>> index e6613e62f8d0..90e2a7852300 100644
>> --- a/mm/swap_table.h
>> +++ b/mm/swap_table.h
>> @@ -8,16 +8,14 @@
>>
>> /* A typical flat array in each cluster as swap table */
>> struct swap_table {
>> - atomic_long_t entries[SWAPFILE_CLUSTER];
>> + DECLARE_FLEX_ARRAY(atomic_long_t, entries);
>> };
>>
>> /* For storing memcg private id */
>> struct swap_memcg_table {
>> - unsigned short id[SWAPFILE_CLUSTER];
>> + DECLARE_FLEX_ARRAY(unsigned short, id);
>> };
>>
>> -#define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
>> -
>> /*
>> * A swap table entry represents the status of a swap slot on a swap
>> * (physical or virtual) device. The swap table in each cluster is a
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 78b49b0658ad..4bf11c5b87eb 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -129,6 +129,8 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
>> .lock = INIT_LOCAL_LOCK(),
>> };
>>
>> +static bool swap_table_use_page __ro_after_init;
>
> Does a static key help here?
That IMO won't give much benefit, given the allocation by either kmem or
alloc pages, anyway dominates the cost. Also I believe this is exactly
the usecase where branch predictor helps signficantly and reliably given
the variable is ro_after_init.
>
> Best Regards
> Barry
Thanks Barry for looking into this.
-ritesh
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER
2026-06-19 4:40 ` [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER Ritesh Harjani (IBM)
@ 2026-06-23 5:11 ` Barry Song
2026-06-23 6:37 ` Ritesh Harjani
0 siblings, 1 reply; 15+ messages in thread
From: Barry Song @ 2026-06-23 5:11 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil
On Fri, Jun 19, 2026 at 12:41 PM Ritesh Harjani (IBM)
<ritesh.list@gmail.com> wrote:
>
> SWAP_NR_ORDERS sizes a few small bounded arrays inside THP swap
> allocator code (nofull/frag cluster lists, percpu_swap_cluster's
> si/offset arrays, next array for rotational device). This currently
> expands to PMD_ORDER+1, which only works when PMD_ORDER is a compile
> time constant.
>
> However on architecture like PowerPC Book3S64, PMD_ORDER is a runtime
> variable which depends upon which MMU is selected (Radix / Hash), so in
> that case, PMD_ORDER cannot be used to size the static arrays.
>
> This patch provides an optional ARCH_MAX_PMD_ORDER (upper-bound)
> override for such architectures. The memory overhead on enabling this
> override is negligible. Even if we make SWAP_NR_ORDERS runtime alloc,
> default slab padding could cause some memory waste. Also we lose the
> per-cpu cacheline benefits (for percpu_swap_cluster) because it might
> cost an extra cacheline indirection overhead in swap_alloc_fast() for
> fetching si[order]/offset[order]. Note that a fully runtime
> SWAP_NR_ORDERS was considered in previous version but was dropped for
> this reason [1]
Do we know the maximum PMD size? On arm64 with a 64 KB base page,
a PMD can be as large as 512 MB:
https://docs.kernel.org/arch/arm64/hugetlbpage.html
One concern we have is that performing I/O on such a large folio could
incur significant latency before reclaiming any memory. For this
reason, on arm64 we initially enabled THP_SWAPOUT only for 4 KB base
pages:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0637c505f
>
> [1]: https://lore.kernel.org/linuxppc-dev/pl1zdksc.ritesh.list@gmail.com/
>
Best Regards
Barry
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 3/3] powerpc: Kconfig: Enable THP_SWAP on Book3S64
2026-06-19 4:40 ` [PATCH v4 3/3] powerpc: Kconfig: Enable THP_SWAP on Book3S64 Ritesh Harjani (IBM)
@ 2026-06-23 5:21 ` Barry Song
2026-06-23 7:06 ` Ritesh Harjani
0 siblings, 1 reply; 15+ messages in thread
From: Barry Song @ 2026-06-23 5:21 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil
On Fri, Jun 19, 2026 at 12:41 PM Ritesh Harjani (IBM)
<ritesh.list@gmail.com> wrote:
>
> THP_SWAP avoids splitting of a transparent huge folio into 32 smaller
> 64K folios (Radix-64K pagesize / 2M PMD) or into 256 smaller 64K folios
> (Hash-64K pagesize / 16M PMD), during swapout. This improves the
> swapping performance since all the bookking & I/O submission happens
> once per large folio. More details at [1].
>
> PowerPC Book3S64 could not enable this before because PMD_ORDER is
> selected at runtime depending upon the chosen MMU. The earlier patches
> in this series turn SWAPFILE_CLUSTER into a runtime value and introduce
> an ARCH_MAX_PMD_ORDER upperbound override for SWAP_NR_ORDERS. With those
> changes, we can now enable THP SWAP for Book3S64.
>
> This increases bandwidth throughput with zram backend for swapout by
> 40-50% with Radix and 100-130% with Hash (Tested by Sayali)
Thanks!
I am curious about the contents of the anonymous memory being tested
and the compression algorithm used by zram.
>
> [1]: https://lore.kernel.org/all/20170515112522.32457-2-ying.huang@intel.com/
Best Regards
Barry
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER
2026-06-23 5:11 ` Barry Song
@ 2026-06-23 6:37 ` Ritesh Harjani
2026-06-23 8:42 ` Barry Song
0 siblings, 1 reply; 15+ messages in thread
From: Ritesh Harjani @ 2026-06-23 6:37 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil
Barry Song <baohua@kernel.org> writes:
> On Fri, Jun 19, 2026 at 12:41 PM Ritesh Harjani (IBM)
> <ritesh.list@gmail.com> wrote:
>>
>> SWAP_NR_ORDERS sizes a few small bounded arrays inside THP swap
>> allocator code (nofull/frag cluster lists, percpu_swap_cluster's
>> si/offset arrays, next array for rotational device). This currently
>> expands to PMD_ORDER+1, which only works when PMD_ORDER is a compile
>> time constant.
>>
>> However on architecture like PowerPC Book3S64, PMD_ORDER is a runtime
>> variable which depends upon which MMU is selected (Radix / Hash), so in
>> that case, PMD_ORDER cannot be used to size the static arrays.
>>
>> This patch provides an optional ARCH_MAX_PMD_ORDER (upper-bound)
>> override for such architectures. The memory overhead on enabling this
>> override is negligible. Even if we make SWAP_NR_ORDERS runtime alloc,
>> default slab padding could cause some memory waste. Also we lose the
>> per-cpu cacheline benefits (for percpu_swap_cluster) because it might
>> cost an extra cacheline indirection overhead in swap_alloc_fast() for
>> fetching si[order]/offset[order]. Note that a fully runtime
>> SWAP_NR_ORDERS was considered in previous version but was dropped for
>> this reason [1]
>
> Do we know the maximum PMD size?
ARCH_MAX_PMD_ORDER will be 8 on PowerPC book3s64 with 64K pagesize.
PowerPC Hash MMU with 64K default pagesize supports PMD size of 16MB.
> On arm64 with a 64 KB base page,
> a PMD can be as large as 512 MB:
> https://docs.kernel.org/arch/arm64/hugetlbpage.html
>
> One concern we have is that performing I/O on such a large folio could
> incur significant latency before reclaiming any memory. For this
> reason, on arm64 we initially enabled THP_SWAPOUT only for 4 KB base
> pages:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0637c505f
>
That's not the case on PowerPC. Max PMD size for Hash will be 16MB.
Also we still need this patch since we can at runtime choose Hash or
Radix MMU. So, the main problem this patch is trying to solve on PowerPC
Book3s64 is enabling this feature w/o impacting any other architecture.
W/O this patch series, we can't enable it, since it gives build errors.
>>
>> [1]: https://lore.kernel.org/linuxppc-dev/pl1zdksc.ritesh.list@gmail.com/
>>
>
> Best Regards
> Barry
Thanks for the review!
-ritesh
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 3/3] powerpc: Kconfig: Enable THP_SWAP on Book3S64
2026-06-23 5:21 ` Barry Song
@ 2026-06-23 7:06 ` Ritesh Harjani
2026-06-23 8:39 ` Barry Song
0 siblings, 1 reply; 15+ messages in thread
From: Ritesh Harjani @ 2026-06-23 7:06 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil
Barry Song <baohua@kernel.org> writes:
> On Fri, Jun 19, 2026 at 12:41 PM Ritesh Harjani (IBM)
> <ritesh.list@gmail.com> wrote:
>>
>> THP_SWAP avoids splitting of a transparent huge folio into 32 smaller
>> 64K folios (Radix-64K pagesize / 2M PMD) or into 256 smaller 64K folios
>> (Hash-64K pagesize / 16M PMD), during swapout. This improves the
>> swapping performance since all the bookking & I/O submission happens
>> once per large folio. More details at [1].
>>
>> PowerPC Book3S64 could not enable this before because PMD_ORDER is
>> selected at runtime depending upon the chosen MMU. The earlier patches
>> in this series turn SWAPFILE_CLUSTER into a runtime value and introduce
>> an ARCH_MAX_PMD_ORDER upperbound override for SWAP_NR_ORDERS. With those
>> changes, we can now enable THP SWAP for Book3S64.
>>
>> This increases bandwidth throughput with zram backend for swapout by
>> 40-50% with Radix and 100-130% with Hash (Tested by Sayali)
>
> Thanks!
>
> I am curious about the contents of the anonymous memory being tested
> and the compression algorithm used by zram.
>
I am sure it was derived from your microbenchmark itself which you had
shared here (so repetitive pattern) with default zram compression
algorithm. Thanks for that :)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0637c505f
I think I got your point - I can mention that it was a microbenchmark
similar to yours and not a real world workload test. Is this what you
meant here?
-ritesh
>>
>> [1]: https://lore.kernel.org/all/20170515112522.32457-2-ying.huang@intel.com/
>
> Best Regards
> Barry
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 3/3] powerpc: Kconfig: Enable THP_SWAP on Book3S64
2026-06-23 7:06 ` Ritesh Harjani
@ 2026-06-23 8:39 ` Barry Song
2026-06-23 9:03 ` Ritesh Harjani
0 siblings, 1 reply; 15+ messages in thread
From: Barry Song @ 2026-06-23 8:39 UTC (permalink / raw)
To: Ritesh Harjani
Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil
On Tue, Jun 23, 2026 at 3:12 PM Ritesh Harjani <ritesh.list@gmail.com> wrote:
>
> Barry Song <baohua@kernel.org> writes:
>
> > On Fri, Jun 19, 2026 at 12:41 PM Ritesh Harjani (IBM)
> > <ritesh.list@gmail.com> wrote:
> >>
> >> THP_SWAP avoids splitting of a transparent huge folio into 32 smaller
> >> 64K folios (Radix-64K pagesize / 2M PMD) or into 256 smaller 64K folios
> >> (Hash-64K pagesize / 16M PMD), during swapout. This improves the
> >> swapping performance since all the bookking & I/O submission happens
> >> once per large folio. More details at [1].
> >>
> >> PowerPC Book3S64 could not enable this before because PMD_ORDER is
> >> selected at runtime depending upon the chosen MMU. The earlier patches
> >> in this series turn SWAPFILE_CLUSTER into a runtime value and introduce
> >> an ARCH_MAX_PMD_ORDER upperbound override for SWAP_NR_ORDERS. With those
> >> changes, we can now enable THP SWAP for Book3S64.
> >>
> >> This increases bandwidth throughput with zram backend for swapout by
> >> 40-50% with Radix and 100-130% with Hash (Tested by Sayali)
> >
> > Thanks!
> >
> > I am curious about the contents of the anonymous memory being tested
> > and the compression algorithm used by zram.
> >
>
> I am sure it was derived from your microbenchmark itself which you had
> shared here (so repetitive pattern) with default zram compression
> algorithm. Thanks for that :)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0637c505f
>
> I think I got your point - I can mention that it was a microbenchmark
> similar to yours and not a real world workload test. Is this what you
> meant here?
Yep. Please make it clear in the changelog what kind of workload was
used, as different data can result in completely different compression
ratios and compression/decompression costs. Consequently, the reported
swap-out and swap-in performance improvements can vary significantly as
well.
w/ that, please feel free to add:
Reviewed-by: Barry Song <baohua@kernel.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER
2026-06-23 6:37 ` Ritesh Harjani
@ 2026-06-23 8:42 ` Barry Song
2026-06-23 9:32 ` Ritesh Harjani
0 siblings, 1 reply; 15+ messages in thread
From: Barry Song @ 2026-06-23 8:42 UTC (permalink / raw)
To: Ritesh Harjani
Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil
On Tue, Jun 23, 2026 at 3:05 PM Ritesh Harjani <ritesh.list@gmail.com> wrote:
>
> Barry Song <baohua@kernel.org> writes:
>
> > On Fri, Jun 19, 2026 at 12:41 PM Ritesh Harjani (IBM)
> > <ritesh.list@gmail.com> wrote:
> >>
> >> SWAP_NR_ORDERS sizes a few small bounded arrays inside THP swap
> >> allocator code (nofull/frag cluster lists, percpu_swap_cluster's
> >> si/offset arrays, next array for rotational device). This currently
> >> expands to PMD_ORDER+1, which only works when PMD_ORDER is a compile
> >> time constant.
> >>
> >> However on architecture like PowerPC Book3S64, PMD_ORDER is a runtime
> >> variable which depends upon which MMU is selected (Radix / Hash), so in
> >> that case, PMD_ORDER cannot be used to size the static arrays.
> >>
> >> This patch provides an optional ARCH_MAX_PMD_ORDER (upper-bound)
> >> override for such architectures. The memory overhead on enabling this
> >> override is negligible. Even if we make SWAP_NR_ORDERS runtime alloc,
> >> default slab padding could cause some memory waste. Also we lose the
> >> per-cpu cacheline benefits (for percpu_swap_cluster) because it might
> >> cost an extra cacheline indirection overhead in swap_alloc_fast() for
> >> fetching si[order]/offset[order]. Note that a fully runtime
> >> SWAP_NR_ORDERS was considered in previous version but was dropped for
> >> this reason [1]
> >
> > Do we know the maximum PMD size?
>
> ARCH_MAX_PMD_ORDER will be 8 on PowerPC book3s64 with 64K pagesize.
> PowerPC Hash MMU with 64K default pagesize supports PMD size of 16MB.
>
> > On arm64 with a 64 KB base page,
> > a PMD can be as large as 512 MB:
> > https://docs.kernel.org/arch/arm64/hugetlbpage.html
> >
> > One concern we have is that performing I/O on such a large folio could
> > incur significant latency before reclaiming any memory. For this
> > reason, on arm64 we initially enabled THP_SWAPOUT only for 4 KB base
> > pages:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0637c505f
> >
>
> That's not the case on PowerPC. Max PMD size for Hash will be 16MB.
Yep. A 16 MB folio might be fine, although I'm not sure whether
splitting a 16 MB folio into eight 2 MB folios would help much.
For 512 MB PMD-sized pages on arm64, one possible approach might be to
split them into 256 × 2 MB folios rather than all the way down to 4 KB
pages. That could provide a better balance between I/O latency and swap
performance.
> Also we still need this patch since we can at runtime choose Hash or
> Radix MMU. So, the main problem this patch is trying to solve on PowerPC
> Book3s64 is enabling this feature w/o impacting any other architecture.
> W/O this patch series, we can't enable it, since it gives build errors.
I see. If possible, please mention in the changelog that the maximum
PMD size on your platform is 16 MB. In that case, the I/O latency
concerns I raised may not really apply.
w/ that, please free feel to add:
Reviewed-by: Barry Song <baohua@kernel.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime
2026-06-19 4:40 ` [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
2026-06-22 1:39 ` Barry Song
@ 2026-06-23 8:44 ` Barry Song
1 sibling, 0 replies; 15+ messages in thread
From: Barry Song @ 2026-06-23 8:44 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil
On Fri, Jun 19, 2026 at 12:41 PM Ritesh Harjani (IBM)
<ritesh.list@gmail.com> wrote:
>
> On PowerPC Book3S64, MMU is selected at runtime, so macros like
> PMD_SHIFT are effectively runtime variables in the Book3S64 code. THP
> swap code uses these macros to size some of its array data structures
> based on PMD_ORDER e.g. SWAPFILE_CLUSTER macro is used for this very
> purpose.
> Hence this patch makes the users of SWAPFILE_CLUSTER to use this macro value at
> runtime and also modifies swap_table and swap_memcg_table which were earlier
> using this macro for defining the number of table entries.
>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
Reviewed-by: Barry Song <baohua@kernel.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 3/3] powerpc: Kconfig: Enable THP_SWAP on Book3S64
2026-06-23 8:39 ` Barry Song
@ 2026-06-23 9:03 ` Ritesh Harjani
0 siblings, 0 replies; 15+ messages in thread
From: Ritesh Harjani @ 2026-06-23 9:03 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil
Barry Song <baohua@kernel.org> writes:
> On Tue, Jun 23, 2026 at 3:12 PM Ritesh Harjani <ritesh.list@gmail.com> wrote:
>>
>> Barry Song <baohua@kernel.org> writes:
>>
>> > On Fri, Jun 19, 2026 at 12:41 PM Ritesh Harjani (IBM)
>> > <ritesh.list@gmail.com> wrote:
>> >>
>> >> THP_SWAP avoids splitting of a transparent huge folio into 32 smaller
>> >> 64K folios (Radix-64K pagesize / 2M PMD) or into 256 smaller 64K folios
>> >> (Hash-64K pagesize / 16M PMD), during swapout. This improves the
>> >> swapping performance since all the bookking & I/O submission happens
>> >> once per large folio. More details at [1].
>> >>
>> >> PowerPC Book3S64 could not enable this before because PMD_ORDER is
>> >> selected at runtime depending upon the chosen MMU. The earlier patches
>> >> in this series turn SWAPFILE_CLUSTER into a runtime value and introduce
>> >> an ARCH_MAX_PMD_ORDER upperbound override for SWAP_NR_ORDERS. With those
>> >> changes, we can now enable THP SWAP for Book3S64.
>> >>
>> >> This increases bandwidth throughput with zram backend for swapout by
>> >> 40-50% with Radix and 100-130% with Hash (Tested by Sayali)
>> >
>> > Thanks!
>> >
>> > I am curious about the contents of the anonymous memory being tested
>> > and the compression algorithm used by zram.
>> >
>>
>> I am sure it was derived from your microbenchmark itself which you had
>> shared here (so repetitive pattern) with default zram compression
>> algorithm. Thanks for that :)
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0637c505f
>>
>> I think I got your point - I can mention that it was a microbenchmark
>> similar to yours and not a real world workload test. Is this what you
>> meant here?
>
> Yep. Please make it clear in the changelog what kind of workload was
> used, as different data can result in completely different compression
> ratios and compression/decompression costs. Consequently, the reported
> swap-out and swap-in performance improvements can vary significantly as
> well.
>
Sure, I will update the changelog in the next version with more details
on the benchmark ran. Mainly planning to provide the link to your commit
for the e.g. microbenchmark code we ran along with few other details.
> w/ that, please feel free to add:
>
> Reviewed-by: Barry Song <baohua@kernel.org>
Thanks Barry for the review.
-ritesh
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER
2026-06-23 8:42 ` Barry Song
@ 2026-06-23 9:32 ` Ritesh Harjani
0 siblings, 0 replies; 15+ messages in thread
From: Ritesh Harjani @ 2026-06-23 9:32 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil
Barry Song <baohua@kernel.org> writes:
> On Tue, Jun 23, 2026 at 3:05 PM Ritesh Harjani <ritesh.list@gmail.com> wrote:
>>
>> Barry Song <baohua@kernel.org> writes:
>>
>> > On Fri, Jun 19, 2026 at 12:41 PM Ritesh Harjani (IBM)
>> > <ritesh.list@gmail.com> wrote:
>> >>
>> >> SWAP_NR_ORDERS sizes a few small bounded arrays inside THP swap
>> >> allocator code (nofull/frag cluster lists, percpu_swap_cluster's
>> >> si/offset arrays, next array for rotational device). This currently
>> >> expands to PMD_ORDER+1, which only works when PMD_ORDER is a compile
>> >> time constant.
>> >>
>> >> However on architecture like PowerPC Book3S64, PMD_ORDER is a runtime
>> >> variable which depends upon which MMU is selected (Radix / Hash), so in
>> >> that case, PMD_ORDER cannot be used to size the static arrays.
>> >>
>> >> This patch provides an optional ARCH_MAX_PMD_ORDER (upper-bound)
>> >> override for such architectures. The memory overhead on enabling this
>> >> override is negligible. Even if we make SWAP_NR_ORDERS runtime alloc,
>> >> default slab padding could cause some memory waste. Also we lose the
>> >> per-cpu cacheline benefits (for percpu_swap_cluster) because it might
>> >> cost an extra cacheline indirection overhead in swap_alloc_fast() for
>> >> fetching si[order]/offset[order]. Note that a fully runtime
>> >> SWAP_NR_ORDERS was considered in previous version but was dropped for
>> >> this reason [1]
>> >
>> > Do we know the maximum PMD size?
>>
>> ARCH_MAX_PMD_ORDER will be 8 on PowerPC book3s64 with 64K pagesize.
>> PowerPC Hash MMU with 64K default pagesize supports PMD size of 16MB.
>>
>> > On arm64 with a 64 KB base page,
>> > a PMD can be as large as 512 MB:
>> > https://docs.kernel.org/arch/arm64/hugetlbpage.html
>> >
>> > One concern we have is that performing I/O on such a large folio could
>> > incur significant latency before reclaiming any memory. For this
>> > reason, on arm64 we initially enabled THP_SWAPOUT only for 4 KB base
>> > pages:
>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0637c505f
>> >
>>
>> That's not the case on PowerPC. Max PMD size for Hash will be 16MB.
>
> Yep. A 16 MB folio might be fine, although I'm not sure whether
> splitting a 16 MB folio into eight 2 MB folios would help much.
>
> For 512 MB PMD-sized pages on arm64, one possible approach might be to
> split them into 256 × 2 MB folios rather than all the way down to 4 KB
> pages. That could provide a better balance between I/O latency and swap
> performance.
>
Fair enough. I guess this can be looked upon but is outside of the scope
of this work. For now Radix with 2MB PMD is the default on latest
PowerPC, so this will be slightly lower priority for me right now.
>> Also we still need this patch since we can at runtime choose Hash or
>> Radix MMU. So, the main problem this patch is trying to solve on PowerPC
>> Book3s64 is enabling this feature w/o impacting any other architecture.
>> W/O this patch series, we can't enable it, since it gives build errors.
>
> I see. If possible, please mention in the changelog that the maximum
> PMD size on your platform is 16 MB.
Sure. I can do that.
> In that case, the I/O latency concerns I raised may not really apply.
>
> w/ that, please free feel to add:
>
> Reviewed-by: Barry Song <baohua@kernel.org>
Thanks again for the reviewing this patch series.
I will re-spin the updated version (with additional details in the
commit msg as you requested) in a couple of days.
-ritesh
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-06-23 11:37 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-19 4:40 [PATCH v4 0/3] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
2026-06-19 4:40 ` [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
2026-06-22 1:39 ` Barry Song
2026-06-23 4:11 ` Ritesh Harjani
2026-06-23 8:44 ` Barry Song
2026-06-19 4:40 ` [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER Ritesh Harjani (IBM)
2026-06-23 5:11 ` Barry Song
2026-06-23 6:37 ` Ritesh Harjani
2026-06-23 8:42 ` Barry Song
2026-06-23 9:32 ` Ritesh Harjani
2026-06-19 4:40 ` [PATCH v4 3/3] powerpc: Kconfig: Enable THP_SWAP on Book3S64 Ritesh Harjani (IBM)
2026-06-23 5:21 ` Barry Song
2026-06-23 7:06 ` Ritesh Harjani
2026-06-23 8:39 ` Barry Song
2026-06-23 9:03 ` Ritesh Harjani
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.