* [PATCH v4 0/3] mm, swap: Enable THP SWAP for PowerPC Book3S64
@ 2026-06-19 4:40 Ritesh Harjani (IBM)
2026-06-19 4:40 ` [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-19 4:40 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
On PowerPC Book3S64, MMU is selected at runtime, so macros like PMD_SHIFT
are effectively runtime variables in the Book3S64 code. THP swap code uses
these macros for e.g. to size some of its array data structures based on
PMD_ORDER. This patch series makes that usage dependent on the runtime
variable and provides an upper-bound architecture override for cases (e.g.
SWAP_NR_ORDERS), where the runtime conversion is not considered beneficial.
This series increases bandwidth throughput with zram backend for swapout by
around 40-50% with Radix and 100-130% with Hash (Tested by Sayali)
Note that this patch series is based out of linux-next (next-20260608).
v3->v4:
======
1. Revert SWAPFILE_CLUSTER definition - since we already adjusted all the users
of SWAPFILE_CLUSTER and made those users use this value at runtime (Kairui Song)
v2 -> v3:
=========
1. Fixed sparse warning for swap_table_use_page reported by lkp in patch-1
RFC -> v2:
==========
1. Send the unused leftovers change in swap.h separately [1]
2. Changed RFC Patch-3 design from runtime SWAP_NR_ORDERS to arch override
(ARCH_MAX_PMD_ORDER) - suggested by Youngjun
3. Dropped RFC tag
[1]: https://lore.kernel.org/linux-mm/68591daf0d679e5a0072d63751f187d14613e2b0.1781146877.git.ritesh.list@gmail.com/
[RFC]: https://lore.kernel.org/linux-mm/cover.1781000840.git.ritesh.list@gmail.com/
[v2]: https://lore.kernel.org/linuxppc-dev/cover.1781170904.git.ritesh.list@gmail.com/
Ritesh Harjani (IBM) (3):
mm, swap: make SWAPFILE_CLUSTER runtime
mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER
powerpc: Kconfig: Enable THP_SWAP on Book3S64
arch/powerpc/include/asm/book3s/64/pgtable.h | 7 +++++++
arch/powerpc/platforms/Kconfig.cputype | 1 +
include/linux/swap.h | 12 +++++++++++-
mm/swap_table.h | 6 ++----
mm/swapfile.c | 17 ++++++++++++-----
5 files changed, 33 insertions(+), 10 deletions(-)
--
2.39.5
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime
2026-06-19 4:40 [PATCH v4 0/3] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
@ 2026-06-19 4:40 ` Ritesh Harjani (IBM)
2026-06-22 1:39 ` Barry Song
2026-06-19 4:40 ` [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER Ritesh Harjani (IBM)
2026-06-19 4:40 ` [PATCH v4 3/3] powerpc: Kconfig: Enable THP_SWAP on Book3S64 Ritesh Harjani (IBM)
2 siblings, 1 reply; 8+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-19 4:40 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
On PowerPC Book3S64, MMU is selected at runtime, so macros like
PMD_SHIFT are effectively runtime variables in the Book3S64 code. THP
swap code uses these macros to size some of its array data structures
based on PMD_ORDER e.g. SWAPFILE_CLUSTER macro is used for this very
purpose.
Hence this patch makes the users of SWAPFILE_CLUSTER to use this macro value at
runtime and also modifies swap_table and swap_memcg_table which were earlier
using this macro for defining the number of table entries.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
mm/swap_table.h | 6 ++----
mm/swapfile.c | 17 ++++++++++++-----
2 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/mm/swap_table.h b/mm/swap_table.h
index e6613e62f8d0..90e2a7852300 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -8,16 +8,14 @@
/* A typical flat array in each cluster as swap table */
struct swap_table {
- atomic_long_t entries[SWAPFILE_CLUSTER];
+ DECLARE_FLEX_ARRAY(atomic_long_t, entries);
};
/* For storing memcg private id */
struct swap_memcg_table {
- unsigned short id[SWAPFILE_CLUSTER];
+ DECLARE_FLEX_ARRAY(unsigned short, id);
};
-#define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
-
/*
* A swap table entry represents the status of a swap slot on a swap
* (physical or virtual) device. The swap table in each cluster is a
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 78b49b0658ad..4bf11c5b87eb 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -129,6 +129,8 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
.lock = INIT_LOCAL_LOCK(),
};
+static bool swap_table_use_page __ro_after_init;
+
/* May return NULL on invalid type, caller must check for NULL return */
static struct swap_info_struct *swap_type_to_info(int type)
{
@@ -437,7 +439,7 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci)
return;
rcu_assign_pointer(ci->table, NULL);
- if (!SWP_TABLE_USE_PAGE) {
+ if (!swap_table_use_page) {
kmem_cache_free(swap_table_cachep, table);
return;
}
@@ -456,7 +458,7 @@ static int swap_cluster_alloc_table(struct swap_cluster_info *ci, gfp_t gfp)
if (rcu_access_pointer(ci->table))
return 0;
- if (SWP_TABLE_USE_PAGE) {
+ if (swap_table_use_page) {
folio = folio_alloc(gfp | __GFP_ZERO, 0);
if (folio)
table = folio_address(folio);
@@ -471,7 +473,8 @@ static int swap_cluster_alloc_table(struct swap_cluster_info *ci, gfp_t gfp)
#ifdef CONFIG_MEMCG
if (!mem_cgroup_disabled()) {
VM_WARN_ON_ONCE(ci->memcg_table);
- ci->memcg_table = kzalloc_obj(*ci->memcg_table, gfp);
+ ci->memcg_table = kzalloc_flex(*ci->memcg_table, id,
+ SWAPFILE_CLUSTER, gfp);
if (!ci->memcg_table) {
swap_cluster_free_table(ci);
return -ENOMEM;
@@ -3912,14 +3915,18 @@ static int __init swapfile_init(void)
{
swapfile_maximum_size = arch_max_swapfile_size();
+ swap_table_use_page =
+ (SWAPFILE_CLUSTER * sizeof(atomic_long_t) == PAGE_SIZE);
+
/*
* Once a cluster is freed, it's swap table content is read
* only, and all swap cache readers (swap_cache_*) verifies
* the content before use. So it's safe to use RCU slab here.
*/
- if (!SWP_TABLE_USE_PAGE)
+ if (!swap_table_use_page)
swap_table_cachep = kmem_cache_create("swap_table",
- sizeof(struct swap_table),
+ struct_size_t(struct swap_table, entries,
+ SWAPFILE_CLUSTER),
0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
#ifdef CONFIG_MIGRATION
--
2.39.5
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER
2026-06-19 4:40 [PATCH v4 0/3] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
2026-06-19 4:40 ` [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
@ 2026-06-19 4:40 ` Ritesh Harjani (IBM)
2026-06-23 5:11 ` Barry Song
2026-06-19 4:40 ` [PATCH v4 3/3] powerpc: Kconfig: Enable THP_SWAP on Book3S64 Ritesh Harjani (IBM)
2 siblings, 1 reply; 8+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-19 4:40 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
SWAP_NR_ORDERS sizes a few small bounded arrays inside THP swap
allocator code (nofull/frag cluster lists, percpu_swap_cluster's
si/offset arrays, next array for rotational device). This currently
expands to PMD_ORDER+1, which only works when PMD_ORDER is a compile
time constant.
However on architecture like PowerPC Book3S64, PMD_ORDER is a runtime
variable which depends upon which MMU is selected (Radix / Hash), so in
that case, PMD_ORDER cannot be used to size the static arrays.
This patch provides an optional ARCH_MAX_PMD_ORDER (upper-bound)
override for such architectures. The memory overhead on enabling this
override is negligible. Even if we make SWAP_NR_ORDERS runtime alloc,
default slab padding could cause some memory waste. Also we lose the
per-cpu cacheline benefits (for percpu_swap_cluster) because it might
cost an extra cacheline indirection overhead in swap_alloc_fast() for
fetching si[order]/offset[order]. Note that a fully runtime
SWAP_NR_ORDERS was considered in previous version but was dropped for
this reason [1]
[1]: https://lore.kernel.org/linuxppc-dev/pl1zdksc.ritesh.list@gmail.com/
Suggested-by: YoungJun Park <youngjun.park@lge.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
arch/powerpc/include/asm/book3s/64/pgtable.h | 7 +++++++
include/linux/swap.h | 12 +++++++++++-
2 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index e67e64ac6e8c..7f22d5d5fbdf 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -204,6 +204,13 @@ extern unsigned long __pmd_frag_size_shift;
#define MAX_PTRS_PER_PGD (1 << (H_PGD_INDEX_SIZE > RADIX_PGD_INDEX_SIZE ? \
H_PGD_INDEX_SIZE : RADIX_PGD_INDEX_SIZE))
+/*
+ * Compile-time upper bound on PMD_ORDER across hash and radix MMUs.
+ * Used by THP SWAP code. Check include/linux/swap.h
+ */
+#define ARCH_MAX_PMD_ORDER ((H_PTE_INDEX_SIZE > RADIX_PTE_INDEX_SIZE) ? \
+ H_PTE_INDEX_SIZE : RADIX_PTE_INDEX_SIZE)
+
/* PMD_SHIFT determines what a second-level page table entry can map */
#define PMD_SHIFT (PAGE_SHIFT + PTE_INDEX_SIZE)
#define PMD_SIZE (1UL << PMD_SHIFT)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8f0f68e245ba..317168aa2db5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -229,11 +229,21 @@ enum {
*/
#define SWAP_ENTRY_INVALID 0
+/*
+ * ARCH_MAX_PMD_ORDER is an optional arch hook: a compile-time upper bound for
+ * PMD_ORDER across all possible MMU configurations of that arch. It is used to
+ * size SWAP_NR_ORDERS on architectures (e.g. powerpc book3s64) where PMD_ORDER
+ * is selected at boot rather than at compile time.
+ */
#ifdef CONFIG_THP_SWAP
+#ifdef ARCH_MAX_PMD_ORDER
+#define SWAP_NR_ORDERS (ARCH_MAX_PMD_ORDER + 1)
+#else
#define SWAP_NR_ORDERS (PMD_ORDER + 1)
+#endif /* ARCH_MAX_PMD_ORDER */
#else
#define SWAP_NR_ORDERS 1
-#endif
+#endif /* CONFIG_THP_SWAP */
/*
* We keep using same cluster for rotational device so IO will be sequential.
--
2.39.5
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v4 3/3] powerpc: Kconfig: Enable THP_SWAP on Book3S64
2026-06-19 4:40 [PATCH v4 0/3] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
2026-06-19 4:40 ` [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
2026-06-19 4:40 ` [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER Ritesh Harjani (IBM)
@ 2026-06-19 4:40 ` Ritesh Harjani (IBM)
2026-06-23 5:21 ` Barry Song
2 siblings, 1 reply; 8+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-19 4:40 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
THP_SWAP avoids splitting of a transparent huge folio into 32 smaller
64K folios (Radix-64K pagesize / 2M PMD) or into 256 smaller 64K folios
(Hash-64K pagesize / 16M PMD), during swapout. This improves the
swapping performance since all the bookking & I/O submission happens
once per large folio. More details at [1].
PowerPC Book3S64 could not enable this before because PMD_ORDER is
selected at runtime depending upon the chosen MMU. The earlier patches
in this series turn SWAPFILE_CLUSTER into a runtime value and introduce
an ARCH_MAX_PMD_ORDER upperbound override for SWAP_NR_ORDERS. With those
changes, we can now enable THP SWAP for Book3S64.
This increases bandwidth throughput with zram backend for swapout by
40-50% with Radix and 100-130% with Hash (Tested by Sayali)
[1]: https://lore.kernel.org/all/20170515112522.32457-2-ying.huang@intel.com/
Tested-by: Sayali Patil <sayalip@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
arch/powerpc/platforms/Kconfig.cputype | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index bac02c83bb3e..48f74bd22343 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -113,6 +113,7 @@ config PPC_THP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+ select ARCH_WANTS_THP_SWAP if TRANSPARENT_HUGEPAGE
choice
prompt "CPU selection"
--
2.39.5
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime
2026-06-19 4:40 ` [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
@ 2026-06-22 1:39 ` Barry Song
2026-06-23 4:11 ` Ritesh Harjani
0 siblings, 1 reply; 8+ messages in thread
From: Barry Song @ 2026-06-22 1:39 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil
On Fri, Jun 19, 2026 at 12:41 PM Ritesh Harjani (IBM)
<ritesh.list@gmail.com> wrote:
>
> On PowerPC Book3S64, MMU is selected at runtime, so macros like
> PMD_SHIFT are effectively runtime variables in the Book3S64 code. THP
Not an expert on Book3S64—could you explain the runtime variables in
more detail? Does enabling THP_SWAP on PowerPC cause any build issues?
> swap code uses these macros to size some of its array data structures
> based on PMD_ORDER e.g. SWAPFILE_CLUSTER macro is used for this very
> purpose.
> Hence this patch makes the users of SWAPFILE_CLUSTER to use this macro value at
> runtime and also modifies swap_table and swap_memcg_table which were earlier
> using this macro for defining the number of table entries.
>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
> mm/swap_table.h | 6 ++----
> mm/swapfile.c | 17 ++++++++++++-----
> 2 files changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/mm/swap_table.h b/mm/swap_table.h
> index e6613e62f8d0..90e2a7852300 100644
> --- a/mm/swap_table.h
> +++ b/mm/swap_table.h
> @@ -8,16 +8,14 @@
>
> /* A typical flat array in each cluster as swap table */
> struct swap_table {
> - atomic_long_t entries[SWAPFILE_CLUSTER];
> + DECLARE_FLEX_ARRAY(atomic_long_t, entries);
> };
>
> /* For storing memcg private id */
> struct swap_memcg_table {
> - unsigned short id[SWAPFILE_CLUSTER];
> + DECLARE_FLEX_ARRAY(unsigned short, id);
> };
>
> -#define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
> -
> /*
> * A swap table entry represents the status of a swap slot on a swap
> * (physical or virtual) device. The swap table in each cluster is a
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 78b49b0658ad..4bf11c5b87eb 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -129,6 +129,8 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
> .lock = INIT_LOCAL_LOCK(),
> };
>
> +static bool swap_table_use_page __ro_after_init;
Does a static key help here?
Best Regards
Barry
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime
2026-06-22 1:39 ` Barry Song
@ 2026-06-23 4:11 ` Ritesh Harjani
0 siblings, 0 replies; 8+ messages in thread
From: Ritesh Harjani @ 2026-06-23 4:11 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil
Barry Song <baohua@kernel.org> writes:
> On Fri, Jun 19, 2026 at 12:41 PM Ritesh Harjani (IBM)
> <ritesh.list@gmail.com> wrote:
>>
>> On PowerPC Book3S64, MMU is selected at runtime, so macros like
>> PMD_SHIFT are effectively runtime variables in the Book3S64 code. THP
>
> Not an expert on Book3S64—could you explain the runtime variables in
> more detail? Does enabling THP_SWAP on PowerPC cause any build issues?
>
yes, build issues. We cannot declare array sizes by using runtime
variables. That's what this patch series fixes.
>> swap code uses these macros to size some of its array data structures
>> based on PMD_ORDER e.g. SWAPFILE_CLUSTER macro is used for this very
>> purpose.
>> Hence this patch makes the users of SWAPFILE_CLUSTER to use this macro value at
>> runtime and also modifies swap_table and swap_memcg_table which were earlier
>> using this macro for defining the number of table entries.
>>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>> mm/swap_table.h | 6 ++----
>> mm/swapfile.c | 17 ++++++++++++-----
>> 2 files changed, 14 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/swap_table.h b/mm/swap_table.h
>> index e6613e62f8d0..90e2a7852300 100644
>> --- a/mm/swap_table.h
>> +++ b/mm/swap_table.h
>> @@ -8,16 +8,14 @@
>>
>> /* A typical flat array in each cluster as swap table */
>> struct swap_table {
>> - atomic_long_t entries[SWAPFILE_CLUSTER];
>> + DECLARE_FLEX_ARRAY(atomic_long_t, entries);
>> };
>>
>> /* For storing memcg private id */
>> struct swap_memcg_table {
>> - unsigned short id[SWAPFILE_CLUSTER];
>> + DECLARE_FLEX_ARRAY(unsigned short, id);
>> };
>>
>> -#define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
>> -
>> /*
>> * A swap table entry represents the status of a swap slot on a swap
>> * (physical or virtual) device. The swap table in each cluster is a
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 78b49b0658ad..4bf11c5b87eb 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -129,6 +129,8 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
>> .lock = INIT_LOCAL_LOCK(),
>> };
>>
>> +static bool swap_table_use_page __ro_after_init;
>
> Does a static key help here?
That IMO won't give much benefit, given the allocation by either kmem or
alloc pages, anyway dominates the cost. Also I believe this is exactly
the usecase where branch predictor helps signficantly and reliably given
the variable is ro_after_init.
>
> Best Regards
> Barry
Thanks Barry for looking into this.
-ritesh
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER
2026-06-19 4:40 ` [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER Ritesh Harjani (IBM)
@ 2026-06-23 5:11 ` Barry Song
0 siblings, 0 replies; 8+ messages in thread
From: Barry Song @ 2026-06-23 5:11 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil
On Fri, Jun 19, 2026 at 12:41 PM Ritesh Harjani (IBM)
<ritesh.list@gmail.com> wrote:
>
> SWAP_NR_ORDERS sizes a few small bounded arrays inside THP swap
> allocator code (nofull/frag cluster lists, percpu_swap_cluster's
> si/offset arrays, next array for rotational device). This currently
> expands to PMD_ORDER+1, which only works when PMD_ORDER is a compile
> time constant.
>
> However on architecture like PowerPC Book3S64, PMD_ORDER is a runtime
> variable which depends upon which MMU is selected (Radix / Hash), so in
> that case, PMD_ORDER cannot be used to size the static arrays.
>
> This patch provides an optional ARCH_MAX_PMD_ORDER (upper-bound)
> override for such architectures. The memory overhead on enabling this
> override is negligible. Even if we make SWAP_NR_ORDERS runtime alloc,
> default slab padding could cause some memory waste. Also we lose the
> per-cpu cacheline benefits (for percpu_swap_cluster) because it might
> cost an extra cacheline indirection overhead in swap_alloc_fast() for
> fetching si[order]/offset[order]. Note that a fully runtime
> SWAP_NR_ORDERS was considered in previous version but was dropped for
> this reason [1]
Do we know the maximum PMD size? On arm64 with a 64 KB base page,
a PMD can be as large as 512 MB:
https://docs.kernel.org/arch/arm64/hugetlbpage.html
One concern we have is that performing I/O on such a large folio could
incur significant latency before reclaiming any memory. For this
reason, on arm64 we initially enabled THP_SWAPOUT only for 4 KB base
pages:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0637c505f
>
> [1]: https://lore.kernel.org/linuxppc-dev/pl1zdksc.ritesh.list@gmail.com/
>
Best Regards
Barry
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v4 3/3] powerpc: Kconfig: Enable THP_SWAP on Book3S64
2026-06-19 4:40 ` [PATCH v4 3/3] powerpc: Kconfig: Enable THP_SWAP on Book3S64 Ritesh Harjani (IBM)
@ 2026-06-23 5:21 ` Barry Song
0 siblings, 0 replies; 8+ messages in thread
From: Barry Song @ 2026-06-23 5:21 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil
On Fri, Jun 19, 2026 at 12:41 PM Ritesh Harjani (IBM)
<ritesh.list@gmail.com> wrote:
>
> THP_SWAP avoids splitting of a transparent huge folio into 32 smaller
> 64K folios (Radix-64K pagesize / 2M PMD) or into 256 smaller 64K folios
> (Hash-64K pagesize / 16M PMD), during swapout. This improves the
> swapping performance since all the bookking & I/O submission happens
> once per large folio. More details at [1].
>
> PowerPC Book3S64 could not enable this before because PMD_ORDER is
> selected at runtime depending upon the chosen MMU. The earlier patches
> in this series turn SWAPFILE_CLUSTER into a runtime value and introduce
> an ARCH_MAX_PMD_ORDER upperbound override for SWAP_NR_ORDERS. With those
> changes, we can now enable THP SWAP for Book3S64.
>
> This increases bandwidth throughput with zram backend for swapout by
> 40-50% with Radix and 100-130% with Hash (Tested by Sayali)
Thanks!
I am curious about the contents of the anonymous memory being tested
and the compression algorithm used by zram.
>
> [1]: https://lore.kernel.org/all/20170515112522.32457-2-ying.huang@intel.com/
Best Regards
Barry
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-06-23 5:21 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-19 4:40 [PATCH v4 0/3] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
2026-06-19 4:40 ` [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
2026-06-22 1:39 ` Barry Song
2026-06-23 4:11 ` Ritesh Harjani
2026-06-19 4:40 ` [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER Ritesh Harjani (IBM)
2026-06-23 5:11 ` Barry Song
2026-06-19 4:40 ` [PATCH v4 3/3] powerpc: Kconfig: Enable THP_SWAP on Book3S64 Ritesh Harjani (IBM)
2026-06-23 5:21 ` Barry Song
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox