* [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64
@ 2026-06-09 13:19 Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 1/4] include/linux/swap.h: Remove unused leftovers Ritesh Harjani (IBM)
` (4 more replies)
0 siblings, 5 replies; 6+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-09 13:19 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
On PowerPC Book3S64, MMU is selected at runtime, so macros like PMD_SHIFT are
effectively runtime variables in the Book3S64 code. THP swap code uses these
macros for e.g. to size some of its array data structures based on PMD_ORDER.
This patch series makes that usage dependent on the runtime variable.
Sayali did some performance runs of this on Book3S64 with Radix and it gives
40-50% performance improvement. We also plan to run it with Hash, will soon
update the results.
Note that this patch series is based out of linux-next (next-20260608).
Ritesh Harjani (IBM) (4):
include/linux/swap.h: Remove unused leftovers
mm, swap: make SWAPFILE_CLUSTER runtime
mm, swap: make SWAP_NR_ORDERS runtime
powerpc: Kconfig: Enable THP_SWAP on Book3S64
arch/powerpc/platforms/Kconfig.cputype | 1 +
include/linux/swap.h | 17 +---
mm/swap.h | 5 +-
mm/swap_table.h | 6 +-
mm/swapfile.c | 132 ++++++++++++++++++-------
5 files changed, 106 insertions(+), 55 deletions(-)
--
2.39.5
^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC 1/4] include/linux/swap.h: Remove unused leftovers
2026-06-09 13:19 [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
@ 2026-06-09 13:19 ` Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 2/4] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-09 13:19 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
This removed unused leftovers most of them are forward structure
declarations. Also removes SWAP_BATCH macro which isn't used any
where in the code.
Found these during manual code review.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
include/linux/swap.h | 7 -------
1 file changed, 7 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8f0f68e245ba..46c25523d7b8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -16,10 +16,6 @@
#include <uapi/linux/mempolicy.h>
#include <asm/page.h>
-struct notifier_block;
-
-struct bio;
-
#define SWAP_FLAG_PREFER 0x8000 /* set if swap priority specified */
#define SWAP_FLAG_PRIO_MASK 0x7fff
#define SWAP_FLAG_DISCARD 0x10000 /* enable discard for swap */
@@ -29,7 +25,6 @@ struct bio;
#define SWAP_FLAGS_VALID (SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
SWAP_FLAG_DISCARD_PAGES)
-#define SWAP_BATCH 64
static inline int current_is_kswapd(void)
{
@@ -175,7 +170,6 @@ static inline void mm_account_reclaimed_pages(unsigned long pages)
struct address_space;
struct sysinfo;
-struct writeback_control;
struct zone;
/*
@@ -442,7 +436,6 @@ extern sector_t swapdev_block(int, pgoff_t);
extern int __swap_count(swp_entry_t entry);
extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
extern int swp_swapcount(swp_entry_t entry);
-struct backing_dev_info;
extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
sector_t swap_folio_sector(struct folio *folio);
--
2.39.5
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC 2/4] mm, swap: make SWAPFILE_CLUSTER runtime
2026-06-09 13:19 [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 1/4] include/linux/swap.h: Remove unused leftovers Ritesh Harjani (IBM)
@ 2026-06-09 13:19 ` Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 3/4] mm, swap: make SWAP_NR_ORDERS runtime Ritesh Harjani (IBM)
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-09 13:19 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
This makes SWAPFILE_CLUSTER as a runtime value. Architectures like powerpc
book3s64 has HPAGE_PMD_NR, which is derived at runtime depending upon which
chosen mmu.
Hence this patch initializes SWAPFILE_CLUSTER at runtime and also
modifies swap_table and swap_memcg_table which were earlier using this
macro for defining the number of table entries.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
mm/swap.h | 5 +++--
mm/swap_table.h | 6 ++----
mm/swapfile.c | 27 ++++++++++++++++++++++-----
3 files changed, 27 insertions(+), 11 deletions(-)
diff --git a/mm/swap.h b/mm/swap.h
index 77d2d14eda42..956879a69ddd 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -26,11 +26,12 @@ extern int page_cluster;
#define SWAP_TABLE_HAS_ZEROFLAG ((BITS_PER_LONG - SWAP_CACHE_PFN_MARK_BITS - \
SWAP_CACHE_PFN_BITS) > SWAP_COUNT_MIN_BITS)
+extern unsigned int swap_slots_in_cluster __read_mostly;
+#define SWAPFILE_CLUSTER swap_slots_in_cluster
+
#ifdef CONFIG_THP_SWAP
-#define SWAPFILE_CLUSTER HPAGE_PMD_NR
#define swap_entry_order(order) (order)
#else
-#define SWAPFILE_CLUSTER 256
#define swap_entry_order(order) 0
#endif
diff --git a/mm/swap_table.h b/mm/swap_table.h
index e6613e62f8d0..90e2a7852300 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -8,16 +8,14 @@
/* A typical flat array in each cluster as swap table */
struct swap_table {
- atomic_long_t entries[SWAPFILE_CLUSTER];
+ DECLARE_FLEX_ARRAY(atomic_long_t, entries);
};
/* For storing memcg private id */
struct swap_memcg_table {
- unsigned short id[SWAPFILE_CLUSTER];
+ DECLARE_FLEX_ARRAY(unsigned short, id);
};
-#define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
-
/*
* A swap table entry represents the status of a swap slot on a swap
* (physical or virtual) device. The swap table in each cluster is a
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 78b49b0658ad..016a5aa0cb93 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -129,6 +129,17 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
.lock = INIT_LOCAL_LOCK(),
};
+unsigned int swap_slots_in_cluster __read_mostly;
+bool swap_table_use_page __read_mostly;
+
+static unsigned int generic_swap_slots_in_clusters(void)
+{
+ if (IS_ENABLED(CONFIG_THP_SWAP))
+ return HPAGE_PMD_NR;
+ else
+ return 256;
+}
+
/* May return NULL on invalid type, caller must check for NULL return */
static struct swap_info_struct *swap_type_to_info(int type)
{
@@ -437,7 +448,7 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci)
return;
rcu_assign_pointer(ci->table, NULL);
- if (!SWP_TABLE_USE_PAGE) {
+ if (!swap_table_use_page) {
kmem_cache_free(swap_table_cachep, table);
return;
}
@@ -456,7 +467,7 @@ static int swap_cluster_alloc_table(struct swap_cluster_info *ci, gfp_t gfp)
if (rcu_access_pointer(ci->table))
return 0;
- if (SWP_TABLE_USE_PAGE) {
+ if (swap_table_use_page) {
folio = folio_alloc(gfp | __GFP_ZERO, 0);
if (folio)
table = folio_address(folio);
@@ -471,7 +482,8 @@ static int swap_cluster_alloc_table(struct swap_cluster_info *ci, gfp_t gfp)
#ifdef CONFIG_MEMCG
if (!mem_cgroup_disabled()) {
VM_WARN_ON_ONCE(ci->memcg_table);
- ci->memcg_table = kzalloc_obj(*ci->memcg_table, gfp);
+ ci->memcg_table = kzalloc_flex(*ci->memcg_table, id,
+ SWAPFILE_CLUSTER, gfp);
if (!ci->memcg_table) {
swap_cluster_free_table(ci);
return -ENOMEM;
@@ -3912,14 +3924,19 @@ static int __init swapfile_init(void)
{
swapfile_maximum_size = arch_max_swapfile_size();
+ swap_slots_in_cluster = generic_swap_slots_in_clusters();
+ swap_table_use_page =
+ (swap_slots_in_cluster * sizeof(atomic_long_t) == PAGE_SIZE);
+
/*
* Once a cluster is freed, it's swap table content is read
* only, and all swap cache readers (swap_cache_*) verifies
* the content before use. So it's safe to use RCU slab here.
*/
- if (!SWP_TABLE_USE_PAGE)
+ if (!swap_table_use_page)
swap_table_cachep = kmem_cache_create("swap_table",
- sizeof(struct swap_table),
+ struct_size_t(struct swap_table, entries,
+ SWAPFILE_CLUSTER),
0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
#ifdef CONFIG_MIGRATION
--
2.39.5
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC 3/4] mm, swap: make SWAP_NR_ORDERS runtime
2026-06-09 13:19 [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 1/4] include/linux/swap.h: Remove unused leftovers Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 2/4] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
@ 2026-06-09 13:19 ` Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 4/4] powerpc: Kconfig: Enable THP_SWAP on Book3S64 Ritesh Harjani (IBM)
2026-06-09 15:54 ` [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 YoungJun Park
4 siblings, 0 replies; 6+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-09 13:19 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
SWAP_NR_ORDERS is currently a compile-time constant defined as PMD_ORDER
+ 1 when CONFIG_THP_SWAP=y, else 1.
This patch converts SWAP_NR_ORDERS and all the relevant code paths to
make it runtime dependent. This is needed for architectures like powerpc
book3s64, where PMD_ORDER is decided at runtime depending upon which MMU
is chosen (Radix / Hash).
One thing to note here is, if any of the allocations required in
swapfile_init() call (which is a subsys_initcall) fails, then we have no
option but to panic. This is inline with how memory allocation failures
in other subsys_initcall() are handled.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
include/linux/swap.h | 10 ++---
mm/swapfile.c | 105 ++++++++++++++++++++++++++++++-------------
2 files changed, 78 insertions(+), 37 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46c25523d7b8..063ab7c4d4a5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -224,9 +224,9 @@ enum {
#define SWAP_ENTRY_INVALID 0
#ifdef CONFIG_THP_SWAP
-#define SWAP_NR_ORDERS (PMD_ORDER + 1)
+#define swap_nr_orders() ((unsigned int)(PMD_ORDER + 1))
#else
-#define SWAP_NR_ORDERS 1
+#define swap_nr_orders() (1U)
#endif
/*
@@ -234,7 +234,7 @@ enum {
* The purpose is to optimize SWAP throughput on these device.
*/
struct swap_sequential_cluster {
- unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
+ DECLARE_FLEX_ARRAY(unsigned int, next); /* Likely next allocation offset */
};
/*
@@ -250,9 +250,9 @@ struct swap_info_struct {
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
struct list_head free_clusters; /* free clusters list */
struct list_head full_clusters; /* full clusters list */
- struct list_head nonfull_clusters[SWAP_NR_ORDERS];
+ struct list_head *nonfull_clusters;
/* list of cluster that contains at least one free slot */
- struct list_head frag_clusters[SWAP_NR_ORDERS];
+ struct list_head *frag_clusters;
/* list of cluster that are fragmented or contented */
unsigned int pages; /* total of usable pages of swap */
atomic_long_t inuse_pages; /* number of those currently in use */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 016a5aa0cb93..0a78802528cf 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -118,16 +118,12 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
atomic_t nr_rotate_swap = ATOMIC_INIT(0);
struct percpu_swap_cluster {
- struct swap_info_struct *si[SWAP_NR_ORDERS];
- unsigned long offset[SWAP_NR_ORDERS];
+ struct swap_info_struct **si;
+ unsigned long *offset;
local_lock_t lock;
};
-static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
- .si = { NULL },
- .offset = { SWAP_ENTRY_INVALID },
- .lock = INIT_LOCAL_LOCK(),
-};
+static struct percpu_swap_cluster __percpu *percpu_swap_cluster;
unsigned int swap_slots_in_cluster __read_mostly;
bool swap_table_use_page __read_mostly;
@@ -545,7 +541,7 @@ swap_cluster_populate(struct swap_info_struct *si,
* Only cluster isolation from the allocator does table allocation.
* Swap allocator uses percpu clusters and holds the local lock.
*/
- lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock);
+ lockdep_assert_held(&this_cpu_ptr(percpu_swap_cluster)->lock);
if (!(si->flags & SWP_SOLIDSTATE))
lockdep_assert_held(&si->global_cluster_lock);
lockdep_assert_held(&ci->lock);
@@ -562,7 +558,7 @@ swap_cluster_populate(struct swap_info_struct *si,
spin_unlock(&ci->lock);
if (!(si->flags & SWP_SOLIDSTATE))
spin_unlock(&si->global_cluster_lock);
- local_unlock(&percpu_swap_cluster.lock);
+ local_unlock(&percpu_swap_cluster->lock);
ret = swap_cluster_alloc_table(ci, __GFP_HIGH | __GFP_NOMEMALLOC |
GFP_KERNEL);
@@ -575,7 +571,7 @@ swap_cluster_populate(struct swap_info_struct *si,
* could happen with ignoring the percpu cluster is fragmentation,
* which is acceptable since this fallback and race is rare.
*/
- local_lock(&percpu_swap_cluster.lock);
+ local_lock(&percpu_swap_cluster->lock);
if (!(si->flags & SWP_SOLIDSTATE))
spin_lock(&si->global_cluster_lock);
spin_lock(&ci->lock);
@@ -1016,8 +1012,10 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
relocate_cluster(si, ci);
swap_cluster_unlock(ci);
if (si->flags & SWP_SOLIDSTATE) {
- this_cpu_write(percpu_swap_cluster.offset[order], next);
- this_cpu_write(percpu_swap_cluster.si[order], si);
+ struct percpu_swap_cluster *pcp_sc = this_cpu_ptr(percpu_swap_cluster);
+
+ pcp_sc->offset[order] = next;
+ pcp_sc->si[order] = si;
} else {
si->global_cluster->next[order] = next;
}
@@ -1178,7 +1176,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
goto done;
/* Order 0 stealing from higher order */
- for (int o = 1; o < SWAP_NR_ORDERS; o++) {
+ for (int o = 1; o < swap_nr_orders(); o++) {
/*
* Clusters here have at least one usable slots and can't fail order 0
* allocation, but reclaim may drop si->lock and race with another user.
@@ -1376,13 +1374,14 @@ static bool swap_alloc_fast(struct folio *folio)
struct swap_cluster_info *ci;
struct swap_info_struct *si;
unsigned int offset;
+ struct percpu_swap_cluster *pcp_sc = this_cpu_ptr(percpu_swap_cluster);
/*
* Once allocated, swap_info_struct will never be completely freed,
* so checking it's liveness by get_swap_device_info is enough.
*/
- si = this_cpu_read(percpu_swap_cluster.si[order]);
- offset = this_cpu_read(percpu_swap_cluster.offset[order]);
+ si = pcp_sc->si[order];
+ offset = pcp_sc->offset[order];
if (!si || !offset || !get_swap_device_info(si))
return false;
@@ -1770,10 +1769,10 @@ int folio_alloc_swap(struct folio *folio)
}
again:
- local_lock(&percpu_swap_cluster.lock);
+ local_lock(&percpu_swap_cluster->lock);
if (!swap_alloc_fast(folio))
swap_alloc_slow(folio);
- local_unlock(&percpu_swap_cluster.lock);
+ local_unlock(&percpu_swap_cluster->lock);
if (!order && unlikely(!folio_test_swapcache(folio))) {
if (swap_sync_discard())
@@ -2166,6 +2165,7 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
unsigned long pcp_offset, offset = SWAP_ENTRY_INVALID;
struct swap_cluster_info *ci;
swp_entry_t entry = {0};
+ struct percpu_swap_cluster *pcp_sc;
if (!si)
goto fail;
@@ -2174,9 +2174,10 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
* Try the local cluster first if it matches the device. If
* not, try grab a new cluster and override local cluster.
*/
- local_lock(&percpu_swap_cluster.lock);
- pcp_si = this_cpu_read(percpu_swap_cluster.si[0]);
- pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
+ local_lock(&percpu_swap_cluster->lock);
+ pcp_sc = this_cpu_ptr(percpu_swap_cluster);
+ pcp_si = pcp_sc->si[0];
+ pcp_offset = pcp_sc->offset[0];
if (pcp_si == si && pcp_offset) {
ci = swap_cluster_lock(si, pcp_offset);
if (cluster_is_usable(ci, 0))
@@ -2186,7 +2187,7 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
}
if (!offset)
offset = cluster_alloc_swap_entry(si, NULL);
- local_unlock(&percpu_swap_cluster.lock);
+ local_unlock(&percpu_swap_cluster->lock);
if (offset)
entry = swp_entry(si->type, offset);
@@ -3029,6 +3030,16 @@ static void wait_for_allocation(struct swap_info_struct *si)
}
}
+static void free_swap_info_arrays(struct swap_info_struct *si)
+{
+ kfree(si->global_cluster);
+ si->global_cluster = NULL;
+ kfree(si->nonfull_clusters);
+ si->nonfull_clusters = NULL;
+ kfree(si->frag_clusters);
+ si->frag_clusters = NULL;
+}
+
static void free_swap_cluster_info(struct swap_cluster_info *cluster_info,
unsigned long maxpages)
{
@@ -3057,17 +3068,17 @@ static void free_swap_cluster_info(struct swap_cluster_info *cluster_info,
static void flush_percpu_swap_cluster(struct swap_info_struct *si)
{
int cpu, i;
- struct swap_info_struct **pcp_si;
+ struct percpu_swap_cluster *pcp_sc;
for_each_possible_cpu(cpu) {
- pcp_si = per_cpu_ptr(percpu_swap_cluster.si, cpu);
+ pcp_sc = per_cpu_ptr(percpu_swap_cluster, cpu);
/*
* Invalidate the percpu swap cluster cache, si->users
* is dead, so no new user will point to it, just flush
* any existing user.
*/
- for (i = 0; i < SWAP_NR_ORDERS; i++)
- cmpxchg(&pcp_si[i], si, NULL);
+ for (i = 0; i < swap_nr_orders(); i++)
+ cmpxchg(&pcp_sc->si[i], si, NULL);
}
}
@@ -3179,8 +3190,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
arch_swap_invalidate_area(p->type);
zswap_swapoff(p->type);
mutex_unlock(&swapon_mutex);
- kfree(p->global_cluster);
- p->global_cluster = NULL;
+ free_swap_info_arrays(p);
free_swap_cluster_info(cluster_info, maxpages);
inode = mapping->host;
@@ -3531,6 +3541,7 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
struct swap_cluster_info *cluster_info;
int err = -ENOMEM;
unsigned long i;
+ unsigned int nr_orders = swap_nr_orders();
cluster_info = kvzalloc_objs(*cluster_info, nr_clusters);
if (!cluster_info)
@@ -3539,11 +3550,19 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
for (i = 0; i < nr_clusters; i++)
spin_lock_init(&cluster_info[i].lock);
+ si->nonfull_clusters = kmalloc_objs(*si->nonfull_clusters, nr_orders);
+ if (!si->nonfull_clusters)
+ goto err;
+
+ si->frag_clusters = kmalloc_objs(*si->frag_clusters, nr_orders);
+ if (!si->frag_clusters)
+ goto err;
+
if (!(si->flags & SWP_SOLIDSTATE)) {
- si->global_cluster = kmalloc_obj(*si->global_cluster);
+ si->global_cluster = kmalloc_flex(*si->global_cluster, next, nr_orders);
if (!si->global_cluster)
goto err;
- for (i = 0; i < SWAP_NR_ORDERS; i++)
+ for (i = 0; i < nr_orders; i++)
si->global_cluster->next[i] = SWAP_ENTRY_INVALID;
spin_lock_init(&si->global_cluster_lock);
}
@@ -3579,7 +3598,7 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
INIT_LIST_HEAD(&si->full_clusters);
INIT_LIST_HEAD(&si->discard_clusters);
- for (i = 0; i < SWAP_NR_ORDERS; i++) {
+ for (i = 0; i < nr_orders; i++) {
INIT_LIST_HEAD(&si->nonfull_clusters[i]);
INIT_LIST_HEAD(&si->frag_clusters[i]);
}
@@ -3599,6 +3618,7 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
si->cluster_info = cluster_info;
return 0;
err:
+ free_swap_info_arrays(si);
free_swap_cluster_info(cluster_info, maxpages);
return err;
}
@@ -3807,8 +3827,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
bad_swap_unlock_inode:
inode_unlock(inode);
bad_swap:
- kfree(si->global_cluster);
- si->global_cluster = NULL;
+ free_swap_info_arrays(si);
inode = NULL;
destroy_swap_extents(si, swap_file);
free_swap_cluster_info(si->cluster_info, si->max);
@@ -3922,6 +3941,10 @@ void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
static int __init swapfile_init(void)
{
+ unsigned int nr_orders = swap_nr_orders();
+ struct percpu_swap_cluster *pcp_sc;
+ int cpu;
+
swapfile_maximum_size = arch_max_swapfile_size();
swap_slots_in_cluster = generic_swap_slots_in_clusters();
@@ -3939,6 +3962,24 @@ static int __init swapfile_init(void)
SWAPFILE_CLUSTER),
0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
+ percpu_swap_cluster = alloc_percpu(struct percpu_swap_cluster);
+ if (!percpu_swap_cluster)
+ panic("%s: alloc_percpu failed for percpu_swap_cluster\n", __func__);
+
+ for_each_possible_cpu(cpu) {
+ int node = cpu_to_mem(cpu);
+
+ pcp_sc = per_cpu_ptr(percpu_swap_cluster, cpu);
+ local_lock_init(&pcp_sc->lock);
+ pcp_sc->si = kcalloc_node(nr_orders, sizeof(*pcp_sc->si),
+ GFP_KERNEL, node);
+ pcp_sc->offset = kcalloc_node(nr_orders, sizeof(*pcp_sc->offset),
+ GFP_KERNEL, node);
+ if (!pcp_sc->si || !pcp_sc->offset)
+ panic("%s: per-CPU kcalloc failed for cpu:%d, node:%d\n",
+ __func__, cpu, node);
+ }
+
#ifdef CONFIG_MIGRATION
if (swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS))
swap_migration_ad_supported = true;
--
2.39.5
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC 4/4] powerpc: Kconfig: Enable THP_SWAP on Book3S64
2026-06-09 13:19 [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
` (2 preceding siblings ...)
2026-06-09 13:19 ` [RFC 3/4] mm, swap: make SWAP_NR_ORDERS runtime Ritesh Harjani (IBM)
@ 2026-06-09 13:19 ` Ritesh Harjani (IBM)
2026-06-09 15:54 ` [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 YoungJun Park
4 siblings, 0 replies; 6+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-09 13:19 UTC (permalink / raw)
To: linux-mm
Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
Ritesh Harjani (IBM)
This enables THP_SWAP support for Book3S64.
The performance testing of this patch series on Book3S64 with zram has shown
around 40-50% improvement in case of Radix. We will be doing some performance
testing on Hash too and will soon update the results.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
arch/powerpc/platforms/Kconfig.cputype | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index bac02c83bb3e..48f74bd22343 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -113,6 +113,7 @@ config PPC_THP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+ select ARCH_WANTS_THP_SWAP if TRANSPARENT_HUGEPAGE
choice
prompt "CPU selection"
--
2.39.5
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64
2026-06-09 13:19 [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
` (3 preceding siblings ...)
2026-06-09 13:19 ` [RFC 4/4] powerpc: Kconfig: Enable THP_SWAP on Book3S64 Ritesh Harjani (IBM)
@ 2026-06-09 15:54 ` YoungJun Park
4 siblings, 0 replies; 6+ messages in thread
From: YoungJun Park @ 2026-06-09 15:54 UTC (permalink / raw)
To: Ritesh Harjani (IBM)
Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, David Hildenbrand,
linuxppc-dev, linux-kernel, Sayali Patil
On Tue, Jun 09, 2026 at 06:49:30PM +0530, Ritesh Harjani (IBM) wrote:
> On PowerPC Book3S64, MMU is selected at runtime, so macros like PMD_SHIFT are
> effectively runtime variables in the Book3S64 code. THP swap code uses these
> macros for e.g. to size some of its array data structures based on PMD_ORDER.
> This patch series makes that usage dependent on the runtime variable.
>
> Sayali did some performance runs of this on Book3S64 with Radix and it gives
> 40-50% performance improvement. We also plan to run it with Hash, will soon
> update the results.
>
> Note that this patch series is based out of linux-next (next-20260608).
>
> Ritesh Harjani (IBM) (4):
> include/linux/swap.h: Remove unused leftovers
> mm, swap: make SWAPFILE_CLUSTER runtime
> mm, swap: make SWAP_NR_ORDERS runtime
> powerpc: Kconfig: Enable THP_SWAP on Book3S64
>
> arch/powerpc/platforms/Kconfig.cputype | 1 +
> include/linux/swap.h | 17 +---
> mm/swap.h | 5 +-
> mm/swap_table.h | 6 +-
> mm/swapfile.c | 132 ++++++++++++++++++-------
> 5 files changed, 106 insertions(+), 55 deletions(-)
>
> --
> 2.39.5
>
Hello!
Instead of making SWAP_NR_ORDERS fully runtime, could we set it to the max
PMD_ORDER possible on PowerPC Book3S64 as a compile-time constant in the
swap.h ifdef block? (My assumtion is PMD_ORDER max not too big.)
I think the general runtime version adds cost. It impacts all other archs.
percpu_swap_cluster needs a runtime alloc,
the si/offset and nonfull/frag arrays become separate pointers, and some
accesses get one more indirection. And for nr_orders=1, the allocation
itself is just waste.
With a compile-time possible max constant, the only downside is some acceptable amount of
wasted bytes per CPU / per device on Book3S64 (the unused entries in the swap
offset cache and the nonfull/frag lists), with no perf impact. the perf
improvement comes from THP swap itself, right? Other arches see no
impact at all.
patch 2 looks fine as is. SWAPFILE_CLUSTER backs much bigger per-cluster
arrays, so runtime sizing makes sense there, and it looks like no impact to
other arches or the current code.
Thanks!
Youngjun Park
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-06-09 15:54 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-09 13:19 [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 1/4] include/linux/swap.h: Remove unused leftovers Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 2/4] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 3/4] mm, swap: make SWAP_NR_ORDERS runtime Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 4/4] powerpc: Kconfig: Enable THP_SWAP on Book3S64 Ritesh Harjani (IBM)
2026-06-09 15:54 ` [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 YoungJun Park
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.