The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64
@ 2026-06-09 13:19 Ritesh Harjani (IBM)
  2026-06-09 13:19 ` [RFC 1/4] include/linux/swap.h: Remove unused leftovers Ritesh Harjani (IBM)
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-09 13:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
	David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
	Ritesh Harjani (IBM)

On PowerPC Book3S64, MMU is selected at runtime, so macros like PMD_SHIFT are
effectively runtime variables in the Book3S64 code. THP swap code uses these
macros for e.g. to size some of its array data structures based on PMD_ORDER.
This patch series makes that usage dependent on the runtime variable.

Sayali did some performance runs of this on Book3S64 with Radix and it gives
40-50% performance improvement. We also plan to run it with Hash, will soon
update the results.

Note that this patch series is based out of linux-next (next-20260608).

Ritesh Harjani (IBM) (4):
  include/linux/swap.h: Remove unused leftovers
  mm, swap: make SWAPFILE_CLUSTER runtime
  mm, swap: make SWAP_NR_ORDERS runtime
  powerpc: Kconfig: Enable THP_SWAP on Book3S64

 arch/powerpc/platforms/Kconfig.cputype |   1 +
 include/linux/swap.h                   |  17 +---
 mm/swap.h                              |   5 +-
 mm/swap_table.h                        |   6 +-
 mm/swapfile.c                          | 132 ++++++++++++++++++-------
 5 files changed, 106 insertions(+), 55 deletions(-)

--
2.39.5


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC 1/4] include/linux/swap.h: Remove unused leftovers
  2026-06-09 13:19 [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
@ 2026-06-09 13:19 ` Ritesh Harjani (IBM)
  2026-06-09 13:19 ` [RFC 2/4] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-09 13:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
	David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
	Ritesh Harjani (IBM)

This removed unused leftovers most of them are forward structure
declarations. Also removes SWAP_BATCH macro which isn't used any
where in the code.

Found these during manual code review.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 include/linux/swap.h | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8f0f68e245ba..46c25523d7b8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -16,10 +16,6 @@
 #include <uapi/linux/mempolicy.h>
 #include <asm/page.h>
 
-struct notifier_block;
-
-struct bio;
-
 #define SWAP_FLAG_PREFER	0x8000	/* set if swap priority specified */
 #define SWAP_FLAG_PRIO_MASK	0x7fff
 #define SWAP_FLAG_DISCARD	0x10000 /* enable discard for swap */
@@ -29,7 +25,6 @@ struct bio;
 #define SWAP_FLAGS_VALID	(SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
 				 SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
 				 SWAP_FLAG_DISCARD_PAGES)
-#define SWAP_BATCH 64
 
 static inline int current_is_kswapd(void)
 {
@@ -175,7 +170,6 @@ static inline void mm_account_reclaimed_pages(unsigned long pages)
 
 struct address_space;
 struct sysinfo;
-struct writeback_control;
 struct zone;
 
 /*
@@ -442,7 +436,6 @@ extern sector_t swapdev_block(int, pgoff_t);
 extern int __swap_count(swp_entry_t entry);
 extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
-struct backing_dev_info;
 extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
 sector_t swap_folio_sector(struct folio *folio);
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC 2/4] mm, swap: make SWAPFILE_CLUSTER runtime
  2026-06-09 13:19 [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
  2026-06-09 13:19 ` [RFC 1/4] include/linux/swap.h: Remove unused leftovers Ritesh Harjani (IBM)
@ 2026-06-09 13:19 ` Ritesh Harjani (IBM)
  2026-06-09 13:19 ` [RFC 3/4] mm, swap: make SWAP_NR_ORDERS runtime Ritesh Harjani (IBM)
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-09 13:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
	David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
	Ritesh Harjani (IBM)

This makes SWAPFILE_CLUSTER as a runtime value. Architectures like powerpc
book3s64 has HPAGE_PMD_NR, which is derived at runtime depending upon which
chosen mmu.
Hence this patch initializes SWAPFILE_CLUSTER at runtime and also
modifies swap_table and swap_memcg_table which were earlier using this
macro for defining the number of table entries.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 mm/swap.h       |  5 +++--
 mm/swap_table.h |  6 ++----
 mm/swapfile.c   | 27 ++++++++++++++++++++++-----
 3 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 77d2d14eda42..956879a69ddd 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -26,11 +26,12 @@ extern int page_cluster;
 #define SWAP_TABLE_HAS_ZEROFLAG		((BITS_PER_LONG - SWAP_CACHE_PFN_MARK_BITS - \
 					  SWAP_CACHE_PFN_BITS) > SWAP_COUNT_MIN_BITS)

+extern unsigned int swap_slots_in_cluster __read_mostly;
+#define SWAPFILE_CLUSTER	swap_slots_in_cluster
+
 #ifdef CONFIG_THP_SWAP
-#define SWAPFILE_CLUSTER	HPAGE_PMD_NR
 #define swap_entry_order(order)	(order)
 #else
-#define SWAPFILE_CLUSTER	256
 #define swap_entry_order(order)	0
 #endif

diff --git a/mm/swap_table.h b/mm/swap_table.h
index e6613e62f8d0..90e2a7852300 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -8,16 +8,14 @@

 /* A typical flat array in each cluster as swap table */
 struct swap_table {
-	atomic_long_t entries[SWAPFILE_CLUSTER];
+	DECLARE_FLEX_ARRAY(atomic_long_t, entries);
 };

 /* For storing memcg private id */
 struct swap_memcg_table {
-	unsigned short id[SWAPFILE_CLUSTER];
+	DECLARE_FLEX_ARRAY(unsigned short, id);
 };

-#define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
-
 /*
  * A swap table entry represents the status of a swap slot on a swap
  * (physical or virtual) device. The swap table in each cluster is a
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 78b49b0658ad..016a5aa0cb93 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -129,6 +129,17 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
 	.lock = INIT_LOCAL_LOCK(),
 };

+unsigned int swap_slots_in_cluster __read_mostly;
+bool swap_table_use_page __read_mostly;
+
+static unsigned int generic_swap_slots_in_clusters(void)
+{
+	if (IS_ENABLED(CONFIG_THP_SWAP))
+		return HPAGE_PMD_NR;
+	else
+		return 256;
+}
+
 /* May return NULL on invalid type, caller must check for NULL return */
 static struct swap_info_struct *swap_type_to_info(int type)
 {
@@ -437,7 +448,7 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci)
 		return;

 	rcu_assign_pointer(ci->table, NULL);
-	if (!SWP_TABLE_USE_PAGE) {
+	if (!swap_table_use_page) {
 		kmem_cache_free(swap_table_cachep, table);
 		return;
 	}
@@ -456,7 +467,7 @@ static int swap_cluster_alloc_table(struct swap_cluster_info *ci, gfp_t gfp)
 	if (rcu_access_pointer(ci->table))
 		return 0;

-	if (SWP_TABLE_USE_PAGE) {
+	if (swap_table_use_page) {
 		folio = folio_alloc(gfp | __GFP_ZERO, 0);
 		if (folio)
 			table = folio_address(folio);
@@ -471,7 +482,8 @@ static int swap_cluster_alloc_table(struct swap_cluster_info *ci, gfp_t gfp)
 #ifdef CONFIG_MEMCG
 	if (!mem_cgroup_disabled()) {
 		VM_WARN_ON_ONCE(ci->memcg_table);
-		ci->memcg_table = kzalloc_obj(*ci->memcg_table, gfp);
+		ci->memcg_table = kzalloc_flex(*ci->memcg_table, id,
+					       SWAPFILE_CLUSTER, gfp);
 		if (!ci->memcg_table) {
 			swap_cluster_free_table(ci);
 			return -ENOMEM;
@@ -3912,14 +3924,19 @@ static int __init swapfile_init(void)
 {
 	swapfile_maximum_size = arch_max_swapfile_size();

+	swap_slots_in_cluster = generic_swap_slots_in_clusters();
+	swap_table_use_page =
+		(swap_slots_in_cluster * sizeof(atomic_long_t) == PAGE_SIZE);
+
 	/*
 	 * Once a cluster is freed, it's swap table content is read
 	 * only, and all swap cache readers (swap_cache_*) verifies
 	 * the content before use. So it's safe to use RCU slab here.
 	 */
-	if (!SWP_TABLE_USE_PAGE)
+	if (!swap_table_use_page)
 		swap_table_cachep = kmem_cache_create("swap_table",
-				    sizeof(struct swap_table),
+				    struct_size_t(struct swap_table, entries,
+					    SWAPFILE_CLUSTER),
 				    0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);

 #ifdef CONFIG_MIGRATION
--
2.39.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC 3/4] mm, swap: make SWAP_NR_ORDERS runtime
  2026-06-09 13:19 [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
  2026-06-09 13:19 ` [RFC 1/4] include/linux/swap.h: Remove unused leftovers Ritesh Harjani (IBM)
  2026-06-09 13:19 ` [RFC 2/4] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
@ 2026-06-09 13:19 ` Ritesh Harjani (IBM)
  2026-06-09 13:19 ` [RFC 4/4] powerpc: Kconfig: Enable THP_SWAP on Book3S64 Ritesh Harjani (IBM)
  2026-06-09 15:54 ` [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 YoungJun Park
  4 siblings, 0 replies; 6+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-09 13:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
	David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
	Ritesh Harjani (IBM)

SWAP_NR_ORDERS is currently a compile-time constant defined as PMD_ORDER
+ 1 when CONFIG_THP_SWAP=y, else 1.
This patch converts SWAP_NR_ORDERS and all the relevant code paths to
make it runtime dependent. This is needed for architectures like powerpc
book3s64, where PMD_ORDER is decided at runtime depending upon which MMU
is chosen (Radix / Hash).

One thing to note here is, if any of the allocations required in
swapfile_init() call (which is a subsys_initcall) fails, then we have no
option but to panic. This is inline with how memory allocation failures
in other subsys_initcall() are handled.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 include/linux/swap.h |  10 ++---
 mm/swapfile.c        | 105 ++++++++++++++++++++++++++++++-------------
 2 files changed, 78 insertions(+), 37 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46c25523d7b8..063ab7c4d4a5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -224,9 +224,9 @@ enum {
 #define SWAP_ENTRY_INVALID	0

 #ifdef CONFIG_THP_SWAP
-#define SWAP_NR_ORDERS		(PMD_ORDER + 1)
+#define swap_nr_orders()	((unsigned int)(PMD_ORDER + 1))
 #else
-#define SWAP_NR_ORDERS		1
+#define swap_nr_orders()	(1U)
 #endif

 /*
@@ -234,7 +234,7 @@ enum {
  * The purpose is to optimize SWAP throughput on these device.
  */
 struct swap_sequential_cluster {
-	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
+	DECLARE_FLEX_ARRAY(unsigned int, next); /* Likely next allocation offset */
 };

 /*
@@ -250,9 +250,9 @@ struct swap_info_struct {
 	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
 	struct list_head free_clusters; /* free clusters list */
 	struct list_head full_clusters; /* full clusters list */
-	struct list_head nonfull_clusters[SWAP_NR_ORDERS];
+	struct list_head *nonfull_clusters;
 					/* list of cluster that contains at least one free slot */
-	struct list_head frag_clusters[SWAP_NR_ORDERS];
+	struct list_head *frag_clusters;
 					/* list of cluster that are fragmented or contented */
 	unsigned int pages;		/* total of usable pages of swap */
 	atomic_long_t inuse_pages;	/* number of those currently in use */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 016a5aa0cb93..0a78802528cf 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -118,16 +118,12 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
 atomic_t nr_rotate_swap = ATOMIC_INIT(0);

 struct percpu_swap_cluster {
-	struct swap_info_struct *si[SWAP_NR_ORDERS];
-	unsigned long offset[SWAP_NR_ORDERS];
+	struct swap_info_struct **si;
+	unsigned long *offset;
 	local_lock_t lock;
 };

-static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
-	.si = { NULL },
-	.offset = { SWAP_ENTRY_INVALID },
-	.lock = INIT_LOCAL_LOCK(),
-};
+static struct percpu_swap_cluster __percpu *percpu_swap_cluster;

 unsigned int swap_slots_in_cluster __read_mostly;
 bool swap_table_use_page __read_mostly;
@@ -545,7 +541,7 @@ swap_cluster_populate(struct swap_info_struct *si,
 	 * Only cluster isolation from the allocator does table allocation.
 	 * Swap allocator uses percpu clusters and holds the local lock.
 	 */
-	lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock);
+	lockdep_assert_held(&this_cpu_ptr(percpu_swap_cluster)->lock);
 	if (!(si->flags & SWP_SOLIDSTATE))
 		lockdep_assert_held(&si->global_cluster_lock);
 	lockdep_assert_held(&ci->lock);
@@ -562,7 +558,7 @@ swap_cluster_populate(struct swap_info_struct *si,
 	spin_unlock(&ci->lock);
 	if (!(si->flags & SWP_SOLIDSTATE))
 		spin_unlock(&si->global_cluster_lock);
-	local_unlock(&percpu_swap_cluster.lock);
+	local_unlock(&percpu_swap_cluster->lock);

 	ret = swap_cluster_alloc_table(ci, __GFP_HIGH | __GFP_NOMEMALLOC |
 					   GFP_KERNEL);
@@ -575,7 +571,7 @@ swap_cluster_populate(struct swap_info_struct *si,
 	 * could happen with ignoring the percpu cluster is fragmentation,
 	 * which is acceptable since this fallback and race is rare.
 	 */
-	local_lock(&percpu_swap_cluster.lock);
+	local_lock(&percpu_swap_cluster->lock);
 	if (!(si->flags & SWP_SOLIDSTATE))
 		spin_lock(&si->global_cluster_lock);
 	spin_lock(&ci->lock);
@@ -1016,8 +1012,10 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 	relocate_cluster(si, ci);
 	swap_cluster_unlock(ci);
 	if (si->flags & SWP_SOLIDSTATE) {
-		this_cpu_write(percpu_swap_cluster.offset[order], next);
-		this_cpu_write(percpu_swap_cluster.si[order], si);
+		struct percpu_swap_cluster *pcp_sc = this_cpu_ptr(percpu_swap_cluster);
+
+		pcp_sc->offset[order] = next;
+		pcp_sc->si[order] = si;
 	} else {
 		si->global_cluster->next[order] = next;
 	}
@@ -1178,7 +1176,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
 		goto done;

 	/* Order 0 stealing from higher order */
-	for (int o = 1; o < SWAP_NR_ORDERS; o++) {
+	for (int o = 1; o < swap_nr_orders(); o++) {
 		/*
 		 * Clusters here have at least one usable slots and can't fail order 0
 		 * allocation, but reclaim may drop si->lock and race with another user.
@@ -1376,13 +1374,14 @@ static bool swap_alloc_fast(struct folio *folio)
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
 	unsigned int offset;
+	struct percpu_swap_cluster *pcp_sc = this_cpu_ptr(percpu_swap_cluster);

 	/*
 	 * Once allocated, swap_info_struct will never be completely freed,
 	 * so checking it's liveness by get_swap_device_info is enough.
 	 */
-	si = this_cpu_read(percpu_swap_cluster.si[order]);
-	offset = this_cpu_read(percpu_swap_cluster.offset[order]);
+	si = pcp_sc->si[order];
+	offset = pcp_sc->offset[order];
 	if (!si || !offset || !get_swap_device_info(si))
 		return false;

@@ -1770,10 +1769,10 @@ int folio_alloc_swap(struct folio *folio)
 	}

 again:
-	local_lock(&percpu_swap_cluster.lock);
+	local_lock(&percpu_swap_cluster->lock);
 	if (!swap_alloc_fast(folio))
 		swap_alloc_slow(folio);
-	local_unlock(&percpu_swap_cluster.lock);
+	local_unlock(&percpu_swap_cluster->lock);

 	if (!order && unlikely(!folio_test_swapcache(folio))) {
 		if (swap_sync_discard())
@@ -2166,6 +2165,7 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
 	unsigned long pcp_offset, offset = SWAP_ENTRY_INVALID;
 	struct swap_cluster_info *ci;
 	swp_entry_t entry = {0};
+	struct percpu_swap_cluster *pcp_sc;

 	if (!si)
 		goto fail;
@@ -2174,9 +2174,10 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
 	 * Try the local cluster first if it matches the device. If
 	 * not, try grab a new cluster and override local cluster.
 	 */
-	local_lock(&percpu_swap_cluster.lock);
-	pcp_si = this_cpu_read(percpu_swap_cluster.si[0]);
-	pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
+	local_lock(&percpu_swap_cluster->lock);
+	pcp_sc = this_cpu_ptr(percpu_swap_cluster);
+	pcp_si = pcp_sc->si[0];
+	pcp_offset = pcp_sc->offset[0];
 	if (pcp_si == si && pcp_offset) {
 		ci = swap_cluster_lock(si, pcp_offset);
 		if (cluster_is_usable(ci, 0))
@@ -2186,7 +2187,7 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
 	}
 	if (!offset)
 		offset = cluster_alloc_swap_entry(si, NULL);
-	local_unlock(&percpu_swap_cluster.lock);
+	local_unlock(&percpu_swap_cluster->lock);
 	if (offset)
 		entry = swp_entry(si->type, offset);

@@ -3029,6 +3030,16 @@ static void wait_for_allocation(struct swap_info_struct *si)
 	}
 }

+static void free_swap_info_arrays(struct swap_info_struct *si)
+{
+	kfree(si->global_cluster);
+	si->global_cluster = NULL;
+	kfree(si->nonfull_clusters);
+	si->nonfull_clusters = NULL;
+	kfree(si->frag_clusters);
+	si->frag_clusters = NULL;
+}
+
 static void free_swap_cluster_info(struct swap_cluster_info *cluster_info,
 				   unsigned long maxpages)
 {
@@ -3057,17 +3068,17 @@ static void free_swap_cluster_info(struct swap_cluster_info *cluster_info,
 static void flush_percpu_swap_cluster(struct swap_info_struct *si)
 {
 	int cpu, i;
-	struct swap_info_struct **pcp_si;
+	struct percpu_swap_cluster *pcp_sc;

 	for_each_possible_cpu(cpu) {
-		pcp_si = per_cpu_ptr(percpu_swap_cluster.si, cpu);
+		pcp_sc = per_cpu_ptr(percpu_swap_cluster, cpu);
 		/*
 		 * Invalidate the percpu swap cluster cache, si->users
 		 * is dead, so no new user will point to it, just flush
 		 * any existing user.
 		 */
-		for (i = 0; i < SWAP_NR_ORDERS; i++)
-			cmpxchg(&pcp_si[i], si, NULL);
+		for (i = 0; i < swap_nr_orders(); i++)
+			cmpxchg(&pcp_sc->si[i], si, NULL);
 	}
 }

@@ -3179,8 +3190,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	arch_swap_invalidate_area(p->type);
 	zswap_swapoff(p->type);
 	mutex_unlock(&swapon_mutex);
-	kfree(p->global_cluster);
-	p->global_cluster = NULL;
+	free_swap_info_arrays(p);
 	free_swap_cluster_info(cluster_info, maxpages);

 	inode = mapping->host;
@@ -3531,6 +3541,7 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
 	struct swap_cluster_info *cluster_info;
 	int err = -ENOMEM;
 	unsigned long i;
+	unsigned int nr_orders = swap_nr_orders();

 	cluster_info = kvzalloc_objs(*cluster_info, nr_clusters);
 	if (!cluster_info)
@@ -3539,11 +3550,19 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
 	for (i = 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);

+	si->nonfull_clusters = kmalloc_objs(*si->nonfull_clusters, nr_orders);
+	if (!si->nonfull_clusters)
+		goto err;
+
+	si->frag_clusters = kmalloc_objs(*si->frag_clusters, nr_orders);
+	if (!si->frag_clusters)
+		goto err;
+
 	if (!(si->flags & SWP_SOLIDSTATE)) {
-		si->global_cluster = kmalloc_obj(*si->global_cluster);
+		si->global_cluster = kmalloc_flex(*si->global_cluster, next, nr_orders);
 		if (!si->global_cluster)
 			goto err;
-		for (i = 0; i < SWAP_NR_ORDERS; i++)
+		for (i = 0; i < nr_orders; i++)
 			si->global_cluster->next[i] = SWAP_ENTRY_INVALID;
 		spin_lock_init(&si->global_cluster_lock);
 	}
@@ -3579,7 +3598,7 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
 	INIT_LIST_HEAD(&si->full_clusters);
 	INIT_LIST_HEAD(&si->discard_clusters);

-	for (i = 0; i < SWAP_NR_ORDERS; i++) {
+	for (i = 0; i < nr_orders; i++) {
 		INIT_LIST_HEAD(&si->nonfull_clusters[i]);
 		INIT_LIST_HEAD(&si->frag_clusters[i]);
 	}
@@ -3599,6 +3618,7 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
 	si->cluster_info = cluster_info;
 	return 0;
 err:
+	free_swap_info_arrays(si);
 	free_swap_cluster_info(cluster_info, maxpages);
 	return err;
 }
@@ -3807,8 +3827,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 bad_swap_unlock_inode:
 	inode_unlock(inode);
 bad_swap:
-	kfree(si->global_cluster);
-	si->global_cluster = NULL;
+	free_swap_info_arrays(si);
 	inode = NULL;
 	destroy_swap_extents(si, swap_file);
 	free_swap_cluster_info(si->cluster_info, si->max);
@@ -3922,6 +3941,10 @@ void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)

 static int __init swapfile_init(void)
 {
+	unsigned int nr_orders = swap_nr_orders();
+	struct percpu_swap_cluster *pcp_sc;
+	int cpu;
+
 	swapfile_maximum_size = arch_max_swapfile_size();

 	swap_slots_in_cluster = generic_swap_slots_in_clusters();
@@ -3939,6 +3962,24 @@ static int __init swapfile_init(void)
 					    SWAPFILE_CLUSTER),
 				    0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);

+	percpu_swap_cluster = alloc_percpu(struct percpu_swap_cluster);
+	if (!percpu_swap_cluster)
+		panic("%s: alloc_percpu failed for percpu_swap_cluster\n", __func__);
+
+	for_each_possible_cpu(cpu) {
+		int node = cpu_to_mem(cpu);
+
+		pcp_sc = per_cpu_ptr(percpu_swap_cluster, cpu);
+		local_lock_init(&pcp_sc->lock);
+		pcp_sc->si = kcalloc_node(nr_orders, sizeof(*pcp_sc->si),
+					GFP_KERNEL, node);
+		pcp_sc->offset = kcalloc_node(nr_orders, sizeof(*pcp_sc->offset),
+					    GFP_KERNEL, node);
+		if (!pcp_sc->si || !pcp_sc->offset)
+			panic("%s: per-CPU kcalloc failed for cpu:%d, node:%d\n",
+					__func__, cpu, node);
+	}
+
 #ifdef CONFIG_MIGRATION
 	if (swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS))
 		swap_migration_ad_supported = true;
--
2.39.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC 4/4] powerpc: Kconfig: Enable THP_SWAP on Book3S64
  2026-06-09 13:19 [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
                   ` (2 preceding siblings ...)
  2026-06-09 13:19 ` [RFC 3/4] mm, swap: make SWAP_NR_ORDERS runtime Ritesh Harjani (IBM)
@ 2026-06-09 13:19 ` Ritesh Harjani (IBM)
  2026-06-09 15:54 ` [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 YoungJun Park
  4 siblings, 0 replies; 6+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-06-09 13:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Youngjun Park,
	David Hildenbrand, linuxppc-dev, linux-kernel, Sayali Patil,
	Ritesh Harjani (IBM)

This enables THP_SWAP support for Book3S64.

The performance testing of this patch series on Book3S64 with zram has shown
around 40-50% improvement in case of Radix. We will be doing some performance
testing on Hash too and will soon update the results.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 arch/powerpc/platforms/Kconfig.cputype | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index bac02c83bb3e..48f74bd22343 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -113,6 +113,7 @@ config PPC_THP
        select HAVE_ARCH_TRANSPARENT_HUGEPAGE
        select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
        select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+       select ARCH_WANTS_THP_SWAP	if TRANSPARENT_HUGEPAGE

 choice
 	prompt "CPU selection"
--
2.39.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64
  2026-06-09 13:19 [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
                   ` (3 preceding siblings ...)
  2026-06-09 13:19 ` [RFC 4/4] powerpc: Kconfig: Enable THP_SWAP on Book3S64 Ritesh Harjani (IBM)
@ 2026-06-09 15:54 ` YoungJun Park
  4 siblings, 0 replies; 6+ messages in thread
From: YoungJun Park @ 2026-06-09 15:54 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-mm, Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, David Hildenbrand,
	linuxppc-dev, linux-kernel, Sayali Patil

On Tue, Jun 09, 2026 at 06:49:30PM +0530, Ritesh Harjani (IBM) wrote:
> On PowerPC Book3S64, MMU is selected at runtime, so macros like PMD_SHIFT are
> effectively runtime variables in the Book3S64 code. THP swap code uses these
> macros for e.g. to size some of its array data structures based on PMD_ORDER.
> This patch series makes that usage dependent on the runtime variable.
> 
> Sayali did some performance runs of this on Book3S64 with Radix and it gives
> 40-50% performance improvement. We also plan to run it with Hash, will soon
> update the results.
> 
> Note that this patch series is based out of linux-next (next-20260608).
> 
> Ritesh Harjani (IBM) (4):
>   include/linux/swap.h: Remove unused leftovers
>   mm, swap: make SWAPFILE_CLUSTER runtime
>   mm, swap: make SWAP_NR_ORDERS runtime
>   powerpc: Kconfig: Enable THP_SWAP on Book3S64
> 
>  arch/powerpc/platforms/Kconfig.cputype |   1 +
>  include/linux/swap.h                   |  17 +---
>  mm/swap.h                              |   5 +-
>  mm/swap_table.h                        |   6 +-
>  mm/swapfile.c                          | 132 ++++++++++++++++++-------
>  5 files changed, 106 insertions(+), 55 deletions(-)
> 
> --
> 2.39.5
>
Hello!

Instead of making SWAP_NR_ORDERS fully runtime, could we set it to the max
PMD_ORDER possible on PowerPC Book3S64 as a compile-time constant in the
swap.h ifdef block? (My assumtion is PMD_ORDER max not too big.)

I think the general runtime version adds cost. It impacts all other archs.
percpu_swap_cluster needs a runtime alloc,
the si/offset and nonfull/frag arrays become separate pointers, and some
accesses get one more indirection. And for nr_orders=1, the allocation
itself is just waste. 

With a compile-time possible max constant, the only downside is some acceptable amount of
wasted bytes per CPU / per device on Book3S64 (the unused entries in the swap
offset cache and the nonfull/frag lists), with no perf impact. the perf
improvement comes from THP swap itself, right? Other arches see no
impact at all.

patch 2 looks fine as is. SWAPFILE_CLUSTER backs much bigger per-cluster
arrays, so runtime sizing makes sense there, and it looks like no impact to
other arches or the current code.

Thanks!
Youngjun Park

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-06-09 16:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-09 13:19 [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 1/4] include/linux/swap.h: Remove unused leftovers Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 2/4] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 3/4] mm, swap: make SWAP_NR_ORDERS runtime Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 4/4] powerpc: Kconfig: Enable THP_SWAP on Book3S64 Ritesh Harjani (IBM)
2026-06-09 15:54 ` [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 YoungJun Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox