From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: YoungJun Park <youngjun.park@lge.com>
Cc: linux-mm@kvack.org, Madhavan Srinivasan <maddy@linux.ibm.com>,
Michael Ellerman <mpe@ellerman.id.au>,
Nicholas Piggin <npiggin@gmail.com>,
Christophe Leroy <chleroy@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>,
Kemeng Shi <shikemeng@huaweicloud.com>,
Nhat Pham <nphamcs@gmail.com>, Baoquan He <baoquan.he@linux.dev>,
Barry Song <baohua@kernel.org>,
David Hildenbrand <david@kernel.org>,
linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
Sayali Patil <sayalip@linux.ibm.com>
Subject: Re: [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64
Date: Wed, 10 Jun 2026 11:00:43 +0530 [thread overview]
Message-ID: <pl1zdksc.ritesh.list@gmail.com> (raw)
In-Reply-To: <aig3PFm9jst8cHxn@yjaykim-PowerEdge-T330>
YoungJun Park <youngjun.park@lge.com> writes:
> On Tue, Jun 09, 2026 at 06:49:30PM +0530, Ritesh Harjani (IBM) wrote:
>> On PowerPC Book3S64, MMU is selected at runtime, so macros like PMD_SHIFT are
>> effectively runtime variables in the Book3S64 code. THP swap code uses these
>> macros for e.g. to size some of its array data structures based on PMD_ORDER.
>> This patch series makes that usage dependent on the runtime variable.
>>
>> Sayali did some performance runs of this on Book3S64 with Radix and it gives
>> 40-50% performance improvement. We also plan to run it with Hash, will soon
>> update the results.
>>
>> Note that this patch series is based out of linux-next (next-20260608).
>>
>> Ritesh Harjani (IBM) (4):
>> include/linux/swap.h: Remove unused leftovers
>> mm, swap: make SWAPFILE_CLUSTER runtime
>> mm, swap: make SWAP_NR_ORDERS runtime
>> powerpc: Kconfig: Enable THP_SWAP on Book3S64
>>
>> arch/powerpc/platforms/Kconfig.cputype | 1 +
>> include/linux/swap.h | 17 +---
>> mm/swap.h | 5 +-
>> mm/swap_table.h | 6 +-
>> mm/swapfile.c | 132 ++++++++++++++++++-------
>> 5 files changed, 106 insertions(+), 55 deletions(-)
>>
>> --
>> 2.39.5
>>
> Hello!
>
Thanks for taking a look at this.
> Instead of making SWAP_NR_ORDERS fully runtime, could we set it to the max
> PMD_ORDER possible on PowerPC Book3S64 as a compile-time constant in the
> swap.h ifdef block? (My assumtion is PMD_ORDER max not too big.)
>
> I think the general runtime version adds cost. It impacts all other archs.
> percpu_swap_cluster needs a runtime alloc,
> the si/offset and nonfull/frag arrays become separate pointers, and some
> accesses get one more indirection. And for nr_orders=1, the allocation
> itself is just waste.
>
> With a compile-time possible max constant, the only downside is some acceptable amount of
> wasted bytes per CPU / per device on Book3S64 (the unused entries in the swap
> offset cache and the nonfull/frag lists), with no perf impact. the perf
> improvement comes from THP swap itself, right? Other arches see no
> impact at all.
>
I looked into the memory waste comparison between static v/s runtime
alloc. And the wastage for per-cpu alloc data structures (with Radix
MMU) will be 0, because we use kcalloc_node() which will use kmalloc-64
slab. So slab padding would anyway add some memory waste. So it is as
good as using static arrays with some max PMD_ORDER for the
percpu_swap_cluster.
For the other lists you mentioned, it anyways adds a onetime negligible
cost which isn't worth for making SWAP_NR_ORDERS runtime.
> patch 2 looks fine as is. SWAPFILE_CLUSTER backs much bigger per-cluster
> arrays, so runtime sizing makes sense there, and it looks like no impact to
> other arches or the current code.
>
yup. That make sense.
So, unless someone else raises any objection - I will give this a try
instead of patch-3 in this series and will get back with v2.
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index e67e64ac6e8c..57abd8b2c9a1 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -204,6 +204,9 @@ extern unsigned long __pmd_frag_size_shift;
#define MAX_PTRS_PER_PGD (1 << (H_PGD_INDEX_SIZE > RADIX_PGD_INDEX_SIZE ? \
H_PGD_INDEX_SIZE : RADIX_PGD_INDEX_SIZE))
+#define ARCH_MAX_PMD_ORDER ((H_PTE_INDEX_SIZE > RADIX_PTE_INDEX_SIZE) ? \
+ H_PTE_INDEX_SIZE : RADIX_PTE_INDEX_SIZE)
+
/* PMD_SHIFT determines what a second-level page table entry can map */
#define PMD_SHIFT (PAGE_SHIFT + PTE_INDEX_SIZE)
#define PMD_SIZE (1UL << PMD_SHIFT)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46c25523d7b8..5f1451f8f266 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -224,10 +224,14 @@ enum {
#define SWAP_ENTRY_INVALID 0
#ifdef CONFIG_THP_SWAP
+#ifdef ARCH_MAX_PMD_ORDER
+#define SWAP_NR_ORDERS (ARCH_MAX_PMD_ORDER + 1)
+#else
#define SWAP_NR_ORDERS (PMD_ORDER + 1)
+#endif /* ARCH_MAX_PMD_ORDER */
#else
#define SWAP_NR_ORDERS 1
-#endif
+#endif /* CONFIG_THP_SWAP */
-ritesh
prev parent reply other threads:[~2026-06-10 6:29 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-09 13:19 [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 1/4] include/linux/swap.h: Remove unused leftovers Ritesh Harjani (IBM)
2026-06-10 16:41 ` Nhat Pham
2026-06-09 13:19 ` [RFC 2/4] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 3/4] mm, swap: make SWAP_NR_ORDERS runtime Ritesh Harjani (IBM)
2026-06-09 13:19 ` [RFC 4/4] powerpc: Kconfig: Enable THP_SWAP on Book3S64 Ritesh Harjani (IBM)
2026-06-09 15:54 ` [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64 YoungJun Park
2026-06-10 5:30 ` Ritesh Harjani [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=pl1zdksc.ritesh.list@gmail.com \
--to=ritesh.list@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baoquan.he@linux.dev \
--cc=chleroy@kernel.org \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=maddy@linux.ibm.com \
--cc=mpe@ellerman.id.au \
--cc=nphamcs@gmail.com \
--cc=npiggin@gmail.com \
--cc=sayalip@linux.ibm.com \
--cc=shikemeng@huaweicloud.com \
--cc=youngjun.park@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.