From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: linux-mm@kvack.org, Madhavan Srinivasan <maddy@linux.ibm.com>,
Michael Ellerman <mpe@ellerman.id.au>,
Nicholas Piggin <npiggin@gmail.com>,
Christophe Leroy <chleroy@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Chris Li <chrisl@kernel.org>,
Kemeng Shi <shikemeng@huaweicloud.com>,
Nhat Pham <nphamcs@gmail.com>, Baoquan He <baoquan.he@linux.dev>,
Barry Song <baohua@kernel.org>,
Youngjun Park <youngjun.park@lge.com>,
David Hildenbrand <david@kernel.org>,
linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
Sayali Patil <sayalip@linux.ibm.com>
Subject: Re: [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER
Date: Wed, 24 Jun 2026 16:45:21 +0530 [thread overview]
Message-ID: <pl1gw5o6.ritesh.list@gmail.com> (raw)
In-Reply-To: <CAMgjq7BNMYCBKDYH_O-mHsBdAeSq4Z_ve5oDB6rQTGioHo26GQ@mail.gmail.com>
Kairui Song <ryncsn@gmail.com> writes:
> On Fri, Jun 19, 2026 at 12:42 PM Ritesh Harjani (IBM)
> <ritesh.list@gmail.com> wrote:
>>
>> SWAP_NR_ORDERS sizes a few small bounded arrays inside THP swap
>> allocator code (nofull/frag cluster lists, percpu_swap_cluster's
>> si/offset arrays, next array for rotational device). This currently
>> expands to PMD_ORDER+1, which only works when PMD_ORDER is a compile
>> time constant.
>>
>> However on architecture like PowerPC Book3S64, PMD_ORDER is a runtime
>> variable which depends upon which MMU is selected (Radix / Hash), so in
>> that case, PMD_ORDER cannot be used to size the static arrays.
>>
>> This patch provides an optional ARCH_MAX_PMD_ORDER (upper-bound)
>> override for such architectures. The memory overhead on enabling this
>> override is negligible. Even if we make SWAP_NR_ORDERS runtime alloc,
>> default slab padding could cause some memory waste. Also we lose the
>> per-cpu cacheline benefits (for percpu_swap_cluster) because it might
>> cost an extra cacheline indirection overhead in swap_alloc_fast() for
>> fetching si[order]/offset[order]. Note that a fully runtime
>> SWAP_NR_ORDERS was considered in previous version but was dropped for
>> this reason [1]
>>
>> [1]: https://lore.kernel.org/linuxppc-dev/pl1zdksc.ritesh.list@gmail.com/
>>
>> Suggested-by: YoungJun Park <youngjun.park@lge.com>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>> arch/powerpc/include/asm/book3s/64/pgtable.h | 7 +++++++
>> include/linux/swap.h | 12 +++++++++++-
>> 2 files changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index e67e64ac6e8c..7f22d5d5fbdf 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -204,6 +204,13 @@ extern unsigned long __pmd_frag_size_shift;
>> #define MAX_PTRS_PER_PGD (1 << (H_PGD_INDEX_SIZE > RADIX_PGD_INDEX_SIZE ? \
>> H_PGD_INDEX_SIZE : RADIX_PGD_INDEX_SIZE))
>>
>> +/*
>> + * Compile-time upper bound on PMD_ORDER across hash and radix MMUs.
>> + * Used by THP SWAP code. Check include/linux/swap.h
>> + */
>> +#define ARCH_MAX_PMD_ORDER ((H_PTE_INDEX_SIZE > RADIX_PTE_INDEX_SIZE) ? \
>> + H_PTE_INDEX_SIZE : RADIX_PTE_INDEX_SIZE)
>
> Hi Ritesh
>
> So swap is the only user of this macro? Will there by any other users?
>
No other users so far other than swap.
> I see that due to the percpu cluster design, it's hard to use a
> flexible array here. We will probabaly get rid of the fixed percpu
> cluster design in the future. By then should we be able to get rid of
> this macro?
>
Earlier in RFC version [1] it was runtime though, but as stated in the
commit msg, it adds unncessary complexity and yes, the per-cpu usage
there, made me re-think this whole thing (as Youngjun also suggested).
Since the allocation of si/offset of percpu_swap_cluster in fastpath
means, we also loose on the cacheline benefits that it otherwise had.
[1]: https://lore.kernel.org/linux-mm/19688ab5ab8017467749e003cf630c76a4b2b198.1781000840.git.ritesh.list@gmail.com/
Sure - I am not well aware of the plans on how to avoid the fixed
per-cpu cluster design here. Maybe if you can share some details, that
will be helpful.
But essentially yes, per-cpu swap cluster was the major reason why we
looked at adding ARCH_MAX_PMD_ORDER for PowerPC. Also note that this
does not cost any additional memory overhead compared to the runtime
solution, since kmalloc allocations of these structures were anyway
adding some bit of padding.
> I'm OK with this approach though. This current design has no negative
> effect on other archs so no reason to block it,
Sure. Thanks!
> just wondering if this can be made simpler in the future :)
Well it's relative. I felt this is a simpler design compared to the RFC
version we had earlier [1]. But still - can you share some additional
details of your concerns please.
Having said that - sure if in future we get rid of the fixed percpu
design, then I am happy to revisit this to see if this macro can be
killed - by maybe adopting to runtime allocations.
Thanks for looking into this!
-ritesh
next prev parent reply other threads:[~2026-06-24 11:50 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-19 4:40 [PATCH v4 0/3] mm, swap: Enable THP SWAP for PowerPC Book3S64 Ritesh Harjani (IBM)
2026-06-19 4:40 ` [PATCH v4 1/3] mm, swap: make SWAPFILE_CLUSTER runtime Ritesh Harjani (IBM)
2026-06-22 1:39 ` Barry Song
2026-06-23 4:11 ` Ritesh Harjani
2026-06-23 8:44 ` Barry Song
2026-06-19 4:40 ` [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER Ritesh Harjani (IBM)
2026-06-23 5:11 ` Barry Song
2026-06-23 6:37 ` Ritesh Harjani
2026-06-23 8:42 ` Barry Song
2026-06-23 9:32 ` Ritesh Harjani
2026-06-24 10:24 ` Kairui Song
2026-06-24 11:15 ` Ritesh Harjani [this message]
2026-06-19 4:40 ` [PATCH v4 3/3] powerpc: Kconfig: Enable THP_SWAP on Book3S64 Ritesh Harjani (IBM)
2026-06-23 5:21 ` Barry Song
2026-06-23 7:06 ` Ritesh Harjani
2026-06-23 8:39 ` Barry Song
2026-06-23 9:03 ` Ritesh Harjani
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=pl1gw5o6.ritesh.list@gmail.com \
--to=ritesh.list@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baoquan.he@linux.dev \
--cc=chleroy@kernel.org \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=maddy@linux.ibm.com \
--cc=mpe@ellerman.id.au \
--cc=nphamcs@gmail.com \
--cc=npiggin@gmail.com \
--cc=ryncsn@gmail.com \
--cc=sayalip@linux.ibm.com \
--cc=shikemeng@huaweicloud.com \
--cc=youngjun.park@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox