From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 92FDB1DB356
	for <linux-kernel@vger.kernel.org>; Wed, 24 Jun 2026 11:50:59 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.169
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1782301860; cv=none; b=qmRXMmr8kdjwS/THaQSs9brvghHMlLkLRdXBDZMRffLYR2Au2MGI1ymsIwEarPDidGnBbtu9VrCGg1JlZkAQyalRIIMI34jzcO0JnPmrShha0zcDfPbpdvQBOPuwkXleKEk53CTsN0SurZB9K3y0OQdCS+45j/pRzz6jY/jIQZY=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1782301860; c=relaxed/simple;
	bh=/xz05pz6deue6P0QNHqraBcCc0F90hoRfEeIZvDIfv4=;
	h=From:To:Cc:Subject:In-Reply-To:Date:Message-ID:References:
	 MIME-version:Content-type; b=VWbLAAmcICIm1XgYxZfZqyL2uA1fYbhzMTl2fy6ZLfl1GCH72c2fQNQztSbtnhC587qfahIsn0V3haRtX/BoGqIGJV0toCg2yx4IKnbVWhpL9dtBPf9q05OezW9qAYy1PJsA1XHWMyhs9TwyKhlA9C5fhD1AQ0DGRrmr6d/8OEc=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=EvFkUAdp; arc=none smtp.client-ip=209.85.210.169
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="EvFkUAdp"
Received: by mail-pf1-f169.google.com with SMTP id d2e1a72fcca58-84538597e1fso655374b3a.1
        for <linux-kernel@vger.kernel.org>; Wed, 24 Jun 2026 04:50:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20251104; t=1782301859; x=1782906659; darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:message-id:date
         :in-reply-to:subject:cc:to:from:from:to:cc:subject:date:message-id
         :reply-to;
        bh=XLYGI+jGGD5SSrEhLKeuITqC8EDy62Dm21Ns6LCR8Mk=;
        b=EvFkUAdp6BwfrKEOYlZ2PnYjLyT4h73WdZ3BluISOna5RigAvFbJvqPLEJX7SSc2Gi
         kkovII/oWo12I8pNDgblbNaPFRMjqv8QgJYAnVTjwTneggAFwURV+QwFtZkUbbrT6Crg
         MpqnScjUqz61oDIRIIO8YiOIY45kUXDseicRnHojSpjfdNKKgeXNVO9mioiCbUggqFxI
         B/M4VYhVVDwRhQDS32Tdm0y4Oqm+uEDYOvxk7ac+EEMPN5A+vSQMqGcBjzVdUrQgcHad
         QomvXd8FdA0Nk/v2wILszri/pwGV/JRSpg/aeqVKg+rfLtITQJSrEhz5zm6gu3ptjbLS
         ld2w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1782301859; x=1782906659;
        h=content-transfer-encoding:mime-version:references:message-id:date
         :in-reply-to:subject:cc:to:from:x-gm-gg:x-gm-message-state:from:to
         :cc:subject:date:message-id:reply-to;
        bh=XLYGI+jGGD5SSrEhLKeuITqC8EDy62Dm21Ns6LCR8Mk=;
        b=sBdZ7SAomUCt0DjBePCgBR19EcNL8UK80Z3nWEdKDkTzPRQsRaWGGsTxErAqZCpXoX
         d548U8JI9aYrb6GmFYq6PP065fP5Jp9wJhLQ8VJqsAMVWJ6IuNGCcD/dqxB2W45NXBOw
         TKdnymiLevb5DKy1Yhgj+Kk1v1zbo7U3j9GJl645Gx4LlwIzHkTuU3x1vhxfAfnNTv7C
         UO5v/UOCSz8ClboGMrAtSoxcxMpBVGl3YElpPRGRXvgge79jTxjEcTAjtOYJFIpYWfub
         n/xLqSd3xQT/34ZrTCgelo4+LCoz8zDdmNU70xuJJgt/2haKMpsf/TApFO4A7SnyaXjw
         Nrgw==
X-Forwarded-Encrypted: i=1; AFNElJ9FaZkfU8dLIkcEiIU33ntIs+ppyBtZlrvnqYredbvV/G2eJIyT+2KmGLsecTb5CH5gULmaqdUziUR0xis=@vger.kernel.org
X-Gm-Message-State: AOJu0YzN1W46doWmh+ch2LizQW+4FiHjcpkIqOpiYkFHJZgfLQYeNiyg
	d/eHibBREe+QIgU+ERLLq6gRysE/ygU6BBLOdzCe+Ci9Z3ncfFJkePnv
X-Gm-Gg: AfdE7cnRGaukM0wEa7Pino6SMIWAVLhnuJ5zVhbbkc9EK5ZXJ5JpJ9xUFGUEnKP0y5a
	zb4aX6nokHJGyvduozHwkcKWgEwavCDK+a0xg0oQ/2GtKa7m4LG/GYdM/o4NdSI2Db32S79i+iS
	rbN5JXuq1CsSHT2qotVyRme2FQRUimRSKHZ7lv9WWycak9uOjmzM3/x2ckTMvzc7i7pNDt5FETu
	0mceYfAo8fO2fFD/jvs+zb2PagEY3mzsW6a6rTgsWe2H9T77rat1r80XOGNbxlQLe1l4OpxNicu
	iVqMCaoHxSD/pzmEPneefvqzXAKoeAFFP618+Jes2sfJLewPtZvqGNuCUVME/cl7Cwhq/whlQPR
	s1+7DIjHdBfkJziMPvHNFpQVj4+INFohebEoevSjpnqmlzbwE92rpdgG4ZWPXVgO83DmXuGVbI6
	iNZPSb10osJmFoAj2wQW2VfKnlaA==
X-Received: by 2002:a05:6a00:2305:b0:842:7f81:8079 with SMTP id d2e1a72fcca58-84595326988mr8140571b3a.37.1782301858737;
        Wed, 24 Jun 2026 04:50:58 -0700 (PDT)
Received: from pve-server ([49.205.216.49])
        by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-845a413d2aasm2435598b3a.59.2026.06.24.04.50.51
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 24 Jun 2026 04:50:56 -0700 (PDT)
From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: linux-mm@kvack.org, Madhavan Srinivasan <maddy@linux.ibm.com>, Michael Ellerman <mpe@ellerman.id.au>, Nicholas Piggin <npiggin@gmail.com>, Christophe Leroy <chleroy@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, Chris Li <chrisl@kernel.org>, Kemeng Shi <shikemeng@huaweicloud.com>, Nhat Pham <nphamcs@gmail.com>, Baoquan He <baoquan.he@linux.dev>, Barry Song <baohua@kernel.org>, Youngjun Park <youngjun.park@lge.com>, David Hildenbrand <david@kernel.org>, linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org, Sayali Patil <sayalip@linux.ibm.com>
Subject: Re: [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER
In-Reply-To: <CAMgjq7BNMYCBKDYH_O-mHsBdAeSq4Z_ve5oDB6rQTGioHo26GQ@mail.gmail.com>
Date: Wed, 24 Jun 2026 16:45:21 +0530
Message-ID: <pl1gw5o6.ritesh.list@gmail.com>
References: <cover.1781843449.git.ritesh.list@gmail.com> <eda4e51ee9f1270582fbb2823ec5873e769de089.1781843449.git.ritesh.list@gmail.com> <CAMgjq7BNMYCBKDYH_O-mHsBdAeSq4Z_ve5oDB6rQTGioHo26GQ@mail.gmail.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-version: 1.0
Content-type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

Kairui Song <ryncsn@gmail.com> writes:

> On Fri, Jun 19, 2026 at 12:42 PM Ritesh Harjani (IBM)
> <ritesh.list@gmail.com> wrote:
>>
>> SWAP_NR_ORDERS sizes a few small bounded arrays inside THP swap
>> allocator code (nofull/frag cluster lists, percpu_swap_cluster's
>> si/offset arrays, next array for rotational device). This currently
>> expands to PMD_ORDER+1, which only works when PMD_ORDER is a compile
>> time constant.
>>
>> However on architecture like PowerPC Book3S64, PMD_ORDER is a runtime
>> variable which depends upon which MMU is selected (Radix / Hash), so in
>> that case, PMD_ORDER cannot be used to size the static arrays.
>>
>> This patch provides an optional ARCH_MAX_PMD_ORDER (upper-bound)
>> override for such architectures. The memory overhead on enabling this
>> override is negligible. Even if we make SWAP_NR_ORDERS runtime alloc,
>> default slab padding could cause some memory waste. Also we lose the
>> per-cpu cacheline benefits (for percpu_swap_cluster) because it might
>> cost an extra cacheline indirection overhead in swap_alloc_fast() for
>> fetching si[order]/offset[order]. Note that a fully runtime
>> SWAP_NR_ORDERS was considered in previous version but was dropped for
>> this reason [1]
>>
>> [1]: https://lore.kernel.org/linuxppc-dev/pl1zdksc.ritesh.list@gmail.com/
>>
>> Suggested-by: YoungJun Park <youngjun.park@lge.com>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  arch/powerpc/include/asm/book3s/64/pgtable.h |  7 +++++++
>>  include/linux/swap.h                         | 12 +++++++++++-
>>  2 files changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index e67e64ac6e8c..7f22d5d5fbdf 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -204,6 +204,13 @@ extern unsigned long __pmd_frag_size_shift;
>>  #define MAX_PTRS_PER_PGD       (1 << (H_PGD_INDEX_SIZE > RADIX_PGD_INDEX_SIZE ? \
>>                                        H_PGD_INDEX_SIZE : RADIX_PGD_INDEX_SIZE))
>>
>> +/*
>> + * Compile-time upper bound on PMD_ORDER across hash and radix MMUs.
>> + * Used by THP SWAP code. Check include/linux/swap.h
>> + */
>> +#define ARCH_MAX_PMD_ORDER ((H_PTE_INDEX_SIZE > RADIX_PTE_INDEX_SIZE) ? \
>> +                               H_PTE_INDEX_SIZE : RADIX_PTE_INDEX_SIZE)
>
> Hi Ritesh
>
> So swap is the only user of this macro? Will there by any other users?
>

No other users so far other than swap.

> I see that due to the percpu cluster design, it's hard to use a
> flexible array here. We will probabaly get rid of the fixed percpu
> cluster design in the future. By then should we be able to get rid of
> this macro?
>

Earlier in RFC version [1] it was runtime though, but as stated in the
commit msg, it adds unncessary complexity and yes, the per-cpu usage
there, made me re-think this whole thing (as Youngjun also suggested).
Since the allocation of si/offset of percpu_swap_cluster in fastpath
means, we also loose on the cacheline benefits that it otherwise had.

[1]: https://lore.kernel.org/linux-mm/19688ab5ab8017467749e003cf630c76a4b2b198.1781000840.git.ritesh.list@gmail.com/

Sure - I am not well aware of the plans on how to avoid the fixed
per-cpu cluster design here. Maybe if you can share some details, that
will be helpful.

But essentially yes, per-cpu swap cluster was the major reason why we
looked at adding ARCH_MAX_PMD_ORDER for PowerPC. Also note that this
does not cost any additional memory overhead compared to the runtime
solution, since kmalloc allocations of these structures were anyway
adding some bit of padding.


> I'm OK with this approach though. This current design has no negative
> effect on other archs so no reason to block it,

Sure. Thanks!

> just wondering if this can be made simpler in the future :)

Well it's relative. I felt this is a simpler design compared to the RFC
version we had earlier [1]. But still - can you share some additional
details of your concerns please.

Having said that - sure if in future we get rid of the fixed percpu
design, then I am happy to revisit this to see if this macro can be
killed - by maybe adopting to runtime allocations.

Thanks for looking into this!

-ritesh