Re: [RFC PATCH v6 3/4] arm64: mm: support large block mapping when rodata=full

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Yang Shi <yang@os.amperecomputing.com>
To: Ryan Roberts <ryan.roberts@arm.com>,
	will@kernel.org, catalin.marinas@arm.com,
	akpm@linux-foundation.org, Miko.Lenczewski@arm.com,
	dev.jain@arm.com, scott@os.amperecomputing.com, cl@gentwo.org
Cc: linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v6 3/4] arm64: mm: support large block mapping when rodata=full
Date: Tue, 5 Aug 2025 10:59:16 -0700	[thread overview]
Message-ID: <06e63282-771c-48ec-bcfd-b174e94d52bf@os.amperecomputing.com> (raw)
In-Reply-To: <20250805081350.3854670-4-ryan.roberts@arm.com>



On 8/5/25 1:13 AM, Ryan Roberts wrote:
> From: Yang Shi <yang@os.amperecomputing.com>
>
> When rodata=full is specified, kernel linear mapping has to be mapped at
> PTE level since large page table can't be split due to break-before-make
> rule on ARM64.
>
> This resulted in a couple of problems:
>    - performance degradation
>    - more TLB pressure
>    - memory waste for kernel page table
>
> With FEAT_BBM level 2 support, splitting large block page table to
> smaller ones doesn't need to make the page table entry invalid anymore.
> This allows kernel split large block mapping on the fly.
>
> Add kernel page table split support and use large block mapping by
> default when FEAT_BBM level 2 is supported for rodata=full.  When
> changing permissions for kernel linear mapping, the page table will be
> split to smaller size.
>
> The machine without FEAT_BBM level 2 will fallback to have kernel linear
> mapping PTE-mapped when rodata=full.
>
> With this we saw significant performance boost with some benchmarks and
> much less memory consumption on my AmpereOne machine (192 cores, 1P)
> with 256GB memory.
>
> * Memory use after boot
> Before:
> MemTotal:       258988984 kB
> MemFree:        254821700 kB
>
> After:
> MemTotal:       259505132 kB
> MemFree:        255410264 kB
>
> Around 500MB more memory are free to use.  The larger the machine, the
> more memory saved.
>
> * Memcached
> We saw performance degradation when running Memcached benchmark with
> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
> With this patchset we saw ops/sec is increased by around 3.5%, P99
> latency is reduced by around 9.6%.
> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
> MPKI is reduced by 28.5%.
>
> The benchmark data is now on par with rodata=on too.
>
> * Disk encryption (dm-crypt) benchmark
> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with
> disk encryption (by dm-crypt).
> fio --directory=/data --random_generator=lfsr --norandommap            \
>      --randrepeat 1 --status-interval=999 --rw=write --bs=4k --loops=1  \
>      --ioengine=sync --iodepth=1 --numjobs=1 --fsync_on_close=1         \
>      --group_reporting --thread --name=iops-test-job --eta-newline=1    \
>      --size 100G
>
> The IOPS is increased by 90% - 150% (the variance is high, but the worst
> number of good case is around 90% more than the best number of bad
> case). The bandwidth is increased and the avg clat is reduced
> proportionally.
>
> * Sequential file read
> Read 100G file sequentially on XFS (xfs_io read with page cache
> populated). The bandwidth is increased by 150%.
>
> Co-developed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
> ---
>   arch/arm64/include/asm/cpufeature.h |   2 +
>   arch/arm64/include/asm/mmu.h        |   1 +
>   arch/arm64/include/asm/pgtable.h    |   5 +
>   arch/arm64/kernel/cpufeature.c      |   7 +-
>   arch/arm64/mm/mmu.c                 | 237 +++++++++++++++++++++++++++-
>   arch/arm64/mm/pageattr.c            |   6 +
>   6 files changed, 252 insertions(+), 6 deletions(-)
>

[...]

> +
> +static DEFINE_MUTEX(pgtable_split_lock);
> +
> +int split_kernel_leaf_mapping(unsigned long addr)
> +{
> +	pgd_t *pgdp, pgd;
> +	p4d_t *p4dp, p4d;
> +	pud_t *pudp, pud;
> +	pmd_t *pmdp, pmd;
> +	pte_t *ptep, pte;
> +	int ret = 0;
> +
> +	/*
> +	 * !BBML2_NOABORT systems should not be trying to change permissions on
> +	 * anything that is not pte-mapped in the first place. Just return early
> +	 * and let the permission change code raise a warning if not already
> +	 * pte-mapped.
> +	 */
> +	if (!system_supports_bbml2_noabort())
> +		return 0;
> +
> +	/*
> +	 * Ensure addr is at least page-aligned since this is the finest
> +	 * granularity we can split to.
> +	 */
> +	if (addr != PAGE_ALIGN(addr))
> +		return -EINVAL;
> +
> +	mutex_lock(&pgtable_split_lock);
> +	arch_enter_lazy_mmu_mode();
> +
> +	/*
> +	 * PGD: If addr is PGD aligned then addr already describes a leaf
> +	 * boundary. If not present then there is nothing to split.
> +	 */
> +	if (ALIGN_DOWN(addr, PGDIR_SIZE) == addr)
> +		goto out;
> +	pgdp = pgd_offset_k(addr);
> +	pgd = pgdp_get(pgdp);
> +	if (!pgd_present(pgd))
> +		goto out;
> +
> +	/*
> +	 * P4D: If addr is P4D aligned then addr already describes a leaf
> +	 * boundary. If not present then there is nothing to split.
> +	 */
> +	if (ALIGN_DOWN(addr, P4D_SIZE) == addr)
> +		goto out;
> +	p4dp = p4d_offset(pgdp, addr);
> +	p4d = p4dp_get(p4dp);
> +	if (!p4d_present(p4d))
> +		goto out;
> +
> +	/*
> +	 * PUD: If addr is PUD aligned then addr already describes a leaf
> +	 * boundary. If not present then there is nothing to split. Otherwise,
> +	 * if we have a pud leaf, split to contpmd.
> +	 */
> +	if (ALIGN_DOWN(addr, PUD_SIZE) == addr)
> +		goto out;
> +	pudp = pud_offset(p4dp, addr);
> +	pud = pudp_get(pudp);
> +	if (!pud_present(pud))
> +		goto out;
> +	if (pud_leaf(pud)) {
> +		ret = split_pud(pudp, pud);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	/*
> +	 * CONTPMD: If addr is CONTPMD aligned then addr already describes a
> +	 * leaf boundary. If not present then there is nothing to split.
> +	 * Otherwise, if we have a contpmd leaf, split to pmd.
> +	 */
> +	if (ALIGN_DOWN(addr, CONT_PMD_SIZE) == addr)
> +		goto out;
> +	pmdp = pmd_offset(pudp, addr);
> +	pmd = pmdp_get(pmdp);
> +	if (!pmd_present(pmd))
> +		goto out;
> +	if (pmd_leaf(pmd)) {
> +		if (pmd_cont(pmd))
> +			split_contpmd(pmdp);
> +		/*
> +		 * PMD: If addr is PMD aligned then addr already describes a
> +		 * leaf boundary. Otherwise, split to contpte.
> +		 */
> +		if (ALIGN_DOWN(addr, PMD_SIZE) == addr)
> +			goto out;
> +		ret = split_pmd(pmdp, pmd);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	/*
> +	 * CONTPTE: If addr is CONTPTE aligned then addr already describes a
> +	 * leaf boundary. If not present then there is nothing to split.
> +	 * Otherwise, if we have a contpte leaf, split to pte.
> +	 */
> +	if (ALIGN_DOWN(addr, CONT_PMD_SIZE) == addr)
> +		goto out;
> +	ptep = pte_offset_kernel(pmdp, addr);
> +	pte = __ptep_get(ptep);
> +	if (!pte_present(pte))
> +		goto out;
> +	if (pte_cont(pte))
> +		split_contpte(ptep);
> +
> +out:
> +	arch_leave_lazy_mmu_mode();
> +	mutex_unlock(&pgtable_split_lock);
> +	return ret;
>   }
>   
>   /*
> @@ -640,6 +857,16 @@ static inline void arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp) {
>   
>   #endif /* CONFIG_KFENCE */
>   
> +static inline bool force_pte_mapping(void)
> +{
> +	bool bbml2 = system_capabilities_finalized() ?
> +		system_supports_bbml2_noabort() : bbml2_noabort_available();
> +
> +	return (!bbml2 && (rodata_full || arm64_kfence_can_set_direct_map() ||
> +			   is_realm_world())) ||
> +		debug_pagealloc_enabled();
> +}
> +
>   static void __init map_mem(pgd_t *pgdp)
>   {
>   	static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN);
> @@ -665,7 +892,7 @@ static void __init map_mem(pgd_t *pgdp)
>   
>   	early_kfence_pool = arm64_kfence_alloc_pool();
>   
> -	if (can_set_direct_map())
> +	if (force_pte_mapping())
>   		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>   
>   	/*
> @@ -1367,7 +1594,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
>   
>   	VM_BUG_ON(!mhp_range_allowed(start, size, true));
>   
> -	if (can_set_direct_map())
> +	if (force_pte_mapping())
>   		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>   
>   	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index c6a85000fa0e..6a8eefc16dbc 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -140,6 +140,12 @@ static int update_range_prot(unsigned long start, unsigned long size,
>   	data.set_mask = set_mask;
>   	data.clear_mask = clear_mask;
>   
> +	ret = split_kernel_leaf_mapping(start);
> +	if (!ret)
> +		ret = split_kernel_leaf_mapping(start + size);
> +	if (WARN_ON_ONCE(ret))
> +		return ret;

This means we take the mutex lock twice and do lazy mmu twice too. So 
how's about:

mutex_lock()
enter lazy mmu
split_mapping(start)
split_mapping(end)
leave lazy mmu
mutex_unlock()

Thanks,
Yang

> +
>   	arch_enter_lazy_mmu_mode();
>   
>   	/*

next prev parent reply	other threads:[~2025-08-05 17:59 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-05  8:13 [RFC PATCH v6 0/4] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Ryan Roberts
2025-08-05  8:13 ` [RFC PATCH v6 1/4] arm64: Enable permission change on arm64 kernel block mappings Ryan Roberts
2025-08-28 16:26   ` Catalin Marinas
2025-08-29  9:23     ` Ryan Roberts
2025-08-05  8:13 ` [RFC PATCH v6 2/4] arm64: cpufeature: add AmpereOne to BBML2 allow list Ryan Roberts
2025-08-28 16:29   ` Catalin Marinas
2025-08-05  8:13 ` [RFC PATCH v6 3/4] arm64: mm: support large block mapping when rodata=full Ryan Roberts
2025-08-05 17:59   ` Yang Shi [this message]
2025-08-06  7:57     ` Ryan Roberts
2025-08-07  0:19       ` Yang Shi
2025-08-28 17:09   ` Catalin Marinas
2025-08-28 17:45     ` Ryan Roberts
2025-08-28 18:48       ` Catalin Marinas
2025-08-05  8:13 ` [RFC PATCH v6 4/4] arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs Ryan Roberts
2025-08-05 18:14   ` Yang Shi
2025-08-06  8:15     ` Ryan Roberts
2025-08-07  0:29       ` Yang Shi
2025-08-05  8:16 ` [RFC PATCH v6 0/4] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Ryan Roberts
2025-08-05 14:39   ` Catalin Marinas
2025-08-05 14:52     ` Ryan Roberts
2025-08-05 18:37 ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=06e63282-771c-48ec-bcfd-b174e94d52bf@os.amperecomputing.com \
    --to=yang@os.amperecomputing.com \
    --cc=Miko.Lenczewski@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=dev.jain@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=scott@os.amperecomputing.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).