Re: [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Leo Yan <leo.yan@arm.com>
To: Wen Jiang <jiangwenxiaomi@gmail.com>
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org,
	catalin.marinas@arm.com, will@kernel.org,
	akpm@linux-foundation.org, urezki@gmail.com, baohua@kernel.org,
	Xueyuan.chen21@gmail.com, dev.jain@arm.com, rppt@kernel.org,
	david@kernel.org, ryan.roberts@arm.com,
	anshuman.khandual@arm.com, ajd@linux.ibm.com,
	linux-kernel@vger.kernel.org, jiangwen6@xiaomi.com,
	shanghaoqiang@xiaomi.com,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Mike Leach <mike.leach@arm.com>,
	James Clark <james.clark@linaro.org>,
	Tamas.Petz@arm.com, Michiel.VanTol@arm.com
Subject: Re: [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory
Date: Fri, 26 Jun 2026 16:12:14 +0100	[thread overview]
Message-ID: <20260626151214.GA1794676@e132581.arm.com> (raw)
In-Reply-To: <20260618084726.1070022-1-jiangwen6@xiaomi.com>

On Thu, Jun 18, 2026 at 04:47:20PM +0800, Wen Jiang wrote:

> Besides accelerating the mapping path, this also enables large
> mappings (PMD and cont-PTE) for vmap, which are currently not
> supported.

I verified this series with large vmap() mappings for Arm trace buffer
units (TRBE and SPE), and the results are positive.

Arm trace buffer units use the CPU's page tables for address translation
when writing trace data to DRAM. The larger vmap() mapping granules
reduce TLB pressure, resulting in significantly fewer L2D TLB refills
and reduced L1D TLB refills. The decrease in dtlb_walk indicates that
fewer page table walks are required and that address translations are
more often satisfied by cached TLB entries.

The detailed results are included below for reference.

Thanks for working on this, and here is my test tag:

Tested-by: Leo Yan <leo.yan@arm.com>

P.s. I applied a local change to set PERF_PMU_CAP_AUX_PREFER_LARGE in
the CoreSight and SPE drivers to allocate large memory chunks. This
change will be sent out once the MM changes are agreed.


## Results with TRBE

Test command:

  taskset -c 2 perf stat -C 10 -e cycles:u,instructions:u,dtlb_walk:u,l1d_tlb:u,l1d_tlb_refill:u,l2d_tlb_refill:u \
    -- taskset -c 2 perf record -C 10 -m ,1G -e cs_etm// \
    -- taskset -c 10 ./sparse_branch_delay.elf

The benchmark was run 5 times. CPU10 was isolated and dedicated to
running the workload while collecting the TLB statistics.

Before this series:

 +----------------+--------+--------+--------+--------+--------+----------+
 |TLB Metrics     |   Run1 |   Run2 |   Run3 |   Run4 |   Run5 |     Avg. |
 +----------------+--------+--------+--------+--------+--------+----------+
 | dtlb_walk      |     63 |     75 |     62 |     73 |     69 |     68.4 |
 +----------------+--------+--------+--------+--------+--------+----------+
 | l1d_tlb        |   2093 |   2189 |   2237 |   2036 |   2086 |   2128.2 |
 +----------------+--------+--------+--------+--------+--------+----------+
 | l1d_tlb_refill |    154 |    153 |    150 |    165 |    157 |    155.8 |
 +----------------+--------+--------+--------+--------+--------+----------+
 | l2d_tlb_refill | 161325 | 161403 | 161432 | 161580 | 161439 | 161435.8 |
 +----------------+--------+--------+--------+--------+--------+----------+

After this series:

 +----------------+--------+--------+--------+--------+--------+----------+----------+
 |TLB Metrics     |   Run1 |   Run2 |   Run3 |   Run4 |   Run5 |     Avg. |    Diff. |
 +----------------+--------+--------+--------+--------+--------+----------+----------+
 | dtlb_walk      |     67 |     59 |     60 |     58 |     53 |     59.4 |  -13.16% |
 +----------------+--------+--------+--------+--------+--------+----------+----------+
 | l1d_tlb        |   6710 |   7120 |   6662 |   6626 |   6542 |   6732.0 | +216.32% |
 +----------------+--------+--------+--------+--------+--------+----------+----------+
 | l1d_tlb_refill |    126 |    117 |    119 |    117 |    119 |    119.6 |  -23.23% |
 +----------------+--------+--------+--------+--------+--------+----------+----------+
 | l2d_tlb_refill |    506 |    489 |    485 |    506 |    489 |   495.0  |  -99.69% |
 +----------------+--------+--------+--------+--------+--------+----------+----------+

## Results with SPE

Test command:

  taskset -c 2 perf stat -C 10 -e cycles:u,instructions:u,dtlb_walk:u,l1d_tlb:u,l1d_tlb_refill:u,l2d_tlb_refill:u \
    -- taskset -c 2 perf record -C 10 -m ,512M -e arm_spe_0/ts_enable=1,pa_enable=1,period=64,min_latency=0/ \
    -- taskset -c 10 dd if=/dev/zero of=/dev/shm/dd_mem_test bs=1M count=1024 status=progress

The benchmark was run five times. CPU10 was isolated and dedicated to
running the workload while collecting the TLB statistics.

Before this series:

 +----------------+--------+--------+--------+--------+--------+----------+
 |TLB Metrics     |   Run1 |   Run2 |   Run3 |   Run4 |   Run5 |     Avg. |
 +----------------+--------+--------+--------+--------+--------+----------+
 | dtlb_walk      |   2090 |   1709 |   1679 |   1519 |   1555 |   1710.4 |
 +----------------+--------+--------+--------+--------+--------+----------+
 | l1d_tlb        | 254450 | 257227 | 252517 | 252535 | 254752 | 254296.2 |
 +----------------+--------+--------+--------+--------+--------+----------+
 | l1d_tlb_refill |  16023 |  16088 |  15944 |  15989 |  15956 |  16000.0 |
 +----------------+--------+--------+--------+--------+--------+----------+
 | l2d_tlb_refill |   5887 |   4204 |   3713 |   4556 |   5620 |   4796.0 |
 +----------------+--------+--------+--------+--------+--------+----------+

After this series:

 +----------------+--------+--------+--------+--------+--------+----------+----------+
 |TLB Metrics     |   Run1 |   Run2 |   Run3 |   Run4 |   Run5 |     Avg. |    Diff. |
 +----------------+--------+--------+--------+--------+--------+----------+----------+
 | dtlb_walk      |   1111 |   1301 |   1229 |   1166 |   1771 |   1315.6 |  -23.08% |
 +----------------+--------+--------+--------+--------+--------+----------+----------+
 | l1d_tlb        | 257462 | 257420 | 257241 | 259968 | 261324 | 258683.0 |   +1.73% |
 +----------------+--------+--------+--------+--------+--------+----------+----------+
 | l1d_tlb_refill |  15954 |  15919 |  15948 |  15962 |  15968 |  15950.2 |   -0.31% |
 +----------------+--------+--------+--------+--------+--------+----------+----------+
 | l2d_tlb_refill |   2672 |   2558 |   2801 |   2478 |   4147 |   2931.2 |  -38.88% |
 +----------------+--------+--------+--------+--------+--------+----------+----------+

     prev parent reply	other threads:[~2026-06-26 15:12 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-18  8:47 [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Wen Jiang
2026-06-18  8:47 ` [PATCH v4 1/6] arm64/hugetlb: Extend batching of multiple CONT_PTE in a single PTE setup Wen Jiang
2026-06-18  8:47 ` [PATCH v4 2/6] arm64/vmalloc: Allow arch_vmap_pte_range_map_size to batch multiple CONT_PTE Wen Jiang
2026-06-18  8:47 ` [PATCH v4 3/6] mm/vmalloc: Extract vmap_set_ptes() to consolidate PTE mapping logic Wen Jiang
2026-06-26 16:21   ` Uladzislau Rezki
2026-06-18  8:47 ` [PATCH v4 4/6] mm/vmalloc: Extend page table walk to support larger page_shift sizes and eliminate page table rewalk Wen Jiang
2026-06-18  8:47 ` [PATCH v4 5/6] mm/vmalloc: map contiguous pages in batches for vmap() if possible Wen Jiang
2026-06-18  8:47 ` [PATCH v4 6/6] mm/vmalloc: align vm_area so vmap() can batch mappings Wen Jiang
2026-06-26 16:20   ` Uladzislau Rezki
2026-06-25  2:57 ` [PATCH v4 0/6] mm/vmalloc: Speed up ioremap, vmalloc and vmap with contiguous memory Andrew Morton
2026-06-25  6:37 ` Dev Jain
2026-06-26 11:09   ` Barry Song
2026-06-26 15:12 ` Leo Yan [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260626151214.GA1794676@e132581.arm.com \
    --to=leo.yan@arm.com \
    --cc=Michiel.VanTol@arm.com \
    --cc=Tamas.Petz@arm.com \
    --cc=Xueyuan.chen21@gmail.com \
    --cc=ajd@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=baohua@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=james.clark@linaro.org \
    --cc=jiangwen6@xiaomi.com \
    --cc=jiangwenxiaomi@gmail.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.leach@arm.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shanghaoqiang@xiaomi.com \
    --cc=suzuki.poulose@arm.com \
    --cc=urezki@gmail.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox