From: Shivank Garg <shivankg@amd.com>
To: Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>, <linux-mm@kvack.org>,
<linux-kernel@vger.kernel.org>, <x86@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>, Thomas Gleixner <tglx@kernel.org>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
"H . Peter Anvin" <hpa@zytor.com>,
Ankur Arora <ankur.a.arora@oracle.com>,
Bharata B Rao <bharata@amd.com>,
"Hrushikesh Salunke" <hsalunke@amd.com>,
David Rientjes <rientjes@google.com>,
"Shivank Garg" <shivankg@amd.com>
Subject: [RFC PATCH 0/1] batch page copies in folio_copy() and folio_mc_copy()
Date: Mon, 27 Apr 2026 14:20:36 +0000 [thread overview]
Message-ID: <20260427142036.111940-2-shivankg@amd.com> (raw)
This RFC batches folio_copy() and folio_mc_copy() for the
!HIGHMEM && !__HAVE_ARCH_COPY_HIGHPAGE path. A naive bulk-memcpy()
implementation gives ~2x on AMD Zen 5 but regresses AMD Zen 3, because
memcpy() resolves to different primitives on the two uarchs. I think
the right end state is a copy_pages() x86 helper that mirrors what
Ankur Arora did for clear_pages(), but I'd like opinions before doing
that arch-side work.
Both helpers loop per 4 KiB constituent page:
folio_copy(dst, src):
for each page i in src:
copy_highpage(dst[i], src[i])
cond_resched()
folio_mc_copy(dst, src):
for each page i in src:
if copy_mc_highpage(dst[i], src[i]):
return -EHWPOISON
cond_resched()
On !HIGHMEM and !__HAVE_ARCH_COPY_HIGHPAGE, the per-iteration calls
boil down to:
copy_highpage() -> copy_page() = rep movsq [REP_GOOD]
= unrolled movq [otherwise]
copy_mc_highpage() -> copy_mc_to_kernel() = copy_mc_fragile [if enabled]
= copy_mc_enhanced_fast_string (rep movsb) [ERMS]
= memcpy() [otherwise]
= rep movsb [FSRM]
= memcpy_orig [otherwise]
So the two helpers are not symmetric. copy_highpage() is unambiguously
a microcoded rep movsq per page on every CPU we care about.
copy_mc_highpage() is a runtime dispatch that already uses rep movsb
on ERMS/FSRM CPUs and otherwise falls back to memcpy_orig.
For RFC, I added a naive batched fast path that replaces the
per-page call with a single bulk call:
copy_highpages() -> memcpy(N * PAGE_SIZE)
copy_mc_highpages() -> copy_mc_to_kernel(N * PAGE_SIZE)
copy_mc_highpages() is fine: copy_mc_to_kernel() lands on the
same family of primitives at any length, so batching it just amortises
per-call setup.
| copy_mc_to_kernel() at any length
|--------------------------------------
Zen 3 (REP_GOOD only) | memcpy() --> memcpy_orig (movq)
Zen 4/5 | copy_mc_enhanced_fast_string (rep movsb)
copy_highpages() is where things break. It swaps the per-page
primitive from copy_page() (rep movsq on REP_GOOD) to memcpy(),
which has its own ALTERNATIVE that picks rep movsb on FSRM and
memcpy_orig on everything else. The primitive choice now depends
on X86_FEATURE_FSRM, not on what the caller wanted:
| copy_page() (per page)| memcpy(bulk)
|-----------------------|------------
Zen 3 (REP_GOOD only) | rep movsq | memcpy_orig (movq)
Zen 4/5 (FSRM) | rep movsq | rep movsb (FSRM)
Test setup on Zen 5
===================
- Dual-socket AMD EPYC 9655 (Zen 5), three NUMA nodes
(DRAM node 0/1, CXL.mem node 2)
- Kernel based on akpm/mm-new:c656c6a02
- performance governor
- Bench thread pinned to CPU 4
Microbenchmark: folio_mc_copy() / folio_copy() in isolation
===========================================================
Wrote a simple kernel module that allocates a single src/dst folio
pair of the requested order and times folio_*_copy() between them via
ktime_get_ns(). Each iteration optionally streams an eviction buffer
(128MB) through L3 to evict cache lines, so we can measure both
cache-cold and cache-hot regimes.
Cache-cold (2 MB total per run, source evicted before each copy):
fn=folio_mc_copy
direction folio baseline GB/s optimized GB/s speedup
DRAM0->DRAM1 256K 15.63 ± 0.55 30.81 ± 2.10 1.97x
DRAM0->DRAM1 512K 16.14 ± 0.66 33.74 ± 2.74 2.09x
DRAM0->DRAM1 1M 17.09 ± 0.63 36.04 ± 1.85 2.11x
DRAM0->DRAM1 2M 18.65 ± 1.37 38.03 ± 3.21 2.04x
DRAM0->CXL 256K 22.55 ± 0.37 36.34 ± 1.32 1.61x
DRAM0->CXL 512K 22.24 ± 0.86 37.42 ± 1.89 1.68x
DRAM0->CXL 1M 23.28 ± 0.94 39.23 ± 0.97 1.68x
DRAM0->CXL 2M 25.46 ± 2.89 39.29 ± 1.17 1.54x
CXL->DRAM0 1M 17.88 ± 3.88 33.53 ± 0.40 1.88x
CXL->DRAM0 2M 20.61 ± 3.95 35.07 ± 0.62 1.70x
fn=folio_copy
direction folio baseline GB/s optimized GB/s speedup
DRAM0->DRAM1 256K 14.93 ± 0.57 29.66 ± 2.44 1.99x
DRAM0->DRAM1 512K 15.60 ± 0.36 34.21 ± 1.23 2.19x
DRAM0->DRAM1 1M 17.47 ± 0.41 36.20 ± 1.40 2.07x
DRAM0->DRAM1 2M 19.36 ± 1.97 38.92 ± 1.58 2.01x
DRAM0->CXL 256K 21.49 ± 0.39 34.92 ± 2.95 1.63x
DRAM0->CXL 512K 21.01 ± 0.69 37.09 ± 2.13 1.76x
DRAM0->CXL 1M 24.37 ± 1.99 38.94 ± 0.90 1.60x
DRAM0->CXL 2M 26.59 ± 2.78 38.40 ± 2.59 1.44x
CXL->DRAM0 1M 19.05 ± 3.87 33.93 ± 0.42 1.78x
CXL->DRAM0 2M 20.50 ± 4.53 35.78 ± 0.93 1.75x
Cache-hot scenario (1G total, no eviction):
Even when both source and destination already fit in L2/L3, the
batched helper still wins. For a 2 MB cache-hot folio the old code
runs the kmap_local_page() / kunmap_local() / cond_resched()
sequence 512 times; the new code runs it once.
fn=folio_copy
direction folio baseline GB/s optimized GB/s speedup
DRAM0->DRAM0 16K 83.61 ± 0.41 96.70 ± 0.58 1.16x
DRAM0->DRAM0 64K 65.95 ± 1.14 78.77 ± 0.20 1.19x
DRAM0->DRAM0 256K 68.59 ± 0.88 82.55 ± 0.10 1.20x
DRAM0->DRAM0 512K 66.02 ± 0.50 82.66 ± 0.17 1.25x
DRAM0->DRAM0 1M 38.07 ± 0.06 41.53 ± 0.05 1.09x
DRAM0->DRAM0 2M 38.54 ± 0.02 41.60 ± 0.04 1.08x
End-to-end: move_pages(2) on anon mTHP
======================================
Measure move_pages(2) syscall wall time on userspace pages obtained
via aligned_alloc(). This includes the rmap walk, TLB shootdown,
destination folio allocation, PTE rewrite, and refcount work, on top
of the actual copy. The microbench wins do translate, even though the
syscall floor work caps the speedup.
fn=move_pages(2), 1 GiB migrated per run
direction folio baseline GB/s optimized GB/s speedup
DRAM0->DRAM1 256K 4.77 ± 0.01 5.09 ± 0.01 1.07x
DRAM0->DRAM1 1M 4.83 ± 0.02 5.19 ± 0.02 1.08x
DRAM0->DRAM1 2M 7.20 ± 0.03 8.01 ± 0.02 1.11x
DRAM0->CXL 256K 6.07 ± 0.02 6.65 ± 0.01 1.10x
DRAM0->CXL 1M 6.29 ± 0.02 6.74 ± 0.02 1.07x
DRAM0->CXL 2M 11.12 ± 0.15 13.07 ± 0.03 1.18x
DRAM1->DRAM0 256K 4.72 ± 0.01 5.06 ± 0.01 1.07x
DRAM1->DRAM0 1M 4.83 ± 0.02 5.17 ± 0.02 1.07x
DRAM1->DRAM0 2M 7.21 ± 0.02 7.95 ± 0.02 1.10x
CXL->DRAM0 256K 5.08 ± 0.06 5.24 ± 0.05 1.03x
CXL->DRAM0 1M 5.30 ± 0.05 5.44 ± 0.05 1.03x
CXL->DRAM0 2M 9.10 ± 0.05 9.49 ± 0.01 1.04x
Regression on Zen 3
===================
Hardware: AMD EPYC 7713 (Zen 3 / Milan, no FSRM, no ERMS).
fn=folio_copy (with current patch using bulk memcpy())
2M cache-cold:
direction folio speedup
DRAM0->DRAM1 1M 0.90x
DRAM0->DRAM1 2M 0.89x
DRAM1->DRAM0 1M 0.85x
DRAM1->DRAM0 2M 0.86x
2M Cache-hot:
direction folio speedup
DRAM0->DRAM1 1M 0.60x
DRAM0->DRAM1 2M 0.61x
DRAM1->DRAM0 1M 0.60x
DRAM1->DRAM0 2M 0.60x
1G Cache-hot:
direction folio speedup
DRAM0->DRAM1 1M 0.59x
DRAM0->DRAM1 2M 0.59x
DRAM1->DRAM0 1M 0.59x
DRAM1->DRAM0 2M 0.61x
QUESTIONS:
==========
Should we introduce copy_pages() infrastructure for folio copy
optimisation, or just patch folio_mc_copy() and leave folio_copy()
alone?
Should folio_copy() and folio_mc_copy() use symmetric primitive
selection? It is unclear (to me) whether the asymmetry was deliberate
or accidental.
Unchanged paths
===============
- CONFIG_HIGHMEM: each page still needs its own kmap_local_page(),
so the per-page loop is retained.
- Architectures that override __HAVE_ARCH_COPY_HIGHPAGE:
copy_highpages() falls back to the per-page loop.
Thanks for review and feedback.
Shivank Garg (1):
mm: batch page copies in folio_copy() and folio_mc_copy()
include/linux/highmem.h | 58 +++++++++++++++++++++++++++++++++++++++++
mm/util.c | 25 +++---------------
2 files changed, 62 insertions(+), 21 deletions(-)
base-commit: c656c6a0242712b537ee75208d431b210ab390c3
--
2.43.0
next reply other threads:[~2026-04-27 14:26 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-27 14:20 Shivank Garg [this message]
2026-04-27 14:20 ` [RFC PATCH 1/1] mm: batch page copies in folio_copy() and folio_mc_copy() Shivank Garg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260427142036.111940-2-shivankg@amd.com \
--to=shivankg@amd.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=ankur.a.arora@oracle.com \
--cc=bharata@amd.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=david@kernel.org \
--cc=hpa@zytor.com \
--cc=hsalunke@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=mingo@redhat.com \
--cc=rientjes@google.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=tglx@kernel.org \
--cc=vbabka@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox