public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/1] batch page copies in folio_copy() and folio_mc_copy()
@ 2026-04-27 14:20 Shivank Garg
  2026-04-27 14:20 ` [RFC PATCH 1/1] mm: " Shivank Garg
  0 siblings, 1 reply; 2+ messages in thread
From: Shivank Garg @ 2026-04-27 14:20 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, linux-mm, linux-kernel, x86
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Ankur Arora,
	Bharata B Rao, Hrushikesh Salunke, David Rientjes, Shivank Garg

This RFC batches folio_copy() and folio_mc_copy() for the
!HIGHMEM && !__HAVE_ARCH_COPY_HIGHPAGE path. A naive bulk-memcpy()
implementation gives ~2x on AMD Zen 5 but regresses AMD Zen 3, because
memcpy() resolves to different primitives on the two uarchs. I think
the right end state is a copy_pages() x86 helper that mirrors what
Ankur Arora did for clear_pages(), but I'd like opinions before doing
that arch-side work.

Both helpers loop per 4 KiB constituent page:

folio_copy(dst, src):
    for each page i in src:
        copy_highpage(dst[i], src[i])
        cond_resched()

folio_mc_copy(dst, src):
    for each page i in src:
        if copy_mc_highpage(dst[i], src[i]):
            return -EHWPOISON
        cond_resched()

On !HIGHMEM and !__HAVE_ARCH_COPY_HIGHPAGE, the per-iteration calls
boil down to:

copy_highpage()    -> copy_page()         = rep movsq        [REP_GOOD]
                                          = unrolled movq    [otherwise]

copy_mc_highpage() -> copy_mc_to_kernel() = copy_mc_fragile  [if enabled]
                                          = copy_mc_enhanced_fast_string (rep movsb)  [ERMS]
                                          = memcpy()         [otherwise]
                                            = rep movsb        [FSRM]
                                            = memcpy_orig      [otherwise]

So the two helpers are not symmetric. copy_highpage() is unambiguously
a microcoded rep movsq per page on every CPU we care about.
copy_mc_highpage() is a runtime dispatch that already uses rep movsb
on ERMS/FSRM CPUs and otherwise falls back to memcpy_orig.


For RFC, I added a naive batched fast path that replaces the
per-page call with a single bulk call:

copy_highpages()    -> memcpy(N * PAGE_SIZE)
copy_mc_highpages() -> copy_mc_to_kernel(N * PAGE_SIZE)

copy_mc_highpages() is fine: copy_mc_to_kernel() lands on the
same family of primitives at any length, so batching it just amortises
per-call setup.

                      | copy_mc_to_kernel() at any length
                      |--------------------------------------
Zen 3 (REP_GOOD only) | memcpy()           -->  memcpy_orig (movq)
Zen 4/5               | copy_mc_enhanced_fast_string (rep movsb)


copy_highpages() is where things break. It swaps the per-page
primitive from copy_page() (rep movsq on REP_GOOD) to memcpy(),
which has its own ALTERNATIVE that picks rep movsb on FSRM and
memcpy_orig on everything else. The primitive choice now depends
on X86_FEATURE_FSRM, not on what the caller wanted:

                      | copy_page() (per page)| memcpy(bulk)
                      |-----------------------|------------
Zen 3 (REP_GOOD only) | rep movsq             | memcpy_orig (movq)
Zen 4/5 (FSRM)        | rep movsq             | rep movsb (FSRM)


Test setup on Zen 5
===================

  - Dual-socket AMD EPYC 9655 (Zen 5), three NUMA nodes
    (DRAM node 0/1, CXL.mem node 2)
  - Kernel based on akpm/mm-new:c656c6a02
  - performance governor
  - Bench thread pinned to CPU 4

Microbenchmark: folio_mc_copy() / folio_copy() in isolation
===========================================================

Wrote a simple kernel module that allocates a single src/dst folio
pair of the requested order and times folio_*_copy() between them via
ktime_get_ns(). Each iteration optionally streams an eviction buffer
(128MB) through L3 to evict cache lines, so we can measure both
cache-cold and cache-hot regimes.

Cache-cold (2 MB total per run, source evicted before each copy):

  fn=folio_mc_copy
  direction      folio    baseline GB/s    optimized GB/s   speedup
  DRAM0->DRAM1   256K     15.63 ± 0.55     30.81 ± 2.10     1.97x
  DRAM0->DRAM1   512K     16.14 ± 0.66     33.74 ± 2.74     2.09x
  DRAM0->DRAM1     1M     17.09 ± 0.63     36.04 ± 1.85     2.11x
  DRAM0->DRAM1     2M     18.65 ± 1.37     38.03 ± 3.21     2.04x
  DRAM0->CXL     256K     22.55 ± 0.37     36.34 ± 1.32     1.61x
  DRAM0->CXL     512K     22.24 ± 0.86     37.42 ± 1.89     1.68x
  DRAM0->CXL       1M     23.28 ± 0.94     39.23 ± 0.97     1.68x
  DRAM0->CXL       2M     25.46 ± 2.89     39.29 ± 1.17     1.54x
  CXL->DRAM0       1M     17.88 ± 3.88     33.53 ± 0.40     1.88x
  CXL->DRAM0       2M     20.61 ± 3.95     35.07 ± 0.62     1.70x

  fn=folio_copy
  direction      folio    baseline GB/s    optimized GB/s   speedup
  DRAM0->DRAM1   256K     14.93 ± 0.57     29.66 ± 2.44     1.99x
  DRAM0->DRAM1   512K     15.60 ± 0.36     34.21 ± 1.23     2.19x
  DRAM0->DRAM1     1M     17.47 ± 0.41     36.20 ± 1.40     2.07x
  DRAM0->DRAM1     2M     19.36 ± 1.97     38.92 ± 1.58     2.01x
  DRAM0->CXL     256K     21.49 ± 0.39     34.92 ± 2.95     1.63x
  DRAM0->CXL     512K     21.01 ± 0.69     37.09 ± 2.13     1.76x
  DRAM0->CXL       1M     24.37 ± 1.99     38.94 ± 0.90     1.60x
  DRAM0->CXL       2M     26.59 ± 2.78     38.40 ± 2.59     1.44x
  CXL->DRAM0       1M     19.05 ± 3.87     33.93 ± 0.42     1.78x
  CXL->DRAM0       2M     20.50 ± 4.53     35.78 ± 0.93     1.75x

Cache-hot scenario (1G total, no eviction):

Even when both source and destination already fit in L2/L3, the
batched helper still wins. For a 2 MB cache-hot folio the old code
runs the kmap_local_page() / kunmap_local() / cond_resched()
sequence 512 times; the new code runs it once.

  fn=folio_copy
  direction      folio    baseline GB/s    optimized GB/s   speedup
  DRAM0->DRAM0    16K     83.61 ± 0.41     96.70 ± 0.58     1.16x
  DRAM0->DRAM0    64K     65.95 ± 1.14     78.77 ± 0.20     1.19x
  DRAM0->DRAM0   256K     68.59 ± 0.88     82.55 ± 0.10     1.20x
  DRAM0->DRAM0   512K     66.02 ± 0.50     82.66 ± 0.17     1.25x
  DRAM0->DRAM0     1M     38.07 ± 0.06     41.53 ± 0.05     1.09x
  DRAM0->DRAM0     2M     38.54 ± 0.02     41.60 ± 0.04     1.08x


End-to-end: move_pages(2) on anon mTHP
======================================

Measure move_pages(2) syscall wall time on userspace pages obtained
via aligned_alloc(). This includes the rmap walk, TLB shootdown,
destination folio allocation, PTE rewrite, and refcount work, on top
of the actual copy. The microbench wins do translate, even though the
syscall floor work caps the speedup.

  fn=move_pages(2), 1 GiB migrated per run

  direction      folio     baseline GB/s   optimized GB/s  speedup
  DRAM0->DRAM1   256K      4.77 ± 0.01     5.09 ± 0.01     1.07x
  DRAM0->DRAM1     1M      4.83 ± 0.02     5.19 ± 0.02     1.08x
  DRAM0->DRAM1     2M      7.20 ± 0.03     8.01 ± 0.02     1.11x

  DRAM0->CXL     256K      6.07 ± 0.02     6.65 ± 0.01     1.10x
  DRAM0->CXL       1M      6.29 ± 0.02     6.74 ± 0.02     1.07x
  DRAM0->CXL       2M     11.12 ± 0.15    13.07 ± 0.03     1.18x

  DRAM1->DRAM0   256K      4.72 ± 0.01     5.06 ± 0.01     1.07x
  DRAM1->DRAM0     1M      4.83 ± 0.02     5.17 ± 0.02     1.07x
  DRAM1->DRAM0     2M      7.21 ± 0.02     7.95 ± 0.02     1.10x

  CXL->DRAM0     256K      5.08 ± 0.06     5.24 ± 0.05     1.03x
  CXL->DRAM0       1M      5.30 ± 0.05     5.44 ± 0.05     1.03x
  CXL->DRAM0       2M      9.10 ± 0.05     9.49 ± 0.01     1.04x


Regression on Zen 3
===================

Hardware: AMD EPYC 7713 (Zen 3 / Milan, no FSRM, no ERMS).
fn=folio_copy  (with current patch using bulk memcpy())

2M cache-cold: 
direction      folio   speedup
DRAM0->DRAM1     1M    0.90x
DRAM0->DRAM1     2M    0.89x
DRAM1->DRAM0     1M    0.85x
DRAM1->DRAM0     2M    0.86x

2M Cache-hot:
direction      folio   speedup
DRAM0->DRAM1     1M    0.60x
DRAM0->DRAM1     2M    0.61x
DRAM1->DRAM0     1M    0.60x
DRAM1->DRAM0     2M    0.60x

1G Cache-hot:
direction      folio   speedup
DRAM0->DRAM1     1M    0.59x
DRAM0->DRAM1     2M    0.59x
DRAM1->DRAM0     1M    0.59x
DRAM1->DRAM0     2M    0.61x


QUESTIONS:
==========
Should we introduce copy_pages() infrastructure for folio copy
optimisation, or just patch folio_mc_copy() and leave folio_copy()
alone?

Should folio_copy() and folio_mc_copy() use symmetric primitive
selection? It is unclear (to me) whether the asymmetry was deliberate
or accidental.
 
Unchanged paths
===============
 - CONFIG_HIGHMEM: each page still needs its own kmap_local_page(),
   so the per-page loop is retained.
 - Architectures that override __HAVE_ARCH_COPY_HIGHPAGE:
   copy_highpages() falls back to the per-page loop.

Thanks for review and feedback.


Shivank Garg (1):
  mm: batch page copies in folio_copy() and folio_mc_copy()

 include/linux/highmem.h | 58 +++++++++++++++++++++++++++++++++++++++++
 mm/util.c               | 25 +++---------------
 2 files changed, 62 insertions(+), 21 deletions(-)


base-commit: c656c6a0242712b537ee75208d431b210ab390c3
-- 
2.43.0



^ permalink raw reply	[flat|nested] 2+ messages in thread

* [RFC PATCH 1/1] mm: batch page copies in folio_copy() and folio_mc_copy()
  2026-04-27 14:20 [RFC PATCH 0/1] batch page copies in folio_copy() and folio_mc_copy() Shivank Garg
@ 2026-04-27 14:20 ` Shivank Garg
  0 siblings, 0 replies; 2+ messages in thread
From: Shivank Garg @ 2026-04-27 14:20 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, linux-mm, linux-kernel, x86
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Ankur Arora,
	Bharata B Rao, Hrushikesh Salunke, David Rientjes, Shivank Garg

Rewrite folio_copy() and folio_mc_copy() as thin wrappers around new
batched helpers copy_highpages() and copy_mc_highpages().

The current implementations iterate copy_highpage() (or its #MC-aware
variant) per 4 KB page. For a single 2 MB folio that loop runs 512
times and pays, per page:

  - kmap_local_page() / kunmap_local()
  - cond_resched()
  - one invocation of the architecture copy_page()/memcpy() primitive

The new helpers issue a single copy_mc_to_kernel()/memcpy() over
the whole contiguous range when CONFIG_HIGHMEM is off and no
architecture overrides (__HAVE_ARCH_COPY_HIGHPAGE) copy_highpage().
HIGHMEM and arch overrides keep the existing per-page path.

Tested on dual-socket AMD EPYC 9655 (Zen 5) with a CXL.mem node.
In-kernel folio_mc_copy() microbenchmark on 2 MB folios, source
evicted from cache before each iteration and measured throughput:

  direction         baseline GB/s   optimized GB/s   speedup
  DRAM0 -> DRAM1     18.65 ± 1.37    38.03 ± 3.21     2.04x
  DRAM0 -> CXL       25.46 ± 2.89    39.29 ± 1.17     1.54x
  CXL   -> DRAM0     20.61 ± 3.95    35.07 ± 0.62     1.70x

End-to-end move_pages(2) throughput on anonymous 2 MB mTHP folios,
1 GB migrated per run:

  direction         baseline GB/s   optimized GB/s   speedup
  DRAM0 -> DRAM1      7.20 ± 0.03     8.01 ± 0.02     1.11x
  DRAM0 -> CXL       11.12 ± 0.15    13.07 ± 0.03     1.18x
  DRAM1 -> DRAM0      7.21 ± 0.02     7.95 ± 0.02     1.10x
  CXL   -> DRAM0      9.10 ± 0.05     9.49 ± 0.01     1.04x

On AMD EPYC 7713 (Zen 3 / Milan, REP_GOOD without FSRM/ERMS) the
folio_copy() bulk path regresses because memcpy() falls through to
memcpy_orig (an unrolled movq loop), which is slower than the
per-page copy_page() (microcoded rep movsq) it replaces. 
Same 2 MB folio_copy() microbench, source evicted before
each iteration:

  direction         baseline GB/s   optimized GB/s   speedup
  DRAM0 -> DRAM1     13.03 ± 0.76    11.59 ± 0.18     0.89x
  DRAM1 -> DRAM0     12.85 ± 0.25    11.02 ± 0.10     0.86x

Cover letter discusses introducing a copy_pages() helper to avoid
this regression. 

Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 include/linux/highmem.h | 58 +++++++++++++++++++++++++++++++++++++++++
 mm/util.c               | 25 +++---------------
 2 files changed, 62 insertions(+), 21 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 871d817426bc..daee3f1863d1 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -439,6 +439,23 @@ static inline void copy_highpage(struct page *to, struct page *from)
 
 #endif
 
+static inline void copy_highpages(struct page *to, struct page *from,
+		unsigned long nr_pages)
+{
+	unsigned long i;
+
+#ifndef __HAVE_ARCH_COPY_HIGHPAGE
+	if (!IS_ENABLED(CONFIG_HIGHMEM)) {
+		memcpy(page_address(to), page_address(from), nr_pages << PAGE_SHIFT);
+		for (i = 0; i < nr_pages; i++)
+			kmsan_copy_page_meta(to + i, from + i);
+		return;
+	}
+#endif
+	for (i = 0; i < nr_pages; i++)
+		copy_highpage(to + i, from + i);
+}
+
 #ifdef copy_mc_to_kernel
 /*
  * If architecture supports machine check exception handling, define the
@@ -484,6 +501,40 @@ static inline int copy_mc_highpage(struct page *to, struct page *from)
 
 	return ret;
 }
+
+static inline int copy_mc_highpages(struct page *to, struct page *from,
+		unsigned long nr_pages)
+{
+	unsigned long i;
+
+#ifndef __HAVE_ARCH_COPY_HIGHPAGE
+	if (!IS_ENABLED(CONFIG_HIGHMEM)) {
+		unsigned long len = nr_pages << PAGE_SHIFT;
+		unsigned long ret;
+
+		ret = copy_mc_to_kernel(page_address(to),
+					page_address(from), len);
+		if (!ret) {
+			for (i = 0; i < nr_pages; i++)
+				kmsan_copy_page_meta(to + i, from + i);
+			return 0;
+		}
+		/*
+		 * copy_mc_to_kernel() returns the number bytes that were not copied,
+		 * counted from the end. The first failing page is therefore at
+		 * offset (len - ret) >> PAGE_SHIFT within the range.
+		 */
+		memory_failure_queue(page_to_pfn(from) +
+				     ((len - ret) >> PAGE_SHIFT), 0);
+		return -EHWPOISON;
+	}
+#endif
+
+	for (i = 0; i < nr_pages; i++)
+		if (copy_mc_highpage(to + i, from + i))
+			return -EHWPOISON;
+	return 0;
+}
 #else
 static inline int copy_mc_user_highpage(struct page *to, struct page *from,
 					unsigned long vaddr, struct vm_area_struct *vma)
@@ -497,6 +548,13 @@ static inline int copy_mc_highpage(struct page *to, struct page *from)
 	copy_highpage(to, from);
 	return 0;
 }
+
+static inline int copy_mc_highpages(struct page *to, struct page *from,
+		unsigned long nr_pages)
+{
+	copy_highpages(to, from, nr_pages);
+	return 0;
+}
 #endif
 
 static inline void memcpy_page(struct page *dst_page, size_t dst_off,
diff --git a/mm/util.c b/mm/util.c
index 3cc949a0b7ed..93f0d9daffce 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -749,32 +749,15 @@ EXPORT_SYMBOL(folio_mapping);
  */
 void folio_copy(struct folio *dst, struct folio *src)
 {
-	long i = 0;
-	long nr = folio_nr_pages(src);
-
-	for (;;) {
-		copy_highpage(folio_page(dst, i), folio_page(src, i));
-		if (++i == nr)
-			break;
-		cond_resched();
-	}
+	copy_highpages(folio_page(dst, 0), folio_page(src, 0),
+		       folio_nr_pages(src));
 }
 EXPORT_SYMBOL(folio_copy);
 
 int folio_mc_copy(struct folio *dst, struct folio *src)
 {
-	long nr = folio_nr_pages(src);
-	long i = 0;
-
-	for (;;) {
-		if (copy_mc_highpage(folio_page(dst, i), folio_page(src, i)))
-			return -EHWPOISON;
-		if (++i == nr)
-			break;
-		cond_resched();
-	}
-
-	return 0;
+	return copy_mc_highpages(folio_page(dst, 0), folio_page(src, 0),
+				 folio_nr_pages(src));
 }
 EXPORT_SYMBOL(folio_mc_copy);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-04-27 14:27 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 14:20 [RFC PATCH 0/1] batch page copies in folio_copy() and folio_mc_copy() Shivank Garg
2026-04-27 14:20 ` [RFC PATCH 1/1] mm: " Shivank Garg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox