Re: [RFC PATCH 1/1] mm: batch page copies in folio_copy() and folio_mc

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* Re: [RFC PATCH 1/1] mm: batch page copies in folio_copy() and folio_mc_copy()
       [not found] ` <20260427142036.111940-4-shivankg@amd.com>
@ 2026-05-12  9:31   ` David Hildenbrand (Arm)
  2026-05-14  5:17     ` Garg, Shivank
  0 siblings, 1 reply; 2+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-12  9:31 UTC (permalink / raw)
  To: Shivank Garg, Andrew Morton, linux-mm, linux-kernel, x86
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Ankur Arora,
	Bharata B Rao, Hrushikesh Salunke, David Rientjes

On 4/27/26 16:20, Shivank Garg wrote:
> Rewrite folio_copy() and folio_mc_copy() as thin wrappers around new
> batched helpers copy_highpages() and copy_mc_highpages().
> 
> The current implementations iterate copy_highpage() (or its #MC-aware
> variant) per 4 KB page. For a single 2 MB folio that loop runs 512
> times and pays, per page:
> 
>   - kmap_local_page() / kunmap_local()
>   - cond_resched()
>   - one invocation of the architecture copy_page()/memcpy() primitive
> 
> The new helpers issue a single copy_mc_to_kernel()/memcpy() over
> the whole contiguous range when CONFIG_HIGHMEM is off and no
> architecture overrides (__HAVE_ARCH_COPY_HIGHPAGE) copy_highpage().
> HIGHMEM and arch overrides keep the existing per-page path.
> 
> Tested on dual-socket AMD EPYC 9655 (Zen 5) with a CXL.mem node.
> In-kernel folio_mc_copy() microbenchmark on 2 MB folios, source
> evicted from cache before each iteration and measured throughput:
> 
>   direction         baseline GB/s   optimized GB/s   speedup
>   DRAM0 -> DRAM1     18.65 ± 1.37    38.03 ± 3.21     2.04x
>   DRAM0 -> CXL       25.46 ± 2.89    39.29 ± 1.17     1.54x
>   CXL   -> DRAM0     20.61 ± 3.95    35.07 ± 0.62     1.70x
> 
> End-to-end move_pages(2) throughput on anonymous 2 MB mTHP folios,
> 1 GB migrated per run:
> 
>   direction         baseline GB/s   optimized GB/s   speedup
>   DRAM0 -> DRAM1      7.20 ± 0.03     8.01 ± 0.02     1.11x
>   DRAM0 -> CXL       11.12 ± 0.15    13.07 ± 0.03     1.18x
>   DRAM1 -> DRAM0      7.21 ± 0.02     7.95 ± 0.02     1.10x
>   CXL   -> DRAM0      9.10 ± 0.05     9.49 ± 0.01     1.04x
> 
> On AMD EPYC 7713 (Zen 3 / Milan, REP_GOOD without FSRM/ERMS) the
> folio_copy() bulk path regresses because memcpy() falls through to
> memcpy_orig (an unrolled movq loop), which is slower than the
> per-page copy_page() (microcoded rep movsq) it replaces. 

Do you know what the reason for that fallback is? Could it be fixed (e.g., when
we detect page alignment or sth like that?)

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [RFC PATCH 1/1] mm: batch page copies in folio_copy() and folio_mc_copy()
  2026-05-12  9:31   ` [RFC PATCH 1/1] mm: batch page copies in folio_copy() and folio_mc_copy() David Hildenbrand (Arm)
@ 2026-05-14  5:17     ` Garg, Shivank
  0 siblings, 0 replies; 2+ messages in thread
From: Garg, Shivank @ 2026-05-14  5:17 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, linux-mm, linux-kernel,
	x86
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Ankur Arora,
	Bharata B Rao, Hrushikesh Salunke, David Rientjes, sandipan.das



On 5/12/2026 3:01 PM, David Hildenbrand (Arm) wrote:
> On 4/27/26 16:20, Shivank Garg wrote:
>> Rewrite folio_copy() and folio_mc_copy() as thin wrappers around new
>> batched helpers copy_highpages() and copy_mc_highpages().
>>
>> The current implementations iterate copy_highpage() (or its #MC-aware
>> variant) per 4 KB page. For a single 2 MB folio that loop runs 512
>> times and pays, per page:
>>
>>   - kmap_local_page() / kunmap_local()
>>   - cond_resched()
>>   - one invocation of the architecture copy_page()/memcpy() primitive
>>
>> The new helpers issue a single copy_mc_to_kernel()/memcpy() over
>> the whole contiguous range when CONFIG_HIGHMEM is off and no
>> architecture overrides (__HAVE_ARCH_COPY_HIGHPAGE) copy_highpage().
>> HIGHMEM and arch overrides keep the existing per-page path.
>>
>> Tested on dual-socket AMD EPYC 9655 (Zen 5) with a CXL.mem node.
>> In-kernel folio_mc_copy() microbenchmark on 2 MB folios, source
>> evicted from cache before each iteration and measured throughput:
>>
>>   direction         baseline GB/s   optimized GB/s   speedup
>>   DRAM0 -> DRAM1     18.65 ± 1.37    38.03 ± 3.21     2.04x
>>   DRAM0 -> CXL       25.46 ± 2.89    39.29 ± 1.17     1.54x
>>   CXL   -> DRAM0     20.61 ± 3.95    35.07 ± 0.62     1.70x
>>
>> End-to-end move_pages(2) throughput on anonymous 2 MB mTHP folios,
>> 1 GB migrated per run:
>>
>>   direction         baseline GB/s   optimized GB/s   speedup
>>   DRAM0 -> DRAM1      7.20 ± 0.03     8.01 ± 0.02     1.11x
>>   DRAM0 -> CXL       11.12 ± 0.15    13.07 ± 0.03     1.18x
>>   DRAM1 -> DRAM0      7.21 ± 0.02     7.95 ± 0.02     1.10x
>>   CXL   -> DRAM0      9.10 ± 0.05     9.49 ± 0.01     1.04x
>>
>> On AMD EPYC 7713 (Zen 3 / Milan, REP_GOOD without FSRM/ERMS) the
>> folio_copy() bulk path regresses because memcpy() falls through to
>> memcpy_orig (an unrolled movq loop), which is slower than the
>> per-page copy_page() (microcoded rep movsq) it replaces. 
> 
> Do you know what the reason for that fallback is? Could it be fixed (e.g., when
> we detect page alignment or sth like that?)
> 

The fallback is gated on X86_FEATURE_FSRM in arch/x86/lib/memcpy_64.S:

SYM_TYPED_FUNC_START(__memcpy)
	ALTERNATIVE "jmp memcpy_orig", "", X86_FEATURE_FSRM
	movq %rdi, %rax
	movq %rdx, %rcx
	rep movsb
	RET

AMD Zen 3 does not have FSRM, so it jmp to memcpy_orig (unrolled movq loop).


On v7.1.0-rc3, I measured these primitives and the kernel's actual memcpy()
across three CPUs, using a kernel module that vmallocs 16MB src/dst buffer
and times each primitive for comparison.
Numbers are mean (in GB/s) ± SD% (= SD as percent of mean).


1.) AMD EPYC 7713 (Zen 3), Flags: rep_good only, no ERMS/FSRM:

  size   unrolled_movq GB/s±SD%   rep_movsq GB/s±SD%  kernel_memcpy GB/s±SD%  
------------------------------------------------------------------------------
   16B      0.38± 8.73%            0.41± 0.43%            0.43± 0.31%
   32B      0.85± 0.19%            0.80± 8.37%            0.84± 0.07%
   64B      1.68± 0.35%            1.60± 0.03%            1.59± 9.37%
  128B      3.23± 0.22%            3.04± 0.62%            3.19± 0.03%
  256B      5.99± 5.78%            5.62± 4.15%            5.93± 0.42%
  512B     10.07± 1.36%           10.49± 2.60%           10.02± 0.21%
    1K     14.49± 0.09%           18.19± 0.37%           14.31± 3.48%
    2K     17.11± 1.01%           28.04± 2.37%           18.14± 0.56%
    4K     18.36± 0.22%           39.15± 0.50%           19.57± 1.14%

- kernel_memcpy is tracking unrolled_movq.
- rep_movsq is 1.4x-2x faster than the unrolled_movq fallback for >= 1 KiB.

2.) On Intel(R) Xeon(R) Platinum 8362
Flags: rep_good, erms, fsrm

  size  unrolled_movq GB/s±SD%   rep_movsq GB/s±SD%   rep_movsb GB/s±SD%   kernel_memcpy GB/s±SD%
--------------------------------------------------------------------------------------------
   16B      0.89± 0.93%            0.64± 0.10%            0.69± 0.57%            0.66± 3.52%
   32B      2.08± 2.46%            1.28± 0.15%            1.38± 6.21%            1.33± 4.28%
   64B      3.97± 2.26%            2.55± 0.24%            2.83± 0.22%            2.65± 4.48%
  128B      7.45± 0.09%            5.00± 2.53%            5.48± 5.04%            5.30± 1.60%
  256B     13.24± 0.01%            9.79± 0.57%           10.12± 0.37%            9.81± 0.34%
  512B     21.67± 0.03%           17.87± 0.02%           18.43± 0.79%           17.81± 0.25%
    1K     27.84± 1.96%           34.54± 1.24%           35.67± 1.88%           34.56± 2.49%
    2K     32.67± 2.35%           59.58± 0.01%           65.67± 0.18%           59.35± 1.12%
    4K     34.85± 0.64%           95.35± 0.00%           96.64± 0.69%           95.35± 0.00%

- kernel_memcpy is using rep_movsb (FSRM in use). 
- Below 512 B the unrolled movq loop is ~20-50% faster, >1 KiB FSRM wins.

3.) On AMD EPYC 9655 96-Core Processor (Zen 5)
Flags: rep_good, erms, fsrm

size   unrolled_movq GB/s±SD%    rep_movsq GB/s±SD%       rep_movsb GB/s±SD%   kernel_memcpy GB/s±SD%
--------------------------------------------------------------------------------------------
   16B      0.53± 0.39%            0.53± 0.21%            0.55± 0.13%            0.53± 0.14%
   32B      1.13± 1.49%            1.06± 0.07%            1.09± 0.16%            1.06± 0.09%
   64B      2.21± 0.12%            2.13± 0.07%            2.18± 0.14%            2.13± 0.09%
  128B      4.25± 0.12%            4.26± 0.10%            4.37± 0.12%            4.31± 0.14%
  256B      8.01± 0.19%            8.61± 0.27%            8.61± 0.18%            8.51± 0.10%
  512B     14.14± 0.18%           16.80± 0.24%           16.80± 0.23%           16.81± 0.24%
    1K     22.93± 0.73%           31.70± 0.48%           32.37± 0.28%           32.02± 0.22%
    2K     30.36± 0.27%           53.24± 1.01%           56.58± 0.22%           56.04± 0.22%
    4K     35.05± 0.65%           80.25± 0.41%           83.90± 0.20%           76.23± 0.37%

- kernel_memcpy is using rep_movsb (FSRM in use). 
- For smaller size, unrolled movq are close enough to be within noise.


Regarding the fix,
One option is to make memcpy() fall back to rep movsq instead of unrolled
movq loop when FSRM is absent. The data shows the benefit on Zen 3. For
the Intel, unrolled movq is faster for smaller sizes.
But, I'm not sure if adding these complexities to memcpy() is welcome.
Happy to work on this if it is helpful.

Another option is to leave memcpy() untouched for this series and add
a new copy_pages() helper that the folio copy path can use. It would
use ALTERNATIVE_2 that picks rep movsb on ERMS/FSRM and rep movsq on
REP_GOOD and per-page copy_page() loop as the final fallback.

Thanks,
Shivank

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-05-14  5:18 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260427142036.111940-2-shivankg@amd.com>
     [not found] ` <20260427142036.111940-4-shivankg@amd.com>
2026-05-12  9:31   ` [RFC PATCH 1/1] mm: batch page copies in folio_copy() and folio_mc_copy() David Hildenbrand (Arm)
2026-05-14  5:17     ` Garg, Shivank

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox