The Linux Kernel Mailing List
 help / color / mirror / Atom feed
From: "Garg, Shivank" <shivankg@amd.com>
To: "David Hildenbrand (Arm)" <david@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org
Cc: Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>, Thomas Gleixner <tglx@kernel.org>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H . Peter Anvin" <hpa@zytor.com>,
	Ankur Arora <ankur.a.arora@oracle.com>,
	Bharata B Rao <bharata@amd.com>,
	Hrushikesh Salunke <hsalunke@amd.com>,
	David Rientjes <rientjes@google.com>,
	sandipan.das@amd.com
Subject: Re: [RFC PATCH 1/1] mm: batch page copies in folio_copy() and folio_mc_copy()
Date: Thu, 14 May 2026 10:47:32 +0530	[thread overview]
Message-ID: <6a5e794a-a608-4126-9abe-0d512a57dd67@amd.com> (raw)
In-Reply-To: <073e5e2c-7102-4141-b0d7-fa5635f811f5@kernel.org>



On 5/12/2026 3:01 PM, David Hildenbrand (Arm) wrote:
> On 4/27/26 16:20, Shivank Garg wrote:
>> Rewrite folio_copy() and folio_mc_copy() as thin wrappers around new
>> batched helpers copy_highpages() and copy_mc_highpages().
>>
>> The current implementations iterate copy_highpage() (or its #MC-aware
>> variant) per 4 KB page. For a single 2 MB folio that loop runs 512
>> times and pays, per page:
>>
>>   - kmap_local_page() / kunmap_local()
>>   - cond_resched()
>>   - one invocation of the architecture copy_page()/memcpy() primitive
>>
>> The new helpers issue a single copy_mc_to_kernel()/memcpy() over
>> the whole contiguous range when CONFIG_HIGHMEM is off and no
>> architecture overrides (__HAVE_ARCH_COPY_HIGHPAGE) copy_highpage().
>> HIGHMEM and arch overrides keep the existing per-page path.
>>
>> Tested on dual-socket AMD EPYC 9655 (Zen 5) with a CXL.mem node.
>> In-kernel folio_mc_copy() microbenchmark on 2 MB folios, source
>> evicted from cache before each iteration and measured throughput:
>>
>>   direction         baseline GB/s   optimized GB/s   speedup
>>   DRAM0 -> DRAM1     18.65 ± 1.37    38.03 ± 3.21     2.04x
>>   DRAM0 -> CXL       25.46 ± 2.89    39.29 ± 1.17     1.54x
>>   CXL   -> DRAM0     20.61 ± 3.95    35.07 ± 0.62     1.70x
>>
>> End-to-end move_pages(2) throughput on anonymous 2 MB mTHP folios,
>> 1 GB migrated per run:
>>
>>   direction         baseline GB/s   optimized GB/s   speedup
>>   DRAM0 -> DRAM1      7.20 ± 0.03     8.01 ± 0.02     1.11x
>>   DRAM0 -> CXL       11.12 ± 0.15    13.07 ± 0.03     1.18x
>>   DRAM1 -> DRAM0      7.21 ± 0.02     7.95 ± 0.02     1.10x
>>   CXL   -> DRAM0      9.10 ± 0.05     9.49 ± 0.01     1.04x
>>
>> On AMD EPYC 7713 (Zen 3 / Milan, REP_GOOD without FSRM/ERMS) the
>> folio_copy() bulk path regresses because memcpy() falls through to
>> memcpy_orig (an unrolled movq loop), which is slower than the
>> per-page copy_page() (microcoded rep movsq) it replaces. 
> 
> Do you know what the reason for that fallback is? Could it be fixed (e.g., when
> we detect page alignment or sth like that?)
> 

The fallback is gated on X86_FEATURE_FSRM in arch/x86/lib/memcpy_64.S:

SYM_TYPED_FUNC_START(__memcpy)
	ALTERNATIVE "jmp memcpy_orig", "", X86_FEATURE_FSRM
	movq %rdi, %rax
	movq %rdx, %rcx
	rep movsb
	RET

AMD Zen 3 does not have FSRM, so it jmp to memcpy_orig (unrolled movq loop).


On v7.1.0-rc3, I measured these primitives and the kernel's actual memcpy()
across three CPUs, using a kernel module that vmallocs 16MB src/dst buffer
and times each primitive for comparison.
Numbers are mean (in GB/s) ± SD% (= SD as percent of mean).


1.) AMD EPYC 7713 (Zen 3), Flags: rep_good only, no ERMS/FSRM:

  size   unrolled_movq GB/s±SD%   rep_movsq GB/s±SD%  kernel_memcpy GB/s±SD%  
------------------------------------------------------------------------------
   16B      0.38± 8.73%            0.41± 0.43%            0.43± 0.31%
   32B      0.85± 0.19%            0.80± 8.37%            0.84± 0.07%
   64B      1.68± 0.35%            1.60± 0.03%            1.59± 9.37%
  128B      3.23± 0.22%            3.04± 0.62%            3.19± 0.03%
  256B      5.99± 5.78%            5.62± 4.15%            5.93± 0.42%
  512B     10.07± 1.36%           10.49± 2.60%           10.02± 0.21%
    1K     14.49± 0.09%           18.19± 0.37%           14.31± 3.48%
    2K     17.11± 1.01%           28.04± 2.37%           18.14± 0.56%
    4K     18.36± 0.22%           39.15± 0.50%           19.57± 1.14%

- kernel_memcpy is tracking unrolled_movq.
- rep_movsq is 1.4x-2x faster than the unrolled_movq fallback for >= 1 KiB.

2.) On Intel(R) Xeon(R) Platinum 8362
Flags: rep_good, erms, fsrm

  size  unrolled_movq GB/s±SD%   rep_movsq GB/s±SD%   rep_movsb GB/s±SD%   kernel_memcpy GB/s±SD%
--------------------------------------------------------------------------------------------
   16B      0.89± 0.93%            0.64± 0.10%            0.69± 0.57%            0.66± 3.52%
   32B      2.08± 2.46%            1.28± 0.15%            1.38± 6.21%            1.33± 4.28%
   64B      3.97± 2.26%            2.55± 0.24%            2.83± 0.22%            2.65± 4.48%
  128B      7.45± 0.09%            5.00± 2.53%            5.48± 5.04%            5.30± 1.60%
  256B     13.24± 0.01%            9.79± 0.57%           10.12± 0.37%            9.81± 0.34%
  512B     21.67± 0.03%           17.87± 0.02%           18.43± 0.79%           17.81± 0.25%
    1K     27.84± 1.96%           34.54± 1.24%           35.67± 1.88%           34.56± 2.49%
    2K     32.67± 2.35%           59.58± 0.01%           65.67± 0.18%           59.35± 1.12%
    4K     34.85± 0.64%           95.35± 0.00%           96.64± 0.69%           95.35± 0.00%

- kernel_memcpy is using rep_movsb (FSRM in use). 
- Below 512 B the unrolled movq loop is ~20-50% faster, >1 KiB FSRM wins.

3.) On AMD EPYC 9655 96-Core Processor (Zen 5)
Flags: rep_good, erms, fsrm

size   unrolled_movq GB/s±SD%    rep_movsq GB/s±SD%       rep_movsb GB/s±SD%   kernel_memcpy GB/s±SD%
--------------------------------------------------------------------------------------------
   16B      0.53± 0.39%            0.53± 0.21%            0.55± 0.13%            0.53± 0.14%
   32B      1.13± 1.49%            1.06± 0.07%            1.09± 0.16%            1.06± 0.09%
   64B      2.21± 0.12%            2.13± 0.07%            2.18± 0.14%            2.13± 0.09%
  128B      4.25± 0.12%            4.26± 0.10%            4.37± 0.12%            4.31± 0.14%
  256B      8.01± 0.19%            8.61± 0.27%            8.61± 0.18%            8.51± 0.10%
  512B     14.14± 0.18%           16.80± 0.24%           16.80± 0.23%           16.81± 0.24%
    1K     22.93± 0.73%           31.70± 0.48%           32.37± 0.28%           32.02± 0.22%
    2K     30.36± 0.27%           53.24± 1.01%           56.58± 0.22%           56.04± 0.22%
    4K     35.05± 0.65%           80.25± 0.41%           83.90± 0.20%           76.23± 0.37%

- kernel_memcpy is using rep_movsb (FSRM in use). 
- For smaller size, unrolled movq are close enough to be within noise.


Regarding the fix,
One option is to make memcpy() fall back to rep movsq instead of unrolled
movq loop when FSRM is absent. The data shows the benefit on Zen 3. For
the Intel, unrolled movq is faster for smaller sizes.
But, I'm not sure if adding these complexities to memcpy() is welcome.
Happy to work on this if it is helpful.

Another option is to leave memcpy() untouched for this series and add
a new copy_pages() helper that the folio copy path can use. It would
use ALTERNATIVE_2 that picks rep movsb on ERMS/FSRM and rep movsq on
REP_GOOD and per-page copy_page() loop as the final fallback.

Thanks,
Shivank

      reply	other threads:[~2026-05-14  5:18 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260427142036.111940-2-shivankg@amd.com>
     [not found] ` <20260427142036.111940-4-shivankg@amd.com>
2026-05-12  9:31   ` [RFC PATCH 1/1] mm: batch page copies in folio_copy() and folio_mc_copy() David Hildenbrand (Arm)
2026-05-14  5:17     ` Garg, Shivank [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6a5e794a-a608-4126-9abe-0d512a57dd67@amd.com \
    --to=shivankg@amd.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=ankur.a.arora@oracle.com \
    --cc=bharata@amd.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=hpa@zytor.com \
    --cc=hsalunke@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=sandipan.das@amd.com \
    --cc=surenb@google.com \
    --cc=tglx@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox