From: "Garg, Shivank" <shivankg@amd.com>
To: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: akpm@linux-foundation.org, david@kernel.org, kinseyho@google.com,
weixugc@google.com, ljs@kernel.org, Liam.Howlett@oracle.com,
vbabka@kernel.org, willy@infradead.org, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, ziy@nvidia.com,
matthew.brost@intel.com, joshua.hahnjy@gmail.com,
rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net,
apopple@nvidia.com, dave@stgolabs.net,
Jonathan.Cameron@huawei.com, rkodsara@amd.com, vkoul@kernel.org,
bharata@amd.com, sj@kernel.org, rientjes@google.com,
xuezhengchu@huawei.com, yiannis@zptcorp.com,
dave.hansen@intel.com, hannes@cmpxchg.org, jhubbard@nvidia.com,
peterx@redhat.com, riel@surriel.com, shakeel.butt@linux.dev,
stalexan@redhat.com, tj@kernel.org, nifan.cxl@gmail.com,
jic23@kernel.org, aneesh.kumar@kernel.org, nathan.lynch@amd.com,
Frank.li@nxp.com, djbw@kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload
Date: Fri, 8 May 2026 16:34:22 +0530 [thread overview]
Message-ID: <152b9b5d-67c8-4a13-b8a8-be576a16eb8f@amd.com> (raw)
In-Reply-To: <87zf2kvnqy.fsf@DESKTOP-5N7EMDA>
On 4/30/2026 2:17 PM, Huang, Ying wrote:
> Shivank Garg <shivankg@amd.com> writes:
>> PERFORMANCE RESULTS:
>> --------------------
>>
>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>> change in V5 alters this picture; please refer to the V4 cover letter
>> for the throughput tables [1].
>
> IMHO, it's better to copy performance data here.
>
> In addition to the performance benefit, I want to know the downside as
> well. For example, the migration latency of the first folio may be
> longer. If so, by how much? Can you measure the batch number vs. total
> migration time (benefit) and first folio migration time (downside)?
> That can be used to determine the optimal batch number.
>
System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.
Benchmark: move_pages() syscall to move pages between two NUMA nodes.
1). Moving different sized folios such that total transfer size is constant
(1GB), with different number of DMA channels. Throughput in GB/s.
a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
================================================================================
4K | 16K | 64K | 256K | 1M | 2M |
================================================================================
3.31±0.18 | 5.61±0.07 | 6.66±0.03 | 7.01±0.03 | 7.13±0.08 | 11.02±0.17 |
b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
============================================================================================
N channel| 4K | 16K | 64K | 256K | 1M | 2M |
============================================================================================
1 | 2.16±0.14 | 2.58±0.02 | 3.00±0.04 | 4.56±0.28 | 4.62±0.02 | 12.65±0.08 |
2 | 2.68±0.09 | 3.69±0.15 | 4.52±0.04 | 6.75±0.06 | 7.19±0.19 | 14.38±0.06 |
4 | 3.07±0.13 | 4.62±0.09 | 6.47±0.56 | 9.22±0.15 | 10.24±0.47 | 27.01±0.11 |
8 | 3.43±0.09 | 5.40±0.16 | 7.67±0.08 | 11.25±0.17 | 12.60±0.60 | 45.62±0.52 |
12 | 3.50±0.11 | 5.66±0.16 | 8.12±0.10 | 11.97±0.19 | 13.43±0.08 | 61.02±0.92 |
16 | 3.54±0.12 | 5.79±0.14 | 8.50±0.13 | 12.59±0.15 | 17.21±6.40 | 65.23±1.70 |
2). First-folio latency: Instrumented with custom tracepoints to measure latency per migrate_pages_batch() call.
Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.
A). Vanilla Kernel:
Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
NR_MAX_BATCHED_MIGRATION is upstream default value 512.
--- Order 0 (4K folios) ---
n vanilla/cpu
(folios) GB/s | first(us)
--------------------------
1 0.04 | 24
4 0.16 | 25
8 0.29 | 31
16 0.54 | 27
64 1.15 | 68
256 1.86 | 162
512 2.21 | 264
2048 2.62 | 208
4096 2.74 | 182
16384 2.73 | 173
65536 3.28 | 166
262144 3.20 | 167
--- Order 9 (2M folios) ---
n vanilla/cpu
(folios) GB/s | first(us)
--------------------------
1 7.05 | 194
4 8.78 | 186
8 8.47 | 188
16 7.20 | 193
64 8.23 | 191
256 10.51 | 180
512 10.88 | 173
Takeaway:
In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
per-batch unmap+flush cost, and then plateaus once workload is large enough.
B). Patched kernel:
Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.
Change N with a knob to measure impact of different max batched size.
--- ORDER 0 (4K folios) ---
N offload/dma1 offload/dma4 offload/dma16
GB/s | first(us) GB/s | first(us) GB/s | first(us)
------------------------------------------------------------------------
512 2.13 | 639 3.23 | 290 3.27 | 253
1024 2.17 | 1261 3.44 | 582 3.58 | 536
2048 2.01 | 2769 3.09 | 1360 3.45 | 1083
4096 2.10 | 5059 3.13 | 2737 3.58 | 2115
8192 2.21 | 9320 3.17 | 5015 3.75 | 3617
16384 2.15 | 18689 3.31 | 9623 3.87 | 6937
32768 2.12 | 42692 3.38 | 18893 3.83 | 14255
65536 2.09 | 81956 3.38 | 38556 3.64 | 29003
131072 2.02 | 169563 3.22 | 81082 3.63 | 62236
262144 2.21 | 318424 3.12 | 170174 3.50 | 129413
--- ORDER 9 (2M folios) ---
N offload/dma1 offload/dma4 offload/dma16
GB/s | first(us) GB/s | first(us) GB/s | first(us)
-------------------------------------------------------------------------
512 11.66 | 160 11.68 | 160 11.65 | 160
1024 12.16 | 310 13.67 | 275 13.64 | 276
2048 12.30 | 613 25.47 | 290 25.48 | 291
4096 12.48 | 1215 26.19 | 566 42.59 | 335
8192 12.56 | 2424 26.57 | 1118 58.72 | 470 *
16384 12.61 | 4839 26.77 | 2218 61.94 | 896
32768 12.60 | 9667 26.98 | 4422 63.75 | 1748
65536 12.63 | 19318 26.99 | 8838 60.66 | 3543
131072 12.64 | 38935 27.02 | 17935 61.06 | 7178
262144 12.66 | 77694 26.85 | 35871 65.06 | 14129
In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
returns.
For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
because a larger batch allows the driver to distribute more folios across available DMA channels.
This is where we get most throughput while keeping the first folio latency in check.
This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
will likely have different curves.
Does this approach and experiment look good to you?
Thanks,
Shivank
next prev parent reply other threads:[~2026-05-08 11:04 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260428155043.39251-2-shivankg@amd.com>
[not found] ` <20260428155043.39251-6-shivankg@amd.com>
2026-05-07 9:43 ` [PATCH 2/7] mm/migrate: use migrate_info field instead of private Huang, Ying
2026-05-11 15:22 ` David Hildenbrand (Arm)
2026-05-07 9:58 ` [PATCH 0/7] Accelerate page migration with batch copying and hardware offload Huang, Ying
2026-05-11 15:19 ` David Hildenbrand (Arm)
[not found] ` <87zf2kvnqy.fsf@DESKTOP-5N7EMDA>
2026-05-08 11:04 ` Garg, Shivank [this message]
2026-05-08 11:28 ` Huang, Ying
2026-05-08 12:34 ` Garg, Shivank
2026-05-09 7:49 ` Huang, Ying
2026-05-10 15:03 ` Garg, Shivank
[not found] ` <20260428155043.39251-8-shivankg@amd.com>
2026-05-11 15:35 ` [PATCH 3/7] mm/migrate: skip data copy for already-copied folios David Hildenbrand (Arm)
[not found] ` <20260428155043.39251-10-shivankg@amd.com>
2026-05-11 15:40 ` [PATCH 4/7] mm/migrate: add batch-copy path in migrate_pages_batch David Hildenbrand (Arm)
[not found] ` <20260428155043.39251-12-shivankg@amd.com>
2026-05-11 15:46 ` [PATCH 5/7] mm/migrate: add copy offload registration infrastructure David Hildenbrand (Arm)
2026-05-11 15:50 ` David Hildenbrand (Arm)
2026-05-11 15:53 ` [PATCH 0/7] Accelerate page migration with batch copying and hardware offload David Hildenbrand (Arm)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=152b9b5d-67c8-4a13-b8a8-be576a16eb8f@amd.com \
--to=shivankg@amd.com \
--cc=Frank.li@nxp.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@kernel.org \
--cc=apopple@nvidia.com \
--cc=bharata@amd.com \
--cc=byungchul@sk.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=david@kernel.org \
--cc=djbw@kernel.org \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=jhubbard@nvidia.com \
--cc=jic23@kernel.org \
--cc=joshua.hahnjy@gmail.com \
--cc=kinseyho@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=nathan.lynch@amd.com \
--cc=nifan.cxl@gmail.com \
--cc=peterx@redhat.com \
--cc=rakie.kim@sk.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=rkodsara@amd.com \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=sj@kernel.org \
--cc=stalexan@redhat.com \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=vbabka@kernel.org \
--cc=vkoul@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ying.huang@linux.alibaba.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox