From: "Garg, Shivank" <shivankg@amd.com>
To: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: akpm@linux-foundation.org, david@kernel.org, kinseyho@google.com,
weixugc@google.com, ljs@kernel.org, Liam.Howlett@oracle.com,
vbabka@kernel.org, willy@infradead.org, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, ziy@nvidia.com,
matthew.brost@intel.com, joshua.hahnjy@gmail.com,
rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net,
apopple@nvidia.com, dave@stgolabs.net,
Jonathan.Cameron@huawei.com, rkodsara@amd.com, vkoul@kernel.org,
bharata@amd.com, sj@kernel.org, rientjes@google.com,
xuezhengchu@huawei.com, yiannis@zptcorp.com,
dave.hansen@intel.com, hannes@cmpxchg.org, jhubbard@nvidia.com,
peterx@redhat.com, riel@surriel.com, shakeel.butt@linux.dev,
stalexan@redhat.com, tj@kernel.org, nifan.cxl@gmail.com,
jic23@kernel.org, aneesh.kumar@kernel.org, nathan.lynch@amd.com,
Frank.li@nxp.com, djbw@kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload
Date: Sun, 10 May 2026 20:33:47 +0530 [thread overview]
Message-ID: <22157c87-1465-46de-8e1c-5d99a90152a6@amd.com> (raw)
In-Reply-To: <87cxz5rpi3.fsf@DESKTOP-5N7EMDA>
On 5/9/2026 1:19 PM, Huang, Ying wrote:
> "Garg, Shivank" <shivankg@amd.com> writes:
>
>> On 5/8/2026 4:58 PM, Huang, Ying wrote:
>>> Hi, Shivank,
>>>
>>> "Garg, Shivank" <shivankg@amd.com> writes:
>>>
>>>> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>>>>> Shivank Garg <shivankg@amd.com> writes:
>>>>
>>>>>> PERFORMANCE RESULTS:
>>>>>> --------------------
>>>>>>
>>>>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>>>>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>>>>>> change in V5 alters this picture; please refer to the V4 cover letter
>>>>>> for the throughput tables [1].
>>>>>
>>>>> IMHO, it's better to copy performance data here.
>>>>>
>>>>> In addition to the performance benefit, I want to know the downside as
>>>>> well. For example, the migration latency of the first folio may be
>>>>> longer. If so, by how much? Can you measure the batch number vs. total
>>>>> migration time (benefit) and first folio migration time (downside)?
>>>>> That can be used to determine the optimal batch number.
>>>>>
>>>>
>>>> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
>>>> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.
>>>>
>>>> Benchmark: move_pages() syscall to move pages between two NUMA nodes.
>>>>
>>>> 1). Moving different sized folios such that total transfer size is constant
>>>> (1GB), with different number of DMA channels. Throughput in GB/s.
>>>>
>>>> a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
>>>>
>>>> ================================================================================
>>>> 4K | 16K | 64K | 256K | 1M | 2M |
>>>> ================================================================================
>>>> 3.31±0.18 | 5.61±0.07 | 6.66±0.03 | 7.01±0.03 | 7.13±0.08 | 11.02±0.17 |
>>>>
>>>>
>>>> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>>>>
>>>> ============================================================================================
>>>> N channel| 4K | 16K | 64K | 256K | 1M | 2M |
>>>> ============================================================================================
>>>> 1 | 2.16±0.14 | 2.58±0.02 | 3.00±0.04 | 4.56±0.28 | 4.62±0.02 | 12.65±0.08 |
>>>> 2 | 2.68±0.09 | 3.69±0.15 | 4.52±0.04 | 6.75±0.06 | 7.19±0.19 | 14.38±0.06 |
>>>> 4 | 3.07±0.13 | 4.62±0.09 | 6.47±0.56 | 9.22±0.15 | 10.24±0.47 | 27.01±0.11 |
>>>> 8 | 3.43±0.09 | 5.40±0.16 | 7.67±0.08 | 11.25±0.17 | 12.60±0.60 | 45.62±0.52 |
>>>> 12 | 3.50±0.11 | 5.66±0.16 | 8.12±0.10 | 11.97±0.19 | 13.43±0.08 | 61.02±0.92 |
>>>> 16 | 3.54±0.12 | 5.79±0.14 | 8.50±0.13 | 12.59±0.15 | 17.21±6.40 | 65.23±1.70 |
>>>>
>>>>
>>>> 2). First-folio latency: Instrumented with custom tracepoints to
>>>> measure latency per migrate_pages_batch() call.
>>>> Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.
>>>
>>> Thanks for detailed data. Per my understanding, the run time of
>>> migrate_pages_batch() may be not good enough for measuring first folio
>>> latency. IIUC, the migration procedure is something like,
>>>
>>> for each folio
>>> unmap
>>> flush
>>> for each folio
>>> copy
>>> remap ===> first folio migrated
>>>
>>> Some tracepoint should be better to measure it.
>>
>> Sorry, my earlier write-up was unclear.
>> For first folio latency, I add two tracepoints: one at the start of migrate_pages_batch()
>> and one in migrate_folio_done().
>>
>> I agree that the user-accessible point tracepoint should be right after remove_migration_ptes().
>> Though, migrate_folio_done() runs only a few operations later, and will have a constant
>> offset, so it's unlikely to change the shape of the trade-off curve.
>> I'll move the tracepoint right after remove_migration_ptes() for new posting.
>
> Thanks for explanation. Trace point in migrate_folio_done() should be OK.
>
>>>
>>>> A). Vanilla Kernel:
>>>>
>>>> Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
>>>> NR_MAX_BATCHED_MIGRATION is upstream default value 512.
>>>>
>>>> --- Order 0 (4K folios) ---
>>>> n vanilla/cpu
>>>> (folios) GB/s | first(us)
>>>> --------------------------
>>>> 1 0.04 | 24
>>>> 4 0.16 | 25
>>>> 8 0.29 | 31
>>>> 16 0.54 | 27
>>>> 64 1.15 | 68
>>>> 256 1.86 | 162
>>>> 512 2.21 | 264
>>>> 2048 2.62 | 208
>>>> 4096 2.74 | 182
>>>> 16384 2.73 | 173
>>>> 65536 3.28 | 166
>>>> 262144 3.20 | 167
>>>>
>>>> --- Order 9 (2M folios) ---
>>>> n vanilla/cpu
>>>> (folios) GB/s | first(us)
>>>> --------------------------
>>>> 1 7.05 | 194
>>>> 4 8.78 | 186
>>>> 8 8.47 | 188
>>>> 16 7.20 | 193
>>>> 64 8.23 | 191
>>>> 256 10.51 | 180
>>>> 512 10.88 | 173
>>>>
>>>> Takeaway:
>>>> In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
>>>> and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
>>>> per-batch unmap+flush cost, and then plateaus once workload is large enough.
>>>>
>>>>
>>>> B). Patched kernel:
>>>>
>>>> Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.
>>>
>>> Emm, so NR_MAX_BATCHED_MIGRATION could be very large? I think that it
>>> needs to be bounded. If it is too large, too many pages may be in an
>>> inaccessible state for a longer time. That will hurt the workload
>>> performance, although it is optimal for migration performance.
>>>
>>
>> Agreed, it must be bounded.
>
> Thanks! Could you retest with bounded NR_MAX_BATCHED_MIGRATION. If the
> upstream default doesn't work well for you. We can find a better one
> that balances throughput and latency well.
>
Thanks. Below tables sweep NR_MAX_BATCHED_MIGRATION from 512 up to 262144. On 2M folios,
16-channel PTDMA, the knee is at N=8192-16384 (= {16 to 32} * 512 ).
>>>> 8192 12.56 | 2424 26.57 | 1118 58.72 | 470 *
One thing worth flagging on the "bounded default": at the upstream cap of 512 pages,
migrate_pages_batch() receives at most one 2M folio per call, so PTDMA can only use
one of its 16 channels per batch and the offload reduces to vanilla. (DCBM offloads
one 2M folio to each channel).
The larger-N rows are what exercise the channel parallelism for PTDMA case.
"SDXI"[1] like memory-to-memory data movers should reach good throughput with just 1 channel,
and thus may not require increasing the NR_MAX_BATCHED_MIGRATION for good throughput.
I'm not tying series this to specific perf default for now, the design review (batch-copy
path, migrator interface, registration, static_call dispatch) is the part I'd like to converge
on first, then tune the threshold after it. Does that ordering work?
[1] https://lore.kernel.org/all/20260410-sdxi-base-v1-0-1d184cb5c60a@amd.com
Best regards,
Shivank
>>>> Change N with a knob to measure impact of different max batched size.
>>>>
>>>> --- ORDER 0 (4K folios) ---
>>>> N offload/dma1 offload/dma4 offload/dma16
>>>> GB/s | first(us) GB/s | first(us) GB/s | first(us)
>>>> ------------------------------------------------------------------------
>>>> 512 2.13 | 639 3.23 | 290 3.27 | 253
>>>> 1024 2.17 | 1261 3.44 | 582 3.58 | 536
>>>> 2048 2.01 | 2769 3.09 | 1360 3.45 | 1083
>>>> 4096 2.10 | 5059 3.13 | 2737 3.58 | 2115
>>>> 8192 2.21 | 9320 3.17 | 5015 3.75 | 3617
>>>> 16384 2.15 | 18689 3.31 | 9623 3.87 | 6937
>>>> 32768 2.12 | 42692 3.38 | 18893 3.83 | 14255
>>>> 65536 2.09 | 81956 3.38 | 38556 3.64 | 29003
>>>> 131072 2.02 | 169563 3.22 | 81082 3.63 | 62236
>>>> 262144 2.21 | 318424 3.12 | 170174 3.50 | 129413
>>>>
>>>> --- ORDER 9 (2M folios) ---
>>>> N offload/dma1 offload/dma4 offload/dma16
>>>> GB/s | first(us) GB/s | first(us) GB/s | first(us)
>>>> -------------------------------------------------------------------------
>>>> 512 11.66 | 160 11.68 | 160 11.65 | 160
>>>> 1024 12.16 | 310 13.67 | 275 13.64 | 276
>>>> 2048 12.30 | 613 25.47 | 290 25.48 | 291
>>>> 4096 12.48 | 1215 26.19 | 566 42.59 | 335
>>>> 8192 12.56 | 2424 26.57 | 1118 58.72 | 470 *
>>>> 16384 12.61 | 4839 26.77 | 2218 61.94 | 896
>>>> 32768 12.60 | 9667 26.98 | 4422 63.75 | 1748
>>>> 65536 12.63 | 19318 26.99 | 8838 60.66 | 3543
>>>> 131072 12.64 | 38935 27.02 | 17935 61.06 | 7178
>>>> 262144 12.66 | 77694 26.85 | 35871 65.06 | 14129
>>>>
>>>> In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
>>>> So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
>>>> returns.
>>>>
>>>> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
>>>> because a larger batch allows the driver to distribute more folios across available DMA channels.
>>>> This is where we get most throughput while keeping the first folio latency in check.
>>>>
>>>> This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
>>>> will likely have different curves.
>>>>
>>>> Does this approach and experiment look good to you?
>
> ---
> Best Regards,
> Huang, Ying
next prev parent reply other threads:[~2026-05-10 15:04 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-28 15:50 [PATCH 0/7] Accelerate page migration with batch copying and hardware offload Shivank Garg
2026-04-28 15:50 ` [PATCH 1/7] mm/migrate: rename PAGE_ migration flags to FOLIO_ Shivank Garg
2026-04-30 9:07 ` Huang, Ying
2026-04-28 15:50 ` [PATCH 2/7] mm/migrate: use migrate_info field instead of private Shivank Garg
2026-05-07 9:43 ` Huang, Ying
2026-04-28 15:50 ` [PATCH 3/7] mm/migrate: skip data copy for already-copied folios Shivank Garg
2026-04-28 15:50 ` [PATCH 4/7] mm/migrate: add batch-copy path in migrate_pages_batch Shivank Garg
2026-04-28 15:50 ` [PATCH 5/7] mm/migrate: add copy offload registration infrastructure Shivank Garg
2026-04-28 15:50 ` [PATCH 6/7] drivers/migrate_offload: add DMA batch copy driver (dcbm) Shivank Garg
2026-04-28 15:50 ` [PATCH 7/7] mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing Shivank Garg
2026-04-28 17:11 ` [PATCH 0/7] Accelerate page migration with batch copying and hardware offload Garg, Shivank
2026-04-28 19:33 ` David Hildenbrand (Arm)
2026-04-29 5:51 ` Garg, Shivank
2026-04-30 8:47 ` Huang, Ying
2026-05-08 11:04 ` Garg, Shivank
2026-05-08 11:28 ` Huang, Ying
2026-05-08 12:34 ` Garg, Shivank
2026-05-09 7:49 ` Huang, Ying
2026-05-10 15:03 ` Garg, Shivank [this message]
2026-05-07 9:58 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=22157c87-1465-46de-8e1c-5d99a90152a6@amd.com \
--to=shivankg@amd.com \
--cc=Frank.li@nxp.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@kernel.org \
--cc=apopple@nvidia.com \
--cc=bharata@amd.com \
--cc=byungchul@sk.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=david@kernel.org \
--cc=djbw@kernel.org \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=jhubbard@nvidia.com \
--cc=jic23@kernel.org \
--cc=joshua.hahnjy@gmail.com \
--cc=kinseyho@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=nathan.lynch@amd.com \
--cc=nifan.cxl@gmail.com \
--cc=peterx@redhat.com \
--cc=rakie.kim@sk.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=rkodsara@amd.com \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=sj@kernel.org \
--cc=stalexan@redhat.com \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=vbabka@kernel.org \
--cc=vkoul@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ying.huang@linux.alibaba.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox