From: "Huang, Ying" <ying.huang@linux.alibaba.com>
To: "Garg, Shivank" <shivankg@amd.com>
Cc: akpm@linux-foundation.org, david@kernel.org,
kinseyho@google.com, weixugc@google.com, ljs@kernel.org,
Liam.Howlett@oracle.com, vbabka@kernel.org,
willy@infradead.org, rppt@kernel.org, surenb@google.com,
mhocko@suse.com, ziy@nvidia.com, matthew.brost@intel.com,
joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com,
gourry@gourry.net, apopple@nvidia.com, dave@stgolabs.net,
Jonathan.Cameron@huawei.com, rkodsara@amd.com,
vkoul@kernel.org, bharata@amd.com, sj@kernel.org,
rientjes@google.com, xuezhengchu@huawei.com,
yiannis@zptcorp.com, dave.hansen@intel.com, hannes@cmpxchg.org,
jhubbard@nvidia.com, peterx@redhat.com, riel@surriel.com,
shakeel.butt@linux.dev, stalexan@redhat.com, tj@kernel.org,
nifan.cxl@gmail.com, jic23@kernel.org, aneesh.kumar@kernel.org,
nathan.lynch@amd.com, Frank.li@nxp.com, djbw@kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload
Date: Sat, 09 May 2026 15:49:56 +0800 [thread overview]
Message-ID: <87cxz5rpi3.fsf@DESKTOP-5N7EMDA> (raw)
In-Reply-To: <c4d00222-5caf-47bf-801c-ae1dd439ad0f@amd.com> (Shivank Garg's message of "Fri, 8 May 2026 18:04:34 +0530")
"Garg, Shivank" <shivankg@amd.com> writes:
> On 5/8/2026 4:58 PM, Huang, Ying wrote:
>> Hi, Shivank,
>>
>> "Garg, Shivank" <shivankg@amd.com> writes:
>>
>>> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>>>> Shivank Garg <shivankg@amd.com> writes:
>>>
>>>>> PERFORMANCE RESULTS:
>>>>> --------------------
>>>>>
>>>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>>>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>>>>> change in V5 alters this picture; please refer to the V4 cover letter
>>>>> for the throughput tables [1].
>>>>
>>>> IMHO, it's better to copy performance data here.
>>>>
>>>> In addition to the performance benefit, I want to know the downside as
>>>> well. For example, the migration latency of the first folio may be
>>>> longer. If so, by how much? Can you measure the batch number vs. total
>>>> migration time (benefit) and first folio migration time (downside)?
>>>> That can be used to determine the optimal batch number.
>>>>
>>>
>>> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
>>> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.
>>>
>>> Benchmark: move_pages() syscall to move pages between two NUMA nodes.
>>>
>>> 1). Moving different sized folios such that total transfer size is constant
>>> (1GB), with different number of DMA channels. Throughput in GB/s.
>>>
>>> a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
>>>
>>> ================================================================================
>>> 4K | 16K | 64K | 256K | 1M | 2M |
>>> ================================================================================
>>> 3.31±0.18 | 5.61±0.07 | 6.66±0.03 | 7.01±0.03 | 7.13±0.08 | 11.02±0.17 |
>>>
>>>
>>> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>>>
>>> ============================================================================================
>>> N channel| 4K | 16K | 64K | 256K | 1M | 2M |
>>> ============================================================================================
>>> 1 | 2.16±0.14 | 2.58±0.02 | 3.00±0.04 | 4.56±0.28 | 4.62±0.02 | 12.65±0.08 |
>>> 2 | 2.68±0.09 | 3.69±0.15 | 4.52±0.04 | 6.75±0.06 | 7.19±0.19 | 14.38±0.06 |
>>> 4 | 3.07±0.13 | 4.62±0.09 | 6.47±0.56 | 9.22±0.15 | 10.24±0.47 | 27.01±0.11 |
>>> 8 | 3.43±0.09 | 5.40±0.16 | 7.67±0.08 | 11.25±0.17 | 12.60±0.60 | 45.62±0.52 |
>>> 12 | 3.50±0.11 | 5.66±0.16 | 8.12±0.10 | 11.97±0.19 | 13.43±0.08 | 61.02±0.92 |
>>> 16 | 3.54±0.12 | 5.79±0.14 | 8.50±0.13 | 12.59±0.15 | 17.21±6.40 | 65.23±1.70 |
>>>
>>>
>>> 2). First-folio latency: Instrumented with custom tracepoints to
>>> measure latency per migrate_pages_batch() call.
>>> Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.
>>
>> Thanks for detailed data. Per my understanding, the run time of
>> migrate_pages_batch() may be not good enough for measuring first folio
>> latency. IIUC, the migration procedure is something like,
>>
>> for each folio
>> unmap
>> flush
>> for each folio
>> copy
>> remap ===> first folio migrated
>>
>> Some tracepoint should be better to measure it.
>
> Sorry, my earlier write-up was unclear.
> For first folio latency, I add two tracepoints: one at the start of migrate_pages_batch()
> and one in migrate_folio_done().
>
> I agree that the user-accessible point tracepoint should be right after remove_migration_ptes().
> Though, migrate_folio_done() runs only a few operations later, and will have a constant
> offset, so it's unlikely to change the shape of the trade-off curve.
> I'll move the tracepoint right after remove_migration_ptes() for new posting.
Thanks for explanation. Trace point in migrate_folio_done() should be OK.
>>
>>> A). Vanilla Kernel:
>>>
>>> Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
>>> NR_MAX_BATCHED_MIGRATION is upstream default value 512.
>>>
>>> --- Order 0 (4K folios) ---
>>> n vanilla/cpu
>>> (folios) GB/s | first(us)
>>> --------------------------
>>> 1 0.04 | 24
>>> 4 0.16 | 25
>>> 8 0.29 | 31
>>> 16 0.54 | 27
>>> 64 1.15 | 68
>>> 256 1.86 | 162
>>> 512 2.21 | 264
>>> 2048 2.62 | 208
>>> 4096 2.74 | 182
>>> 16384 2.73 | 173
>>> 65536 3.28 | 166
>>> 262144 3.20 | 167
>>>
>>> --- Order 9 (2M folios) ---
>>> n vanilla/cpu
>>> (folios) GB/s | first(us)
>>> --------------------------
>>> 1 7.05 | 194
>>> 4 8.78 | 186
>>> 8 8.47 | 188
>>> 16 7.20 | 193
>>> 64 8.23 | 191
>>> 256 10.51 | 180
>>> 512 10.88 | 173
>>>
>>> Takeaway:
>>> In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
>>> and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
>>> per-batch unmap+flush cost, and then plateaus once workload is large enough.
>>>
>>>
>>> B). Patched kernel:
>>>
>>> Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.
>>
>> Emm, so NR_MAX_BATCHED_MIGRATION could be very large? I think that it
>> needs to be bounded. If it is too large, too many pages may be in an
>> inaccessible state for a longer time. That will hurt the workload
>> performance, although it is optimal for migration performance.
>>
>
> Agreed, it must be bounded.
Thanks! Could you retest with bounded NR_MAX_BATCHED_MIGRATION. If the
upstream default doesn't work well for you. We can find a better one
that balances throughput and latency well.
>>> Change N with a knob to measure impact of different max batched size.
>>>
>>> --- ORDER 0 (4K folios) ---
>>> N offload/dma1 offload/dma4 offload/dma16
>>> GB/s | first(us) GB/s | first(us) GB/s | first(us)
>>> ------------------------------------------------------------------------
>>> 512 2.13 | 639 3.23 | 290 3.27 | 253
>>> 1024 2.17 | 1261 3.44 | 582 3.58 | 536
>>> 2048 2.01 | 2769 3.09 | 1360 3.45 | 1083
>>> 4096 2.10 | 5059 3.13 | 2737 3.58 | 2115
>>> 8192 2.21 | 9320 3.17 | 5015 3.75 | 3617
>>> 16384 2.15 | 18689 3.31 | 9623 3.87 | 6937
>>> 32768 2.12 | 42692 3.38 | 18893 3.83 | 14255
>>> 65536 2.09 | 81956 3.38 | 38556 3.64 | 29003
>>> 131072 2.02 | 169563 3.22 | 81082 3.63 | 62236
>>> 262144 2.21 | 318424 3.12 | 170174 3.50 | 129413
>>>
>>> --- ORDER 9 (2M folios) ---
>>> N offload/dma1 offload/dma4 offload/dma16
>>> GB/s | first(us) GB/s | first(us) GB/s | first(us)
>>> -------------------------------------------------------------------------
>>> 512 11.66 | 160 11.68 | 160 11.65 | 160
>>> 1024 12.16 | 310 13.67 | 275 13.64 | 276
>>> 2048 12.30 | 613 25.47 | 290 25.48 | 291
>>> 4096 12.48 | 1215 26.19 | 566 42.59 | 335
>>> 8192 12.56 | 2424 26.57 | 1118 58.72 | 470 *
>>> 16384 12.61 | 4839 26.77 | 2218 61.94 | 896
>>> 32768 12.60 | 9667 26.98 | 4422 63.75 | 1748
>>> 65536 12.63 | 19318 26.99 | 8838 60.66 | 3543
>>> 131072 12.64 | 38935 27.02 | 17935 61.06 | 7178
>>> 262144 12.66 | 77694 26.85 | 35871 65.06 | 14129
>>>
>>> In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
>>> So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
>>> returns.
>>>
>>> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
>>> because a larger batch allows the driver to distribute more folios across available DMA channels.
>>> This is where we get most throughput while keeping the first folio latency in check.
>>>
>>> This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
>>> will likely have different curves.
>>>
>>> Does this approach and experiment look good to you?
---
Best Regards,
Huang, Ying
next prev parent reply other threads:[~2026-05-09 7:50 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-28 15:50 [PATCH 0/7] Accelerate page migration with batch copying and hardware offload Shivank Garg
2026-04-28 15:50 ` [PATCH 1/7] mm/migrate: rename PAGE_ migration flags to FOLIO_ Shivank Garg
2026-04-30 9:07 ` Huang, Ying
2026-04-28 15:50 ` [PATCH 2/7] mm/migrate: use migrate_info field instead of private Shivank Garg
2026-05-07 9:43 ` Huang, Ying
2026-04-28 15:50 ` [PATCH 3/7] mm/migrate: skip data copy for already-copied folios Shivank Garg
2026-04-28 15:50 ` [PATCH 4/7] mm/migrate: add batch-copy path in migrate_pages_batch Shivank Garg
2026-04-28 15:50 ` [PATCH 5/7] mm/migrate: add copy offload registration infrastructure Shivank Garg
2026-04-28 15:50 ` [PATCH 6/7] drivers/migrate_offload: add DMA batch copy driver (dcbm) Shivank Garg
2026-04-28 15:50 ` [PATCH 7/7] mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing Shivank Garg
2026-04-28 17:11 ` [PATCH 0/7] Accelerate page migration with batch copying and hardware offload Garg, Shivank
2026-04-28 19:33 ` David Hildenbrand (Arm)
2026-04-29 5:51 ` Garg, Shivank
2026-04-30 8:47 ` Huang, Ying
2026-05-08 11:04 ` Garg, Shivank
2026-05-08 11:28 ` Huang, Ying
2026-05-08 12:34 ` Garg, Shivank
2026-05-09 7:49 ` Huang, Ying [this message]
2026-05-10 15:03 ` Garg, Shivank
2026-05-07 9:58 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87cxz5rpi3.fsf@DESKTOP-5N7EMDA \
--to=ying.huang@linux.alibaba.com \
--cc=Frank.li@nxp.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@kernel.org \
--cc=apopple@nvidia.com \
--cc=bharata@amd.com \
--cc=byungchul@sk.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=david@kernel.org \
--cc=djbw@kernel.org \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=jhubbard@nvidia.com \
--cc=jic23@kernel.org \
--cc=joshua.hahnjy@gmail.com \
--cc=kinseyho@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=nathan.lynch@amd.com \
--cc=nifan.cxl@gmail.com \
--cc=peterx@redhat.com \
--cc=rakie.kim@sk.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=rkodsara@amd.com \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=shivankg@amd.com \
--cc=sj@kernel.org \
--cc=stalexan@redhat.com \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=vbabka@kernel.org \
--cc=vkoul@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox