Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Garg, Shivank" <shivankg@amd.com>
To: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: akpm@linux-foundation.org, david@kernel.org, kinseyho@google.com,
	weixugc@google.com, ljs@kernel.org, Liam.Howlett@oracle.com,
	vbabka@kernel.org, willy@infradead.org, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, ziy@nvidia.com,
	matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net,
	apopple@nvidia.com, dave@stgolabs.net,
	Jonathan.Cameron@huawei.com, rkodsara@amd.com, vkoul@kernel.org,
	bharata@amd.com, sj@kernel.org, rientjes@google.com,
	xuezhengchu@huawei.com, yiannis@zptcorp.com,
	dave.hansen@intel.com, hannes@cmpxchg.org, jhubbard@nvidia.com,
	peterx@redhat.com, riel@surriel.com, shakeel.butt@linux.dev,
	stalexan@redhat.com, tj@kernel.org, nifan.cxl@gmail.com,
	jic23@kernel.org, aneesh.kumar@kernel.org, nathan.lynch@amd.com,
	Frank.li@nxp.com, djbw@kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload
Date: Sun, 10 May 2026 20:33:47 +0530	[thread overview]
Message-ID: <22157c87-1465-46de-8e1c-5d99a90152a6@amd.com> (raw)
In-Reply-To: <87cxz5rpi3.fsf@DESKTOP-5N7EMDA>



On 5/9/2026 1:19 PM, Huang, Ying wrote:
> "Garg, Shivank" <shivankg@amd.com> writes:
> 
>> On 5/8/2026 4:58 PM, Huang, Ying wrote:
>>> Hi, Shivank,
>>>
>>> "Garg, Shivank" <shivankg@amd.com> writes:
>>>
>>>> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>>>>> Shivank Garg <shivankg@amd.com> writes:
>>>>
>>>>>> PERFORMANCE RESULTS:
>>>>>> --------------------
>>>>>>
>>>>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>>>>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>>>>>> change in V5 alters this picture; please refer to the V4 cover letter
>>>>>> for the throughput tables [1].
>>>>>
>>>>> IMHO, it's better to copy performance data here.
>>>>>
>>>>> In addition to the performance benefit, I want to know the downside as
>>>>> well.  For example, the migration latency of the first folio may be
>>>>> longer.  If so, by how much?  Can you measure the batch number vs. total
>>>>> migration time (benefit) and first folio migration time (downside)?
>>>>> That can be used to determine the optimal batch number.
>>>>>
>>>>
>>>> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
>>>> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.
>>>>
>>>> Benchmark: move_pages() syscall to move pages between two NUMA nodes.
>>>>
>>>> 1). Moving different sized folios such that total transfer size is constant
>>>> (1GB), with different number of DMA channels. Throughput in GB/s.
>>>>
>>>> a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
>>>>
>>>> ================================================================================
>>>> 4K          | 16K        | 64K        | 256K       | 1M          | 2M          |
>>>> ================================================================================
>>>> 3.31±0.18   | 5.61±0.07  | 6.66±0.03  | 7.01±0.03  | 7.13±0.08   | 11.02±0.17  |
>>>>
>>>>
>>>> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>>>>
>>>> ============================================================================================
>>>> N channel| 4K        | 16K         | 64K         | 256K        | 1M          | 2M          |
>>>> ============================================================================================
>>>> 1      | 2.16±0.14   | 2.58±0.02   | 3.00±0.04   | 4.56±0.28   | 4.62±0.02   | 12.65±0.08  |
>>>> 2      | 2.68±0.09   | 3.69±0.15   | 4.52±0.04   | 6.75±0.06   | 7.19±0.19   | 14.38±0.06  |
>>>> 4      | 3.07±0.13   | 4.62±0.09   | 6.47±0.56   | 9.22±0.15   | 10.24±0.47  | 27.01±0.11  |
>>>> 8      | 3.43±0.09   | 5.40±0.16   | 7.67±0.08   | 11.25±0.17  | 12.60±0.60  | 45.62±0.52  |
>>>> 12     | 3.50±0.11   | 5.66±0.16   | 8.12±0.10   | 11.97±0.19  | 13.43±0.08  | 61.02±0.92  |
>>>> 16     | 3.54±0.12   | 5.79±0.14   | 8.50±0.13   | 12.59±0.15  | 17.21±6.40  | 65.23±1.70  |
>>>>
>>>>
>>>> 2).  First-folio latency: Instrumented with custom tracepoints to
>>>> measure latency per migrate_pages_batch() call.
>>>>     Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.
>>>
>>> Thanks for detailed data.  Per my understanding, the run time of
>>> migrate_pages_batch() may be not good enough for measuring first folio
>>> latency.  IIUC, the migration procedure is something like,
>>>
>>>   for each folio
>>>         unmap
>>>   flush
>>>   for each folio
>>>         copy
>>>         remap ===> first folio migrated
>>>
>>> Some tracepoint should be better to measure it.
>>
>> Sorry, my earlier write-up was unclear.
>> For first folio latency, I add two tracepoints: one at the start of migrate_pages_batch()
>> and one in migrate_folio_done(). 
>>
>> I agree that the user-accessible point tracepoint should be right after remove_migration_ptes().
>> Though, migrate_folio_done() runs only a few operations later, and will have a constant
>> offset, so it's unlikely to change the shape of the trade-off curve.
>> I'll move the tracepoint right after remove_migration_ptes() for new posting.
> 
> Thanks for explanation.  Trace point in migrate_folio_done() should be OK.
> 
>>>
>>>> A). Vanilla Kernel:
>>>>
>>>> Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
>>>> NR_MAX_BATCHED_MIGRATION is upstream default value 512.
>>>>
>>>> --- Order 0 (4K folios) ---
>>>>      n      vanilla/cpu
>>>> (folios)    GB/s | first(us)
>>>> --------------------------
>>>>      1       0.04 |     24
>>>>      4       0.16 |     25
>>>>      8       0.29 |     31
>>>>     16       0.54 |     27
>>>>     64       1.15 |     68
>>>>    256       1.86 |    162
>>>>    512       2.21 |    264
>>>>   2048       2.62 |    208
>>>>   4096       2.74 |    182
>>>>  16384       2.73 |    173
>>>>  65536       3.28 |    166
>>>> 262144       3.20 |    167
>>>>
>>>> --- Order 9 (2M folios) ---
>>>>      n      vanilla/cpu
>>>> (folios)    GB/s | first(us)
>>>> --------------------------
>>>>      1       7.05 |    194
>>>>      4       8.78 |    186
>>>>      8       8.47 |    188
>>>>     16       7.20 |    193
>>>>     64       8.23 |    191
>>>>    256      10.51 |    180
>>>>    512      10.88 |    173
>>>>
>>>> Takeaway:
>>>> In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
>>>> and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
>>>> per-batch unmap+flush cost, and then plateaus once workload is large enough.
>>>>
>>>>
>>>> B). Patched kernel:
>>>>
>>>> Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.
>>>
>>> Emm, so NR_MAX_BATCHED_MIGRATION could be very large?  I think that it
>>> needs to be bounded.  If it is too large, too many pages may be in an
>>> inaccessible state for a longer time.  That will hurt the workload
>>> performance, although it is optimal for migration performance.
>>>
>>
>> Agreed, it must be bounded.
> 
> Thanks!  Could you retest with bounded NR_MAX_BATCHED_MIGRATION.  If the
> upstream default doesn't work well for you.  We can find a better one
> that balances throughput and latency well.
> 

Thanks. Below tables sweep NR_MAX_BATCHED_MIGRATION from 512 up to 262144. On 2M folios,
16-channel PTDMA, the knee is at N=8192-16384 (= {16 to 32} * 512 ).

>>>>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *

One thing worth flagging on the "bounded default": at the upstream cap of 512 pages,
migrate_pages_batch() receives at most one 2M folio per call, so PTDMA can only use
one of its 16 channels per batch and the offload reduces to vanilla. (DCBM offloads
one 2M folio to each channel).
The larger-N rows are what exercise the channel parallelism for PTDMA case.

"SDXI"[1] like memory-to-memory data movers should reach good throughput with just 1 channel, 
and thus may not require increasing the NR_MAX_BATCHED_MIGRATION for good throughput.

I'm not tying series this to specific perf default for now, the design review (batch-copy
path, migrator interface, registration, static_call dispatch) is the part I'd like to converge
on first, then tune the threshold after it. Does that ordering work?

[1] https://lore.kernel.org/all/20260410-sdxi-base-v1-0-1d184cb5c60a@amd.com

Best regards,
Shivank

>>>> Change N with a knob to measure impact of different max batched size.
>>>>
>>>> --- ORDER 0 (4K folios) ---
>>>>      N         offload/dma1          offload/dma4          offload/dma16
>>>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>>>> ------------------------------------------------------------------------
>>>>    512         2.13 |    639         3.23 |    290         3.27 |    253
>>>>   1024         2.17 |   1261         3.44 |    582         3.58 |    536
>>>>   2048         2.01 |   2769         3.09 |   1360         3.45 |   1083
>>>>   4096         2.10 |   5059         3.13 |   2737         3.58 |   2115 
>>>>   8192         2.21 |   9320         3.17 |   5015         3.75 |   3617 
>>>>  16384         2.15 |  18689         3.31 |   9623         3.87 |   6937
>>>>  32768         2.12 |  42692         3.38 |  18893         3.83 |  14255
>>>>  65536         2.09 |  81956         3.38 |  38556         3.64 |  29003
>>>> 131072         2.02 | 169563         3.22 |  81082         3.63 |  62236
>>>> 262144         2.21 | 318424         3.12 | 170174         3.50 | 129413
>>>>
>>>> --- ORDER 9 (2M folios) ---
>>>>      N         offload/dma1          offload/dma4          offload/dma16
>>>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>>>> -------------------------------------------------------------------------
>>>>    512         11.66 |    160        11.68 |    160        11.65 |    160
>>>>   1024         12.16 |    310        13.67 |    275        13.64 |    276
>>>>   2048         12.30 |    613        25.47 |    290        25.48 |    291
>>>>   4096         12.48 |   1215        26.19 |    566        42.59 |    335
>>>>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *
>>>>  16384         12.61 |   4839        26.77 |   2218        61.94 |    896
>>>>  32768         12.60 |   9667        26.98 |   4422        63.75 |   1748
>>>>  65536         12.63 |  19318        26.99 |   8838        60.66 |   3543
>>>> 131072         12.64 |  38935        27.02 |  17935        61.06 |   7178
>>>> 262144         12.66 |  77694        26.85 |  35871        65.06 |  14129
>>>>
>>>> In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
>>>> So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
>>>> returns.
>>>>
>>>> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
>>>> because a larger batch allows the driver to distribute more folios across available DMA channels.
>>>> This is where we get most throughput while keeping the first folio latency in check.
>>>>
>>>> This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
>>>> will likely have different curves.
>>>>
>>>> Does this approach and experiment look good to you?
> 
> ---
> Best Regards,
> Huang, Ying



  reply	other threads:[~2026-05-10 15:04 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-28 15:50 [PATCH 0/7] Accelerate page migration with batch copying and hardware offload Shivank Garg
2026-04-28 15:50 ` [PATCH 1/7] mm/migrate: rename PAGE_ migration flags to FOLIO_ Shivank Garg
2026-04-30  9:07   ` Huang, Ying
2026-04-28 15:50 ` [PATCH 2/7] mm/migrate: use migrate_info field instead of private Shivank Garg
2026-05-07  9:43   ` Huang, Ying
2026-04-28 15:50 ` [PATCH 3/7] mm/migrate: skip data copy for already-copied folios Shivank Garg
2026-04-28 15:50 ` [PATCH 4/7] mm/migrate: add batch-copy path in migrate_pages_batch Shivank Garg
2026-04-28 15:50 ` [PATCH 5/7] mm/migrate: add copy offload registration infrastructure Shivank Garg
2026-04-28 15:50 ` [PATCH 6/7] drivers/migrate_offload: add DMA batch copy driver (dcbm) Shivank Garg
2026-04-28 15:50 ` [PATCH 7/7] mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing Shivank Garg
2026-04-28 17:11 ` [PATCH 0/7] Accelerate page migration with batch copying and hardware offload Garg, Shivank
2026-04-28 19:33   ` David Hildenbrand (Arm)
2026-04-29  5:51     ` Garg, Shivank
2026-04-30  8:47 ` Huang, Ying
2026-05-08 11:04   ` Garg, Shivank
2026-05-08 11:28     ` Huang, Ying
2026-05-08 12:34       ` Garg, Shivank
2026-05-09  7:49         ` Huang, Ying
2026-05-10 15:03           ` Garg, Shivank [this message]
2026-05-07  9:58 ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=22157c87-1465-46de-8e1c-5d99a90152a6@amd.com \
    --to=shivankg@amd.com \
    --cc=Frank.li@nxp.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@kernel.org \
    --cc=apopple@nvidia.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=djbw@kernel.org \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=jhubbard@nvidia.com \
    --cc=jic23@kernel.org \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kinseyho@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=nathan.lynch@amd.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rkodsara@amd.com \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=sj@kernel.org \
    --cc=stalexan@redhat.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=vkoul@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox