Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Huang, Ying" <ying.huang@linux.alibaba.com>
To: "Garg, Shivank" <shivankg@amd.com>
Cc: akpm@linux-foundation.org,  david@kernel.org,
	 kinseyho@google.com, weixugc@google.com,  ljs@kernel.org,
	 Liam.Howlett@oracle.com, vbabka@kernel.org,
	 willy@infradead.org,  rppt@kernel.org, surenb@google.com,
	 mhocko@suse.com,  ziy@nvidia.com, matthew.brost@intel.com,
	 joshua.hahnjy@gmail.com,  rakie.kim@sk.com, byungchul@sk.com,
	 gourry@gourry.net,  apopple@nvidia.com, dave@stgolabs.net,
	 Jonathan.Cameron@huawei.com,  rkodsara@amd.com,
	vkoul@kernel.org,  bharata@amd.com,  sj@kernel.org,
	 rientjes@google.com, xuezhengchu@huawei.com,
	 yiannis@zptcorp.com,  dave.hansen@intel.com, hannes@cmpxchg.org,
	 jhubbard@nvidia.com,  peterx@redhat.com, riel@surriel.com,
	 shakeel.butt@linux.dev,  stalexan@redhat.com, tj@kernel.org,
	 nifan.cxl@gmail.com,  jic23@kernel.org, aneesh.kumar@kernel.org,
	 nathan.lynch@amd.com,  Frank.li@nxp.com, djbw@kernel.org,
	 linux-kernel@vger.kernel.org,  linux-mm@kvack.org
Subject: Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload
Date: Sat, 09 May 2026 15:49:56 +0800	[thread overview]
Message-ID: <87cxz5rpi3.fsf@DESKTOP-5N7EMDA> (raw)
In-Reply-To: <c4d00222-5caf-47bf-801c-ae1dd439ad0f@amd.com> (Shivank Garg's message of "Fri, 8 May 2026 18:04:34 +0530")

"Garg, Shivank" <shivankg@amd.com> writes:

> On 5/8/2026 4:58 PM, Huang, Ying wrote:
>> Hi, Shivank,
>> 
>> "Garg, Shivank" <shivankg@amd.com> writes:
>> 
>>> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>>>> Shivank Garg <shivankg@amd.com> writes:
>>>
>>>>> PERFORMANCE RESULTS:
>>>>> --------------------
>>>>>
>>>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>>>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>>>>> change in V5 alters this picture; please refer to the V4 cover letter
>>>>> for the throughput tables [1].
>>>>
>>>> IMHO, it's better to copy performance data here.
>>>>
>>>> In addition to the performance benefit, I want to know the downside as
>>>> well.  For example, the migration latency of the first folio may be
>>>> longer.  If so, by how much?  Can you measure the batch number vs. total
>>>> migration time (benefit) and first folio migration time (downside)?
>>>> That can be used to determine the optimal batch number.
>>>>
>>>
>>> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
>>> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.
>>>
>>> Benchmark: move_pages() syscall to move pages between two NUMA nodes.
>>>
>>> 1). Moving different sized folios such that total transfer size is constant
>>> (1GB), with different number of DMA channels. Throughput in GB/s.
>>>
>>> a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
>>>
>>> ================================================================================
>>> 4K          | 16K        | 64K        | 256K       | 1M          | 2M          |
>>> ================================================================================
>>> 3.31±0.18   | 5.61±0.07  | 6.66±0.03  | 7.01±0.03  | 7.13±0.08   | 11.02±0.17  |
>>>
>>>
>>> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>>>
>>> ============================================================================================
>>> N channel| 4K        | 16K         | 64K         | 256K        | 1M          | 2M          |
>>> ============================================================================================
>>> 1      | 2.16±0.14   | 2.58±0.02   | 3.00±0.04   | 4.56±0.28   | 4.62±0.02   | 12.65±0.08  |
>>> 2      | 2.68±0.09   | 3.69±0.15   | 4.52±0.04   | 6.75±0.06   | 7.19±0.19   | 14.38±0.06  |
>>> 4      | 3.07±0.13   | 4.62±0.09   | 6.47±0.56   | 9.22±0.15   | 10.24±0.47  | 27.01±0.11  |
>>> 8      | 3.43±0.09   | 5.40±0.16   | 7.67±0.08   | 11.25±0.17  | 12.60±0.60  | 45.62±0.52  |
>>> 12     | 3.50±0.11   | 5.66±0.16   | 8.12±0.10   | 11.97±0.19  | 13.43±0.08  | 61.02±0.92  |
>>> 16     | 3.54±0.12   | 5.79±0.14   | 8.50±0.13   | 12.59±0.15  | 17.21±6.40  | 65.23±1.70  |
>>>
>>>
>>> 2).  First-folio latency: Instrumented with custom tracepoints to
>>> measure latency per migrate_pages_batch() call.
>>>     Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.
>> 
>> Thanks for detailed data.  Per my understanding, the run time of
>> migrate_pages_batch() may be not good enough for measuring first folio
>> latency.  IIUC, the migration procedure is something like,
>> 
>>   for each folio
>>         unmap
>>   flush
>>   for each folio
>>         copy
>>         remap ===> first folio migrated
>> 
>> Some tracepoint should be better to measure it.
>
> Sorry, my earlier write-up was unclear.
> For first folio latency, I add two tracepoints: one at the start of migrate_pages_batch()
> and one in migrate_folio_done(). 
>
> I agree that the user-accessible point tracepoint should be right after remove_migration_ptes().
> Though, migrate_folio_done() runs only a few operations later, and will have a constant
> offset, so it's unlikely to change the shape of the trade-off curve.
> I'll move the tracepoint right after remove_migration_ptes() for new posting.

Thanks for explanation.  Trace point in migrate_folio_done() should be OK.

>> 
>>> A). Vanilla Kernel:
>>>
>>> Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
>>> NR_MAX_BATCHED_MIGRATION is upstream default value 512.
>>>
>>> --- Order 0 (4K folios) ---
>>>      n      vanilla/cpu
>>> (folios)    GB/s | first(us)
>>> --------------------------
>>>      1       0.04 |     24
>>>      4       0.16 |     25
>>>      8       0.29 |     31
>>>     16       0.54 |     27
>>>     64       1.15 |     68
>>>    256       1.86 |    162
>>>    512       2.21 |    264
>>>   2048       2.62 |    208
>>>   4096       2.74 |    182
>>>  16384       2.73 |    173
>>>  65536       3.28 |    166
>>> 262144       3.20 |    167
>>>
>>> --- Order 9 (2M folios) ---
>>>      n      vanilla/cpu
>>> (folios)    GB/s | first(us)
>>> --------------------------
>>>      1       7.05 |    194
>>>      4       8.78 |    186
>>>      8       8.47 |    188
>>>     16       7.20 |    193
>>>     64       8.23 |    191
>>>    256      10.51 |    180
>>>    512      10.88 |    173
>>>
>>> Takeaway:
>>> In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
>>> and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
>>> per-batch unmap+flush cost, and then plateaus once workload is large enough.
>>>
>>>
>>> B). Patched kernel:
>>>
>>> Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.
>> 
>> Emm, so NR_MAX_BATCHED_MIGRATION could be very large?  I think that it
>> needs to be bounded.  If it is too large, too many pages may be in an
>> inaccessible state for a longer time.  That will hurt the workload
>> performance, although it is optimal for migration performance.
>> 
>
> Agreed, it must be bounded.

Thanks!  Could you retest with bounded NR_MAX_BATCHED_MIGRATION.  If the
upstream default doesn't work well for you.  We can find a better one
that balances throughput and latency well.

>>> Change N with a knob to measure impact of different max batched size.
>>>
>>> --- ORDER 0 (4K folios) ---
>>>      N         offload/dma1          offload/dma4          offload/dma16
>>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>>> ------------------------------------------------------------------------
>>>    512         2.13 |    639         3.23 |    290         3.27 |    253
>>>   1024         2.17 |   1261         3.44 |    582         3.58 |    536
>>>   2048         2.01 |   2769         3.09 |   1360         3.45 |   1083
>>>   4096         2.10 |   5059         3.13 |   2737         3.58 |   2115 
>>>   8192         2.21 |   9320         3.17 |   5015         3.75 |   3617 
>>>  16384         2.15 |  18689         3.31 |   9623         3.87 |   6937
>>>  32768         2.12 |  42692         3.38 |  18893         3.83 |  14255
>>>  65536         2.09 |  81956         3.38 |  38556         3.64 |  29003
>>> 131072         2.02 | 169563         3.22 |  81082         3.63 |  62236
>>> 262144         2.21 | 318424         3.12 | 170174         3.50 | 129413
>>>
>>> --- ORDER 9 (2M folios) ---
>>>      N         offload/dma1          offload/dma4          offload/dma16
>>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>>> -------------------------------------------------------------------------
>>>    512         11.66 |    160        11.68 |    160        11.65 |    160
>>>   1024         12.16 |    310        13.67 |    275        13.64 |    276
>>>   2048         12.30 |    613        25.47 |    290        25.48 |    291
>>>   4096         12.48 |   1215        26.19 |    566        42.59 |    335
>>>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *
>>>  16384         12.61 |   4839        26.77 |   2218        61.94 |    896
>>>  32768         12.60 |   9667        26.98 |   4422        63.75 |   1748
>>>  65536         12.63 |  19318        26.99 |   8838        60.66 |   3543
>>> 131072         12.64 |  38935        27.02 |  17935        61.06 |   7178
>>> 262144         12.66 |  77694        26.85 |  35871        65.06 |  14129
>>>
>>> In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
>>> So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
>>> returns.
>>>
>>> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
>>> because a larger batch allows the driver to distribute more folios across available DMA channels.
>>> This is where we get most throughput while keeping the first folio latency in check.
>>>
>>> This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
>>> will likely have different curves.
>>>
>>> Does this approach and experiment look good to you?

---
Best Regards,
Huang, Ying


  reply	other threads:[~2026-05-09  7:50 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-28 15:50 [PATCH 0/7] Accelerate page migration with batch copying and hardware offload Shivank Garg
2026-04-28 15:50 ` [PATCH 1/7] mm/migrate: rename PAGE_ migration flags to FOLIO_ Shivank Garg
2026-04-30  9:07   ` Huang, Ying
2026-04-28 15:50 ` [PATCH 2/7] mm/migrate: use migrate_info field instead of private Shivank Garg
2026-05-07  9:43   ` Huang, Ying
2026-04-28 15:50 ` [PATCH 3/7] mm/migrate: skip data copy for already-copied folios Shivank Garg
2026-04-28 15:50 ` [PATCH 4/7] mm/migrate: add batch-copy path in migrate_pages_batch Shivank Garg
2026-04-28 15:50 ` [PATCH 5/7] mm/migrate: add copy offload registration infrastructure Shivank Garg
2026-04-28 15:50 ` [PATCH 6/7] drivers/migrate_offload: add DMA batch copy driver (dcbm) Shivank Garg
2026-04-28 15:50 ` [PATCH 7/7] mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing Shivank Garg
2026-04-28 17:11 ` [PATCH 0/7] Accelerate page migration with batch copying and hardware offload Garg, Shivank
2026-04-28 19:33   ` David Hildenbrand (Arm)
2026-04-29  5:51     ` Garg, Shivank
2026-04-30  8:47 ` Huang, Ying
2026-05-08 11:04   ` Garg, Shivank
2026-05-08 11:28     ` Huang, Ying
2026-05-08 12:34       ` Garg, Shivank
2026-05-09  7:49         ` Huang, Ying [this message]
2026-05-10 15:03           ` Garg, Shivank
2026-05-07  9:58 ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87cxz5rpi3.fsf@DESKTOP-5N7EMDA \
    --to=ying.huang@linux.alibaba.com \
    --cc=Frank.li@nxp.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@kernel.org \
    --cc=apopple@nvidia.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=djbw@kernel.org \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=jhubbard@nvidia.com \
    --cc=jic23@kernel.org \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kinseyho@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=nathan.lynch@amd.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rkodsara@amd.com \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=shivankg@amd.com \
    --cc=sj@kernel.org \
    --cc=stalexan@redhat.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=vkoul@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox