The Linux Kernel Mailing List
 help / color / mirror / Atom feed
From: "Garg, Shivank" <shivankg@amd.com>
To: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: akpm@linux-foundation.org, david@kernel.org, kinseyho@google.com,
	weixugc@google.com, ljs@kernel.org, Liam.Howlett@oracle.com,
	vbabka@kernel.org, willy@infradead.org, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, ziy@nvidia.com,
	matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net,
	apopple@nvidia.com, dave@stgolabs.net,
	Jonathan.Cameron@huawei.com, rkodsara@amd.com, vkoul@kernel.org,
	bharata@amd.com, sj@kernel.org, rientjes@google.com,
	xuezhengchu@huawei.com, yiannis@zptcorp.com,
	dave.hansen@intel.com, hannes@cmpxchg.org, jhubbard@nvidia.com,
	peterx@redhat.com, riel@surriel.com, shakeel.butt@linux.dev,
	stalexan@redhat.com, tj@kernel.org, nifan.cxl@gmail.com,
	jic23@kernel.org, aneesh.kumar@kernel.org, nathan.lynch@amd.com,
	Frank.li@nxp.com, djbw@kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload
Date: Fri, 8 May 2026 16:34:22 +0530	[thread overview]
Message-ID: <152b9b5d-67c8-4a13-b8a8-be576a16eb8f@amd.com> (raw)
In-Reply-To: <87zf2kvnqy.fsf@DESKTOP-5N7EMDA>



On 4/30/2026 2:17 PM, Huang, Ying wrote:
> Shivank Garg <shivankg@amd.com> writes:

>> PERFORMANCE RESULTS:
>> --------------------
>>
>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>> change in V5 alters this picture; please refer to the V4 cover letter
>> for the throughput tables [1].
> 
> IMHO, it's better to copy performance data here.
> 
> In addition to the performance benefit, I want to know the downside as
> well.  For example, the migration latency of the first folio may be
> longer.  If so, by how much?  Can you measure the batch number vs. total
> migration time (benefit) and first folio migration time (downside)?
> That can be used to determine the optimal batch number.
> 

System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.

Benchmark: move_pages() syscall to move pages between two NUMA nodes.

1). Moving different sized folios such that total transfer size is constant
(1GB), with different number of DMA channels. Throughput in GB/s.

a. Baseline (vanilla kernel, single-threaded, serial folio_copy):

================================================================================
4K          | 16K        | 64K        | 256K       | 1M          | 2M          |
================================================================================
3.31±0.18   | 5.61±0.07  | 6.66±0.03  | 7.01±0.03  | 7.13±0.08   | 11.02±0.17  |


b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):

============================================================================================
N channel| 4K        | 16K         | 64K         | 256K        | 1M          | 2M          |
============================================================================================
1      | 2.16±0.14   | 2.58±0.02   | 3.00±0.04   | 4.56±0.28   | 4.62±0.02   | 12.65±0.08  |
2      | 2.68±0.09   | 3.69±0.15   | 4.52±0.04   | 6.75±0.06   | 7.19±0.19   | 14.38±0.06  |
4      | 3.07±0.13   | 4.62±0.09   | 6.47±0.56   | 9.22±0.15   | 10.24±0.47  | 27.01±0.11  |
8      | 3.43±0.09   | 5.40±0.16   | 7.67±0.08   | 11.25±0.17  | 12.60±0.60  | 45.62±0.52  |
12     | 3.50±0.11   | 5.66±0.16   | 8.12±0.10   | 11.97±0.19  | 13.43±0.08  | 61.02±0.92  |
16     | 3.54±0.12   | 5.79±0.14   | 8.50±0.13   | 12.59±0.15  | 17.21±6.40  | 65.23±1.70  |


2).  First-folio latency: Instrumented with custom tracepoints to measure latency per migrate_pages_batch() call.
    Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.

A). Vanilla Kernel:

Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
NR_MAX_BATCHED_MIGRATION is upstream default value 512.

--- Order 0 (4K folios) ---
     n      vanilla/cpu
(folios)    GB/s | first(us)
--------------------------
     1       0.04 |     24
     4       0.16 |     25
     8       0.29 |     31
    16       0.54 |     27
    64       1.15 |     68
   256       1.86 |    162
   512       2.21 |    264
  2048       2.62 |    208
  4096       2.74 |    182
 16384       2.73 |    173
 65536       3.28 |    166
262144       3.20 |    167

--- Order 9 (2M folios) ---
     n      vanilla/cpu
(folios)    GB/s | first(us)
--------------------------
     1       7.05 |    194
     4       8.78 |    186
     8       8.47 |    188
    16       7.20 |    193
    64       8.23 |    191
   256      10.51 |    180
   512      10.88 |    173

Takeaway:
In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
per-batch unmap+flush cost, and then plateaus once workload is large enough.


B). Patched kernel:

Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.
Change N with a knob to measure impact of different max batched size.

--- ORDER 0 (4K folios) ---
     N         offload/dma1          offload/dma4          offload/dma16
               GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
------------------------------------------------------------------------
   512         2.13 |    639         3.23 |    290         3.27 |    253
  1024         2.17 |   1261         3.44 |    582         3.58 |    536
  2048         2.01 |   2769         3.09 |   1360         3.45 |   1083
  4096         2.10 |   5059         3.13 |   2737         3.58 |   2115 
  8192         2.21 |   9320         3.17 |   5015         3.75 |   3617 
 16384         2.15 |  18689         3.31 |   9623         3.87 |   6937
 32768         2.12 |  42692         3.38 |  18893         3.83 |  14255
 65536         2.09 |  81956         3.38 |  38556         3.64 |  29003
131072         2.02 | 169563         3.22 |  81082         3.63 |  62236
262144         2.21 | 318424         3.12 | 170174         3.50 | 129413

--- ORDER 9 (2M folios) ---
     N         offload/dma1          offload/dma4          offload/dma16
               GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
-------------------------------------------------------------------------
   512         11.66 |    160        11.68 |    160        11.65 |    160
  1024         12.16 |    310        13.67 |    275        13.64 |    276
  2048         12.30 |    613        25.47 |    290        25.48 |    291
  4096         12.48 |   1215        26.19 |    566        42.59 |    335
  8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *
 16384         12.61 |   4839        26.77 |   2218        61.94 |    896
 32768         12.60 |   9667        26.98 |   4422        63.75 |   1748
 65536         12.63 |  19318        26.99 |   8838        60.66 |   3543
131072         12.64 |  38935        27.02 |  17935        61.06 |   7178
262144         12.66 |  77694        26.85 |  35871        65.06 |  14129

In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
returns.

For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
because a larger batch allows the driver to distribute more folios across available DMA channels.
This is where we get most throughput while keeping the first folio latency in check.

This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
will likely have different curves.

Does this approach and experiment look good to you?

Thanks,
Shivank



  parent reply	other threads:[~2026-05-08 11:04 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260428155043.39251-2-shivankg@amd.com>
     [not found] ` <20260428155043.39251-6-shivankg@amd.com>
2026-05-07  9:43   ` [PATCH 2/7] mm/migrate: use migrate_info field instead of private Huang, Ying
2026-05-11 15:22   ` David Hildenbrand (Arm)
2026-05-07  9:58 ` [PATCH 0/7] Accelerate page migration with batch copying and hardware offload Huang, Ying
2026-05-11 15:19   ` David Hildenbrand (Arm)
     [not found] ` <87zf2kvnqy.fsf@DESKTOP-5N7EMDA>
2026-05-08 11:04   ` Garg, Shivank [this message]
2026-05-08 11:28     ` Huang, Ying
2026-05-08 12:34       ` Garg, Shivank
2026-05-09  7:49         ` Huang, Ying
2026-05-10 15:03           ` Garg, Shivank
     [not found] ` <20260428155043.39251-8-shivankg@amd.com>
2026-05-11 15:35   ` [PATCH 3/7] mm/migrate: skip data copy for already-copied folios David Hildenbrand (Arm)
     [not found] ` <20260428155043.39251-10-shivankg@amd.com>
2026-05-11 15:40   ` [PATCH 4/7] mm/migrate: add batch-copy path in migrate_pages_batch David Hildenbrand (Arm)
     [not found] ` <20260428155043.39251-12-shivankg@amd.com>
2026-05-11 15:46   ` [PATCH 5/7] mm/migrate: add copy offload registration infrastructure David Hildenbrand (Arm)
2026-05-11 15:50   ` David Hildenbrand (Arm)
2026-05-11 15:53 ` [PATCH 0/7] Accelerate page migration with batch copying and hardware offload David Hildenbrand (Arm)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=152b9b5d-67c8-4a13-b8a8-be576a16eb8f@amd.com \
    --to=shivankg@amd.com \
    --cc=Frank.li@nxp.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@kernel.org \
    --cc=apopple@nvidia.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=djbw@kernel.org \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=jhubbard@nvidia.com \
    --cc=jic23@kernel.org \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kinseyho@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=nathan.lynch@amd.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rkodsara@amd.com \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=sj@kernel.org \
    --cc=stalexan@redhat.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=vkoul@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox