From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from out30-99.freemail.mail.aliyun.com (out30-99.freemail.mail.aliyun.com [115.124.30.99])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6885F332EA0
	for <linux-kernel@vger.kernel.org>; Tue, 12 May 2026 02:15:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.99
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778552137; cv=none; b=RrEzHgSKulF2/CuPNag2vwh8ultP7x6/P+TjpILSf+2X3Sd5rENs4kZNYcfpI4hb/wDBACzTJsA8teZ2wWFH4ioII/W9VKdOlK9lcxQdCPvbup6ZBqethNkQMU7LaulStXLGzWCvZDLAJmnyfYPj12ot9yLD7PM9+DpS8L5q+88=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778552137; c=relaxed/simple;
	bh=npuSyaGm7YyRHZRYZPjOPZMoQvE8S0MAZDYYb4KxhVg=;
	h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID:
	 MIME-Version:Content-Type; b=utuhirJyRz9fkebGcvAhE0I9Ar3yjB6mnDvyvODFcGGjj8xfIBW/60nsHO02ebFHPPjZxBLgP2dl/JpeUV942cvR+rolhfU2R85tOIHWisuDdJQI6R9k9b9/WDxnAAXXzNd1OgYD1NIOinBqZ1urLACtkU3mNFhCIGX8xJTGYrg=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=qxykVmCb; arc=none smtp.client-ip=115.124.30.99
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="qxykVmCb"
DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=linux.alibaba.com; s=default;
	t=1778552132; h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type;
	bh=x589uaHNb+WI54NKbeo3IlZRhBFZ8rhkOgGZ83wZkJU=;
	b=qxykVmCbLFkHFYTGtkN/TXE38ormYaohsxxN2GjCzxj91xph86i45tQOFlC9DuQFQM+5GxaGkI/6jztzJmkti0BFBCo6AmOFLmU5mcxpxU2SGeLIA3TDQqguYR7Sm+Ea6ciQoVKwpX4hpGWujR2pWXn+Dybg4c36YbMSrNvvd20=
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R181e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=maildocker-contentspam033032089153;MF=ying.huang@linux.alibaba.com;NM=1;PH=DS;RN=44;SR=0;TI=SMTPD_---0X2olcqX_1778552127;
Received: from DESKTOP-5N7EMDA(mailfrom:ying.huang@linux.alibaba.com fp:SMTPD_---0X2olcqX_1778552127 cluster:ay36)
          by smtp.aliyun-inc.com;
          Tue, 12 May 2026 10:15:29 +0800
From: "Huang, Ying" <ying.huang@linux.alibaba.com>
To: "Garg, Shivank" <shivankg@amd.com>
Cc: akpm@linux-foundation.org,  david@kernel.org,  kinseyho@google.com,
  weixugc@google.com,  ljs@kernel.org,  Liam.Howlett@oracle.com,
  vbabka@kernel.org,  willy@infradead.org,  rppt@kernel.org,
  surenb@google.com,  mhocko@suse.com,  ziy@nvidia.com,
  matthew.brost@intel.com,  joshua.hahnjy@gmail.com,  rakie.kim@sk.com,
  byungchul@sk.com,  gourry@gourry.net,  apopple@nvidia.com,
  dave@stgolabs.net,  Jonathan.Cameron@huawei.com,  rkodsara@amd.com,
  vkoul@kernel.org,  bharata@amd.com,  sj@kernel.org,  rientjes@google.com,
  xuezhengchu@huawei.com,  yiannis@zptcorp.com,  dave.hansen@intel.com,
  hannes@cmpxchg.org,  jhubbard@nvidia.com,  peterx@redhat.com,
  riel@surriel.com,  shakeel.butt@linux.dev,  stalexan@redhat.com,
  tj@kernel.org,  nifan.cxl@gmail.com,  jic23@kernel.org,
  aneesh.kumar@kernel.org,  nathan.lynch@amd.com,  Frank.li@nxp.com,
  djbw@kernel.org,  linux-kernel@vger.kernel.org,  linux-mm@kvack.org
Subject: Re: [PATCH 0/7] Accelerate page migration with batch copying and
 hardware offload
In-Reply-To: <22157c87-1465-46de-8e1c-5d99a90152a6@amd.com> (Shivank Garg's
	message of "Sun, 10 May 2026 20:33:47 +0530")
References: <20260428155043.39251-2-shivankg@amd.com>
	<87zf2kvnqy.fsf@DESKTOP-5N7EMDA>
	<152b9b5d-67c8-4a13-b8a8-be576a16eb8f@amd.com>
	<87mryaqgwg.fsf@DESKTOP-5N7EMDA>
	<c4d00222-5caf-47bf-801c-ae1dd439ad0f@amd.com>
	<87cxz5rpi3.fsf@DESKTOP-5N7EMDA>
	<22157c87-1465-46de-8e1c-5d99a90152a6@amd.com>
Date: Tue, 12 May 2026 10:15:27 +0800
Message-ID: <87a4u5weyo.fsf@DESKTOP-5N7EMDA>
User-Agent: Gnus/5.13 (Gnus v5.13)
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

"Garg, Shivank" <shivankg@amd.com> writes:

> On 5/9/2026 1:19 PM, Huang, Ying wrote:
>> "Garg, Shivank" <shivankg@amd.com> writes:
>>=20
>>> On 5/8/2026 4:58 PM, Huang, Ying wrote:
>>>> Hi, Shivank,
>>>>
>>>> "Garg, Shivank" <shivankg@amd.com> writes:
>>>>
>>>>> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>>>>>> Shivank Garg <shivankg@amd.com> writes:
>>>>>
>>>>>>> PERFORMANCE RESULTS:
>>>>>>> --------------------
>>>>>>>
>>>>>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>>>>>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>>>>>>> change in V5 alters this picture; please refer to the V4 cover lett=
er
>>>>>>> for the throughput tables [1].
>>>>>>
>>>>>> IMHO, it's better to copy performance data here.
>>>>>>
>>>>>> In addition to the performance benefit, I want to know the downside =
as
>>>>>> well.  For example, the migration latency of the first folio may be
>>>>>> longer.  If so, by how much?  Can you measure the batch number vs. t=
otal
>>>>>> migration time (benefit) and first folio migration time (downside)?
>>>>>> That can be used to determine the optimal batch number.
>>>>>>
>>>>>
>>>>> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
>>>>> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hard=
ware.
>>>>>
>>>>> Benchmark: move_pages() syscall to move pages between two NUMA nodes.
>>>>>
>>>>> 1). Moving different sized folios such that total transfer size is co=
nstant
>>>>> (1GB), with different number of DMA channels. Throughput in GB/s.
>>>>>
>>>>> a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
>>>>>
>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D
>>>>> 4K          | 16K        | 64K        | 256K       | 1M          | 2M=
          |
>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D
>>>>> 3.31=C2=B10.18   | 5.61=C2=B10.07  | 6.66=C2=B10.03  | 7.01=C2=B10.03=
  | 7.13=C2=B10.08   | 11.02=C2=B10.17  |
>>>>>
>>>>>
>>>>> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>>>>>
>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>> N channel| 4K        | 16K         | 64K         | 256K        | 1M  =
        | 2M          |
>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>>> 1      | 2.16=C2=B10.14   | 2.58=C2=B10.02   | 3.00=C2=B10.04   | 4.5=
6=C2=B10.28   | 4.62=C2=B10.02   | 12.65=C2=B10.08  |
>>>>> 2      | 2.68=C2=B10.09   | 3.69=C2=B10.15   | 4.52=C2=B10.04   | 6.7=
5=C2=B10.06   | 7.19=C2=B10.19   | 14.38=C2=B10.06  |
>>>>> 4      | 3.07=C2=B10.13   | 4.62=C2=B10.09   | 6.47=C2=B10.56   | 9.2=
2=C2=B10.15   | 10.24=C2=B10.47  | 27.01=C2=B10.11  |
>>>>> 8      | 3.43=C2=B10.09   | 5.40=C2=B10.16   | 7.67=C2=B10.08   | 11.=
25=C2=B10.17  | 12.60=C2=B10.60  | 45.62=C2=B10.52  |
>>>>> 12     | 3.50=C2=B10.11   | 5.66=C2=B10.16   | 8.12=C2=B10.10   | 11.=
97=C2=B10.19  | 13.43=C2=B10.08  | 61.02=C2=B10.92  |
>>>>> 16     | 3.54=C2=B10.12   | 5.79=C2=B10.14   | 8.50=C2=B10.13   | 12.=
59=C2=B10.15  | 17.21=C2=B16.40  | 65.23=C2=B11.70  |
>>>>>
>>>>>
>>>>> 2).  First-folio latency: Instrumented with custom tracepoints to
>>>>> measure latency per migrate_pages_batch() call.
>>>>>     Result: throughput (GB/s) and first-folio latency (in microsecond=
s), median of 10 runs.
>>>>
>>>> Thanks for detailed data.  Per my understanding, the run time of
>>>> migrate_pages_batch() may be not good enough for measuring first folio
>>>> latency.  IIUC, the migration procedure is something like,
>>>>
>>>>   for each folio
>>>>         unmap
>>>>   flush
>>>>   for each folio
>>>>         copy
>>>>         remap =3D=3D=3D> first folio migrated
>>>>
>>>> Some tracepoint should be better to measure it.
>>>
>>> Sorry, my earlier write-up was unclear.
>>> For first folio latency, I add two tracepoints: one at the start of mig=
rate_pages_batch()
>>> and one in migrate_folio_done().=20
>>>
>>> I agree that the user-accessible point tracepoint should be right after=
 remove_migration_ptes().
>>> Though, migrate_folio_done() runs only a few operations later, and will=
 have a constant
>>> offset, so it's unlikely to change the shape of the trade-off curve.
>>> I'll move the tracepoint right after remove_migration_ptes() for new po=
sting.
>>=20
>> Thanks for explanation.  Trace point in migrate_folio_done() should be O=
K.
>>=20
>>>>
>>>>> A). Vanilla Kernel:
>>>>>
>>>>> Here, n =3D workload size passed to move_pages() in folios. Move n nu=
mber of folios with move_pages().
>>>>> NR_MAX_BATCHED_MIGRATION is upstream default value 512.
>>>>>
>>>>> --- Order 0 (4K folios) ---
>>>>>      n      vanilla/cpu
>>>>> (folios)    GB/s | first(us)
>>>>> --------------------------
>>>>>      1       0.04 |     24
>>>>>      4       0.16 |     25
>>>>>      8       0.29 |     31
>>>>>     16       0.54 |     27
>>>>>     64       1.15 |     68
>>>>>    256       1.86 |    162
>>>>>    512       2.21 |    264
>>>>>   2048       2.62 |    208
>>>>>   4096       2.74 |    182
>>>>>  16384       2.73 |    173
>>>>>  65536       3.28 |    166
>>>>> 262144       3.20 |    167
>>>>>
>>>>> --- Order 9 (2M folios) ---
>>>>>      n      vanilla/cpu
>>>>> (folios)    GB/s | first(us)
>>>>> --------------------------
>>>>>      1       7.05 |    194
>>>>>      4       8.78 |    186
>>>>>      8       8.47 |    188
>>>>>     16       7.20 |    193
>>>>>     64       8.23 |    191
>>>>>    256      10.51 |    180
>>>>>    512      10.88 |    173
>>>>>
>>>>> Takeaway:
>>>>> In each migrate_pages_batch() call, folios are first unmapped, then t=
ry_to_unmap_flush(),
>>>>> and only then folios enter move_to_new_folio(). So first-folio latenc=
y is bounded by the
>>>>> per-batch unmap+flush cost, and then plateaus once workload is large =
enough.
>>>>>
>>>>>
>>>>> B). Patched kernel:
>>>>>
>>>>> Here, N =3D NR_MAX_BATCHED_MIGRATION (in page). Total migrated data i=
s fixed at 1 GB.
>>>>
>>>> Emm, so NR_MAX_BATCHED_MIGRATION could be very large?  I think that it
>>>> needs to be bounded.  If it is too large, too many pages may be in an
>>>> inaccessible state for a longer time.  That will hurt the workload
>>>> performance, although it is optimal for migration performance.
>>>>
>>>
>>> Agreed, it must be bounded.
>>=20
>> Thanks!  Could you retest with bounded NR_MAX_BATCHED_MIGRATION.  If the
>> upstream default doesn't work well for you.  We can find a better one
>> that balances throughput and latency well.
>>=20
>
> Thanks. Below tables sweep NR_MAX_BATCHED_MIGRATION from 512 up to 262144=
. On 2M folios,
> 16-channel PTDMA, the knee is at N=3D8192-16384 (=3D {16 to 32} * 512 ).
>
>>>>>   8192         12.56 |   2424        26.57 |   1118        58.72 |   =
 470 *

        2048         12.30 |    613        25.47 |    290        25.48 |   =
 291

IIUC, N=3D2048 already helps dma4.  And, the latency looks OK too.  The
good batch size is hardware configuration dependent too?  If so, we may
need to add another migrator callback for that.

> One thing worth flagging on the "bounded default": at the upstream cap of=
 512 pages,
> migrate_pages_batch() receives at most one 2M folio per call, so PTDMA ca=
n only use
> one of its 16 channels per batch and the offload reduces to vanilla. (DCB=
M offloads
> one 2M folio to each channel).
> The larger-N rows are what exercise the channel parallelism for PTDMA cas=
e.
>
> "SDXI"[1] like memory-to-memory data movers should reach good throughput =
with just 1 channel,=20
> and thus may not require increasing the NR_MAX_BATCHED_MIGRATION for good=
 throughput.
>
> I'm not tying series this to specific perf default for now, the design re=
view (batch-copy
> path, migrator interface, registration, static_call dispatch) is the part=
 I'd like to converge
> on first, then tune the threshold after it. Does that ordering work?

IMHO, we need some performance data to justify the added complexity.
So, threshold tuning isn't the goal, whether we can get better
throughput with some bounded latency is.

> [1] https://lore.kernel.org/all/20260410-sdxi-base-v1-0-1d184cb5c60a@amd.=
com
>
> Best regards,
> Shivank
>
>>>>> Change N with a knob to measure impact of different max batched size.
>>>>>
>>>>> --- ORDER 0 (4K folios) ---
>>>>>      N         offload/dma1          offload/dma4          offload/dm=
a16
>>>>>                GB/s | first(us)      GB/s | first(us)      GB/s | fir=
st(us)
>>>>> ---------------------------------------------------------------------=
---
>>>>>    512         2.13 |    639         3.23 |    290         3.27 |    =
253
>>>>>   1024         2.17 |   1261         3.44 |    582         3.58 |    =
536
>>>>>   2048         2.01 |   2769         3.09 |   1360         3.45 |   1=
083
>>>>>   4096         2.10 |   5059         3.13 |   2737         3.58 |   2=
115=20
>>>>>   8192         2.21 |   9320         3.17 |   5015         3.75 |   3=
617=20
>>>>>  16384         2.15 |  18689         3.31 |   9623         3.87 |   6=
937
>>>>>  32768         2.12 |  42692         3.38 |  18893         3.83 |  14=
255
>>>>>  65536         2.09 |  81956         3.38 |  38556         3.64 |  29=
003
>>>>> 131072         2.02 | 169563         3.22 |  81082         3.63 |  62=
236
>>>>> 262144         2.21 | 318424         3.12 | 170174         3.50 | 129=
413
>>>>>
>>>>> --- ORDER 9 (2M folios) ---
>>>>>      N         offload/dma1          offload/dma4          offload/dm=
a16
>>>>>                GB/s | first(us)      GB/s | first(us)      GB/s | fir=
st(us)
>>>>> ---------------------------------------------------------------------=
----
>>>>>    512         11.66 |    160        11.68 |    160        11.65 |   =
 160
>>>>>   1024         12.16 |    310        13.67 |    275        13.64 |   =
 276
>>>>>   2048         12.30 |    613        25.47 |    290        25.48 |   =
 291
>>>>>   4096         12.48 |   1215        26.19 |    566        42.59 |   =
 335
>>>>>   8192         12.56 |   2424        26.57 |   1118        58.72 |   =
 470 *
>>>>>  16384         12.61 |   4839        26.77 |   2218        61.94 |   =
 896
>>>>>  32768         12.60 |   9667        26.98 |   4422        63.75 |   =
1748
>>>>>  65536         12.63 |  19318        26.99 |   8838        60.66 |   =
3543
>>>>> 131072         12.64 |  38935        27.02 |  17935        61.06 |   =
7178
>>>>> 262144         12.66 |  77694        26.85 |  35871        65.06 |  1=
4129
>>>>>
>>>>> In the batch-copy offload approach, DMA copy phase is inserted betwee=
n unmap/flush and move,
>>>>> So larger N increases first-folio wall clock latency. Throughput impr=
oves but with diminishing
>>>>> returns.
>>>>>
>>>>> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=
=3D8192-16384,
>>>>> because a larger batch allows the driver to distribute more folios ac=
ross available DMA channels.
>>>>> This is where we get most throughput while keeping the first folio la=
tency in check.
>>>>>
>>>>> This optimal batch value is hardware-specific. Other engines (eg. SDX=
I) and memory tier (eg. CXL)
>>>>> will likely have different curves.
>>>>>
>>>>> Does this approach and experiment look good to you?
>>=20

---
Best Regards,
Huang, Ying