From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from out30-101.freemail.mail.aliyun.com (out30-101.freemail.mail.aliyun.com [115.124.30.101])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A49B726E165
	for <linux-kernel@vger.kernel.org>; Fri,  8 May 2026 11:29:04 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.101
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778239748; cv=none; b=BM3mULmcRZumVxD5HZ29r3sNTLoTlNbjS8C7NViqFR1rljHD0c19IhWxVPEyg+fQG9bypMa/G3qrlzY3mQDb+hOAzF1/fCAa2V9lEoJhKmFeMCF8++BX02Qh9bSItKl4SU/WgtNYpWmepyK+5Z0Y6EcW8FxYMRyInOKsApISE2Q=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778239748; c=relaxed/simple;
	bh=kZ/W/OcO0GkN7rnwzo6G9gpMwz7nIjQUwuYPmUs3ODQ=;
	h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID:
	 MIME-Version:Content-Type; b=c2yUVUwXKIN+VE9SMKzHD54VgLQUffUEtyj74XFHm7DvRn//idSGjXUqH81t/YkHa4UOZGUVpgs85HFTu7oDaSvX+zU6n+qgrkN/WnUI5zeW/ATWavxEK9cbrO6pA58x187Xg/Gkbjy/K6RJwtlYf8GdUjjnRKh/BNO5Tgcz4QY=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=tBXbTSmY; arc=none smtp.client-ip=115.124.30.101
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="tBXbTSmY"
DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=linux.alibaba.com; s=default;
	t=1778239741; h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type;
	bh=tqCelfOJlt6r/lw4wZWiqyGU+r2itL3NtxrGfJxpY/k=;
	b=tBXbTSmYclBh9gz5Z0kZBNwK0IVZyG1y6NLf0pzsqSSb5nrI7pUz0rlHCbIaLYmif5/7bW7UloEBjqTJyOof8DhWvlJso7DoTz+zXM1xxRQjBG+5Y0nZNje2XEidgieV3GkgQCN5MmRm0PZyZXSqOuHdzLczVcYoa7FSd0CedFY=
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R921e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=maildocker-contentspam033045098064;MF=ying.huang@linux.alibaba.com;NM=1;PH=DS;RN=44;SR=0;TI=SMTPD_---0X2XLfiQ_1778239728;
Received: from DESKTOP-5N7EMDA(mailfrom:ying.huang@linux.alibaba.com fp:SMTPD_---0X2XLfiQ_1778239728 cluster:ay36)
          by smtp.aliyun-inc.com;
          Fri, 08 May 2026 19:28:58 +0800
From: "Huang, Ying" <ying.huang@linux.alibaba.com>
To: "Garg, Shivank" <shivankg@amd.com>
Cc: akpm@linux-foundation.org,  david@kernel.org,  kinseyho@google.com,
  weixugc@google.com,  ljs@kernel.org,  Liam.Howlett@oracle.com,
  vbabka@kernel.org,  willy@infradead.org,  rppt@kernel.org,
  surenb@google.com,  mhocko@suse.com,  ziy@nvidia.com,
  matthew.brost@intel.com,  joshua.hahnjy@gmail.com,  rakie.kim@sk.com,
  byungchul@sk.com,  gourry@gourry.net,  apopple@nvidia.com,
  dave@stgolabs.net,  Jonathan.Cameron@huawei.com,  rkodsara@amd.com,
  vkoul@kernel.org,  bharata@amd.com,  sj@kernel.org,  rientjes@google.com,
  xuezhengchu@huawei.com,  yiannis@zptcorp.com,  dave.hansen@intel.com,
  hannes@cmpxchg.org,  jhubbard@nvidia.com,  peterx@redhat.com,
  riel@surriel.com,  shakeel.butt@linux.dev,  stalexan@redhat.com,
  tj@kernel.org,  nifan.cxl@gmail.com,  jic23@kernel.org,
  aneesh.kumar@kernel.org,  nathan.lynch@amd.com,  Frank.li@nxp.com,
  djbw@kernel.org,  linux-kernel@vger.kernel.org,  linux-mm@kvack.org
Subject: Re: [PATCH 0/7] Accelerate page migration with batch copying and
 hardware offload
In-Reply-To: <152b9b5d-67c8-4a13-b8a8-be576a16eb8f@amd.com> (Shivank Garg's
	message of "Fri, 8 May 2026 16:34:22 +0530")
References: <20260428155043.39251-2-shivankg@amd.com>
	<87zf2kvnqy.fsf@DESKTOP-5N7EMDA>
	<152b9b5d-67c8-4a13-b8a8-be576a16eb8f@amd.com>
Date: Fri, 08 May 2026 19:28:47 +0800
Message-ID: <87mryaqgwg.fsf@DESKTOP-5N7EMDA>
User-Agent: Gnus/5.13 (Gnus v5.13)
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Hi, Shivank,

"Garg, Shivank" <shivankg@amd.com> writes:

> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>> Shivank Garg <shivankg@amd.com> writes:
>
>>> PERFORMANCE RESULTS:
>>> --------------------
>>>
>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>>> change in V5 alters this picture; please refer to the V4 cover letter
>>> for the throughput tables [1].
>>=20
>> IMHO, it's better to copy performance data here.
>>=20
>> In addition to the performance benefit, I want to know the downside as
>> well.  For example, the migration latency of the first folio may be
>> longer.  If so, by how much?  Can you measure the batch number vs. total
>> migration time (benefit) and first folio migration time (downside)?
>> That can be used to determine the optimal batch number.
>>=20
>
> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.
>
> Benchmark: move_pages() syscall to move pages between two NUMA nodes.
>
> 1). Moving different sized folios such that total transfer size is consta=
nt
> (1GB), with different number of DMA channels. Throughput in GB/s.
>
> a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D
> 4K          | 16K        | 64K        | 256K       | 1M          | 2M    =
      |
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D
> 3.31=C2=B10.18   | 5.61=C2=B10.07  | 6.66=C2=B10.03  | 7.01=C2=B10.03  | =
7.13=C2=B10.08   | 11.02=C2=B10.17  |
>
>
> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> N channel| 4K        | 16K         | 64K         | 256K        | 1M      =
    | 2M          |
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> 1      | 2.16=C2=B10.14   | 2.58=C2=B10.02   | 3.00=C2=B10.04   | 4.56=C2=
=B10.28   | 4.62=C2=B10.02   | 12.65=C2=B10.08  |
> 2      | 2.68=C2=B10.09   | 3.69=C2=B10.15   | 4.52=C2=B10.04   | 6.75=C2=
=B10.06   | 7.19=C2=B10.19   | 14.38=C2=B10.06  |
> 4      | 3.07=C2=B10.13   | 4.62=C2=B10.09   | 6.47=C2=B10.56   | 9.22=C2=
=B10.15   | 10.24=C2=B10.47  | 27.01=C2=B10.11  |
> 8      | 3.43=C2=B10.09   | 5.40=C2=B10.16   | 7.67=C2=B10.08   | 11.25=
=C2=B10.17  | 12.60=C2=B10.60  | 45.62=C2=B10.52  |
> 12     | 3.50=C2=B10.11   | 5.66=C2=B10.16   | 8.12=C2=B10.10   | 11.97=
=C2=B10.19  | 13.43=C2=B10.08  | 61.02=C2=B10.92  |
> 16     | 3.54=C2=B10.12   | 5.79=C2=B10.14   | 8.50=C2=B10.13   | 12.59=
=C2=B10.15  | 17.21=C2=B16.40  | 65.23=C2=B11.70  |
>
>
> 2).  First-folio latency: Instrumented with custom tracepoints to measure=
 latency per migrate_pages_batch() call.
>     Result: throughput (GB/s) and first-folio latency (in microseconds), =
median of 10 runs.

Thanks for detailed data.  Per my understanding, the run time of
migrate_pages_batch() may be not good enough for measuring first folio
latency.  IIUC, the migration procedure is something like,

  for each folio
        unmap
  flush
  for each folio
        copy
        remap =3D=3D=3D> first folio migrated

Some tracepoint should be better to measure it.

> A). Vanilla Kernel:
>
> Here, n =3D workload size passed to move_pages() in folios. Move n number=
 of folios with move_pages().
> NR_MAX_BATCHED_MIGRATION is upstream default value 512.
>
> --- Order 0 (4K folios) ---
>      n      vanilla/cpu
> (folios)    GB/s | first(us)
> --------------------------
>      1       0.04 |     24
>      4       0.16 |     25
>      8       0.29 |     31
>     16       0.54 |     27
>     64       1.15 |     68
>    256       1.86 |    162
>    512       2.21 |    264
>   2048       2.62 |    208
>   4096       2.74 |    182
>  16384       2.73 |    173
>  65536       3.28 |    166
> 262144       3.20 |    167
>
> --- Order 9 (2M folios) ---
>      n      vanilla/cpu
> (folios)    GB/s | first(us)
> --------------------------
>      1       7.05 |    194
>      4       8.78 |    186
>      8       8.47 |    188
>     16       7.20 |    193
>     64       8.23 |    191
>    256      10.51 |    180
>    512      10.88 |    173
>
> Takeaway:
> In each migrate_pages_batch() call, folios are first unmapped, then try_t=
o_unmap_flush(),
> and only then folios enter move_to_new_folio(). So first-folio latency is=
 bounded by the
> per-batch unmap+flush cost, and then plateaus once workload is large enou=
gh.
>
>
> B). Patched kernel:
>
> Here, N =3D NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fi=
xed at 1 GB.

Emm, so NR_MAX_BATCHED_MIGRATION could be very large?  I think that it
needs to be bounded.  If it is too large, too many pages may be in an
inaccessible state for a longer time.  That will hurt the workload
performance, although it is optimal for migration performance.

> Change N with a knob to measure impact of different max batched size.
>
> --- ORDER 0 (4K folios) ---
>      N         offload/dma1          offload/dma4          offload/dma16
>                GB/s | first(us)      GB/s | first(us)      GB/s | first(u=
s)
> ------------------------------------------------------------------------
>    512         2.13 |    639         3.23 |    290         3.27 |    253
>   1024         2.17 |   1261         3.44 |    582         3.58 |    536
>   2048         2.01 |   2769         3.09 |   1360         3.45 |   1083
>   4096         2.10 |   5059         3.13 |   2737         3.58 |   2115=
=20
>   8192         2.21 |   9320         3.17 |   5015         3.75 |   3617=
=20
>  16384         2.15 |  18689         3.31 |   9623         3.87 |   6937
>  32768         2.12 |  42692         3.38 |  18893         3.83 |  14255
>  65536         2.09 |  81956         3.38 |  38556         3.64 |  29003
> 131072         2.02 | 169563         3.22 |  81082         3.63 |  62236
> 262144         2.21 | 318424         3.12 | 170174         3.50 | 129413
>
> --- ORDER 9 (2M folios) ---
>      N         offload/dma1          offload/dma4          offload/dma16
>                GB/s | first(us)      GB/s | first(us)      GB/s | first(u=
s)
> -------------------------------------------------------------------------
>    512         11.66 |    160        11.68 |    160        11.65 |    160
>   1024         12.16 |    310        13.67 |    275        13.64 |    276
>   2048         12.30 |    613        25.47 |    290        25.48 |    291
>   4096         12.48 |   1215        26.19 |    566        42.59 |    335
>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470=
 *
>  16384         12.61 |   4839        26.77 |   2218        61.94 |    896
>  32768         12.60 |   9667        26.98 |   4422        63.75 |   1748
>  65536         12.63 |  19318        26.99 |   8838        60.66 |   3543
> 131072         12.64 |  38935        27.02 |  17935        61.06 |   7178
> 262144         12.66 |  77694        26.85 |  35871        65.06 |  14129
>
> In the batch-copy offload approach, DMA copy phase is inserted between un=
map/flush and move,
> So larger N increases first-folio wall clock latency. Throughput improves=
 but with diminishing
> returns.
>
> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=3D819=
2-16384,
> because a larger batch allows the driver to distribute more folios across=
 available DMA channels.
> This is where we get most throughput while keeping the first folio latenc=
y in check.
>
> This optimal batch value is hardware-specific. Other engines (eg. SDXI) a=
nd memory tier (eg. CXL)
> will likely have different curves.
>
> Does this approach and experiment look good to you?

---
Best Regards,
Huang, Ying