From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out30-110.freemail.mail.aliyun.com (out30-110.freemail.mail.aliyun.com [115.124.30.110]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C6F19276049 for ; Sat, 9 May 2026 07:50:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.110 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778313021; cv=none; b=V1diCuo/thWMZ6r0ppQfKg7KXuIpbh9eZHmH7fruBherw0x5nExvKY2uqIJOgl15eKtG7oksyVR0RyzCG6ycR6GtGmoAbfPZk8IOSobHkQX/AOPFYawqE0TbJxp6fSNvlx1ZW2EkCv3eKKUQDQu4dk1EZoakilhcsagit6ngTO0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778313021; c=relaxed/simple; bh=ivcXqAbDIr2jwO4mzKQPTRLvkvO6995ERSzw38QPDXE=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=FTwQzuTV7S/R0GGfCUsZu5e+XrRmii1JrfsAubLHtFATTUNyy14Je+93y/gyISeeS20ReNBMpqfbZSUtRadwbj4BHwZtlhmGi9RztdYsi2gMhB2Q6lGDFCo5bHwaJSFEiqKoGZDk1VOyMrH1RwfQeiTXYlUU14AzkJzK5yfB4BM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=yY3DlLcS; arc=none smtp.client-ip=115.124.30.110 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="yY3DlLcS" DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1778313015; h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type; bh=og2YnmuyeOWIYjCeR6b0Icao3q7h85tzyg8TQlVWZ6Y=; b=yY3DlLcSWMXMTgmHyB/ZnIl2wMloXap8TPXCTOHtmEgHeVXvPOJDrgLbjoO67fH0PhA6KDx52knI2nR3wZRbT1Y9BPU+pko+bKnSNGcuCBDdnLpjWhCkjso0vHRfa1EzLAbRygPMkEWj67GnJJj2ivHdgXiDeL3S6ATaOOdHH4A= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R201e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=maildocker-contentspam033037033178;MF=ying.huang@linux.alibaba.com;NM=1;PH=DS;RN=44;SR=0;TI=SMTPD_---0X2Zv8Qc_1778312997; Received: from DESKTOP-5N7EMDA(mailfrom:ying.huang@linux.alibaba.com fp:SMTPD_---0X2Zv8Qc_1778312997 cluster:ay36) by smtp.aliyun-inc.com; Sat, 09 May 2026 15:50:11 +0800 From: "Huang, Ying" To: "Garg, Shivank" Cc: akpm@linux-foundation.org, david@kernel.org, kinseyho@google.com, weixugc@google.com, ljs@kernel.org, Liam.Howlett@oracle.com, vbabka@kernel.org, willy@infradead.org, rppt@kernel.org, surenb@google.com, mhocko@suse.com, ziy@nvidia.com, matthew.brost@intel.com, joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net, apopple@nvidia.com, dave@stgolabs.net, Jonathan.Cameron@huawei.com, rkodsara@amd.com, vkoul@kernel.org, bharata@amd.com, sj@kernel.org, rientjes@google.com, xuezhengchu@huawei.com, yiannis@zptcorp.com, dave.hansen@intel.com, hannes@cmpxchg.org, jhubbard@nvidia.com, peterx@redhat.com, riel@surriel.com, shakeel.butt@linux.dev, stalexan@redhat.com, tj@kernel.org, nifan.cxl@gmail.com, jic23@kernel.org, aneesh.kumar@kernel.org, nathan.lynch@amd.com, Frank.li@nxp.com, djbw@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload In-Reply-To: (Shivank Garg's message of "Fri, 8 May 2026 18:04:34 +0530") References: <20260428155043.39251-2-shivankg@amd.com> <87zf2kvnqy.fsf@DESKTOP-5N7EMDA> <152b9b5d-67c8-4a13-b8a8-be576a16eb8f@amd.com> <87mryaqgwg.fsf@DESKTOP-5N7EMDA> Date: Sat, 09 May 2026 15:49:56 +0800 Message-ID: <87cxz5rpi3.fsf@DESKTOP-5N7EMDA> User-Agent: Gnus/5.13 (Gnus v5.13) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable "Garg, Shivank" writes: > On 5/8/2026 4:58 PM, Huang, Ying wrote: >> Hi, Shivank, >>=20 >> "Garg, Shivank" writes: >>=20 >>> On 4/30/2026 2:17 PM, Huang, Ying wrote: >>>> Shivank Garg writes: >>> >>>>> PERFORMANCE RESULTS: >>>>> -------------------- >>>>> >>>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative >>>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design >>>>> change in V5 alters this picture; please refer to the V4 cover letter >>>>> for the throughput tables [1]. >>>> >>>> IMHO, it's better to copy performance data here. >>>> >>>> In addition to the performance benefit, I want to know the downside as >>>> well. For example, the migration latency of the first folio may be >>>> longer. If so, by how much? Can you measure the batch number vs. tot= al >>>> migration time (benefit) and first folio migration time (downside)? >>>> That can be used to determine the optimal batch number. >>>> >>> >>> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled), >>> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardwa= re. >>> >>> Benchmark: move_pages() syscall to move pages between two NUMA nodes. >>> >>> 1). Moving different sized folios such that total transfer size is cons= tant >>> (1GB), with different number of DMA channels. Throughput in GB/s. >>> >>> a. Baseline (vanilla kernel, single-threaded, serial folio_copy): >>> >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D >>> 4K | 16K | 64K | 256K | 1M | 2M = | >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D >>> 3.31=C2=B10.18 | 5.61=C2=B10.07 | 6.66=C2=B10.03 | 7.01=C2=B10.03 = | 7.13=C2=B10.08 | 11.02=C2=B10.17 | >>> >>> >>> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels): >>> >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>> N channel| 4K | 16K | 64K | 256K | 1M = | 2M | >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>> 1 | 2.16=C2=B10.14 | 2.58=C2=B10.02 | 3.00=C2=B10.04 | 4.56= =C2=B10.28 | 4.62=C2=B10.02 | 12.65=C2=B10.08 | >>> 2 | 2.68=C2=B10.09 | 3.69=C2=B10.15 | 4.52=C2=B10.04 | 6.75= =C2=B10.06 | 7.19=C2=B10.19 | 14.38=C2=B10.06 | >>> 4 | 3.07=C2=B10.13 | 4.62=C2=B10.09 | 6.47=C2=B10.56 | 9.22= =C2=B10.15 | 10.24=C2=B10.47 | 27.01=C2=B10.11 | >>> 8 | 3.43=C2=B10.09 | 5.40=C2=B10.16 | 7.67=C2=B10.08 | 11.25= =C2=B10.17 | 12.60=C2=B10.60 | 45.62=C2=B10.52 | >>> 12 | 3.50=C2=B10.11 | 5.66=C2=B10.16 | 8.12=C2=B10.10 | 11.97= =C2=B10.19 | 13.43=C2=B10.08 | 61.02=C2=B10.92 | >>> 16 | 3.54=C2=B10.12 | 5.79=C2=B10.14 | 8.50=C2=B10.13 | 12.59= =C2=B10.15 | 17.21=C2=B16.40 | 65.23=C2=B11.70 | >>> >>> >>> 2). First-folio latency: Instrumented with custom tracepoints to >>> measure latency per migrate_pages_batch() call. >>> Result: throughput (GB/s) and first-folio latency (in microseconds)= , median of 10 runs. >>=20 >> Thanks for detailed data. Per my understanding, the run time of >> migrate_pages_batch() may be not good enough for measuring first folio >> latency. IIUC, the migration procedure is something like, >>=20 >> for each folio >> unmap >> flush >> for each folio >> copy >> remap =3D=3D=3D> first folio migrated >>=20 >> Some tracepoint should be better to measure it. > > Sorry, my earlier write-up was unclear. > For first folio latency, I add two tracepoints: one at the start of migra= te_pages_batch() > and one in migrate_folio_done().=20 > > I agree that the user-accessible point tracepoint should be right after r= emove_migration_ptes(). > Though, migrate_folio_done() runs only a few operations later, and will h= ave a constant > offset, so it's unlikely to change the shape of the trade-off curve. > I'll move the tracepoint right after remove_migration_ptes() for new post= ing. Thanks for explanation. Trace point in migrate_folio_done() should be OK. >>=20 >>> A). Vanilla Kernel: >>> >>> Here, n =3D workload size passed to move_pages() in folios. Move n numb= er of folios with move_pages(). >>> NR_MAX_BATCHED_MIGRATION is upstream default value 512. >>> >>> --- Order 0 (4K folios) --- >>> n vanilla/cpu >>> (folios) GB/s | first(us) >>> -------------------------- >>> 1 0.04 | 24 >>> 4 0.16 | 25 >>> 8 0.29 | 31 >>> 16 0.54 | 27 >>> 64 1.15 | 68 >>> 256 1.86 | 162 >>> 512 2.21 | 264 >>> 2048 2.62 | 208 >>> 4096 2.74 | 182 >>> 16384 2.73 | 173 >>> 65536 3.28 | 166 >>> 262144 3.20 | 167 >>> >>> --- Order 9 (2M folios) --- >>> n vanilla/cpu >>> (folios) GB/s | first(us) >>> -------------------------- >>> 1 7.05 | 194 >>> 4 8.78 | 186 >>> 8 8.47 | 188 >>> 16 7.20 | 193 >>> 64 8.23 | 191 >>> 256 10.51 | 180 >>> 512 10.88 | 173 >>> >>> Takeaway: >>> In each migrate_pages_batch() call, folios are first unmapped, then try= _to_unmap_flush(), >>> and only then folios enter move_to_new_folio(). So first-folio latency = is bounded by the >>> per-batch unmap+flush cost, and then plateaus once workload is large en= ough. >>> >>> >>> B). Patched kernel: >>> >>> Here, N =3D NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is = fixed at 1 GB. >>=20 >> Emm, so NR_MAX_BATCHED_MIGRATION could be very large? I think that it >> needs to be bounded. If it is too large, too many pages may be in an >> inaccessible state for a longer time. That will hurt the workload >> performance, although it is optimal for migration performance. >>=20 > > Agreed, it must be bounded. Thanks! Could you retest with bounded NR_MAX_BATCHED_MIGRATION. If the upstream default doesn't work well for you. We can find a better one that balances throughput and latency well. >>> Change N with a knob to measure impact of different max batched size. >>> >>> --- ORDER 0 (4K folios) --- >>> N offload/dma1 offload/dma4 offload/dma16 >>> GB/s | first(us) GB/s | first(us) GB/s | first= (us) >>> ------------------------------------------------------------------------ >>> 512 2.13 | 639 3.23 | 290 3.27 | 253 >>> 1024 2.17 | 1261 3.44 | 582 3.58 | 536 >>> 2048 2.01 | 2769 3.09 | 1360 3.45 | 1083 >>> 4096 2.10 | 5059 3.13 | 2737 3.58 | 211= 5=20 >>> 8192 2.21 | 9320 3.17 | 5015 3.75 | 361= 7=20 >>> 16384 2.15 | 18689 3.31 | 9623 3.87 | 6937 >>> 32768 2.12 | 42692 3.38 | 18893 3.83 | 14255 >>> 65536 2.09 | 81956 3.38 | 38556 3.64 | 29003 >>> 131072 2.02 | 169563 3.22 | 81082 3.63 | 62236 >>> 262144 2.21 | 318424 3.12 | 170174 3.50 | 129413 >>> >>> --- ORDER 9 (2M folios) --- >>> N offload/dma1 offload/dma4 offload/dma16 >>> GB/s | first(us) GB/s | first(us) GB/s | first= (us) >>> -----------------------------------------------------------------------= -- >>> 512 11.66 | 160 11.68 | 160 11.65 | 1= 60 >>> 1024 12.16 | 310 13.67 | 275 13.64 | 2= 76 >>> 2048 12.30 | 613 25.47 | 290 25.48 | 2= 91 >>> 4096 12.48 | 1215 26.19 | 566 42.59 | 3= 35 >>> 8192 12.56 | 2424 26.57 | 1118 58.72 | 4= 70 * >>> 16384 12.61 | 4839 26.77 | 2218 61.94 | 8= 96 >>> 32768 12.60 | 9667 26.98 | 4422 63.75 | 17= 48 >>> 65536 12.63 | 19318 26.99 | 8838 60.66 | 35= 43 >>> 131072 12.64 | 38935 27.02 | 17935 61.06 | 71= 78 >>> 262144 12.66 | 77694 26.85 | 35871 65.06 | 141= 29 >>> >>> In the batch-copy offload approach, DMA copy phase is inserted between = unmap/flush and move, >>> So larger N increases first-folio wall clock latency. Throughput improv= es but with diminishing >>> returns. >>> >>> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=3D8= 192-16384, >>> because a larger batch allows the driver to distribute more folios acro= ss available DMA channels. >>> This is where we get most throughput while keeping the first folio late= ncy in check. >>> >>> This optimal batch value is hardware-specific. Other engines (eg. SDXI)= and memory tier (eg. CXL) >>> will likely have different curves. >>> >>> Does this approach and experiment look good to you? --- Best Regards, Huang, Ying