From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 70389CD3447 for ; Sat, 9 May 2026 07:50:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 91D496B0300; Sat, 9 May 2026 03:50:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8A7A26B0301; Sat, 9 May 2026 03:50:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 76F146B0302; Sat, 9 May 2026 03:50:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 616DE6B0300 for ; Sat, 9 May 2026 03:50:23 -0400 (EDT) Received: from smtpin30.hostedemail.com (lb01a-stub [10.200.18.249]) by unirelay10.hostedemail.com (Postfix) with ESMTP id D9C9DC1F76 for ; Sat, 9 May 2026 07:50:22 +0000 (UTC) X-FDA: 84747108684.30.1B5D7FA Received: from out30-97.freemail.mail.aliyun.com (out30-97.freemail.mail.aliyun.com [115.124.30.97]) by imf06.hostedemail.com (Postfix) with ESMTP id E92E8180004 for ; Sat, 9 May 2026 07:50:18 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=yY3DlLcS; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf06.hostedemail.com: domain of ying.huang@linux.alibaba.com designates 115.124.30.97 as permitted sender) smtp.mailfrom=ying.huang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1778313021; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=og2YnmuyeOWIYjCeR6b0Icao3q7h85tzyg8TQlVWZ6Y=; b=QQApORNpb0KzK8Ymr/wpAIwOMqrgDfLMvSLpiKQv4q/3nBbUKfLadP8piR/HAjTfSP884K dW200NT7VxE+iMZSYj6QBN9IfMcVcOU/TOjSeXLlsqOmW2y40F06JR04lHaI373r/ENPqm pjTZfJGoqfA43WSDRr7FlJA8QqeGKg4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1778313021; a=rsa-sha256; cv=none; b=voRcVYXaEd0JqYRYky4aX/uBgt8sL+51CQc4CC2wWLFp3hIEwqYf+0b2bkQoJz4BJhbZSe s9iOm4WNF7Ojxn7Neat0Z0tN3yqs7qHc9n0QMhIRXPv5TPUpW8pNA+oRIthXRUjR/IglEN 1aXmqQSyj75I+sPRo3bCj4W7IQNxkic= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=yY3DlLcS; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf06.hostedemail.com: domain of ying.huang@linux.alibaba.com designates 115.124.30.97 as permitted sender) smtp.mailfrom=ying.huang@linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1778313015; h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type; bh=og2YnmuyeOWIYjCeR6b0Icao3q7h85tzyg8TQlVWZ6Y=; b=yY3DlLcSWMXMTgmHyB/ZnIl2wMloXap8TPXCTOHtmEgHeVXvPOJDrgLbjoO67fH0PhA6KDx52knI2nR3wZRbT1Y9BPU+pko+bKnSNGcuCBDdnLpjWhCkjso0vHRfa1EzLAbRygPMkEWj67GnJJj2ivHdgXiDeL3S6ATaOOdHH4A= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R201e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=maildocker-contentspam033037033178;MF=ying.huang@linux.alibaba.com;NM=1;PH=DS;RN=44;SR=0;TI=SMTPD_---0X2Zv8Qc_1778312997; Received: from DESKTOP-5N7EMDA(mailfrom:ying.huang@linux.alibaba.com fp:SMTPD_---0X2Zv8Qc_1778312997 cluster:ay36) by smtp.aliyun-inc.com; Sat, 09 May 2026 15:50:11 +0800 From: "Huang, Ying" To: "Garg, Shivank" Cc: akpm@linux-foundation.org, david@kernel.org, kinseyho@google.com, weixugc@google.com, ljs@kernel.org, Liam.Howlett@oracle.com, vbabka@kernel.org, willy@infradead.org, rppt@kernel.org, surenb@google.com, mhocko@suse.com, ziy@nvidia.com, matthew.brost@intel.com, joshua.hahnjy@gmail.com, rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net, apopple@nvidia.com, dave@stgolabs.net, Jonathan.Cameron@huawei.com, rkodsara@amd.com, vkoul@kernel.org, bharata@amd.com, sj@kernel.org, rientjes@google.com, xuezhengchu@huawei.com, yiannis@zptcorp.com, dave.hansen@intel.com, hannes@cmpxchg.org, jhubbard@nvidia.com, peterx@redhat.com, riel@surriel.com, shakeel.butt@linux.dev, stalexan@redhat.com, tj@kernel.org, nifan.cxl@gmail.com, jic23@kernel.org, aneesh.kumar@kernel.org, nathan.lynch@amd.com, Frank.li@nxp.com, djbw@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload In-Reply-To: (Shivank Garg's message of "Fri, 8 May 2026 18:04:34 +0530") References: <20260428155043.39251-2-shivankg@amd.com> <87zf2kvnqy.fsf@DESKTOP-5N7EMDA> <152b9b5d-67c8-4a13-b8a8-be576a16eb8f@amd.com> <87mryaqgwg.fsf@DESKTOP-5N7EMDA> Date: Sat, 09 May 2026 15:49:56 +0800 Message-ID: <87cxz5rpi3.fsf@DESKTOP-5N7EMDA> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: E92E8180004 X-Rspamd-Server: rspam04 X-Stat-Signature: 8gditupddktdwa9kuf37neurm1cdo43x X-HE-Tag: 1778313018-748228 X-HE-Meta: U2FsdGVkX1/ZsQCXR0BZUiDZmhzjwnjD5/yDucRzhyd5IAMR+/OW77oAZvN/ungr2M4LKsyDJ46tf4b+5eMmOKcts9ZkDLCZwVSjpsUUX2bl6RnCXK3DA1EiqgqHUjmgDHogMw3rWFZxeX8vn15O14vKA6VLD/7TahSZwNa1c8JZ6rlRAkDp6MjFEKKDrAWM5P9aUhbLoE4d+2A9nPiYhLqSbgcrfi9K7zh1xA0u5cbNyxuXbZNbz73kdjRhzCVPpvVR5Xi16lnRw/VdS87xdueRC88JXF2150RSXzyNU/tvspUt5bfGX7H2BNRNFwKdaFfyiDcfuTevSK/J17qRJ99HGsRurYkCMlAZtorIlBht8c03scmUUpUgLiHGH0k06BxgnBWj5ahBdhmq0YCYn6MRFAS08EsIeDnaRvB2QgqMYSc2HN0LYcgA9ef8DPFcX13q44Htknh6Vgt7jWNEyDJjsgR/98AnBeiTTqtOHBsEZPlgI3EzN+hcrswbe+LDZKNzfRJoHVJJ6FLvoaqOjqAJOk1pr8FHvc0NpSWUsFSSbHVqjN8NIjx8hlQSxC7bV2yXFeHNJxW2MtZkkQq/HecU55k3aAD4vZkqvQxWGyG0bzDWN7F8j+cGTes4kkKLUC/RVENDCZyZ4e/Lgozb2QEaxnwiAt0g2W39Skghasybur7+fGkXWmU+4L0wB5+N+abQ72j1+AUsJGWXx/UqEucrmxcWtm3DbACTJpgH1V2h+B3RMfkC4WbasV1mU/JiMrWmozaBZVbQajKQbaVfSWlFuzGGHaTZoI9N6j0dgNIiS0ay/KZrBc2qjVtJf+NPwrotHT+uN/Z7OedbWhnblJLRgiNnUq8C6/Fj9iK10LhN0vllAbNPia9WoWV04VNBVaUTn3SXTaC56LL+w42PU+NZlMUpLrCEvF6Mzz3Y7h4SaCssH2MvdYNdc73vsi78LMNtugSr08yN88zeIp+ sasB+9KF +M4owiD0Mx34UVtQ7FhMZLlBWx5Rqf6ZmotE5wlYROw0mKH9skwOSqRgLvmXFJazh5fgXUUFaHD1vGalgWMeezlvmL/HDjKfWcF5n620AQ2isfC28gIl9vArtLesUjcR1KvtZXrZRh/ghZaY3F1QUvSSAkLF4+W/NiaLf1PGqT2IkRjE7YdwT1zH7K5ILz+ir793HCHSnAUXAwWa9CTJ3U3w8MUsec99OOeQVRdFpSWQ0iKrlAOQNVs7qVdSptPAuwXNsKlfxiXS9GHAUBVjLJD3aePfTGbMAsKICUNsJU8NCwVNZWpCICVtACEcCLHz/R+2Ic9beqvlDbmjfLNJowV9ywfg+U1OmEV6FXvG1/yz/JfGwRI2BnipixA== Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: "Garg, Shivank" writes: > On 5/8/2026 4:58 PM, Huang, Ying wrote: >> Hi, Shivank, >>=20 >> "Garg, Shivank" writes: >>=20 >>> On 4/30/2026 2:17 PM, Huang, Ying wrote: >>>> Shivank Garg writes: >>> >>>>> PERFORMANCE RESULTS: >>>>> -------------------- >>>>> >>>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative >>>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design >>>>> change in V5 alters this picture; please refer to the V4 cover letter >>>>> for the throughput tables [1]. >>>> >>>> IMHO, it's better to copy performance data here. >>>> >>>> In addition to the performance benefit, I want to know the downside as >>>> well. For example, the migration latency of the first folio may be >>>> longer. If so, by how much? Can you measure the batch number vs. tot= al >>>> migration time (benefit) and first folio migration time (downside)? >>>> That can be used to determine the optimal batch number. >>>> >>> >>> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled), >>> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardwa= re. >>> >>> Benchmark: move_pages() syscall to move pages between two NUMA nodes. >>> >>> 1). Moving different sized folios such that total transfer size is cons= tant >>> (1GB), with different number of DMA channels. Throughput in GB/s. >>> >>> a. Baseline (vanilla kernel, single-threaded, serial folio_copy): >>> >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D >>> 4K | 16K | 64K | 256K | 1M | 2M = | >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D >>> 3.31=C2=B10.18 | 5.61=C2=B10.07 | 6.66=C2=B10.03 | 7.01=C2=B10.03 = | 7.13=C2=B10.08 | 11.02=C2=B10.17 | >>> >>> >>> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels): >>> >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>> N channel| 4K | 16K | 64K | 256K | 1M = | 2M | >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>> 1 | 2.16=C2=B10.14 | 2.58=C2=B10.02 | 3.00=C2=B10.04 | 4.56= =C2=B10.28 | 4.62=C2=B10.02 | 12.65=C2=B10.08 | >>> 2 | 2.68=C2=B10.09 | 3.69=C2=B10.15 | 4.52=C2=B10.04 | 6.75= =C2=B10.06 | 7.19=C2=B10.19 | 14.38=C2=B10.06 | >>> 4 | 3.07=C2=B10.13 | 4.62=C2=B10.09 | 6.47=C2=B10.56 | 9.22= =C2=B10.15 | 10.24=C2=B10.47 | 27.01=C2=B10.11 | >>> 8 | 3.43=C2=B10.09 | 5.40=C2=B10.16 | 7.67=C2=B10.08 | 11.25= =C2=B10.17 | 12.60=C2=B10.60 | 45.62=C2=B10.52 | >>> 12 | 3.50=C2=B10.11 | 5.66=C2=B10.16 | 8.12=C2=B10.10 | 11.97= =C2=B10.19 | 13.43=C2=B10.08 | 61.02=C2=B10.92 | >>> 16 | 3.54=C2=B10.12 | 5.79=C2=B10.14 | 8.50=C2=B10.13 | 12.59= =C2=B10.15 | 17.21=C2=B16.40 | 65.23=C2=B11.70 | >>> >>> >>> 2). First-folio latency: Instrumented with custom tracepoints to >>> measure latency per migrate_pages_batch() call. >>> Result: throughput (GB/s) and first-folio latency (in microseconds)= , median of 10 runs. >>=20 >> Thanks for detailed data. Per my understanding, the run time of >> migrate_pages_batch() may be not good enough for measuring first folio >> latency. IIUC, the migration procedure is something like, >>=20 >> for each folio >> unmap >> flush >> for each folio >> copy >> remap =3D=3D=3D> first folio migrated >>=20 >> Some tracepoint should be better to measure it. > > Sorry, my earlier write-up was unclear. > For first folio latency, I add two tracepoints: one at the start of migra= te_pages_batch() > and one in migrate_folio_done().=20 > > I agree that the user-accessible point tracepoint should be right after r= emove_migration_ptes(). > Though, migrate_folio_done() runs only a few operations later, and will h= ave a constant > offset, so it's unlikely to change the shape of the trade-off curve. > I'll move the tracepoint right after remove_migration_ptes() for new post= ing. Thanks for explanation. Trace point in migrate_folio_done() should be OK. >>=20 >>> A). Vanilla Kernel: >>> >>> Here, n =3D workload size passed to move_pages() in folios. Move n numb= er of folios with move_pages(). >>> NR_MAX_BATCHED_MIGRATION is upstream default value 512. >>> >>> --- Order 0 (4K folios) --- >>> n vanilla/cpu >>> (folios) GB/s | first(us) >>> -------------------------- >>> 1 0.04 | 24 >>> 4 0.16 | 25 >>> 8 0.29 | 31 >>> 16 0.54 | 27 >>> 64 1.15 | 68 >>> 256 1.86 | 162 >>> 512 2.21 | 264 >>> 2048 2.62 | 208 >>> 4096 2.74 | 182 >>> 16384 2.73 | 173 >>> 65536 3.28 | 166 >>> 262144 3.20 | 167 >>> >>> --- Order 9 (2M folios) --- >>> n vanilla/cpu >>> (folios) GB/s | first(us) >>> -------------------------- >>> 1 7.05 | 194 >>> 4 8.78 | 186 >>> 8 8.47 | 188 >>> 16 7.20 | 193 >>> 64 8.23 | 191 >>> 256 10.51 | 180 >>> 512 10.88 | 173 >>> >>> Takeaway: >>> In each migrate_pages_batch() call, folios are first unmapped, then try= _to_unmap_flush(), >>> and only then folios enter move_to_new_folio(). So first-folio latency = is bounded by the >>> per-batch unmap+flush cost, and then plateaus once workload is large en= ough. >>> >>> >>> B). Patched kernel: >>> >>> Here, N =3D NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is = fixed at 1 GB. >>=20 >> Emm, so NR_MAX_BATCHED_MIGRATION could be very large? I think that it >> needs to be bounded. If it is too large, too many pages may be in an >> inaccessible state for a longer time. That will hurt the workload >> performance, although it is optimal for migration performance. >>=20 > > Agreed, it must be bounded. Thanks! Could you retest with bounded NR_MAX_BATCHED_MIGRATION. If the upstream default doesn't work well for you. We can find a better one that balances throughput and latency well. >>> Change N with a knob to measure impact of different max batched size. >>> >>> --- ORDER 0 (4K folios) --- >>> N offload/dma1 offload/dma4 offload/dma16 >>> GB/s | first(us) GB/s | first(us) GB/s | first= (us) >>> ------------------------------------------------------------------------ >>> 512 2.13 | 639 3.23 | 290 3.27 | 253 >>> 1024 2.17 | 1261 3.44 | 582 3.58 | 536 >>> 2048 2.01 | 2769 3.09 | 1360 3.45 | 1083 >>> 4096 2.10 | 5059 3.13 | 2737 3.58 | 211= 5=20 >>> 8192 2.21 | 9320 3.17 | 5015 3.75 | 361= 7=20 >>> 16384 2.15 | 18689 3.31 | 9623 3.87 | 6937 >>> 32768 2.12 | 42692 3.38 | 18893 3.83 | 14255 >>> 65536 2.09 | 81956 3.38 | 38556 3.64 | 29003 >>> 131072 2.02 | 169563 3.22 | 81082 3.63 | 62236 >>> 262144 2.21 | 318424 3.12 | 170174 3.50 | 129413 >>> >>> --- ORDER 9 (2M folios) --- >>> N offload/dma1 offload/dma4 offload/dma16 >>> GB/s | first(us) GB/s | first(us) GB/s | first= (us) >>> -----------------------------------------------------------------------= -- >>> 512 11.66 | 160 11.68 | 160 11.65 | 1= 60 >>> 1024 12.16 | 310 13.67 | 275 13.64 | 2= 76 >>> 2048 12.30 | 613 25.47 | 290 25.48 | 2= 91 >>> 4096 12.48 | 1215 26.19 | 566 42.59 | 3= 35 >>> 8192 12.56 | 2424 26.57 | 1118 58.72 | 4= 70 * >>> 16384 12.61 | 4839 26.77 | 2218 61.94 | 8= 96 >>> 32768 12.60 | 9667 26.98 | 4422 63.75 | 17= 48 >>> 65536 12.63 | 19318 26.99 | 8838 60.66 | 35= 43 >>> 131072 12.64 | 38935 27.02 | 17935 61.06 | 71= 78 >>> 262144 12.66 | 77694 26.85 | 35871 65.06 | 141= 29 >>> >>> In the batch-copy offload approach, DMA copy phase is inserted between = unmap/flush and move, >>> So larger N increases first-folio wall clock latency. Throughput improv= es but with diminishing >>> returns. >>> >>> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=3D8= 192-16384, >>> because a larger batch allows the driver to distribute more folios acro= ss available DMA channels. >>> This is where we get most throughput while keeping the first folio late= ncy in check. >>> >>> This optimal batch value is hardware-specific. Other engines (eg. SDXI)= and memory tier (eg. CXL) >>> will likely have different curves. >>> >>> Does this approach and experiment look good to you? --- Best Regards, Huang, Ying