Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Huang, Ying" <ying.huang@linux.alibaba.com>
To: Shivank Garg <shivankg@amd.com>
Cc: <akpm@linux-foundation.org>,  <david@kernel.org>,
	<kinseyho@google.com>,  <weixugc@google.com>,  <ljs@kernel.org>,
	<Liam.Howlett@oracle.com>,  <vbabka@kernel.org>,
	 <willy@infradead.org>, <rppt@kernel.org>,  <surenb@google.com>,
	 <mhocko@suse.com>, <ziy@nvidia.com>,  <matthew.brost@intel.com>,
	 <joshua.hahnjy@gmail.com>, <rakie.kim@sk.com>,
	 <byungchul@sk.com>,  <gourry@gourry.net>, <apopple@nvidia.com>,
	 <dave@stgolabs.net>, <Jonathan.Cameron@huawei.com>,
	 <rkodsara@amd.com>,  <vkoul@kernel.org>, <bharata@amd.com>,
	 <sj@kernel.org>,  <rientjes@google.com>,
	<xuezhengchu@huawei.com>,  <yiannis@zptcorp.com>,
	<dave.hansen@intel.com>,  <hannes@cmpxchg.org>,
	 <jhubbard@nvidia.com>, <peterx@redhat.com>,  <riel@surriel.com>,
	 <shakeel.butt@linux.dev>, <stalexan@redhat.com>,
	 <tj@kernel.org>,  <nifan.cxl@gmail.com>, <jic23@kernel.org>,
	 <aneesh.kumar@kernel.org>,  <nathan.lynch@amd.com>,
	<Frank.li@nxp.com>,  <djbw@kernel.org>,
	 <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>
Subject: Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload
Date: Thu, 30 Apr 2026 16:47:01 +0800	[thread overview]
Message-ID: <87zf2kvnqy.fsf@DESKTOP-5N7EMDA> (raw)
In-Reply-To: <20260428155043.39251-2-shivankg@amd.com> (Shivank Garg's message of "Tue, 28 Apr 2026 15:50:37 +0000")

Shivank Garg <shivankg@amd.com> writes:

> This is the fifth RFC of the patchset to enhance page migration by
> batching folio-copy operations and enabling acceleration via DMA offload.
>
> Single-threaded, folio-by-folio copying bottlenecks page migration in
> modern systems with deep memory hierarchies, especially for large folios
> where copy overhead dominates, leaving significant hardware potential
> untapped.
>
> By batching the copy phase, we create an opportunity for hardware
> acceleration. This series builds the framework and provides a DMA
> offload driver (dcbm) as a reference implementation, targeting bulk
> migration workloads where offloading the copy improves throughput
> and latency while freeing the CPU cycles.
>
> See the RFC V3 cover letter [2] for motivation.
>
> Changelog since V4:
> -------------------
>
> 1. Renamed PAGE_* migration state flags to FOLIO_*. (David)
> 2. Use the new folio->migrate_info field instead of folio->private
>    for migration state. (David)
> 3. Fold folios_mc_copy patch in batch-copy implementation patch. (David)
> 3. Renamed migrate_offload_start()/stop() to register()/unregister().
>    (Huang, Ying)
> 4. Dropped should_batch() callback from struct migrator. Reason-based
>    policy now lives in migrate_pages_batch(). Migrators can still skip
>    a batch they don't want (size based policy). (Huang, Ying)
> 5. CONFIG_MIGRATION_COPY_OFFLOAD is now hidden and selected by the
>    migrator driver. CONFIG_DCBM_DMA is tristate. (Huang Ying, Gregory Price). 
> 6. Wrapped the SRCU + static_call dispatch in a small helper. (Huang, Ying)
> 7. Requir m->owner in migrate_offload_register(), SRCU sync at
>    unregister relies on it. Counters are atomic_long_t to avoid lock-order
>    issue.
> 9. Moved DCBM sysfs from /sys/kernel/dcbm to /sys/module/dcbm (Huang, Ying) 
> 10. Rebased on v7.1-rc1.
>
>
> DESIGN:
> -------
>
> New Migration Flow:
>
> [ migrate_pages_batch() ]
>     |
>     |--> do_batch = migrate_offload_do_batch(reason)  // core filters by migration reason
>     |
>     |--> for each folio:
>     |      migrate_folio_unmap()        // unmap the folio
>     |      |
>     |      +--> (success):
>     |           if do_batch && folio_supports_batch_copy():
>     |               -> unmap_batch / dst_batch  // batch list for copy offloading
>     |           else:
>     |               -> unmap_single / dst_single // single lists for per-folio CPU copy
>     |
>     |--> try_to_unmap_flush()                   // single batched TLB flush
>     |
>     |--> Batch copy (if unmap_batch not empty):
>     |    - Migrator is configurable at runtime via sysfs.
>     |
>     |      static_call(migrate_offload_copy)    // Pluggable Migrators
>     |              /          |            \
>     |             v           v             v
>     |     [ Default ]  [ DMA Offload ]  [ ... ]
>     |
>     |      On -EOPNOTSUPP or other error, batch falls back to per-folio CPU copy.
>     |
>     +--> migrate_folios_move()      // metadata, update PTEs, finalize
>          (batch list with already_copied=true, single list with false)
>
> Offload Registration:
>
>     Driver fills struct migrator { .name, .offload_copy, .owner } and calls
>     migrate_offload_register().  This:
>       - Pins the module via try_module_get()
>       - Patches the migrate_offload_copy() static_call target
>       - Enables the migrate_offload_enabled static branch
>
>     migrate_offload_unregister() disables the static branch and reverts
>     the static_call, then synchronize_srcu() waits for in-flight migrations
>     before module_put().
>
> PERFORMANCE RESULTS:
> --------------------
>
> Re-ran the V4 workload on v7.1-rc1 with this series; relative
> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
> change in V5 alters this picture; please refer to the V4 cover letter
> for the throughput tables [1].

IMHO, it's better to copy performance data here.

In addition to the performance benefit, I want to know the downside as
well.  For example, the migration latency of the first folio may be
longer.  If so, by how much?  Can you measure the batch number vs. total
migration time (benefit) and first folio migration time (downside)?
That can be used to determine the optimal batch number.

> PLAN:
> -----
>
> Patches 1-4 (the batching infrastructure) don't depend on the migrator
> interface, so if it helps I can split them off and post them ahead of
> the migrator and DCBM bits, which still have a few open questions to
> work through.
>
> I would appreciate guidance on splitting the infrastructure portion
> ahead of the migrator interface if that matches maintainers' preference.
>
> OPEN QUESTIONS:
> ---------------
>
> 1. Should the batch path run without a registered migrator? Patches 1-4
>    are self-contained and use folios_mc_copy() (CPU). I have several
>    options like making batch path always-on for eligible folios, or
>    giving admin an option to flip the static branch, or keep the gate.
>    I'm leaning toward always-on.
>
> 2. Carrying already_copied via folio->migrate_info vs changing the
>    migrate_folio() callback signature (Huang, Ying). I went with the
>    field for now to avoid touching every fs callback before the design
>    settles. Happy to revisit.
>
> 3. Per-caller offload selection: Today eligibility is by migrate_reason
>    only. Some are latency-tolerant, others may be not. Is reason the
>    right granularity, or do we want a per-caller hint?
>
> 4. Cgroup integration: How should per-cgroup be accounted for different
>    migrators (e.g.: any accounting for DMA-busy time)?
>
> 5. Tuning migrate_pages callers for offloading. For instance, in
>    compaction COMPACT_CLUSTER_MAX = 32 caps DMA's payoff for compaction
>    (V4 experiment).
>
> 6. Where do batch-size thresholds live, and how are they tuned? Per
>    Huang Ying's split, that policy lives in the migrator. DCBM has no
>    threshold today. Open whether it should later be a per-migrator
>    sysfs knob or hard-coded; probably clearer once a second migrator
>    (SDXI, mtcopy) shows the trade-off.
>
>
> FOLLOW-UPS:
> --------------
>
> 1. dmaengine_prep_dma_memcpy_sg() in DCBM (Vinod Koul). The SG-prep
>    variant cuts per-batch prep/submit cost (=CPU savings), but ptdma does
>    not implement the SG hook yet [10]. The end-to-end migration throughput
>    delta is small because per-descriptor execute time dominates.
>    I'll post the ptdma SG hook + DCBM switch as a follow-up.
>   
> 2. SDXI as a second migrator. The SDXI series [11] is in  review. SDXI is
>    a generic memcpy engine without DMA_PRIVATE, so channel acquisition
>    goes through dma_find_channel() or async_tx rather than
>    dma_request_chan_by_mask(). I have a local DCBM variant working on top
>    of the SDXI driver. I'm planning to send it as a follow-up once the
>    SDXI series settles.
>  
> 3. IOMMU SG merging in DCBM (Gregory). dma_map_sgtable() may merge
>    contiguous PFNs unevenly, so src.nents != dst.nents. DCBM falls back
>    to CPU for safety. Though I haven't seen it on Zen3 + PTDMA. I'll
>    understand this and address it a follow-up.
>  
> 4. Revisit Multi-threaded CPU copy migrator once the infra is settled.
>
> EARLIER POSTINGS:
> -----------------
> [1] RFC V4: https://lore.kernel.org/all/20260309120725.308854-3-shivankg@amd.com
> [2] RFC V3: https://lore.kernel.org/all/20250923174752.35701-1-shivankg@amd.com
> [3] RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@amd.com
> [4] RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
> [5] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
>
> RELATED DISCUSSIONS:
> --------------------
> [6] MM-alignment Session [Nov 12, 2025]:
>     https://lore.kernel.org/linux-mm/bd6a3c75-b9f0-cbcf-f7c4-1ef5dff06d24@google.com
> [7] Linux Memory Hotness and Promotion call [Nov 6, 2025]:
>     https://lore.kernel.org/linux-mm/8ff2fd10-c9ac-4912-cf56-7ecd4afd2770@google.com
> [8] LSFMM 2025:
>     https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com
> [9] OSS India:
>     https://ossindia2025.sched.com/event/23Jk1
> [10] DMA_MEMCPY_SG comparison:
>      https://lore.kernel.org/linux-mm/3e73addb-ac01-4a05-bc75-c6c1c56072df@amd.com
> [11] SDXI V1:
>      https://lore.kernel.org/all/20260410-sdxi-base-v1-0-1d184cb5c60a@amd.com
>
> Thanks to everyone who reviewed, tested or participated in discussions
> around this series. Your feedback helped me throughout the development
> process.
>
> Best Regards,
> Shivank
>
>
> Shivank Garg (6):
>   mm/migrate: rename PAGE_ migration flags to FOLIO_
>   mm/migrate: use migrate_info field instead of private
>   mm/migrate: skip data copy for already-copied folios
>   mm/migrate: add batch-copy path in migrate_pages_batch
>   mm/migrate: add copy offload registration infrastructure
>   drivers/migrate_offload: add DMA batch copy driver (dcbm)
>
> Zi Yan (1):
>   mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing
>
>  drivers/Kconfig                       |   2 +
>  drivers/Makefile                      |   2 +
>  drivers/migrate_offload/Kconfig       |   9 +
>  drivers/migrate_offload/Makefile      |   1 +
>  drivers/migrate_offload/dcbm/Makefile |   1 +
>  drivers/migrate_offload/dcbm/dcbm.c   | 440 ++++++++++++++++++++++++++
>  include/linux/migrate_copy_offload.h  |  44 +++
>  include/linux/mm.h                    |   2 +
>  include/linux/mm_types.h              |   1 +
>  mm/Kconfig                            |   6 +
>  mm/Makefile                           |   1 +
>  mm/migrate.c                          | 211 ++++++++----
>  mm/migrate_copy_offload.c             |  94 ++++++
>  mm/util.c                             |  30 ++
>  14 files changed, 784 insertions(+), 60 deletions(-)
>  create mode 100644 drivers/migrate_offload/Kconfig
>  create mode 100644 drivers/migrate_offload/Makefile
>  create mode 100644 drivers/migrate_offload/dcbm/Makefile
>  create mode 100644 drivers/migrate_offload/dcbm/dcbm.c
>  create mode 100644 include/linux/migrate_copy_offload.h
>  create mode 100644 mm/migrate_copy_offload.c
>
>
> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731

---
Best Regards,
Huang, Ying

     prev parent reply	other threads:[~2026-04-30  8:47 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-28 15:50 [PATCH 0/7] Accelerate page migration with batch copying and hardware offload Shivank Garg
2026-04-28 15:50 ` [PATCH 1/7] mm/migrate: rename PAGE_ migration flags to FOLIO_ Shivank Garg
2026-04-30  9:07   ` Huang, Ying
2026-04-28 15:50 ` [PATCH 2/7] mm/migrate: use migrate_info field instead of private Shivank Garg
2026-04-28 15:50 ` [PATCH 3/7] mm/migrate: skip data copy for already-copied folios Shivank Garg
2026-04-28 15:50 ` [PATCH 4/7] mm/migrate: add batch-copy path in migrate_pages_batch Shivank Garg
2026-04-28 15:50 ` [PATCH 5/7] mm/migrate: add copy offload registration infrastructure Shivank Garg
2026-04-28 15:50 ` [PATCH 6/7] drivers/migrate_offload: add DMA batch copy driver (dcbm) Shivank Garg
2026-04-28 15:50 ` [PATCH 7/7] mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing Shivank Garg
2026-04-28 17:11 ` [PATCH 0/7] Accelerate page migration with batch copying and hardware offload Garg, Shivank
2026-04-28 19:33   ` David Hildenbrand (Arm)
2026-04-29  5:51     ` Garg, Shivank
2026-04-30  8:47 ` Huang, Ying [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zf2kvnqy.fsf@DESKTOP-5N7EMDA \
    --to=ying.huang@linux.alibaba.com \
    --cc=Frank.li@nxp.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@kernel.org \
    --cc=apopple@nvidia.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=djbw@kernel.org \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=jhubbard@nvidia.com \
    --cc=jic23@kernel.org \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kinseyho@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=nathan.lynch@amd.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rkodsara@amd.com \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=shivankg@amd.com \
    --cc=sj@kernel.org \
    --cc=stalexan@redhat.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=vkoul@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox