[PATCH 0/7] Accelerate page migration with batch copying and hardware offload

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Shivank Garg <shivankg@amd.com>
To: <akpm@linux-foundation.org>, <david@kernel.org>
Cc: <kinseyho@google.com>, <weixugc@google.com>, <ljs@kernel.org>,
	<Liam.Howlett@oracle.com>, <vbabka@kernel.org>,
	<willy@infradead.org>, <rppt@kernel.org>, <surenb@google.com>,
	<mhocko@suse.com>, <ziy@nvidia.com>, <matthew.brost@intel.com>,
	<joshua.hahnjy@gmail.com>, <rakie.kim@sk.com>, <byungchul@sk.com>,
	<gourry@gourry.net>, <ying.huang@linux.alibaba.com>,
	<apopple@nvidia.com>, <dave@stgolabs.net>,
	<Jonathan.Cameron@huawei.com>, <rkodsara@amd.com>,
	<vkoul@kernel.org>, <bharata@amd.com>, <sj@kernel.org>,
	<rientjes@google.com>, <xuezhengchu@huawei.com>,
	<yiannis@zptcorp.com>, <dave.hansen@intel.com>,
	<hannes@cmpxchg.org>, <jhubbard@nvidia.com>, <peterx@redhat.com>,
	<riel@surriel.com>, <shakeel.butt@linux.dev>,
	<stalexan@redhat.com>, <tj@kernel.org>, <nifan.cxl@gmail.com>,
	<jic23@kernel.org>, <aneesh.kumar@kernel.org>,
	<nathan.lynch@amd.com>, <Frank.li@nxp.com>, <djbw@kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
	Shivank Garg <shivankg@amd.com>
Subject: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload
Date: Tue, 28 Apr 2026 15:50:37 +0000	[thread overview]
Message-ID: <20260428155043.39251-2-shivankg@amd.com> (raw)

This is the fifth RFC of the patchset to enhance page migration by
batching folio-copy operations and enabling acceleration via DMA offload.

Single-threaded, folio-by-folio copying bottlenecks page migration in
modern systems with deep memory hierarchies, especially for large folios
where copy overhead dominates, leaving significant hardware potential
untapped.

By batching the copy phase, we create an opportunity for hardware
acceleration. This series builds the framework and provides a DMA
offload driver (dcbm) as a reference implementation, targeting bulk
migration workloads where offloading the copy improves throughput
and latency while freeing the CPU cycles.

See the RFC V3 cover letter [2] for motivation.

Changelog since V4:
-------------------

1. Renamed PAGE_* migration state flags to FOLIO_*. (David)
2. Use the new folio->migrate_info field instead of folio->private
   for migration state. (David)
3. Fold folios_mc_copy patch in batch-copy implementation patch. (David)
3. Renamed migrate_offload_start()/stop() to register()/unregister().
   (Huang, Ying)
4. Dropped should_batch() callback from struct migrator. Reason-based
   policy now lives in migrate_pages_batch(). Migrators can still skip
   a batch they don't want (size based policy). (Huang, Ying)
5. CONFIG_MIGRATION_COPY_OFFLOAD is now hidden and selected by the
   migrator driver. CONFIG_DCBM_DMA is tristate. (Huang Ying, Gregory Price). 
6. Wrapped the SRCU + static_call dispatch in a small helper. (Huang, Ying)
7. Requir m->owner in migrate_offload_register(), SRCU sync at
   unregister relies on it. Counters are atomic_long_t to avoid lock-order
   issue.
9. Moved DCBM sysfs from /sys/kernel/dcbm to /sys/module/dcbm (Huang, Ying) 
10. Rebased on v7.1-rc1.

DESIGN:
-------

New Migration Flow:

[ migrate_pages_batch() ]
    |
    |--> do_batch = migrate_offload_do_batch(reason)  // core filters by migration reason
    |
    |--> for each folio:
    |      migrate_folio_unmap()        // unmap the folio
    |      |
    |      +--> (success):
    |           if do_batch && folio_supports_batch_copy():
    |               -> unmap_batch / dst_batch  // batch list for copy offloading
    |           else:
    |               -> unmap_single / dst_single // single lists for per-folio CPU copy
    |
    |--> try_to_unmap_flush()                   // single batched TLB flush
    |
    |--> Batch copy (if unmap_batch not empty):
    |    - Migrator is configurable at runtime via sysfs.
    |
    |      static_call(migrate_offload_copy)    // Pluggable Migrators
    |              /          |            \
    |             v           v             v
    |     [ Default ]  [ DMA Offload ]  [ ... ]
    |
    |      On -EOPNOTSUPP or other error, batch falls back to per-folio CPU copy.
    |
    +--> migrate_folios_move()      // metadata, update PTEs, finalize
         (batch list with already_copied=true, single list with false)

Offload Registration:

    Driver fills struct migrator { .name, .offload_copy, .owner } and calls
    migrate_offload_register().  This:
      - Pins the module via try_module_get()
      - Patches the migrate_offload_copy() static_call target
      - Enables the migrate_offload_enabled static branch

    migrate_offload_unregister() disables the static branch and reverts
    the static_call, then synchronize_srcu() waits for in-flight migrations
    before module_put().

PERFORMANCE RESULTS:
--------------------

Re-ran the V4 workload on v7.1-rc1 with this series; relative
speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
change in V5 alters this picture; please refer to the V4 cover letter
for the throughput tables [1].

PLAN:
-----

Patches 1-4 (the batching infrastructure) don't depend on the migrator
interface, so if it helps I can split them off and post them ahead of
the migrator and DCBM bits, which still have a few open questions to
work through.

I would appreciate guidance on splitting the infrastructure portion
ahead of the migrator interface if that matches maintainers' preference.

OPEN QUESTIONS:
---------------

1. Should the batch path run without a registered migrator? Patches 1-4
   are self-contained and use folios_mc_copy() (CPU). I have several
   options like making batch path always-on for eligible folios, or
   giving admin an option to flip the static branch, or keep the gate.
   I'm leaning toward always-on.

2. Carrying already_copied via folio->migrate_info vs changing the
   migrate_folio() callback signature (Huang, Ying). I went with the
   field for now to avoid touching every fs callback before the design
   settles. Happy to revisit.

3. Per-caller offload selection: Today eligibility is by migrate_reason
   only. Some are latency-tolerant, others may be not. Is reason the
   right granularity, or do we want a per-caller hint?

4. Cgroup integration: How should per-cgroup be accounted for different
   migrators (e.g.: any accounting for DMA-busy time)?

5. Tuning migrate_pages callers for offloading. For instance, in
   compaction COMPACT_CLUSTER_MAX = 32 caps DMA's payoff for compaction
   (V4 experiment).

6. Where do batch-size thresholds live, and how are they tuned? Per
   Huang Ying's split, that policy lives in the migrator. DCBM has no
   threshold today. Open whether it should later be a per-migrator
   sysfs knob or hard-coded; probably clearer once a second migrator
   (SDXI, mtcopy) shows the trade-off.

FOLLOW-UPS:
--------------

1. dmaengine_prep_dma_memcpy_sg() in DCBM (Vinod Koul). The SG-prep
   variant cuts per-batch prep/submit cost (=CPU savings), but ptdma does
   not implement the SG hook yet [10]. The end-to-end migration throughput
   delta is small because per-descriptor execute time dominates.
   I'll post the ptdma SG hook + DCBM switch as a follow-up.

2. SDXI as a second migrator. The SDXI series [11] is in  review. SDXI is
   a generic memcpy engine without DMA_PRIVATE, so channel acquisition
   goes through dma_find_channel() or async_tx rather than
   dma_request_chan_by_mask(). I have a local DCBM variant working on top
   of the SDXI driver. I'm planning to send it as a follow-up once the
   SDXI series settles.

3. IOMMU SG merging in DCBM (Gregory). dma_map_sgtable() may merge
   contiguous PFNs unevenly, so src.nents != dst.nents. DCBM falls back
   to CPU for safety. Though I haven't seen it on Zen3 + PTDMA. I'll
   understand this and address it a follow-up.

4. Revisit Multi-threaded CPU copy migrator once the infra is settled.

EARLIER POSTINGS:
-----------------
[1] RFC V4: https://lore.kernel.org/all/20260309120725.308854-3-shivankg@amd.com
[2] RFC V3: https://lore.kernel.org/all/20250923174752.35701-1-shivankg@amd.com
[3] RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@amd.com
[4] RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
[5] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com

RELATED DISCUSSIONS:
--------------------
[6] MM-alignment Session [Nov 12, 2025]:
    https://lore.kernel.org/linux-mm/bd6a3c75-b9f0-cbcf-f7c4-1ef5dff06d24@google.com
[7] Linux Memory Hotness and Promotion call [Nov 6, 2025]:
    https://lore.kernel.org/linux-mm/8ff2fd10-c9ac-4912-cf56-7ecd4afd2770@google.com
[8] LSFMM 2025:
    https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com
[9] OSS India:
    https://ossindia2025.sched.com/event/23Jk1
[10] DMA_MEMCPY_SG comparison:
     https://lore.kernel.org/linux-mm/3e73addb-ac01-4a05-bc75-c6c1c56072df@amd.com
[11] SDXI V1:
     https://lore.kernel.org/all/20260410-sdxi-base-v1-0-1d184cb5c60a@amd.com

Thanks to everyone who reviewed, tested or participated in discussions
around this series. Your feedback helped me throughout the development
process.

Best Regards,
Shivank

Shivank Garg (6):
  mm/migrate: rename PAGE_ migration flags to FOLIO_
  mm/migrate: use migrate_info field instead of private
  mm/migrate: skip data copy for already-copied folios
  mm/migrate: add batch-copy path in migrate_pages_batch
  mm/migrate: add copy offload registration infrastructure
  drivers/migrate_offload: add DMA batch copy driver (dcbm)

Zi Yan (1):
  mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing

 drivers/Kconfig                       |   2 +
 drivers/Makefile                      |   2 +
 drivers/migrate_offload/Kconfig       |   9 +
 drivers/migrate_offload/Makefile      |   1 +
 drivers/migrate_offload/dcbm/Makefile |   1 +
 drivers/migrate_offload/dcbm/dcbm.c   | 440 ++++++++++++++++++++++++++
 include/linux/migrate_copy_offload.h  |  44 +++
 include/linux/mm.h                    |   2 +
 include/linux/mm_types.h              |   1 +
 mm/Kconfig                            |   6 +
 mm/Makefile                           |   1 +
 mm/migrate.c                          | 211 ++++++++----
 mm/migrate_copy_offload.c             |  94 ++++++
 mm/util.c                             |  30 ++
 14 files changed, 784 insertions(+), 60 deletions(-)
 create mode 100644 drivers/migrate_offload/Kconfig
 create mode 100644 drivers/migrate_offload/Makefile
 create mode 100644 drivers/migrate_offload/dcbm/Makefile
 create mode 100644 drivers/migrate_offload/dcbm/dcbm.c
 create mode 100644 include/linux/migrate_copy_offload.h
 create mode 100644 mm/migrate_copy_offload.c

base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
-- 
2.43.0

next             reply	other threads:[~2026-04-28 15:51 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-28 15:50 Shivank Garg [this message]
2026-04-28 15:50 ` [PATCH 1/7] mm/migrate: rename PAGE_ migration flags to FOLIO_ Shivank Garg
2026-04-30  9:07   ` Huang, Ying
2026-04-28 15:50 ` [PATCH 2/7] mm/migrate: use migrate_info field instead of private Shivank Garg
2026-04-28 15:50 ` [PATCH 3/7] mm/migrate: skip data copy for already-copied folios Shivank Garg
2026-04-28 15:50 ` [PATCH 4/7] mm/migrate: add batch-copy path in migrate_pages_batch Shivank Garg
2026-04-28 15:50 ` [PATCH 5/7] mm/migrate: add copy offload registration infrastructure Shivank Garg
2026-04-28 15:50 ` [PATCH 6/7] drivers/migrate_offload: add DMA batch copy driver (dcbm) Shivank Garg
2026-04-28 15:50 ` [PATCH 7/7] mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing Shivank Garg
2026-04-28 17:11 ` [PATCH 0/7] Accelerate page migration with batch copying and hardware offload Garg, Shivank
2026-04-28 19:33   ` David Hildenbrand (Arm)
2026-04-29  5:51     ` Garg, Shivank
2026-04-30  8:47 ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260428155043.39251-2-shivankg@amd.com \
    --to=shivankg@amd.com \
    --cc=Frank.li@nxp.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@kernel.org \
    --cc=apopple@nvidia.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=djbw@kernel.org \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=jhubbard@nvidia.com \
    --cc=jic23@kernel.org \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kinseyho@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=nathan.lynch@amd.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rkodsara@amd.com \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=sj@kernel.org \
    --cc=stalexan@redhat.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=vkoul@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox