public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: Shivank Garg <shivankg@amd.com>
To: <akpm@linux-foundation.org>, <david@kernel.org>
Cc: <lorenzo.stoakes@oracle.com>, <Liam.Howlett@oracle.com>,
	<vbabka@kernel.org>, <willy@infradead.org>, <rppt@kernel.org>,
	<surenb@google.com>, <mhocko@suse.com>, <ziy@nvidia.com>,
	<matthew.brost@intel.com>, <joshua.hahnjy@gmail.com>,
	<rakie.kim@sk.com>, <byungchul@sk.com>, <gourry@gourry.net>,
	<ying.huang@linux.alibaba.com>, <apopple@nvidia.com>,
	<dave@stgolabs.net>, <Jonathan.Cameron@huawei.com>,
	<rkodsara@amd.com>, <vkoul@kernel.org>, <bharata@amd.com>,
	<sj@kernel.org>, <weixugc@google.com>, <dan.j.williams@intel.com>,
	<rientjes@google.com>, <xuezhengchu@huawei.com>,
	<yiannis@zptcorp.com>, <dave.hansen@intel.com>,
	<hannes@cmpxchg.org>, <jhubbard@nvidia.com>, <peterx@redhat.com>,
	<riel@surriel.com>, <shakeel.butt@linux.dev>,
	<stalexan@redhat.com>, <tj@kernel.org>, <nifan.cxl@gmail.com>,
	<linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
	Shivank Garg <shivankg@amd.com>
Subject: [RFC PATCH v4 0/6] Accelerate page migration with batch copying and hardware offload
Date: Mon, 9 Mar 2026 12:07:20 +0000	[thread overview]
Message-ID: <20260309120725.308854-3-shivankg@amd.com> (raw)

This is the fourth RFC of the patchset to enhance page migration by
batching folio-copy operations and enabling acceleration via DMA offload.

Single-threaded, folio-by-folio copying bottlenecks page migration in
modern systems with deep memory hierarchies, especially for large folios
where copy overhead dominates, leaving significant hardware potential
untapped.

By batching the copy phase, we create an opportunity for hardware
acceleration. This series builds the framework and provides a DMA
offload driver (dcbm) as a reference implementation, targeting bulk
migration workloads where offloading the copy improves throughput
and latency while freeing the CPU cycles.

See the RFC V3 cover letter [1] for motivation.


Changelog since V3:
-------------------

1. Redesigned batch migration flow: pre-copy the batch before the move
   phase instead of interleaving copy with metadata updates.
   Simpler design, avoids redundancy with existing migrate_folios_move()
   path.

2. Rewrote offload registration infrastructure: Simplified the migrate
   copy offload infrastructure design, fixed the srcu_read_lock()
   placement and other minor bugs.

3. Added should_batch() callback to struct migrator so offload drivers can
   filter which migration reasons are eligible for offload.

4. Renamed for clarity:
   - CONFIG_OFFC_MIGRATION     -> CONFIG_MIGRATION_COPY_OFFLOAD
   - migrate_offc.[ch]         -> migrate_copy_offload.[ch]
   - drivers/migoffcopy/       -> drivers/migrate_offload/
   - start_offloading/stop_offloading -> migrate_offload_start/stop

5. Dropped mtcopy driver to keep focus on core infrastructure and DMA
   offload (for testing and reference). Multi-threaded CPU copy can
   follow separately.

6. Rebased on v7.0-rc2.


DESIGN:
-------

New Migration Flow:

[ migrate_pages_batch() ]
    |
    |--> do_batch = should_batch(reason) // driver filters by migration reason (e.g. allow
    |                                    // NUMA balancing, skip other), Once per batch
    |
    |--> for each folio:
    |      migrate_folio_unmap()        // unmap the folio
    |      |
    |      +--> (success):
    |           if migrate_offload_enabled && do_batch && folio_supports_batch_copy():
    |               -> src_batch / dst_batch    // batch list for copy offloading
    |           else:
    |               -> src_std / dst_std        // standard lists for per-folio CPU copy
    |
    |--> try_to_unmap_flush()                   // single batched TLB flush 
    |
    |--> Batch copy (if src_batch not empty):
    |    - Migrator is configurable at runtime via sysfs.
    |
    |      static_call(migrate_offload_copy)    // Pluggable Migrators
    |              /          |            \
    |             v           v             v
    |     [ Default ]  [ DMA Offload ]  [ ... ]
    |
    |      On failure, folios fall back to per-folio CPU copy.
    |
    +--> migrate_folios_move()      // metadata, update PTEs, finalize
         (batch list with already_copied=true, std list with false)

Offload Registration:

    Driver fills struct migrator { .name, .offload_copy, .should_batch, .owner }
    and calls migrate_offload_start().  This:
      - Pins the module via try_module_get()
      - Patches static_call targets for offload_copy and should_batch
      - Enables the migrate_offload_enabled static branch

    migrate_offload_stop() disables the static branch and reverts both
    static_calls, then synchronize_srcu() waits for in-flight
    migrations before module_put().


PERFORMANCE RESULTS:
--------------------

System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
1 NUMA node per socket, v7.0-rc2, DVFS set to Performance, PTDMA hardware.

Benchmark: move_pages() syscall to move pages between two NUMA nodes.

1. Moving different sized folios such that total transfer size is constant
(1GB), with different number of DMA channels. Throughput in GB/s.

a. Baseline (vanilla kernel: v7.0-rc2, single-threaded, serial folio_copy):

============================================================================================
       | 4K          | 16K         | 64K         | 256K        | 1M           | 2M         |
============================================================================================
       |3.55±0.19    | 5.66±0.30   | 6.16±0.09   | 7.12±0.83   | 6.93±0.09   | 10.88±0.19  |

b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):

============================================================================================
Channel Cnt| 4K      | 16K         | 64K         | 256K        | 1M          | 2M          |
============================================================================================
1      | 2.63±0.26   | 2.92±0.09   |  3.16±0.13  |  4.75±0.70  |  7.38±0.18  | 12.64±0.07  |
2      | 3.20±0.12   | 4.68±0.17   |  5.16±0.36  |  7.42±1.00  |  8.05±0.05  | 14.40±0.10  |
4      | 3.78±0.16   | 6.45±0.06   |  7.36±0.18  |  9.70±0.11  | 11.68±2.37  | 27.16±0.20  |
8      | 4.32±0.24   | 8.20±0.45   |  9.45±0.26  | 12.99±2.87  | 13.18±0.08  | 46.17±0.67  |
12     | 4.35±0.16   | 8.80±0.09   | 11.65±2.71  | 15.46±4.95  | 14.69±4.10  | 60.89±0.68  |
16     | 4.40±0.19   | 9.25±0.13   | 11.02±0.26  | 13.56±0.15  | 18.04±7.11  | 66.86±0.81  |

- DMA offload with 16 channels achieves ~6x speedup for 2MB folios.
- Larger folios benefit more; small folios are DMA-setup bound.

2. Varying total move size (folio count) for fixed 2MB folio size,
   single DMA channel. Throughput (GB/s):

2MB Folios | Baseline    | DMA
=================================
1          |  7.34       |  6.17
8          |  8.27       |  8.85
16         |  7.56       |  9.12
32         |  8.39       | 11.73
64         |  9.37       | 12.18
256        | 10.58       | 12.50
512        | 10.78       | 12.68
1024       | 10.77       | 12.76
2048       | 10.87       | 12.81
8192       | 10.84       | 12.82

- Throughput increases with batch size but plateaus after ~64 pages.
- Even a single DMA channel outperforms baseline for batch-size >= 8 pages.

EARLIER POSTINGS:
-----------------
[1] RFC V3: https://lore.kernel.org/all/20250923174752.35701-1-shivankg@amd.com
[2] RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@amd.com
[3] RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
[4] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com

RELATED DISCUSSIONS:
-------------------
[5] MM-alignment Session [Nov 12, 2025]:
    https://lore.kernel.org/linux-mm/bd6a3c75-b9f0-cbcf-f7c4-1ef5dff06d24@google.com/
[6] Linux Memory Hotness and Promotion call [Nov 6, 2025]:
    https://lore.kernel.org/linux-mm/8ff2fd10-c9ac-4912-cf56-7ecd4afd2770@google.com/
[7] LSFMM 2025:
    https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com
[8] OSS India:
    https://ossindia2025.sched.com/event/23Jk1

Git Tree: https://github.com/shivankgarg98/linux/commits/shivank/V4_migrate_pages_optimization_precopy

Thanks to everyone who reviewed, tested or participated in discussions
around this series. Your feedback helped me throughout the development
process.

Best Regards,
Shivank


Shivank Garg (5):
  mm: introduce folios_mc_copy() for batch folio copying
  mm/migrate: skip data copy for already-copied folios
  mm/migrate: add batch-copy path in migrate_pages_batch
  mm/migrate: add copy offload registration infrastructure
  drivers/migrate_offload: add DMA batch copy driver (dcbm)

Zi Yan (1):
  mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing

 drivers/Kconfig                       |   2 +
 drivers/Makefile                      |   2 +
 drivers/migrate_offload/Kconfig       |   8 +
 drivers/migrate_offload/Makefile      |   1 +
 drivers/migrate_offload/dcbm/Makefile |   1 +
 drivers/migrate_offload/dcbm/dcbm.c   | 457 ++++++++++++++++++++++++++
 include/linux/migrate_copy_offload.h  |  34 ++
 include/linux/mm.h                    |   2 +
 mm/Kconfig                            |   9 +
 mm/Makefile                           |   1 +
 mm/migrate.c                          | 133 ++++++--
 mm/migrate_copy_offload.c             |  99 ++++++
 mm/util.c                             |  31 ++
 13 files changed, 748 insertions(+), 32 deletions(-)
 create mode 100644 drivers/migrate_offload/Kconfig
 create mode 100644 drivers/migrate_offload/Makefile
 create mode 100644 drivers/migrate_offload/dcbm/Makefile
 create mode 100644 drivers/migrate_offload/dcbm/dcbm.c
 create mode 100644 include/linux/migrate_copy_offload.h
 create mode 100644 mm/migrate_copy_offload.c

-- 
2.43.0



             reply	other threads:[~2026-03-09 12:08 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-09 12:07 Shivank Garg [this message]
2026-03-09 12:07 ` [RFC PATCH v4 1/6] mm: introduce folios_mc_copy() for batch folio copying Shivank Garg
2026-03-12  9:41   ` David Hildenbrand (Arm)
2026-03-15 18:09     ` Garg, Shivank
2026-03-09 12:07 ` [RFC PATCH v4 2/6] mm/migrate: skip data copy for already-copied folios Shivank Garg
2026-03-12  9:44   ` David Hildenbrand (Arm)
2026-03-15 18:25     ` Garg, Shivank
2026-03-23 12:20       ` David Hildenbrand (Arm)
2026-03-24  8:22   ` Huang, Ying
2026-03-09 12:07 ` [RFC PATCH v4 3/6] mm/migrate: add batch-copy path in migrate_pages_batch Shivank Garg
2026-03-24  8:42   ` Huang, Ying
2026-03-09 12:07 ` [RFC PATCH v4 4/6] mm/migrate: add copy offload registration infrastructure Shivank Garg
2026-03-09 17:54   ` Gregory Price
2026-03-10 10:07     ` Garg, Shivank
2026-03-24 10:54   ` Huang, Ying
2026-03-09 12:07 ` [RFC PATCH v4 5/6] drivers/migrate_offload: add DMA batch copy driver (dcbm) Shivank Garg
2026-03-09 18:04   ` Gregory Price
2026-03-12  9:33     ` Garg, Shivank
2026-03-24  8:10   ` Huang, Ying
2026-03-09 12:07 ` [RFC PATCH v4 6/6] mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing Shivank Garg
2026-03-18 14:29 ` [RFC PATCH v4 0/6] Accelerate page migration with batch copying and hardware offload Garg, Shivank

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260309120725.308854-3-shivankg@amd.com \
    --to=shivankg@amd.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=jhubbard@nvidia.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rkodsara@amd.com \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=sj@kernel.org \
    --cc=stalexan@redhat.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=vkoul@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox