Re: [PATCH 2/7] mm/migrate: use migrate_info field instead of private

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* Re: [PATCH 2/7] mm/migrate: use migrate_info field instead of private
       [not found] ` <20260428155043.39251-6-shivankg@amd.com>
@ 2026-05-07  9:43   ` Huang, Ying
  2026-05-11 15:22   ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 14+ messages in thread
From: Huang, Ying @ 2026-05-07  9:43 UTC (permalink / raw)
  To: Shivank Garg
  Cc: akpm, david, kinseyho, weixugc, ljs, Liam.Howlett, vbabka, willy,
	rppt, surenb, mhocko, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, gourry, apopple, dave, Jonathan.Cameron,
	rkodsara, vkoul, bharata, sj, rientjes, xuezhengchu, yiannis,
	dave.hansen, hannes, jhubbard, peterx, riel, shakeel.butt,
	stalexan, tj, nifan.cxl, jic23, aneesh.kumar, nathan.lynch,
	Frank.li, djbw, linux-kernel, linux-mm

Shivank Garg <shivankg@amd.com> writes:

> Add an unsigned long migrate_info member to the struct folio union and
> use it to store migration state (anon_vma pointer and FOLIO_WAS_*
> flags) instead of using folio->private.
>
> No functional change.
>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---
>  include/linux/mm_types.h |  1 +
>  mm/migrate.c             | 14 +++++++-------
>  2 files changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index a308e2c23b82..f52818dcf4d2 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -426,6 +426,7 @@ struct folio {
>  			union {
>  				void *private;
>  				swp_entry_t swap;
> +				unsigned long migrate_info;
>  			};
>  			atomic_t _mapcount;
>  			atomic_t _refcount;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 0c6a0ab6ecce..03c2a6f7e5e4 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1130,7 +1130,7 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
>  }
>  
>  /*
> - * To record some information during migration, we use unused private
> + * To record some information during migration, we use the migrate_info
>   * field of struct folio of the newly allocated destination folio.
>   * This is safe because nobody is using it except us.
>   */
> @@ -1143,17 +1143,17 @@ enum {
>  static void __migrate_folio_record(struct folio *dst,
>  		int old_folio_state, struct anon_vma *anon_vma)
>  {
> -	dst->private = (void *)anon_vma + old_folio_state;
> +	dst->migrate_info = (unsigned long)anon_vma | old_folio_state;
>  }
>  
>  static void __migrate_folio_extract(struct folio *dst,
>  		int *old_folio_state, struct anon_vma **anon_vmap)
>  {
> -	unsigned long private = (unsigned long)dst->private;
> +	unsigned long info = dst->migrate_info;
>  
> -	*anon_vmap = (struct anon_vma *)(private & ~FOLIO_OLD_STATES);
> -	*old_folio_state = private & FOLIO_OLD_STATES;
> -	dst->private = NULL;
> +	*anon_vmap = (struct anon_vma *)(info & ~FOLIO_OLD_STATES);
> +	*old_folio_state = info & FOLIO_OLD_STATES;
> +	dst->migrate_info = 0;
>  }
>  
>  /* Restore the source folio to the original state upon failure */
> @@ -1217,7 +1217,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>  		return -ENOMEM;
>  	*dstp = dst;
>  
> -	dst->private = NULL;
> +	dst->migrate_info = 0;
>  
>  	if (!folio_trylock(src)) {
>  		if (mode == MIGRATE_ASYNC)

LGTM!  Feel free to add my

Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com>

in the future versions.

---
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/7] mm/migrate: use migrate_info field instead of private
       [not found] ` <20260428155043.39251-6-shivankg@amd.com>
  2026-05-07  9:43   ` [PATCH 2/7] mm/migrate: use migrate_info field instead of private Huang, Ying
@ 2026-05-11 15:22   ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 14+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-11 15:22 UTC (permalink / raw)
  To: Shivank Garg, akpm
  Cc: kinseyho, weixugc, ljs, Liam.Howlett, vbabka, willy, rppt, surenb,
	mhocko, ziy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, dave, Jonathan.Cameron, rkodsara,
	vkoul, bharata, sj, rientjes, xuezhengchu, yiannis, dave.hansen,
	hannes, jhubbard, peterx, riel, shakeel.butt, stalexan, tj,
	nifan.cxl, jic23, aneesh.kumar, nathan.lynch, Frank.li, djbw,
	linux-kernel, linux-mm

On 4/28/26 17:50, Shivank Garg wrote:
> Add an unsigned long migrate_info member to the struct folio union and
> use it to store migration state (anon_vma pointer and FOLIO_WAS_*
> flags) instead of using folio->private.
> 
> No functional change.
> 
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload
       [not found] <20260428155043.39251-2-shivankg@amd.com>
       [not found] ` <20260428155043.39251-6-shivankg@amd.com>
@ 2026-05-07  9:58 ` Huang, Ying
  2026-05-11 15:19   ` David Hildenbrand (Arm)
       [not found] ` <87zf2kvnqy.fsf@DESKTOP-5N7EMDA>
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Huang, Ying @ 2026-05-07  9:58 UTC (permalink / raw)
  To: Shivank Garg
  Cc: akpm, david, kinseyho, weixugc, ljs, Liam.Howlett, vbabka, willy,
	rppt, surenb, mhocko, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, gourry, apopple, dave, Jonathan.Cameron,
	rkodsara, vkoul, bharata, sj, rientjes, xuezhengchu, yiannis,
	dave.hansen, hannes, jhubbard, peterx, riel, shakeel.butt,
	stalexan, tj, nifan.cxl, jic23, aneesh.kumar, nathan.lynch,
	Frank.li, djbw, linux-kernel, linux-mm

Shivank Garg <shivankg@amd.com> writes:

> This is the fifth RFC of the patchset to enhance page migration by
> batching folio-copy operations and enabling acceleration via DMA offload.
>
> Single-threaded, folio-by-folio copying bottlenecks page migration in
> modern systems with deep memory hierarchies, especially for large folios
> where copy overhead dominates, leaving significant hardware potential
> untapped.
>
> By batching the copy phase, we create an opportunity for hardware
> acceleration. This series builds the framework and provides a DMA
> offload driver (dcbm) as a reference implementation, targeting bulk
> migration workloads where offloading the copy improves throughput
> and latency while freeing the CPU cycles.
>
> See the RFC V3 cover letter [2] for motivation.
>
> Changelog since V4:
> -------------------
>
> 1. Renamed PAGE_* migration state flags to FOLIO_*. (David)
> 2. Use the new folio->migrate_info field instead of folio->private
>    for migration state. (David)
> 3. Fold folios_mc_copy patch in batch-copy implementation patch. (David)
> 3. Renamed migrate_offload_start()/stop() to register()/unregister().
>    (Huang, Ying)
> 4. Dropped should_batch() callback from struct migrator. Reason-based
>    policy now lives in migrate_pages_batch(). Migrators can still skip
>    a batch they don't want (size based policy). (Huang, Ying)
> 5. CONFIG_MIGRATION_COPY_OFFLOAD is now hidden and selected by the
>    migrator driver. CONFIG_DCBM_DMA is tristate. (Huang Ying, Gregory Price). 
> 6. Wrapped the SRCU + static_call dispatch in a small helper. (Huang, Ying)
> 7. Requir m->owner in migrate_offload_register(), SRCU sync at
>    unregister relies on it. Counters are atomic_long_t to avoid lock-order
>    issue.
> 9. Moved DCBM sysfs from /sys/kernel/dcbm to /sys/module/dcbm (Huang, Ying) 
> 10. Rebased on v7.1-rc1.
>
>
> DESIGN:
> -------
>
> New Migration Flow:
>
> [ migrate_pages_batch() ]
>     |
>     |--> do_batch = migrate_offload_do_batch(reason)  // core filters by migration reason
>     |
>     |--> for each folio:
>     |      migrate_folio_unmap()        // unmap the folio
>     |      |
>     |      +--> (success):
>     |           if do_batch && folio_supports_batch_copy():
>     |               -> unmap_batch / dst_batch  // batch list for copy offloading
>     |           else:
>     |               -> unmap_single / dst_single // single lists for per-folio CPU copy
>     |
>     |--> try_to_unmap_flush()                   // single batched TLB flush
>     |
>     |--> Batch copy (if unmap_batch not empty):
>     |    - Migrator is configurable at runtime via sysfs.
>     |
>     |      static_call(migrate_offload_copy)    // Pluggable Migrators
>     |              /          |            \
>     |             v           v             v
>     |     [ Default ]  [ DMA Offload ]  [ ... ]
>     |
>     |      On -EOPNOTSUPP or other error, batch falls back to per-folio CPU copy.
>     |
>     +--> migrate_folios_move()      // metadata, update PTEs, finalize
>          (batch list with already_copied=true, single list with false)
>
> Offload Registration:
>
>     Driver fills struct migrator { .name, .offload_copy, .owner } and calls
>     migrate_offload_register().  This:
>       - Pins the module via try_module_get()
>       - Patches the migrate_offload_copy() static_call target
>       - Enables the migrate_offload_enabled static branch
>
>     migrate_offload_unregister() disables the static branch and reverts
>     the static_call, then synchronize_srcu() waits for in-flight migrations
>     before module_put().
>
> PERFORMANCE RESULTS:
> --------------------
>
> Re-ran the V4 workload on v7.1-rc1 with this series; relative
> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
> change in V5 alters this picture; please refer to the V4 cover letter
> for the throughput tables [1].
>
>
> PLAN:
> -----
>
> Patches 1-4 (the batching infrastructure) don't depend on the migrator
> interface, so if it helps I can split them off and post them ahead of
> the migrator and DCBM bits, which still have a few open questions to
> work through.
>
> I would appreciate guidance on splitting the infrastructure portion
> ahead of the migrator interface if that matches maintainers' preference.
>
> OPEN QUESTIONS:
> ---------------
>
> 1. Should the batch path run without a registered migrator? Patches 1-4
>    are self-contained and use folios_mc_copy() (CPU). I have several
>    options like making batch path always-on for eligible folios, or
>    giving admin an option to flip the static branch, or keep the gate.
>    I'm leaning toward always-on.
>
> 2. Carrying already_copied via folio->migrate_info vs changing the
>    migrate_folio() callback signature (Huang, Ying). I went with the
>    field for now to avoid touching every fs callback before the design
>    settles. Happy to revisit.

Personally, I still prefer to change migrate_folio() callbacks for
better readability.

> 3. Per-caller offload selection: Today eligibility is by migrate_reason
>    only. Some are latency-tolerant, others may be not. Is reason the
>    right granularity, or do we want a per-caller hint?
>
> 4. Cgroup integration: How should per-cgroup be accounted for different
>    migrators (e.g.: any accounting for DMA-busy time)?
>
> 5. Tuning migrate_pages callers for offloading. For instance, in
>    compaction COMPACT_CLUSTER_MAX = 32 caps DMA's payoff for compaction
>    (V4 experiment).
>
> 6. Where do batch-size thresholds live, and how are they tuned? Per
>    Huang Ying's split, that policy lives in the migrator. DCBM has no
>    threshold today. Open whether it should later be a per-migrator
>    sysfs knob or hard-coded; probably clearer once a second migrator
>    (SDXI, mtcopy) shows the trade-off.
>
>
> FOLLOW-UPS:
> --------------
>
> 1. dmaengine_prep_dma_memcpy_sg() in DCBM (Vinod Koul). The SG-prep
>    variant cuts per-batch prep/submit cost (=CPU savings), but ptdma does
>    not implement the SG hook yet [10]. The end-to-end migration throughput
>    delta is small because per-descriptor execute time dominates.
>    I'll post the ptdma SG hook + DCBM switch as a follow-up.
>   
> 2. SDXI as a second migrator. The SDXI series [11] is in  review. SDXI is
>    a generic memcpy engine without DMA_PRIVATE, so channel acquisition
>    goes through dma_find_channel() or async_tx rather than
>    dma_request_chan_by_mask(). I have a local DCBM variant working on top
>    of the SDXI driver. I'm planning to send it as a follow-up once the
>    SDXI series settles.
>  
> 3. IOMMU SG merging in DCBM (Gregory). dma_map_sgtable() may merge
>    contiguous PFNs unevenly, so src.nents != dst.nents. DCBM falls back
>    to CPU for safety. Though I haven't seen it on Zen3 + PTDMA. I'll
>    understand this and address it a follow-up.
>  
> 4. Revisit Multi-threaded CPU copy migrator once the infra is settled.
>
> EARLIER POSTINGS:
> -----------------
> [1] RFC V4: https://lore.kernel.org/all/20260309120725.308854-3-shivankg@amd.com
> [2] RFC V3: https://lore.kernel.org/all/20250923174752.35701-1-shivankg@amd.com
> [3] RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@amd.com
> [4] RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
> [5] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
>
> RELATED DISCUSSIONS:
> --------------------
> [6] MM-alignment Session [Nov 12, 2025]:
>     https://lore.kernel.org/linux-mm/bd6a3c75-b9f0-cbcf-f7c4-1ef5dff06d24@google.com
> [7] Linux Memory Hotness and Promotion call [Nov 6, 2025]:
>     https://lore.kernel.org/linux-mm/8ff2fd10-c9ac-4912-cf56-7ecd4afd2770@google.com
> [8] LSFMM 2025:
>     https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com
> [9] OSS India:
>     https://ossindia2025.sched.com/event/23Jk1
> [10] DMA_MEMCPY_SG comparison:
>      https://lore.kernel.org/linux-mm/3e73addb-ac01-4a05-bc75-c6c1c56072df@amd.com
> [11] SDXI V1:
>      https://lore.kernel.org/all/20260410-sdxi-base-v1-0-1d184cb5c60a@amd.com
>
> Thanks to everyone who reviewed, tested or participated in discussions
> around this series. Your feedback helped me throughout the development
> process.
>
> Best Regards,
> Shivank
>
>
> Shivank Garg (6):
>   mm/migrate: rename PAGE_ migration flags to FOLIO_
>   mm/migrate: use migrate_info field instead of private
>   mm/migrate: skip data copy for already-copied folios
>   mm/migrate: add batch-copy path in migrate_pages_batch
>   mm/migrate: add copy offload registration infrastructure
>   drivers/migrate_offload: add DMA batch copy driver (dcbm)
>
> Zi Yan (1):
>   mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing
>
>  drivers/Kconfig                       |   2 +
>  drivers/Makefile                      |   2 +
>  drivers/migrate_offload/Kconfig       |   9 +
>  drivers/migrate_offload/Makefile      |   1 +
>  drivers/migrate_offload/dcbm/Makefile |   1 +
>  drivers/migrate_offload/dcbm/dcbm.c   | 440 ++++++++++++++++++++++++++
>  include/linux/migrate_copy_offload.h  |  44 +++
>  include/linux/mm.h                    |   2 +
>  include/linux/mm_types.h              |   1 +
>  mm/Kconfig                            |   6 +
>  mm/Makefile                           |   1 +
>  mm/migrate.c                          | 211 ++++++++----
>  mm/migrate_copy_offload.c             |  94 ++++++
>  mm/util.c                             |  30 ++
>  14 files changed, 784 insertions(+), 60 deletions(-)
>  create mode 100644 drivers/migrate_offload/Kconfig
>  create mode 100644 drivers/migrate_offload/Makefile
>  create mode 100644 drivers/migrate_offload/dcbm/Makefile
>  create mode 100644 drivers/migrate_offload/dcbm/dcbm.c
>  create mode 100644 include/linux/migrate_copy_offload.h
>  create mode 100644 mm/migrate_copy_offload.c
>
>
> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731

---
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload
  2026-05-07  9:58 ` [PATCH 0/7] Accelerate page migration with batch copying and hardware offload Huang, Ying
@ 2026-05-11 15:19   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-11 15:19 UTC (permalink / raw)
  To: Huang, Ying, Shivank Garg
  Cc: akpm, kinseyho, weixugc, ljs, Liam.Howlett, vbabka, willy, rppt,
	surenb, mhocko, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, apopple, dave, Jonathan.Cameron, rkodsara,
	vkoul, bharata, sj, rientjes, xuezhengchu, yiannis, dave.hansen,
	hannes, jhubbard, peterx, riel, shakeel.butt, stalexan, tj,
	nifan.cxl, jic23, aneesh.kumar, nathan.lynch, Frank.li, djbw,
	linux-kernel, linux-mm


>> ---------------
>>
>> 1. Should the batch path run without a registered migrator? Patches 1-4
>>    are self-contained and use folios_mc_copy() (CPU). I have several
>>    options like making batch path always-on for eligible folios, or
>>    giving admin an option to flip the static branch, or keep the gate.
>>    I'm leaning toward always-on.
>>
>> 2. Carrying already_copied via folio->migrate_info vs changing the
>>    migrate_folio() callback signature (Huang, Ying). I went with the
>>    field for now to avoid touching every fs callback before the design
>>    settles. Happy to revisit.
> 
> Personally, I still prefer to change migrate_folio() callbacks for
> better readability.

Can that be added as a cleanup on top?

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <87zf2kvnqy.fsf@DESKTOP-5N7EMDA>]

* Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload
       [not found] ` <87zf2kvnqy.fsf@DESKTOP-5N7EMDA>
@ 2026-05-08 11:04   ` Garg, Shivank
  2026-05-08 11:28     ` Huang, Ying
  0 siblings, 1 reply; 14+ messages in thread
From: Garg, Shivank @ 2026-05-08 11:04 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, david, kinseyho, weixugc, ljs, Liam.Howlett, vbabka, willy,
	rppt, surenb, mhocko, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, gourry, apopple, dave, Jonathan.Cameron,
	rkodsara, vkoul, bharata, sj, rientjes, xuezhengchu, yiannis,
	dave.hansen, hannes, jhubbard, peterx, riel, shakeel.butt,
	stalexan, tj, nifan.cxl, jic23, aneesh.kumar, nathan.lynch,
	Frank.li, djbw, linux-kernel, linux-mm



On 4/30/2026 2:17 PM, Huang, Ying wrote:
> Shivank Garg <shivankg@amd.com> writes:

>> PERFORMANCE RESULTS:
>> --------------------
>>
>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>> change in V5 alters this picture; please refer to the V4 cover letter
>> for the throughput tables [1].
> 
> IMHO, it's better to copy performance data here.
> 
> In addition to the performance benefit, I want to know the downside as
> well.  For example, the migration latency of the first folio may be
> longer.  If so, by how much?  Can you measure the batch number vs. total
> migration time (benefit) and first folio migration time (downside)?
> That can be used to determine the optimal batch number.
> 

System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.

Benchmark: move_pages() syscall to move pages between two NUMA nodes.

1). Moving different sized folios such that total transfer size is constant
(1GB), with different number of DMA channels. Throughput in GB/s.

a. Baseline (vanilla kernel, single-threaded, serial folio_copy):

================================================================================
4K          | 16K        | 64K        | 256K       | 1M          | 2M          |
================================================================================
3.31±0.18   | 5.61±0.07  | 6.66±0.03  | 7.01±0.03  | 7.13±0.08   | 11.02±0.17  |


b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):

============================================================================================
N channel| 4K        | 16K         | 64K         | 256K        | 1M          | 2M          |
============================================================================================
1      | 2.16±0.14   | 2.58±0.02   | 3.00±0.04   | 4.56±0.28   | 4.62±0.02   | 12.65±0.08  |
2      | 2.68±0.09   | 3.69±0.15   | 4.52±0.04   | 6.75±0.06   | 7.19±0.19   | 14.38±0.06  |
4      | 3.07±0.13   | 4.62±0.09   | 6.47±0.56   | 9.22±0.15   | 10.24±0.47  | 27.01±0.11  |
8      | 3.43±0.09   | 5.40±0.16   | 7.67±0.08   | 11.25±0.17  | 12.60±0.60  | 45.62±0.52  |
12     | 3.50±0.11   | 5.66±0.16   | 8.12±0.10   | 11.97±0.19  | 13.43±0.08  | 61.02±0.92  |
16     | 3.54±0.12   | 5.79±0.14   | 8.50±0.13   | 12.59±0.15  | 17.21±6.40  | 65.23±1.70  |


2).  First-folio latency: Instrumented with custom tracepoints to measure latency per migrate_pages_batch() call.
    Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.

A). Vanilla Kernel:

Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
NR_MAX_BATCHED_MIGRATION is upstream default value 512.

--- Order 0 (4K folios) ---
     n      vanilla/cpu
(folios)    GB/s | first(us)
--------------------------
     1       0.04 |     24
     4       0.16 |     25
     8       0.29 |     31
    16       0.54 |     27
    64       1.15 |     68
   256       1.86 |    162
   512       2.21 |    264
  2048       2.62 |    208
  4096       2.74 |    182
 16384       2.73 |    173
 65536       3.28 |    166
262144       3.20 |    167

--- Order 9 (2M folios) ---
     n      vanilla/cpu
(folios)    GB/s | first(us)
--------------------------
     1       7.05 |    194
     4       8.78 |    186
     8       8.47 |    188
    16       7.20 |    193
    64       8.23 |    191
   256      10.51 |    180
   512      10.88 |    173

Takeaway:
In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
per-batch unmap+flush cost, and then plateaus once workload is large enough.


B). Patched kernel:

Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.
Change N with a knob to measure impact of different max batched size.

--- ORDER 0 (4K folios) ---
     N         offload/dma1          offload/dma4          offload/dma16
               GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
------------------------------------------------------------------------
   512         2.13 |    639         3.23 |    290         3.27 |    253
  1024         2.17 |   1261         3.44 |    582         3.58 |    536
  2048         2.01 |   2769         3.09 |   1360         3.45 |   1083
  4096         2.10 |   5059         3.13 |   2737         3.58 |   2115 
  8192         2.21 |   9320         3.17 |   5015         3.75 |   3617 
 16384         2.15 |  18689         3.31 |   9623         3.87 |   6937
 32768         2.12 |  42692         3.38 |  18893         3.83 |  14255
 65536         2.09 |  81956         3.38 |  38556         3.64 |  29003
131072         2.02 | 169563         3.22 |  81082         3.63 |  62236
262144         2.21 | 318424         3.12 | 170174         3.50 | 129413

--- ORDER 9 (2M folios) ---
     N         offload/dma1          offload/dma4          offload/dma16
               GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
-------------------------------------------------------------------------
   512         11.66 |    160        11.68 |    160        11.65 |    160
  1024         12.16 |    310        13.67 |    275        13.64 |    276
  2048         12.30 |    613        25.47 |    290        25.48 |    291
  4096         12.48 |   1215        26.19 |    566        42.59 |    335
  8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *
 16384         12.61 |   4839        26.77 |   2218        61.94 |    896
 32768         12.60 |   9667        26.98 |   4422        63.75 |   1748
 65536         12.63 |  19318        26.99 |   8838        60.66 |   3543
131072         12.64 |  38935        27.02 |  17935        61.06 |   7178
262144         12.66 |  77694        26.85 |  35871        65.06 |  14129

In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
returns.

For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
because a larger batch allows the driver to distribute more folios across available DMA channels.
This is where we get most throughput while keeping the first folio latency in check.

This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
will likely have different curves.

Does this approach and experiment look good to you?

Thanks,
Shivank



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload
  2026-05-08 11:04   ` Garg, Shivank
@ 2026-05-08 11:28     ` Huang, Ying
  2026-05-08 12:34       ` Garg, Shivank
  0 siblings, 1 reply; 14+ messages in thread
From: Huang, Ying @ 2026-05-08 11:28 UTC (permalink / raw)
  To: Garg, Shivank
  Cc: akpm, david, kinseyho, weixugc, ljs, Liam.Howlett, vbabka, willy,
	rppt, surenb, mhocko, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, gourry, apopple, dave, Jonathan.Cameron,
	rkodsara, vkoul, bharata, sj, rientjes, xuezhengchu, yiannis,
	dave.hansen, hannes, jhubbard, peterx, riel, shakeel.butt,
	stalexan, tj, nifan.cxl, jic23, aneesh.kumar, nathan.lynch,
	Frank.li, djbw, linux-kernel, linux-mm

Hi, Shivank,

"Garg, Shivank" <shivankg@amd.com> writes:

> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>> Shivank Garg <shivankg@amd.com> writes:
>
>>> PERFORMANCE RESULTS:
>>> --------------------
>>>
>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>>> change in V5 alters this picture; please refer to the V4 cover letter
>>> for the throughput tables [1].
>> 
>> IMHO, it's better to copy performance data here.
>> 
>> In addition to the performance benefit, I want to know the downside as
>> well.  For example, the migration latency of the first folio may be
>> longer.  If so, by how much?  Can you measure the batch number vs. total
>> migration time (benefit) and first folio migration time (downside)?
>> That can be used to determine the optimal batch number.
>> 
>
> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.
>
> Benchmark: move_pages() syscall to move pages between two NUMA nodes.
>
> 1). Moving different sized folios such that total transfer size is constant
> (1GB), with different number of DMA channels. Throughput in GB/s.
>
> a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
>
> ================================================================================
> 4K          | 16K        | 64K        | 256K       | 1M          | 2M          |
> ================================================================================
> 3.31±0.18   | 5.61±0.07  | 6.66±0.03  | 7.01±0.03  | 7.13±0.08   | 11.02±0.17  |
>
>
> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>
> ============================================================================================
> N channel| 4K        | 16K         | 64K         | 256K        | 1M          | 2M          |
> ============================================================================================
> 1      | 2.16±0.14   | 2.58±0.02   | 3.00±0.04   | 4.56±0.28   | 4.62±0.02   | 12.65±0.08  |
> 2      | 2.68±0.09   | 3.69±0.15   | 4.52±0.04   | 6.75±0.06   | 7.19±0.19   | 14.38±0.06  |
> 4      | 3.07±0.13   | 4.62±0.09   | 6.47±0.56   | 9.22±0.15   | 10.24±0.47  | 27.01±0.11  |
> 8      | 3.43±0.09   | 5.40±0.16   | 7.67±0.08   | 11.25±0.17  | 12.60±0.60  | 45.62±0.52  |
> 12     | 3.50±0.11   | 5.66±0.16   | 8.12±0.10   | 11.97±0.19  | 13.43±0.08  | 61.02±0.92  |
> 16     | 3.54±0.12   | 5.79±0.14   | 8.50±0.13   | 12.59±0.15  | 17.21±6.40  | 65.23±1.70  |
>
>
> 2).  First-folio latency: Instrumented with custom tracepoints to measure latency per migrate_pages_batch() call.
>     Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.

Thanks for detailed data.  Per my understanding, the run time of
migrate_pages_batch() may be not good enough for measuring first folio
latency.  IIUC, the migration procedure is something like,

  for each folio
        unmap
  flush
  for each folio
        copy
        remap ===> first folio migrated

Some tracepoint should be better to measure it.

> A). Vanilla Kernel:
>
> Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
> NR_MAX_BATCHED_MIGRATION is upstream default value 512.
>
> --- Order 0 (4K folios) ---
>      n      vanilla/cpu
> (folios)    GB/s | first(us)
> --------------------------
>      1       0.04 |     24
>      4       0.16 |     25
>      8       0.29 |     31
>     16       0.54 |     27
>     64       1.15 |     68
>    256       1.86 |    162
>    512       2.21 |    264
>   2048       2.62 |    208
>   4096       2.74 |    182
>  16384       2.73 |    173
>  65536       3.28 |    166
> 262144       3.20 |    167
>
> --- Order 9 (2M folios) ---
>      n      vanilla/cpu
> (folios)    GB/s | first(us)
> --------------------------
>      1       7.05 |    194
>      4       8.78 |    186
>      8       8.47 |    188
>     16       7.20 |    193
>     64       8.23 |    191
>    256      10.51 |    180
>    512      10.88 |    173
>
> Takeaway:
> In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
> and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
> per-batch unmap+flush cost, and then plateaus once workload is large enough.
>
>
> B). Patched kernel:
>
> Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.

Emm, so NR_MAX_BATCHED_MIGRATION could be very large?  I think that it
needs to be bounded.  If it is too large, too many pages may be in an
inaccessible state for a longer time.  That will hurt the workload
performance, although it is optimal for migration performance.

> Change N with a knob to measure impact of different max batched size.
>
> --- ORDER 0 (4K folios) ---
>      N         offload/dma1          offload/dma4          offload/dma16
>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
> ------------------------------------------------------------------------
>    512         2.13 |    639         3.23 |    290         3.27 |    253
>   1024         2.17 |   1261         3.44 |    582         3.58 |    536
>   2048         2.01 |   2769         3.09 |   1360         3.45 |   1083
>   4096         2.10 |   5059         3.13 |   2737         3.58 |   2115 
>   8192         2.21 |   9320         3.17 |   5015         3.75 |   3617 
>  16384         2.15 |  18689         3.31 |   9623         3.87 |   6937
>  32768         2.12 |  42692         3.38 |  18893         3.83 |  14255
>  65536         2.09 |  81956         3.38 |  38556         3.64 |  29003
> 131072         2.02 | 169563         3.22 |  81082         3.63 |  62236
> 262144         2.21 | 318424         3.12 | 170174         3.50 | 129413
>
> --- ORDER 9 (2M folios) ---
>      N         offload/dma1          offload/dma4          offload/dma16
>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
> -------------------------------------------------------------------------
>    512         11.66 |    160        11.68 |    160        11.65 |    160
>   1024         12.16 |    310        13.67 |    275        13.64 |    276
>   2048         12.30 |    613        25.47 |    290        25.48 |    291
>   4096         12.48 |   1215        26.19 |    566        42.59 |    335
>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *
>  16384         12.61 |   4839        26.77 |   2218        61.94 |    896
>  32768         12.60 |   9667        26.98 |   4422        63.75 |   1748
>  65536         12.63 |  19318        26.99 |   8838        60.66 |   3543
> 131072         12.64 |  38935        27.02 |  17935        61.06 |   7178
> 262144         12.66 |  77694        26.85 |  35871        65.06 |  14129
>
> In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
> So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
> returns.
>
> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
> because a larger batch allows the driver to distribute more folios across available DMA channels.
> This is where we get most throughput while keeping the first folio latency in check.
>
> This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
> will likely have different curves.
>
> Does this approach and experiment look good to you?

---
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload
  2026-05-08 11:28     ` Huang, Ying
@ 2026-05-08 12:34       ` Garg, Shivank
  2026-05-09  7:49         ` Huang, Ying
  0 siblings, 1 reply; 14+ messages in thread
From: Garg, Shivank @ 2026-05-08 12:34 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, david, kinseyho, weixugc, ljs, Liam.Howlett, vbabka, willy,
	rppt, surenb, mhocko, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, gourry, apopple, dave, Jonathan.Cameron,
	rkodsara, vkoul, bharata, sj, rientjes, xuezhengchu, yiannis,
	dave.hansen, hannes, jhubbard, peterx, riel, shakeel.butt,
	stalexan, tj, nifan.cxl, jic23, aneesh.kumar, nathan.lynch,
	Frank.li, djbw, linux-kernel, linux-mm



On 5/8/2026 4:58 PM, Huang, Ying wrote:
> Hi, Shivank,
> 
> "Garg, Shivank" <shivankg@amd.com> writes:
> 
>> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>>> Shivank Garg <shivankg@amd.com> writes:
>>
>>>> PERFORMANCE RESULTS:
>>>> --------------------
>>>>
>>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>>>> change in V5 alters this picture; please refer to the V4 cover letter
>>>> for the throughput tables [1].
>>>
>>> IMHO, it's better to copy performance data here.
>>>
>>> In addition to the performance benefit, I want to know the downside as
>>> well.  For example, the migration latency of the first folio may be
>>> longer.  If so, by how much?  Can you measure the batch number vs. total
>>> migration time (benefit) and first folio migration time (downside)?
>>> That can be used to determine the optimal batch number.
>>>
>>
>> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
>> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.
>>
>> Benchmark: move_pages() syscall to move pages between two NUMA nodes.
>>
>> 1). Moving different sized folios such that total transfer size is constant
>> (1GB), with different number of DMA channels. Throughput in GB/s.
>>
>> a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
>>
>> ================================================================================
>> 4K          | 16K        | 64K        | 256K       | 1M          | 2M          |
>> ================================================================================
>> 3.31±0.18   | 5.61±0.07  | 6.66±0.03  | 7.01±0.03  | 7.13±0.08   | 11.02±0.17  |
>>
>>
>> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>>
>> ============================================================================================
>> N channel| 4K        | 16K         | 64K         | 256K        | 1M          | 2M          |
>> ============================================================================================
>> 1      | 2.16±0.14   | 2.58±0.02   | 3.00±0.04   | 4.56±0.28   | 4.62±0.02   | 12.65±0.08  |
>> 2      | 2.68±0.09   | 3.69±0.15   | 4.52±0.04   | 6.75±0.06   | 7.19±0.19   | 14.38±0.06  |
>> 4      | 3.07±0.13   | 4.62±0.09   | 6.47±0.56   | 9.22±0.15   | 10.24±0.47  | 27.01±0.11  |
>> 8      | 3.43±0.09   | 5.40±0.16   | 7.67±0.08   | 11.25±0.17  | 12.60±0.60  | 45.62±0.52  |
>> 12     | 3.50±0.11   | 5.66±0.16   | 8.12±0.10   | 11.97±0.19  | 13.43±0.08  | 61.02±0.92  |
>> 16     | 3.54±0.12   | 5.79±0.14   | 8.50±0.13   | 12.59±0.15  | 17.21±6.40  | 65.23±1.70  |
>>
>>
>> 2).  First-folio latency: Instrumented with custom tracepoints to measure latency per migrate_pages_batch() call.
>>     Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.
> 
> Thanks for detailed data.  Per my understanding, the run time of
> migrate_pages_batch() may be not good enough for measuring first folio
> latency.  IIUC, the migration procedure is something like,
> 
>   for each folio
>         unmap
>   flush
>   for each folio
>         copy
>         remap ===> first folio migrated
> 
> Some tracepoint should be better to measure it.

Sorry, my earlier write-up was unclear.
For first folio latency, I add two tracepoints: one at the start of migrate_pages_batch()
and one in migrate_folio_done(). 

I agree that the user-accessible point tracepoint should be right after remove_migration_ptes().
Though, migrate_folio_done() runs only a few operations later, and will have a constant
offset, so it's unlikely to change the shape of the trade-off curve.
I'll move the tracepoint right after remove_migration_ptes() for new posting.

> 
>> A). Vanilla Kernel:
>>
>> Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
>> NR_MAX_BATCHED_MIGRATION is upstream default value 512.
>>
>> --- Order 0 (4K folios) ---
>>      n      vanilla/cpu
>> (folios)    GB/s | first(us)
>> --------------------------
>>      1       0.04 |     24
>>      4       0.16 |     25
>>      8       0.29 |     31
>>     16       0.54 |     27
>>     64       1.15 |     68
>>    256       1.86 |    162
>>    512       2.21 |    264
>>   2048       2.62 |    208
>>   4096       2.74 |    182
>>  16384       2.73 |    173
>>  65536       3.28 |    166
>> 262144       3.20 |    167
>>
>> --- Order 9 (2M folios) ---
>>      n      vanilla/cpu
>> (folios)    GB/s | first(us)
>> --------------------------
>>      1       7.05 |    194
>>      4       8.78 |    186
>>      8       8.47 |    188
>>     16       7.20 |    193
>>     64       8.23 |    191
>>    256      10.51 |    180
>>    512      10.88 |    173
>>
>> Takeaway:
>> In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
>> and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
>> per-batch unmap+flush cost, and then plateaus once workload is large enough.
>>
>>
>> B). Patched kernel:
>>
>> Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.
> 
> Emm, so NR_MAX_BATCHED_MIGRATION could be very large?  I think that it
> needs to be bounded.  If it is too large, too many pages may be in an
> inaccessible state for a longer time.  That will hurt the workload
> performance, although it is optimal for migration performance.
> 

Agreed, it must be bounded.

>> Change N with a knob to measure impact of different max batched size.
>>
>> --- ORDER 0 (4K folios) ---
>>      N         offload/dma1          offload/dma4          offload/dma16
>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>> ------------------------------------------------------------------------
>>    512         2.13 |    639         3.23 |    290         3.27 |    253
>>   1024         2.17 |   1261         3.44 |    582         3.58 |    536
>>   2048         2.01 |   2769         3.09 |   1360         3.45 |   1083
>>   4096         2.10 |   5059         3.13 |   2737         3.58 |   2115 
>>   8192         2.21 |   9320         3.17 |   5015         3.75 |   3617 
>>  16384         2.15 |  18689         3.31 |   9623         3.87 |   6937
>>  32768         2.12 |  42692         3.38 |  18893         3.83 |  14255
>>  65536         2.09 |  81956         3.38 |  38556         3.64 |  29003
>> 131072         2.02 | 169563         3.22 |  81082         3.63 |  62236
>> 262144         2.21 | 318424         3.12 | 170174         3.50 | 129413
>>
>> --- ORDER 9 (2M folios) ---
>>      N         offload/dma1          offload/dma4          offload/dma16
>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>> -------------------------------------------------------------------------
>>    512         11.66 |    160        11.68 |    160        11.65 |    160
>>   1024         12.16 |    310        13.67 |    275        13.64 |    276
>>   2048         12.30 |    613        25.47 |    290        25.48 |    291
>>   4096         12.48 |   1215        26.19 |    566        42.59 |    335
>>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *
>>  16384         12.61 |   4839        26.77 |   2218        61.94 |    896
>>  32768         12.60 |   9667        26.98 |   4422        63.75 |   1748
>>  65536         12.63 |  19318        26.99 |   8838        60.66 |   3543
>> 131072         12.64 |  38935        27.02 |  17935        61.06 |   7178
>> 262144         12.66 |  77694        26.85 |  35871        65.06 |  14129
>>
>> In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
>> So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
>> returns.
>>
>> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
>> because a larger batch allows the driver to distribute more folios across available DMA channels.
>> This is where we get most throughput while keeping the first folio latency in check.
>>
>> This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
>> will likely have different curves.
>>
>> Does this approach and experiment look good to you?
> 
> ---
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload
  2026-05-08 12:34       ` Garg, Shivank
@ 2026-05-09  7:49         ` Huang, Ying
  2026-05-10 15:03           ` Garg, Shivank
  0 siblings, 1 reply; 14+ messages in thread
From: Huang, Ying @ 2026-05-09  7:49 UTC (permalink / raw)
  To: Garg, Shivank
  Cc: akpm, david, kinseyho, weixugc, ljs, Liam.Howlett, vbabka, willy,
	rppt, surenb, mhocko, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, gourry, apopple, dave, Jonathan.Cameron,
	rkodsara, vkoul, bharata, sj, rientjes, xuezhengchu, yiannis,
	dave.hansen, hannes, jhubbard, peterx, riel, shakeel.butt,
	stalexan, tj, nifan.cxl, jic23, aneesh.kumar, nathan.lynch,
	Frank.li, djbw, linux-kernel, linux-mm

"Garg, Shivank" <shivankg@amd.com> writes:

> On 5/8/2026 4:58 PM, Huang, Ying wrote:
>> Hi, Shivank,
>> 
>> "Garg, Shivank" <shivankg@amd.com> writes:
>> 
>>> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>>>> Shivank Garg <shivankg@amd.com> writes:
>>>
>>>>> PERFORMANCE RESULTS:
>>>>> --------------------
>>>>>
>>>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>>>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>>>>> change in V5 alters this picture; please refer to the V4 cover letter
>>>>> for the throughput tables [1].
>>>>
>>>> IMHO, it's better to copy performance data here.
>>>>
>>>> In addition to the performance benefit, I want to know the downside as
>>>> well.  For example, the migration latency of the first folio may be
>>>> longer.  If so, by how much?  Can you measure the batch number vs. total
>>>> migration time (benefit) and first folio migration time (downside)?
>>>> That can be used to determine the optimal batch number.
>>>>
>>>
>>> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
>>> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.
>>>
>>> Benchmark: move_pages() syscall to move pages between two NUMA nodes.
>>>
>>> 1). Moving different sized folios such that total transfer size is constant
>>> (1GB), with different number of DMA channels. Throughput in GB/s.
>>>
>>> a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
>>>
>>> ================================================================================
>>> 4K          | 16K        | 64K        | 256K       | 1M          | 2M          |
>>> ================================================================================
>>> 3.31±0.18   | 5.61±0.07  | 6.66±0.03  | 7.01±0.03  | 7.13±0.08   | 11.02±0.17  |
>>>
>>>
>>> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>>>
>>> ============================================================================================
>>> N channel| 4K        | 16K         | 64K         | 256K        | 1M          | 2M          |
>>> ============================================================================================
>>> 1      | 2.16±0.14   | 2.58±0.02   | 3.00±0.04   | 4.56±0.28   | 4.62±0.02   | 12.65±0.08  |
>>> 2      | 2.68±0.09   | 3.69±0.15   | 4.52±0.04   | 6.75±0.06   | 7.19±0.19   | 14.38±0.06  |
>>> 4      | 3.07±0.13   | 4.62±0.09   | 6.47±0.56   | 9.22±0.15   | 10.24±0.47  | 27.01±0.11  |
>>> 8      | 3.43±0.09   | 5.40±0.16   | 7.67±0.08   | 11.25±0.17  | 12.60±0.60  | 45.62±0.52  |
>>> 12     | 3.50±0.11   | 5.66±0.16   | 8.12±0.10   | 11.97±0.19  | 13.43±0.08  | 61.02±0.92  |
>>> 16     | 3.54±0.12   | 5.79±0.14   | 8.50±0.13   | 12.59±0.15  | 17.21±6.40  | 65.23±1.70  |
>>>
>>>
>>> 2).  First-folio latency: Instrumented with custom tracepoints to
>>> measure latency per migrate_pages_batch() call.
>>>     Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.
>> 
>> Thanks for detailed data.  Per my understanding, the run time of
>> migrate_pages_batch() may be not good enough for measuring first folio
>> latency.  IIUC, the migration procedure is something like,
>> 
>>   for each folio
>>         unmap
>>   flush
>>   for each folio
>>         copy
>>         remap ===> first folio migrated
>> 
>> Some tracepoint should be better to measure it.
>
> Sorry, my earlier write-up was unclear.
> For first folio latency, I add two tracepoints: one at the start of migrate_pages_batch()
> and one in migrate_folio_done(). 
>
> I agree that the user-accessible point tracepoint should be right after remove_migration_ptes().
> Though, migrate_folio_done() runs only a few operations later, and will have a constant
> offset, so it's unlikely to change the shape of the trade-off curve.
> I'll move the tracepoint right after remove_migration_ptes() for new posting.

Thanks for explanation.  Trace point in migrate_folio_done() should be OK.

>> 
>>> A). Vanilla Kernel:
>>>
>>> Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
>>> NR_MAX_BATCHED_MIGRATION is upstream default value 512.
>>>
>>> --- Order 0 (4K folios) ---
>>>      n      vanilla/cpu
>>> (folios)    GB/s | first(us)
>>> --------------------------
>>>      1       0.04 |     24
>>>      4       0.16 |     25
>>>      8       0.29 |     31
>>>     16       0.54 |     27
>>>     64       1.15 |     68
>>>    256       1.86 |    162
>>>    512       2.21 |    264
>>>   2048       2.62 |    208
>>>   4096       2.74 |    182
>>>  16384       2.73 |    173
>>>  65536       3.28 |    166
>>> 262144       3.20 |    167
>>>
>>> --- Order 9 (2M folios) ---
>>>      n      vanilla/cpu
>>> (folios)    GB/s | first(us)
>>> --------------------------
>>>      1       7.05 |    194
>>>      4       8.78 |    186
>>>      8       8.47 |    188
>>>     16       7.20 |    193
>>>     64       8.23 |    191
>>>    256      10.51 |    180
>>>    512      10.88 |    173
>>>
>>> Takeaway:
>>> In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
>>> and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
>>> per-batch unmap+flush cost, and then plateaus once workload is large enough.
>>>
>>>
>>> B). Patched kernel:
>>>
>>> Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.
>> 
>> Emm, so NR_MAX_BATCHED_MIGRATION could be very large?  I think that it
>> needs to be bounded.  If it is too large, too many pages may be in an
>> inaccessible state for a longer time.  That will hurt the workload
>> performance, although it is optimal for migration performance.
>> 
>
> Agreed, it must be bounded.

Thanks!  Could you retest with bounded NR_MAX_BATCHED_MIGRATION.  If the
upstream default doesn't work well for you.  We can find a better one
that balances throughput and latency well.

>>> Change N with a knob to measure impact of different max batched size.
>>>
>>> --- ORDER 0 (4K folios) ---
>>>      N         offload/dma1          offload/dma4          offload/dma16
>>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>>> ------------------------------------------------------------------------
>>>    512         2.13 |    639         3.23 |    290         3.27 |    253
>>>   1024         2.17 |   1261         3.44 |    582         3.58 |    536
>>>   2048         2.01 |   2769         3.09 |   1360         3.45 |   1083
>>>   4096         2.10 |   5059         3.13 |   2737         3.58 |   2115 
>>>   8192         2.21 |   9320         3.17 |   5015         3.75 |   3617 
>>>  16384         2.15 |  18689         3.31 |   9623         3.87 |   6937
>>>  32768         2.12 |  42692         3.38 |  18893         3.83 |  14255
>>>  65536         2.09 |  81956         3.38 |  38556         3.64 |  29003
>>> 131072         2.02 | 169563         3.22 |  81082         3.63 |  62236
>>> 262144         2.21 | 318424         3.12 | 170174         3.50 | 129413
>>>
>>> --- ORDER 9 (2M folios) ---
>>>      N         offload/dma1          offload/dma4          offload/dma16
>>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>>> -------------------------------------------------------------------------
>>>    512         11.66 |    160        11.68 |    160        11.65 |    160
>>>   1024         12.16 |    310        13.67 |    275        13.64 |    276
>>>   2048         12.30 |    613        25.47 |    290        25.48 |    291
>>>   4096         12.48 |   1215        26.19 |    566        42.59 |    335
>>>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *
>>>  16384         12.61 |   4839        26.77 |   2218        61.94 |    896
>>>  32768         12.60 |   9667        26.98 |   4422        63.75 |   1748
>>>  65536         12.63 |  19318        26.99 |   8838        60.66 |   3543
>>> 131072         12.64 |  38935        27.02 |  17935        61.06 |   7178
>>> 262144         12.66 |  77694        26.85 |  35871        65.06 |  14129
>>>
>>> In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
>>> So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
>>> returns.
>>>
>>> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
>>> because a larger batch allows the driver to distribute more folios across available DMA channels.
>>> This is where we get most throughput while keeping the first folio latency in check.
>>>
>>> This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
>>> will likely have different curves.
>>>
>>> Does this approach and experiment look good to you?

---
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload
  2026-05-09  7:49         ` Huang, Ying
@ 2026-05-10 15:03           ` Garg, Shivank
  0 siblings, 0 replies; 14+ messages in thread
From: Garg, Shivank @ 2026-05-10 15:03 UTC (permalink / raw)
  To: Huang, Ying
  Cc: akpm, david, kinseyho, weixugc, ljs, Liam.Howlett, vbabka, willy,
	rppt, surenb, mhocko, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, gourry, apopple, dave, Jonathan.Cameron,
	rkodsara, vkoul, bharata, sj, rientjes, xuezhengchu, yiannis,
	dave.hansen, hannes, jhubbard, peterx, riel, shakeel.butt,
	stalexan, tj, nifan.cxl, jic23, aneesh.kumar, nathan.lynch,
	Frank.li, djbw, linux-kernel, linux-mm



On 5/9/2026 1:19 PM, Huang, Ying wrote:
> "Garg, Shivank" <shivankg@amd.com> writes:
> 
>> On 5/8/2026 4:58 PM, Huang, Ying wrote:
>>> Hi, Shivank,
>>>
>>> "Garg, Shivank" <shivankg@amd.com> writes:
>>>
>>>> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>>>>> Shivank Garg <shivankg@amd.com> writes:
>>>>
>>>>>> PERFORMANCE RESULTS:
>>>>>> --------------------
>>>>>>
>>>>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>>>>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>>>>>> change in V5 alters this picture; please refer to the V4 cover letter
>>>>>> for the throughput tables [1].
>>>>>
>>>>> IMHO, it's better to copy performance data here.
>>>>>
>>>>> In addition to the performance benefit, I want to know the downside as
>>>>> well.  For example, the migration latency of the first folio may be
>>>>> longer.  If so, by how much?  Can you measure the batch number vs. total
>>>>> migration time (benefit) and first folio migration time (downside)?
>>>>> That can be used to determine the optimal batch number.
>>>>>
>>>>
>>>> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
>>>> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.
>>>>
>>>> Benchmark: move_pages() syscall to move pages between two NUMA nodes.
>>>>
>>>> 1). Moving different sized folios such that total transfer size is constant
>>>> (1GB), with different number of DMA channels. Throughput in GB/s.
>>>>
>>>> a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
>>>>
>>>> ================================================================================
>>>> 4K          | 16K        | 64K        | 256K       | 1M          | 2M          |
>>>> ================================================================================
>>>> 3.31±0.18   | 5.61±0.07  | 6.66±0.03  | 7.01±0.03  | 7.13±0.08   | 11.02±0.17  |
>>>>
>>>>
>>>> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>>>>
>>>> ============================================================================================
>>>> N channel| 4K        | 16K         | 64K         | 256K        | 1M          | 2M          |
>>>> ============================================================================================
>>>> 1      | 2.16±0.14   | 2.58±0.02   | 3.00±0.04   | 4.56±0.28   | 4.62±0.02   | 12.65±0.08  |
>>>> 2      | 2.68±0.09   | 3.69±0.15   | 4.52±0.04   | 6.75±0.06   | 7.19±0.19   | 14.38±0.06  |
>>>> 4      | 3.07±0.13   | 4.62±0.09   | 6.47±0.56   | 9.22±0.15   | 10.24±0.47  | 27.01±0.11  |
>>>> 8      | 3.43±0.09   | 5.40±0.16   | 7.67±0.08   | 11.25±0.17  | 12.60±0.60  | 45.62±0.52  |
>>>> 12     | 3.50±0.11   | 5.66±0.16   | 8.12±0.10   | 11.97±0.19  | 13.43±0.08  | 61.02±0.92  |
>>>> 16     | 3.54±0.12   | 5.79±0.14   | 8.50±0.13   | 12.59±0.15  | 17.21±6.40  | 65.23±1.70  |
>>>>
>>>>
>>>> 2).  First-folio latency: Instrumented with custom tracepoints to
>>>> measure latency per migrate_pages_batch() call.
>>>>     Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.
>>>
>>> Thanks for detailed data.  Per my understanding, the run time of
>>> migrate_pages_batch() may be not good enough for measuring first folio
>>> latency.  IIUC, the migration procedure is something like,
>>>
>>>   for each folio
>>>         unmap
>>>   flush
>>>   for each folio
>>>         copy
>>>         remap ===> first folio migrated
>>>
>>> Some tracepoint should be better to measure it.
>>
>> Sorry, my earlier write-up was unclear.
>> For first folio latency, I add two tracepoints: one at the start of migrate_pages_batch()
>> and one in migrate_folio_done(). 
>>
>> I agree that the user-accessible point tracepoint should be right after remove_migration_ptes().
>> Though, migrate_folio_done() runs only a few operations later, and will have a constant
>> offset, so it's unlikely to change the shape of the trade-off curve.
>> I'll move the tracepoint right after remove_migration_ptes() for new posting.
> 
> Thanks for explanation.  Trace point in migrate_folio_done() should be OK.
> 
>>>
>>>> A). Vanilla Kernel:
>>>>
>>>> Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
>>>> NR_MAX_BATCHED_MIGRATION is upstream default value 512.
>>>>
>>>> --- Order 0 (4K folios) ---
>>>>      n      vanilla/cpu
>>>> (folios)    GB/s | first(us)
>>>> --------------------------
>>>>      1       0.04 |     24
>>>>      4       0.16 |     25
>>>>      8       0.29 |     31
>>>>     16       0.54 |     27
>>>>     64       1.15 |     68
>>>>    256       1.86 |    162
>>>>    512       2.21 |    264
>>>>   2048       2.62 |    208
>>>>   4096       2.74 |    182
>>>>  16384       2.73 |    173
>>>>  65536       3.28 |    166
>>>> 262144       3.20 |    167
>>>>
>>>> --- Order 9 (2M folios) ---
>>>>      n      vanilla/cpu
>>>> (folios)    GB/s | first(us)
>>>> --------------------------
>>>>      1       7.05 |    194
>>>>      4       8.78 |    186
>>>>      8       8.47 |    188
>>>>     16       7.20 |    193
>>>>     64       8.23 |    191
>>>>    256      10.51 |    180
>>>>    512      10.88 |    173
>>>>
>>>> Takeaway:
>>>> In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
>>>> and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
>>>> per-batch unmap+flush cost, and then plateaus once workload is large enough.
>>>>
>>>>
>>>> B). Patched kernel:
>>>>
>>>> Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.
>>>
>>> Emm, so NR_MAX_BATCHED_MIGRATION could be very large?  I think that it
>>> needs to be bounded.  If it is too large, too many pages may be in an
>>> inaccessible state for a longer time.  That will hurt the workload
>>> performance, although it is optimal for migration performance.
>>>
>>
>> Agreed, it must be bounded.
> 
> Thanks!  Could you retest with bounded NR_MAX_BATCHED_MIGRATION.  If the
> upstream default doesn't work well for you.  We can find a better one
> that balances throughput and latency well.
> 

Thanks. Below tables sweep NR_MAX_BATCHED_MIGRATION from 512 up to 262144. On 2M folios,
16-channel PTDMA, the knee is at N=8192-16384 (= {16 to 32} * 512 ).

>>>>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *

One thing worth flagging on the "bounded default": at the upstream cap of 512 pages,
migrate_pages_batch() receives at most one 2M folio per call, so PTDMA can only use
one of its 16 channels per batch and the offload reduces to vanilla. (DCBM offloads
one 2M folio to each channel).
The larger-N rows are what exercise the channel parallelism for PTDMA case.

"SDXI"[1] like memory-to-memory data movers should reach good throughput with just 1 channel, 
and thus may not require increasing the NR_MAX_BATCHED_MIGRATION for good throughput.

I'm not tying series this to specific perf default for now, the design review (batch-copy
path, migrator interface, registration, static_call dispatch) is the part I'd like to converge
on first, then tune the threshold after it. Does that ordering work?

[1] https://lore.kernel.org/all/20260410-sdxi-base-v1-0-1d184cb5c60a@amd.com

Best regards,
Shivank

>>>> Change N with a knob to measure impact of different max batched size.
>>>>
>>>> --- ORDER 0 (4K folios) ---
>>>>      N         offload/dma1          offload/dma4          offload/dma16
>>>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>>>> ------------------------------------------------------------------------
>>>>    512         2.13 |    639         3.23 |    290         3.27 |    253
>>>>   1024         2.17 |   1261         3.44 |    582         3.58 |    536
>>>>   2048         2.01 |   2769         3.09 |   1360         3.45 |   1083
>>>>   4096         2.10 |   5059         3.13 |   2737         3.58 |   2115 
>>>>   8192         2.21 |   9320         3.17 |   5015         3.75 |   3617 
>>>>  16384         2.15 |  18689         3.31 |   9623         3.87 |   6937
>>>>  32768         2.12 |  42692         3.38 |  18893         3.83 |  14255
>>>>  65536         2.09 |  81956         3.38 |  38556         3.64 |  29003
>>>> 131072         2.02 | 169563         3.22 |  81082         3.63 |  62236
>>>> 262144         2.21 | 318424         3.12 | 170174         3.50 | 129413
>>>>
>>>> --- ORDER 9 (2M folios) ---
>>>>      N         offload/dma1          offload/dma4          offload/dma16
>>>>                GB/s | first(us)      GB/s | first(us)      GB/s | first(us)
>>>> -------------------------------------------------------------------------
>>>>    512         11.66 |    160        11.68 |    160        11.65 |    160
>>>>   1024         12.16 |    310        13.67 |    275        13.64 |    276
>>>>   2048         12.30 |    613        25.47 |    290        25.48 |    291
>>>>   4096         12.48 |   1215        26.19 |    566        42.59 |    335
>>>>   8192         12.56 |   2424        26.57 |   1118        58.72 |    470 *
>>>>  16384         12.61 |   4839        26.77 |   2218        61.94 |    896
>>>>  32768         12.60 |   9667        26.98 |   4422        63.75 |   1748
>>>>  65536         12.63 |  19318        26.99 |   8838        60.66 |   3543
>>>> 131072         12.64 |  38935        27.02 |  17935        61.06 |   7178
>>>> 262144         12.66 |  77694        26.85 |  35871        65.06 |  14129
>>>>
>>>> In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
>>>> So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
>>>> returns.
>>>>
>>>> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
>>>> because a larger batch allows the driver to distribute more folios across available DMA channels.
>>>> This is where we get most throughput while keeping the first folio latency in check.
>>>>
>>>> This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
>>>> will likely have different curves.
>>>>
>>>> Does this approach and experiment look good to you?
> 
> ---
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <20260428155043.39251-8-shivankg@amd.com>]

* Re: [PATCH 3/7] mm/migrate: skip data copy for already-copied folios
       [not found] ` <20260428155043.39251-8-shivankg@amd.com>
@ 2026-05-11 15:35   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-11 15:35 UTC (permalink / raw)
  To: Shivank Garg, akpm
  Cc: kinseyho, weixugc, ljs, Liam.Howlett, vbabka, willy, rppt, surenb,
	mhocko, ziy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, dave, Jonathan.Cameron, rkodsara,
	vkoul, bharata, sj, rientjes, xuezhengchu, yiannis, dave.hansen,
	hannes, jhubbard, peterx, riel, shakeel.butt, stalexan, tj,
	nifan.cxl, jic23, aneesh.kumar, nathan.lynch, Frank.li, djbw,
	linux-kernel, linux-mm

On 4/28/26 17:50, Shivank Garg wrote:
> Add a FOLIO_ALREADY_COPIED flag to the dst->migrate_info migration
> state. When set, __migrate_folio() skips folio_mc_copy() and
> performs metadata-only migration. All callers currently pass
> already_copied=false. The batch-copy path enables it later in a
> subsequent patch.
> 
> Move the dst->migrate_info state enum earlier in the file so
> __migrate_folio() and move_to_new_folio() can see FOLIO_ALREADY_COPIED.
> 
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---
>  mm/migrate.c | 53 +++++++++++++++++++++++++++++++---------------------
>  1 file changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 03c2a6f7e5e4..c493e67e359d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -850,6 +850,19 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
>  }
>  EXPORT_SYMBOL(folio_migrate_flags);
>  
> +/*
> + * To record some information during migration, we use the migrate_info
> + * field of struct folio of the newly allocated destination folio.
> + * This is safe because nobody is using it except us.
> + */
> +enum {
> +	FOLIO_WAS_MAPPED = BIT(0),
> +	FOLIO_WAS_MLOCKED = BIT(1),
> +	FOLIO_ALREADY_COPIED = BIT(2),

I wonder whether we want to talk about "folio content copied", to not confuse it
with folio flags copied etc.

FOLIO_CONTENT_COPIED.

Thoughts?

> +	FOLIO_OLD_STATES = FOLIO_WAS_MAPPED | FOLIO_WAS_MLOCKED |
> +			   FOLIO_ALREADY_COPIED,
> +};
> +
>  /************************************************************
>   *                    Migration functions
>   ***********************************************************/
> @@ -859,14 +872,20 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst,
>  			   enum migrate_mode mode)
>  {
>  	int rc, expected_count = folio_expected_ref_count(src) + 1;
> +	bool already_copied = (dst->migrate_info & FOLIO_ALREADY_COPIED);

const, and no need for ().

> +
> +	if (already_copied)
> +		dst->migrate_info = 0;

Hm, why is that required? Might deserve a comment.

Likely you want to clear the "already copied" marker?

	dst->migrate_info &= ~FOLIO_ALREADY_COPIED;

?

But I wonder if this really belongs exactly here.

>  
>  	/* Check whether src does not have extra refs before we do more work */
>  	if (folio_ref_count(src) != expected_count)
>  		return -EAGAIN;
>  
> -	rc = folio_mc_copy(dst, src);
> -	if (unlikely(rc))
> -		return rc;
> +	if (!already_copied) {
> +		rc = folio_mc_copy(dst, src);
> +		if (unlikely(rc))
> +			return rc;
> +	}
>  
>  	rc = __folio_migrate_mapping(mapping, dst, src, expected_count);
>  	if (rc)
> @@ -1090,7 +1109,7 @@ static int fallback_migrate_folio(struct address_space *mapping,
>   *     0 - success
>   */
>  static int move_to_new_folio(struct folio *dst, struct folio *src,
> -				enum migrate_mode mode)
> +		enum migrate_mode mode, bool already_copied)
>  {
>  	struct address_space *mapping = folio_mapping(src);
>  	int rc = -EAGAIN;
> @@ -1098,6 +1117,9 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
>  	VM_BUG_ON_FOLIO(!folio_test_locked(src), src);
>  	VM_BUG_ON_FOLIO(!folio_test_locked(dst), dst);
>  
> +	if (already_copied)
> +		dst->migrate_info = FOLIO_ALREADY_COPIED;

|= ?

> +
>  	if (!mapping)
>  		rc = migrate_folio(mapping, dst, src, mode);
>  	else if (mapping_inaccessible(mapping))
> @@ -1129,17 +1151,6 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
>  	return rc;
>  }
>  
> -/*
> - * To record some information during migration, we use the migrate_info
> - * field of struct folio of the newly allocated destination folio.
> - * This is safe because nobody is using it except us.
> - */
> -enum {
> -	FOLIO_WAS_MAPPED = BIT(0),
> -	FOLIO_WAS_MLOCKED = BIT(1),
> -	FOLIO_OLD_STATES = FOLIO_WAS_MAPPED | FOLIO_WAS_MLOCKED,
> -};
> -
>  static void __migrate_folio_record(struct folio *dst,
>  		int old_folio_state, struct anon_vma *anon_vma)
>  {
> @@ -1353,7 +1364,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>  static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
>  			      struct folio *src, struct folio *dst,
>  			      enum migrate_mode mode, enum migrate_reason reason,
> -			      struct list_head *ret)
> +			      struct list_head *ret, bool already_copied)
>  {
>  	int rc;
>  	int old_folio_state = 0;
> @@ -1379,7 +1390,7 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
>  		src_partially_mapped = folio_test_partially_mapped(src);
>  	}
>  
> -	rc = move_to_new_folio(dst, src, mode);
> +	rc = move_to_new_folio(dst, src, mode, already_copied);
>  	if (rc)
>  		goto out;
>  
> @@ -1536,7 +1547,7 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
>  	}
>  
>  	if (!folio_mapped(src))
> -		rc = move_to_new_folio(dst, src, mode);
> +		rc = move_to_new_folio(dst, src, mode, false);

... mode, /* already_copied = */ false

>  
>  	if (page_was_mapped)
>  		remove_migration_ptes(src, !rc ? dst : src, ttu);
> @@ -1720,7 +1731,7 @@ static void migrate_folios_move(struct list_head *src_folios,
>  		struct list_head *ret_folios,
>  		struct migrate_pages_stats *stats,
>  		int *retry, int *thp_retry, int *nr_failed,
> -		int *nr_retry_pages)
> +		int *nr_retry_pages, bool already_copied)
>  {
>  	struct folio *folio, *folio2, *dst, *dst2;
>  	bool is_thp;
> @@ -1737,7 +1748,7 @@ static void migrate_folios_move(struct list_head *src_folios,
>  
>  		rc = migrate_folio_move(put_new_folio, private,
>  				folio, dst, mode,
> -				reason, ret_folios);
> +				reason, ret_folios, already_copied);
>  		/*
>  		 * The rules are:
>  		 *	0: folio will be freed
> @@ -1994,7 +2005,7 @@ static int migrate_pages_batch(struct list_head *from,
>  		migrate_folios_move(&unmap_folios, &dst_folios,
>  				put_new_folio, private, mode, reason,
>  				ret_folios, stats, &retry, &thp_retry,
> -				&nr_failed, &nr_retry_pages);
> +				&nr_failed, &nr_retry_pages, false);
>  	}

dito.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <20260428155043.39251-10-shivankg@amd.com>]

* Re: [PATCH 4/7] mm/migrate: add batch-copy path in migrate_pages_batch
       [not found] ` <20260428155043.39251-10-shivankg@amd.com>
@ 2026-05-11 15:40   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-11 15:40 UTC (permalink / raw)
  To: Shivank Garg, akpm
  Cc: kinseyho, weixugc, ljs, Liam.Howlett, vbabka, willy, rppt, surenb,
	mhocko, ziy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, dave, Jonathan.Cameron, rkodsara,
	vkoul, bharata, sj, rientjes, xuezhengchu, yiannis, dave.hansen,
	hannes, jhubbard, peterx, riel, shakeel.butt, stalexan, tj,
	nifan.cxl, jic23, aneesh.kumar, nathan.lynch, Frank.li, djbw,
	linux-kernel, linux-mm

On 4/28/26 17:50, Shivank Garg wrote:
> Add folios_mc_copy() which walks list of src and dst folios in lockstep,
> and copies folio content via folio_mc_copy(). folios_cnt parameter is
> unused here, but is part of the offload_copy callback signature used by
> later patches in the series.
> 
> Split unmapped folios into batch-eligible (unmap_batch/dst_batch) and
> standard (unmap_single/dst_single) lists, gated by the
> migrate_offload_enabled which is off by default. So, when no offload
> driver is active, the branch is never taken and everything goes
> through the standard path.
> 
> After TLB flush, batch copy the eligible folios via folios_mc_copy()
> and pass already_copied=true into migrate_folios_move() so
> __migrate_folio() skips the per-folio copy.
> 
> On batch copy failure, already_copied flag stays false and each folio
> fall back to individual copy.
> 
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---
>  include/linux/mm.h |  2 ++
>  mm/migrate.c       | 61 +++++++++++++++++++++++++++++++++++-----------
>  mm/util.c          | 30 +++++++++++++++++++++++
>  3 files changed, 79 insertions(+), 14 deletions(-)

[...]

>  
> +DEFINE_STATIC_KEY_FALSE(migrate_offload_enabled);
> +
>  static const struct movable_operations *offline_movable_ops;
>  static const struct movable_operations *zsmalloc_movable_ops;
>  
> @@ -1724,6 +1727,12 @@ static int migrate_hugetlbs(struct list_head *from, new_folio_t get_new_folio,
>  	return nr_failed;
>  }
>  
> +/* movable_ops folios have their own migrate path */
> +static bool folio_supports_batch_copy(struct folio *folio)
> +{
> +	return likely(!page_has_movable_ops(&folio->page));
> +}

As these things are not actually folios (and callers will have to be taught to
distinguish them way, way earlier), I guess you should make this

/* movable_ops pages have a separate migration path */
static bool page_supports_batch_copy(struct page *page)
...

> +
>  static void migrate_folios_move(struct list_head *src_folios,
>  		struct list_head *dst_folios,
>  		free_folio_t put_new_folio, unsigned long private,
> @@ -1752,7 +1761,7 @@ static void migrate_folios_move(struct list_head *src_folios,
>  		/*
>  		 * The rules are:
>  		 *	0: folio will be freed
> -		 *	-EAGAIN: stay on the unmap_folios list
> +		 *	-EAGAIN: stay on the src_folios list
>  		 *	Other errno: put on ret_folios list
>  		 */
>  		switch (rc) {


[...]

> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -778,6 +778,36 @@ int folio_mc_copy(struct folio *dst, struct folio *src)
>  }
>  EXPORT_SYMBOL(folio_mc_copy);
>  
> +/**
> + * folios_mc_copy - Copy the contents of list of folios.
> + * @dst_list: destination folio list.
> + * @src_list: source folio list.
> + * @folios_cnt: unused here, present for callback signature compatibility.
> + *
> + * Walks list of src and dst folios in lockstep and copies folio
> + * content via folio_mc_copy(). The caller must ensure both lists have
> + * the same number of entries. This may sleep.

This *function*

> + *
> + * Return: 0 on success, negative errno on failure.
> + */
> +int folios_mc_copy(struct list_head *dst_list, struct list_head *src_list,
> +		unsigned int __always_unused folios_cnt)
> +{
> +	struct folio *src, *dst;
> +	int ret;
> +
> +	dst = list_first_entry(dst_list, struct folio, lru);
> +	list_for_each_entry(src, src_list, lru) {
> +		ret = folio_mc_copy(dst, src);
> +		if (ret)
> +			return ret;
> +		dst = list_next_entry(dst, lru);
> +	}

Wouldn't it be cleaner to remember "already copied" immediately after we ...
performed the copy? (succeeded with folio_mc_copy)



-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <20260428155043.39251-12-shivankg@amd.com>]

* Re: [PATCH 5/7] mm/migrate: add copy offload registration infrastructure
       [not found] ` <20260428155043.39251-12-shivankg@amd.com>
@ 2026-05-11 15:46   ` David Hildenbrand (Arm)
  2026-05-11 15:50   ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 14+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-11 15:46 UTC (permalink / raw)
  To: Shivank Garg, akpm
  Cc: kinseyho, weixugc, ljs, Liam.Howlett, vbabka, willy, rppt, surenb,
	mhocko, ziy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, dave, Jonathan.Cameron, rkodsara,
	vkoul, bharata, sj, rientjes, xuezhengchu, yiannis, dave.hansen,
	hannes, jhubbard, peterx, riel, shakeel.butt, stalexan, tj,
	nifan.cxl, jic23, aneesh.kumar, nathan.lynch, Frank.li, djbw,
	linux-kernel, linux-mm, Mike Day

On 4/28/26 17:50, Shivank Garg wrote:
> Add a registration interface that lets a single offload provider
> (DMA, multi-threaded CPU copy, etc) take over the batch folio copy
> performed by migrate_pages_batch().
> 
> The provider fills in a struct migrator with an offload_copy()
> callback and calls migrate_offload_register(). Registration patches
> the migrate_offload_copy() static_call and flips the
> migrate_offload_enabled static branch. The migrate_offload_unregister()
> reverts both.
> 
> Whether a migration reason is batch-copy eligible is decided by the
> core in migrate_offload_do_batch(). A migrator may decline a particular
> batch (e.g. when nr_batch is too small to amortize setup) by returning
> -EOPNOTSUPP, and the move phase falls back to per-folio CPU copy.
> 
> Only one migrator can be active at a time. A second registration
> returns -EBUSY, and only the active migrator can unregister itself.
> The static_call dispatch is protected by SRCU so that the
> synchronize_srcu() in unregister waits for all in-flight copy before
> the module reference is dropped.
> 
> Co-developed-by: Mike Day <michael.day@amd.com>
> Signed-off-by: Mike Day <michael.day@amd.com>
> Signed-off-by: Shivank Garg <shivankg@amd.com>

MAINTAINERS file deserves some love (likely put the new files under the
migration section)

> ---
>  include/linux/migrate_copy_offload.h | 44 +++++++++++++
>  mm/Kconfig                           |  6 ++
>  mm/Makefile                          |  1 +
>  mm/migrate.c                         | 57 +++++++++++++++--
>  mm/migrate_copy_offload.c            | 94 ++++++++++++++++++++++++++++
>  5 files changed, 198 insertions(+), 4 deletions(-)
>  create mode 100644 include/linux/migrate_copy_offload.h
>  create mode 100644 mm/migrate_copy_offload.c
> 
> diff --git a/include/linux/migrate_copy_offload.h b/include/linux/migrate_copy_offload.h
> new file mode 100644
> index 000000000000..d68b10a84743
> --- /dev/null
> +++ b/include/linux/migrate_copy_offload.h
> @@ -0,0 +1,44 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MIGRATE_COPY_OFFLOAD_H
> +#define _LINUX_MIGRATE_COPY_OFFLOAD_H
> +
> +#include <linux/errno.h>
> +#include <linux/jump_label.h>
> +#include <linux/srcu.h>
> +#include <linux/types.h>
> +
> +struct list_head;
> +struct module;
> +
> +#define MIGRATOR_NAME_LEN 32
> +
> +/**
> + * struct migrator - batch-copy provider for page migration.
> + * @name: name of the provider.
> + * @offload_copy: copy @folio_cnt folios from @src_list to @dst_list.
> + *
> + *	The migrator may inspect @folio_cnt to decide whether the batch
> + * 	is worth offloading, e.g. skip when the batch is too small to
> + * 	amortize setup cost. If returns error, the core falls back to CPU copy.
> + *
> + * @owner: module providing the migrator.
> + */
> +struct migrator {
> +	char name[MIGRATOR_NAME_LEN];
> +	int (*offload_copy)(struct list_head *dst_list,
> +			    struct list_head *src_list,
> +			    unsigned int folio_cnt);
> +	struct module *owner;
> +};
> +
> +#ifdef CONFIG_MIGRATION_COPY_OFFLOAD
> +extern struct static_key_false migrate_offload_enabled;
> +extern struct srcu_struct migrate_offload_srcu;
> +int migrate_offload_register(struct migrator *m);
> +int migrate_offload_unregister(struct migrator *m);
> +#else
> +static inline int migrate_offload_register(struct migrator *m) { return -EOPNOTSUPP; }
> +static inline int migrate_offload_unregister(struct migrator *m) { return -EOPNOTSUPP; }
> +#endif
> +
> +#endif /* _LINUX_MIGRATE_COPY_OFFLOAD_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e8bf1e9e6ad9..325d79619680 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -647,6 +647,12 @@ config MIGRATION
>  config DEVICE_MIGRATION
>  	def_bool MIGRATION && ZONE_DEVICE
>  
> +# Page-migration batch-copy offload infrastructure.
> +# Selected by migrator drivers (e.g. CONFIG_DCBM_DMA).
> +config MIGRATION_COPY_OFFLOAD
> +	bool
> +	depends on MIGRATION
> +
>  config ARCH_ENABLE_HUGEPAGE_MIGRATION
>  	bool
>  
> diff --git a/mm/Makefile b/mm/Makefile
> index 8ad2ab08244e..db1ac8097089 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -96,6 +96,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_FAIL_PAGE_ALLOC) += fail_page_alloc.o
>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_MIGRATION_COPY_OFFLOAD) += migrate_copy_offload.o
>  obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6c2f1cb66f96..9af070f9a1f2 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -44,6 +44,8 @@
>  #include <linux/memory-tiers.h>
>  #include <linux/pagewalk.h>
>  #include <linux/jump_label.h>
> +#include <linux/static_call.h>
> +#include <linux/migrate_copy_offload.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -54,6 +56,51 @@
>  
>  DEFINE_STATIC_KEY_FALSE(migrate_offload_enabled);
>  
> +#ifdef CONFIG_MIGRATION_COPY_OFFLOAD
> +DEFINE_SRCU(migrate_offload_srcu);
> +DEFINE_STATIC_CALL(migrate_offload_copy, folios_mc_copy);
> +
> +static bool migrate_offload_do_batch(int reason)
> +{
> +	if (!static_branch_unlikely(&migrate_offload_enabled))
> +		return false;
> +
> +	switch (reason) {
> +	case MR_COMPACTION:
> +	case MR_SYSCALL:
> +	case MR_DEMOTION:
> +	case MR_NUMA_MISPLACED:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
> +static int migrate_offload_batch_copy(struct list_head *dst_batch,
> +				      struct list_head *src_batch,
> +				      unsigned int nr_batch)
> +{
> +	int idx, rc;
> +
> +	idx = srcu_read_lock(&migrate_offload_srcu);
> +	rc = static_call(migrate_offload_copy)(dst_batch, src_batch, nr_batch);
> +	srcu_read_unlock(&migrate_offload_srcu, idx);
> +	return rc;
> +}
> +#else
> +static bool migrate_offload_do_batch(int reason)
> +{
> +	return false;
> +}
> +
> +static int migrate_offload_batch_copy(struct list_head *dst_batch,
> +				      struct list_head *src_batch,
> +				      unsigned int nr_batch)
> +{
> +	return -EOPNOTSUPP;
> +}
> +#endif

Can't all that go to migrate_copy_offload.(hc) Looks like a clean interface form
migrate.c, no?

> +
>  static const struct movable_operations *offline_movable_ops;
>  static const struct movable_operations *zsmalloc_movable_ops;
>  
> @@ -1833,7 +1880,7 @@ static int migrate_pages_batch(struct list_head *from,
>  	struct folio *folio, *folio2, *dst = NULL;
>  	int rc, rc_saved = 0, nr_pages;
>  	unsigned int nr_batch = 0;
> -	bool batch_copied = false;
> +	bool do_batch = false, batch_copied = false;
>  	LIST_HEAD(unmap_batch);
>  	LIST_HEAD(dst_batch);
>  	LIST_HEAD(unmap_single);
> @@ -1843,6 +1890,8 @@ static int migrate_pages_batch(struct list_head *from,
>  	VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC &&
>  			!list_empty(from) && !list_is_singular(from));
>  
> +	do_batch = migrate_offload_do_batch(reason);
> +
>  	for (pass = 0; pass < nr_pass && retry; pass++) {
>  		retry = 0;
>  		thp_retry = 0;
> @@ -1984,8 +2033,7 @@ static int migrate_pages_batch(struct list_head *from,
>  				nr_retry_pages += nr_pages;
>  				break;
>  			case 0:
> -				if (static_branch_unlikely(&migrate_offload_enabled) &&
> -				    folio_supports_batch_copy(folio)) {
> +				if (do_batch && folio_supports_batch_copy(folio)) {

I was about to say, the migrate_offload_enabled usage in the previous patch
looks off.

Can you move that to this patch here, and use in the previous patch simply a
default migrate_offload_do_batch() that simply returns "false" ?

>  					list_move_tail(&folio->lru, &unmap_batch);
>  					list_add_tail(&dst->lru, &dst_batch);
>  					nr_batch++;
> @@ -2017,7 +2065,8 @@ static int migrate_pages_batch(struct list_head *from,
>  
>  	/* Batch-copy eligible folios before the move phase */
>  	if (!list_empty(&unmap_batch)) {
> -		rc = folios_mc_copy(&dst_batch, &unmap_batch, nr_batch);
> +		rc = migrate_offload_batch_copy(&dst_batch, &unmap_batch,
> +						nr_batch);
>  		batch_copied = (rc == 0);
>  	}
>  
> diff --git a/mm/migrate_copy_offload.c b/mm/migrate_copy_offload.c
-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/7] mm/migrate: add copy offload registration infrastructure
       [not found] ` <20260428155043.39251-12-shivankg@amd.com>
  2026-05-11 15:46   ` [PATCH 5/7] mm/migrate: add copy offload registration infrastructure David Hildenbrand (Arm)
@ 2026-05-11 15:50   ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 14+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-11 15:50 UTC (permalink / raw)
  To: Shivank Garg, akpm
  Cc: kinseyho, weixugc, ljs, Liam.Howlett, vbabka, willy, rppt, surenb,
	mhocko, ziy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, dave, Jonathan.Cameron, rkodsara,
	vkoul, bharata, sj, rientjes, xuezhengchu, yiannis, dave.hansen,
	hannes, jhubbard, peterx, riel, shakeel.butt, stalexan, tj,
	nifan.cxl, jic23, aneesh.kumar, nathan.lynch, Frank.li, djbw,
	linux-kernel, linux-mm, Mike Day

On 4/28/26 17:50, Shivank Garg wrote:
> Add a registration interface that lets a single offload provider
> (DMA, multi-threaded CPU copy, etc) take over the batch folio copy
> performed by migrate_pages_batch().
> 
> The provider fills in a struct migrator with an offload_copy()
> callback and calls migrate_offload_register(). Registration patches
> the migrate_offload_copy() static_call and flips the
> migrate_offload_enabled static branch. The migrate_offload_unregister()
> reverts both.
> 
> Whether a migration reason is batch-copy eligible is decided by the
> core in migrate_offload_do_batch(). A migrator may decline a particular
> batch (e.g. when nr_batch is too small to amortize setup) by returning
> -EOPNOTSUPP, and the move phase falls back to per-folio CPU copy.
> 
> Only one migrator can be active at a time. A second registration
> returns -EBUSY, and only the active migrator can unregister itself.
> The static_call dispatch is protected by SRCU so that the
> synchronize_srcu() in unregister waits for all in-flight copy before
> the module reference is dropped.
> 
> Co-developed-by: Mike Day <michael.day@amd.com>
> Signed-off-by: Mike Day <michael.day@amd.com>
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---
>  include/linux/migrate_copy_offload.h | 44 +++++++++++++
>  mm/Kconfig                           |  6 ++
>  mm/Makefile                          |  1 +
>  mm/migrate.c                         | 57 +++++++++++++++--
>  mm/migrate_copy_offload.c            | 94 ++++++++++++++++++++++++++++
>  5 files changed, 198 insertions(+), 4 deletions(-)
>  create mode 100644 include/linux/migrate_copy_offload.h
>  create mode 100644 mm/migrate_copy_offload.c
> 
> diff --git a/include/linux/migrate_copy_offload.h b/include/linux/migrate_copy_offload.h
> new file mode 100644
> index 000000000000..d68b10a84743
> --- /dev/null
> +++ b/include/linux/migrate_copy_offload.h
> @@ -0,0 +1,44 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MIGRATE_COPY_OFFLOAD_H
> +#define _LINUX_MIGRATE_COPY_OFFLOAD_H
> +
> +#include <linux/errno.h>
> +#include <linux/jump_label.h>
> +#include <linux/srcu.h>
> +#include <linux/types.h>
> +
> +struct list_head;
> +struct module;
> +
> +#define MIGRATOR_NAME_LEN 32
> +
> +/**
> + * struct migrator - batch-copy provider for page migration.
> + * @name: name of the provider.
> + * @offload_copy: copy @folio_cnt folios from @src_list to @dst_list.
> + *
> + *	The migrator may inspect @folio_cnt to decide whether the batch
> + * 	is worth offloading, e.g. skip when the batch is too small to
> + * 	amortize setup cost. If returns error, the core falls back to CPU copy.
> + *
> + * @owner: module providing the migrator.
> + */
> +struct migrator {
> +	char name[MIGRATOR_NAME_LEN];
> +	int (*offload_copy)(struct list_head *dst_list,
> +			    struct list_head *src_list,
> +			    unsigned int folio_cnt);
> +	struct module *owner;
> +};
> +
> +#ifdef CONFIG_MIGRATION_COPY_OFFLOAD
> +extern struct static_key_false migrate_offload_enabled;
> +extern struct srcu_struct migrate_offload_srcu;
> +int migrate_offload_register(struct migrator *m);
> +int migrate_offload_unregister(struct migrator *m);
> +#else
> +static inline int migrate_offload_register(struct migrator *m) { return -EOPNOTSUPP; }
> +static inline int migrate_offload_unregister(struct migrator *m) { return -EOPNOTSUPP; }
> +#endif
> +
> +#endif /* _LINUX_MIGRATE_COPY_OFFLOAD_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e8bf1e9e6ad9..325d79619680 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -647,6 +647,12 @@ config MIGRATION
>  config DEVICE_MIGRATION
>  	def_bool MIGRATION && ZONE_DEVICE
>  
> +# Page-migration batch-copy offload infrastructure.
> +# Selected by migrator drivers (e.g. CONFIG_DCBM_DMA).
> +config MIGRATION_COPY_OFFLOAD
> +	bool
> +	depends on MIGRATION
> +
>  config ARCH_ENABLE_HUGEPAGE_MIGRATION
>  	bool
>  
> diff --git a/mm/Makefile b/mm/Makefile
> index 8ad2ab08244e..db1ac8097089 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -96,6 +96,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_FAIL_PAGE_ALLOC) += fail_page_alloc.o
>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_MIGRATION_COPY_OFFLOAD) += migrate_copy_offload.o
>  obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6c2f1cb66f96..9af070f9a1f2 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -44,6 +44,8 @@
>  #include <linux/memory-tiers.h>
>  #include <linux/pagewalk.h>
>  #include <linux/jump_label.h>
> +#include <linux/static_call.h>
> +#include <linux/migrate_copy_offload.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -54,6 +56,51 @@
>  
>  DEFINE_STATIC_KEY_FALSE(migrate_offload_enabled);
>  
> +#ifdef CONFIG_MIGRATION_COPY_OFFLOAD
> +DEFINE_SRCU(migrate_offload_srcu);
> +DEFINE_STATIC_CALL(migrate_offload_copy, folios_mc_copy);
> +
> +static bool migrate_offload_do_batch(int reason)
> +{
> +	if (!static_branch_unlikely(&migrate_offload_enabled))
> +		return false;
> +
> +	switch (reason) {
> +	case MR_COMPACTION:
> +	case MR_SYSCALL:
> +	case MR_DEMOTION:
> +	case MR_NUMA_MISPLACED:
> +		return true;
> +	default:
> +		return false;


What's the exact reason we don't do this for hotunplug etc? IOW, why do we make
this depend on a reason?


-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload
       [not found] <20260428155043.39251-2-shivankg@amd.com>
                   ` (5 preceding siblings ...)
       [not found] ` <20260428155043.39251-12-shivankg@amd.com>
@ 2026-05-11 15:53 ` David Hildenbrand (Arm)
  6 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-11 15:53 UTC (permalink / raw)
  To: Shivank Garg, akpm
  Cc: kinseyho, weixugc, ljs, Liam.Howlett, vbabka, willy, rppt, surenb,
	mhocko, ziy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, dave, Jonathan.Cameron, rkodsara,
	vkoul, bharata, sj, rientjes, xuezhengchu, yiannis, dave.hansen,
	hannes, jhubbard, peterx, riel, shakeel.butt, stalexan, tj,
	nifan.cxl, jic23, aneesh.kumar, nathan.lynch, Frank.li, djbw,
	linux-kernel, linux-mm

On 4/28/26 17:50, Shivank Garg wrote:
> This is the fifth RFC of the patchset to enhance page migration by

Ah, this is an RFC ...

... I suggest b4 for patch series management :P

That also explains why patch #7 is still in there.

> batching folio-copy operations and enabling acceleration via DMA offload.
> 
> Single-threaded, folio-by-folio copying bottlenecks page migration in
> modern systems with deep memory hierarchies, especially for large folios
> where copy overhead dominates, leaving significant hardware potential
> untapped.
> 
> By batching the copy phase, we create an opportunity for hardware
> acceleration. This series builds the framework and provides a DMA
> offload driver (dcbm) as a reference implementation, targeting bulk
> migration workloads where offloading the copy improves throughput
> and latency while freeing the CPU cycles.
> 
> See the RFC V3 cover letter [2] for motivation.
> 
> Changelog since V4:
> -------------------
> 
> 1. Renamed PAGE_* migration state flags to FOLIO_*. (David)
> 2. Use the new folio->migrate_info field instead of folio->private
>    for migration state. (David)
> 3. Fold folios_mc_copy patch in batch-copy implementation patch. (David)
> 3. Renamed migrate_offload_start()/stop() to register()/unregister().
>    (Huang, Ying)
> 4. Dropped should_batch() callback from struct migrator. Reason-based
>    policy now lives in migrate_pages_batch(). Migrators can still skip
>    a batch they don't want (size based policy). (Huang, Ying)
> 5. CONFIG_MIGRATION_COPY_OFFLOAD is now hidden and selected by the
>    migrator driver. CONFIG_DCBM_DMA is tristate. (Huang Ying, Gregory Price). 
> 6. Wrapped the SRCU + static_call dispatch in a small helper. (Huang, Ying)
> 7. Requir m->owner in migrate_offload_register(), SRCU sync at
>    unregister relies on it. Counters are atomic_long_t to avoid lock-order
>    issue.
> 9. Moved DCBM sysfs from /sys/kernel/dcbm to /sys/module/dcbm (Huang, Ying) 
> 10. Rebased on v7.1-rc1.
> 

[...]

> 
> OPEN QUESTIONS:
> ---------------
> 
> 1. Should the batch path run without a registered migrator? Patches 1-4
>    are self-contained and use folios_mc_copy() (CPU). I have several
>    options like making batch path always-on for eligible folios, or
>    giving admin an option to flip the static branch, or keep the gate.
>    I'm leaning toward always-on.

Hiding that detail from migrate.c sounds interesting.

> 
> 2. Carrying already_copied via folio->migrate_info vs changing the
>    migrate_folio() callback signature (Huang, Ying). I went with the
>    field for now to avoid touching every fs callback before the design
>    settles. Happy to revisit.
> 
> 3. Per-caller offload selection: Today eligibility is by migrate_reason
>    only. Some are latency-tolerant, others may be not. Is reason the
>    right granularity, or do we want a per-caller hint?

Isn't it sufficient to just do it based on the #folios or sth like that?

If someone migrates a handful of folios, latency is likely more important (and
batching less beneficial).

I'd assume when migrating many folios, batching could just always be done. Or
what's the concern?

> 
> 4. Cgroup integration: How should per-cgroup be accounted for different
>    migrators (e.g.: any accounting for DMA-busy time)?

Oh. Do we even have to mess with that?

> 
> 5. Tuning migrate_pages callers for offloading. For instance, in
>    compaction COMPACT_CLUSTER_MAX = 32 caps DMA's payoff for compaction
>    (V4 experiment).

Is that HW dependent?

> 
> 6. Where do batch-size thresholds live, and how are they tuned? Per
>    Huang Ying's split, that policy lives in the migrator. DCBM has no
>    threshold today. Open whether it should later be a per-migrator
>    sysfs knob or hard-coded; probably clearer once a second migrator
>    (SDXI, mtcopy) shows the trade-off.

Again, sounds like being HW dependent, no?


-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-05-11 15:53 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260428155043.39251-2-shivankg@amd.com>
     [not found] ` <20260428155043.39251-6-shivankg@amd.com>
2026-05-07  9:43   ` [PATCH 2/7] mm/migrate: use migrate_info field instead of private Huang, Ying
2026-05-11 15:22   ` David Hildenbrand (Arm)
2026-05-07  9:58 ` [PATCH 0/7] Accelerate page migration with batch copying and hardware offload Huang, Ying
2026-05-11 15:19   ` David Hildenbrand (Arm)
     [not found] ` <87zf2kvnqy.fsf@DESKTOP-5N7EMDA>
2026-05-08 11:04   ` Garg, Shivank
2026-05-08 11:28     ` Huang, Ying
2026-05-08 12:34       ` Garg, Shivank
2026-05-09  7:49         ` Huang, Ying
2026-05-10 15:03           ` Garg, Shivank
     [not found] ` <20260428155043.39251-8-shivankg@amd.com>
2026-05-11 15:35   ` [PATCH 3/7] mm/migrate: skip data copy for already-copied folios David Hildenbrand (Arm)
     [not found] ` <20260428155043.39251-10-shivankg@amd.com>
2026-05-11 15:40   ` [PATCH 4/7] mm/migrate: add batch-copy path in migrate_pages_batch David Hildenbrand (Arm)
     [not found] ` <20260428155043.39251-12-shivankg@amd.com>
2026-05-11 15:46   ` [PATCH 5/7] mm/migrate: add copy offload registration infrastructure David Hildenbrand (Arm)
2026-05-11 15:50   ` David Hildenbrand (Arm)
2026-05-11 15:53 ` [PATCH 0/7] Accelerate page migration with batch copying and hardware offload David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox