public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: "Garg, Shivank" <shivankg@amd.com>
To: Vinod Koul <vkoul@kernel.org>
Cc: lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@kernel.org, willy@infradead.org, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, ziy@nvidia.com,
	matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net,
	ying.huang@linux.alibaba.com, apopple@nvidia.com,
	dave@stgolabs.net, Jonathan.Cameron@huawei.com, rkodsara@amd.com,
	bharata@amd.com, sj@kernel.org, weixugc@google.com,
	dan.j.williams@intel.com, rientjes@google.com,
	xuezhengchu@huawei.com, yiannis@zptcorp.com,
	dave.hansen@intel.com, hannes@cmpxchg.org, jhubbard@nvidia.com,
	peterx@redhat.com, riel@surriel.com, shakeel.butt@linux.dev,
	stalexan@redhat.com, tj@kernel.org, nifan.cxl@gmail.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	akpm@linux-foundation.org, david@kernel.org
Subject: Re: [RFC PATCH v4 5/6] drivers/migrate_offload: add DMA batch copy driver (dcbm)
Date: Fri, 24 Apr 2026 16:56:51 +0530	[thread overview]
Message-ID: <3e73addb-ac01-4a05-bc75-c6c1c56072df@amd.com> (raw)
In-Reply-To: <aeopIn_sCBR2CFrI@vaman>



On 4/23/2026 7:43 PM, Vinod Koul wrote:
> On 23-04-26, 17:40, Garg, Shivank wrote:
>> Hi Vinod,
>>
>> Following your suggestion at the Kernel meetup in Bangalore (11 Apr 2026)
>> to check 0cae04373b ("dmaengine: remove DMA_MEMCPY_SG once again") and use
>> DMA_MEMCPY_SG / dmaengine_prep_dma_memcpy_sg() (I added a
>> device_prep_dma_memcpy_sg hook in drivers/dma/amd/ptdma/ptdma-dmaengine.c
>> for this experiment; not posted).
>> I ran an A/B comparison against the existing DCBM path that uses
>> dmaengine_prep_dma_memcpy() in a loop over mapped SGL segments.
>>
>> I'm using the move_pages() workload to move 1 GB data per run. I do not see
>> significant performance difference, and results are broadly within each
>> other's noise band).
>>
>> Throughput (GB/s, mean ± SD), ITERATIONS=10:
>>
>> Page     nr_dma_chan=1              nr_dma_chan=4              nr_dma_chan=8              nr_dma_chan=16
>> order    dcbm          dcbm_sg      dcbm          dcbm_sg      dcbm          dcbm_sg      dcbm           dcbm_sg
>> ------   -----------   ----------   -----------   ----------   -----------   ----------   ------------   ----------
>> 0        2.33 ± 0.17   2.26 ± 0.19  3.24 ± 0.21   3.18 ± 0.23  3.29 ± 0.10   3.45 ± 0.10  3.29 ± 0.13    3.49 ± 0.22
>> 4        2.77 ± 0.21   2.99 ± 0.18  6.26 ± 0.99   6.75 ± 0.12  8.01 ± 0.58   7.70 ± 0.64  8.22 ± 0.89    8.72 ± 0.87
>> 8        4.57 ± 0.70   4.75 ± 0.83  10.64 ± 1.97  10.94 ± 3.52 10.30 ± 1.22  10.36 ± 1.24 11.27 ± 1.21   12.47 ± 1.66
>> 9        12.71 ± 0.09  12.68 ± 0.08 27.13 ± 0.15  26.89 ± 0.27 46.50 ± 0.73  45.17 ± 2.46 67.25 ± 1.42   62.78 ± 8.24
>>
>> Notes: order 0/4/8/9 = 4K / 64K / 1M / 2M folios
>> 	dcbm     = per-segment dmaengine_prep_dma_memcpy
>> 	dcbm_sg  = DMA_MEMCPY_SG / dmaengine_prep_dma_memcpy_sg
>>
>> <snip>
>>
>>> +
>>> +static int submit_dma_transfers(struct dma_work *work)
>>> +{
>>> +	struct scatterlist *sg_src, *sg_dst;
>>> +	struct dma_async_tx_descriptor *tx;
>>> +	unsigned long flags = DMA_CTRL_ACK;
>>> +	dma_cookie_t cookie;
>>> +	int i;
>>> +
>>> +	atomic_set(&work->pending, 1);
>>> +
>>> +	sg_src = work->src_sgt->sgl;
>>> +	sg_dst = work->dst_sgt->sgl;
>>> +	for_each_sgtable_dma_sg(work->src_sgt, sg_src, i) {
>>> +		if (i == work->src_sgt->nents - 1)
>>> +			flags |= DMA_PREP_INTERRUPT;
>>> +
>>> +		tx = dmaengine_prep_dma_memcpy(work->chan,
>>> +				sg_dma_address(sg_dst),
>>> +				sg_dma_address(sg_src),
>>> +				sg_dma_len(sg_src), flags);
>>> +		if (!tx) {
>>> +			atomic_set(&work->pending, 0);
>>> +			return -EIO;
>>> +		}
>>> +
>>> +		if (i == work->src_sgt->nents - 1) {
>>> +			tx->callback = dma_completion_callback;
>>> +			tx->callback_param = work;
>>> +		}
>>> +
>>> +		cookie = dmaengine_submit(tx);
>>> +		if (dma_submit_error(cookie)) {
>>> +			atomic_set(&work->pending, 0);
>>> +			return -EIO;
>>> +		}
>>> +		sg_dst = sg_next(sg_dst);
>>> +	}
>>> +	return 0;
>>> +}
>>
>> static int submit_dma_transfers(struct dma_work *work)
>> {
>>         struct dma_async_tx_descriptor *tx;
>>         unsigned long flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>>         dma_cookie_t cookie;
>>
>>         tx = dmaengine_prep_dma_memcpy_sg(work->chan,
>>                         work->dst_sgt->sgl, work->dst_sgt->nents,
>>                         work->src_sgt->sgl, work->src_sgt->nents,
>>                         flags);
>>         if (!tx)
>>                 return -EIO;
>>
>>         atomic_set(&work->pending, 1);
>>         tx->callback = dma_completion_callback;
>>         tx->callback_param = work;
>>
>>         cookie = dmaengine_submit(tx);
>>         if (dma_submit_error(cookie)) {
>>                 atomic_set(&work->pending, 0);
>>                 return -EIO;
>>         }
>>         return 0;
>> }
>>
>> The memcpy_sg version does simplify submit_dma_transfers()
>> (one dmaengine_prep_dma_memcpy_sg + one dmaengine_submit vs a loop).
> 
> Right
> 
>>
>> My current DCBM path issues dmaengine_prep_dma_memcpy()+dmaengine_submit()
>> per mapped SG segment and sets DMA_PREP_INTERRUPT + callback only
>> on the last one, so the IRQ/callback cost is already one per batch.
>>
>> My understanding is switching to dmaengine_prep_dma_memcpy_sg() mainly
>> saves the per-segment prep/submit calls and hands the provider a single
>> multi-segment TX to program.
> 
> Right, but the analysis you showed indicated the dma setup cost was
> quite a bit, this moving away from N transfers to single one should have
> saved a bit more...
> 
>>
>> Please correct me if the benefit you had in mind is something stronger.
>> Thanks for the suggestion and for guidance.
> 
> I still feel this looks better version... 
> Can you compare your setup time between the two please

I wrote a small dmaengine bench module to isolate the setup prep overheads from full migration path.

prep_memcpy: loop of dmaengine_prep_dma_memcpy(), one descriptor per SG entry, single completion callback on the last tx (same pattern my driver use currently).
prep_memcpy_sg: one dmaengine_prep_dma_memcpy_sg() per batch, so the provider walks the mapped src/dst SGLs (proposed)

Instrumented with ktime_get() for each phase - prep / submit / issue / wait.
Happy to share the module and the runner script if useful.

Workload: Copy 512 MB/channel, 20 runs/cell, src_nid=0 dst_nid=1, Folio sizes 4KB/2MB, batch = 512 SG entries.
*_ms columns are thread-time summed across channels (for c=16 divide by 16 for per-channel time)
run_ms is wall time to copy the 512MB.
prep_calls: total number of dmaengine_prep_dma_memcpy{,_sg}() (512X less for memcpy_sg)

mode            chan  folio  sge  run_ms           prep_ms         submit_ms     issue_ms       wait_ms         prep_calls
prep_memcpy     1     4KB    512  632.86 ± 8.18    18.00 ± 6.38    4.44 ± 0.09   0.09 ± 0.04    603.54 ± 5.03     131072 (= 512MB/4KB)
prep_memcpy_sg  1     4KB    512  611.34 ± 13.52    0.74 ± 0.33    0.01 ± 0.00   0.08 ± 0.00    610.48 ± 13.68    256    (= prep_memcpy calls / 512)

prep_memcpy     16    4KB    512  675.70 ± 14.13  416.19 ± 27.49  79.19 ± 2.27   1.53 ± 0.12   9590.11 ± 206.81   2097152
prep_memcpy_sg  16    4KB    512  615.43 ± 11.55   19.61 ± 3.38    0.17 ± 0.03   1.55 ± 0.16   9202.33 ± 138.41   4096

prep_memcpy     1     2MB    512   77.19 ± 0.15     0.04 ± 0.02    0.02 ± 0.00   0.00 ± 0.00     77.10 ± 0.15     512
prep_memcpy_sg  1     2MB    512   77.21 ± 0.11     0.00 ± 0.00    0.00 ± 0.00   0.00 ± 0.00     77.21 ± 0.11     1

prep_memcpy     16    2MB    512  186.01 ± 0.40     2.31 ± 0.17    0.32 ± 0.03   0.00 ± 0.00   2712.56 ± 4.24     8192
prep_memcpy_sg  16    2MB    512  185.63 ± 0.37     0.09 ± 0.02    0.00 ± 0.00   0.00 ± 0.00   2711.20 ± 3.75     16

dmaengine_prep_dma_memcpy_sg() is a clear win (fewer preps, fewer submits, no per-tx callback bookkeeping).
However, the end-to-end throughput gain was modest earlier because migration path cost and per-descriptor execution
time (wait_ms) dominates.

Thanks,
Shivank


  reply	other threads:[~2026-04-24 11:27 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-09 12:07 [RFC PATCH v4 0/6] Accelerate page migration with batch copying and hardware offload Shivank Garg
2026-03-09 12:07 ` [RFC PATCH v4 1/6] mm: introduce folios_mc_copy() for batch folio copying Shivank Garg
2026-03-12  9:41   ` David Hildenbrand (Arm)
2026-03-15 18:09     ` Garg, Shivank
2026-03-09 12:07 ` [RFC PATCH v4 2/6] mm/migrate: skip data copy for already-copied folios Shivank Garg
2026-03-12  9:44   ` David Hildenbrand (Arm)
2026-03-15 18:25     ` Garg, Shivank
2026-03-23 12:20       ` David Hildenbrand (Arm)
2026-03-24  8:22   ` Huang, Ying
2026-04-03 11:08     ` Garg, Shivank
2026-04-07  6:52       ` Huang, Ying
2026-04-23 12:20         ` Garg, Shivank
2026-03-09 12:07 ` [RFC PATCH v4 3/6] mm/migrate: add batch-copy path in migrate_pages_batch Shivank Garg
2026-03-24  8:42   ` Huang, Ying
2026-04-03 11:09     ` Garg, Shivank
2026-03-09 12:07 ` [RFC PATCH v4 4/6] mm/migrate: add copy offload registration infrastructure Shivank Garg
2026-03-09 17:54   ` Gregory Price
2026-03-10 10:07     ` Garg, Shivank
2026-03-24 10:54   ` Huang, Ying
2026-04-03 11:11     ` Garg, Shivank
2026-04-07  7:40       ` Huang, Ying
2026-04-28 12:10       ` Garg, Shivank
2026-03-09 12:07 ` [RFC PATCH v4 5/6] drivers/migrate_offload: add DMA batch copy driver (dcbm) Shivank Garg
2026-03-09 18:04   ` Gregory Price
2026-03-12  9:33     ` Garg, Shivank
2026-03-24  8:10   ` Huang, Ying
2026-04-03 11:06     ` Garg, Shivank
2026-04-23 12:10   ` Garg, Shivank
2026-04-23 14:13     ` Vinod Koul
2026-04-24 11:26       ` Garg, Shivank [this message]
2026-03-09 12:07 ` [RFC PATCH v4 6/6] mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing Shivank Garg
2026-03-18 14:29 ` [RFC PATCH v4 0/6] Accelerate page migration with batch copying and hardware offload Garg, Shivank

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3e73addb-ac01-4a05-bc75-c6c1c56072df@amd.com \
    --to=shivankg@amd.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=jhubbard@nvidia.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rkodsara@amd.com \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=sj@kernel.org \
    --cc=stalexan@redhat.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=vkoul@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox