All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Garg, Shivank" <shivankg@amd.com>
To: Vinod Koul <vkoul@kernel.org>
Cc: lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@kernel.org, willy@infradead.org, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, ziy@nvidia.com,
	matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net,
	ying.huang@linux.alibaba.com, apopple@nvidia.com,
	dave@stgolabs.net, Jonathan.Cameron@huawei.com, rkodsara@amd.com,
	bharata@amd.com, sj@kernel.org, weixugc@google.com,
	dan.j.williams@intel.com, rientjes@google.com,
	xuezhengchu@huawei.com, yiannis@zptcorp.com,
	dave.hansen@intel.com, hannes@cmpxchg.org, jhubbard@nvidia.com,
	peterx@redhat.com, riel@surriel.com, shakeel.butt@linux.dev,
	stalexan@redhat.com, tj@kernel.org, nifan.cxl@gmail.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	akpm@linux-foundation.org, david@kernel.org
Subject: Re: [RFC PATCH v4 5/6] drivers/migrate_offload: add DMA batch copy driver (dcbm)
Date: Fri, 24 Apr 2026 16:56:51 +0530	[thread overview]
Message-ID: <3e73addb-ac01-4a05-bc75-c6c1c56072df@amd.com> (raw)
In-Reply-To: <aeopIn_sCBR2CFrI@vaman>



On 4/23/2026 7:43 PM, Vinod Koul wrote:
> On 23-04-26, 17:40, Garg, Shivank wrote:
>> Hi Vinod,
>>
>> Following your suggestion at the Kernel meetup in Bangalore (11 Apr 2026)
>> to check 0cae04373b ("dmaengine: remove DMA_MEMCPY_SG once again") and use
>> DMA_MEMCPY_SG / dmaengine_prep_dma_memcpy_sg() (I added a
>> device_prep_dma_memcpy_sg hook in drivers/dma/amd/ptdma/ptdma-dmaengine.c
>> for this experiment; not posted).
>> I ran an A/B comparison against the existing DCBM path that uses
>> dmaengine_prep_dma_memcpy() in a loop over mapped SGL segments.
>>
>> I'm using the move_pages() workload to move 1 GB data per run. I do not see
>> significant performance difference, and results are broadly within each
>> other's noise band).
>>
>> Throughput (GB/s, mean ± SD), ITERATIONS=10:
>>
>> Page     nr_dma_chan=1              nr_dma_chan=4              nr_dma_chan=8              nr_dma_chan=16
>> order    dcbm          dcbm_sg      dcbm          dcbm_sg      dcbm          dcbm_sg      dcbm           dcbm_sg
>> ------   -----------   ----------   -----------   ----------   -----------   ----------   ------------   ----------
>> 0        2.33 ± 0.17   2.26 ± 0.19  3.24 ± 0.21   3.18 ± 0.23  3.29 ± 0.10   3.45 ± 0.10  3.29 ± 0.13    3.49 ± 0.22
>> 4        2.77 ± 0.21   2.99 ± 0.18  6.26 ± 0.99   6.75 ± 0.12  8.01 ± 0.58   7.70 ± 0.64  8.22 ± 0.89    8.72 ± 0.87
>> 8        4.57 ± 0.70   4.75 ± 0.83  10.64 ± 1.97  10.94 ± 3.52 10.30 ± 1.22  10.36 ± 1.24 11.27 ± 1.21   12.47 ± 1.66
>> 9        12.71 ± 0.09  12.68 ± 0.08 27.13 ± 0.15  26.89 ± 0.27 46.50 ± 0.73  45.17 ± 2.46 67.25 ± 1.42   62.78 ± 8.24
>>
>> Notes: order 0/4/8/9 = 4K / 64K / 1M / 2M folios
>> 	dcbm     = per-segment dmaengine_prep_dma_memcpy
>> 	dcbm_sg  = DMA_MEMCPY_SG / dmaengine_prep_dma_memcpy_sg
>>
>> <snip>
>>
>>> +
>>> +static int submit_dma_transfers(struct dma_work *work)
>>> +{
>>> +	struct scatterlist *sg_src, *sg_dst;
>>> +	struct dma_async_tx_descriptor *tx;
>>> +	unsigned long flags = DMA_CTRL_ACK;
>>> +	dma_cookie_t cookie;
>>> +	int i;
>>> +
>>> +	atomic_set(&work->pending, 1);
>>> +
>>> +	sg_src = work->src_sgt->sgl;
>>> +	sg_dst = work->dst_sgt->sgl;
>>> +	for_each_sgtable_dma_sg(work->src_sgt, sg_src, i) {
>>> +		if (i == work->src_sgt->nents - 1)
>>> +			flags |= DMA_PREP_INTERRUPT;
>>> +
>>> +		tx = dmaengine_prep_dma_memcpy(work->chan,
>>> +				sg_dma_address(sg_dst),
>>> +				sg_dma_address(sg_src),
>>> +				sg_dma_len(sg_src), flags);
>>> +		if (!tx) {
>>> +			atomic_set(&work->pending, 0);
>>> +			return -EIO;
>>> +		}
>>> +
>>> +		if (i == work->src_sgt->nents - 1) {
>>> +			tx->callback = dma_completion_callback;
>>> +			tx->callback_param = work;
>>> +		}
>>> +
>>> +		cookie = dmaengine_submit(tx);
>>> +		if (dma_submit_error(cookie)) {
>>> +			atomic_set(&work->pending, 0);
>>> +			return -EIO;
>>> +		}
>>> +		sg_dst = sg_next(sg_dst);
>>> +	}
>>> +	return 0;
>>> +}
>>
>> static int submit_dma_transfers(struct dma_work *work)
>> {
>>         struct dma_async_tx_descriptor *tx;
>>         unsigned long flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>>         dma_cookie_t cookie;
>>
>>         tx = dmaengine_prep_dma_memcpy_sg(work->chan,
>>                         work->dst_sgt->sgl, work->dst_sgt->nents,
>>                         work->src_sgt->sgl, work->src_sgt->nents,
>>                         flags);
>>         if (!tx)
>>                 return -EIO;
>>
>>         atomic_set(&work->pending, 1);
>>         tx->callback = dma_completion_callback;
>>         tx->callback_param = work;
>>
>>         cookie = dmaengine_submit(tx);
>>         if (dma_submit_error(cookie)) {
>>                 atomic_set(&work->pending, 0);
>>                 return -EIO;
>>         }
>>         return 0;
>> }
>>
>> The memcpy_sg version does simplify submit_dma_transfers()
>> (one dmaengine_prep_dma_memcpy_sg + one dmaengine_submit vs a loop).
> 
> Right
> 
>>
>> My current DCBM path issues dmaengine_prep_dma_memcpy()+dmaengine_submit()
>> per mapped SG segment and sets DMA_PREP_INTERRUPT + callback only
>> on the last one, so the IRQ/callback cost is already one per batch.
>>
>> My understanding is switching to dmaengine_prep_dma_memcpy_sg() mainly
>> saves the per-segment prep/submit calls and hands the provider a single
>> multi-segment TX to program.
> 
> Right, but the analysis you showed indicated the dma setup cost was
> quite a bit, this moving away from N transfers to single one should have
> saved a bit more...
> 
>>
>> Please correct me if the benefit you had in mind is something stronger.
>> Thanks for the suggestion and for guidance.
> 
> I still feel this looks better version... 
> Can you compare your setup time between the two please

I wrote a small dmaengine bench module to isolate the setup prep overheads from full migration path.

prep_memcpy: loop of dmaengine_prep_dma_memcpy(), one descriptor per SG entry, single completion callback on the last tx (same pattern my driver use currently).
prep_memcpy_sg: one dmaengine_prep_dma_memcpy_sg() per batch, so the provider walks the mapped src/dst SGLs (proposed)

Instrumented with ktime_get() for each phase - prep / submit / issue / wait.
Happy to share the module and the runner script if useful.

Workload: Copy 512 MB/channel, 20 runs/cell, src_nid=0 dst_nid=1, Folio sizes 4KB/2MB, batch = 512 SG entries.
*_ms columns are thread-time summed across channels (for c=16 divide by 16 for per-channel time)
run_ms is wall time to copy the 512MB.
prep_calls: total number of dmaengine_prep_dma_memcpy{,_sg}() (512X less for memcpy_sg)

mode            chan  folio  sge  run_ms           prep_ms         submit_ms     issue_ms       wait_ms         prep_calls
prep_memcpy     1     4KB    512  632.86 ± 8.18    18.00 ± 6.38    4.44 ± 0.09   0.09 ± 0.04    603.54 ± 5.03     131072 (= 512MB/4KB)
prep_memcpy_sg  1     4KB    512  611.34 ± 13.52    0.74 ± 0.33    0.01 ± 0.00   0.08 ± 0.00    610.48 ± 13.68    256    (= prep_memcpy calls / 512)

prep_memcpy     16    4KB    512  675.70 ± 14.13  416.19 ± 27.49  79.19 ± 2.27   1.53 ± 0.12   9590.11 ± 206.81   2097152
prep_memcpy_sg  16    4KB    512  615.43 ± 11.55   19.61 ± 3.38    0.17 ± 0.03   1.55 ± 0.16   9202.33 ± 138.41   4096

prep_memcpy     1     2MB    512   77.19 ± 0.15     0.04 ± 0.02    0.02 ± 0.00   0.00 ± 0.00     77.10 ± 0.15     512
prep_memcpy_sg  1     2MB    512   77.21 ± 0.11     0.00 ± 0.00    0.00 ± 0.00   0.00 ± 0.00     77.21 ± 0.11     1

prep_memcpy     16    2MB    512  186.01 ± 0.40     2.31 ± 0.17    0.32 ± 0.03   0.00 ± 0.00   2712.56 ± 4.24     8192
prep_memcpy_sg  16    2MB    512  185.63 ± 0.37     0.09 ± 0.02    0.00 ± 0.00   0.00 ± 0.00   2711.20 ± 3.75     16

dmaengine_prep_dma_memcpy_sg() is a clear win (fewer preps, fewer submits, no per-tx callback bookkeeping).
However, the end-to-end throughput gain was modest earlier because migration path cost and per-descriptor execution
time (wait_ms) dominates.

Thanks,
Shivank


  reply	other threads:[~2026-04-24 11:27 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-09 12:07 [RFC PATCH v4 0/6] Accelerate page migration with batch copying and hardware offload Shivank Garg
2026-03-09 12:07 ` [RFC PATCH v4 1/6] mm: introduce folios_mc_copy() for batch folio copying Shivank Garg
2026-03-12  9:41   ` David Hildenbrand (Arm)
2026-03-15 18:09     ` Garg, Shivank
2026-03-09 12:07 ` [RFC PATCH v4 2/6] mm/migrate: skip data copy for already-copied folios Shivank Garg
2026-03-12  9:44   ` David Hildenbrand (Arm)
2026-03-15 18:25     ` Garg, Shivank
2026-03-23 12:20       ` David Hildenbrand (Arm)
2026-03-24  8:22   ` Huang, Ying
2026-04-03 11:08     ` Garg, Shivank
2026-04-07  6:52       ` Huang, Ying
2026-04-23 12:20         ` Garg, Shivank
2026-03-09 12:07 ` [RFC PATCH v4 3/6] mm/migrate: add batch-copy path in migrate_pages_batch Shivank Garg
2026-03-24  8:42   ` Huang, Ying
2026-04-03 11:09     ` Garg, Shivank
2026-03-09 12:07 ` [RFC PATCH v4 4/6] mm/migrate: add copy offload registration infrastructure Shivank Garg
2026-03-09 17:54   ` Gregory Price
2026-03-10 10:07     ` Garg, Shivank
2026-03-24 10:54   ` Huang, Ying
2026-04-03 11:11     ` Garg, Shivank
2026-04-07  7:40       ` Huang, Ying
2026-04-28 12:10       ` Garg, Shivank
2026-04-30  1:23         ` Huang, Ying
2026-03-09 12:07 ` [RFC PATCH v4 5/6] drivers/migrate_offload: add DMA batch copy driver (dcbm) Shivank Garg
2026-03-09 18:04   ` Gregory Price
2026-03-12  9:33     ` Garg, Shivank
2026-03-24  8:10   ` Huang, Ying
2026-04-03 11:06     ` Garg, Shivank
2026-04-23 12:10   ` Garg, Shivank
2026-04-23 14:13     ` Vinod Koul
2026-04-24 11:26       ` Garg, Shivank [this message]
2026-03-09 12:07 ` [RFC PATCH v4 6/6] mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing Shivank Garg
2026-03-18 14:29 ` [RFC PATCH v4 0/6] Accelerate page migration with batch copying and hardware offload Garg, Shivank

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3e73addb-ac01-4a05-bc75-c6c1c56072df@amd.com \
    --to=shivankg@amd.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=jhubbard@nvidia.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rkodsara@amd.com \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=sj@kernel.org \
    --cc=stalexan@redhat.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=vkoul@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.