Re: [RFC PATCH v4 5/6] drivers/migrate_offload: add DMA batch copy driver (dcbm)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Vinod Koul <vkoul@kernel.org>
To: "Garg, Shivank" <shivankg@amd.com>
Cc: lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@kernel.org, willy@infradead.org, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, ziy@nvidia.com,
	matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net,
	ying.huang@linux.alibaba.com, apopple@nvidia.com,
	dave@stgolabs.net, Jonathan.Cameron@huawei.com, rkodsara@amd.com,
	bharata@amd.com, sj@kernel.org, weixugc@google.com,
	dan.j.williams@intel.com, rientjes@google.com,
	xuezhengchu@huawei.com, yiannis@zptcorp.com,
	dave.hansen@intel.com, hannes@cmpxchg.org, jhubbard@nvidia.com,
	peterx@redhat.com, riel@surriel.com, shakeel.butt@linux.dev,
	stalexan@redhat.com, tj@kernel.org, nifan.cxl@gmail.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	akpm@linux-foundation.org, david@kernel.org
Subject: Re: [RFC PATCH v4 5/6] drivers/migrate_offload: add DMA batch copy driver (dcbm)
Date: Thu, 23 Apr 2026 19:43:54 +0530	[thread overview]
Message-ID: <aeopIn_sCBR2CFrI@vaman> (raw)
In-Reply-To: <396b4be1-376b-4aac-bd1e-2854c88b3757@amd.com>

On 23-04-26, 17:40, Garg, Shivank wrote:
> Hi Vinod,
> 
> Following your suggestion at the Kernel meetup in Bangalore (11 Apr 2026)
> to check 0cae04373b ("dmaengine: remove DMA_MEMCPY_SG once again") and use
> DMA_MEMCPY_SG / dmaengine_prep_dma_memcpy_sg() (I added a
> device_prep_dma_memcpy_sg hook in drivers/dma/amd/ptdma/ptdma-dmaengine.c
> for this experiment; not posted).
> I ran an A/B comparison against the existing DCBM path that uses
> dmaengine_prep_dma_memcpy() in a loop over mapped SGL segments.
> 
> I'm using the move_pages() workload to move 1 GB data per run. I do not see
> significant performance difference, and results are broadly within each
> other's noise band).
> 
> Throughput (GB/s, mean ± SD), ITERATIONS=10:
> 
> Page     nr_dma_chan=1              nr_dma_chan=4              nr_dma_chan=8              nr_dma_chan=16
> order    dcbm          dcbm_sg      dcbm          dcbm_sg      dcbm          dcbm_sg      dcbm           dcbm_sg
> ------   -----------   ----------   -----------   ----------   -----------   ----------   ------------   ----------
> 0        2.33 ± 0.17   2.26 ± 0.19  3.24 ± 0.21   3.18 ± 0.23  3.29 ± 0.10   3.45 ± 0.10  3.29 ± 0.13    3.49 ± 0.22
> 4        2.77 ± 0.21   2.99 ± 0.18  6.26 ± 0.99   6.75 ± 0.12  8.01 ± 0.58   7.70 ± 0.64  8.22 ± 0.89    8.72 ± 0.87
> 8        4.57 ± 0.70   4.75 ± 0.83  10.64 ± 1.97  10.94 ± 3.52 10.30 ± 1.22  10.36 ± 1.24 11.27 ± 1.21   12.47 ± 1.66
> 9        12.71 ± 0.09  12.68 ± 0.08 27.13 ± 0.15  26.89 ± 0.27 46.50 ± 0.73  45.17 ± 2.46 67.25 ± 1.42   62.78 ± 8.24
> 
> Notes: order 0/4/8/9 = 4K / 64K / 1M / 2M folios
> 	dcbm     = per-segment dmaengine_prep_dma_memcpy
> 	dcbm_sg  = DMA_MEMCPY_SG / dmaengine_prep_dma_memcpy_sg
> 
> <snip>
> 
> > +
> > +static int submit_dma_transfers(struct dma_work *work)
> > +{
> > +	struct scatterlist *sg_src, *sg_dst;
> > +	struct dma_async_tx_descriptor *tx;
> > +	unsigned long flags = DMA_CTRL_ACK;
> > +	dma_cookie_t cookie;
> > +	int i;
> > +
> > +	atomic_set(&work->pending, 1);
> > +
> > +	sg_src = work->src_sgt->sgl;
> > +	sg_dst = work->dst_sgt->sgl;
> > +	for_each_sgtable_dma_sg(work->src_sgt, sg_src, i) {
> > +		if (i == work->src_sgt->nents - 1)
> > +			flags |= DMA_PREP_INTERRUPT;
> > +
> > +		tx = dmaengine_prep_dma_memcpy(work->chan,
> > +				sg_dma_address(sg_dst),
> > +				sg_dma_address(sg_src),
> > +				sg_dma_len(sg_src), flags);
> > +		if (!tx) {
> > +			atomic_set(&work->pending, 0);
> > +			return -EIO;
> > +		}
> > +
> > +		if (i == work->src_sgt->nents - 1) {
> > +			tx->callback = dma_completion_callback;
> > +			tx->callback_param = work;
> > +		}
> > +
> > +		cookie = dmaengine_submit(tx);
> > +		if (dma_submit_error(cookie)) {
> > +			atomic_set(&work->pending, 0);
> > +			return -EIO;
> > +		}
> > +		sg_dst = sg_next(sg_dst);
> > +	}
> > +	return 0;
> > +}
> 
> static int submit_dma_transfers(struct dma_work *work)
> {
>         struct dma_async_tx_descriptor *tx;
>         unsigned long flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>         dma_cookie_t cookie;
> 
>         tx = dmaengine_prep_dma_memcpy_sg(work->chan,
>                         work->dst_sgt->sgl, work->dst_sgt->nents,
>                         work->src_sgt->sgl, work->src_sgt->nents,
>                         flags);
>         if (!tx)
>                 return -EIO;
> 
>         atomic_set(&work->pending, 1);
>         tx->callback = dma_completion_callback;
>         tx->callback_param = work;
> 
>         cookie = dmaengine_submit(tx);
>         if (dma_submit_error(cookie)) {
>                 atomic_set(&work->pending, 0);
>                 return -EIO;
>         }
>         return 0;
> }
> 
> The memcpy_sg version does simplify submit_dma_transfers()
> (one dmaengine_prep_dma_memcpy_sg + one dmaengine_submit vs a loop).

Right

> 
> My current DCBM path issues dmaengine_prep_dma_memcpy()+dmaengine_submit()
> per mapped SG segment and sets DMA_PREP_INTERRUPT + callback only
> on the last one, so the IRQ/callback cost is already one per batch.
> 
> My understanding is switching to dmaengine_prep_dma_memcpy_sg() mainly
> saves the per-segment prep/submit calls and hands the provider a single
> multi-segment TX to program.

Right, but the analysis you showed indicated the dma setup cost was
quite a bit, this moving away from N transfers to single one should have
saved a bit more...

> 
> Please correct me if the benefit you had in mind is something stronger.
> Thanks for the suggestion and for guidance.

I still feel this looks better version... 
Can you compare your setup time between the two please

Thanks
-- 
~Vinod

next prev parent reply	other threads:[~2026-04-23 14:14 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-09 12:07 [RFC PATCH v4 0/6] Accelerate page migration with batch copying and hardware offload Shivank Garg
2026-03-09 12:07 ` [RFC PATCH v4 1/6] mm: introduce folios_mc_copy() for batch folio copying Shivank Garg
2026-03-12  9:41   ` David Hildenbrand (Arm)
2026-03-15 18:09     ` Garg, Shivank
2026-03-09 12:07 ` [RFC PATCH v4 2/6] mm/migrate: skip data copy for already-copied folios Shivank Garg
2026-03-12  9:44   ` David Hildenbrand (Arm)
2026-03-15 18:25     ` Garg, Shivank
2026-03-23 12:20       ` David Hildenbrand (Arm)
2026-03-24  8:22   ` Huang, Ying
2026-04-03 11:08     ` Garg, Shivank
2026-04-07  6:52       ` Huang, Ying
2026-04-23 12:20         ` Garg, Shivank
2026-03-09 12:07 ` [RFC PATCH v4 3/6] mm/migrate: add batch-copy path in migrate_pages_batch Shivank Garg
2026-03-24  8:42   ` Huang, Ying
2026-04-03 11:09     ` Garg, Shivank
2026-03-09 12:07 ` [RFC PATCH v4 4/6] mm/migrate: add copy offload registration infrastructure Shivank Garg
2026-03-09 17:54   ` Gregory Price
2026-03-10 10:07     ` Garg, Shivank
2026-03-24 10:54   ` Huang, Ying
2026-04-03 11:11     ` Garg, Shivank
2026-04-07  7:40       ` Huang, Ying
2026-04-28 12:10       ` Garg, Shivank
2026-04-30  1:23         ` Huang, Ying
2026-03-09 12:07 ` [RFC PATCH v4 5/6] drivers/migrate_offload: add DMA batch copy driver (dcbm) Shivank Garg
2026-03-09 18:04   ` Gregory Price
2026-03-12  9:33     ` Garg, Shivank
2026-03-24  8:10   ` Huang, Ying
2026-04-03 11:06     ` Garg, Shivank
2026-04-23 12:10   ` Garg, Shivank
2026-04-23 14:13     ` Vinod Koul [this message]
2026-04-24 11:26       ` Garg, Shivank
2026-03-09 12:07 ` [RFC PATCH v4 6/6] mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing Shivank Garg
2026-03-18 14:29 ` [RFC PATCH v4 0/6] Accelerate page migration with batch copying and hardware offload Garg, Shivank

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aeopIn_sCBR2CFrI@vaman \
    --to=vkoul@kernel.org \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=jhubbard@nvidia.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rkodsara@amd.com \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=shivankg@amd.com \
    --cc=sj@kernel.org \
    --cc=stalexan@redhat.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.