dma_sync_sg_for_device/cpu look very inefficient

All of lore.kernel.org
 help / color / mirror / Atom feed

* dma_sync_sg_for_device/cpu look very inefficient
@ 2023-12-18 12:32 Paul Cercueil
  2023-12-18 14:30 ` Christoph Hellwig
  2023-12-18 15:16 ` Robin Murphy
  0 siblings, 2 replies; 6+ messages in thread
From: Paul Cercueil @ 2023-12-18 12:32 UTC (permalink / raw)
  To: Christoph Hellwig, Marek Szyprowski, Robin Murphy
  Cc: Nuno Sa, Michael Hennerich, iommu

Hi,

I am manipulating DMABUFs created with the udmabuf driver, that are 64
MiB each, on a ZedBoard (which should have ~512 KiB of L2 cache IIRC).

When trying to access them with the CPU, dma_sync_sg_for_cpu() is
called. What ends up happening, is a cache sync for each one of the (up
to) 16 thousand pages that back up the DMABUF, which obviously takes a
huge amount of time, tanking performance.

My guess (probably very naïve) is that if the total length of the SG is
equal or bigger than the size of the wider non-coherent cache, it would
be much better to just flush the whole data cache.

Thoughts?

Cheers,
-Paul

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dma_sync_sg_for_device/cpu look very inefficient
  2023-12-18 12:32 dma_sync_sg_for_device/cpu look very inefficient Paul Cercueil
@ 2023-12-18 14:30 ` Christoph Hellwig
  2023-12-18 14:40   ` Lucas Stach
  2023-12-18 15:01   ` Paul Cercueil
  2023-12-18 15:16 ` Robin Murphy
  1 sibling, 2 replies; 6+ messages in thread
From: Christoph Hellwig @ 2023-12-18 14:30 UTC (permalink / raw)
  To: Paul Cercueil
  Cc: Christoph Hellwig, Marek Szyprowski, Robin Murphy, Nuno Sa,
	Michael Hennerich, iommu

On Mon, Dec 18, 2023 at 01:32:59PM +0100, Paul Cercueil wrote:
> I am manipulating DMABUFs created with the udmabuf driver, that are 64
> MiB each, on a ZedBoard (which should have ~512 KiB of L2 cache IIRC).
> 
> When trying to access them with the CPU, dma_sync_sg_for_cpu() is
> called. What ends up happening, is a cache sync for each one of the (up
> to) 16 thousand pages that back up the DMABUF, which obviously takes a
> huge amount of time, tanking performance.
> 
> My guess (probably very naïve) is that if the total length of the SG is
> equal or bigger than the size of the wider non-coherent cache, it would
> be much better to just flush the whole data cache.

IFF we regularly ѕync huge amount of data a complete flush is probably
going to be more efficient.  But why are we doing that to start with?
Ownership needs to transfer when the data is accessed, and when you
regularly access 16.000 pages you're probably got other performance
problems to start with.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dma_sync_sg_for_device/cpu look very inefficient
  2023-12-18 14:30 ` Christoph Hellwig
@ 2023-12-18 14:40   ` Lucas Stach
  2023-12-18 15:02     ` Paul Cercueil
  2023-12-18 15:01   ` Paul Cercueil
  1 sibling, 1 reply; 6+ messages in thread
From: Lucas Stach @ 2023-12-18 14:40 UTC (permalink / raw)
  To: Christoph Hellwig, Paul Cercueil
  Cc: Marek Szyprowski, Robin Murphy, Nuno Sa, Michael Hennerich, iommu

Am Montag, dem 18.12.2023 um 15:30 +0100 schrieb Christoph Hellwig:
> On Mon, Dec 18, 2023 at 01:32:59PM +0100, Paul Cercueil wrote:
> > I am manipulating DMABUFs created with the udmabuf driver, that are 64
> > MiB each, on a ZedBoard (which should have ~512 KiB of L2 cache IIRC).
> > 
> > When trying to access them with the CPU, dma_sync_sg_for_cpu() is
> > called. What ends up happening, is a cache sync for each one of the (up
> > to) 16 thousand pages that back up the DMABUF, which obviously takes a
> > huge amount of time, tanking performance.
> > 
> > My guess (probably very naïve) is that if the total length of the SG is
> > equal or bigger than the size of the wider non-coherent cache, it would
> > be much better to just flush the whole data cache.
> 
> IFF we regularly ѕync huge amount of data a complete flush is probably
> going to be more efficient.  But why are we doing that to start with?
> Ownership needs to transfer when the data is accessed, and when you
> regularly access 16.000 pages you're probably got other performance
> problems to start with.
> 
In addition to that flushing the cache only works for data moving to
the device. When you have to deal with data written by the device you
would need to invalidate the cache, which is obviously not something
you can do to the whole cache without data corruption.

Regards,
Lucas

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dma_sync_sg_for_device/cpu look very inefficient
  2023-12-18 14:30 ` Christoph Hellwig
  2023-12-18 14:40   ` Lucas Stach
@ 2023-12-18 15:01   ` Paul Cercueil
  1 sibling, 0 replies; 6+ messages in thread
From: Paul Cercueil @ 2023-12-18 15:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Marek Szyprowski, Robin Murphy, Nuno Sa, Michael Hennerich, iommu

Hi Christoph,

Le lundi 18 décembre 2023 à 15:30 +0100, Christoph Hellwig a écrit :
> On Mon, Dec 18, 2023 at 01:32:59PM +0100, Paul Cercueil wrote:
> > I am manipulating DMABUFs created with the udmabuf driver, that are
> > 64
> > MiB each, on a ZedBoard (which should have ~512 KiB of L2 cache
> > IIRC).
> > 
> > When trying to access them with the CPU, dma_sync_sg_for_cpu() is
> > called. What ends up happening, is a cache sync for each one of the
> > (up
> > to) 16 thousand pages that back up the DMABUF, which obviously
> > takes a
> > huge amount of time, tanking performance.
> > 
> > My guess (probably very naïve) is that if the total length of the
> > SG is
> > equal or bigger than the size of the wider non-coherent cache, it
> > would
> > be much better to just flush the whole data cache.
> 
> IFF we regularly ѕync huge amount of data a complete flush is
> probably
> going to be more efficient.  But why are we doing that to start with?
> Ownership needs to transfer when the data is accessed, and when you
> regularly access 16.000 pages you're probably got other performance
> problems to start with.

I mmap the DMABUFs from user-space, so that I can pass huge amounts of
data back and forth between user-space and kernel/DMA space without
going through copy_to_user().

This works fine and is faster than copy_to_user() with coherent memory,
but it's much slower with non-coherent memory because of the time spent
in dma_sync_sg_for_cpu().

(Note that we also have regular use cases where we share the DMABUFs
between device drivers to transfer data in a zero-copy fashion - we're
not just abusing them as a kernel/userspace interface.)

You are right that the problem is more that I have an insane number of
pages :) Thanks for pointing that out. I will try to use huge pages
instead.

Cheers,
-Paul

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dma_sync_sg_for_device/cpu look very inefficient
  2023-12-18 14:40   ` Lucas Stach
@ 2023-12-18 15:02     ` Paul Cercueil
  0 siblings, 0 replies; 6+ messages in thread
From: Paul Cercueil @ 2023-12-18 15:02 UTC (permalink / raw)
  To: Lucas Stach, Christoph Hellwig
  Cc: Marek Szyprowski, Robin Murphy, Nuno Sa, Michael Hennerich, iommu

Hi Lucas,

Le lundi 18 décembre 2023 à 15:40 +0100, Lucas Stach a écrit :
> Am Montag, dem 18.12.2023 um 15:30 +0100 schrieb Christoph Hellwig:
> > On Mon, Dec 18, 2023 at 01:32:59PM +0100, Paul Cercueil wrote:
> > > I am manipulating DMABUFs created with the udmabuf driver, that
> > > are 64
> > > MiB each, on a ZedBoard (which should have ~512 KiB of L2 cache
> > > IIRC).
> > > 
> > > When trying to access them with the CPU, dma_sync_sg_for_cpu() is
> > > called. What ends up happening, is a cache sync for each one of
> > > the (up
> > > to) 16 thousand pages that back up the DMABUF, which obviously
> > > takes a
> > > huge amount of time, tanking performance.
> > > 
> > > My guess (probably very naïve) is that if the total length of the
> > > SG is
> > > equal or bigger than the size of the wider non-coherent cache, it
> > > would
> > > be much better to just flush the whole data cache.
> > 
> > IFF we regularly ѕync huge amount of data a complete flush is
> > probably
> > going to be more efficient.  But why are we doing that to start
> > with?
> > Ownership needs to transfer when the data is accessed, and when you
> > regularly access 16.000 pages you're probably got other performance
> > problems to start with.
> > 
> In addition to that flushing the cache only works for data moving to
> the device. When you have to deal with data written by the device you
> would need to invalidate the cache, which is obviously not something
> you can do to the whole cache without data corruption.

Ah, that's a very good point.

Cheers,
-Paul

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dma_sync_sg_for_device/cpu look very inefficient
  2023-12-18 12:32 dma_sync_sg_for_device/cpu look very inefficient Paul Cercueil
  2023-12-18 14:30 ` Christoph Hellwig
@ 2023-12-18 15:16 ` Robin Murphy
  1 sibling, 0 replies; 6+ messages in thread
From: Robin Murphy @ 2023-12-18 15:16 UTC (permalink / raw)
  To: Paul Cercueil, Christoph Hellwig, Marek Szyprowski
  Cc: Nuno Sa, Michael Hennerich, iommu

On 2023-12-18 12:32 pm, Paul Cercueil wrote:
> Hi,
> 
> I am manipulating DMABUFs created with the udmabuf driver, that are 64
> MiB each, on a ZedBoard (which should have ~512 KiB of L2 cache IIRC).
> 
> When trying to access them with the CPU, dma_sync_sg_for_cpu() is
> called. What ends up happening, is a cache sync for each one of the (up
> to) 16 thousand pages that back up the DMABUF, which obviously takes a
> huge amount of time, tanking performance.
> 
> My guess (probably very naïve) is that if the total length of the SG is
> equal or bigger than the size of the wider non-coherent cache, it would
> be much better to just flush the whole data cache.

In addition to the points already raised, not all architectures have a 
notion of "the whole data cache", so while this is something that 
individual arch implementations could potentially consider, it's not 
necessarily possible (or at least reasonable) to do at the generic DMA 
API level, nor in common implementations like dma-direct or iommu-dma.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-12-18 15:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-18 12:32 dma_sync_sg_for_device/cpu look very inefficient Paul Cercueil
2023-12-18 14:30 ` Christoph Hellwig
2023-12-18 14:40   ` Lucas Stach
2023-12-18 15:02     ` Paul Cercueil
2023-12-18 15:01   ` Paul Cercueil
2023-12-18 15:16 ` Robin Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.