Re: Uncached buffers from CMA DMA heap on some Arm devices?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Milan Zamazal <mzamazal@redhat.com>
To: Lucas Stach <dev@lynxeye.de>
Cc: Christoph Hellwig <hch@lst.de>,
	 iommu@lists.linux.dev,  Will Deacon <will@kernel.org>,
	 catalin.marinas@arm.com,
	 Bryan O'Donoghue <bryan.odonoghue@linaro.org>,
	 Andrey Konovalov <andrey.konovalov.ynk@gmail.com>,
	 Pavel Machek <pavel@ucw.cz>,  Maxime Ripard <mripard@redhat.com>,
	 Laurent Pinchart <laurent.pinchart@ideasonboard.com>,
	 kieran.bingham@ideasonboard.com,
	Hans de Goede <hdegoede@redhat.com>
Subject: Re: Uncached buffers from CMA DMA heap on some Arm devices?
Date: Fri, 26 Jan 2024 12:22:30 +0100	[thread overview]
Message-ID: <874jf05og9.fsf@redhat.com> (raw)
In-Reply-To: <d2ff8df896d8a167e9abf447ae184ce2f5823852.camel@lynxeye.de> (Lucas Stach's message of "Thu, 25 Jan 2024 12:41:01 +0100")

Lucas Stach <dev@lynxeye.de> writes:

> Hi Milan,
>
> Am Mittwoch, dem 24.01.2024 um 19:27 +0100 schrieb Milan Zamazal:
>> Hello,
>> 
>> in the libcamera project, we experience a major performance problem related to
>> DMA buffers while working on camera image processing using CPU.  This happens
>> only with some Arm boards, we have observed it on Debix Model A (NXP i.MX 8M
>> Plus) and PinePhone.  We use /dev/dma_heap/linux,cma (or reserved) DMA buffer
>> heap on Arm.
>> 
>> Reading V4L2 camera data from buffers is very slow.  When we memcpy the data
>> from the buffer to a malloc'ed memory before working with it (reading each byte
>> multiple times, without any big non-sequential jumps across the data), we get
>> more than 10 times speed up.  It looks like the input buffer is uncached.
>> 
> That's right and a reality you have to deal with on those small ARM
> systems. The ARM architecture allows for systems that don't enforce
> hardware coherency across the whole SoC and many of the small/cheap SoC
> variants make use of this architectural feature.

Hi Lucas,

thank you for explanation.  It mostly confirms the limitations we suspected are
unavoidable but it's good in any case to know for sure whether there is any hope
or not. :-)

> What this means is that the CPU caches aren't coherent when it comes to
> DMA from other masters like the video capture units. There are two ways
> to enforce DMA coherency on such systems:
> 1. map the DMA buffers uncached on the CPU
> 2. require explicit cache maintenance when touching DMA buffers with
> the CPU
>
> Option 1 is what you see is happening in your setup, as it is simple,
> straight-forward and doesn't require any synchronization points.
>
> Option 2 could be implemented by allocating cached DMA buffers in the
> V4L2 device and then executing the necessary cache synchronization in
> qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA
> master. 

Do I understand it right that "could be implemented" applies to kernel code and
there are currently no facilities there that would allow experimenting with such
an approach from user space?

> However this isn't guaranteed to be any faster, as the cache synchronization
> itself is a pretty heavy-weight operation when you are dealing with buffer
> that are potentially multi-megabytes in size.

Yes, it would be best to measure it if the mechanism was available.

>> We experience slow down also when writing to output buffers.  It doesn't seem to
>> matter whether we write to the output byte-by-byte or memcpy larger chunks.
>> 
> For DMA coherency it's sufficient to map the DMA buffers as write-
> combined, which should at least give you okay-ish write performance,
> depending on the specific micro-architecture of your system.

OK.

>> We are having trouble to understand what's the problem with the buffers on some
>> hardware and what we can realistically do about it.  Could you please help us
>> clarify this?  Is it possible to force the DMA buffer CMA heap to be cached?
>> Or is there anything else we can do or try?
>
> See above. You can work with cached buffers, but that is moving the
> cost elsewhere and is not guaranteed to yield better performance. There
> is no panacea on systems that don't enforce coherency at the hardware
> level.
>
> When working on uncached buffers directly, your best option is to try
> to access the buffers in as large chunks as possible, using vector
> loads or similar facilities. You certainly don't want to access a
> single memory location multiple times. If that is what your algorithm
> requires then copying the content into a cached buffer might be your
> best option, as it might have similar performance to explicit cache
> maintenance on cached DMA buffers and doesn't require another
> maintenance operation when transitioning the buffer back to DMA master
> ownership.

What works best for us is copying + processing camera data approximately
line-by-line, which are chunks large enough to achieve efficient copying while
still being small enough to fit into CPU caches.

Regards,
Milan

next prev parent reply	other threads:[~2024-01-26 11:22 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-24 18:27 Uncached buffers from CMA DMA heap on some Arm devices? Milan Zamazal
2024-01-25 11:41 ` Lucas Stach
2024-01-26 11:22   ` Milan Zamazal [this message]
2024-01-26 12:19     ` Maxime Ripard
2024-01-26 12:17   ` Maxime Ripard
2024-01-29 12:05     ` Laurent Pinchart
2024-01-29 10:23   ` Pavel Machek
2024-01-29 10:32     ` Maxime Ripard
2024-01-29 12:07       ` Laurent Pinchart
2024-01-29 13:12         ` Lucas Stach
2024-01-29 18:30       ` Pavel Machek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874jf05og9.fsf@redhat.com \
    --to=mzamazal@redhat.com \
    --cc=andrey.konovalov.ynk@gmail.com \
    --cc=bryan.odonoghue@linaro.org \
    --cc=catalin.marinas@arm.com \
    --cc=dev@lynxeye.de \
    --cc=hch@lst.de \
    --cc=hdegoede@redhat.com \
    --cc=iommu@lists.linux.dev \
    --cc=kieran.bingham@ideasonboard.com \
    --cc=laurent.pinchart@ideasonboard.com \
    --cc=mripard@redhat.com \
    --cc=pavel@ucw.cz \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.