* Re: Uncached buffers from CMA DMA heap on some Arm devices? [not found] ` <d2ff8df896d8a167e9abf447ae184ce2f5823852.camel@lynxeye.de> @ 2024-01-29 10:23 ` Pavel Machek 2024-01-29 10:32 ` Maxime Ripard 0 siblings, 1 reply; 5+ messages in thread From: Pavel Machek @ 2024-01-29 10:23 UTC (permalink / raw) To: Lucas Stach, kernel list Cc: Milan Zamazal, Christoph Hellwig, iommu, Will Deacon, catalin.marinas, Bryan O'Donoghue, Andrey Konovalov, Maxime Ripard, Laurent Pinchart, kieran.bingham, Hans de Goede [-- Attachment #1: Type: text/plain, Size: 2093 bytes --] Hi! > That's right and a reality you have to deal with on those small ARM > systems. The ARM architecture allows for systems that don't enforce > hardware coherency across the whole SoC and many of the small/cheap SoC > variants make use of this architectural feature. > > What this means is that the CPU caches aren't coherent when it comes to > DMA from other masters like the video capture units. There are two ways > to enforce DMA coherency on such systems: > 1. map the DMA buffers uncached on the CPU > 2. require explicit cache maintenance when touching DMA buffers with > the CPU > > Option 1 is what you see is happening in your setup, as it is simple, > straight-forward and doesn't require any synchronization points. Yeah, and it also does not work :-). Userspace gets the buffers, and it is not really equipped to work with them. For example, on pinephone, memcpy() crashes on uncached memory. I'm pretty sure user could have some kind of kernel-crashing fun if he passed the uncached memory to futex or something similar. > Option 2 could be implemented by allocating cached DMA buffers in the > V4L2 device and then executing the necessary cache synchronization in > qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA > master. However this isn't guaranteed to be any faster, as the cache > synchronization itself is a pretty heavy-weight operation when you are > dealing with buffer that are potentially multi-megabytes in size. Yes, cache synchronization can be slow, but IIRC it was on order of milisecond in the worst case.. and copying megayte images is still slower than that. Note that it is faster to do read/write syscalls then deal with uncached memory. And userspace can't simply flush the caches and remap memory as cached. v4l2 moved away from read/write "because it is slow" and switched to interface that is even slower than that. And libcamera exposes uncached memory to the user :-(. Best regards, Pavel -- People of Russia, stop Putin before his war on Ukraine escalates. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Re: Uncached buffers from CMA DMA heap on some Arm devices? 2024-01-29 10:23 ` Uncached buffers from CMA DMA heap on some Arm devices? Pavel Machek @ 2024-01-29 10:32 ` Maxime Ripard 2024-01-29 12:07 ` Laurent Pinchart 2024-01-29 18:30 ` Pavel Machek 0 siblings, 2 replies; 5+ messages in thread From: Maxime Ripard @ 2024-01-29 10:32 UTC (permalink / raw) To: Pavel Machek Cc: Lucas Stach, kernel list, Milan Zamazal, Christoph Hellwig, iommu, Will Deacon, catalin.marinas, Bryan O'Donoghue, Andrey Konovalov, Laurent Pinchart, kieran.bingham, Hans de Goede [-- Attachment #1: Type: text/plain, Size: 2666 bytes --] On Mon, Jan 29, 2024 at 11:23:16AM +0100, Pavel Machek wrote: > Hi! > > > That's right and a reality you have to deal with on those small ARM > > systems. The ARM architecture allows for systems that don't enforce > > hardware coherency across the whole SoC and many of the small/cheap SoC > > variants make use of this architectural feature. > > > > What this means is that the CPU caches aren't coherent when it comes to > > DMA from other masters like the video capture units. There are two ways > > to enforce DMA coherency on such systems: > > 1. map the DMA buffers uncached on the CPU > > 2. require explicit cache maintenance when touching DMA buffers with > > the CPU > > > > Option 1 is what you see is happening in your setup, as it is simple, > > straight-forward and doesn't require any synchronization points. > > Yeah, and it also does not work :-). > > Userspace gets the buffers, and it is not really equipped to work with > them. For example, on pinephone, memcpy() crashes on uncached > memory. I'm pretty sure user could have some kind of kernel-crashing > fun if he passed the uncached memory to futex or something similar. Uncached buffers are ubiquitous on arm/arm64 so there must be something else going on. And there's nothing to equip for, it's just a memory array you can access in any way you want (but very slowly). How does it not work? > > Option 2 could be implemented by allocating cached DMA buffers in the > > V4L2 device and then executing the necessary cache synchronization in > > qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA > > master. However this isn't guaranteed to be any faster, as the cache > > synchronization itself is a pretty heavy-weight operation when you are > > dealing with buffer that are potentially multi-megabytes in size. > > Yes, cache synchronization can be slow, but IIRC it was on order of > milisecond in the worst case.. and copying megayte images is still > slower than that. > > Note that it is faster to do read/write syscalls then deal with > uncached memory. And userspace can't simply flush the caches and remap > memory as cached. You can't change the memory mapping, but you can flush the caches with dma-buf. It's even required by the dma-buf documentation. > v4l2 moved away from read/write "because it is slow" and switched to > interface that is even slower than that. And libcamera exposes > uncached memory to the user :-(. There's also the number of copies to consider. If you were to use read/write to display a frame on a framebuffer, you would use 4 copies vs 2 with dma-buf. Maxime [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Re: Uncached buffers from CMA DMA heap on some Arm devices? 2024-01-29 10:32 ` Maxime Ripard @ 2024-01-29 12:07 ` Laurent Pinchart 2024-01-29 13:12 ` Lucas Stach 2024-01-29 18:30 ` Pavel Machek 1 sibling, 1 reply; 5+ messages in thread From: Laurent Pinchart @ 2024-01-29 12:07 UTC (permalink / raw) To: Maxime Ripard Cc: Pavel Machek, Lucas Stach, kernel list, Milan Zamazal, Christoph Hellwig, iommu, Will Deacon, catalin.marinas, Bryan O'Donoghue, Andrey Konovalov, kieran.bingham, Hans de Goede On Mon, Jan 29, 2024 at 11:32:08AM +0100, Maxime Ripard wrote: > On Mon, Jan 29, 2024 at 11:23:16AM +0100, Pavel Machek wrote: > > Hi! > > > > > That's right and a reality you have to deal with on those small ARM > > > systems. The ARM architecture allows for systems that don't enforce > > > hardware coherency across the whole SoC and many of the small/cheap SoC > > > variants make use of this architectural feature. > > > > > > What this means is that the CPU caches aren't coherent when it comes to > > > DMA from other masters like the video capture units. There are two ways > > > to enforce DMA coherency on such systems: > > > 1. map the DMA buffers uncached on the CPU > > > 2. require explicit cache maintenance when touching DMA buffers with > > > the CPU > > > > > > Option 1 is what you see is happening in your setup, as it is simple, > > > straight-forward and doesn't require any synchronization points. > > > > Yeah, and it also does not work :-). > > > > Userspace gets the buffers, and it is not really equipped to work with > > them. For example, on pinephone, memcpy() crashes on uncached > > memory. I'm pretty sure user could have some kind of kernel-crashing > > fun if he passed the uncached memory to futex or something similar. > > Uncached buffers are ubiquitous on arm/arm64 so there must be something > else going on. And there's nothing to equip for, it's just a memory > array you can access in any way you want (but very slowly). > > How does it not work? I agree, this should just work (albeit possibly slowly). A crash is a sign something needs to be fixed. > > > Option 2 could be implemented by allocating cached DMA buffers in the > > > V4L2 device and then executing the necessary cache synchronization in > > > qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA > > > master. However this isn't guaranteed to be any faster, as the cache > > > synchronization itself is a pretty heavy-weight operation when you are > > > dealing with buffer that are potentially multi-megabytes in size. > > > > Yes, cache synchronization can be slow, but IIRC it was on order of > > milisecond in the worst case.. and copying megayte images is still > > slower than that. Those numbers are platform-specific, you can't assume this to be true everywhere. > > Note that it is faster to do read/write syscalls then deal with > > uncached memory. And userspace can't simply flush the caches and remap > > memory as cached. > > You can't change the memory mapping, but you can flush the caches with > dma-buf. It's even required by the dma-buf documentation. > > > v4l2 moved away from read/write "because it is slow" and switched to > > interface that is even slower than that. And libcamera exposes > > uncached memory to the user :-(. > > There's also the number of copies to consider. If you were to use > read/write to display a frame on a framebuffer, you would use 4 copies > vs 2 with dma-buf. -- Regards, Laurent Pinchart ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Re: Uncached buffers from CMA DMA heap on some Arm devices? 2024-01-29 12:07 ` Laurent Pinchart @ 2024-01-29 13:12 ` Lucas Stach 0 siblings, 0 replies; 5+ messages in thread From: Lucas Stach @ 2024-01-29 13:12 UTC (permalink / raw) To: Laurent Pinchart, Maxime Ripard Cc: Pavel Machek, kernel list, Milan Zamazal, Christoph Hellwig, iommu, Will Deacon, catalin.marinas, Bryan O'Donoghue, Andrey Konovalov, kieran.bingham, Hans de Goede Am Montag, dem 29.01.2024 um 14:07 +0200 schrieb Laurent Pinchart: > On Mon, Jan 29, 2024 at 11:32:08AM +0100, Maxime Ripard wrote: > > On Mon, Jan 29, 2024 at 11:23:16AM +0100, Pavel Machek wrote: > > > Hi! > > > > > > > That's right and a reality you have to deal with on those small ARM > > > > systems. The ARM architecture allows for systems that don't enforce > > > > hardware coherency across the whole SoC and many of the small/cheap SoC > > > > variants make use of this architectural feature. > > > > > > > > What this means is that the CPU caches aren't coherent when it comes to > > > > DMA from other masters like the video capture units. There are two ways > > > > to enforce DMA coherency on such systems: > > > > 1. map the DMA buffers uncached on the CPU > > > > 2. require explicit cache maintenance when touching DMA buffers with > > > > the CPU > > > > > > > > Option 1 is what you see is happening in your setup, as it is simple, > > > > straight-forward and doesn't require any synchronization points. > > > > > > Yeah, and it also does not work :-). > > > > > > Userspace gets the buffers, and it is not really equipped to work with > > > them. For example, on pinephone, memcpy() crashes on uncached > > > memory. I'm pretty sure user could have some kind of kernel-crashing > > > fun if he passed the uncached memory to futex or something similar. > > > > Uncached buffers are ubiquitous on arm/arm64 so there must be something > > else going on. And there's nothing to equip for, it's just a memory > > array you can access in any way you want (but very slowly). > > > > How does it not work? > > I agree, this should just work (albeit possibly slowly). A crash is a > sign something needs to be fixed. > Optimized memcpy implementations might use unligned access at the edges of the copy regions, which will in fact not work with uncached memory, as hardware unaligned access support on ARM(64) requires the bufferable memory attribute, so you might see aborts in this case. write-combined mappings are bufferable and thus don't exhibit this issue. > > > > Option 2 could be implemented by allocating cached DMA buffers in the > > > > V4L2 device and then executing the necessary cache synchronization in > > > > qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA > > > > master. However this isn't guaranteed to be any faster, as the cache > > > > synchronization itself is a pretty heavy-weight operation when you are > > > > dealing with buffer that are potentially multi-megabytes in size. > > > > > > Yes, cache synchronization can be slow, but IIRC it was on order of > > > milisecond in the worst case.. and copying megayte images is still > > > slower than that. > > Those numbers are platform-specific, you can't assume this to be true > everywhere. > Last time I looked at this was on a pretty old platform (Cortex-A9). There the TLB walks caused by the cache maintenance by virtual address was causing severe slowdowns, to the point where actually copying the data performs similar to the cache maintenance within noise margins, with the significant difference that copying actually causes the data to be cache hot for the following operations. Regards, Lucas ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Re: Uncached buffers from CMA DMA heap on some Arm devices? 2024-01-29 10:32 ` Maxime Ripard 2024-01-29 12:07 ` Laurent Pinchart @ 2024-01-29 18:30 ` Pavel Machek 1 sibling, 0 replies; 5+ messages in thread From: Pavel Machek @ 2024-01-29 18:30 UTC (permalink / raw) To: Maxime Ripard Cc: Lucas Stach, kernel list, Milan Zamazal, Christoph Hellwig, iommu, Will Deacon, catalin.marinas, Bryan O'Donoghue, Andrey Konovalov, Laurent Pinchart, kieran.bingham, Hans de Goede [-- Attachment #1: Type: text/plain, Size: 985 bytes --] Hi! > > Yeah, and it also does not work :-). > > > > Userspace gets the buffers, and it is not really equipped to work with > > them. For example, on pinephone, memcpy() crashes on uncached > > memory. I'm pretty sure user could have some kind of kernel-crashing > > fun if he passed the uncached memory to futex or something similar. > > Uncached buffers are ubiquitous on arm/arm64 so there must be something > else going on. And there's nothing to equip for, it's just a memory > array you can access in any way you want (but very slowly). Not really. Not on anything modern. ll/sc will not work, for example, than's on ARM. https://en.wikipedia.org/wiki/Load-link/store-conditional Transactional memory will not work, that was on x86. Powerpc has cacheline clearing instruction. And that's design, I'm pretty sure there are also numerous CPU errata. Best regards, Pavel -- People of Russia, stop Putin before his war on Ukraine escalates. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-01-29 18:30 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <87bk9ahex7.fsf@redhat.com>
[not found] ` <d2ff8df896d8a167e9abf447ae184ce2f5823852.camel@lynxeye.de>
2024-01-29 10:23 ` Uncached buffers from CMA DMA heap on some Arm devices? Pavel Machek
2024-01-29 10:32 ` Maxime Ripard
2024-01-29 12:07 ` Laurent Pinchart
2024-01-29 13:12 ` Lucas Stach
2024-01-29 18:30 ` Pavel Machek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox