* How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture? @ 2018-10-03 16:55 Casey Leedom 2018-10-03 17:44 ` Robin Murphy 0 siblings, 1 reply; 6+ messages in thread From: Casey Leedom @ 2018-10-03 16:55 UTC (permalink / raw) To: linux-arm-kernel I have a question about ARM CPU versus PCIe DMA I/O Coherence that I'm trying to understand. In general, I thought that ARM is I/O Incoherent and that setting up Device DMA READs from Coherent Memory and Device DMA WRITEs to Coherent Memory require that the Device Driver/OS coordinate to FLUSH/INVALIDATE Caches, etc. In Linux this is all handled automatically via the dam_map*()/dma_unmap*() APIs. But what does the Linux kernel API dma_alloc_coherent() do on an architecture like ARM? Return an UNCACHED mapping? I've tried ferreting my way down through the layers and layers of abstraction and implementation differences for various ARM platforms but it's pretty opaque ... We use the Linux dma_alloc_coherent() API in order to allocate our TX and RX "Rings". All TX and RX "Buffers" are managed with the dma_map*(*READ* and *WRITE*) APIs in order to Flush Caches to Memory / Invalidate Caches, etc. But these "Rings" serve as "message" rings between the Host and the Device and we don't do Cache Flushes/Invalidates on them. Messages sent from the Host to the Device include Work Requests and lists of Free List Buffer Pointers. Messages sent from the Device to the Host include Ingress Packet Delivery Notifications, Link Status, etc. For the Ingress Queues which the Device uses to send messages to the Host, we use a Generation Bit scheme where the Generation Bit flips back and forth between 0 and 1 every time the Device's Write Index in the Ingress Queue wraps back around to the start of the Ingress Queue. The Host software uses the Generation Bit value to determine when there are new Device Messages available in the Ingress Queue. So, as I was grinding my way down through the layers of implementation on the Linux dma_alloc_coherent() I was trying to see how the above dma_alloc_coherent() semantic was being implemented on the ARM architecture which [I thought] doesn't generally support I/O Coherency. Setting up a completely UNCACHED mapping would of course work but at a significant cost in terms of access. It's conceivable that the TX Rings could be mapped with a WRITE-COMBINING UNCACHED mapping I suppose (though the Linux API doesn't include any information on the DIRECTION of a dma_map_coherent() call). So I'm curious about how that all fits together. Casey ^ permalink raw reply [flat|nested] 6+ messages in thread
* How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture? 2018-10-03 16:55 How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture? Casey Leedom @ 2018-10-03 17:44 ` Robin Murphy 2018-10-03 18:08 ` Russell King - ARM Linux 2018-10-03 18:36 ` Casey Leedom 0 siblings, 2 replies; 6+ messages in thread From: Robin Murphy @ 2018-10-03 17:44 UTC (permalink / raw) To: linux-arm-kernel Hi Casey, On 03/10/18 17:55, Casey Leedom wrote: > I have a question about ARM CPU versus PCIe DMA I/O Coherence that I'm > trying to understand. In general, I thought that ARM is I/O Incoherent and > that setting up Device DMA READs from Coherent Memory and Device DMA WRITEs > to Coherent Memory require that the Device Driver/OS coordinate to > FLUSH/INVALIDATE Caches, etc. In Linux this is all handled automatically > via the dam_map*()/dma_unmap*() APIs. But what does the Linux kernel API > dma_alloc_coherent() do on an architecture like ARM? Return an UNCACHED > mapping? I've tried ferreting my way down through the layers and layers of > abstraction and implementation differences for various ARM platforms but > it's pretty opaque ... The really old stuff is a bit murky, but certainly from Armv7 onwards (i.e. the kind of systems you'd be plugging a PCIe card into today), it's relatively straightforward. In short, if the PCIe root complex is non-coherent, the allocation (usually from CMA) gets remapped with the Normal Non-Cacheable memory type, and that's the address given back to the caller (side note: we don't unmap the Cacheable linear map alias, so have to be very careful to *not* touch that while both mappings exist, because there be dragons). If on the other hand it is coherent, then the caller just gets the regular Normal Cacheable linear map address back and nothing special happens (the IOMMU stuff is a little more involved, but the same principle applies). Note that those Non-Cacheable mappings are still of the Normal memory type, which is not as strict as the Device memory type used for MMIO, and are also write-bufferable (again, on modern CPUs at least). Given the Architectural memory model, that's about as good as it can get. FWIW coherent I/O is actually becoming increasingly common, at least on larger systems, since once you have a coherent interconnect to allow two or more clusters of CPUs to work together properly, it's not *that* much work to also make other DMA masters spit out enough of the right signals to snoop caches when targeting memory addresses. > We use the Linux dma_alloc_coherent() API in order to allocate our TX and > RX "Rings". All TX and RX "Buffers" are managed with the dma_map*(*READ* > and *WRITE*) APIs in order to Flush Caches to Memory / Invalidate Caches, etc. > > But these "Rings" serve as "message" rings between the Host and the Device > and we don't do Cache Flushes/Invalidates on them. Messages sent from the > Host to the Device include Work Requests and lists of Free List Buffer > Pointers. Messages sent from the Device to the Host include Ingress Packet > Delivery Notifications, Link Status, etc. For the Ingress Queues which the > Device uses to send messages to the Host, we use a Generation Bit scheme > where the Generation Bit flips back and forth between 0 and 1 every time the > Device's Write Index in the Ingress Queue wraps back around to the start of > the Ingress Queue. The Host software uses the Generation Bit value to > determine when there are new Device Messages available in the Ingress Queue. > > So, as I was grinding my way down through the layers of implementation on > the Linux dma_alloc_coherent() I was trying to see how the above > dma_alloc_coherent() semantic was being implemented on the ARM architecture > which [I thought] doesn't generally support I/O Coherency. Setting up a > completely UNCACHED mapping would of course work but at a significant cost > in terms of access. It's conceivable that the TX Rings could be mapped with > a WRITE-COMBINING UNCACHED mapping I suppose (though the Linux API doesn't > include any information on the DIRECTION of a dma_map_coherent() call). So > I'm curious about how that all fits together. Arm has no specific write-combining memory type, so in actual fact DMA_ATTR_WRITE_COMBINE will just give you the same Normal Non-cacheable mapping as for the non-coherent case (and thus you really wouldn't want it for coherent devices!) Robin. ^ permalink raw reply [flat|nested] 6+ messages in thread
* How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture? 2018-10-03 17:44 ` Robin Murphy @ 2018-10-03 18:08 ` Russell King - ARM Linux 2018-10-04 11:05 ` Robin Murphy 2018-10-03 18:36 ` Casey Leedom 1 sibling, 1 reply; 6+ messages in thread From: Russell King - ARM Linux @ 2018-10-03 18:08 UTC (permalink / raw) To: linux-arm-kernel On Wed, Oct 03, 2018 at 06:44:56PM +0100, Robin Murphy wrote: > Hi Casey, > > On 03/10/18 17:55, Casey Leedom wrote: > > I have a question about ARM CPU versus PCIe DMA I/O Coherence that I'm > >trying to understand. In general, I thought that ARM is I/O Incoherent and > >that setting up Device DMA READs from Coherent Memory and Device DMA WRITEs > >to Coherent Memory require that the Device Driver/OS coordinate to > >FLUSH/INVALIDATE Caches, etc. In Linux this is all handled automatically > >via the dam_map*()/dma_unmap*() APIs. But what does the Linux kernel API > >dma_alloc_coherent() do on an architecture like ARM? Return an UNCACHED > >mapping? I've tried ferreting my way down through the layers and layers of > >abstraction and implementation differences for various ARM platforms but > >it's pretty opaque ... > > The really old stuff is a bit murky, but certainly from Armv7 onwards (i.e. > the kind of systems you'd be plugging a PCIe card into today), it's > relatively straightforward. The old stuff is _not_ murky - it operates the same way as modern systems as far as Linux is concerned. That's one of the reasons for having the DMA API - to provide a consistent cross-platform cross-architecture way to deal with the DMA coherency issues. There are two major parts of the DMA API - the coherent API and the streaming API. The coherent API consists of dma_alloc_coherent() and friends. dma_alloc_coherent() returns an allocation of memory to the driver that is guaranteed to be coherent with the device that was passed into the API. How that happens depends on the implementation, but the requirement is that the memory can be read and written by the CPU _while_ the DMA device is also reading and writing that memory. No cache flushing is required (if it were, then simultaneous access would not be possible.) The main purpose of this memory is for things like descriptor rings, where (eg) the CPU places addresses for the device to process and links the ring entry into the ring, or sets an ownership or go bit in the descriptor. Meanwhile the hardware polls the go bit and processes the descriptor as soon as it notices that the go or ownership bit has been set. Barriers are required for weakly ordered architectures to ensure the correct visibility semantics for CPU reads and writes. Essentially, in *all* ARMs where the memory is noncoherent with the device, allocating coherent memory means that the memory is remapped with the caches disabled. The streaming API consists of memory which is not inherently coherent, and requires some form of cache maintanence to ensure that the data transferred by DMA or the CPU is visible to the other. These are the dma_map_*(), dma_unmap_*() and dma_sync_*() functions. > > We use the Linux dma_alloc_coherent() API in order to allocate our TX and > >RX "Rings". All TX and RX "Buffers" are managed with the dma_map*(*READ* > >and *WRITE*) APIs in order to Flush Caches to Memory / Invalidate Caches, etc. > > > > But these "Rings" serve as "message" rings between the Host and the Device > >and we don't do Cache Flushes/Invalidates on them. Messages sent from the > >Host to the Device include Work Requests and lists of Free List Buffer > >Pointers. Messages sent from the Device to the Host include Ingress Packet > >Delivery Notifications, Link Status, etc. For the Ingress Queues which the > >Device uses to send messages to the Host, we use a Generation Bit scheme > >where the Generation Bit flips back and forth between 0 and 1 every time the > >Device's Write Index in the Ingress Queue wraps back around to the start of > >the Ingress Queue. The Host software uses the Generation Bit value to > >determine when there are new Device Messages available in the Ingress Queue. > > > > So, as I was grinding my way down through the layers of implementation on > >the Linux dma_alloc_coherent() I was trying to see how the above > >dma_alloc_coherent() semantic was being implemented on the ARM architecture > >which [I thought] doesn't generally support I/O Coherency. Setting up a > >completely UNCACHED mapping would of course work but at a significant cost > >in terms of access. It's conceivable that the TX Rings could be mapped with > >a WRITE-COMBINING UNCACHED mapping I suppose (though the Linux API doesn't > >include any information on the DIRECTION of a dma_map_coherent() call). So > >I'm curious about how that all fits together. To Casey: If you are writing at the level where you need to know exactly how coherent memory is implemented, then you're doing it wrong - and your driver will never be reliable. You will be writing it based upon implementation semantics, rather than the API definition. If you're thinking about avoiding "uncached" mappings by using a streaming mapping and doing cache maintanence, please don't. Writes in that situation are written back using a full (or sometimes half depending on the implementation) cache line which can be 16, 32, 64 or 128 bytes. What this means is that if your descriptors are 16 bytes long, flushing just that descriptor is impossible without also writing neighbouring entries - and if the hardware has updated those entries while the cache line is in the CPU cache, the hardware update will be lost upon writeback. Also note that there is no dma_map_coherent() API, and "direction" is not relevant for memory which is inherently coherent. Direction is necessary for the streaming API so it knows how to most efficiently perform the cache maintanence if the memory is not coherent with the device. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ^ permalink raw reply [flat|nested] 6+ messages in thread
* How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture? 2018-10-03 18:08 ` Russell King - ARM Linux @ 2018-10-04 11:05 ` Robin Murphy 2018-10-04 19:13 ` Casey Leedom 0 siblings, 1 reply; 6+ messages in thread From: Robin Murphy @ 2018-10-04 11:05 UTC (permalink / raw) To: linux-arm-kernel On 03/10/18 19:08, Russell King - ARM Linux wrote: > On Wed, Oct 03, 2018 at 06:44:56PM +0100, Robin Murphy wrote: >> Hi Casey, >> >> On 03/10/18 17:55, Casey Leedom wrote: >>> I have a question about ARM CPU versus PCIe DMA I/O Coherence that I'm >>> trying to understand. In general, I thought that ARM is I/O Incoherent and >>> that setting up Device DMA READs from Coherent Memory and Device DMA WRITEs >>> to Coherent Memory require that the Device Driver/OS coordinate to >>> FLUSH/INVALIDATE Caches, etc. In Linux this is all handled automatically >>> via the dam_map*()/dma_unmap*() APIs. But what does the Linux kernel API >>> dma_alloc_coherent() do on an architecture like ARM? Return an UNCACHED >>> mapping? I've tried ferreting my way down through the layers and layers of >>> abstraction and implementation differences for various ARM platforms but >>> it's pretty opaque ... >> >> The really old stuff is a bit murky, but certainly from Armv7 onwards (i.e. >> the kind of systems you'd be plugging a PCIe card into today), it's >> relatively straightforward. > > The old stuff is _not_ murky - it operates the same way as modern > systems as far as Linux is concerned. Apologies if that came across wrong - I mean it from my viewpoint in terms of the level of detail I wanted to go into (my point of reference is primarily VMSAv8, plus what it changed from VMSAv7). I can't say that an ARMv5 kernel is going to give a Normal-NC mapping since as I understand things it may be Strongly-Ordered instead depending on configuration. TBH I can't even fully remember off-hand what exactly the pre-TEX-remap attributes are and how they would map to AXI/ACE transactions, and it didn't seem worth spending half an hour looking it all up since the prospect of something like an ARM296 being the main application processor driving the kind of kit that Chelsio do seems rather unlikely. > That's one of the reasons > for having the DMA API - to provide a consistent cross-platform > cross-architecture way to deal with the DMA coherency issues. Fully agreed that driver authors shouldn't care about this and can trust the kernel to provide the best thing it can, but Casey's initial question implied that a bit more architectural background might be useful (especially now given the response). Robin. ^ permalink raw reply [flat|nested] 6+ messages in thread
* How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture? 2018-10-04 11:05 ` Robin Murphy @ 2018-10-04 19:13 ` Casey Leedom 0 siblings, 0 replies; 6+ messages in thread From: Casey Leedom @ 2018-10-04 19:13 UTC (permalink / raw) To: linux-arm-kernel | From: Robin Murphy <robin.murphy@arm.com> | Sent: Thursday, October 4, 2018 4:05 AM | | On 03/10/18 19:08, Russell King - ARM Linux wrote: | > | > That's one of the reasons | > for having the DMA API - to provide a consistent cross-platform | > cross-architecture way to deal with the DMA coherency issues. | | Fully agreed that driver authors shouldn't care about this and can trust the | kernel to provide the best thing it can, but Casey's initial question | implied that a bit more architectural background might be useful (especially | now given the response). Yes, I'm basically trying to advise a Hardware Team on what work they need to do in order to successfully integrate a 4-Core ARM Complex into an existing SoC. Their initial plan was to make this completely I/O Coherent which would have causes _*ALL*_ I/O to flow through a bandwidth limited Coherency Switch. But that's definitely not necessary (or desirable because of the bandwidth issue) for the actual Data being transferred. But the TX/RX Rings on the other hand ... that could be an issue. As noted, these are mapped with the Linux dma_alloc_coherent() API and I was trying to understand what mapping this actually ended up with on various ARM platform implementations. It ~looked like~ this resulted in an UNCACHED Mapping, but I couldn't tell for sure. The levels of API Abstraction made it very difficult to see what's happening. Casey ^ permalink raw reply [flat|nested] 6+ messages in thread
* How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture? 2018-10-03 17:44 ` Robin Murphy 2018-10-03 18:08 ` Russell King - ARM Linux @ 2018-10-03 18:36 ` Casey Leedom 1 sibling, 0 replies; 6+ messages in thread From: Casey Leedom @ 2018-10-03 18:36 UTC (permalink / raw) To: linux-arm-kernel [[ Sorry if there's a duplicate: the mailing list complained about HTML email. ]] Thanks Robin and Russell! | From: Russell King - ARM Linux <linux@armlinux.org.uk> | Sent: Wednesday, October 3, 2018 11:08 AM | | If you are writing at the level where you need to know exactly how coherent | memory is implemented, then you're doing it wrong - and your driver will | never be reliable. You will be writing it based upon implementation | semantics, rather than the API definition. Ah, sorry, no, that's not what's happening. Our current software uses all of the appropriate Linux kernel APIs for all of this. I am asking because I'm trying to offer advice to a Hardware Team which has been tasked with adding an embedded ARM Cluster to an existing chip. The Hardware Team wanted to know whether they needed to extend the ARM's CPU Coherency Domain to include DMA I/O from a different portion of the chip (it's sort of emulated PCIe DMA). My initial response was "no" because I had thought that ARM systems were, in general, not I/O coherent. (Though I've seen cases where tightly coupled co-processors like Graphics Engines are made to be coherent in the CPU Coherency Domain.) That was all fine until someone dragged my attention over to the Linux kernel API dma_alloc_coherent() which in fact we do use for our TX/RX "Descriptor Rings". Before I signed off on my recommendation to the Hardware Team I wanted to make sure I wasn't blowing smoke out my ... er, I wasn't directing them incorrectly ... :-) | Also note that there is no dma_map_coherent() API, and "direction" is not | relevant for memory which is inherently coherent. The reason that I thought that there might be a need for a "direction" in the dma_map_coherent() API is because RX and TX Descriptor Rings might benefit under some Architectural Implementations from different treatments/mappings. For instance, for RX Rings, UNCACHED is the right answer. But for TX Rings where a Write Combining Buffer is implemented, an UNCACHED/WRITE-COMBINING mapping might be best. Casey ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2018-10-04 19:13 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-10-03 16:55 How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture? Casey Leedom 2018-10-03 17:44 ` Robin Murphy 2018-10-03 18:08 ` Russell King - ARM Linux 2018-10-04 11:05 ` Robin Murphy 2018-10-04 19:13 ` Casey Leedom 2018-10-03 18:36 ` Casey Leedom
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox