From mboxrd@z Thu Jan 1 00:00:00 1970 From: linux@armlinux.org.uk (Russell King - ARM Linux) Date: Wed, 3 Oct 2018 19:08:43 +0100 Subject: How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture? In-Reply-To: <866a698a-aa61-6c48-0258-f2dd97973e87@arm.com> References: <866a698a-aa61-6c48-0258-f2dd97973e87@arm.com> Message-ID: <20181003180843.GY30658@n2100.armlinux.org.uk> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Wed, Oct 03, 2018 at 06:44:56PM +0100, Robin Murphy wrote: > Hi Casey, > > On 03/10/18 17:55, Casey Leedom wrote: > > I have a question about ARM CPU versus PCIe DMA I/O Coherence that I'm > >trying to understand. In general, I thought that ARM is I/O Incoherent and > >that setting up Device DMA READs from Coherent Memory and Device DMA WRITEs > >to Coherent Memory require that the Device Driver/OS coordinate to > >FLUSH/INVALIDATE Caches, etc. In Linux this is all handled automatically > >via the dam_map*()/dma_unmap*() APIs. But what does the Linux kernel API > >dma_alloc_coherent() do on an architecture like ARM? Return an UNCACHED > >mapping? I've tried ferreting my way down through the layers and layers of > >abstraction and implementation differences for various ARM platforms but > >it's pretty opaque ... > > The really old stuff is a bit murky, but certainly from Armv7 onwards (i.e. > the kind of systems you'd be plugging a PCIe card into today), it's > relatively straightforward. The old stuff is _not_ murky - it operates the same way as modern systems as far as Linux is concerned. That's one of the reasons for having the DMA API - to provide a consistent cross-platform cross-architecture way to deal with the DMA coherency issues. There are two major parts of the DMA API - the coherent API and the streaming API. The coherent API consists of dma_alloc_coherent() and friends. dma_alloc_coherent() returns an allocation of memory to the driver that is guaranteed to be coherent with the device that was passed into the API. How that happens depends on the implementation, but the requirement is that the memory can be read and written by the CPU _while_ the DMA device is also reading and writing that memory. No cache flushing is required (if it were, then simultaneous access would not be possible.) The main purpose of this memory is for things like descriptor rings, where (eg) the CPU places addresses for the device to process and links the ring entry into the ring, or sets an ownership or go bit in the descriptor. Meanwhile the hardware polls the go bit and processes the descriptor as soon as it notices that the go or ownership bit has been set. Barriers are required for weakly ordered architectures to ensure the correct visibility semantics for CPU reads and writes. Essentially, in *all* ARMs where the memory is noncoherent with the device, allocating coherent memory means that the memory is remapped with the caches disabled. The streaming API consists of memory which is not inherently coherent, and requires some form of cache maintanence to ensure that the data transferred by DMA or the CPU is visible to the other. These are the dma_map_*(), dma_unmap_*() and dma_sync_*() functions. > > We use the Linux dma_alloc_coherent() API in order to allocate our TX and > >RX "Rings". All TX and RX "Buffers" are managed with the dma_map*(*READ* > >and *WRITE*) APIs in order to Flush Caches to Memory / Invalidate Caches, etc. > > > > But these "Rings" serve as "message" rings between the Host and the Device > >and we don't do Cache Flushes/Invalidates on them. Messages sent from the > >Host to the Device include Work Requests and lists of Free List Buffer > >Pointers. Messages sent from the Device to the Host include Ingress Packet > >Delivery Notifications, Link Status, etc. For the Ingress Queues which the > >Device uses to send messages to the Host, we use a Generation Bit scheme > >where the Generation Bit flips back and forth between 0 and 1 every time the > >Device's Write Index in the Ingress Queue wraps back around to the start of > >the Ingress Queue. The Host software uses the Generation Bit value to > >determine when there are new Device Messages available in the Ingress Queue. > > > > So, as I was grinding my way down through the layers of implementation on > >the Linux dma_alloc_coherent() I was trying to see how the above > >dma_alloc_coherent() semantic was being implemented on the ARM architecture > >which [I thought] doesn't generally support I/O Coherency. Setting up a > >completely UNCACHED mapping would of course work but at a significant cost > >in terms of access. It's conceivable that the TX Rings could be mapped with > >a WRITE-COMBINING UNCACHED mapping I suppose (though the Linux API doesn't > >include any information on the DIRECTION of a dma_map_coherent() call). So > >I'm curious about how that all fits together. To Casey: If you are writing at the level where you need to know exactly how coherent memory is implemented, then you're doing it wrong - and your driver will never be reliable. You will be writing it based upon implementation semantics, rather than the API definition. If you're thinking about avoiding "uncached" mappings by using a streaming mapping and doing cache maintanence, please don't. Writes in that situation are written back using a full (or sometimes half depending on the implementation) cache line which can be 16, 32, 64 or 128 bytes. What this means is that if your descriptors are 16 bytes long, flushing just that descriptor is impossible without also writing neighbouring entries - and if the hardware has updated those entries while the cache line is in the CPU cache, the hardware update will be lost upon writeback. Also note that there is no dma_map_coherent() API, and "direction" is not relevant for memory which is inherently coherent. Direction is necessary for the streaming API so it knows how to most efficiently perform the cache maintanence if the memory is not coherent with the device. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up