How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture?

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture?
@ 2018-10-03 16:55 Casey Leedom
  2018-10-03 17:44 ` Robin Murphy
  0 siblings, 1 reply; 6+ messages in thread
From: Casey Leedom @ 2018-10-03 16:55 UTC (permalink / raw)
  To: linux-arm-kernel

  I have a question about ARM CPU versus PCIe DMA I/O Coherence that I'm
trying to understand. In general, I thought that ARM is I/O Incoherent and
that setting up Device DMA READs from Coherent Memory and Device DMA WRITEs
to Coherent Memory require that the Device Driver/OS coordinate to
FLUSH/INVALIDATE Caches, etc.  In Linux this is all handled automatically
via the dam_map*()/dma_unmap*() APIs.  But what does the Linux kernel API
dma_alloc_coherent() do on an architecture like ARM?  Return an UNCACHED
mapping? I've tried ferreting my way down through the layers and layers of
abstraction and implementation differences for various ARM platforms but
it's pretty opaque ...

  We use the Linux dma_alloc_coherent() API in order to allocate our TX and
RX "Rings".  All TX and RX "Buffers" are managed with the dma_map*(*READ*
and *WRITE*) APIs in order to Flush Caches to Memory / Invalidate Caches, etc.

  But these "Rings" serve as "message" rings between the Host and the Device
and we don't do Cache Flushes/Invalidates on them.  Messages sent from the
Host to the Device include Work Requests and lists of Free List Buffer
Pointers.  Messages sent from the Device to the Host include Ingress Packet
Delivery Notifications, Link Status, etc.  For the Ingress Queues which the
Device uses to send messages to the Host, we use a Generation Bit scheme
where the Generation Bit flips back and forth between 0 and 1 every time the
Device's Write Index in the Ingress Queue wraps back around to the start of
the Ingress Queue.  The Host software uses the Generation Bit value to
determine when there are new Device Messages available in the Ingress Queue.

  So, as I was grinding my way down through the layers of implementation on
the Linux dma_alloc_coherent() I was trying to see how the above
dma_alloc_coherent() semantic was being implemented on the ARM architecture
which [I thought] doesn't generally support I/O Coherency.  Setting up a
completely UNCACHED mapping would of course work but at a significant cost
in terms of access.  It's conceivable that the TX Rings could be mapped with
a WRITE-COMBINING UNCACHED mapping I suppose (though the Linux API doesn't
include any information on the DIRECTION of a dma_map_coherent() call).  So
I'm curious about how that all fits together.

Casey

^ permalink raw reply	[flat|nested] 6+ messages in thread

* How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture?
  2018-10-03 16:55 How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture? Casey Leedom
@ 2018-10-03 17:44 ` Robin Murphy
  2018-10-03 18:08   ` Russell King - ARM Linux
  2018-10-03 18:36   ` Casey Leedom
  0 siblings, 2 replies; 6+ messages in thread
From: Robin Murphy @ 2018-10-03 17:44 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Casey,

On 03/10/18 17:55, Casey Leedom wrote:
>    I have a question about ARM CPU versus PCIe DMA I/O Coherence that I'm
> trying to understand. In general, I thought that ARM is I/O Incoherent and
> that setting up Device DMA READs from Coherent Memory and Device DMA WRITEs
> to Coherent Memory require that the Device Driver/OS coordinate to
> FLUSH/INVALIDATE Caches, etc.  In Linux this is all handled automatically
> via the dam_map*()/dma_unmap*() APIs.  But what does the Linux kernel API
> dma_alloc_coherent() do on an architecture like ARM?  Return an UNCACHED
> mapping? I've tried ferreting my way down through the layers and layers of
> abstraction and implementation differences for various ARM platforms but
> it's pretty opaque ...

The really old stuff is a bit murky, but certainly from Armv7 onwards 
(i.e. the kind of systems you'd be plugging a PCIe card into today), 
it's relatively straightforward. In short, if the PCIe root complex is 
non-coherent, the allocation (usually from CMA) gets remapped with the 
Normal Non-Cacheable memory type, and that's the address given back to 
the caller (side note: we don't unmap the Cacheable linear map alias, so 
have to be very careful to *not* touch that while both mappings exist, 
because there be dragons). If on the other hand it is coherent, then the 
caller just gets the regular Normal Cacheable linear map address back 
and nothing special happens (the IOMMU stuff is a little more involved, 
but the same principle applies).

Note that those Non-Cacheable mappings are still of the Normal memory 
type, which is not as strict as the Device memory type used for MMIO, 
and are also write-bufferable (again, on modern CPUs at least). Given 
the Architectural memory model, that's about as good as it can get.

FWIW coherent I/O is actually becoming increasingly common, at least on 
larger systems, since once you have a coherent interconnect to allow two 
or more clusters of CPUs to work together properly, it's not *that* much 
work to also make other DMA masters spit out enough of the right signals 
to snoop caches when targeting memory addresses.

>    We use the Linux dma_alloc_coherent() API in order to allocate our TX and
> RX "Rings".  All TX and RX "Buffers" are managed with the dma_map*(*READ*
> and *WRITE*) APIs in order to Flush Caches to Memory / Invalidate Caches, etc.
> 
>    But these "Rings" serve as "message" rings between the Host and the Device
> and we don't do Cache Flushes/Invalidates on them.  Messages sent from the
> Host to the Device include Work Requests and lists of Free List Buffer
> Pointers.  Messages sent from the Device to the Host include Ingress Packet
> Delivery Notifications, Link Status, etc.  For the Ingress Queues which the
> Device uses to send messages to the Host, we use a Generation Bit scheme
> where the Generation Bit flips back and forth between 0 and 1 every time the
> Device's Write Index in the Ingress Queue wraps back around to the start of
> the Ingress Queue.  The Host software uses the Generation Bit value to
> determine when there are new Device Messages available in the Ingress Queue.
> 
>    So, as I was grinding my way down through the layers of implementation on
> the Linux dma_alloc_coherent() I was trying to see how the above
> dma_alloc_coherent() semantic was being implemented on the ARM architecture
> which [I thought] doesn't generally support I/O Coherency.  Setting up a
> completely UNCACHED mapping would of course work but at a significant cost
> in terms of access.  It's conceivable that the TX Rings could be mapped with
> a WRITE-COMBINING UNCACHED mapping I suppose (though the Linux API doesn't
> include any information on the DIRECTION of a dma_map_coherent() call).  So
> I'm curious about how that all fits together.

Arm has no specific write-combining memory type, so in actual fact 
DMA_ATTR_WRITE_COMBINE will just give you the same Normal Non-cacheable 
mapping as for the non-coherent case (and thus you really wouldn't want 
it for coherent devices!)

Robin.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture?
  2018-10-03 17:44 ` Robin Murphy
@ 2018-10-03 18:08   ` Russell King - ARM Linux
  2018-10-04 11:05     ` Robin Murphy
  2018-10-03 18:36   ` Casey Leedom
  1 sibling, 1 reply; 6+ messages in thread
From: Russell King - ARM Linux @ 2018-10-03 18:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Oct 03, 2018 at 06:44:56PM +0100, Robin Murphy wrote:
> Hi Casey,
> 
> On 03/10/18 17:55, Casey Leedom wrote:
> >   I have a question about ARM CPU versus PCIe DMA I/O Coherence that I'm
> >trying to understand. In general, I thought that ARM is I/O Incoherent and
> >that setting up Device DMA READs from Coherent Memory and Device DMA WRITEs
> >to Coherent Memory require that the Device Driver/OS coordinate to
> >FLUSH/INVALIDATE Caches, etc.  In Linux this is all handled automatically
> >via the dam_map*()/dma_unmap*() APIs.  But what does the Linux kernel API
> >dma_alloc_coherent() do on an architecture like ARM?  Return an UNCACHED
> >mapping? I've tried ferreting my way down through the layers and layers of
> >abstraction and implementation differences for various ARM platforms but
> >it's pretty opaque ...
> 
> The really old stuff is a bit murky, but certainly from Armv7 onwards (i.e.
> the kind of systems you'd be plugging a PCIe card into today), it's
> relatively straightforward.

The old stuff is _not_ murky - it operates the same way as modern
systems as far as Linux is concerned.  That's one of the reasons
for having the DMA API - to provide a consistent cross-platform
cross-architecture way to deal with the DMA coherency issues.

There are two major parts of the DMA API - the coherent API and the
streaming API.

The coherent API consists of dma_alloc_coherent() and friends.
dma_alloc_coherent() returns an allocation of memory to the driver
that is guaranteed to be coherent with the device that was passed
into the API.  How that happens depends on the implementation, but
the requirement is that the memory can be read and written by the
CPU _while_ the DMA device is also reading and writing that memory.
No cache flushing is required (if it were, then simultaneous access
would not be possible.)

The main purpose of this memory is for things like descriptor rings,
where (eg) the CPU places addresses for the device to process and
links the ring entry into the ring, or sets an ownership or go bit
in the descriptor.  Meanwhile the hardware polls the go bit and
processes the descriptor as soon as it notices that the go or ownership
bit has been set.

Barriers are required for weakly ordered architectures to ensure the
correct visibility semantics for CPU reads and writes.

Essentially, in *all* ARMs where the memory is noncoherent with the
device, allocating coherent memory means that the memory is remapped
with the caches disabled.

The streaming API consists of memory which is not inherently coherent,
and requires some form of cache maintanence to ensure that the data
transferred by DMA or the CPU is visible to the other.  These are the
dma_map_*(), dma_unmap_*() and dma_sync_*() functions.

> >   We use the Linux dma_alloc_coherent() API in order to allocate our TX and
> >RX "Rings".  All TX and RX "Buffers" are managed with the dma_map*(*READ*
> >and *WRITE*) APIs in order to Flush Caches to Memory / Invalidate Caches, etc.
> >
> >   But these "Rings" serve as "message" rings between the Host and the Device
> >and we don't do Cache Flushes/Invalidates on them.  Messages sent from the
> >Host to the Device include Work Requests and lists of Free List Buffer
> >Pointers.  Messages sent from the Device to the Host include Ingress Packet
> >Delivery Notifications, Link Status, etc.  For the Ingress Queues which the
> >Device uses to send messages to the Host, we use a Generation Bit scheme
> >where the Generation Bit flips back and forth between 0 and 1 every time the
> >Device's Write Index in the Ingress Queue wraps back around to the start of
> >the Ingress Queue.  The Host software uses the Generation Bit value to
> >determine when there are new Device Messages available in the Ingress Queue.
> >
> >   So, as I was grinding my way down through the layers of implementation on
> >the Linux dma_alloc_coherent() I was trying to see how the above
> >dma_alloc_coherent() semantic was being implemented on the ARM architecture
> >which [I thought] doesn't generally support I/O Coherency.  Setting up a
> >completely UNCACHED mapping would of course work but at a significant cost
> >in terms of access.  It's conceivable that the TX Rings could be mapped with
> >a WRITE-COMBINING UNCACHED mapping I suppose (though the Linux API doesn't
> >include any information on the DIRECTION of a dma_map_coherent() call).  So
> >I'm curious about how that all fits together.

To Casey:

If you are writing at the level where you need to know exactly how
coherent memory is implemented, then you're doing it wrong - and
your driver will never be reliable.  You will be writing it based
upon implementation semantics, rather than the API definition.

If you're thinking about avoiding "uncached" mappings by using a
streaming mapping and doing cache maintanence, please don't.  Writes
in that situation are written back using a full (or sometimes half
depending on the implementation) cache line which can be 16, 32, 64
or 128 bytes.

What this means is that if your descriptors are 16 bytes long, flushing
just that descriptor is impossible without also writing neighbouring
entries - and if the hardware has updated those entries while the
cache line is in the CPU cache, the hardware update will be lost upon
writeback.

Also note that there is no dma_map_coherent() API, and "direction" is
not relevant for memory which is inherently coherent.

Direction is necessary for the streaming API so it knows how to most
efficiently perform the cache maintanence if the memory is not coherent
with the device.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 6+ messages in thread

* How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture?
  2018-10-03 18:08   ` Russell King - ARM Linux
@ 2018-10-04 11:05     ` Robin Murphy
  2018-10-04 19:13       ` Casey Leedom
  0 siblings, 1 reply; 6+ messages in thread
From: Robin Murphy @ 2018-10-04 11:05 UTC (permalink / raw)
  To: linux-arm-kernel

On 03/10/18 19:08, Russell King - ARM Linux wrote:
> On Wed, Oct 03, 2018 at 06:44:56PM +0100, Robin Murphy wrote:
>> Hi Casey,
>>
>> On 03/10/18 17:55, Casey Leedom wrote:
>>>    I have a question about ARM CPU versus PCIe DMA I/O Coherence that I'm
>>> trying to understand. In general, I thought that ARM is I/O Incoherent and
>>> that setting up Device DMA READs from Coherent Memory and Device DMA WRITEs
>>> to Coherent Memory require that the Device Driver/OS coordinate to
>>> FLUSH/INVALIDATE Caches, etc.  In Linux this is all handled automatically
>>> via the dam_map*()/dma_unmap*() APIs.  But what does the Linux kernel API
>>> dma_alloc_coherent() do on an architecture like ARM?  Return an UNCACHED
>>> mapping? I've tried ferreting my way down through the layers and layers of
>>> abstraction and implementation differences for various ARM platforms but
>>> it's pretty opaque ...
>>
>> The really old stuff is a bit murky, but certainly from Armv7 onwards (i.e.
>> the kind of systems you'd be plugging a PCIe card into today), it's
>> relatively straightforward.
> 
> The old stuff is _not_ murky - it operates the same way as modern
> systems as far as Linux is concerned.

Apologies if that came across wrong - I mean it from my viewpoint in 
terms of the level of detail I wanted to go into (my point of reference 
is primarily VMSAv8, plus what it changed from VMSAv7). I can't say that 
an ARMv5 kernel is going to give a Normal-NC mapping since as I 
understand things it may be Strongly-Ordered instead depending on 
configuration. TBH I can't even fully remember off-hand what exactly the 
pre-TEX-remap attributes are and how they would map to AXI/ACE 
transactions, and it didn't seem worth spending half an hour looking it 
all up since the prospect of something like an ARM296 being the main 
application processor driving the kind of kit that Chelsio do seems 
rather unlikely.

>  That's one of the reasons
> for having the DMA API - to provide a consistent cross-platform
> cross-architecture way to deal with the DMA coherency issues.

Fully agreed that driver authors shouldn't care about this and can trust 
the kernel to provide the best thing it can, but Casey's initial 
question implied that a bit more architectural background might be 
useful (especially now given the response).

Robin.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture?
  2018-10-04 11:05     ` Robin Murphy
@ 2018-10-04 19:13       ` Casey Leedom
  0 siblings, 0 replies; 6+ messages in thread
From: Casey Leedom @ 2018-10-04 19:13 UTC (permalink / raw)
  To: linux-arm-kernel

| From: Robin Murphy <robin.murphy@arm.com>
| Sent: Thursday, October 4, 2018 4:05 AM
|
| On 03/10/18 19:08, Russell King - ARM Linux wrote:
| >
| > That's one of the reasons
| > for having the DMA API - to provide a consistent cross-platform
| > cross-architecture way to deal with the DMA coherency issues.
|
| Fully agreed that driver authors shouldn't care about this and can trust the
| kernel to provide the best thing it can, but Casey's initial question
| implied that a bit more architectural background might be useful (especially
| now given the response).

Yes, I'm basically trying to advise a Hardware Team on what work they need
to do in order to successfully integrate a 4-Core ARM Complex into an
existing SoC.

Their initial plan was to make this completely I/O Coherent which would have
causes _*ALL*_ I/O to flow through a bandwidth limited Coherency Switch.
But that's definitely not necessary (or desirable because of the bandwidth
issue) for the actual Data being transferred.

But the TX/RX Rings on the other hand ... that could be an issue.  As noted,
these are mapped with the Linux dma_alloc_coherent() API and I was trying to
understand what mapping this actually ended up with on various ARM platform
implementations.  It ~looked like~ this resulted in an UNCACHED Mapping, but
I couldn't tell for sure.  The levels of API Abstraction made it very
difficult to see what's happening.

Casey

^ permalink raw reply	[flat|nested] 6+ messages in thread

* How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture?
  2018-10-03 17:44 ` Robin Murphy
  2018-10-03 18:08   ` Russell King - ARM Linux
@ 2018-10-03 18:36   ` Casey Leedom
  1 sibling, 0 replies; 6+ messages in thread
From: Casey Leedom @ 2018-10-03 18:36 UTC (permalink / raw)
  To: linux-arm-kernel

[[ Sorry if there's a duplicate: the mailing list complained about HTML email. ]]

  Thanks Robin and Russell!

| From: Russell King - ARM Linux <linux@armlinux.org.uk>
| Sent: Wednesday, October 3, 2018 11:08 AM
| 
| If you are writing at the level where you need to know exactly how coherent
| memory is implemented, then you're doing it wrong - and your driver will
| never be reliable.  You will be writing it based upon implementation
| semantics, rather than the API definition.

  Ah, sorry, no, that's not what's happening.  Our current software uses all
of the appropriate Linux kernel APIs for all of this.  I am asking because
I'm trying to offer advice to a Hardware Team which has been tasked with
adding an embedded ARM Cluster to an existing chip.  The Hardware Team
wanted to know whether they needed to extend the ARM's CPU Coherency Domain
to include DMA I/O from a different portion of the chip (it's sort of
emulated PCIe DMA).  My initial response was "no" because I had thought that
ARM systems were, in general, not I/O coherent.  (Though I've seen cases
where tightly coupled co-processors like Graphics Engines are made to be
coherent in the CPU Coherency Domain.)

  That was all fine until someone dragged my attention over to the Linux
kernel API dma_alloc_coherent() which in fact we do use for our TX/RX
"Descriptor Rings".  Before I signed off on my recommendation to the
Hardware Team I wanted to make sure I wasn't blowing smoke out my ... er, I
wasn't directing them incorrectly ... :-)

| Also note that there is no dma_map_coherent() API, and "direction" is not
| relevant for memory which is inherently coherent.

  The reason that I thought that there might be a need for a "direction" in
the dma_map_coherent() API is because RX and TX Descriptor Rings might
benefit under some Architectural Implementations from different
treatments/mappings.  For instance, for RX Rings, UNCACHED is the right
answer.  But for TX Rings where a Write Combining Buffer is implemented, an
UNCACHED/WRITE-COMBINING mapping might be best.

Casey

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-10-04 19:13 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-10-03 16:55 How is the Linux kernel API dma_alloc_coherent() typically implemented for the ARM Architecture? Casey Leedom
2018-10-03 17:44 ` Robin Murphy
2018-10-03 18:08   ` Russell King - ARM Linux
2018-10-04 11:05     ` Robin Murphy
2018-10-04 19:13       ` Casey Leedom
2018-10-03 18:36   ` Casey Leedom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox