From mboxrd@z Thu Jan  1 00:00:00 1970
From: jgunthorpe@obsidianresearch.com (Jason Gunthorpe)
Date: Wed, 23 Apr 2014 13:34:54 -0600
Subject: [PATCH] ARM: mm: dma: Update coherent streaming apis with
 missing memory barrier
In-Reply-To: <6414220.SShvCHLvZQ@wuerfel>
References: <1398103390-31968-1-git-send-email-santosh.shilimkar@ti.com>
 <20140423171727.GK5649@arm.com>
 <20140423183742.GK24070@n2100.arm.linux.org.uk>
 <6414220.SShvCHLvZQ@wuerfel>
Message-ID: <20140423193454.GA10076@obsidianresearch.com>
To: linux-arm-kernel@lists.infradead.org
List-Id: linux-arm-kernel.lists.infradead.org

On Wed, Apr 23, 2014 at 08:58:05PM +0200, Arnd Bergmann wrote:

> PCI guarantees this, but I have seen systems in the past (on
> PowerPC) that would violate them on the internal interconnect: You
> could sometimes see the completion DMA data in the descriptor ring
> before the actual user data is there. We only ever observed it in
> combination with an IOMMU, when the descriptor address had a valid
> IOTLB but the data address did not.

Ordering in PCI-E gets a bit fuzzy when you talk about internal order
within the host bridge, but AFAIK, re-ordering non-relaxed PCI-E
writes is certainly a big no-no.. It breaks the entire
producer/consumer driver model..

> Another problem is MSI processing. MSI was specifically invented to avoid
> having to check an MMIO register for a DMA completion that as a side-effect
> flushes pending DMAs from the same device. This breaks down if the MSI
> packet gets turned into a level interrupt before it reaches the CPU's
> coherency domain, which is likely the case on the dw-pcie controller that
> comes with its own MSI block.

I recently implemented a PCI-E to AXI bridge HW that does MSI in an
AXI environment and it requires waiting for all AXI operations
associated with prior PCI-E packets to complete and be acked back to
the bridge before sending an MSI edge over to the GIC.

Unlike PCI, AXI provides a write completion ack back to the initiator,
which can only be sent by the completor once the transaction is
visible to all other initiators.

A bridge must similarly serialize other TLPs: eg a series of posted
writes with the relaxed ordering bit set can be pipelined into AXI,
but once a non-relaxed TLP is hit, the bridge must wait for all the
prior writes to be ack'd before forwarding the non-relaxed one.

Not doing this serialization would be the root cause of problem like
you described above in PPC - where the IOMMU path takes longer than
the non-IOMMU path so the non-relaxed completion write arrives too
soon.

IMHO, if someone builds a PCI-E bridge that doesn't do this, then it's
MSI support is completely broken and should not be used. Delivering a
MSI interrupt before data visibility completely violates requirements
in PCI-E for transaction ordering.

It is also important to note that even level interrupts require bridge
serialization. When a non-relaxed read-response TLP is returned the
bridge must wait for all AXI writes to be ack'd before forwarding the
read response. Otherwise writes could be buffered within the
interconnect and still not be visible to the CPU while the read
response 'passes' them.

Jason