* DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
[not found] ` <6a32f30b-433a-9857-164f-1717405a6082@oracle.com>
@ 2017-06-24 7:35 ` Christoph Hellwig
2017-06-26 9:15 ` Arnd Bergmann
2017-06-27 20:46 ` chris hyser
0 siblings, 2 replies; 5+ messages in thread
From: Christoph Hellwig @ 2017-06-24 7:35 UTC (permalink / raw)
To: Alan Adamson
Cc: Christoph Hellwig, keith.busch, axboe, linux-nvme, sagi,
linux-pci, Mark Nelson, Arnd Bergmann, Benjamin Herrenschmidt,
iommu
I always assumed that our streaming mappings are relaxed order for
TLP anyway. And at very least Documentation/DMA-attributes.txt seems
to imply something different:
DMA_ATTR_WEAK_ORDERING
----------------------
DMA_ATTR_WEAK_ORDERING specifies that reads and writes to the mapping
may be weakly ordered, that is that reads and writes may pass each other.
Which to me suggest host reads, which also makes me wonder why these
would even apply to our streaming mappings.
Adding the powerpc folks that added the DMA_ATTR_WEAK_ORDERING flag
originally back in 2008, but not actual users as far as I can tell -
those are all new and from the sparc gang, except for the noveau
driver, which is a bit older but only uses for dma_alloc_attrs, where
the original description makes sense to me.
On Wed, Jun 21, 2017 at 01:30:40PM -0700, Alan Adamson wrote:
> drivers may rely on strong ordering for certain writes. For example, a
> device may use a specific location in host memory to serve as a "done"
> flag. The device writes data to a buffer, and those writes can be relaxed
> because the order in which the data bytes enter the buffer is not visible
> to software. The device then writes to the "done" flag location, to
> indicate to the consumer that it has completed the buffer write. The write
> to "done" cannot pass the prior writes to the data buffer; otherwise
> software polling on the "done" location could potentially read stale data
> from the data buffer. Device driver developers need to understand the
> hardware behavior and only set the DMA_ATTR_WEAK_ORDERING attribute when
> safe.
>
> Alan Adamson
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
2017-06-24 7:35 ` DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers Christoph Hellwig
@ 2017-06-26 9:15 ` Arnd Bergmann
2017-06-27 20:46 ` chris hyser
1 sibling, 0 replies; 5+ messages in thread
From: Arnd Bergmann @ 2017-06-26 9:15 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Alan Adamson, Keith Busch, Jens Axboe, linux-nvme, Sagi Grimberg,
linux-pci, Mark Nelson, Benjamin Herrenschmidt, iommu
On Sat, Jun 24, 2017 at 9:35 AM, Christoph Hellwig <hch@lst.de> wrote:
> I always assumed that our streaming mappings are relaxed order for
> TLP anyway. And at very least Documentation/DMA-attributes.txt seems
> to imply something different:
>
>
> DMA_ATTR_WEAK_ORDERING
> ----------------------
>
> DMA_ATTR_WEAK_ORDERING specifies that reads and writes to the mapping
> may be weakly ordered, that is that reads and writes may pass each other.
>
> Which to me suggest host reads, which also makes me wonder why these
> would even apply to our streaming mappings.
>
> Adding the powerpc folks that added the DMA_ATTR_WEAK_ORDERING flag
> originally back in 2008, but not actual users as far as I can tell -
> those are all new and from the sparc gang, except for the noveau
> driver, which is a bit older but only uses for dma_alloc_attrs, where
> the original description makes sense to me.
I suspect people have applied the same name for different things over time.
When we added it in 2008, there was one specific use case, improving
throughput on infiniband, iirc using the mthca device driver. In the Cell
IOMMU, this attribute causes concurrent DMA transfers on PCIe
devices to disregard the stricter PCI ordering rules: multiple transfers
could be initiated together and complete in an arbitrary order depending
on e.g. I/O page faults in the IOMMU. Using the dma attributes, this
could be controlled to apply only to certain transfers, so a completion
queue DMA would still be ordered with regard to other transfers, as it would
not come with that attribute.
There are two things that I don't remember unfortunately.
a) why this patch seems to have never made it into the mainline kernel:
http://lists.openfabrics.org/pipermail/general/2008-October/055006.html
We probably had it added to both OFED and the supported distros at
the time (RHEL, Fedora, SLES, all in versions I don't remember any
more)
b) How this relates to the PCIe relaxed ordering flag on DMA transfers.
It's possible that the hardware later learned to correctly set the
flag on the actual transfer to have the same effect, or it's possible
that we had to add the dma_alloc_attr() interface specifically because
the hardware could not do this for some but not other transfers.
Arnd
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
2017-06-24 7:35 ` DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers Christoph Hellwig
2017-06-26 9:15 ` Arnd Bergmann
@ 2017-06-27 20:46 ` chris hyser
2017-07-05 19:07 ` Christoph Hellwig
1 sibling, 1 reply; 5+ messages in thread
From: chris hyser @ 2017-06-27 20:46 UTC (permalink / raw)
To: Christoph Hellwig, Alan Adamson
Cc: keith.busch, axboe, linux-nvme, sagi, linux-pci, Mark Nelson,
Arnd Bergmann, Benjamin Herrenschmidt, iommu
On 6/24/2017 3:35 AM, Christoph Hellwig wrote:
> I always assumed that our streaming mappings are relaxed order for
> TLP anyway. And at very least Documentation/DMA-attributes.txt seems
> to imply something different:
>
>
> DMA_ATTR_WEAK_ORDERING
> ----------------------
>
> DMA_ATTR_WEAK_ORDERING specifies that reads and writes to the mapping
> may be weakly ordered, that is that reads and writes may pass each other.
>
> Which to me suggest host reads, which also makes me wonder why these
> would even apply to our streaming mappings.
>
> Adding the powerpc folks that added the DMA_ATTR_WEAK_ORDERING flag
> originally back in 2008, but not actual users as far as I can tell -
> those are all new and from the sparc gang, except for the noveau
> driver, which is a bit older but only uses for dma_alloc_attrs, where
> the original description makes sense to me.
I put this in for SPARC. In our case the host bridge/RC itself follows very strict ordering unless the relaxed order bit
is set in the TLP. This works great for devices that actually allow the driver to enable it. We however also have to
support an infiniband card that does not support enabling this in the HW and thus in the TLP but is actually fine with
relaxed order for the data buffers (ie the streaming I/O vs the coherent control buffers). In fact w/o relaxed order the
performance is absolutely atrocious ... w/ exceeds x86. This flag enables the driver to signal to us when we map the
buffer in the IOMMU to enable the relaxed order attribute for our HW.
-chrish
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
2017-06-27 20:46 ` chris hyser
@ 2017-07-05 19:07 ` Christoph Hellwig
2017-07-10 16:17 ` chris hyser
0 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2017-07-05 19:07 UTC (permalink / raw)
To: chris hyser
Cc: Christoph Hellwig, Alan Adamson, keith.busch, axboe, linux-nvme,
sagi, linux-pci, Mark Nelson, Arnd Bergmann,
Benjamin Herrenschmidt, iommu
On Tue, Jun 27, 2017 at 04:46:31PM -0400, chris hyser wrote:
> I put this in for SPARC. In our case the host bridge/RC itself follows very
> strict ordering unless the relaxed order bit is set in the TLP. This works
> great for devices that actually allow the driver to enable it. We however
> also have to support an infiniband card that does not support enabling this
> in the HW and thus in the TLP but is actually fine with relaxed order for
> the data buffers (ie the streaming I/O vs the coherent control buffers). In
> fact w/o relaxed order the performance is absolutely atrocious ... w/
> exceeds x86. This flag enables the driver to signal to us when we map the
> buffer in the IOMMU to enable the relaxed order attribute for our HW.
We'll really need to start writing down our semantics. As I said
given how our streaming dma mappings (dma_map_single/page/sg) are
defined I can't think of any way how relaxed vs strict ordering would
matter for them, so just enabling it by default seems like the right
thing in this case, instead of having to patch every PCIe driver people
might ever use on sparc to work around your bridge.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
2017-07-05 19:07 ` Christoph Hellwig
@ 2017-07-10 16:17 ` chris hyser
0 siblings, 0 replies; 5+ messages in thread
From: chris hyser @ 2017-07-10 16:17 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Alan Adamson, keith.busch, axboe, linux-nvme, sagi, linux-pci,
Mark Nelson, Arnd Bergmann, Benjamin Herrenschmidt, iommu
On 7/5/2017 3:07 PM, Christoph Hellwig wrote:
> On Tue, Jun 27, 2017 at 04:46:31PM -0400, chris hyser wrote:
>> I put this in for SPARC. In our case the host bridge/RC itself follows very
>> strict ordering unless the relaxed order bit is set in the TLP. This works
>> great for devices that actually allow the driver to enable it. We however
>> also have to support an infiniband card that does not support enabling this
>> in the HW and thus in the TLP but is actually fine with relaxed order for
>> the data buffers (ie the streaming I/O vs the coherent control buffers). In
>> fact w/o relaxed order the performance is absolutely atrocious ... w/
>> exceeds x86. This flag enables the driver to signal to us when we map the
>> buffer in the IOMMU to enable the relaxed order attribute for our HW.
>
> We'll really need to start writing down our semantics. As I said
Fair enough. This will get documented.
> given how our streaming dma mappings (dma_map_single/page/sg) are
> defined I can't think of any way how relaxed vs strict ordering would
> matter for them, so just enabling it by default seems like the right
> thing in this case, instead of having to patch every PCIe driver people
People get real squeamish about just turning on relaxed order for driver/devices we actually know something about and
test let alone everything. I do agree with your interpretation of the streaming DMA semantics. That does not mean every
driver writer did or does. :-)
> might ever use on sparc to work around your bridge.
To be clear, this isn't really a host bridge issue. Technically no driver that runs on SPARC needs this enabled -- even
performance critical ... if they do the HW correctly. Again, we have a necessary performance critical device which does
not correctly signal its okayness with relaxed order for non-control DMA. This feature of the host bridge is to enable a
workaround for just these cases (deemed needed by past history). Said differently, lots of devices are made for the
commodity market where shortcuts like this apparently do not affect performance but are critical to performance on large
systems.
-chrish
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2017-07-10 16:19 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1497645290-31041-1-git-send-email-alan.adamson@oracle.com>
[not found] ` <20170617115703.GA13002@lst.de>
[not found] ` <6a32f30b-433a-9857-164f-1717405a6082@oracle.com>
2017-06-24 7:35 ` DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers Christoph Hellwig
2017-06-26 9:15 ` Arnd Bergmann
2017-06-27 20:46 ` chris hyser
2017-07-05 19:07 ` Christoph Hellwig
2017-07-10 16:17 ` chris hyser
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).