All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
@ 2017-06-16 20:34 Alan Adamson
  2017-06-17 11:57 ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Alan Adamson @ 2017-06-16 20:34 UTC (permalink / raw)


SPARC based platforms suffer significant read performance penalties for
nvme reads when Relaxed Ordering is not enabled. This change sets the
DMA_ATTR_WEAK_ORDERING (Relaxed Ordering) attribute for nvme block dma
buffers.

Both the AIC and SSD versions of the Samsung 2.9TB device were tested
with fio using iodepth=32, numjobs=8, and bs=4k.

For reads, a 450% increase in IOPS was observed.
For writes, no change in IOPS was observed.

Signed-off-by: Alan Adamson <alan.adamson at oracle.com>
Tested-by: Alan Adamson <alan.adamson at oracle.com>
Reviewed-by: Kyle Fortin <kyle.fortin at oracle.com>
Reviewed-by: Govinda Tatti <govinda.tatti at oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson at oracle.com>

---
 drivers/nvme/host/pci.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 951042a375d6..7a6cf6fa95c2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -632,7 +632,7 @@ static int nvme_map_data(struct nvme_dev *dev, struct request *req,
 
 	ret = BLK_MQ_RQ_QUEUE_BUSY;
 	if (!dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
-				DMA_ATTR_NO_WARN))
+				DMA_ATTR_NO_WARN | DMA_ATTR_WEAK_ORDERING))
 		goto out;
 
 	if (!nvme_setup_prps(dev, req))
@@ -650,7 +650,8 @@ static int nvme_map_data(struct nvme_dev *dev, struct request *req,
 		if (rq_data_dir(req))
 			nvme_dif_remap(req, nvme_dif_prep);
 
-		if (!dma_map_sg(dev->dev, &iod->meta_sg, 1, dma_dir))
+		if (!dma_map_sg_attrs(dev->dev, &iod->meta_sg, 1, dma_dir,
+				DMA_ATTR_WEAK_ORDERING))
 			goto out_unmap;
 	}
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
  2017-06-16 20:34 [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers Alan Adamson
@ 2017-06-17 11:57 ` Christoph Hellwig
  2017-06-21 20:30   ` Alan Adamson
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2017-06-17 11:57 UTC (permalink / raw)


On Fri, Jun 16, 2017@01:34:50PM -0700, Alan Adamson wrote:
> SPARC based platforms suffer significant read performance penalties for
> nvme reads when Relaxed Ordering is not enabled. This change sets the
> DMA_ATTR_WEAK_ORDERING (Relaxed Ordering) attribute for nvme block dma
> buffers.

Please explain what it does exactly, and why you think it's safe for NVMe,
as that belongs into the changelog.

Bonus points for explaining why DMA_ATTR_WEAK_ORDERING shouldn't
be the default behavior for the dma_map_ routines instead of sprinkling
it over just about every driver..

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
  2017-06-17 11:57 ` Christoph Hellwig
@ 2017-06-21 20:30   ` Alan Adamson
       [not found]     ` <6a32f30b-433a-9857-164f-1717405a6082-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Alan Adamson @ 2017-06-21 20:30 UTC (permalink / raw)


Christoph,


Hopefully, this answers your questions.

The SPARC platforms are designed in accordance with PCIe specification 
and expects the the device driver to set DMA_ATTR_WEAK_ORDERING to avoid 
restrict ordering rules enforced by PCIe root-complex and ultimately 
experience higher latency for DMA Writes.

The Root Complex needs to break PCIe Memory Write request into one or 
more cacheline size requests to host memory. If RO is not set, those 
cacheline requests are serialized and the next/subsequent write cannot 
issue until the coherency acknowledgment for the first/earlier write is 
received. This serialization of write transactions can significantly 
impact sustained DMA write performance for the given device.

Relaxed ordering should be used as much as possible to avoid the 
serialization penalty described above. The risk of using weak ordering 
indiscriminately on all writes, however, is that certain devices and 
drivers may rely on strong ordering for certain writes. For example, a 
device may use a specific location in host memory to serve as a "done" 
flag. The device writes data to a buffer, and those writes can be 
relaxed because the order in which the data bytes enter the buffer is 
not visible to software. The device then writes to the "done" flag 
location, to indicate to the consumer that it has completed the buffer 
write. The write to "done" cannot pass the prior writes to the data 
buffer; otherwise software polling on the "done" location could 
potentially read stale data from the data buffer. Device driver 
developers need to understand the hardware behavior and only set the 
DMA_ATTR_WEAK_ORDERING attribute when safe.

Alan Adamson

On 06/17/2017 04:57 AM, Christoph Hellwig wrote:
> On Fri, Jun 16, 2017@01:34:50PM -0700, Alan Adamson wrote:
>> SPARC based platforms suffer significant read performance penalties for
>> nvme reads when Relaxed Ordering is not enabled. This change sets the
>> DMA_ATTR_WEAK_ORDERING (Relaxed Ordering) attribute for nvme block dma
>> buffers.
> Please explain what it does exactly, and why you think it's safe for NVMe,
> as that belongs into the changelog.
>
> Bonus points for explaining why DMA_ATTR_WEAK_ORDERING shouldn't
> be the default behavior for the dma_map_ routines instead of sprinkling
> it over just about every driver..
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 15+ messages in thread

* DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
  2017-06-21 20:30   ` Alan Adamson
       [not found]     ` <6a32f30b-433a-9857-164f-1717405a6082-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2017-06-24  7:35         ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2017-06-24  7:35 UTC (permalink / raw)
  To: Alan Adamson
  Cc: Mark Nelson, axboe-b10kYP2dOMg, sagi-NQWnxTmZq1alnMjI0IkVqw,
	Arnd Bergmann, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	keith.busch-ral2JQCrhuEAvxtiuMwx3w,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Benjamin Herrenschmidt, Christoph Hellwig

I always assumed that our streaming mappings are relaxed order for
TLP anyway.  And at very least Documentation/DMA-attributes.txt seems
to imply something different:


  DMA_ATTR_WEAK_ORDERING
  ----------------------

  DMA_ATTR_WEAK_ORDERING specifies that reads and writes to the mapping
  may be weakly ordered, that is that reads and writes may pass each other.

Which to me suggest host reads, which also makes me wonder why these
would even apply to our streaming mappings.

Adding the powerpc folks that added the DMA_ATTR_WEAK_ORDERING flag
originally back in 2008, but not actual users as far as I can tell -
those are all new and from the sparc gang, except for the noveau
driver, which is a bit older but only uses for dma_alloc_attrs, where
the original description makes sense to me.

On Wed, Jun 21, 2017 at 01:30:40PM -0700, Alan Adamson wrote:
> drivers may rely on strong ordering for certain writes. For example, a 
> device may use a specific location in host memory to serve as a "done" 
> flag. The device writes data to a buffer, and those writes can be relaxed 
> because the order in which the data bytes enter the buffer is not visible 
> to software. The device then writes to the "done" flag location, to 
> indicate to the consumer that it has completed the buffer write. The write 
> to "done" cannot pass the prior writes to the data buffer; otherwise 
> software polling on the "done" location could potentially read stale data 
> from the data buffer. Device driver developers need to understand the 
> hardware behavior and only set the DMA_ATTR_WEAK_ORDERING attribute when 
> safe.
>
> Alan Adamson

^ permalink raw reply	[flat|nested] 15+ messages in thread

* DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
@ 2017-06-24  7:35         ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2017-06-24  7:35 UTC (permalink / raw)


I always assumed that our streaming mappings are relaxed order for
TLP anyway.  And at very least Documentation/DMA-attributes.txt seems
to imply something different:


  DMA_ATTR_WEAK_ORDERING
  ----------------------

  DMA_ATTR_WEAK_ORDERING specifies that reads and writes to the mapping
  may be weakly ordered, that is that reads and writes may pass each other.

Which to me suggest host reads, which also makes me wonder why these
would even apply to our streaming mappings.

Adding the powerpc folks that added the DMA_ATTR_WEAK_ORDERING flag
originally back in 2008, but not actual users as far as I can tell -
those are all new and from the sparc gang, except for the noveau
driver, which is a bit older but only uses for dma_alloc_attrs, where
the original description makes sense to me.

On Wed, Jun 21, 2017@01:30:40PM -0700, Alan Adamson wrote:
> drivers may rely on strong ordering for certain writes. For example, a 
> device may use a specific location in host memory to serve as a "done" 
> flag. The device writes data to a buffer, and those writes can be relaxed 
> because the order in which the data bytes enter the buffer is not visible 
> to software. The device then writes to the "done" flag location, to 
> indicate to the consumer that it has completed the buffer write. The write 
> to "done" cannot pass the prior writes to the data buffer; otherwise 
> software polling on the "done" location could potentially read stale data 
> from the data buffer. Device driver developers need to understand the 
> hardware behavior and only set the DMA_ATTR_WEAK_ORDERING attribute when 
> safe.
>
> Alan Adamson

^ permalink raw reply	[flat|nested] 15+ messages in thread

* DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
@ 2017-06-24  7:35         ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2017-06-24  7:35 UTC (permalink / raw)
  To: Alan Adamson
  Cc: Christoph Hellwig, keith.busch, axboe, linux-nvme, sagi,
	linux-pci, Mark Nelson, Arnd Bergmann, Benjamin Herrenschmidt,
	iommu

I always assumed that our streaming mappings are relaxed order for
TLP anyway.  And at very least Documentation/DMA-attributes.txt seems
to imply something different:


  DMA_ATTR_WEAK_ORDERING
  ----------------------

  DMA_ATTR_WEAK_ORDERING specifies that reads and writes to the mapping
  may be weakly ordered, that is that reads and writes may pass each other.

Which to me suggest host reads, which also makes me wonder why these
would even apply to our streaming mappings.

Adding the powerpc folks that added the DMA_ATTR_WEAK_ORDERING flag
originally back in 2008, but not actual users as far as I can tell -
those are all new and from the sparc gang, except for the noveau
driver, which is a bit older but only uses for dma_alloc_attrs, where
the original description makes sense to me.

On Wed, Jun 21, 2017 at 01:30:40PM -0700, Alan Adamson wrote:
> drivers may rely on strong ordering for certain writes. For example, a 
> device may use a specific location in host memory to serve as a "done" 
> flag. The device writes data to a buffer, and those writes can be relaxed 
> because the order in which the data bytes enter the buffer is not visible 
> to software. The device then writes to the "done" flag location, to 
> indicate to the consumer that it has completed the buffer write. The write 
> to "done" cannot pass the prior writes to the data buffer; otherwise 
> software polling on the "done" location could potentially read stale data 
> from the data buffer. Device driver developers need to understand the 
> hardware behavior and only set the DMA_ATTR_WEAK_ORDERING attribute when 
> safe.
>
> Alan Adamson

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
  2017-06-24  7:35         ` Christoph Hellwig
@ 2017-06-26  9:15           ` Arnd Bergmann
  -1 siblings, 0 replies; 15+ messages in thread
From: Arnd Bergmann @ 2017-06-26  9:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alan Adamson, Keith Busch, Jens Axboe, linux-nvme, Sagi Grimberg,
	linux-pci, Mark Nelson, Benjamin Herrenschmidt, iommu

On Sat, Jun 24, 2017 at 9:35 AM, Christoph Hellwig <hch@lst.de> wrote:
> I always assumed that our streaming mappings are relaxed order for
> TLP anyway.  And at very least Documentation/DMA-attributes.txt seems
> to imply something different:
>
>
>   DMA_ATTR_WEAK_ORDERING
>   ----------------------
>
>   DMA_ATTR_WEAK_ORDERING specifies that reads and writes to the mapping
>   may be weakly ordered, that is that reads and writes may pass each other.
>
> Which to me suggest host reads, which also makes me wonder why these
> would even apply to our streaming mappings.
>
> Adding the powerpc folks that added the DMA_ATTR_WEAK_ORDERING flag
> originally back in 2008, but not actual users as far as I can tell -
> those are all new and from the sparc gang, except for the noveau
> driver, which is a bit older but only uses for dma_alloc_attrs, where
> the original description makes sense to me.

I suspect people have applied the same name for different things over time.

When we added it in 2008, there was one specific use case, improving
throughput on infiniband, iirc using the mthca device driver. In the Cell
IOMMU, this attribute causes concurrent DMA transfers on PCIe
devices to disregard the stricter PCI ordering rules: multiple transfers
could be initiated together and complete in an arbitrary order depending
on e.g. I/O page faults in the IOMMU. Using the dma attributes, this
could be controlled to apply only to certain transfers, so a completion
queue DMA would still be ordered with regard to other transfers, as it would
not come with that attribute.

There are two things that I don't remember unfortunately.

a) why this patch seems to have never made it into the mainline kernel:
http://lists.openfabrics.org/pipermail/general/2008-October/055006.html
We probably had it added to both OFED and the supported distros at
the time (RHEL, Fedora, SLES, all in versions I don't remember any
more)

b) How this relates to the PCIe relaxed ordering flag on DMA transfers.
It's possible that the hardware later learned to correctly set the
flag on the actual transfer to have the same effect, or it's possible
that we had to add the dma_alloc_attr() interface specifically because
the hardware could not do this for some but not other transfers.

      Arnd

^ permalink raw reply	[flat|nested] 15+ messages in thread

* DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
@ 2017-06-26  9:15           ` Arnd Bergmann
  0 siblings, 0 replies; 15+ messages in thread
From: Arnd Bergmann @ 2017-06-26  9:15 UTC (permalink / raw)


On Sat, Jun 24, 2017@9:35 AM, Christoph Hellwig <hch@lst.de> wrote:
> I always assumed that our streaming mappings are relaxed order for
> TLP anyway.  And at very least Documentation/DMA-attributes.txt seems
> to imply something different:
>
>
>   DMA_ATTR_WEAK_ORDERING
>   ----------------------
>
>   DMA_ATTR_WEAK_ORDERING specifies that reads and writes to the mapping
>   may be weakly ordered, that is that reads and writes may pass each other.
>
> Which to me suggest host reads, which also makes me wonder why these
> would even apply to our streaming mappings.
>
> Adding the powerpc folks that added the DMA_ATTR_WEAK_ORDERING flag
> originally back in 2008, but not actual users as far as I can tell -
> those are all new and from the sparc gang, except for the noveau
> driver, which is a bit older but only uses for dma_alloc_attrs, where
> the original description makes sense to me.

I suspect people have applied the same name for different things over time.

When we added it in 2008, there was one specific use case, improving
throughput on infiniband, iirc using the mthca device driver. In the Cell
IOMMU, this attribute causes concurrent DMA transfers on PCIe
devices to disregard the stricter PCI ordering rules: multiple transfers
could be initiated together and complete in an arbitrary order depending
on e.g. I/O page faults in the IOMMU. Using the dma attributes, this
could be controlled to apply only to certain transfers, so a completion
queue DMA would still be ordered with regard to other transfers, as it would
not come with that attribute.

There are two things that I don't remember unfortunately.

a) why this patch seems to have never made it into the mainline kernel:
http://lists.openfabrics.org/pipermail/general/2008-October/055006.html
We probably had it added to both OFED and the supported distros at
the time (RHEL, Fedora, SLES, all in versions I don't remember any
more)

b) How this relates to the PCIe relaxed ordering flag on DMA transfers.
It's possible that the hardware later learned to correctly set the
flag on the actual transfer to have the same effect, or it's possible
that we had to add the dma_alloc_attr() interface specifically because
the hardware could not do this for some but not other transfers.

      Arnd

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
  2017-06-24  7:35         ` Christoph Hellwig
@ 2017-06-27 20:46           ` chris hyser
  -1 siblings, 0 replies; 15+ messages in thread
From: chris hyser @ 2017-06-27 20:46 UTC (permalink / raw)
  To: Christoph Hellwig, Alan Adamson
  Cc: keith.busch, axboe, linux-nvme, sagi, linux-pci, Mark Nelson,
	Arnd Bergmann, Benjamin Herrenschmidt, iommu

On 6/24/2017 3:35 AM, Christoph Hellwig wrote:
> I always assumed that our streaming mappings are relaxed order for
> TLP anyway.  And at very least Documentation/DMA-attributes.txt seems
> to imply something different:
>
>
>   DMA_ATTR_WEAK_ORDERING
>   ----------------------
>
>   DMA_ATTR_WEAK_ORDERING specifies that reads and writes to the mapping
>   may be weakly ordered, that is that reads and writes may pass each other.
>
> Which to me suggest host reads, which also makes me wonder why these
> would even apply to our streaming mappings.
>
> Adding the powerpc folks that added the DMA_ATTR_WEAK_ORDERING flag
> originally back in 2008, but not actual users as far as I can tell -
> those are all new and from the sparc gang, except for the noveau
> driver, which is a bit older but only uses for dma_alloc_attrs, where
> the original description makes sense to me.

I put this in for SPARC. In our case the host bridge/RC itself follows very strict ordering unless the relaxed order bit 
is set in the TLP. This works great for devices that actually allow the driver to enable it. We however also have to 
support an infiniband card that does not support enabling this in the HW and thus in the TLP but is actually fine with 
relaxed order for the data buffers (ie the streaming I/O vs the coherent control buffers). In fact w/o relaxed order the 
performance is absolutely atrocious ... w/ exceeds x86. This flag enables the driver to signal to us when we map the 
buffer in the IOMMU to enable the relaxed order attribute for our HW.

-chrish

^ permalink raw reply	[flat|nested] 15+ messages in thread

* DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
@ 2017-06-27 20:46           ` chris hyser
  0 siblings, 0 replies; 15+ messages in thread
From: chris hyser @ 2017-06-27 20:46 UTC (permalink / raw)


On 6/24/2017 3:35 AM, Christoph Hellwig wrote:
> I always assumed that our streaming mappings are relaxed order for
> TLP anyway.  And at very least Documentation/DMA-attributes.txt seems
> to imply something different:
>
>
>   DMA_ATTR_WEAK_ORDERING
>   ----------------------
>
>   DMA_ATTR_WEAK_ORDERING specifies that reads and writes to the mapping
>   may be weakly ordered, that is that reads and writes may pass each other.
>
> Which to me suggest host reads, which also makes me wonder why these
> would even apply to our streaming mappings.
>
> Adding the powerpc folks that added the DMA_ATTR_WEAK_ORDERING flag
> originally back in 2008, but not actual users as far as I can tell -
> those are all new and from the sparc gang, except for the noveau
> driver, which is a bit older but only uses for dma_alloc_attrs, where
> the original description makes sense to me.

I put this in for SPARC. In our case the host bridge/RC itself follows very strict ordering unless the relaxed order bit 
is set in the TLP. This works great for devices that actually allow the driver to enable it. We however also have to 
support an infiniband card that does not support enabling this in the HW and thus in the TLP but is actually fine with 
relaxed order for the data buffers (ie the streaming I/O vs the coherent control buffers). In fact w/o relaxed order the 
performance is absolutely atrocious ... w/ exceeds x86. This flag enables the driver to signal to us when we map the 
buffer in the IOMMU to enable the relaxed order attribute for our HW.

-chrish

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
  2017-06-27 20:46           ` chris hyser
@ 2017-07-05 19:07             ` Christoph Hellwig
  -1 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2017-07-05 19:07 UTC (permalink / raw)
  To: chris hyser
  Cc: Christoph Hellwig, Alan Adamson, keith.busch, axboe, linux-nvme,
	sagi, linux-pci, Mark Nelson, Arnd Bergmann,
	Benjamin Herrenschmidt, iommu

On Tue, Jun 27, 2017 at 04:46:31PM -0400, chris hyser wrote:
> I put this in for SPARC. In our case the host bridge/RC itself follows very 
> strict ordering unless the relaxed order bit is set in the TLP. This works 
> great for devices that actually allow the driver to enable it. We however 
> also have to support an infiniband card that does not support enabling this 
> in the HW and thus in the TLP but is actually fine with relaxed order for 
> the data buffers (ie the streaming I/O vs the coherent control buffers). In 
> fact w/o relaxed order the performance is absolutely atrocious ... w/ 
> exceeds x86. This flag enables the driver to signal to us when we map the 
> buffer in the IOMMU to enable the relaxed order attribute for our HW.

We'll really need to start writing down our semantics.  As I said
given how our streaming dma mappings (dma_map_single/page/sg) are
defined I can't think of any way how relaxed vs strict ordering would
matter for them, so just enabling it by default seems like the right
thing in this case, instead of having to patch every PCIe driver people
might ever use on sparc to work around your bridge.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
@ 2017-07-05 19:07             ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2017-07-05 19:07 UTC (permalink / raw)


On Tue, Jun 27, 2017@04:46:31PM -0400, chris hyser wrote:
> I put this in for SPARC. In our case the host bridge/RC itself follows very 
> strict ordering unless the relaxed order bit is set in the TLP. This works 
> great for devices that actually allow the driver to enable it. We however 
> also have to support an infiniband card that does not support enabling this 
> in the HW and thus in the TLP but is actually fine with relaxed order for 
> the data buffers (ie the streaming I/O vs the coherent control buffers). In 
> fact w/o relaxed order the performance is absolutely atrocious ... w/ 
> exceeds x86. This flag enables the driver to signal to us when we map the 
> buffer in the IOMMU to enable the relaxed order attribute for our HW.

We'll really need to start writing down our semantics.  As I said
given how our streaming dma mappings (dma_map_single/page/sg) are
defined I can't think of any way how relaxed vs strict ordering would
matter for them, so just enabling it by default seems like the right
thing in this case, instead of having to patch every PCIe driver people
might ever use on sparc to work around your bridge.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
  2017-07-05 19:07             ` Christoph Hellwig
  (?)
@ 2017-07-10 16:17                 ` chris hyser
  -1 siblings, 0 replies; 15+ messages in thread
From: chris hyser @ 2017-07-10 16:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mark Nelson, axboe-b10kYP2dOMg, sagi-NQWnxTmZq1alnMjI0IkVqw,
	Arnd Bergmann, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	keith.busch-ral2JQCrhuEAvxtiuMwx3w,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Alan Adamson,
	Benjamin Herrenschmidt

On 7/5/2017 3:07 PM, Christoph Hellwig wrote:
> On Tue, Jun 27, 2017 at 04:46:31PM -0400, chris hyser wrote:
>> I put this in for SPARC. In our case the host bridge/RC itself follows very
>> strict ordering unless the relaxed order bit is set in the TLP. This works
>> great for devices that actually allow the driver to enable it. We however
>> also have to support an infiniband card that does not support enabling this
>> in the HW and thus in the TLP but is actually fine with relaxed order for
>> the data buffers (ie the streaming I/O vs the coherent control buffers). In
>> fact w/o relaxed order the performance is absolutely atrocious ... w/
>> exceeds x86. This flag enables the driver to signal to us when we map the
>> buffer in the IOMMU to enable the relaxed order attribute for our HW.
>
> We'll really need to start writing down our semantics.  As I said

Fair enough. This will get documented.

> given how our streaming dma mappings (dma_map_single/page/sg) are
> defined I can't think of any way how relaxed vs strict ordering would
> matter for them, so just enabling it by default seems like the right
> thing in this case, instead of having to patch every PCIe driver people

People get real squeamish about just turning on relaxed order for driver/devices we actually know something about and 
test let alone everything. I do agree with your interpretation of the streaming DMA semantics. That does not mean every 
driver writer did or does. :-)

> might ever use on sparc to work around your bridge.

To be clear, this isn't really a host bridge issue. Technically no driver that runs on SPARC needs this enabled -- even 
performance critical ... if they do the HW correctly. Again, we have a necessary performance critical device which does 
not correctly signal its okayness with relaxed order for non-control DMA. This feature of the host bridge is to enable a 
workaround for just these cases (deemed needed by past history). Said differently, lots of devices are made for the 
commodity market where shortcuts like this apparently do not affect performance but are critical to performance on large 
systems.

-chrish

^ permalink raw reply	[flat|nested] 15+ messages in thread

* DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
@ 2017-07-10 16:17                 ` chris hyser
  0 siblings, 0 replies; 15+ messages in thread
From: chris hyser @ 2017-07-10 16:17 UTC (permalink / raw)


On 7/5/2017 3:07 PM, Christoph Hellwig wrote:
> On Tue, Jun 27, 2017@04:46:31PM -0400, chris hyser wrote:
>> I put this in for SPARC. In our case the host bridge/RC itself follows very
>> strict ordering unless the relaxed order bit is set in the TLP. This works
>> great for devices that actually allow the driver to enable it. We however
>> also have to support an infiniband card that does not support enabling this
>> in the HW and thus in the TLP but is actually fine with relaxed order for
>> the data buffers (ie the streaming I/O vs the coherent control buffers). In
>> fact w/o relaxed order the performance is absolutely atrocious ... w/
>> exceeds x86. This flag enables the driver to signal to us when we map the
>> buffer in the IOMMU to enable the relaxed order attribute for our HW.
>
> We'll really need to start writing down our semantics.  As I said

Fair enough. This will get documented.

> given how our streaming dma mappings (dma_map_single/page/sg) are
> defined I can't think of any way how relaxed vs strict ordering would
> matter for them, so just enabling it by default seems like the right
> thing in this case, instead of having to patch every PCIe driver people

People get real squeamish about just turning on relaxed order for driver/devices we actually know something about and 
test let alone everything. I do agree with your interpretation of the streaming DMA semantics. That does not mean every 
driver writer did or does. :-)

> might ever use on sparc to work around your bridge.

To be clear, this isn't really a host bridge issue. Technically no driver that runs on SPARC needs this enabled -- even 
performance critical ... if they do the HW correctly. Again, we have a necessary performance critical device which does 
not correctly signal its okayness with relaxed order for non-control DMA. This feature of the host bridge is to enable a 
workaround for just these cases (deemed needed by past history). Said differently, lots of devices are made for the 
commodity market where shortcuts like this apparently do not affect performance but are critical to performance on large 
systems.

-chrish

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: DMA_ATTR_WEAK_ORDERING defintion, was Re: [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers
@ 2017-07-10 16:17                 ` chris hyser
  0 siblings, 0 replies; 15+ messages in thread
From: chris hyser @ 2017-07-10 16:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alan Adamson, keith.busch, axboe, linux-nvme, sagi, linux-pci,
	Mark Nelson, Arnd Bergmann, Benjamin Herrenschmidt, iommu

On 7/5/2017 3:07 PM, Christoph Hellwig wrote:
> On Tue, Jun 27, 2017 at 04:46:31PM -0400, chris hyser wrote:
>> I put this in for SPARC. In our case the host bridge/RC itself follows very
>> strict ordering unless the relaxed order bit is set in the TLP. This works
>> great for devices that actually allow the driver to enable it. We however
>> also have to support an infiniband card that does not support enabling this
>> in the HW and thus in the TLP but is actually fine with relaxed order for
>> the data buffers (ie the streaming I/O vs the coherent control buffers). In
>> fact w/o relaxed order the performance is absolutely atrocious ... w/
>> exceeds x86. This flag enables the driver to signal to us when we map the
>> buffer in the IOMMU to enable the relaxed order attribute for our HW.
>
> We'll really need to start writing down our semantics.  As I said

Fair enough. This will get documented.

> given how our streaming dma mappings (dma_map_single/page/sg) are
> defined I can't think of any way how relaxed vs strict ordering would
> matter for them, so just enabling it by default seems like the right
> thing in this case, instead of having to patch every PCIe driver people

People get real squeamish about just turning on relaxed order for driver/devices we actually know something about and 
test let alone everything. I do agree with your interpretation of the streaming DMA semantics. That does not mean every 
driver writer did or does. :-)

> might ever use on sparc to work around your bridge.

To be clear, this isn't really a host bridge issue. Technically no driver that runs on SPARC needs this enabled -- even 
performance critical ... if they do the HW correctly. Again, we have a necessary performance critical device which does 
not correctly signal its okayness with relaxed order for non-control DMA. This feature of the host bridge is to enable a 
workaround for just these cases (deemed needed by past history). Said differently, lots of devices are made for the 
commodity market where shortcuts like this apparently do not affect performance but are critical to performance on large 
systems.

-chrish

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-07-10 16:19 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-06-16 20:34 [PATCH] nvme: set DMA_ATTR_WEAK_ORDERING attribute on dma buffers Alan Adamson
2017-06-17 11:57 ` Christoph Hellwig
2017-06-21 20:30   ` Alan Adamson
     [not found]     ` <6a32f30b-433a-9857-164f-1717405a6082-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-24  7:35       ` DMA_ATTR_WEAK_ORDERING defintion, was " Christoph Hellwig
2017-06-24  7:35         ` Christoph Hellwig
2017-06-24  7:35         ` Christoph Hellwig
2017-06-26  9:15         ` Arnd Bergmann
2017-06-26  9:15           ` Arnd Bergmann
2017-06-27 20:46         ` chris hyser
2017-06-27 20:46           ` chris hyser
2017-07-05 19:07           ` Christoph Hellwig
2017-07-05 19:07             ` Christoph Hellwig
     [not found]             ` <20170705190714.GA7107-jcswGhMUV9g@public.gmane.org>
2017-07-10 16:17               ` chris hyser
2017-07-10 16:17                 ` chris hyser
2017-07-10 16:17                 ` chris hyser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.