* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 [not found] <20070818002746.GU1813@sgi.com> @ 2007-08-20 8:24 ` Jes Sorensen 2007-08-20 16:07 ` akepner 2007-08-21 19:35 ` akepner 0 siblings, 2 replies; 24+ messages in thread From: Jes Sorensen @ 2007-08-20 8:24 UTC (permalink / raw) To: akepner; +Cc: linux-kernel, rdreier, linux-ia64 akepner@sgi.com wrote: > Define ARCH_DOES_POSTED_DMA, and override the dma_flags_set_dmaflush > function. Also define dma_flags_get_direction, dma_flags_get_dmaflush > to retrieve the direction and "dmaflush attribute" from flags > passed to the sn_dma_map_* functions. > Hi Arthur, I'm a little concerned about changing the API for the dma_ foo functions, which are defined cross platform. If you want to change that, I think it will require updating the documentation explaining it. Regarding ARCH_DOES_POSTED_DMA, is that sufficient as a #define or does it have to be run-time tested? (does it do anything at this stage). I ask since not all ia64 platforms do posted DMA. Cheers, Jes ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-20 8:24 ` [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 Jes Sorensen @ 2007-08-20 16:07 ` akepner 2007-08-21 19:35 ` akepner 1 sibling, 0 replies; 24+ messages in thread From: akepner @ 2007-08-20 16:07 UTC (permalink / raw) To: Jes Sorensen; +Cc: linux-kernel, rdreier, linux-ia64 On Mon, Aug 20, 2007 at 10:24:53AM +0200, Jes Sorensen wrote: > I'm a little concerned about changing the API for the dma_ foo > functions, which are defined cross platform. If you want to change > that, I think it will require updating the documentation explaining > it. Good idea. I'll post a documentation patchette. > > Regarding ARCH_DOES_POSTED_DMA, is that sufficient as a #define or > does it have to be run-time tested? (does it do anything at this > stage). I ask since not all ia64 platforms do posted DMA. > I think a #define is exactly what we want here. The newly #define-ed function (for now, and maybe forever) does nothing except on IA64_SGI_SN2, which does posted DMA. -- Arthur ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-20 8:24 ` [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 Jes Sorensen 2007-08-20 16:07 ` akepner @ 2007-08-21 19:35 ` akepner 2007-08-21 20:05 ` Randy Dunlap 2007-08-21 20:16 ` Matthew Wilcox 1 sibling, 2 replies; 24+ messages in thread From: akepner @ 2007-08-21 19:35 UTC (permalink / raw) To: Jes Sorensen; +Cc: linux-kernel, rdreier, linux-ia64 > I'm a little concerned about changing the API for the dma_ foo > functions, which are defined cross platform. If you want to change > that, I think it will require updating the documentation explaining > it..... What do you think of the following? (And is there anyone else I should be cc-ing for review?) Document semantics of dma_flags_set_dmaflush() Signed-off-by: Arthur Kepner <akepner@sgi.com> -- DMA-API.txt | 22 ++++++++++++++++++++++ 1 files changed, 22 insertions(+) diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index cc7a8c3..e117b72 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -392,6 +392,28 @@ Notes: You must do this: See also dma_map_single(). +int +dma_flags_set_dmaflush(int dir) + +Amend dir (one of the enum dma_data_direction values), with a platform- +specific "dmaflush" attribute. Unless the platform supports "posted DMA" +this is a no-op. + +On platforms that support posted DMA, dma_flags_set_dmaflush() causes +device writes to the associated memory region to flush in-flight DMA. +This can be important, for example, when (DMA) writes to the memory +region indicate that DMA of data is complete. If DMA of data and DMA of +the completion indication race, as they can do when the platform supports +posted DMA, then the completion indication may arrive in host memory +ahead of some data. + +To prevent this, you might map the memory region used for completion +indications as follows: + + int count, flags = dma_flags_set_dmaflush(DMA_BIDIRECTIONAL); + ..... + count = dma_map_sg(dev, sglist, nents, flags); + Part II - Advanced dma_ usage ----------------------------- -- Arthur ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-21 19:35 ` akepner @ 2007-08-21 20:05 ` Randy Dunlap 2007-08-21 20:55 ` James Bottomley 2007-08-21 20:16 ` Matthew Wilcox 1 sibling, 1 reply; 24+ messages in thread From: Randy Dunlap @ 2007-08-21 20:05 UTC (permalink / raw) To: akepner; +Cc: Jes Sorensen, linux-kernel, rdreier, linux-ia64, James.Bottomley On Tue, 21 Aug 2007 12:35:22 -0700 akepner@sgi.com wrote: > > > I'm a little concerned about changing the API for the dma_ foo > > functions, which are defined cross platform. If you want to change > > that, I think it will require updating the documentation explaining > > it..... > > What do you think of the following? (And is there anyone else > I should be cc-ing for review?) probably the document's author (cc added) > Document semantics of dma_flags_set_dmaflush() > > Signed-off-by: Arthur Kepner <akepner@sgi.com> > -- > DMA-API.txt | 22 ++++++++++++++++++++++ > 1 files changed, 22 insertions(+) > > diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt > index cc7a8c3..e117b72 100644 > --- a/Documentation/DMA-API.txt > +++ b/Documentation/DMA-API.txt > @@ -392,6 +392,28 @@ Notes: You must do this: > > See also dma_map_single(). > > +int > +dma_flags_set_dmaflush(int dir) > + > +Amend dir (one of the enum dma_data_direction values), with a platform- no comma. > +specific "dmaflush" attribute. Unless the platform supports "posted DMA" add comma after "posted DMA" and drop lots of trailing spaces. > +this is a no-op. > + > +On platforms that support posted DMA, dma_flags_set_dmaflush() causes > +device writes to the associated memory region to flush in-flight DMA. > +This can be important, for example, when (DMA) writes to the memory > +region indicate that DMA of data is complete. If DMA of data and DMA of > +the completion indication race, as they can do when the platform supports > +posted DMA, then the completion indication may arrive in host memory > +ahead of some data. > + > +To prevent this, you might map the memory region used for completion > +indications as follows: > + > + int count, flags = dma_flags_set_dmaflush(DMA_BIDIRECTIONAL); > + ..... > + count = dma_map_sg(dev, sglist, nents, flags); > + > > Part II - Advanced dma_ usage > ----------------------------- > -- > Arthur --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-21 20:05 ` Randy Dunlap @ 2007-08-21 20:55 ` James Bottomley 2007-08-22 0:34 ` akepner 0 siblings, 1 reply; 24+ messages in thread From: James Bottomley @ 2007-08-21 20:55 UTC (permalink / raw) To: Randy Dunlap; +Cc: akepner, Jes Sorensen, linux-kernel, rdreier, linux-ia64 On Tue, 2007-08-21 at 13:05 -0700, Randy Dunlap wrote: > > diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt > > index cc7a8c3..e117b72 100644 > > --- a/Documentation/DMA-API.txt > > +++ b/Documentation/DMA-API.txt > > @@ -392,6 +392,28 @@ Notes: You must do this: > > > > See also dma_map_single(). > > > > +int > > +dma_flags_set_dmaflush(int dir) > > + > > +Amend dir (one of the enum dma_data_direction values), with a platform- > > no comma. > > > +specific "dmaflush" attribute. Unless the platform supports "posted DMA" > > add comma after "posted DMA" and drop lots of trailing spaces. > > > +this is a no-op Almost every platform supports posted DMA ... its a property of most PCI bridge chips. > > +On platforms that support posted DMA, dma_flags_set_dmaflush() causes > > +device writes to the associated memory region to flush in-flight DMA. > > +This can be important, for example, when (DMA) writes to the memory > > +region indicate that DMA of data is complete. If DMA of data and DMA of > > +the completion indication race, as they can do when the platform supports > > +posted DMA, then the completion indication may arrive in host memory > > +ahead of some data. This isn't possible on most platforms. PCI write posting can only be flushed by a read transaction on the device (or sometimes any device on the bridge). Either this interface is misnamed and misdescribed, or it can't work for most systems. > > +To prevent this, you might map the memory region used for completion > > +indications as follows: > > + > > + int count, flags = dma_flags_set_dmaflush(DMA_BIDIRECTIONAL); > > + ..... > > + count = dma_map_sg(dev, sglist, nents, flags); James ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-21 20:55 ` James Bottomley @ 2007-08-22 0:34 ` akepner 2007-08-22 1:14 ` James Bottomley 0 siblings, 1 reply; 24+ messages in thread From: akepner @ 2007-08-22 0:34 UTC (permalink / raw) To: James Bottomley Cc: Randy Dunlap, Jes Sorensen, linux-kernel, rdreier, linux-ia64 On Tue, Aug 21, 2007 at 03:55:29PM -0500, James Bottomley wrote: > ..... > Almost every platform supports posted DMA ... its a property of most PCI > bridge chips. > The term "posted DMA" is used to describe this behavior in the Altix Device Driver Writer's Guide, but it may be confusing things here. Maybe a better term will suggest itself if I can clarify.... On Altix, DMA from a device isn't guaranteed to arrive in host memory in the order it was sent from the device. This reordering can happen in the NUMA interconnect (it's specifically not a PCI reordering.) > ...... > This isn't possible on most platforms. PCI write posting can only be > flushed by a read transaction on the device (or sometimes any device on > the bridge). Either this interface is misnamed and misdescribed, or it > can't work for most systems. > Clearly it wasn't described adequately... A read transaction on the device will flush pending writes to the device. But I'm worried about DMA from the device to host memory. On Altix, there are two mechanisms that flush all in-flight DMA to host memory: 1) an interrupt, and 2) a write to a memory region which has a "barrier" attribute set. Obviously option 1 isn't viable for performance reasons. This new interface is about making "option 2" generally available. (As it is now, the only way to get memory with the "barrier" attribute is to allocate it with dma_alloc_coherent().) -- Arthur ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-22 0:34 ` akepner @ 2007-08-22 1:14 ` James Bottomley 2007-08-22 7:39 ` Jes Sorensen 2007-08-22 15:54 ` akepner 0 siblings, 2 replies; 24+ messages in thread From: James Bottomley @ 2007-08-22 1:14 UTC (permalink / raw) To: akepner; +Cc: Randy Dunlap, Jes Sorensen, linux-kernel, rdreier, linux-ia64 On Tue, 2007-08-21 at 17:34 -0700, akepner@sgi.com wrote: > On Tue, Aug 21, 2007 at 03:55:29PM -0500, James Bottomley wrote: > > > ..... > > Almost every platform supports posted DMA ... its a property of most PCI > > bridge chips. > > > > The term "posted DMA" is used to describe this behavior in the Altix > Device Driver Writer's Guide, but it may be confusing things here. > Maybe a better term will suggest itself if I can clarify.... OK, but posted DMA has a pretty specific meaning in terms of PCI, hence the confusion. > On Altix, DMA from a device isn't guaranteed to arrive in host memory > in the order it was sent from the device. This reordering can happen > in the NUMA interconnect (it's specifically not a PCI reordering.) This is mmiowb and read_relaxed() again, isn't it? > > ...... > > This isn't possible on most platforms. PCI write posting can only be > > flushed by a read transaction on the device (or sometimes any device on > > the bridge). Either this interface is misnamed and misdescribed, or it > > can't work for most systems. > > > > Clearly it wasn't described adequately... > > A read transaction on the device will flush pending writes to the > device. But I'm worried about DMA from the device to host memory. > On Altix, there are two mechanisms that flush all in-flight DMA > to host memory: 1) an interrupt, and 2) a write to a memory region > which has a "barrier" attribute set. Obviously option 1 isn't > viable for performance reasons. This new interface is about making > "option 2" generally available. (As it is now, the only way to get > memory with the "barrier" attribute is to allocate it with > dma_alloc_coherent().) Which sounds exactly what mmiowb does ... is there a need for a new API; can't you just use mmiowb()? James ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-22 1:14 ` James Bottomley @ 2007-08-22 7:39 ` Jes Sorensen 2007-08-22 14:02 ` James Bottomley 2007-08-23 5:58 ` Jeremy Higdon 2007-08-22 15:54 ` akepner 1 sibling, 2 replies; 24+ messages in thread From: Jes Sorensen @ 2007-08-22 7:39 UTC (permalink / raw) To: James Bottomley; +Cc: akepner, Randy Dunlap, linux-kernel, rdreier, linux-ia64 James Bottomley wrote: > On Tue, 2007-08-21 at 17:34 -0700, akepner@sgi.com wrote: >> The term "posted DMA" is used to describe this behavior in the Altix >> Device Driver Writer's Guide, but it may be confusing things here. >> Maybe a better term will suggest itself if I can clarify.... > > OK, but posted DMA has a pretty specific meaning in terms of PCI, hence > the confusion. Maybe it would be more better to refer to this as 'out of order DMA'? >> On Altix, DMA from a device isn't guaranteed to arrive in host memory >> in the order it was sent from the device. This reordering can happen >> in the NUMA interconnect (it's specifically not a PCI reordering.) > > This is mmiowb and read_relaxed() again, isn't it? I believe it's the same problem, except this time it's when exposing structures to userland. Cheers, Jes ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-22 7:39 ` Jes Sorensen @ 2007-08-22 14:02 ` James Bottomley 2007-08-22 16:03 ` Jesse Barnes 2007-08-23 5:58 ` Jeremy Higdon 1 sibling, 1 reply; 24+ messages in thread From: James Bottomley @ 2007-08-22 14:02 UTC (permalink / raw) To: Jes Sorensen; +Cc: akepner, Randy Dunlap, linux-kernel, rdreier, linux-ia64 On Wed, 2007-08-22 at 09:39 +0200, Jes Sorensen wrote: > James Bottomley wrote: > > On Tue, 2007-08-21 at 17:34 -0700, akepner@sgi.com wrote: > >> The term "posted DMA" is used to describe this behavior in the Altix > >> Device Driver Writer's Guide, but it may be confusing things here. > >> Maybe a better term will suggest itself if I can clarify.... > > > > OK, but posted DMA has a pretty specific meaning in terms of PCI, hence > > the confusion. > > Maybe it would be more better to refer to this as 'out of order DMA'? Or Relaxed ordering DMA ... that's why the readX_relaxed()? > >> On Altix, DMA from a device isn't guaranteed to arrive in host memory > >> in the order it was sent from the device. This reordering can happen > >> in the NUMA interconnect (it's specifically not a PCI reordering.) > > > > This is mmiowb and read_relaxed() again, isn't it? > > I believe it's the same problem, except this time it's when exposing > structures to userland. Hmm, so how does another kernel API exposing mmiowb in a different way help with this? Surely, if the driver is exporting something to user space, it's simply up to the driver to call mmiowb when the user says it's done? James ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-22 14:02 ` James Bottomley @ 2007-08-22 16:03 ` Jesse Barnes 2007-08-22 16:44 ` James Bottomley 0 siblings, 1 reply; 24+ messages in thread From: Jesse Barnes @ 2007-08-22 16:03 UTC (permalink / raw) To: James Bottomley Cc: Jes Sorensen, akepner, Randy Dunlap, linux-kernel, rdreier, linux-ia64 On Wednesday, August 22, 2007 7:02:38 am James Bottomley wrote: > On Wed, 2007-08-22 at 09:39 +0200, Jes Sorensen wrote: > > James Bottomley wrote: > > > On Tue, 2007-08-21 at 17:34 -0700, akepner@sgi.com wrote: > > >> The term "posted DMA" is used to describe this behavior in the Altix > > >> Device Driver Writer's Guide, but it may be confusing things here. > > >> Maybe a better term will suggest itself if I can clarify.... > > > > > > OK, but posted DMA has a pretty specific meaning in terms of PCI, hence > > > the confusion. > > > > Maybe it would be more better to refer to this as 'out of order DMA'? > > Or Relaxed ordering DMA ... that's why the readX_relaxed()? > > > >> On Altix, DMA from a device isn't guaranteed to arrive in host memory > > >> in the order it was sent from the device. This reordering can happen > > >> in the NUMA interconnect (it's specifically not a PCI reordering.) > > > > > > This is mmiowb and read_relaxed() again, isn't it? > > > > I believe it's the same problem, except this time it's when exposing > > structures to userland. > > Hmm, so how does another kernel API exposing mmiowb in a different way > help with this? Surely, if the driver is exporting something to user > space, it's simply up to the driver to call mmiowb when the user says > it's done? mmiowb() is for PIO->device. This interface is for DMA->memory (see akepner's other mail). The problem is a DMA write (say to a completion queue) from a device may imply something about another DMA write from the same device (say the actual data). If the completion queue write arrives first (which can happen on sn2), the driver must ensure that the rest of the outstanding DMA is complete prior to looking at the completion queue status. It can either use a regular PIO read to do this (i.e. a non-relaxed one) or set a flag on the completion queue DMA address that makes it act as a barrier wrt other DMA, which is what akepner's patch does (which should be much more efficient that using a PIO read to guarantee DMA writes have completed). Jesse ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-22 16:03 ` Jesse Barnes @ 2007-08-22 16:44 ` James Bottomley 2007-08-22 16:51 ` Jesse Barnes 0 siblings, 1 reply; 24+ messages in thread From: James Bottomley @ 2007-08-22 16:44 UTC (permalink / raw) To: Jesse Barnes Cc: Jes Sorensen, akepner, Randy Dunlap, linux-kernel, rdreier, linux-ia64 On Wed, 2007-08-22 at 09:03 -0700, Jesse Barnes wrote: > On Wednesday, August 22, 2007 7:02:38 am James Bottomley wrote: > > On Wed, 2007-08-22 at 09:39 +0200, Jes Sorensen wrote: > > > James Bottomley wrote: > > > > On Tue, 2007-08-21 at 17:34 -0700, akepner@sgi.com wrote: > > > >> The term "posted DMA" is used to describe this behavior in the Altix > > > >> Device Driver Writer's Guide, but it may be confusing things here. > > > >> Maybe a better term will suggest itself if I can clarify.... > > > > > > > > OK, but posted DMA has a pretty specific meaning in terms of PCI, hence > > > > the confusion. > > > > > > Maybe it would be more better to refer to this as 'out of order DMA'? > > > > Or Relaxed ordering DMA ... that's why the readX_relaxed()? > > > > > >> On Altix, DMA from a device isn't guaranteed to arrive in host memory > > > >> in the order it was sent from the device. This reordering can happen > > > >> in the NUMA interconnect (it's specifically not a PCI reordering.) > > > > > > > > This is mmiowb and read_relaxed() again, isn't it? > > > > > > I believe it's the same problem, except this time it's when exposing > > > structures to userland. > > > > Hmm, so how does another kernel API exposing mmiowb in a different way > > help with this? Surely, if the driver is exporting something to user > > space, it's simply up to the driver to call mmiowb when the user says > > it's done? > > mmiowb() is for PIO->device. This interface is for DMA->memory (see akepner's > other mail). > > The problem is a DMA write (say to a completion queue) from a device may imply > something about another DMA write from the same device (say the actual data). > If the completion queue write arrives first (which can happen on sn2), the > driver must ensure that the rest of the outstanding DMA is complete prior to > looking at the completion queue status. It can either use a regular PIO read > to do this (i.e. a non-relaxed one) or set a flag on the completion queue DMA > address that makes it act as a barrier wrt other DMA, which is what akepner's > patch does (which should be much more efficient that using a PIO read to > guarantee DMA writes have completed). This is a violation of the PCI spec, isn't it, like Matthew pointed out? The only time a device->host DMA transaction shouldn't follow strict ordering is when the device sets the relaxed hint in its PCI registers. James ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-22 16:44 ` James Bottomley @ 2007-08-22 16:51 ` Jesse Barnes 2007-08-22 17:04 ` James Bottomley 0 siblings, 1 reply; 24+ messages in thread From: Jesse Barnes @ 2007-08-22 16:51 UTC (permalink / raw) To: James Bottomley Cc: Jes Sorensen, akepner, Randy Dunlap, linux-kernel, rdreier, linux-ia64 On Wednesday, August 22, 2007 9:44:55 am James Bottomley wrote: > > The problem is a DMA write (say to a completion queue) from a device may > > imply something about another DMA write from the same device (say the > > actual data). If the completion queue write arrives first (which can > > happen on sn2), the driver must ensure that the rest of the outstanding > > DMA is complete prior to looking at the completion queue status. It can > > either use a regular PIO read to do this (i.e. a non-relaxed one) or set > > a flag on the completion queue DMA address that makes it act as a barrier > > wrt other DMA, which is what akepner's patch does (which should be much > > more efficient that using a PIO read to guarantee DMA writes have > > completed). > > This is a violation of the PCI spec, isn't it, like Matthew pointed out? > The only time a device->host DMA transaction shouldn't follow strict > ordering is when the device sets the relaxed hint in its PCI registers. Yeah, it is. Whether its allowed in PCIe depends on how you read the spec (but either way it would need to be explicitly enabled). For better or for worse, Altix hardware always behaves this way (well mostly for the better, since most device protocols don't care as they involve PIO, and out of order completion is *much* faster on Altix than strict ordering). Arthur's patch is pretty straightfoward though, so unless someone can think of a better way of hiding this architectural detail in lower level code it's probably a good thing to add (especially given that future revs of PCIe will probably allow this behavior, and hopefully less ambiguously than the current spec). Jesse ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-22 16:51 ` Jesse Barnes @ 2007-08-22 17:04 ` James Bottomley 2007-08-22 17:03 ` Jes Sorensen 2007-08-22 17:17 ` Jesse Barnes 0 siblings, 2 replies; 24+ messages in thread From: James Bottomley @ 2007-08-22 17:04 UTC (permalink / raw) To: Jesse Barnes Cc: Jes Sorensen, akepner, Randy Dunlap, linux-kernel, rdreier, linux-ia64 On Wed, 2007-08-22 at 09:51 -0700, Jesse Barnes wrote: > On Wednesday, August 22, 2007 9:44:55 am James Bottomley wrote: > > > The problem is a DMA write (say to a completion queue) from a device may > > > imply something about another DMA write from the same device (say the > > > actual data). If the completion queue write arrives first (which can > > > happen on sn2), the driver must ensure that the rest of the outstanding > > > DMA is complete prior to looking at the completion queue status. It can > > > either use a regular PIO read to do this (i.e. a non-relaxed one) or set > > > a flag on the completion queue DMA address that makes it act as a barrier > > > wrt other DMA, which is what akepner's patch does (which should be much > > > more efficient that using a PIO read to guarantee DMA writes have > > > completed). > > > > This is a violation of the PCI spec, isn't it, like Matthew pointed out? > > The only time a device->host DMA transaction shouldn't follow strict > > ordering is when the device sets the relaxed hint in its PCI registers. > > Yeah, it is. Whether its allowed in PCIe depends on how you read the spec > (but either way it would need to be explicitly enabled). > > For better or for worse, Altix hardware always behaves this way (well mostly > for the better, since most device protocols don't care as they involve PIO, > and out of order completion is *much* faster on Altix than strict ordering). > > Arthur's patch is pretty straightfoward though, so unless someone can think of > a better way of hiding this architectural detail in lower level code it's > probably a good thing to add (especially given that future revs of PCIe will > probably allow this behavior, and hopefully less ambiguously than the current > spec). The spec isn't ambiguous ... it says if the device and bridge agree on relaxed ordering, then it *may* be observed in the transaction. If there's a disagreement or neither wishes to support relaxed ordering then the transaction *must* be strict. I really don't think a work around for a PCI spec violation belongs in the generic DMA code, do you? The correct fix for this should be to set the device hints to strict ordering, which presumably altix respects? In which case, it sounds like what needs exposing are access to the PCI device hints. I believe both PCI-X and PCIe have these hints as optional specifiers in the bridges, so it should be in a current Rev of the PCI spec. Or are you proposing adding an additional PCI API that allows transaction flushes to be inserted into the stream for devices and bridges that have already negotiated relaxed ordering? ... in which case we need to take this to the PCI list. James ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-22 17:04 ` James Bottomley @ 2007-08-22 17:03 ` Jes Sorensen 2007-08-22 18:10 ` James Bottomley 2007-08-22 17:17 ` Jesse Barnes 1 sibling, 1 reply; 24+ messages in thread From: Jes Sorensen @ 2007-08-22 17:03 UTC (permalink / raw) To: James Bottomley Cc: Jesse Barnes, akepner, Randy Dunlap, linux-kernel, rdreier, linux-ia64 James Bottomley wrote: > I really don't think a work around for a PCI spec violation belongs in > the generic DMA code, do you? The correct fix for this should be to set > the device hints to strict ordering, which presumably altix respects? > In which case, it sounds like what needs exposing are access to the PCI > device hints. I believe both PCI-X and PCIe have these hints as > optional specifiers in the bridges, so it should be in a current Rev of > the PCI spec. Or are you proposing adding an additional PCI API that > allows transaction flushes to be inserted into the stream for devices > and bridges that have already negotiated relaxed ordering? ... in which > case we need to take this to the PCI list. James, I don't believe it respects those hints - I agree, it's a pita, but thats the state of the situation. Even if it did, it would make performance suck as Jesse also pointed out. As I pointed out in my email to Willy is that the NUMA fabric is routed, there's not one path through the system, which is what makes this happen. Jes ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-22 17:03 ` Jes Sorensen @ 2007-08-22 18:10 ` James Bottomley 2007-08-23 8:45 ` Jes Sorensen 0 siblings, 1 reply; 24+ messages in thread From: James Bottomley @ 2007-08-22 18:10 UTC (permalink / raw) To: Jes Sorensen Cc: Jesse Barnes, akepner, Randy Dunlap, linux-kernel, rdreier, linux-ia64 On Wed, 2007-08-22 at 19:03 +0200, Jes Sorensen wrote: > James Bottomley wrote: > > I really don't think a work around for a PCI spec violation belongs in > > the generic DMA code, do you? The correct fix for this should be to set > > the device hints to strict ordering, which presumably altix respects? > > In which case, it sounds like what needs exposing are access to the PCI > > device hints. I believe both PCI-X and PCIe have these hints as > > optional specifiers in the bridges, so it should be in a current Rev of > > the PCI spec. Or are you proposing adding an additional PCI API that > > allows transaction flushes to be inserted into the stream for devices > > and bridges that have already negotiated relaxed ordering? ... in which > > case we need to take this to the PCI list. > > James, > > I don't believe it respects those hints - I agree, it's a pita, but > thats the state of the situation. Even if it did, it would make > performance suck as Jesse also pointed out. > > As I pointed out in my email to Willy is that the NUMA fabric is routed, > there's not one path through the system, which is what makes this > happen. Hmm, didn't see the email ... but I'm probably not cc'd on all the thread. However ... it isn't that you couldn't do it ... it's that you don't want to do it because it's faster to violate the spec ... like all those nice ATA devices that lie about having a cache and then let you power down with uncommitted data still in it ... they work much faster for HDIO tests ... and who ever switches their box off? James ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-22 18:10 ` James Bottomley @ 2007-08-23 8:45 ` Jes Sorensen 0 siblings, 0 replies; 24+ messages in thread From: Jes Sorensen @ 2007-08-23 8:45 UTC (permalink / raw) To: James Bottomley Cc: Jesse Barnes, akepner, Randy Dunlap, linux-kernel, rdreier, linux-ia64 James Bottomley wrote: > Hmm, didn't see the email ... but I'm probably not cc'd on all the > thread. However ... it isn't that you couldn't do it ... it's that you > don't want to do it because it's faster to violate the spec ... like all > those nice ATA devices that lie about having a cache and then let you > power down with uncommitted data still in it ... they work much faster > for HDIO tests ... and who ever switches their box off? James, I didn't do it, I don't know who did it, but sure we can try and track them down and line them up outside ..... Point is that this is how the chips were done and they are out there in numbers. It makes things a *lot* faster to violate the spec in such a system, yes it sucks, but thats how things are. It wouldn't be the first time someone violated the PCI spec in their implementation and I am pretty sure it won't be the last. Cheers, Jes ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-22 17:04 ` James Bottomley 2007-08-22 17:03 ` Jes Sorensen @ 2007-08-22 17:17 ` Jesse Barnes 2007-08-22 18:13 ` James Bottomley 1 sibling, 1 reply; 24+ messages in thread From: Jesse Barnes @ 2007-08-22 17:17 UTC (permalink / raw) To: James Bottomley Cc: Jes Sorensen, akepner, Randy Dunlap, linux-kernel, rdreier, linux-ia64 On Wednesday, August 22, 2007 10:04:32 am James Bottomley wrote: > The spec isn't ambiguous ... it says if the device and bridge agree on > relaxed ordering, then it *may* be observed in the transaction. If > there's a disagreement or neither wishes to support relaxed ordering > then the transaction *must* be strict. Arg, don't make me dig out my spec, I don't have it handy... > I really don't think a work around for a PCI spec violation belongs in > the generic DMA code, do you? The correct fix for this should be to set > the device hints to strict ordering, which presumably altix respects? Well, the Altix hw by itself won't honor device hints (I'm not even sure if there are devices that honor order hints like you outline above). However, Altix could track per-device ordering as long as arch code was called from such a hook. > In which case, it sounds like what needs exposing are access to the PCI > device hints. I believe both PCI-X and PCIe have these hints as > optional specifiers in the bridges, so it should be in a current Rev of > the PCI spec. Or are you proposing adding an additional PCI API that > allows transaction flushes to be inserted into the stream for devices > and bridges that have already negotiated relaxed ordering? ... in which > case we need to take this to the PCI list. I think it would have to be the latter, since otherwise it would be hard to setup just completion queue requests to be ordered. Jesse ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-22 17:17 ` Jesse Barnes @ 2007-08-22 18:13 ` James Bottomley 2007-08-22 18:44 ` akepner 0 siblings, 1 reply; 24+ messages in thread From: James Bottomley @ 2007-08-22 18:13 UTC (permalink / raw) To: Jesse Barnes Cc: Jes Sorensen, akepner, Randy Dunlap, linux-kernel, rdreier, linux-ia64 On Wed, 2007-08-22 at 10:17 -0700, Jesse Barnes wrote: > On Wednesday, August 22, 2007 10:04:32 am James Bottomley wrote: > > The spec isn't ambiguous ... it says if the device and bridge agree on > > relaxed ordering, then it *may* be observed in the transaction. If > > there's a disagreement or neither wishes to support relaxed ordering > > then the transaction *must* be strict. > > Arg, don't make me dig out my spec, I don't have it handy... Well ... I don't have one either. However, Grant Grundler did a presentation to OLS about relaxed ordering, and I went over it pretty thoroughly with him a while ago. The bottom line is that the default is always strict unless both the bridge and the device agree otherwise. > > I really don't think a work around for a PCI spec violation belongs in > > the generic DMA code, do you? The correct fix for this should be to set > > the device hints to strict ordering, which presumably altix respects? > > Well, the Altix hw by itself won't honor device hints (I'm not even sure if > there are devices that honor order hints like you outline above). However, > Altix could track per-device ordering as long as arch code was called from > such a hook. > > > In which case, it sounds like what needs exposing are access to the PCI > > device hints. I believe both PCI-X and PCIe have these hints as > > optional specifiers in the bridges, so it should be in a current Rev of > > the PCI spec. Or are you proposing adding an additional PCI API that > > allows transaction flushes to be inserted into the stream for devices > > and bridges that have already negotiated relaxed ordering? ... in which > > case we need to take this to the PCI list. > > I think it would have to be the latter, since otherwise it would be hard to > setup just completion queue requests to be ordered. OK ... I think this is definitely a PCI specific API ... and probably a generic one for requesting order flushes in devices that have negotiated relaxed ordering. Do you want to start a new thread on linux-pci and cc me? James ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-22 18:13 ` James Bottomley @ 2007-08-22 18:44 ` akepner 0 siblings, 0 replies; 24+ messages in thread From: akepner @ 2007-08-22 18:44 UTC (permalink / raw) To: James Bottomley Cc: Jesse Barnes, Jes Sorensen, Randy Dunlap, linux-kernel, rdreier, linux-ia64 On Wed, Aug 22, 2007 at 01:13:04PM -0500, James Bottomley wrote: > ...... > OK ... I think this is definitely a PCI specific API ... and probably a > generic one for requesting order flushes in devices that have negotiated > relaxed ordering. Do you want to start a new thread on linux-pci and cc > me? > I'll do this. -- Arthur ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-22 7:39 ` Jes Sorensen 2007-08-22 14:02 ` James Bottomley @ 2007-08-23 5:58 ` Jeremy Higdon 1 sibling, 0 replies; 24+ messages in thread From: Jeremy Higdon @ 2007-08-23 5:58 UTC (permalink / raw) To: Jes Sorensen Cc: James Bottomley, akepner, Randy Dunlap, linux-kernel, rdreier, linux-ia64 On Wed, Aug 22, 2007 at 09:39:36AM +0200, Jes Sorensen wrote: > James Bottomley wrote: > >On Tue, 2007-08-21 at 17:34 -0700, akepner@sgi.com wrote: > >>The term "posted DMA" is used to describe this behavior in the Altix > >>Device Driver Writer's Guide, but it may be confusing things here. > >>Maybe a better term will suggest itself if I can clarify.... > > > >OK, but posted DMA has a pretty specific meaning in terms of PCI, hence > >the confusion. > > Maybe it would be more better to refer to this as 'out of order DMA'? > > >>On Altix, DMA from a device isn't guaranteed to arrive in host memory > >>in the order it was sent from the device. This reordering can happen > >>in the NUMA interconnect (it's specifically not a PCI reordering.) > > > >This is mmiowb and read_relaxed() again, isn't it? > > I believe it's the same problem, except this time it's when exposing > structures to userland. Actually, it's a different problem, but with a similar cause. mmiowb() is for the case PIO (or MMIO) write order from two different CPUs can invert somewhere in the NL fabric. read_relaxed() is a performance optimization to avoid the flush that's necessary to avoid inversion in order between a PIO (or MMIO) read and a DMA write. What Arthur's trying to do here is avoid inversion in the order of two DMA writes. jeremy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-22 1:14 ` James Bottomley 2007-08-22 7:39 ` Jes Sorensen @ 2007-08-22 15:54 ` akepner 1 sibling, 0 replies; 24+ messages in thread From: akepner @ 2007-08-22 15:54 UTC (permalink / raw) To: James Bottomley Cc: Randy Dunlap, Jes Sorensen, linux-kernel, rdreier, linux-ia64 On Tue, Aug 21, 2007 at 08:14:09PM -0500, James Bottomley wrote: > On Tue, 2007-08-21 at 17:34 -0700, akepner@sgi.com wrote: > ..... > > On Altix, DMA from a device isn't guaranteed to arrive in host memory > > in the order it was sent from the device. This reordering can happen > > in the NUMA interconnect (it's specifically not a PCI reordering.) > > This is mmiowb and read_relaxed() again, isn't it? > ..... No, this is different. This problem here has do with ordering writes from the device to host memory. Specifically this problem can be manifested with Infiniband - when a Completion Queue Entry is written by the IB device, it indicates that data is available in host memory. But the write to the Completion Queue can race with DMA of data. (Completion Queues can be allocated by the kernel or in userspace. The race described above can only happen when they are allocated in userspace - kernel allocations of CQs use dma_alloc_coherent() and so get the "barrier" attribute that's needed to prevent the race. The proposed new interface would allow CQs, or anything else, allocated with plain old malloc(), to set the barrier attribute.) -- Arthur ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-21 19:35 ` akepner 2007-08-21 20:05 ` Randy Dunlap @ 2007-08-21 20:16 ` Matthew Wilcox 2007-08-21 21:37 ` akepner 2007-08-22 7:44 ` Jes Sorensen 1 sibling, 2 replies; 24+ messages in thread From: Matthew Wilcox @ 2007-08-21 20:16 UTC (permalink / raw) To: akepner; +Cc: Jes Sorensen, linux-kernel, rdreier, linux-ia64 On Tue, Aug 21, 2007 at 12:35:22PM -0700, akepner@sgi.com wrote: > +int > +dma_flags_set_dmaflush(int dir) > + > +Amend dir (one of the enum dma_data_direction values), with a platform- > +specific "dmaflush" attribute. Unless the platform supports "posted DMA" > +this is a no-op. > + > +On platforms that support posted DMA, dma_flags_set_dmaflush() causes > +device writes to the associated memory region to flush in-flight DMA. > +This can be important, for example, when (DMA) writes to the memory > +region indicate that DMA of data is complete. If DMA of data and DMA of > +the completion indication race, as they can do when the platform supports > +posted DMA, then the completion indication may arrive in host memory > +ahead of some data. So, let me try to understand ... your hardware allows writes from the device to pass other writes from the device? Doesn't that violate the PCI spec? I'm thinking about this (page 43 of PCI 2.3): Posted memory writes moving in the same direction through a bridge will complete on the destination bus in the same order they complete on the originating bus. Even if a single burst on the originating bus is terminated with Disconnect on the destination bus so that it is broken into multiple transactions, those transactions must not allow the data phases to complete on the destination bus in any order other than their order on the originating bus. > +To prevent this, you might map the memory region used for completion > +indications as follows: > + > + int count, flags = dma_flags_set_dmaflush(DMA_BIDIRECTIONAL); > + ..... > + count = dma_map_sg(dev, sglist, nents, flags); So any device driver used on your hardware has to add a call to this new function, or it'll see data corruption? Not acceptable, IMO. If this is a performance optimisation, then by all means add a function drivers can call to say "it's OK, I know about this brokenness, and I don't depend on it", but safety first. -- "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-21 20:16 ` Matthew Wilcox @ 2007-08-21 21:37 ` akepner 2007-08-22 7:44 ` Jes Sorensen 1 sibling, 0 replies; 24+ messages in thread From: akepner @ 2007-08-21 21:37 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Jes Sorensen, linux-kernel, rdreier, linux-ia64 On Tue, Aug 21, 2007 at 02:16:32PM -0600, Matthew Wilcox wrote: > > So, let me try to understand ... your hardware allows writes from the > device to pass other writes from the device? Doesn't that violate the > PCI spec? I'm thinking about this (page 43 of PCI 2.3): > .... I should have stated this more carefully, but it's not a PCI reordering that's being addressed here, it's a reordering that can occur within the NUMA-interconnect. -- Arthur ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 2007-08-21 20:16 ` Matthew Wilcox 2007-08-21 21:37 ` akepner @ 2007-08-22 7:44 ` Jes Sorensen 1 sibling, 0 replies; 24+ messages in thread From: Jes Sorensen @ 2007-08-22 7:44 UTC (permalink / raw) To: Matthew Wilcox; +Cc: akepner, linux-kernel, rdreier, linux-ia64 Matthew Wilcox wrote: > So, let me try to understand ... your hardware allows writes from the > device to pass other writes from the device? Doesn't that violate the > PCI spec? I'm thinking about this (page 43 of PCI 2.3): Hi Matthew, Yes I believe this behavior on sn2 is 'bending' the PCI spec a bit. The problem is that the NUMA fabric is a routed network in itself so there can be multiple paths between the device and the physical memory it DMAs to/from. Cheers, Jes ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2007-08-23 8:45 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20070818002746.GU1813@sgi.com>
2007-08-20 8:24 ` [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64 Jes Sorensen
2007-08-20 16:07 ` akepner
2007-08-21 19:35 ` akepner
2007-08-21 20:05 ` Randy Dunlap
2007-08-21 20:55 ` James Bottomley
2007-08-22 0:34 ` akepner
2007-08-22 1:14 ` James Bottomley
2007-08-22 7:39 ` Jes Sorensen
2007-08-22 14:02 ` James Bottomley
2007-08-22 16:03 ` Jesse Barnes
2007-08-22 16:44 ` James Bottomley
2007-08-22 16:51 ` Jesse Barnes
2007-08-22 17:04 ` James Bottomley
2007-08-22 17:03 ` Jes Sorensen
2007-08-22 18:10 ` James Bottomley
2007-08-23 8:45 ` Jes Sorensen
2007-08-22 17:17 ` Jesse Barnes
2007-08-22 18:13 ` James Bottomley
2007-08-22 18:44 ` akepner
2007-08-23 5:58 ` Jeremy Higdon
2007-08-22 15:54 ` akepner
2007-08-21 20:16 ` Matthew Wilcox
2007-08-21 21:37 ` akepner
2007-08-22 7:44 ` Jes Sorensen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox