Re: [PATCH v3 7/7] x86: add pci-nommu implementation of map_resource

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Will Davis <wdavis@nvidia.com>
To: Bjorn Helgaas <bhelgaas@google.com>
Cc: Dave Jiang <dave.jiang@intel.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Jonathan Corbet <corbet@lwn.net>,
	<iommu@lists.linux-foundation.org>,
	Jerome Glisse <jglisse@redhat.com>, <linux-pci@vger.kernel.org>,
	Terence Ripperda <tripperda@nvidia.com>,
	"David S. Miller" <davem@davemloft.net>,
	Mark Hounschell <markh@compro.net>,
	"joro@8bytes.org" <joro@8bytes.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Alex Williamson <alex.williamson@redhat.com>
Subject: Re: [PATCH v3 7/7] x86: add pci-nommu implementation of map_resource
Date: Tue, 7 Jul 2015 13:59:40 -0500	[thread overview]
Message-ID: <1436295580-3034-1-git-send-email-wdavis@nvidia.com> (raw)
In-Reply-To: <20150707153428.GA14784@google.com>

> [+cc Mark, Joerg, Konrad, Alex]
> 
> Hi Will,
> 
> On Wed, Jul 01, 2015 at 01:14:30PM -0500, Will Davis wrote:
> > > From: Bjorn Helgaas <bhelgaas@google.com>
> > > On Fri, May 29, 2015 at 12:14:46PM -0500, wdavis@nvidia.com wrote:
> > > > From: Will Davis <wdavis@nvidia.com>
> > > > 
> > > > Lookup the bus address of the resource by finding the parent host bridge,
> > > > which may be different than the parent host bridge of the target device.
> > > > 
> > > > Signed-off-by: Will Davis <wdavis@nvidia.com>
> > > > ---
> > > >  arch/x86/kernel/pci-nommu.c | 32 ++++++++++++++++++++++++++++++++
> > > >  1 file changed, 32 insertions(+)
> > > > 
> > > > diff --git a/arch/x86/kernel/pci-nommu.c b/arch/x86/kernel/pci-nommu.c
> > > > index da15918..6384482 100644
> > > > --- a/arch/x86/kernel/pci-nommu.c
> > > > +++ b/arch/x86/kernel/pci-nommu.c
> > > > @@ -38,6 +38,37 @@ static dma_addr_t nommu_map_page(struct device *dev, struct page *page,
> > > >  	return bus;
> > > >  }
> > > >  
> > > > +static dma_addr_t nommu_map_resource(struct device *dev, struct resource *res,
> > > > +				     unsigned long offset, size_t size,
> > > > +				     enum dma_data_direction dir,
> > > > +				     struct dma_attrs *attrs)
> > > > +{
> > > > +	struct pci_bus *bus;
> > > > +	struct pci_host_bridge *bridge;
> > > > +	struct resource_entry *window;
> > > > +	resource_size_t bus_offset = 0;
> > > > +	dma_addr_t dma_address;
> > > > +
> > > > +	/* Find the parent host bridge of the resource, and determine the
> > > > +	 * relative offset.
> > > > +	 */
> > > > +	list_for_each_entry(bus, &pci_root_buses, node) {
> > > > +		bridge = to_pci_host_bridge(bus->bridge);
> > > > +		resource_list_for_each_entry(window, &bridge->windows) {
> > > > +			if (resource_contains(window->res, res))
> > > > +				bus_offset = window->offset;
> > > > +		}
> > > > +	}
> > > 
> > > I don't think this is safe.  Assume we have the following topology, and
> > > we want to set it up so 0000:00:00.0 can perform peer-to-peer DMA to
> > > 0001:00:01.0:
> > > 
> > >   pci_bus 0000:00: root bus resource [mem 0x80000000-0xffffffff] (bus address [0x80000000-0xffffffff])
> > >   pci 0000:00:00.0: ...
> > >   pci_bus 0001:00: root bus resource [mem 0x180000000-0x1ffffffff] (bus address [0x80000000-0xffffffff])
> > >   pci 0001:00:01.0: reg 0x10: [mem 0x180000000-0x1803fffff 64bit]
> > > 
> > > I assume the way this works is that the driver for 0000:00:00.0 would call
> > > this function with 0001:00:01.0 and [mem 0x180000000-0x1803fffff 64bit].
> > > 
> > 
> > The intention is that pci_map_resource() would be called with the device to
> > map the region to, and the resource to map. So in this example, we would
> > call pci_map_resource(0000:00:00.0, [mem 0x180000000-0x1803fffff 64bit]).
> > The driver for 0000:00:00.0 needs to pass some information to
> > pci_map_resource() indicating that the mapping is for device 0000:00:00.0.
> 
> Oh, of course; that's sort of analogous to the way the other DMA
> mapping interfaces work.
> 
> > > We'll figure out that the resource belongs to 0001:00, so we return a
> > > dma_addr of 0x80000000, which is the bus address as seen by 0001:00:01.0.
> > > But if 0000:00:00.0 uses that address, it refers to something in the
> > > 0000:00 hierarchy, not the 0001:00 hierarchy.
> > 
> > If the bus addresses are organized as described, is peer-to-peer DMA even
> > possible with this nommu topology? Is there any way in which device
> > 0000:00:00.0 can address resources under the 0001:00: root bus, since the
> > bus address range is identical?
> 
> It doesn't seem possible on conventional PCI, because the host bridge
> to 0000:00 believes the transaction is intended for a device under it,
> not for a device under 0001:00.
> 
> On PCIe, I think it would depend on ACS configuration and the IOMMU
> and whether there's anything that can route transactions between host
> bridges.
> 
> Is it important to support peer-to-peer between host bridges?  If it's
> not important, you could probably simplify things by disallowing that
> case.
>

I've mostly been focused on peer-to-peer under a single root complex. At
least for our hardware (which is what I've had to test with), we only
support peer-to-peer when both devices are under the same root complex,
due to some performance and/or functional issues that I am not entirely
clear on, seen when trying to peer-to-peer to a device under another root
complex via Intel QPI [1]:

"The ability to use the peer-to-peer
protocol among GPUs, and its performance, is constrained
by the PCIe topology; performance is excellent when two
GPUs share the same PCIe root-complex, e.g. they are directly
connected to a PCIe switch or to the same hub. Otherwise,
when GPUs are linked to different bus branches, performance
may suffers or malfunctionings can arise. This can be an issue
on multi-socket Sandy Bridge Xeon platforms, where PCIe
slots might be connected to different processors, therefore
requiring GPU peer-to-peer traffic to cross the inter-socket
QPI channel(s)."

Apparently the QPI protocol is not quite compatible with PCIe peer-to-peer,
so perhaps that is what is being referred to [2]:

"The IOH does not support non-contiguous byte enables from PCI Express for
remote peer-to-peer MMIO transactions. This is an additional restriction
over the PCI Express standard requirements to prevent incompatibility with
Intel QuickPath Interconnect."

[1] http://arxiv.org/pdf/1307.8276.pdf
[2] http://www.intel.com/content/www/us/en/chipsets/5520-5500-chipset-ioh-datasheet.html

I'm trying to find some more details on the issues behind this but it
seems that they are Intel-specific. Although if we can't determine an
easy way to tell if inter-root-complex peer-to-peer is supported, per
the other sub-thread, then maybe it would be best to just disable that
case for now.

> The pci_map_resource(struct pci_dev *, struct resource *, offset, ...)
> interface is analogous to dma_map_single() and similar interfaces.
> But we're essentially using the resource as a proxy to identify the
> other device: we use the resource, i.e., the CPU physical address of
> one of the BARs, to search for the host bridge.
> 
> What would you think about explicitly passing both devices, e.g.,
> replacing the "struct resource *" with a "struct pci_dev *, int bar"
> pair?  It seems like then we'd be better prepared to figure out
> whether it's even possible to do peer-to-peer between the two devices.
> 

Yes, that does sound better to me. I'll work this into the next version.

Thanks,
Will

> I don't know how to discover that today, but I assume that's just
> because I'm ignorant or there's a hole in the system description that
> might be filled eventually.
> 
> Bjorn
>

next prev parent reply	other threads:[~2015-07-07 19:14 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-29 17:14 [PATCH v3 0/7] IOMMU/DMA map_resource support for peer-to-peer wdavis
2015-05-29 17:14 ` [PATCH v3 1/7] dma-debug: add checking for map/unmap_resource wdavis
2015-05-29 17:14 ` [PATCH v3 2/7] DMA-API: Introduce dma_(un)map_resource wdavis
2015-05-29 17:14 ` [PATCH v3 3/7] dma-mapping: pci: add pci_(un)map_resource wdavis
2015-07-01 16:06   ` Bjorn Helgaas
2015-07-01 18:31     ` Bjorn Helgaas
2015-07-06 15:16       ` Will Davis
2015-05-29 17:14 ` [PATCH v3 4/7] DMA-API: Add dma_(un)map_resource() documentation wdavis
2015-05-29 17:14 ` [PATCH v3 5/7] iommu/amd: Implement (un)map_resource wdavis
2015-07-01 16:13   ` Bjorn Helgaas
2015-05-29 17:14 ` [PATCH v3 6/7] iommu/vt-d: implement (un)map_resource wdavis
2015-05-29 17:14 ` [PATCH v3 7/7] x86: add pci-nommu implementation of map_resource wdavis
2015-07-01 16:45   ` Bjorn Helgaas
2015-07-01 18:14     ` Will Davis
2015-07-07 15:34       ` Bjorn Helgaas
2015-07-07 18:59         ` Will Davis [this message]
2015-06-15 15:49 ` [PATCH v3 0/7] IOMMU/DMA map_resource support for peer-to-peer William Davis
2015-07-01 15:11   ` William Davis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1436295580-3034-1-git-send-email-wdavis@nvidia.com \
    --to=wdavis@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=bhelgaas@google.com \
    --cc=corbet@lwn.net \
    --cc=dave.jiang@intel.com \
    --cc=davem@davemloft.net \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=joro@8bytes.org \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=markh@compro.net \
    --cc=tripperda@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).