From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Williams Subject: Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory Date: Tue, 18 Apr 2017 10:27:47 -0700 Message-ID: References: <1492381396.25766.43.camel@kernel.crashing.org> <20170418164557.GA7181@obsidianresearch.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20170418164557.GA7181-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: linux-nvdimm-bounces-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org Sender: "Linux-nvdimm" To: Jason Gunthorpe Cc: Jens Axboe , "James E.J. Bottomley" , "Martin K. Petersen" , linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Benjamin Herrenschmidt , Steve Wise , "linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" , linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, Keith Busch , Jerome Glisse , Bjorn Helgaas , linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-nvdimm , Max Gurtovoy , linux-scsi , Christoph Hellwig List-Id: linux-rdma@vger.kernel.org On Tue, Apr 18, 2017 at 9:45 AM, Jason Gunthorpe wrote: > On Mon, Apr 17, 2017 at 08:23:16AM +1000, Benjamin Herrenschmidt wrote: > >> Thanks :-) There's a reason why I'm insisting on this. We have constant >> requests for this today. We have hacks in the GPU drivers to do it for >> GPUs behind a switch, but those are just that, ad-hoc hacks in the >> drivers. We have similar grossness around the corner with some CAPI >> NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines >> to whack nVME devices. > > A lot of people feel this way in the RDMA community too. We have had > vendors shipping out of tree code to enable P2P for RDMA with GPU > years and years now. :( > > Attempts to get things in mainline have always run into the same sort > of road blocks you've identified in this thread.. > > FWIW, I read this discussion and it sounds closer to an agreement than > I've ever seen in the past. > > From Ben's comments, I would think that the 'first class' support that > is needed here is simply a function to return the 'struct device' > backing a CPU address range. > > This is the minimal required information for the arch or IOMMU code > under the dma ops to figure out the fabric source/dest, compute the > traffic path, determine if P2P is even possible, what translation > hardware is crossed, and what DMA address should be used. > > If there is going to be more core support for this stuff I think it > will be under the topic of more robustly describing the fabric to the > core and core helpers to extract data from the description: eg compute > the path, check if the path crosses translation, etc > > But that isn't really related to P2P, and is probably better left to > the arch authors to figure out where they need to enhance the existing > topology data.. > > I think the key agreement to get out of Logan's series is that P2P DMA > means: > - The BAR will be backed by struct pages > - Passing the CPU __iomem address of the BAR to the DMA API is > valid and, long term, dma ops providers are expected to fail > or return the right DMA address > - Mapping BAR memory into userspace and back to the kernel via > get_user_pages works transparently, and with the DMA API above > - The dma ops provider must be able to tell if source memory is bar > mapped and recover the pci device backing the mapping. > > At least this is what we'd like in RDMA :) > > FWIW, RDMA probably wouldn't want to use a p2mem device either, we > already have APIs that map BAR memory to user space, and would like to > keep using them. A 'enable P2P for bar' helper function sounds better > to me. ...and I think it's not a helper function as much as asking the bus provider "can these two device dma to each other". The "helper" is the dma api redirecting through a software-iommu that handles bus address translation differently than it would handle host memory dma mapping.