From mboxrd@z Thu Jan 1 00:00:00 1970 From: Benjamin Herrenschmidt Subject: Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory Date: Mon, 17 Apr 2017 08:23:16 +1000 Message-ID: <1492381396.25766.43.camel@kernel.crashing.org> References: Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Return-path: In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: linux-nvdimm-bounces-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org Sender: "Linux-nvdimm" To: Dan Williams , Logan Gunthorpe Cc: Jens Axboe , Keith Busch , "James E.J. Bottomley" , "Martin K. Petersen" , linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Steve Wise , "linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" , linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, Jason Gunthorpe , Jerome Glisse , Bjorn Helgaas , linux-scsi , linux-nvdimm , Max Gurtovoy , Christoph Hellwig List-Id: linux-nvdimm@lists.01.org T24gU3VuLCAyMDE3LTA0LTE2IGF0IDA4OjQ0IC0wNzAwLCBEYW4gV2lsbGlhbXMgd3JvdGU6Cj4g VGhlIGRpZmZlcmVuY2UgaXMgdGhhdCB0aGVyZSB3YXMgbm90aGluZyBmdW5kYW1lbnRhbCBpbiB0 aGUgY29yZQo+IGRlc2lnbiBvZiBwbWVtICsgREFYIHRoYXQgcHJldmVudGVkIG90aGVyIGFyY2hz IGZyb20gZ3Jvd2luZyBwbWVtCj4gc3VwcG9ydC4KCkluZGVlZC4gSW4gZmFjdCB3ZSBoYXZlIHdv cmsgaW4gcHJvZ3Jlc3Mgc3VwcG9ydCBmb3IgcG1lbSBvbiBwb3dlcgp1c2luZyBleHBlcmltZW50 YWwgSFcuCgo+IFRIUCBhbmQgbWVtb3J5IGhvdHBsdWcgZXhpc3RlZCBvbiBvdGhlciBhcmNoaXRl Y3R1cmVzIGFuZAo+IHRoZXkganVzdCBuZWVkIHRvIHBsdWcgaW4gdGhlaXIgYXJjaC1zcGVjaWZp YyBlbmFibGluZy4gcDJwIHN1cHBvcnQKPiBuZWVkcyB0aGUgc2FtZSBzdGFydGluZyBwb2ludCBv ZiBzb21ldGhpbmcgbW9yZSB0aGFuIG9uZSBhcmNoaXRlY3R1cmUKPiBjYW4gcGx1ZyBpbnRvLCBh bmQgaGFuZGxpbmcgdGhlIGJ1cyBhZGRyZXNzIG9mZnNldCBjYXNlIG5lZWRzIHRvIGJlCj4gaW5j b3Jwb3JhdGVkIGludG8gdGhlIGRlc2lnbi4KPiAKPiBwbWVtICsgZGF4IGRpZCBub3QgY2hhbmdl IHRoZSBtZWFuaW5nIG9mIHdoYXQgYSBkbWFfYWRkcl90IGlzLCBwMnAgZG9lcy4KClRoZSBtb3Jl IEkgdGhpbmsgYWJvdXQgaXQsIHRoZSBtb3JlIEkgdGVuZCB0b3dhcmQgc29tZXRoaW5nIGFsb25n IHRoZQpsaW5lcyBvZiBoYXZpbmcgdGhlIGFyY2ggRE1BIG9wcyBiZWluZyBhYmxlIHRvIHF1aWNr bHkgZGlmZmVyZW50aWF0ZQpiZXR3ZWVuICJub3JtYWwiIG1lbW9yeSAod2hpY2ggaW5jbHVkZXMg bm9uLVBDSSBwbWVtIGluIHNvbWUgY2FzZXMsCml0J3MgYW4gYXJjaGl0ZWN0dXJlIGNob2ljZSBJ IHN1cHBvc2UpIGFuZCAic3BlY2lhbCBkZXZpY2UiIChwYWdlIGZsYWcKPyBwZm4gYml0ID8gLi4u IHRoZXJlIGFyZSBvcHRpb25zKS4KCkZyb20gdGhlcmUsIHdlIGtlZXAgb3VyIGV4aXN0aW5nIGZh c3QgcGF0aCBmb3IgdGhlIG5vcm1hbCBjYXNlLgoKRm9yIHRoZSBzcGVjaWFsIGNhc2UsIHdlIG5l ZWQgdG8gcHJvdmlkZSBhIGZhc3QgbG9va3VwIG1lY2hhbmlzbQooYXNzdW1pbmcgd2UgY2FuJ3Qg c3Rhc2ggZW5vdWdoIHN0dWZmIGluIHN0cnVjdCBwYWdlIG9yIHRoZSBwZm4pCnRvIGdldCBiYWNr IHRvIGEgc3RydWN0IG9mIHNvbWUgc29ydCB0aGF0IHByb3ZpZGVzIHRoZSBuZWNlc3NhcnkKaW5m b3JtYXRpb24gdG8gcmVzb2x2ZSB0aGUgdHJhbnNsYXRpb24uCgpUaGlzICpjb3VsZCogYmUgc29t ZXRoaW5nIGxpa2UgYSBzdHJ1Y3QgcDJtZW0gZGV2aWNlIHRoYXQgY2FycmllcwphIHNwZWNpYWwg c2V0IG9mIERNQSBvcHMsIHRob3VnaCB3ZSBwcm9iYWJseSBzaG91bGRuJ3QgbWFrZSB0aGUgZ2Vu ZXJpYwpzdHJ1Y3R1cmUgUENJIHNwZWNpZmljLgoKVGhpcyBpcyBhIHNsaWdodGx5IHNsb3dlciBw YXRoLCBidXQgdGhhdCAic3R1YiIgc3RydWN0dXJlIGFsbG93cyB0aGUKc3BlY2lhbCBETUEgb3Bz IHRvIHByb3ZpZGUgdGhlIG5lY2Vzc2FyeSBidXMtc3BlY2lmaWMga25vd2xlZGdlLCB3aGljaApm b3IgUENJIGZvciBleGFtcGxlLCBjYW4gY2hlY2sgd2hldGhlciB0aGUgZGV2aWNlcyBhcmUgb24g dGhlIHNhbWUKc2VnbWVudCwgd2hldGhlciB0aGUgc3dpdGNoZXMgYXJlIGNvbmZpZ3VyZWQgdG8g YWxsb3cgcDJwLCBldGMuLi4KCldoYXQgZm9ybSBzaG91bGQgdGhhdCBmYXN0IGxvb2t1cCB0YWtl ID8gSXQncyBub3QgY29tcGxldGVseSBjbGVhciB0bwptZSBhdCB0aGF0IHBvaW50LiBXZSBjb3Vs ZCBzdGFydCB3aXRoIGEgc2ltcGxlIGxpbmVhciBsb29rdXAgSSBzdXBwb3NlCmFuZCBpbXByb3Zl IGluIGEgc2Vjb25kIHN0YWdlLgoKT2YgY291cnNlIHRoaXMgcGlwZXMgaW50byB0aGUgb2xkIGRp c2N1c3Npb24gYWJvdXQgZGlzY29ubmVjdGluZwp0aGUgRE1BIG9wcyBmcm9tIHN0cnVjdCBwYWdl LiBJZiB3ZSBrZWVwIHN0cnVjdCBwYWdlLCBhbnkgZGV2aWNlIHRoYXQKd2FudHMgdG8gYmUgYSBw b3RlbnRpYWwgRE1BIHRhcmdldCB3aWxsIG5lZWQgdG8gZG8gc29tZXRoaW5nICJzcGVjaWFsIgp0 byBjcmVhdGUgdGhvc2Ugc3RydWN0IHBhZ2VzIGV0Yy4uIHRob3VnaCB3ZSBjb3VsZCBtYWtlIHRo YXQgYSBzaW1wbGUKcGNpIGhlbHBlciB0aGF0IHBvcHMgdGhlIG5lY2Vzc2FyeSBiaXRzIGFuZCBw aWVjZXMgZm9yIGEgZ2l2ZW4gQkFSICYKcmFuZ2UuCgpJZiB3ZSBkb24ndCBuZWVkIHN0cnVjdCBw YWdlLCB0aGVuIGl0IG1pZ2h0IGJlIHBvc3NpYmxlIHRvIGhpZGUgaXQgYWxsCmluIHRoZSBQQ0kg aW5mcmFzdHJ1Y3R1cmUuCgo+ID4gVmlydHVhbGl6YXRpb24gc3BlY2lmaWNhbGx5IHdvdWxkIGJl IGEgX2xvdF8gbW9yZSBkaWZmaWN1bHQgdGhhbiBzaW1wbHkKPiA+IHN1cHBvcnRpbmcgb2Zmc2V0 cy4gVGhlIGFjdHVhbCB0b3BvbG9neSBvZiB0aGUgYnVzIHdpbGwgcHJvYmFibHkgYmUgbG9zdAo+ ID4gb24gdGhlIGd1ZXN0IE9TIGFuZCBpdCB3b3VsZCB0aGVyZWZvciBoYXZlIGEgZGlmZmljdWx0 IHRpbWUgZmlndXJpbmcgb3V0Cj4gPiB3aGVuIGl0J3MgYWNjZXB0YWJsZSB0byB1c2UgcDJwbWVt LiBJIGFsc28gaGF2ZSBhIGRpZmZpY3VsdCB0aW1lIHNlZWluZwo+ID4gYSB1c2UgY2FzZSBmb3Ig aXQgYW5kIHRodXMgSSBoYXZlIGEgaGFyZCB0aW1lIHdpdGggdGhlIGFyZ3VtZW50IHRoYXQgd2UK PiA+IGNhbid0IHN1cHBvcnQgdXNlIGNhc2VzIHRoYXQgZG8gd2FudCBpdCBiZWNhdXNlIHVzZSBj YXNlcyB0aGF0IGRvbid0Cj4gPiB3YW50IGl0IChwZXJoYXBzIHlldCkgd29uJ3Qgd29yay4KPiA+ IAo+ID4gPiBUaGlzIGlzIGFuIGludGVyZXN0aW5nIGV4cGVyaWVtZW50IHRvIGxvb2sgYXQgSSBz dXBwb3NlLCBidXQgaWYgeW91Cj4gPiA+IGV2ZXIgd2FudCB0aGlzIHVwc3RyZWFtIEkgd291bGQg bGlrZSBhdCBsZWFzdCBmb3IgeW91IHRvIGRldmVsb3AgYQo+ID4gPiBzdHJhdGVneSB0byBzdXBw b3J0IHRoZSB3aWRlciBjYXNlLCBpZiBub3QgYW4gYWN0dWFsIGltcGxlbWVudGF0aW9uLgo+ID4g Cj4gPiBJIHRoaW5rIHRoZXJlIGFyZSBwbGVudHkgb2YgYXZlbnVlcyBmb3J3YXJkIHRvIHN1cHBv cnQgb2Zmc2V0cywgZXRjLgo+ID4gSXQncyBqdXN0IHdvcmsuIE5vdGhpbmcgd2UnZCBiZSBwcm9w b3Npbmcgd291bGQgYmUgaW5jb21wYXRpYmxlIHdpdGggaXQuCj4gPiBXZSBqdXN0IGRvbid0IHdh bnQgdG8gaGF2ZSB0byBkbyBpdCBhbGwgdXBmcm9udCBlc3BlY2lhbGx5IHdoZW4gbm8gb25lCj4g PiByZWFsbHkga25vd3MgaG93IHdlbGwgdmFyaW91cyBhcmNoaXRlY3R1cmUncyBoYXJkd2FyZSBz dXBwb3J0cyB0aGlzIG9yCj4gPiBpZiBhbnlvbmUgZXZlbiB3YW50cyB0byBydW4gaXQgb24gc3lz dGVtcyBzdWNoIGFzIHRob3NlLiAoS2VlcCBpbiBtaW5kCj4gPiB0aGlzIGlzIGEgcHJldHR5IHNw ZWNpZmljIG9wdGltaXphdGlvbiB0aGF0IG1vc3RseSBoZWxwcyBzeXN0ZW1zCj4gPiBkZXNpZ25l ZCBpbiBzcGVjaWZpYyB3YXlzIC0tIG5vdCBhIGdlbmVyYWwgImV2ZXJ5Ym9keSBnZXRzIGZhc3Rl ciIgdHlwZQo+ID4gc2l0dWF0aW9uLikgR2V0IHRoZSBjYXNlcyB3b3JraW5nIHdlIGtub3cgd2ls bCB3b3JrLCBjYW4gZWFzaWx5IHN1cHBvcnQKPiA+IGFuZCBwZW9wbGUgYWN0dWFsbHkgd2FudC7C oCBUaGVuIGV4cGFuZCBpdCB0byBzdXBwb3J0IG90aGVycyBhcyBwZW9wbGUKPiA+IGNvbWUgYXJv dW5kIHdpdGggaGFyZHdhcmUgdG8gdGVzdCBhbmQgdXNlIGNhc2VzIGZvciBpdC4KPiAKPiBJIHRo aW5rIHlvdSBuZWVkIHRvIGdpdmUgb3RoZXIgYXJjaHMgYSBjaGFuY2UgdG8gc3VwcG9ydCB0aGlz IHdpdGggYQo+IGRlc2lnbiB0aGF0IGNvbnNpZGVycyB0aGUgb2Zmc2V0IGNhc2UgYXMgYSBmaXJz dCBjbGFzcyBjaXRpemVuIHJhdGhlcgo+IHRoYW4gYW4gYWZ0ZXJ0aG91Z2h0LgoKVGhhbmtzIDot KSBUaGVyZSdzIGEgcmVhc29uIHdoeSBJJ20gaW5zaXN0aW5nIG9uIHRoaXMuIFdlIGhhdmUgY29u c3RhbnQKcmVxdWVzdHMgZm9yIHRoaXMgdG9kYXkuIFdlIGhhdmUgaGFja3MgaW4gdGhlIEdQVSBk cml2ZXJzIHRvIGRvIGl0IGZvcgpHUFVzIGJlaGluZCBhIHN3aXRjaCwgYnV0IHRob3NlIGFyZSBq dXN0IHRoYXQsIGFkLWhvYyBoYWNrcyBpbiB0aGUKZHJpdmVycy4gV2UgaGF2ZSBzaW1pbGFyIGdy b3NzbmVzcyBhcm91bmQgdGhlIGNvcm5lciB3aXRoIHNvbWUgQ0FQSQpOSUNzIHRyeWluZyB0byBE TUEgdG8gR1BVcy4gSSBoYXZlIHBlb3BsZSB0cnlpbmcgdG8gdXNlIFBMWCBETUEgZW5naW5lcwp0 byB3aGFjayBuVk1FIGRldmljZXMuCgpJJ20gdmVyeSBpbnRlcmVzdGVkIGluIGEgbW9yZSBnZW5l cmljIHNvbHV0aW9uIHRvIGRlYWwgd2l0aCB0aGUgcHJvYmxlbQpvZiBQMlAgYmV0d2VlbiBkZXZp Y2VzLiBJJ20gaGFwcHkgdG8gY29udHJpYnV0ZSB3aXRoIGNvZGUgdG8gaGFuZGxlIHRoZQpwb3dl cnBjIGJpdHMgYnV0IHdlIG5lZWQgdG8gYWdyZWUgb24gdGhlIGRlc2lnbiBmaXJzdCA6KQoKQ2hl ZXJzLApCZW4uCgpfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f XwpMaW51eC1udmRpbW0gbWFpbGluZyBsaXN0CkxpbnV4LW52ZGltbUBsaXN0cy4wMS5vcmcKaHR0 cHM6Ly9saXN0cy4wMS5vcmcvbWFpbG1hbi9saXN0aW5mby9saW51eC1udmRpbW0K From mboxrd@z Thu Jan 1 00:00:00 1970 From: benh@kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 17 Apr 2017 08:23:16 +1000 Subject: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory In-Reply-To: References: Message-ID: <1492381396.25766.43.camel@kernel.crashing.org> On Sun, 2017-04-16@08:44 -0700, Dan Williams wrote: > The difference is that there was nothing fundamental in the core > design of pmem + DAX that prevented other archs from growing pmem > support. Indeed. In fact we have work in progress support for pmem on power using experimental HW. > THP and memory hotplug existed on other architectures and > they just need to plug in their arch-specific enabling. p2p support > needs the same starting point of something more than one architecture > can plug into, and handling the bus address offset case needs to be > incorporated into the design. > > pmem + dax did not change the meaning of what a dma_addr_t is, p2p does. The more I think about it, the more I tend toward something along the lines of having the arch DMA ops being able to quickly differentiate between "normal" memory (which includes non-PCI pmem in some cases, it's an architecture choice I suppose) and "special device" (page flag ? pfn bit ? ... there are options). >>From there, we keep our existing fast path for the normal case. For the special case, we need to provide a fast lookup mechanism (assuming we can't stash enough stuff in struct page or the pfn) to get back to a struct of some sort that provides the necessary information to resolve the translation. This *could* be something like a struct p2mem device that carries a special set of DMA ops, though we probably shouldn't make the generic structure PCI specific. This is a slightly slower path, but that "stub" structure allows the special DMA ops to provide the necessary bus-specific knowledge, which for PCI for example, can check whether the devices are on the same segment, whether the switches are configured to allow p2p, etc... What form should that fast lookup take ? It's not completely clear to me at that point. We could start with a simple linear lookup I suppose and improve in a second stage. Of course this pipes into the old discussion about disconnecting the DMA ops from struct page. If we keep struct page, any device that wants to be a potential DMA target will need to do something "special" to create those struct pages etc.. though we could make that a simple pci helper that pops the necessary bits and pieces for a given BAR & range. If we don't need struct page, then it might be possible to hide it all in the PCI infrastructure. > > Virtualization specifically would be a _lot_ more difficult than simply > > supporting offsets. The actual topology of the bus will probably be lost > > on the guest OS and it would therefor have a difficult time figuring out > > when it's acceptable to use p2pmem. I also have a difficult time seeing > > a use case for it and thus I have a hard time with the argument that we > > can't support use cases that do want it because use cases that don't > > want it (perhaps yet) won't work. > > > > > This is an interesting experiement to look at I suppose, but if you > > > ever want this upstream I would like at least for you to develop a > > > strategy to support the wider case, if not an actual implementation. > > > > I think there are plenty of avenues forward to support offsets, etc. > > It's just work. Nothing we'd be proposing would be incompatible with it. > > We just don't want to have to do it all upfront especially when no one > > really knows how well various architecture's hardware supports this or > > if anyone even wants to run it on systems such as those. (Keep in mind > > this is a pretty specific optimization that mostly helps systems > > designed in specific ways -- not a general "everybody gets faster" type > > situation.) Get the cases working we know will work, can easily support > > and people actually want.? Then expand it to support others as people > > come around with hardware to test and use cases for it. > > I think you need to give other archs a chance to support this with a > design that considers the offset case as a first class citizen rather > than an afterthought. Thanks :-) There's a reason why I'm insisting on this. We have constant requests for this today. We have hacks in the GPU drivers to do it for GPUs behind a switch, but those are just that, ad-hoc hacks in the drivers. We have similar grossness around the corner with some CAPI NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines to whack nVME devices. I'm very interested in a more generic solution to deal with the problem of P2P between devices. I'm happy to contribute with code to handle the powerpc bits but we need to agree on the design first :) Cheers, Ben. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Return-Path: Message-ID: <1492381396.25766.43.camel@kernel.crashing.org> Subject: Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory From: Benjamin Herrenschmidt To: Dan Williams , Logan Gunthorpe Cc: Bjorn Helgaas , Jason Gunthorpe , Christoph Hellwig , Sagi Grimberg , "James E.J. Bottomley" , "Martin K. Petersen" , Jens Axboe , Steve Wise , Stephen Bates , Max Gurtovoy , Keith Busch , linux-pci@vger.kernel.org, linux-scsi , linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm , "linux-kernel@vger.kernel.org" , Jerome Glisse Date: Mon, 17 Apr 2017 08:23:16 +1000 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 List-ID: On Sun, 2017-04-16 at 08:44 -0700, Dan Williams wrote: > The difference is that there was nothing fundamental in the core > design of pmem + DAX that prevented other archs from growing pmem > support. Indeed. In fact we have work in progress support for pmem on power using experimental HW. > THP and memory hotplug existed on other architectures and > they just need to plug in their arch-specific enabling. p2p support > needs the same starting point of something more than one architecture > can plug into, and handling the bus address offset case needs to be > incorporated into the design. > > pmem + dax did not change the meaning of what a dma_addr_t is, p2p does. The more I think about it, the more I tend toward something along the lines of having the arch DMA ops being able to quickly differentiate between "normal" memory (which includes non-PCI pmem in some cases, it's an architecture choice I suppose) and "special device" (page flag ? pfn bit ? ... there are options). >>From there, we keep our existing fast path for the normal case. For the special case, we need to provide a fast lookup mechanism (assuming we can't stash enough stuff in struct page or the pfn) to get back to a struct of some sort that provides the necessary information to resolve the translation. This *could* be something like a struct p2mem device that carries a special set of DMA ops, though we probably shouldn't make the generic structure PCI specific. This is a slightly slower path, but that "stub" structure allows the special DMA ops to provide the necessary bus-specific knowledge, which for PCI for example, can check whether the devices are on the same segment, whether the switches are configured to allow p2p, etc... What form should that fast lookup take ? It's not completely clear to me at that point. We could start with a simple linear lookup I suppose and improve in a second stage. Of course this pipes into the old discussion about disconnecting the DMA ops from struct page. If we keep struct page, any device that wants to be a potential DMA target will need to do something "special" to create those struct pages etc.. though we could make that a simple pci helper that pops the necessary bits and pieces for a given BAR & range. If we don't need struct page, then it might be possible to hide it all in the PCI infrastructure. > > Virtualization specifically would be a _lot_ more difficult than simply > > supporting offsets. The actual topology of the bus will probably be lost > > on the guest OS and it would therefor have a difficult time figuring out > > when it's acceptable to use p2pmem. I also have a difficult time seeing > > a use case for it and thus I have a hard time with the argument that we > > can't support use cases that do want it because use cases that don't > > want it (perhaps yet) won't work. > > > > > This is an interesting experiement to look at I suppose, but if you > > > ever want this upstream I would like at least for you to develop a > > > strategy to support the wider case, if not an actual implementation. > > > > I think there are plenty of avenues forward to support offsets, etc. > > It's just work. Nothing we'd be proposing would be incompatible with it. > > We just don't want to have to do it all upfront especially when no one > > really knows how well various architecture's hardware supports this or > > if anyone even wants to run it on systems such as those. (Keep in mind > > this is a pretty specific optimization that mostly helps systems > > designed in specific ways -- not a general "everybody gets faster" type > > situation.) Get the cases working we know will work, can easily support > > and people actually want.  Then expand it to support others as people > > come around with hardware to test and use cases for it. > > I think you need to give other archs a chance to support this with a > design that considers the offset case as a first class citizen rather > than an afterthought. Thanks :-) There's a reason why I'm insisting on this. We have constant requests for this today. We have hacks in the GPU drivers to do it for GPUs behind a switch, but those are just that, ad-hoc hacks in the drivers. We have similar grossness around the corner with some CAPI NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines to whack nVME devices. I'm very interested in a more generic solution to deal with the problem of P2P between devices. I'm happy to contribute with code to handle the powerpc bits but we need to agree on the design first :) Cheers, Ben.