* Re: [RFC 0/7] Peer-direct memory [not found] ` <20160211191838.GA23675@obsidianresearch.com> @ 2016-02-14 14:27 ` Haggai Eran 2016-02-16 18:22 ` Jason Gunthorpe [not found] ` <20160212201328.GA14122@infradead.org> 1 sibling, 1 reply; 13+ messages in thread From: Haggai Eran @ 2016-02-14 14:27 UTC (permalink / raw) To: Jason Gunthorpe, Kovalyov Artemy Cc: dledford@redhat.com, linux-rdma@vger.kernel.org, linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg [apologies: sending again because linux-mm address was wrong] On 11/02/2016 21:18, Jason Gunthorpe wrote: > Resubmit those parts under the mm subsystem, or another more > appropriate place. We want the feedback from linux-mm, and they are now Cced. > If you want to make some incremental progress then implement the > existing ZONE_DEVICE API for the IB core and add the invalidate stuff > later, once you've negotiated a common API for that with linux-mm. So there are couple of issues we currently have with ZONE_DEVICE. Perhaps they can be solved and then we could use it directly. First, I'm not sure it is intended to be used for our purpose. memremap() has this comment [1]: > memremap() is "ioremap" for cases where it is known that the resource > being mapped does not have i/o side effects and the __iomem > annotation is not applicable. Does this apply also to devm_memremap_pages()? Because the HCA BAR clearly doesn't fall under this definition. Second, there's a requirement that ZONE_DEVICE ranges are aligned to section-boundary, right? We have devices that have 8MB or 32MB BARs, so they won't work with 128MB sections on x86_64. Third, I understand there was a desire to place ZONE_DEVICE page structs in the device itself. This can work for pmem, but obviously won't work for an I/O device BAR like an HCA. Regards, Haggai [1] http://lxr.free-electrons.com/source/kernel/memremap.c?v=4.4#L38 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/7] Peer-direct memory 2016-02-14 14:27 ` [RFC 0/7] Peer-direct memory Haggai Eran @ 2016-02-16 18:22 ` Jason Gunthorpe 2016-02-17 4:03 ` davide rossetti 0 siblings, 1 reply; 13+ messages in thread From: Jason Gunthorpe @ 2016-02-16 18:22 UTC (permalink / raw) To: Haggai Eran Cc: Kovalyov Artemy, dledford@redhat.com, linux-rdma@vger.kernel.org, linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg On Sun, Feb 14, 2016 at 04:27:20PM +0200, Haggai Eran wrote: > [apologies: sending again because linux-mm address was wrong] > > On 11/02/2016 21:18, Jason Gunthorpe wrote: > > Resubmit those parts under the mm subsystem, or another more > > appropriate place. > > We want the feedback from linux-mm, and they are now Cced. Resubmit to mm means put this stuff someplace outside drivers/infiniband in the tree and don't try and inappropriately send memory management stuff through Doug's tree. Jason -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/7] Peer-direct memory 2016-02-16 18:22 ` Jason Gunthorpe @ 2016-02-17 4:03 ` davide rossetti 2016-02-17 4:13 ` davide rossetti 0 siblings, 1 reply; 13+ messages in thread From: davide rossetti @ 2016-02-17 4:03 UTC (permalink / raw) To: Jason Gunthorpe Cc: Haggai Eran, Kovalyov Artemy, dledford@redhat.com, linux-rdma@vger.kernel.org, linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg [-- Attachment #1: Type: text/plain, Size: 1707 bytes --] On Tue, Feb 16, 2016 at 10:22 AM, Jason Gunthorpe < jgunthorpe@obsidianresearch.com> wrote: > On Sun, Feb 14, 2016 at 04:27:20PM +0200, Haggai Eran wrote: > > [apologies: sending again because linux-mm address was wrong] > > > > On 11/02/2016 21:18, Jason Gunthorpe wrote: > > > Resubmit those parts under the mm subsystem, or another more > > > appropriate place. > > > > We want the feedback from linux-mm, and they are now Cced. > > Resubmit to mm means put this stuff someplace outside > drivers/infiniband in the tree and don't try and inappropriately send > memory management stuff through Doug's tree. > > Jason, I beg to differ. 1) I see mm as appropriate for real memory, i.e. something that user-space apps can pass around. This is not totally true for BAR memory, for instance as long as CPU initiated atomic ops are not supported on BAR space of PCIe devices. OTOT, CPU reading from BAR is awful (BW being abysmal,~10MB/s), while high BW writing requires use of vector instructions (at least on x86_64). 2) Instead, I see appropriate that two sophisticated devices, like an IB NIC and a storage/accelerator device, can freely target each other for I/O, i.e. exchanging peer-to-peer PCIe transactions. And as long as the existing sophisticated initiators are confined to the RDMA subsystem, that is where this support belongs to. On a different note, this reminds me that the current patch set may be missing a way to disable the use of platform PCIe atomics when the target is the BAR of a peer device. -- sincerely, d. email: davide DOT rossetti AT gmail DOT com work: drossetti AT nvidia DOT com facebook: http://www.facebook.com/dado.rossetti twitter: @dado_rossetti skype: d.rossetti [-- Attachment #2: Type: text/html, Size: 2617 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/7] Peer-direct memory 2016-02-17 4:03 ` davide rossetti @ 2016-02-17 4:13 ` davide rossetti 2016-02-17 4:44 ` Jason Gunthorpe 2016-02-17 8:44 ` Christoph Hellwig 0 siblings, 2 replies; 13+ messages in thread From: davide rossetti @ 2016-02-17 4:13 UTC (permalink / raw) To: Jason Gunthorpe Cc: Haggai Eran, Kovalyov Artemy, dledford@redhat.com, linux-rdma@vger.kernel.org, linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg resending, sorry On Tue, Feb 16, 2016 at 10:22 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > > On Sun, Feb 14, 2016 at 04:27:20PM +0200, Haggai Eran wrote: > > [apologies: sending again because linux-mm address was wrong] > > > > On 11/02/2016 21:18, Jason Gunthorpe wrote: > > > Resubmit those parts under the mm subsystem, or another more > > > appropriate place. > > > > We want the feedback from linux-mm, and they are now Cced. > > Resubmit to mm means put this stuff someplace outside > drivers/infiniband in the tree and don't try and inappropriately send > memory management stuff through Doug's tree. > Jason, I beg to differ. 1) I see mm as appropriate for real memory, i.e. something that user-space apps can pass around. This is not totally true for BAR memory, for instance: a) as long as CPU initiated atomic ops are not supported on BAR space of PCIe devices. b) OTOT, CPU reading from BAR is awful (BW being abysmal,~10MB/s), while high BW writing requires use of vector instructions (at least on x86_64). Bottom line is, BAR mappings are not like plain memory. 2) Instead, I see appropriate that two sophisticated devices, like an IB NIC and a storage/accelerator device, can freely target each other for I/O, i.e. exchanging peer-to-peer PCIe transactions. And as long as the existing sophisticated initiators are confined to the RDMA subsystem, that is where this support belongs to. On a different note, this reminds me that the current patch set may be missing a way to disable the use of platform PCIe atomics when the target is the BAR of a peer device. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/7] Peer-direct memory 2016-02-17 4:13 ` davide rossetti @ 2016-02-17 4:44 ` Jason Gunthorpe 2016-02-17 8:49 ` Christoph Hellwig 2016-02-17 8:44 ` Christoph Hellwig 1 sibling, 1 reply; 13+ messages in thread From: Jason Gunthorpe @ 2016-02-17 4:44 UTC (permalink / raw) To: davide rossetti Cc: Haggai Eran, Kovalyov Artemy, dledford@redhat.com, linux-rdma@vger.kernel.org, linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg On Tue, Feb 16, 2016 at 08:13:58PM -0800, davide rossetti wrote: > Bottom line is, BAR mappings are not like plain memory. As I understand it the actual use of this in fact when user space manages to map BAR memory into it's address space and attempts to do DMA from it. So, I'm not sure I agree at all with this assement. ie I gather with NVMe the desire is this could happen through the filesystem with the right open/mmap flags. So, saying this has nothing to do with core kernel code, or with mm, is a really big leap. > 2) Instead, I see appropriate that two sophisticated devices, like an > IB NIC and a storage/accelerator device, can freely target each > other There is nothing special about IB, and no 'sophistication' of the DMA'ing device is required. All other DMA devices should be able to target BAR memory. eg TCP TSO, or storage-to-storage copies from BAR to SCSI immediately come to mind. > for I/O, i.e. exchanging peer-to-peer PCIe transactions. And as long > as the existing sophisticated initiators are confined to the RDMA > subsystem, that is where this support belongs to. I would not object to this stuff living in the PCI subsystem, but living in rdma and having this narrrow focus that it should only work with IB is not good. > On a different note, this reminds me that the current patch set may be > missing a way to disable the use of platform PCIe atomics when the > target is the BAR of a peer device. There is a general open question with all PCI peer to peer transactions on how to negotiate all the relevant PCI parameters. Supported vendor extensions and supported standardized features seems like just one piece of a larger problem. Again well outside the scope of IB. Jason -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/7] Peer-direct memory 2016-02-17 4:44 ` Jason Gunthorpe @ 2016-02-17 8:49 ` Christoph Hellwig 2016-02-18 17:12 ` Jason Gunthorpe 0 siblings, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2016-02-17 8:49 UTC (permalink / raw) To: Jason Gunthorpe Cc: davide rossetti, Haggai Eran, Kovalyov Artemy, dledford@redhat.com, linux-rdma@vger.kernel.org, linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg On Tue, Feb 16, 2016 at 09:44:17PM -0700, Jason Gunthorpe wrote: > On Tue, Feb 16, 2016 at 08:13:58PM -0800, davide rossetti wrote: > > > Bottom line is, BAR mappings are not like plain memory. > > As I understand it the actual use of this in fact when user space > manages to map BAR memory into it's address space and attempts to do DMA > from it. So, I'm not sure I agree at all with this assement. > > ie I gather with NVMe the desire is this could happen through the > filesystem with the right open/mmap flags. Lot's of confusion here. NVMe is a block device interface - there is not real point in mapping anything in there to userspace unless you use an entirely userspace driver through the normal userspace PCI driver interface. For pmem (which some people confusingly call NVM) mapping the byte addressable persistent memory to userspace using DAX makes a lot of sense, and a lot of work around that is going on currently. For NVMe 1.2 there is a new feature called the controller memory buffer, which basically is a giant BAR that can be used instead of host memory for the submission and completion queues of the device, as well as for actual data sent to and reived from the device. Some people are tlaking about using this as the target of RDMA operations, but I don't think this patch series would be anywhere near useful for this mode of operation. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/7] Peer-direct memory 2016-02-17 8:49 ` Christoph Hellwig @ 2016-02-18 17:12 ` Jason Gunthorpe 0 siblings, 0 replies; 13+ messages in thread From: Jason Gunthorpe @ 2016-02-18 17:12 UTC (permalink / raw) To: Christoph Hellwig Cc: davide rossetti, Haggai Eran, Kovalyov Artemy, dledford@redhat.com, linux-rdma@vger.kernel.org, linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg On Wed, Feb 17, 2016 at 12:49:59AM -0800, Christoph Hellwig wrote: > PCI driver interface. For pmem (which some people confusingly call > NVM) mapping the byte addressable persistent memory to userspace using > DAX makes a lot of sense, and a lot of work around that is going > on currently. Right, this is what I was refering to, 'pmem' like capability done with NVMe hardware on PCIe. Jason -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/7] Peer-direct memory 2016-02-17 4:13 ` davide rossetti 2016-02-17 4:44 ` Jason Gunthorpe @ 2016-02-17 8:44 ` Christoph Hellwig 2016-02-17 15:25 ` Haggai Eran 1 sibling, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2016-02-17 8:44 UTC (permalink / raw) To: davide rossetti Cc: Jason Gunthorpe, Haggai Eran, Kovalyov Artemy, dledford@redhat.com, linux-rdma@vger.kernel.org, linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg [disclaimer: I've been involved with ZONE_DEVICE support and the pmem driver and wrote parts of the code and discussed a lot of the tradeoffs on how we handle I/O to memory in BARs] On Tue, Feb 16, 2016 at 08:13:58PM -0800, davide rossetti wrote: > 1) I see mm as appropriate for real memory, i.e. something that > user-space apps can pass around. mm is memory management, and this clearly falls under the umbrella, so it absolutely needs to be under mm/ and reviewed by the linux-mm crowd. > This is not totally true for BAR > memory, for instance: > a) as long as CPU initiated atomic ops are not supported on BAR space > of PCIe devices. > b) OTOT, CPU reading from BAR is awful (BW being abysmal,~10MB/s), > while high BW writing requires use of vector instructions (at least on > x86_64). > Bottom line is, BAR mappings are not like plain memory. That doesn't change how the are managed. We've always suppored mapping BARs to userspace in various drivers, and the only real news with things like the pmem driver with DAX or some of the things people want to do with the NVMe controller memoery buffer is that there are much bigger quantities of it, and: a) people want to be able have cachable mappings of various kinds instead of the old uncachable default. b) we want to be able to DMA (including RDMA) to the regions in the BARs. a) is something that needs smaller amounts in all kinds of areas to be done properly, but in principle GPU drivers have been doing this forever using all kinds of hacks. b) is the real issue. The Linux DMA support code doesn't really operate on just physical addresses, but on page structures, and we don't allocate for BARs. We investigated two ways to address this: 1) allow DMA operations without struct page and 2) create struct page structures for BARs that we want to be able to use DMA operations on. For various reasons version 2) was favored and this is how we ended up with ZONE_DEVICE. Read the linux-mm and linux-nvdimm lists for the lenghty discussions how we ended up here. Additional issues like which instructions to use for access build on top of these basic building blocks. > 2) Instead, I see appropriate that two sophisticated devices, like an > IB NIC and a storage/accelerator device, can freely target each other > for I/O, i.e. exchanging peer-to-peer PCIe transactions. And as long > as the existing sophisticated initiators are confined to the RDMA > subsystem, that is where this support belongs to. It doesn't. There is absolutely nothing RDMA specific here - please work with the overall community to do the right thing here. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/7] Peer-direct memory 2016-02-17 8:44 ` Christoph Hellwig @ 2016-02-17 15:25 ` Haggai Eran 2016-02-19 18:54 ` Dan Williams 0 siblings, 1 reply; 13+ messages in thread From: Haggai Eran @ 2016-02-17 15:25 UTC (permalink / raw) To: Christoph Hellwig, davide rossetti Cc: Jason Gunthorpe, Kovalyov Artemy, dledford@redhat.com, linux-rdma@vger.kernel.org, linux-mm@kvack.org, Leon Romanovsky, Sagi Grimberg On 17/02/2016 10:44, Christoph Hellwig wrote: > That doesn't change how the are managed. We've always suppored mapping > BARs to userspace in various drivers, and the only real news with things > like the pmem driver with DAX or some of the things people want to do > with the NVMe controller memoery buffer is that there are much bigger > quantities of it, and: > > a) people want to be able have cachable mappings of various kinds > instead of the old uncachable default. What if we do want an uncachable mapping for our device's BAR. Can we still expose it under ZONE_DEVICE? > b) we want to be able to DMA (including RDMA) to the regions in the > BARs. > > a) is something that needs smaller amounts in all kinds of areas to be > done properly, but in principle GPU drivers have been doing this forever > using all kinds of hacks. > > b) is the real issue. The Linux DMA support code doesn't really operate > on just physical addresses, but on page structures, and we don't > allocate for BARs. We investigated two ways to address this: 1) allow > DMA operations without struct page and 2) create struct page structures > for BARs that we want to be able to use DMA operations on. For various > reasons version 2) was favored and this is how we ended up with > ZONE_DEVICE. Read the linux-mm and linux-nvdimm lists for the lenghty > discussions how we ended up here. I was wondering what are your thoughts regarding the other questions we raised about ZONE_DEVICE. How can we overcome the section-alignment requirement in the current code? Our HCA's BARs are usually smaller than 128MB. Sagi also asked how should a peer device who got a ZONE_DEVICE page know it should stop using it (the CMB example). Regards, Haggai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/7] Peer-direct memory 2016-02-17 15:25 ` Haggai Eran @ 2016-02-19 18:54 ` Dan Williams 0 siblings, 0 replies; 13+ messages in thread From: Dan Williams @ 2016-02-19 18:54 UTC (permalink / raw) To: Haggai Eran Cc: Christoph Hellwig, davide rossetti, Jason Gunthorpe, Kovalyov Artemy, dledford@redhat.com, linux-rdma@vger.kernel.org, linux-mm@kvack.org, Leon Romanovsky, Sagi Grimberg On Wed, Feb 17, 2016 at 7:25 AM, Haggai Eran <haggaie@mellanox.com> wrote: > On 17/02/2016 10:44, Christoph Hellwig wrote: >> That doesn't change how the are managed. We've always suppored mapping >> BARs to userspace in various drivers, and the only real news with things >> like the pmem driver with DAX or some of the things people want to do >> with the NVMe controller memoery buffer is that there are much bigger >> quantities of it, and: >> >> a) people want to be able have cachable mappings of various kinds >> instead of the old uncachable default. > What if we do want an uncachable mapping for our device's BAR. Can we still > expose it under ZONE_DEVICE? > >> b) we want to be able to DMA (including RDMA) to the regions in the >> BARs. >> >> a) is something that needs smaller amounts in all kinds of areas to be >> done properly, but in principle GPU drivers have been doing this forever >> using all kinds of hacks. >> >> b) is the real issue. The Linux DMA support code doesn't really operate >> on just physical addresses, but on page structures, and we don't >> allocate for BARs. We investigated two ways to address this: 1) allow >> DMA operations without struct page and 2) create struct page structures >> for BARs that we want to be able to use DMA operations on. For various >> reasons version 2) was favored and this is how we ended up with >> ZONE_DEVICE. Read the linux-mm and linux-nvdimm lists for the lenghty >> discussions how we ended up here. > > I was wondering what are your thoughts regarding the other questions we raised > about ZONE_DEVICE. > > How can we overcome the section-alignment requirement in the current code? Our > HCA's BARs are usually smaller than 128MB. This may not help, but note that the section-alignment only bites when trying to have 2 mappings with different lifetimes in a single section. It's otherwise fine to map a full section for a smaller single range, you'll just end up with pages that won't be used. However, this assumes that you are fine with everything in that section being mapped cacheable, you couldn't mix uncacheable mappings in that same range. > Sagi also asked how should a peer device who got a ZONE_DEVICE page know it > should stop using it (the CMB example). ZONE_DEVICE pages come with a per-cpu reference counter via page->pgmap. See get_dev_pagemap(), get_zone_device_page(), and put_zone_device_page(). However this gets confusing quickly when a 'pfn' and a 'page' start referencing mmio space instead of host memory. It seems like we need new data types because a dma_addr_t does not necessarily reflect the peer-to-peer address as seen by the device. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <20160212201328.GA14122@infradead.org>]
[parent not found: <20160212203649.GA10540@obsidianresearch.com>]
[parent not found: <56C09C7E.4060808@dev.mellanox.co.il>]
[parent not found: <36F6EBABA23FEF4391AF72944D228901EB70C102@BBYEXM01.pmc-sierra.internal>]
* Re: [RFC 0/7] Peer-direct memory [not found] ` <36F6EBABA23FEF4391AF72944D228901EB70C102@BBYEXM01.pmc-sierra.internal> @ 2016-02-21 9:06 ` Haggai Eran 2016-02-24 23:45 ` Stephen Bates 0 siblings, 1 reply; 13+ messages in thread From: Haggai Eran @ 2016-02-21 9:06 UTC (permalink / raw) To: Stephen Bates, Sagi Grimberg, Jason Gunthorpe, Christoph Hellwig, 'Logan Gunthorpe' (logang@deltatee.com) Cc: Artemy Kovalyov, dledford@redhat.com, linux-rdma@vger.kernel.org, linux-mm@kvack.org, Leon Romanovsky, sagig@mellanox.com On 18/02/2016 16:44, Stephen Bates wrote: > Sagi > >> CC'ing sbates who played with this stuff at some point... > > Thanks for inviting me to this party Sagi ;-). Here are some comments and responses based on our experiences. Apologies in advance for the list format: > > 1. As it stands in 4.5-rc4 devm_memremap_pages will not work with iomem. Myself and (mostly) Logan (cc'ed here) developed the ability to do that in an out of tree patch for memremap.c. We also developed a simple example driver for a PCIe device that exposes DRAM on the card via a BAR. We used this code to provide some feedback to Dan (e.g. [1]-[3]). At this time we are preparing an RFC to extend devm_memremap_pages for IO memory and we hope to have that ready soon but there is no guarantee our approach is acceptable to the community. My hope is that it will be a good starting point for moving forward... I'd be happy to see your RFC when you are ready. I see in the thread of [3] that you are using write-combining. Do you think your patchset will also be suitable for uncachable memory? > 2. The two good things about Peer-Direct are that is works and it is here today. That said, I do think an approach based on ZONE_DEVICE is more general and a preferred way to allow IO devices to communicate with each other. The question is can we find such an approach that is acceptable to the community? As noted in point 1 I hope the coming RFC will initiate a discussion. I have also requested attendance at LSF/MM to discuss this topic (among others). > > 3. As of now the section alignment requirement is somewhat relaxed. I quote from [4]. > > "I could loosen the restriction a bit to allow one unaligned mapping > per section. However, if another mapping request came along that > tried to map a free part of the section it would fail because the code > depends on a "1 dev_pagemap per section" relationship. Seems an ok > compromise to me..." > > This is implemented in 4.5-rc4 (see memremap.c line 315). I don't think that's enough for our purposes. We have devices with rather small BARs (32MB) and multiple PFs that all need to expose their BAR to peer to peer access. One can expect these PFs will be assigned adjacent addresses and they will break the "one dev_pagemap per section" rule. > 4. The out of tree patch we did allows one to register the device memory as IO memory. However, we were only concerned with DRAM exposed on the BAR and so were not affected by the "i/o side effects" issues. Someone would need to think about how this applies to IOMEM that does have side-effects when accessed. With this RFC, we map parts of the HCA BAR that were mmapped to a process (both uncacheable and write-combining) and map them to a peer device (another HCA). As long as the kernel doesn't do anything else with these pages, and leaves them to be controlled by the user-space application and/or the peer device, I don't see a problem with mapping IO memory with side effects. However, I'm not an expert here, and I'd be happy to hear what others think about this. > 5. I concur with Sagi's comment below that one approach we can use to inform 3rd party device drives about vanishing memory regions is via mmu_notifiers. However this needs to be fleshed out and tied into the relevant driver(s). > > 6. In full disclosure, my main interest in this ties in to NVM Express devices which can act as DMA masters and expose regions of IOMEM at the same time (via CMBs). I want to be able to tie these devices together with other IO devices (like RDMA NICs, FPGA and GPGPU based offload engines, other NVMe devices and storage adaptors) in a peer-2-peer fashion and may not always have a RDMA device in the mix... I understand. Regards, Haggai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [RFC 0/7] Peer-direct memory 2016-02-21 9:06 ` Haggai Eran @ 2016-02-24 23:45 ` Stephen Bates 2016-02-25 11:27 ` Haggai Eran 0 siblings, 1 reply; 13+ messages in thread From: Stephen Bates @ 2016-02-24 23:45 UTC (permalink / raw) To: Haggai Eran, Sagi Grimberg, Jason Gunthorpe, Christoph Hellwig, 'Logan Gunthorpe' (logang@deltatee.com) Cc: Artemy Kovalyov, dledford@redhat.com, linux-rdma@vger.kernel.org, linux-mm@kvack.org, Leon Romanovsky, sagig@mellanox.com Haggi > I'd be happy to see your RFC when you are ready. I see in the thread of [3] > that you are using write-combining. Do you think your patchset will also be > suitable for uncachable memory? Great, we hope to have the RFC soon. It will be able to accept different flags for devm_memremap() call with regards to caching. Though one question I have is when does the caching flag affect Peer-2-Peer memory accesses? I can see caching causing issues when performing accesses from the CPU but P2P accesses should bypass any caches in the system? > I don't think that's enough for our purposes. We have devices with rather > small BARs (32MB) and multiple PFs that all need to expose their BAR to peer > to peer access. One can expect these PFs will be assigned adjacent addresses > and they will break the "one dev_pagemap per section" rule. On the cards and systems I have checked even small BARs tend to be separated by more than one section's worth of memory. As I understand it the allocation of BAR addresses is very ARCH and BIOS specific. Let's discuss this once the RFC comes out and see what options exist to address your concerns. > > > 4. The out of tree patch we did allows one to register the device memory as > IO memory. However, we were only concerned with DRAM exposed on the > BAR and so were not affected by the "i/o side effects" issues. Someone > would need to think about how this applies to IOMEM that does have side- > effects when accessed. > With this RFC, we map parts of the HCA BAR that were mmapped to a > process (both uncacheable and write-combining) and map them to a peer > device (another HCA). As long as the kernel doesn't do anything else with > these pages, and leaves them to be controlled by the user-space application > and/or the peer device, I don't see a problem with mapping IO memory with > side effects. However, I'm not an expert here, and I'd be happy to hear what > others think about this. See above. I think the upcoming RFC should provide support for both caching and uncashed mappings. I concur that even if the mappings are flagged as cachable there should be no issues as long as all accesses are from the peer-direct device. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 0/7] Peer-direct memory 2016-02-24 23:45 ` Stephen Bates @ 2016-02-25 11:27 ` Haggai Eran 0 siblings, 0 replies; 13+ messages in thread From: Haggai Eran @ 2016-02-25 11:27 UTC (permalink / raw) To: Stephen Bates, Sagi Grimberg, Jason Gunthorpe, Christoph Hellwig, 'Logan Gunthorpe' (logang@deltatee.com) Cc: Artemy Kovalyov, dledford@redhat.com, linux-rdma@vger.kernel.org, linux-mm@kvack.org, Leon Romanovsky, sagig@mellanox.com On 25/02/2016 01:45, Stephen Bates wrote: > Great, we hope to have the RFC soon. It will be able to accept different flags for devm_memremap() call with regards to caching. Though one question I have is when does the caching flag affect Peer-2-Peer memory accesses? I can see caching causing issues when performing accesses from the CPU but P2P accesses should bypass any caches in the system? I don't think the caching flag will affect peer to peer directly, but we need to keep the BAR mapped to the host the same way it is today. If we change the driver to map page structs returned from devm_memremap_pages() instead of using io_remap_pfn_range() it needs to continue working with host uses and not only with peers. Regards, Haggai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2016-02-25 11:27 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1455207177-11949-1-git-send-email-artemyko@mellanox.com>
     [not found] ` <20160211191838.GA23675@obsidianresearch.com>
2016-02-14 14:27   ` [RFC 0/7] Peer-direct memory Haggai Eran
2016-02-16 18:22     ` Jason Gunthorpe
2016-02-17  4:03       ` davide rossetti
2016-02-17  4:13         ` davide rossetti
2016-02-17  4:44           ` Jason Gunthorpe
2016-02-17  8:49             ` Christoph Hellwig
2016-02-18 17:12               ` Jason Gunthorpe
2016-02-17  8:44           ` Christoph Hellwig
2016-02-17 15:25             ` Haggai Eran
2016-02-19 18:54               ` Dan Williams
     [not found]   ` <20160212201328.GA14122@infradead.org>
     [not found]     ` <20160212203649.GA10540@obsidianresearch.com>
     [not found]       ` <56C09C7E.4060808@dev.mellanox.co.il>
     [not found]         ` <36F6EBABA23FEF4391AF72944D228901EB70C102@BBYEXM01.pmc-sierra.internal>
2016-02-21  9:06           ` Haggai Eran
2016-02-24 23:45             ` Stephen Bates
2016-02-25 11:27               ` Haggai Eran
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).