linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC 0/7] Peer-direct memory
       [not found] ` <20160211191838.GA23675@obsidianresearch.com>
@ 2016-02-14 14:27   ` Haggai Eran
  2016-02-16 18:22     ` Jason Gunthorpe
       [not found]   ` <20160212201328.GA14122@infradead.org>
  1 sibling, 1 reply; 13+ messages in thread
From: Haggai Eran @ 2016-02-14 14:27 UTC (permalink / raw)
  To: Jason Gunthorpe, Kovalyov Artemy
  Cc: dledford@redhat.com, linux-rdma@vger.kernel.org,
	linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg

[apologies: sending again because linux-mm address was wrong]

On 11/02/2016 21:18, Jason Gunthorpe wrote:
> Resubmit those parts under the mm subsystem, or another more
> appropriate place.

We want the feedback from linux-mm, and they are now Cced.

> If you want to make some incremental progress then implement the
> existing ZONE_DEVICE API for the IB core and add the invalidate stuff
> later, once you've negotiated a common API for that with linux-mm.

So there are couple of issues we currently have with ZONE_DEVICE. 
Perhaps they can be solved and then we could use it directly.

First, I'm not sure it is intended to be used for our purpose. 
memremap() has this comment [1]:
> memremap() is "ioremap" for cases where it is known that the resource
> being mapped does not have i/o side effects and the __iomem
> annotation is not applicable. 

Does this apply also to devm_memremap_pages()? Because the HCA BAR 
clearly doesn't fall under this definition.

Second, there's a requirement that ZONE_DEVICE ranges are aligned to 
section-boundary, right? We have devices that have 8MB or 32MB BARs, 
so they won't work with 128MB sections on x86_64.

Third, I understand there was a desire to place ZONE_DEVICE page structs 
in the device itself. This can work for pmem, but obviously won't work 
for an I/O device BAR like an HCA.

Regards,
Haggai

[1] http://lxr.free-electrons.com/source/kernel/memremap.c?v=4.4#L38

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/7] Peer-direct memory
  2016-02-14 14:27   ` [RFC 0/7] Peer-direct memory Haggai Eran
@ 2016-02-16 18:22     ` Jason Gunthorpe
  2016-02-17  4:03       ` davide rossetti
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Gunthorpe @ 2016-02-16 18:22 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Kovalyov Artemy, dledford@redhat.com, linux-rdma@vger.kernel.org,
	linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg

On Sun, Feb 14, 2016 at 04:27:20PM +0200, Haggai Eran wrote:
> [apologies: sending again because linux-mm address was wrong]
> 
> On 11/02/2016 21:18, Jason Gunthorpe wrote:
> > Resubmit those parts under the mm subsystem, or another more
> > appropriate place.
> 
> We want the feedback from linux-mm, and they are now Cced.

Resubmit to mm means put this stuff someplace outside
drivers/infiniband in the tree and don't try and inappropriately send
memory management stuff through Doug's tree.

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/7] Peer-direct memory
  2016-02-16 18:22     ` Jason Gunthorpe
@ 2016-02-17  4:03       ` davide rossetti
  2016-02-17  4:13         ` davide rossetti
  0 siblings, 1 reply; 13+ messages in thread
From: davide rossetti @ 2016-02-17  4:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Haggai Eran, Kovalyov Artemy, dledford@redhat.com,
	linux-rdma@vger.kernel.org, linux-mm@kvack.org, leon@leon.ro,
	Sagi Grimberg

[-- Attachment #1: Type: text/plain, Size: 1707 bytes --]

On Tue, Feb 16, 2016 at 10:22 AM, Jason Gunthorpe <
jgunthorpe@obsidianresearch.com> wrote:

> On Sun, Feb 14, 2016 at 04:27:20PM +0200, Haggai Eran wrote:
> > [apologies: sending again because linux-mm address was wrong]
> >
> > On 11/02/2016 21:18, Jason Gunthorpe wrote:
> > > Resubmit those parts under the mm subsystem, or another more
> > > appropriate place.
> >
> > We want the feedback from linux-mm, and they are now Cced.
>
> Resubmit to mm means put this stuff someplace outside
> drivers/infiniband in the tree and don't try and inappropriately send
> memory management stuff through Doug's tree.
>
>
Jason,
I beg to differ.

1) I see mm as appropriate for real memory, i.e. something that user-space
apps can pass around.
This is not totally true for BAR memory, for instance as long as CPU
initiated atomic ops are not supported on BAR space of PCIe devices.
OTOT, CPU reading from BAR is awful (BW being abysmal,~10MB/s), while high
BW writing requires use of vector instructions (at least on x86_64).

2) Instead, I see appropriate that two sophisticated devices, like an IB
NIC and a storage/accelerator device, can freely target each other for I/O,
i.e. exchanging peer-to-peer PCIe transactions. And as long as the existing
sophisticated initiators are confined to the RDMA subsystem, that is where
this support belongs to.

On a different note, this reminds me that the current patch set may be
missing a way to disable the use of platform PCIe atomics when the target
is the BAR of a peer device.

-- 
sincerely,
d.

email: davide DOT rossetti AT gmail DOT com
work: drossetti AT nvidia DOT com
facebook: http://www.facebook.com/dado.rossetti
twitter: @dado_rossetti
skype: d.rossetti

[-- Attachment #2: Type: text/html, Size: 2617 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/7] Peer-direct memory
  2016-02-17  4:03       ` davide rossetti
@ 2016-02-17  4:13         ` davide rossetti
  2016-02-17  4:44           ` Jason Gunthorpe
  2016-02-17  8:44           ` Christoph Hellwig
  0 siblings, 2 replies; 13+ messages in thread
From: davide rossetti @ 2016-02-17  4:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Haggai Eran, Kovalyov Artemy, dledford@redhat.com,
	linux-rdma@vger.kernel.org, linux-mm@kvack.org, leon@leon.ro,
	Sagi Grimberg

resending, sorry

On Tue, Feb 16, 2016 at 10:22 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
>
> On Sun, Feb 14, 2016 at 04:27:20PM +0200, Haggai Eran wrote:
> > [apologies: sending again because linux-mm address was wrong]
> >
> > On 11/02/2016 21:18, Jason Gunthorpe wrote:
> > > Resubmit those parts under the mm subsystem, or another more
> > > appropriate place.
> >
> > We want the feedback from linux-mm, and they are now Cced.
>
> Resubmit to mm means put this stuff someplace outside
> drivers/infiniband in the tree and don't try and inappropriately send
> memory management stuff through Doug's tree.
>

Jason,
I beg to differ.

1) I see mm as appropriate for real memory, i.e. something that
user-space apps can pass around. This is not totally true for BAR
memory, for instance:
 a) as long as CPU initiated atomic ops are not supported on BAR space
of PCIe devices.
 b) OTOT, CPU reading from BAR is awful (BW being abysmal,~10MB/s),
while high BW writing requires use of vector instructions (at least on
x86_64).
Bottom line is, BAR mappings are not like plain memory.

2) Instead, I see appropriate that two sophisticated devices, like an
IB NIC and a storage/accelerator device, can freely target each other
for I/O, i.e. exchanging peer-to-peer PCIe transactions. And as long
as the existing sophisticated initiators are confined to the RDMA
subsystem, that is where this support belongs to.

On a different note, this reminds me that the current patch set may be
missing a way to disable the use of platform PCIe atomics when the
target is the BAR of a peer device.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/7] Peer-direct memory
  2016-02-17  4:13         ` davide rossetti
@ 2016-02-17  4:44           ` Jason Gunthorpe
  2016-02-17  8:49             ` Christoph Hellwig
  2016-02-17  8:44           ` Christoph Hellwig
  1 sibling, 1 reply; 13+ messages in thread
From: Jason Gunthorpe @ 2016-02-17  4:44 UTC (permalink / raw)
  To: davide rossetti
  Cc: Haggai Eran, Kovalyov Artemy, dledford@redhat.com,
	linux-rdma@vger.kernel.org, linux-mm@kvack.org, leon@leon.ro,
	Sagi Grimberg

On Tue, Feb 16, 2016 at 08:13:58PM -0800, davide rossetti wrote:

> Bottom line is, BAR mappings are not like plain memory.

As I understand it the actual use of this in fact when user space
manages to map BAR memory into it's address space and attempts to do DMA
from it. So, I'm not sure I agree at all with this assement.

ie I gather with NVMe the desire is this could happen through the
filesystem with the right open/mmap flags.

So, saying this has nothing to do with core kernel code, or with mm,
is a really big leap.

> 2) Instead, I see appropriate that two sophisticated devices, like an
> IB NIC and a storage/accelerator device, can freely target each
> other

There is nothing special about IB, and no 'sophistication' of the
DMA'ing device is required.

All other DMA devices should be able to target BAR memory. eg TCP TSO,
or storage-to-storage copies from BAR to SCSI immediately come to
mind.

> for I/O, i.e. exchanging peer-to-peer PCIe transactions. And as long
> as the existing sophisticated initiators are confined to the RDMA
> subsystem, that is where this support belongs to.

I would not object to this stuff living in the PCI subsystem, but
living in rdma and having this narrrow focus that it should only
work with IB is not good.

> On a different note, this reminds me that the current patch set may be
> missing a way to disable the use of platform PCIe atomics when the
> target is the BAR of a peer device.

There is a general open question with all PCI peer to peer
transactions on how to negotiate all the relevant PCI
parameters. Supported vendor extensions and supported standardized
features seems like just one piece of a larger problem. Again well
outside the scope of IB.

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/7] Peer-direct memory
  2016-02-17  4:13         ` davide rossetti
  2016-02-17  4:44           ` Jason Gunthorpe
@ 2016-02-17  8:44           ` Christoph Hellwig
  2016-02-17 15:25             ` Haggai Eran
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2016-02-17  8:44 UTC (permalink / raw)
  To: davide rossetti
  Cc: Jason Gunthorpe, Haggai Eran, Kovalyov Artemy,
	dledford@redhat.com, linux-rdma@vger.kernel.org,
	linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg

[disclaimer: I've been involved with ZONE_DEVICE support and the pmem
 driver and wrote parts of the code and discussed a lot of the tradeoffs
 on how we handle I/O to memory in BARs]

On Tue, Feb 16, 2016 at 08:13:58PM -0800, davide rossetti wrote:
> 1) I see mm as appropriate for real memory, i.e. something that
> user-space apps can pass around.

mm is memory management, and this clearly falls under the umbrella,
so it absolutely needs to be under mm/ and reviewed by the linux-mm
crowd.

> This is not totally true for BAR
> memory, for instance:
>  a) as long as CPU initiated atomic ops are not supported on BAR space
> of PCIe devices.
>  b) OTOT, CPU reading from BAR is awful (BW being abysmal,~10MB/s),
> while high BW writing requires use of vector instructions (at least on
> x86_64).
> Bottom line is, BAR mappings are not like plain memory.

That doesn't change how the are managed.  We've always suppored mapping
BARs to userspace in various drivers, and the only real news with things
like the pmem driver with DAX or some of the things people want to do
with the NVMe controller memoery buffer is that there are much bigger
quantities of it, and:

 a) people want to be able  have cachable mappings of various kinds
    instead of the old uncachable default.
 b) we want to be able to DMA (including RDMA) to the regions in the
    BARs.

a) is something that needs smaller amounts in all kinds of areas to be
done properly, but in principle GPU drivers have been doing this forever
using all kinds of hacks.

b) is the real issue.  The Linux DMA support code doesn't really operate
on just physical addresses, but on page structures, and we don't
allocate for BARs.  We investigated two ways to address this:  1) allow
DMA operations without struct page and 2) create struct page structures
for BARs that we want to be able to use DMA operations on.  For various
reasons version 2) was favored and this is how we ended up with
ZONE_DEVICE.  Read the linux-mm and linux-nvdimm lists for the lenghty
discussions how we ended up here.

Additional issues like which instructions to use for access build on top
of these basic building blocks.

> 2) Instead, I see appropriate that two sophisticated devices, like an
> IB NIC and a storage/accelerator device, can freely target each other
> for I/O, i.e. exchanging peer-to-peer PCIe transactions. And as long
> as the existing sophisticated initiators are confined to the RDMA
> subsystem, that is where this support belongs to.

It doesn't.  There is absolutely nothing RDMA specific here - please
work with the overall community to do the right thing here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/7] Peer-direct memory
  2016-02-17  4:44           ` Jason Gunthorpe
@ 2016-02-17  8:49             ` Christoph Hellwig
  2016-02-18 17:12               ` Jason Gunthorpe
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2016-02-17  8:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: davide rossetti, Haggai Eran, Kovalyov Artemy,
	dledford@redhat.com, linux-rdma@vger.kernel.org,
	linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg

On Tue, Feb 16, 2016 at 09:44:17PM -0700, Jason Gunthorpe wrote:
> On Tue, Feb 16, 2016 at 08:13:58PM -0800, davide rossetti wrote:
> 
> > Bottom line is, BAR mappings are not like plain memory.
> 
> As I understand it the actual use of this in fact when user space
> manages to map BAR memory into it's address space and attempts to do DMA
> from it. So, I'm not sure I agree at all with this assement.
> 
> ie I gather with NVMe the desire is this could happen through the
> filesystem with the right open/mmap flags.

Lot's of confusion here.  NVMe is a block device interface - there
is not real point in mapping anything in there to userspace unless
you use an entirely userspace driver through the normal userspace
PCI driver interface.  For pmem (which some people confusingly call
NVM) mapping the byte addressable persistent memory to userspace using
DAX makes a lot of sense, and a lot of work around that is going
on currently.

For NVMe 1.2 there is a new feature called the controller memory
buffer, which basically is a giant BAR that can be used instead
of host memory for the submission and completion queues of the
device, as well as for actual data sent to and reived from the device.

Some people are tlaking about using this as the target of RDMA
operations, but I don't think this patch series would be anywhere
near useful for this mode of operation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/7] Peer-direct memory
  2016-02-17  8:44           ` Christoph Hellwig
@ 2016-02-17 15:25             ` Haggai Eran
  2016-02-19 18:54               ` Dan Williams
  0 siblings, 1 reply; 13+ messages in thread
From: Haggai Eran @ 2016-02-17 15:25 UTC (permalink / raw)
  To: Christoph Hellwig, davide rossetti
  Cc: Jason Gunthorpe, Kovalyov Artemy, dledford@redhat.com,
	linux-rdma@vger.kernel.org, linux-mm@kvack.org, Leon Romanovsky,
	Sagi Grimberg

On 17/02/2016 10:44, Christoph Hellwig wrote:
> That doesn't change how the are managed.  We've always suppored mapping
> BARs to userspace in various drivers, and the only real news with things
> like the pmem driver with DAX or some of the things people want to do
> with the NVMe controller memoery buffer is that there are much bigger
> quantities of it, and:
> 
>  a) people want to be able  have cachable mappings of various kinds
>     instead of the old uncachable default.
What if we do want an uncachable mapping for our device's BAR. Can we still 
expose it under ZONE_DEVICE?

>  b) we want to be able to DMA (including RDMA) to the regions in the
>     BARs.
> 
> a) is something that needs smaller amounts in all kinds of areas to be
> done properly, but in principle GPU drivers have been doing this forever
> using all kinds of hacks.
> 
> b) is the real issue.  The Linux DMA support code doesn't really operate
> on just physical addresses, but on page structures, and we don't
> allocate for BARs.  We investigated two ways to address this:  1) allow
> DMA operations without struct page and 2) create struct page structures
> for BARs that we want to be able to use DMA operations on.  For various
> reasons version 2) was favored and this is how we ended up with
> ZONE_DEVICE.  Read the linux-mm and linux-nvdimm lists for the lenghty
> discussions how we ended up here.

I was wondering what are your thoughts regarding the other questions we raised
about ZONE_DEVICE.

How can we overcome the section-alignment requirement in the current code? Our 
HCA's BARs are usually smaller than 128MB.

Sagi also asked how should a peer device who got a ZONE_DEVICE page know it 
should stop using it (the CMB example).

Regards,
Haggai



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/7] Peer-direct memory
  2016-02-17  8:49             ` Christoph Hellwig
@ 2016-02-18 17:12               ` Jason Gunthorpe
  0 siblings, 0 replies; 13+ messages in thread
From: Jason Gunthorpe @ 2016-02-18 17:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: davide rossetti, Haggai Eran, Kovalyov Artemy,
	dledford@redhat.com, linux-rdma@vger.kernel.org,
	linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg

On Wed, Feb 17, 2016 at 12:49:59AM -0800, Christoph Hellwig wrote:

> PCI driver interface.  For pmem (which some people confusingly call
> NVM) mapping the byte addressable persistent memory to userspace using
> DAX makes a lot of sense, and a lot of work around that is going
> on currently.

Right, this is what I was refering to, 'pmem' like capability done
with NVMe hardware on PCIe.

Jason

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/7] Peer-direct memory
  2016-02-17 15:25             ` Haggai Eran
@ 2016-02-19 18:54               ` Dan Williams
  0 siblings, 0 replies; 13+ messages in thread
From: Dan Williams @ 2016-02-19 18:54 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Christoph Hellwig, davide rossetti, Jason Gunthorpe,
	Kovalyov Artemy, dledford@redhat.com, linux-rdma@vger.kernel.org,
	linux-mm@kvack.org, Leon Romanovsky, Sagi Grimberg

On Wed, Feb 17, 2016 at 7:25 AM, Haggai Eran <haggaie@mellanox.com> wrote:
> On 17/02/2016 10:44, Christoph Hellwig wrote:
>> That doesn't change how the are managed.  We've always suppored mapping
>> BARs to userspace in various drivers, and the only real news with things
>> like the pmem driver with DAX or some of the things people want to do
>> with the NVMe controller memoery buffer is that there are much bigger
>> quantities of it, and:
>>
>>  a) people want to be able  have cachable mappings of various kinds
>>     instead of the old uncachable default.
> What if we do want an uncachable mapping for our device's BAR. Can we still
> expose it under ZONE_DEVICE?
>
>>  b) we want to be able to DMA (including RDMA) to the regions in the
>>     BARs.
>>
>> a) is something that needs smaller amounts in all kinds of areas to be
>> done properly, but in principle GPU drivers have been doing this forever
>> using all kinds of hacks.
>>
>> b) is the real issue.  The Linux DMA support code doesn't really operate
>> on just physical addresses, but on page structures, and we don't
>> allocate for BARs.  We investigated two ways to address this:  1) allow
>> DMA operations without struct page and 2) create struct page structures
>> for BARs that we want to be able to use DMA operations on.  For various
>> reasons version 2) was favored and this is how we ended up with
>> ZONE_DEVICE.  Read the linux-mm and linux-nvdimm lists for the lenghty
>> discussions how we ended up here.
>
> I was wondering what are your thoughts regarding the other questions we raised
> about ZONE_DEVICE.
>
> How can we overcome the section-alignment requirement in the current code? Our
> HCA's BARs are usually smaller than 128MB.

This may not help, but note that the section-alignment only bites when
trying to have 2 mappings with different lifetimes in a single
section.  It's otherwise fine to map a full section for a smaller
single range, you'll just end up with pages that won't be used.
However, this assumes that you are fine with everything in that
section being mapped cacheable, you couldn't mix uncacheable mappings
in that same range.

> Sagi also asked how should a peer device who got a ZONE_DEVICE page know it
> should stop using it (the CMB example).

ZONE_DEVICE pages come with a per-cpu reference counter via
page->pgmap.  See get_dev_pagemap(), get_zone_device_page(), and
put_zone_device_page().

However this gets confusing quickly when a 'pfn' and a 'page' start
referencing mmio space instead of host memory.  It seems like we need
new data types because a dma_addr_t does not necessarily reflect the
peer-to-peer address as seen by the device.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/7] Peer-direct memory
       [not found]         ` <36F6EBABA23FEF4391AF72944D228901EB70C102@BBYEXM01.pmc-sierra.internal>
@ 2016-02-21  9:06           ` Haggai Eran
  2016-02-24 23:45             ` Stephen Bates
  0 siblings, 1 reply; 13+ messages in thread
From: Haggai Eran @ 2016-02-21  9:06 UTC (permalink / raw)
  To: Stephen Bates, Sagi Grimberg, Jason Gunthorpe, Christoph Hellwig,
	'Logan Gunthorpe' (logang@deltatee.com)
  Cc: Artemy Kovalyov, dledford@redhat.com, linux-rdma@vger.kernel.org,
	linux-mm@kvack.org, Leon Romanovsky, sagig@mellanox.com

On 18/02/2016 16:44, Stephen Bates wrote:
> Sagi
> 
>> CC'ing sbates who played with this stuff at some point...
> 
> Thanks for inviting me to this party Sagi ;-). Here are some comments and responses based on our experiences. Apologies in advance for the list format:
> 
> 1. As it stands in 4.5-rc4 devm_memremap_pages will not work with iomem. Myself and  (mostly) Logan (cc'ed here) developed the ability to do that in an out of tree patch for memremap.c. We also developed a simple example driver for a PCIe device that exposes DRAM on the card via a BAR. We used this code to provide some feedback to Dan (e.g.  [1]-[3]). At this time we are preparing an RFC to extend devm_memremap_pages for IO memory and we hope to have that ready soon but there is no guarantee our approach is acceptable to the community. My hope is that it will be a good starting point for moving forward...
I'd be happy to see your RFC when you are ready. I see in the thread 
of [3] that you are using write-combining. Do you think your patchset 
will also be suitable for uncachable memory?

> 2. The two good things about Peer-Direct are that is works and it is here today. That said, I do think an approach based on ZONE_DEVICE is more general and a preferred way to allow IO devices to communicate with each other. The question is can we find such an approach that is acceptable to the community? As noted in point 1 I hope the coming RFC will initiate a discussion. I have also requested attendance at LSF/MM to discuss this topic (among others). 
> 
> 3. As of now the section alignment requirement is somewhat relaxed. I quote from [4]. 
> 
> "I could loosen the restriction a bit to allow one unaligned mapping
> per section.  However, if another mapping request came along that
> tried to map a free part of the section it would fail because the code
> depends on a  "1 dev_pagemap per section" relationship.  Seems an ok
> compromise to me..."
> 
> This is implemented in 4.5-rc4 (see memremap.c line 315).

I don't think that's enough for our purposes. We have devices with 
rather small BARs (32MB) and multiple PFs that all need to expose their 
BAR to peer to peer access. One can expect these PFs will be assigned 
adjacent addresses and they will break the "one dev_pagemap per 
section" rule.

> 4. The out of tree patch we did allows one to register the device memory as IO memory. However, we were only concerned with DRAM exposed on the BAR and so were not affected by the "i/o side effects" issues. Someone would need to think about how this applies to IOMEM that does have side-effects when accessed.
With this RFC, we map parts of the HCA BAR that were mmapped to a process 
(both uncacheable and write-combining) and map them to a peer device 
(another HCA). As long as the kernel doesn't do anything else with 
these pages, and leaves them to be controlled by the user-space 
application and/or the peer device, I don't see a problem with mapping
IO memory with side effects. However, I'm not an expert here, and I'd
be happy to hear what others think about this.

> 5. I concur with Sagi's comment below that one approach we can use to inform 3rd party device drives about vanishing memory regions is via mmu_notifiers. However this needs to be fleshed out and tied into the relevant driver(s).
> 
> 6. In full disclosure, my main interest in this ties in to NVM Express devices which can act as DMA masters and expose regions of IOMEM at the same time (via CMBs). I want to be able to tie these devices together with other IO devices (like RDMA NICs, FPGA and GPGPU based offload engines, other NVMe devices and storage adaptors) in a peer-2-peer fashion and may not always have a RDMA device in the mix...
I understand.

Regards,
Haggai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [RFC 0/7] Peer-direct memory
  2016-02-21  9:06           ` Haggai Eran
@ 2016-02-24 23:45             ` Stephen Bates
  2016-02-25 11:27               ` Haggai Eran
  0 siblings, 1 reply; 13+ messages in thread
From: Stephen Bates @ 2016-02-24 23:45 UTC (permalink / raw)
  To: Haggai Eran, Sagi Grimberg, Jason Gunthorpe, Christoph Hellwig,
	'Logan Gunthorpe' (logang@deltatee.com)
  Cc: Artemy Kovalyov, dledford@redhat.com, linux-rdma@vger.kernel.org,
	linux-mm@kvack.org, Leon Romanovsky, sagig@mellanox.com

Haggi

> I'd be happy to see your RFC when you are ready. I see in the thread of [3]
> that you are using write-combining. Do you think your patchset will also be
> suitable for uncachable memory?

Great, we hope to have the RFC soon. It will be able to accept different flags for devm_memremap() call with regards to caching. Though one question I have is when does the caching flag affect Peer-2-Peer memory accesses? I can see caching causing issues when performing accesses from the CPU but P2P accesses should bypass any caches in the system?

> I don't think that's enough for our purposes. We have devices with rather
> small BARs (32MB) and multiple PFs that all need to expose their BAR to peer
> to peer access. One can expect these PFs will be assigned adjacent addresses
> and they will break the "one dev_pagemap per section" rule.

On the cards and systems I have checked even small BARs tend to be separated by more than one section's worth of memory.  As I understand it the allocation of BAR addresses is very ARCH and BIOS specific. Let's discuss this once the RFC comes out and see what options exist to address your concerns. 

> 
> > 4. The out of tree patch we did allows one to register the device memory as
> IO memory. However, we were only concerned with DRAM exposed on the
> BAR and so were not affected by the "i/o side effects" issues. Someone
> would need to think about how this applies to IOMEM that does have side-
> effects when accessed.
> With this RFC, we map parts of the HCA BAR that were mmapped to a
> process (both uncacheable and write-combining) and map them to a peer
> device (another HCA). As long as the kernel doesn't do anything else with
> these pages, and leaves them to be controlled by the user-space application
> and/or the peer device, I don't see a problem with mapping IO memory with
> side effects. However, I'm not an expert here, and I'd be happy to hear what
> others think about this.

See above. I think the upcoming RFC should provide support for both caching and uncashed mappings. I concur that even if the mappings are flagged as cachable there should be no issues as long as all accesses are from the peer-direct device.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/7] Peer-direct memory
  2016-02-24 23:45             ` Stephen Bates
@ 2016-02-25 11:27               ` Haggai Eran
  0 siblings, 0 replies; 13+ messages in thread
From: Haggai Eran @ 2016-02-25 11:27 UTC (permalink / raw)
  To: Stephen Bates, Sagi Grimberg, Jason Gunthorpe, Christoph Hellwig,
	'Logan Gunthorpe' (logang@deltatee.com)
  Cc: Artemy Kovalyov, dledford@redhat.com, linux-rdma@vger.kernel.org,
	linux-mm@kvack.org, Leon Romanovsky, sagig@mellanox.com

On 25/02/2016 01:45, Stephen Bates wrote:
> Great, we hope to have the RFC soon. It will be able to accept different flags for devm_memremap() call with regards to caching. Though one question I have is when does the caching flag affect Peer-2-Peer memory accesses? I can see caching causing issues when performing accesses from the CPU but P2P accesses should bypass any caches in the system?
I don't think the caching flag will affect peer to peer directly, but we need 
to keep the BAR mapped to the host the same way it is today. If we change the
driver to map page structs returned from devm_memremap_pages() instead of using
io_remap_pfn_range() it needs to continue working with host uses and not only
with peers.

Regards,
Haggai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-02-25 11:27 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1455207177-11949-1-git-send-email-artemyko@mellanox.com>
     [not found] ` <20160211191838.GA23675@obsidianresearch.com>
2016-02-14 14:27   ` [RFC 0/7] Peer-direct memory Haggai Eran
2016-02-16 18:22     ` Jason Gunthorpe
2016-02-17  4:03       ` davide rossetti
2016-02-17  4:13         ` davide rossetti
2016-02-17  4:44           ` Jason Gunthorpe
2016-02-17  8:49             ` Christoph Hellwig
2016-02-18 17:12               ` Jason Gunthorpe
2016-02-17  8:44           ` Christoph Hellwig
2016-02-17 15:25             ` Haggai Eran
2016-02-19 18:54               ` Dan Williams
     [not found]   ` <20160212201328.GA14122@infradead.org>
     [not found]     ` <20160212203649.GA10540@obsidianresearch.com>
     [not found]       ` <56C09C7E.4060808@dev.mellanox.co.il>
     [not found]         ` <36F6EBABA23FEF4391AF72944D228901EB70C102@BBYEXM01.pmc-sierra.internal>
2016-02-21  9:06           ` Haggai Eran
2016-02-24 23:45             ` Stephen Bates
2016-02-25 11:27               ` Haggai Eran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).