From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx157.postini.com [74.125.245.157]) by kanga.kvack.org (Postfix) with SMTP id CA11D6B0005 for ; Fri, 8 Feb 2013 06:18:39 -0500 (EST) Message-ID: <5114DF05.7070702@mellanox.com> Date: Fri, 8 Feb 2013 13:18:29 +0200 From: Shachar Raindel MIME-Version: 1.0 Subject: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: lsf-pc@lists.linux-foundation.org Cc: linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss Hi, We would like to present a reference implementation for safely sharing memory pages from user space with the hardware, without pinning. We will be happy to hear the community feedback on our prototype implementation, and suggestions for future improvements. We would also like to discuss adding features to the core MM subsystem to assist hardware access to user memory without pinning. Following is a longer motivation and explanation on the technology presented: Many application developers would like to be able to be able to communicate directly with the hardware from the userspace. Use cases for that includes high performance networking API such as InfiniBand, RoCE and iWarp and interfacing with GPUs. Currently, if the user space application wants to share system memory with the hardware device, the kernel component must pin the memory pages in RAM, using get_user_pages. This is a hurdle, as it usually makes large portions the application memory unmovable. This pinning also makes the user space development model very complicated a?? one needs to register memory before using it for communication with the hardware. We use the mmu-notifiers [1] mechanism to inform the hardware when the mapping of a page is changed. If the hardware tries to access a page which is not yet mapped for the hardware, it requests a resolution for the page address from the kernel. This mechanism allows the hardware to access the entire address space of the user application, without pinning even a single page. We would like to use the LSF/MM forum opportunity to discuss open issues we have for further development, such as: -Allowing the hardware to perform page table walk, similar to get_user_pages_fast to resolve user pages that are already in RAM. -Batching page eviction by various kernel subsystems (swapper, page-cache) to reduce the amount of communication needed with the hardware in such events -Hinting from the hardware to the MM regarding page fetches which are speculative, similarly to prefetching done by the page-cache -Page-in notifications from the kernel to the driver, such that we can keep our secondary TLB in sync with the kernel page table without incurring page faults. -Allowed and banned actions while in an MMU notifier callback. We have already done some work on making the MMU notifiers sleepable [2], but there might be additional limitations, which we would like to discuss. -Hinting from the MMU notifiers as for the reason for the notification - for example we would like to react differently if a page was moved by NUMA migration vs. page being swapped out. [1] http://lwn.net/Articles/266320/ [2] http://comments.gmane.org/gmane.linux.kernel.mm/85002 Thanks, --Shachar -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx136.postini.com [74.125.245.136]) by kanga.kvack.org (Postfix) with SMTP id A5A256B0005 for ; Fri, 8 Feb 2013 10:21:41 -0500 (EST) Received: by mail-qe0-f48.google.com with SMTP id 3so1731093qea.35 for ; Fri, 08 Feb 2013 07:21:40 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <5114DF05.7070702@mellanox.com> References: <5114DF05.7070702@mellanox.com> Date: Fri, 8 Feb 2013 10:21:40 -0500 Message-ID: Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes From: Jerome Glisse Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Shachar Raindel Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss On Fri, Feb 8, 2013 at 6:18 AM, Shachar Raindel wrot= e: > Hi, > > We would like to present a reference implementation for safely sharing > memory pages from user space with the hardware, without pinning. > > We will be happy to hear the community feedback on our prototype > implementation, and suggestions for future improvements. > > We would also like to discuss adding features to the core MM subsystem to > assist hardware access to user memory without pinning. > > Following is a longer motivation and explanation on the technology > presented: > > Many application developers would like to be able to be able to communica= te > directly with the hardware from the userspace. > > Use cases for that includes high performance networking API such as > InfiniBand, RoCE and iWarp and interfacing with GPUs. > > Currently, if the user space application wants to share system memory wit= h > the hardware device, the kernel component must pin the memory pages in RA= M, > using get_user_pages. > > This is a hurdle, as it usually makes large portions the application memo= ry > unmovable. This pinning also makes the user space development model very > complicated =96 one needs to register memory before using it for communic= ation > with the hardware. > > We use the mmu-notifiers [1] mechanism to inform the hardware when the > mapping of a page is changed. If the hardware tries to access a page whic= h > is not yet mapped for the hardware, it requests a resolution for the page > address from the kernel. > > This mechanism allows the hardware to access the entire address space of = the > user application, without pinning even a single page. > > We would like to use the LSF/MM forum opportunity to discuss open issues = we > have for further development, such as: > > -Allowing the hardware to perform page table walk, similar to > get_user_pages_fast to resolve user pages that are already in RAM. > > -Batching page eviction by various kernel subsystems (swapper, page-cache= ) > to reduce the amount of communication needed with the hardware in such > events > > -Hinting from the hardware to the MM regarding page fetches which are > speculative, similarly to prefetching done by the page-cache > > -Page-in notifications from the kernel to the driver, such that we can ke= ep > our secondary TLB in sync with the kernel page table without incurring pa= ge > faults. > > -Allowed and banned actions while in an MMU notifier callback. We have > already done some work on making the MMU notifiers sleepable [2], but the= re > might be additional limitations, which we would like to discuss. > > -Hinting from the MMU notifiers as for the reason for the notification - = for > example we would like to react differently if a page was moved by NUMA > migration vs. page being swapped out. > > [1] http://lwn.net/Articles/266320/ > > [2] http://comments.gmane.org/gmane.linux.kernel.mm/85002 > > Thanks, > > --Shachar As a GPU driver developer i can say that this is something we want to do in a very near future. Also i think we would like another capabilities : - hint to mm on memory range that are best not to evict (easier for driver to know what is hot and gonna see activities) Dunno how big the change to the page eviction path would need to be. Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx126.postini.com [74.125.245.126]) by kanga.kvack.org (Postfix) with SMTP id 7A75A6B0002 for ; Sat, 9 Feb 2013 01:05:15 -0500 (EST) Received: by mail-vc0-f178.google.com with SMTP id m8so2799648vcd.23 for ; Fri, 08 Feb 2013 22:05:14 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <5114DF05.7070702@mellanox.com> References: <5114DF05.7070702@mellanox.com> Date: Fri, 8 Feb 2013 22:05:14 -0800 Message-ID: Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes From: Michel Lespinasse Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Shachar Raindel Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel wrote: > Hi, > > We would like to present a reference implementation for safely sharing > memory pages from user space with the hardware, without pinning. > > We will be happy to hear the community feedback on our prototype > implementation, and suggestions for future improvements. > > We would also like to discuss adding features to the core MM subsystem to > assist hardware access to user memory without pinning. This sounds kinda scary TBH; however I do understand the need for such technology. I think one issue is that many MM developers are insufficiently aware of such developments; having a technology presentation would probably help there; but traditionally LSF/MM sessions are more interactive between developers who are already quite familiar with the technology. I think it would help if you could send in advance a detailed presentation of the problem and the proposed solutions (and then what they require of the MM layer) so people can be better prepared. And first I'd like to ask, aren't IOMMUs supposed to already largely solve this problem ? (probably a dumb question, but that just tells you how much you need to explain :) -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx170.postini.com [74.125.245.170]) by kanga.kvack.org (Postfix) with SMTP id 4EF936B0002 for ; Sat, 9 Feb 2013 11:29:07 -0500 (EST) Received: by mail-qa0-f50.google.com with SMTP id dx4so707992qab.9 for ; Sat, 09 Feb 2013 08:29:06 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <5114DF05.7070702@mellanox.com> Date: Sat, 9 Feb 2013 11:29:05 -0500 Message-ID: Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes From: Jerome Glisse Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Michel Lespinasse Cc: Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse wrote: > On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel wrote: >> Hi, >> >> We would like to present a reference implementation for safely sharing >> memory pages from user space with the hardware, without pinning. >> >> We will be happy to hear the community feedback on our prototype >> implementation, and suggestions for future improvements. >> >> We would also like to discuss adding features to the core MM subsystem to >> assist hardware access to user memory without pinning. > > This sounds kinda scary TBH; however I do understand the need for such > technology. > > I think one issue is that many MM developers are insufficiently aware > of such developments; having a technology presentation would probably > help there; but traditionally LSF/MM sessions are more interactive > between developers who are already quite familiar with the technology. > I think it would help if you could send in advance a detailed > presentation of the problem and the proposed solutions (and then what > they require of the MM layer) so people can be better prepared. > > And first I'd like to ask, aren't IOMMUs supposed to already largely > solve this problem ? (probably a dumb question, but that just tells > you how much you need to explain :) For GPU the motivation is three fold. With the advance of GPU compute and also with newer graphic program we see a massive increase in GPU memory consumption. We easily can reach buffer that are bigger than 1gbytes. So the first motivation is to directly use the memory the user allocated through malloc in the GPU this avoid copying 1gbytes of data with the cpu to the gpu buffer. The second and mostly important to GPU compute is the use of GPU seamlessly with the CPU, in order to achieve this you want the programmer to have a single address space on the CPU and GPU. So that the same address point to the same object on GPU as on the CPU. This would also be a tremendous cleaner design from driver point of view toward memory management. And last, the most important, with such big buffer (>1gbytes) the memory pinning is becoming way to expensive and also drastically reduce the freedom of the mm to free page for other process. Most of the time a small window (every thing is relative the window can be > 100mbytes not so small :)) of the object will be in use by the hardware. The hardware pagefault support would avoid the necessity to pin memory and thus offer greater flexibility. At the same time the driver wants to avoid page fault as much as possible this is why i would like to be able to give hint to the mm about range of address it should avoid freeing page (swapping them out). The iommu was designed with other goals, which were first isolate device from one another and restrict device access to allowed memory. Second allow to remap address that are above device address space limit. Lot of device can only address 24bit or 32bit of memory and with computer with several gbytes of memory suddenly lot of the page become unreachable to the hardware. The iommu allow to work around this by remapping those high page into address that the hardware can reach. The hardware page fault support is a new feature of iommu designed to help the os and driver to reduce memory pinning and also share address space. Thought i am sure there are other motivations that i am not even aware off or would think off. Btw i won't be at LSF/MM so a free good beer (or other beverage) on me to whoever takes note on this subject in next conf we run into each others. Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx161.postini.com [74.125.245.161]) by kanga.kvack.org (Postfix) with SMTP id CDC896B0002 for ; Sun, 10 Feb 2013 02:55:55 -0500 (EST) Message-ID: <51175251.3040209@mellanox.com> Date: Sun, 10 Feb 2013 09:54:57 +0200 From: Shachar Raindel MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> In-Reply-To: Content-Type: multipart/alternative; boundary="------------020209080104080609050500" Sender: owner-linux-mm@kvack.org List-ID: To: Michel Lespinasse Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss --------------020209080104080609050500 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit On 2/9/2013 8:05 AM, Michel Lespinasse wrote: > On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel wrote: >> Hi, >> >> We would like to present a reference implementation for safely sharing >> memory pages from user space with the hardware, without pinning. >> >> We will be happy to hear the community feedback on our prototype >> implementation, and suggestions for future improvements. >> >> We would also like to discuss adding features to the core MM subsystem to >> assist hardware access to user memory without pinning. > This sounds kinda scary TBH; however I do understand the need for such > technology. The technological challenges here are actually rather similar to the ones experienced by hypervisors that want to allow swapping of virtual machines. As a result, we benefit greatly from the mmu notifiers implemented for KVM. Reading the page table directly will be another level of challenge. > I think one issue is that many MM developers are insufficiently aware > of such developments; having a technology presentation would probably > help there; but traditionally LSF/MM sessions are more interactive > between developers who are already quite familiar with the technology. > I think it would help if you could send in advance a detailed > presentation of the problem and the proposed solutions (and then what > they require of the MM layer) so people can be better prepared. We hope to send out an RFC patch-set of the feature implementation for our hardware soon, which might help to demonstrate a use case for the technology. The current programming model for InfiniBand (and related network protocols - RoCE, iWarp) relies on the user space program registering memory regions for use with the hardware. Upon registration, the driver performs pinning (get_user_pages) of the memory area, updates a mapping table in the hardware and provides the user application with a handle for the mapping. The user space application then use this handle to request the hardware to access this area for network IO. While achieving unbeatable IO performance (round-trip latency, for user space programs, of less than 2 microseconds, bandwidth of 56 Gbit/second), this model is relatively hard to use: - The need for explicit memory registration for each area makes the API rather complex to use. Ideal API would have a handle per process, that allows it to communicate with the hardware using the process virtual addresses. - After a part of the address space has been registered, the application must be careful not to move the pages around. For example, doing a fork results in all of the memory registrations pointing to the wrong pages (which is very hard to debug). This was partially addressed at [1], but the cure is nearly as bad as the disease - when MADVISE_DONTFORK is used on the heap, a simple call to malloc in the child process might crash the process. - Memory which was registered is not swappable. As a result, one cannot write applications that overcommit for physical memory while using this API. Similarly to what Jerome described about GPU applications, for network access the application might want to use ~10% of its allocated memory space, but it is required to either pin all of the memory, use heuristics to predict what memory will be used or perform expensive copying/pinning for every network transaction. All of these are non-optimal. > And first I'd like to ask, aren't IOMMUs supposed to already largely > solve this problem ? (probably a dumb question, but that just tells > you how much you need to explain :) > IOMMU v1 doesn't solve this problem, as it gives you only one mapping table per PCI function. If you want ~64 processes on your machine to be able to access the network, this is not nearly enough. It is helping in implementing PCI pass-thru for virtualized guests (with the hardware devices exposing several virtual PCI functions for the guests), but that is still not enough for user space applications. To some extant, IOMMU v1 might even be an obstacle to implementing such feature, as it prevents PCI devices from accessing parts of the memory, requiring driver intervention for every page fault, even if the page is in memory. IOMMU v2 [2] is a step at the same direction that we are moving towards, offering PASID - a unique identifier for each transaction that the device performs, allowing to associate the transaction with a specific process. However, the challenges there are similar to these we encounter when using an address translation table on the PCI device itself (NIC/GPU). References: 1. MADVISE_DONTFORK - http://lwn.net/Articles/171956/ 2. AMD IOMMU v2 - http://www.linux-kvm.org/wiki/images/b/b1/2011-forum-amd-iommuv2-kvm.pdf --------------020209080104080609050500 Content-Type: text/html; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit On 2/9/2013 8:05 AM, Michel Lespinasse wrote:
On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel <raindel@mellanox.com> wrote:
Hi,

We would like to present a reference implementation for safely sharing
memory pages from user space with the hardware, without pinning.

We will be happy to hear the community feedback on our prototype
implementation, and suggestions for future improvements.

We would also like to discuss adding features to the core MM subsystem to
assist hardware access to user memory without pinning.
This sounds kinda scary TBH; however I do understand the need for such
technology.
The technological challenges here are actually rather similar to the ones experienced
by hypervisors that want to allow swapping of virtual machines. As a result, we benefit
greatly from the mmu notifiers implemented for KVM. Reading the page table directly
will be another level of challenge.
I think one issue is that many MM developers are insufficiently aware
of such developments; having a technology presentation would probably
help there; but traditionally LSF/MM sessions are more interactive
between developers who are already quite familiar with the technology.
I think it would help if you could send in advance a detailed
presentation of the problem and the proposed solutions (and then what
they require of the MM layer) so people can be better prepared.
We hope to send out an RFC patch-set of the feature implementation for our hardware
soon, which might help to demonstrate a use case for the technology.

The current programming model for InfiniBand (and related network protocols - RoCE,
iWarp) relies on the user space program registering memory regions for use with the
hardware. Upon registration, the driver performs pinning (get_user_pages) of the
memory area, updates a mapping table in the hardware and provides the user
application with a handle for the mapping. The user space application then use this
handle to request the hardware to access this area for network IO.

While achieving unbeatable IO performance (round-trip latency, for user space programs,
of less than 2  microseconds, bandwidth of 56 Gbit/second), this model is relatively
hard to use:

- The need for explicit memory registration for each area makes the API rather
  complex to use. Ideal API would have a handle per process, that allows it to
  communicate with the hardware using the process virtual addresses.

- After a part of the address space has been registered, the application must be
  careful not to move the pages around. For example, doing a fork results in all of
  the memory registrations pointing to the wrong pages (which is very hard to debug).
  This was partially addressed at [1], but the cure is nearly as bad as the disease - when
  MADVISE_DONTFORK is used on the heap, a simple call to malloc in the child process
  might crash the process.

- Memory which was registered is not swappable. As a result, one cannot write
  applications that overcommit for physical memory while using this API. Similarly to
  what Jerome described about GPU applications, for network access the application
  might want to use ~10% of its allocated memory space, but it is required to either
  pin all of the memory, use heuristics to predict what memory will be used or
  perform expensive copying/pinning for every network transaction. All of these are
  non-optimal.

And first I'd like to ask, aren't IOMMUs supposed to already largely
solve this problem ? (probably a dumb question, but that just tells
you how much you need to explain :)


IOMMU v1 doesn't solve this problem, as it gives you only one mapping table per
PCI function. If you want ~64 processes on your machine to be able to access the
network, this is not nearly enough. It is helping in implementing PCI pass-thru for
virtualized guests (with the hardware devices exposing several virtual PCI functions
for the guests), but that is still not enough for user space applications.

To some extant, IOMMU v1 might even be an obstacle to implementing such
feature, as it prevents PCI devices from accessing parts of the memory, requiring
driver intervention for every page fault, even if the page is in memory.

IOMMU v2 [2] is a step at the same direction that we are moving towards, offering
PASID - a unique identifier for each transaction that the device performs, allowing
to associate the transaction with a specific process. However, the challenges there
are similar to these we encounter when using an address translation table on the
PCI device itself (NIC/GPU).

References:

1. MADVISE_DONTFORK - http://lwn.net/Articles/171956/
2. AMD IOMMU v2 - http://www.linux-kvm.org/wiki/images/b/b1/2011-forum-amd-iommuv2-kvm.pdf

--------------020209080104080609050500-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx146.postini.com [74.125.245.146]) by kanga.kvack.org (Postfix) with SMTP id 200E26B0005 for ; Tue, 9 Apr 2013 04:18:03 -0400 (EDT) Received: by mail-ob0-f175.google.com with SMTP id va7so6580563obc.20 for ; Tue, 09 Apr 2013 01:18:02 -0700 (PDT) Message-ID: <5163CEB3.80707@gmail.com> Date: Tue, 09 Apr 2013 16:17:55 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> In-Reply-To: <5114DF05.7070702@mellanox.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Shachar Raindel Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss Hi Simon, On 02/08/2013 07:18 PM, Shachar Raindel wrote: > Hi, > > We would like to present a reference implementation for safely sharing > memory pages from user space with the hardware, without pinning. > > We will be happy to hear the community feedback on our prototype > implementation, and suggestions for future improvements. > > We would also like to discuss adding features to the core MM subsystem > to assist hardware access to user memory without pinning. > > Following is a longer motivation and explanation on the technology > presented: > > Many application developers would like to be able to be able to > communicate directly with the hardware from the userspace. > > Use cases for that includes high performance networking API such as > InfiniBand, RoCE and iWarp and interfacing with GPUs. > > Currently, if the user space application wants to share system memory > with the hardware device, the kernel component must pin the memory > pages in RAM, using get_user_pages. > > This is a hurdle, as it usually makes large portions the application > memory unmovable. This pinning also makes the user space development > model very complicated ? one needs to register memory before using it > for communication with the hardware. > > We use the mmu-notifiers [1] mechanism to inform the hardware when the > mapping of a page is changed. If the hardware tries to access a page > which is not yet mapped for the hardware, it requests a resolution for > the page address from the kernel. mmu_notifiers is used for host notice guest a page changed, is it? Why you said that it is used for informing the hardware when the mapping of a page is changed? > > This mechanism allows the hardware to access the entire address space > of the user application, without pinning even a single page. > > We would like to use the LSF/MM forum opportunity to discuss open > issues we have for further development, such as: > > -Allowing the hardware to perform page table walk, similar to > get_user_pages_fast to resolve user pages that are already in RAM. > > -Batching page eviction by various kernel subsystems (swapper, > page-cache) to reduce the amount of communication needed with the > hardware in such events > > -Hinting from the hardware to the MM regarding page fetches which are > speculative, similarly to prefetching done by the page-cache > > -Page-in notifications from the kernel to the driver, such that we can > keep our secondary TLB in sync with the kernel page table without > incurring page faults. > > -Allowed and banned actions while in an MMU notifier callback. We have > already done some work on making the MMU notifiers sleepable [2], but > there might be additional limitations, which we would like to discuss. > > -Hinting from the MMU notifiers as for the reason for the notification > - for example we would like to react differently if a page was moved > by NUMA migration vs. page being swapped out. > > [1] http://lwn.net/Articles/266320/ > > [2] http://comments.gmane.org/gmane.linux.kernel.mm/85002 > > Thanks, > > --Shachar > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx147.postini.com [74.125.245.147]) by kanga.kvack.org (Postfix) with SMTP id 9463A6B0005 for ; Tue, 9 Apr 2013 04:28:20 -0400 (EDT) Received: by mail-pa0-f45.google.com with SMTP id kl13so3714213pab.18 for ; Tue, 09 Apr 2013 01:28:19 -0700 (PDT) Message-ID: <5163D119.80603@gmail.com> Date: Tue, 09 Apr 2013 16:28:09 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Glisse Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss Hi Jerome, On 02/10/2013 12:29 AM, Jerome Glisse wrote: > On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse wrote: >> On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel wrote: >>> Hi, >>> >>> We would like to present a reference implementation for safely sharing >>> memory pages from user space with the hardware, without pinning. >>> >>> We will be happy to hear the community feedback on our prototype >>> implementation, and suggestions for future improvements. >>> >>> We would also like to discuss adding features to the core MM subsystem to >>> assist hardware access to user memory without pinning. >> This sounds kinda scary TBH; however I do understand the need for such >> technology. >> >> I think one issue is that many MM developers are insufficiently aware >> of such developments; having a technology presentation would probably >> help there; but traditionally LSF/MM sessions are more interactive >> between developers who are already quite familiar with the technology. >> I think it would help if you could send in advance a detailed >> presentation of the problem and the proposed solutions (and then what >> they require of the MM layer) so people can be better prepared. >> >> And first I'd like to ask, aren't IOMMUs supposed to already largely >> solve this problem ? (probably a dumb question, but that just tells >> you how much you need to explain :) > For GPU the motivation is three fold. With the advance of GPU compute > and also with newer graphic program we see a massive increase in GPU > memory consumption. We easily can reach buffer that are bigger than > 1gbytes. So the first motivation is to directly use the memory the > user allocated through malloc in the GPU this avoid copying 1gbytes of > data with the cpu to the gpu buffer. The second and mostly important > to GPU compute is the use of GPU seamlessly with the CPU, in order to > achieve this you want the programmer to have a single address space on > the CPU and GPU. So that the same address point to the same object on > GPU as on the CPU. This would also be a tremendous cleaner design from > driver point of view toward memory management. > > And last, the most important, with such big buffer (>1gbytes) the > memory pinning is becoming way to expensive and also drastically > reduce the freedom of the mm to free page for other process. Most of > the time a small window (every thing is relative the window can be > > 100mbytes not so small :)) of the object will be in use by the > hardware. The hardware pagefault support would avoid the necessity to What's the meaning of hardware pagefault? > pin memory and thus offer greater flexibility. At the same time the > driver wants to avoid page fault as much as possible this is why i > would like to be able to give hint to the mm about range of address it > should avoid freeing page (swapping them out). > > The iommu was designed with other goals, which were first isolate > device from one another and restrict device access to allowed memory. > Second allow to remap address that are above device address space When need remap address? > limit. Lot of device can only address 24bit or 32bit of memory and > with computer with several gbytes of memory suddenly lot of the page > become unreachable to the hardware. The iommu allow to work around > this by remapping those high page into address that the hardware can > reach. > > The hardware page fault support is a new feature of iommu designed to > help the os and driver to reduce memory pinning and also share address > space. Thought i am sure there are other motivations that i am not > even aware off or would think off. > > Btw i won't be at LSF/MM so a free good beer (or other beverage) on me > to whoever takes note on this subject in next conf we run into each > others. > > Cheers, > Jerome > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx175.postini.com [74.125.245.175]) by kanga.kvack.org (Postfix) with SMTP id 545AF6B0027 for ; Tue, 9 Apr 2013 10:25:28 -0400 (EDT) Received: by mail-qc0-f175.google.com with SMTP id j3so1154893qcs.6 for ; Tue, 09 Apr 2013 07:25:27 -0700 (PDT) Date: Tue, 9 Apr 2013 10:21:57 -0400 From: Jerome Glisse Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes Message-ID: <20130409142156.GA1909@gmail.com> References: <5114DF05.7070702@mellanox.com> <5163D119.80603@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5163D119.80603@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss On Tue, Apr 09, 2013 at 04:28:09PM +0800, Simon Jeons wrote: > Hi Jerome, > On 02/10/2013 12:29 AM, Jerome Glisse wrote: > >On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse wrote: > >>On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel wrote: > >>>Hi, > >>> > >>>We would like to present a reference implementation for safely sharing > >>>memory pages from user space with the hardware, without pinning. > >>> > >>>We will be happy to hear the community feedback on our prototype > >>>implementation, and suggestions for future improvements. > >>> > >>>We would also like to discuss adding features to the core MM subsystem to > >>>assist hardware access to user memory without pinning. > >>This sounds kinda scary TBH; however I do understand the need for such > >>technology. > >> > >>I think one issue is that many MM developers are insufficiently aware > >>of such developments; having a technology presentation would probably > >>help there; but traditionally LSF/MM sessions are more interactive > >>between developers who are already quite familiar with the technology. > >>I think it would help if you could send in advance a detailed > >>presentation of the problem and the proposed solutions (and then what > >>they require of the MM layer) so people can be better prepared. > >> > >>And first I'd like to ask, aren't IOMMUs supposed to already largely > >>solve this problem ? (probably a dumb question, but that just tells > >>you how much you need to explain :) > >For GPU the motivation is three fold. With the advance of GPU compute > >and also with newer graphic program we see a massive increase in GPU > >memory consumption. We easily can reach buffer that are bigger than > >1gbytes. So the first motivation is to directly use the memory the > >user allocated through malloc in the GPU this avoid copying 1gbytes of > >data with the cpu to the gpu buffer. The second and mostly important > >to GPU compute is the use of GPU seamlessly with the CPU, in order to > >achieve this you want the programmer to have a single address space on > >the CPU and GPU. So that the same address point to the same object on > >GPU as on the CPU. This would also be a tremendous cleaner design from > >driver point of view toward memory management. > > > >And last, the most important, with such big buffer (>1gbytes) the > >memory pinning is becoming way to expensive and also drastically > >reduce the freedom of the mm to free page for other process. Most of > >the time a small window (every thing is relative the window can be > > >100mbytes not so small :)) of the object will be in use by the > >hardware. The hardware pagefault support would avoid the necessity to > > What's the meaning of hardware pagefault? It's a PCIE extension (well it's a combination of extension that allow that see http://www.pcisig.com/specifications/iov/ats/). Idea is that the iommu can trigger a regular pagefault inside a process address space on behalf of the hardware. The only iommu supporting that right now is the AMD iommu v2 that you find on recent AMD platform. > > >pin memory and thus offer greater flexibility. At the same time the > >driver wants to avoid page fault as much as possible this is why i > >would like to be able to give hint to the mm about range of address it > >should avoid freeing page (swapping them out). > > > >The iommu was designed with other goals, which were first isolate > >device from one another and restrict device access to allowed memory. > >Second allow to remap address that are above device address space > > When need remap address? Some hardware have 24bits or 32bits address limitation, iommu allow to remap memory that are above this range into the working range of the device. Just as i said below. Or are your question different ? Cheers, Jerome > >limit. Lot of device can only address 24bit or 32bit of memory and > >with computer with several gbytes of memory suddenly lot of the page > >become unreachable to the hardware. The iommu allow to work around > >this by remapping those high page into address that the hardware can > >reach. > > > >The hardware page fault support is a new feature of iommu designed to > >help the os and driver to reduce memory pinning and also share address > >space. Thought i am sure there are other motivations that i am not > >even aware off or would think off. > > > >Btw i won't be at LSF/MM so a free good beer (or other beverage) on me > >to whoever takes note on this subject in next conf we run into each > >others. > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx141.postini.com [74.125.245.141]) by kanga.kvack.org (Postfix) with SMTP id BDEE76B0005 for ; Tue, 9 Apr 2013 21:42:05 -0400 (EDT) Received: by mail-qe0-f49.google.com with SMTP id 6so2573027qeb.22 for ; Tue, 09 Apr 2013 18:42:04 -0700 (PDT) Message-ID: <5164C365.70302@gmail.com> Date: Wed, 10 Apr 2013 09:41:57 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> <5163D119.80603@gmail.com> <20130409142156.GA1909@gmail.com> In-Reply-To: <20130409142156.GA1909@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Glisse Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss Hi Jerome, On 04/09/2013 10:21 PM, Jerome Glisse wrote: > On Tue, Apr 09, 2013 at 04:28:09PM +0800, Simon Jeons wrote: >> Hi Jerome, >> On 02/10/2013 12:29 AM, Jerome Glisse wrote: >>> On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse wrote: >>>> On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel wrote: >>>>> Hi, >>>>> >>>>> We would like to present a reference implementation for safely sharing >>>>> memory pages from user space with the hardware, without pinning. >>>>> >>>>> We will be happy to hear the community feedback on our prototype >>>>> implementation, and suggestions for future improvements. >>>>> >>>>> We would also like to discuss adding features to the core MM subsystem to >>>>> assist hardware access to user memory without pinning. >>>> This sounds kinda scary TBH; however I do understand the need for such >>>> technology. >>>> >>>> I think one issue is that many MM developers are insufficiently aware >>>> of such developments; having a technology presentation would probably >>>> help there; but traditionally LSF/MM sessions are more interactive >>>> between developers who are already quite familiar with the technology. >>>> I think it would help if you could send in advance a detailed >>>> presentation of the problem and the proposed solutions (and then what >>>> they require of the MM layer) so people can be better prepared. >>>> >>>> And first I'd like to ask, aren't IOMMUs supposed to already largely >>>> solve this problem ? (probably a dumb question, but that just tells >>>> you how much you need to explain :) >>> For GPU the motivation is three fold. With the advance of GPU compute >>> and also with newer graphic program we see a massive increase in GPU >>> memory consumption. We easily can reach buffer that are bigger than >>> 1gbytes. So the first motivation is to directly use the memory the >>> user allocated through malloc in the GPU this avoid copying 1gbytes of >>> data with the cpu to the gpu buffer. The second and mostly important >>> to GPU compute is the use of GPU seamlessly with the CPU, in order to >>> achieve this you want the programmer to have a single address space on >>> the CPU and GPU. So that the same address point to the same object on >>> GPU as on the CPU. This would also be a tremendous cleaner design from >>> driver point of view toward memory management. >>> >>> And last, the most important, with such big buffer (>1gbytes) the >>> memory pinning is becoming way to expensive and also drastically >>> reduce the freedom of the mm to free page for other process. Most of >>> the time a small window (every thing is relative the window can be > >>> 100mbytes not so small :)) of the object will be in use by the >>> hardware. The hardware pagefault support would avoid the necessity to >> What's the meaning of hardware pagefault? > It's a PCIE extension (well it's a combination of extension that allow > that see http://www.pcisig.com/specifications/iov/ats/). Idea is that the > iommu can trigger a regular pagefault inside a process address space on > behalf of the hardware. The only iommu supporting that right now is the > AMD iommu v2 that you find on recent AMD platform. Why need hardware page fault? regular page fault is trigger by cpu mmu, correct? >>> pin memory and thus offer greater flexibility. At the same time the >>> driver wants to avoid page fault as much as possible this is why i >>> would like to be able to give hint to the mm about range of address it >>> should avoid freeing page (swapping them out). >>> >>> The iommu was designed with other goals, which were first isolate >>> device from one another and restrict device access to allowed memory. >>> Second allow to remap address that are above device address space >> When need remap address? > Some hardware have 24bits or 32bits address limitation, iommu allow to > remap memory that are above this range into the working range of the > device. Just as i said below. Or are your question different ? Oh, this method can replace bounce buffer, correct? > > Cheers, > Jerome > >>> limit. Lot of device can only address 24bit or 32bit of memory and >>> with computer with several gbytes of memory suddenly lot of the page >>> become unreachable to the hardware. The iommu allow to work around >>> this by remapping those high page into address that the hardware can >>> reach. >>> >>> The hardware page fault support is a new feature of iommu designed to >>> help the os and driver to reduce memory pinning and also share address >>> space. Thought i am sure there are other motivations that i am not >>> even aware off or would think off. >>> >>> Btw i won't be at LSF/MM so a free good beer (or other beverage) on me >>> to whoever takes note on this subject in next conf we run into each >>> others. >>> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx142.postini.com [74.125.245.142]) by kanga.kvack.org (Postfix) with SMTP id 340456B0005 for ; Tue, 9 Apr 2013 21:48:42 -0400 (EDT) Received: by mail-pa0-f51.google.com with SMTP id jh10so25413pab.38 for ; Tue, 09 Apr 2013 18:48:41 -0700 (PDT) Message-ID: <5164C4F2.7090108@gmail.com> Date: Wed, 10 Apr 2013 09:48:34 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> <5163CEB3.80707@gmail.com> In-Reply-To: <5163CEB3.80707@gmail.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Shachar Raindel Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss , Jerome Glisse Ping Jerome, On 04/09/2013 04:17 PM, Simon Jeons wrote: > Hi Simon, > On 02/08/2013 07:18 PM, Shachar Raindel wrote: >> Hi, >> >> We would like to present a reference implementation for safely >> sharing memory pages from user space with the hardware, without pinning. >> >> We will be happy to hear the community feedback on our prototype >> implementation, and suggestions for future improvements. >> >> We would also like to discuss adding features to the core MM >> subsystem to assist hardware access to user memory without pinning. >> >> Following is a longer motivation and explanation on the technology >> presented: >> >> Many application developers would like to be able to be able to >> communicate directly with the hardware from the userspace. >> >> Use cases for that includes high performance networking API such as >> InfiniBand, RoCE and iWarp and interfacing with GPUs. >> >> Currently, if the user space application wants to share system memory >> with the hardware device, the kernel component must pin the memory >> pages in RAM, using get_user_pages. >> >> This is a hurdle, as it usually makes large portions the application >> memory unmovable. This pinning also makes the user space development >> model very complicated ? one needs to register memory before using it >> for communication with the hardware. >> >> We use the mmu-notifiers [1] mechanism to inform the hardware when >> the mapping of a page is changed. If the hardware tries to access a >> page which is not yet mapped for the hardware, it requests a >> resolution for the page address from the kernel. > > mmu_notifiers is used for host notice guest a page changed, is it? Why > you said that it is used for informing the hardware when the mapping > of a page is changed? > >> >> This mechanism allows the hardware to access the entire address space >> of the user application, without pinning even a single page. >> >> We would like to use the LSF/MM forum opportunity to discuss open >> issues we have for further development, such as: >> >> -Allowing the hardware to perform page table walk, similar to >> get_user_pages_fast to resolve user pages that are already in RAM. >> >> -Batching page eviction by various kernel subsystems (swapper, >> page-cache) to reduce the amount of communication needed with the >> hardware in such events >> >> -Hinting from the hardware to the MM regarding page fetches which are >> speculative, similarly to prefetching done by the page-cache >> >> -Page-in notifications from the kernel to the driver, such that we >> can keep our secondary TLB in sync with the kernel page table without >> incurring page faults. >> >> -Allowed and banned actions while in an MMU notifier callback. We >> have already done some work on making the MMU notifiers sleepable >> [2], but there might be additional limitations, which we would like >> to discuss. >> >> -Hinting from the MMU notifiers as for the reason for the >> notification - for example we would like to react differently if a >> page was moved by NUMA migration vs. page being swapped out. >> >> [1] http://lwn.net/Articles/266320/ >> >> [2] http://comments.gmane.org/gmane.linux.kernel.mm/85002 >> >> Thanks, >> >> --Shachar >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx158.postini.com [74.125.245.158]) by kanga.kvack.org (Postfix) with SMTP id 395A06B0006 for ; Tue, 9 Apr 2013 21:57:11 -0400 (EDT) Received: by mail-qe0-f54.google.com with SMTP id s14so4038919qeb.13 for ; Tue, 09 Apr 2013 18:57:10 -0700 (PDT) Message-ID: <5164C6EE.7020502@gmail.com> Date: Wed, 10 Apr 2013 09:57:02 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Glisse Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss Hi Jerome, On 02/10/2013 12:29 AM, Jerome Glisse wrote: > On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse wrote: >> On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel wrote: >>> Hi, >>> >>> We would like to present a reference implementation for safely sharing >>> memory pages from user space with the hardware, without pinning. >>> >>> We will be happy to hear the community feedback on our prototype >>> implementation, and suggestions for future improvements. >>> >>> We would also like to discuss adding features to the core MM subsystem to >>> assist hardware access to user memory without pinning. >> This sounds kinda scary TBH; however I do understand the need for such >> technology. >> >> I think one issue is that many MM developers are insufficiently aware >> of such developments; having a technology presentation would probably >> help there; but traditionally LSF/MM sessions are more interactive >> between developers who are already quite familiar with the technology. >> I think it would help if you could send in advance a detailed >> presentation of the problem and the proposed solutions (and then what >> they require of the MM layer) so people can be better prepared. >> >> And first I'd like to ask, aren't IOMMUs supposed to already largely >> solve this problem ? (probably a dumb question, but that just tells >> you how much you need to explain :) > For GPU the motivation is three fold. With the advance of GPU compute > and also with newer graphic program we see a massive increase in GPU > memory consumption. We easily can reach buffer that are bigger than > 1gbytes. So the first motivation is to directly use the memory the > user allocated through malloc in the GPU this avoid copying 1gbytes of > data with the cpu to the gpu buffer. The second and mostly important > to GPU compute is the use of GPU seamlessly with the CPU, in order to > achieve this you want the programmer to have a single address space on > the CPU and GPU. So that the same address point to the same object on > GPU as on the CPU. This would also be a tremendous cleaner design from > driver point of view toward memory management. When GPU will comsume memory? The userspace process like mplayer will have video datas and GPU will play this datas and use memory of mplayer since these video datas load in mplayer process's address space? So GPU codes will call gup to take a reference of memory? Please correct me if my understanding is wrong. ;-) > And last, the most important, with such big buffer (>1gbytes) the > memory pinning is becoming way to expensive and also drastically > reduce the freedom of the mm to free page for other process. Most of > the time a small window (every thing is relative the window can be > > 100mbytes not so small :)) of the object will be in use by the > hardware. The hardware pagefault support would avoid the necessity to > pin memory and thus offer greater flexibility. At the same time the > driver wants to avoid page fault as much as possible this is why i > would like to be able to give hint to the mm about range of address it > should avoid freeing page (swapping them out). > > The iommu was designed with other goals, which were first isolate > device from one another and restrict device access to allowed memory. > Second allow to remap address that are above device address space > limit. Lot of device can only address 24bit or 32bit of memory and > with computer with several gbytes of memory suddenly lot of the page > become unreachable to the hardware. The iommu allow to work around > this by remapping those high page into address that the hardware can > reach. > > The hardware page fault support is a new feature of iommu designed to > help the os and driver to reduce memory pinning and also share address > space. Thought i am sure there are other motivations that i am not > even aware off or would think off. > > Btw i won't be at LSF/MM so a free good beer (or other beverage) on me > to whoever takes note on this subject in next conf we run into each > others. > > Cheers, > Jerome > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx199.postini.com [74.125.245.199]) by kanga.kvack.org (Postfix) with SMTP id 039B46B0005 for ; Wed, 10 Apr 2013 16:49:24 -0400 (EDT) Received: by mail-ea0-f170.google.com with SMTP id a15so442706eae.29 for ; Wed, 10 Apr 2013 13:49:23 -0700 (PDT) Date: Wed, 10 Apr 2013 16:45:08 -0400 From: Jerome Glisse Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes Message-ID: <20130410204507.GA3958@gmail.com> References: <5114DF05.7070702@mellanox.com> <5163D119.80603@gmail.com> <20130409142156.GA1909@gmail.com> <5164C365.70302@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5164C365.70302@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss On Wed, Apr 10, 2013 at 09:41:57AM +0800, Simon Jeons wrote: > Hi Jerome, > On 04/09/2013 10:21 PM, Jerome Glisse wrote: > >On Tue, Apr 09, 2013 at 04:28:09PM +0800, Simon Jeons wrote: > >>Hi Jerome, > >>On 02/10/2013 12:29 AM, Jerome Glisse wrote: > >>>On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse wrote: > >>>>On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel wrote: > >>>>>Hi, > >>>>> > >>>>>We would like to present a reference implementation for safely sharing > >>>>>memory pages from user space with the hardware, without pinning. > >>>>> > >>>>>We will be happy to hear the community feedback on our prototype > >>>>>implementation, and suggestions for future improvements. > >>>>> > >>>>>We would also like to discuss adding features to the core MM subsystem to > >>>>>assist hardware access to user memory without pinning. > >>>>This sounds kinda scary TBH; however I do understand the need for such > >>>>technology. > >>>> > >>>>I think one issue is that many MM developers are insufficiently aware > >>>>of such developments; having a technology presentation would probably > >>>>help there; but traditionally LSF/MM sessions are more interactive > >>>>between developers who are already quite familiar with the technology. > >>>>I think it would help if you could send in advance a detailed > >>>>presentation of the problem and the proposed solutions (and then what > >>>>they require of the MM layer) so people can be better prepared. > >>>> > >>>>And first I'd like to ask, aren't IOMMUs supposed to already largely > >>>>solve this problem ? (probably a dumb question, but that just tells > >>>>you how much you need to explain :) > >>>For GPU the motivation is three fold. With the advance of GPU compute > >>>and also with newer graphic program we see a massive increase in GPU > >>>memory consumption. We easily can reach buffer that are bigger than > >>>1gbytes. So the first motivation is to directly use the memory the > >>>user allocated through malloc in the GPU this avoid copying 1gbytes of > >>>data with the cpu to the gpu buffer. The second and mostly important > >>>to GPU compute is the use of GPU seamlessly with the CPU, in order to > >>>achieve this you want the programmer to have a single address space on > >>>the CPU and GPU. So that the same address point to the same object on > >>>GPU as on the CPU. This would also be a tremendous cleaner design from > >>>driver point of view toward memory management. > >>> > >>>And last, the most important, with such big buffer (>1gbytes) the > >>>memory pinning is becoming way to expensive and also drastically > >>>reduce the freedom of the mm to free page for other process. Most of > >>>the time a small window (every thing is relative the window can be > > >>>100mbytes not so small :)) of the object will be in use by the > >>>hardware. The hardware pagefault support would avoid the necessity to > >>What's the meaning of hardware pagefault? > >It's a PCIE extension (well it's a combination of extension that allow > >that see http://www.pcisig.com/specifications/iov/ats/). Idea is that the > >iommu can trigger a regular pagefault inside a process address space on > >behalf of the hardware. The only iommu supporting that right now is the > >AMD iommu v2 that you find on recent AMD platform. > > Why need hardware page fault? regular page fault is trigger by cpu > mmu, correct? Well here i abuse regular page fault term. Idea is that with hardware page fault you don't need to pin memory or take reference on page for hardware to use it. So that kernel can free as usual page that would otherwise have been pinned. If GPU is really using them it will trigger a fault through the iommu driver that call get_user_pages (which can end up calling handle_mm_fault like a regular page fault that happened on the CPU). One use case is GPU working on BIG dataset (think > GB buffer that can be on disk and just paged in when a chunk is needed). This is one example, but usualy GPU works on very large dataset because that's what they are good at. > > >>>pin memory and thus offer greater flexibility. At the same time the > >>>driver wants to avoid page fault as much as possible this is why i > >>>would like to be able to give hint to the mm about range of address it > >>>should avoid freeing page (swapping them out). > >>> > >>>The iommu was designed with other goals, which were first isolate > >>>device from one another and restrict device access to allowed memory. > >>>Second allow to remap address that are above device address space > >>When need remap address? > >Some hardware have 24bits or 32bits address limitation, iommu allow to > >remap memory that are above this range into the working range of the > >device. Just as i said below. Or are your question different ? > > Oh, this method can replace bounce buffer, correct? Yes, no bounce buffer, bounce buffer is frowned upon in GPU world because you really really really don't want to use the dma sync buffer API. Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx169.postini.com [74.125.245.169]) by kanga.kvack.org (Postfix) with SMTP id 9FE566B0006 for ; Wed, 10 Apr 2013 16:59:21 -0400 (EDT) Received: by mail-qe0-f46.google.com with SMTP id nd7so531660qeb.33 for ; Wed, 10 Apr 2013 13:59:20 -0700 (PDT) Date: Wed, 10 Apr 2013 16:55:59 -0400 From: Jerome Glisse Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes Message-ID: <20130410205557.GB3958@gmail.com> References: <5114DF05.7070702@mellanox.com> <5164C6EE.7020502@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5164C6EE.7020502@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss On Wed, Apr 10, 2013 at 09:57:02AM +0800, Simon Jeons wrote: > Hi Jerome, > On 02/10/2013 12:29 AM, Jerome Glisse wrote: > >On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse wrote: > >>On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel wrote: > >>>Hi, > >>> > >>>We would like to present a reference implementation for safely sharing > >>>memory pages from user space with the hardware, without pinning. > >>> > >>>We will be happy to hear the community feedback on our prototype > >>>implementation, and suggestions for future improvements. > >>> > >>>We would also like to discuss adding features to the core MM subsystem to > >>>assist hardware access to user memory without pinning. > >>This sounds kinda scary TBH; however I do understand the need for such > >>technology. > >> > >>I think one issue is that many MM developers are insufficiently aware > >>of such developments; having a technology presentation would probably > >>help there; but traditionally LSF/MM sessions are more interactive > >>between developers who are already quite familiar with the technology. > >>I think it would help if you could send in advance a detailed > >>presentation of the problem and the proposed solutions (and then what > >>they require of the MM layer) so people can be better prepared. > >> > >>And first I'd like to ask, aren't IOMMUs supposed to already largely > >>solve this problem ? (probably a dumb question, but that just tells > >>you how much you need to explain :) > >For GPU the motivation is three fold. With the advance of GPU compute > >and also with newer graphic program we see a massive increase in GPU > >memory consumption. We easily can reach buffer that are bigger than > >1gbytes. So the first motivation is to directly use the memory the > >user allocated through malloc in the GPU this avoid copying 1gbytes of > >data with the cpu to the gpu buffer. The second and mostly important > >to GPU compute is the use of GPU seamlessly with the CPU, in order to > >achieve this you want the programmer to have a single address space on > >the CPU and GPU. So that the same address point to the same object on > >GPU as on the CPU. This would also be a tremendous cleaner design from > >driver point of view toward memory management. > > When GPU will comsume memory? > > The userspace process like mplayer will have video datas and GPU > will play this datas and use memory of mplayer since these video > datas load in mplayer process's address space? So GPU codes will > call gup to take a reference of memory? Please correct me if my > understanding is wrong. ;-) First target is not thing such as video decompression, however they could too benefit from it given updated driver kernel API. In case of using iommu hardware page fault we don't call get_user_pages (gup) those we don't take a reference on the page. That's the whole point of the hardware pagefault, not taking reference on the page. Cheers, Jerome Glisse -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx203.postini.com [74.125.245.203]) by kanga.kvack.org (Postfix) with SMTP id C53866B0005 for ; Wed, 10 Apr 2013 23:37:43 -0400 (EDT) Received: by mail-qe0-f41.google.com with SMTP id b10so675933qen.0 for ; Wed, 10 Apr 2013 20:37:42 -0700 (PDT) Message-ID: <51662FFF.10103@gmail.com> Date: Thu, 11 Apr 2013 11:37:35 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> <5164C6EE.7020502@gmail.com> <20130410205557.GB3958@gmail.com> In-Reply-To: <20130410205557.GB3958@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Glisse Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss Hi Jerome, On 04/11/2013 04:55 AM, Jerome Glisse wrote: > On Wed, Apr 10, 2013 at 09:57:02AM +0800, Simon Jeons wrote: >> Hi Jerome, >> On 02/10/2013 12:29 AM, Jerome Glisse wrote: >>> On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse wrote: >>>> On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel wrote: >>>>> Hi, >>>>> >>>>> We would like to present a reference implementation for safely sharing >>>>> memory pages from user space with the hardware, without pinning. >>>>> >>>>> We will be happy to hear the community feedback on our prototype >>>>> implementation, and suggestions for future improvements. >>>>> >>>>> We would also like to discuss adding features to the core MM subsystem to >>>>> assist hardware access to user memory without pinning. >>>> This sounds kinda scary TBH; however I do understand the need for such >>>> technology. >>>> >>>> I think one issue is that many MM developers are insufficiently aware >>>> of such developments; having a technology presentation would probably >>>> help there; but traditionally LSF/MM sessions are more interactive >>>> between developers who are already quite familiar with the technology. >>>> I think it would help if you could send in advance a detailed >>>> presentation of the problem and the proposed solutions (and then what >>>> they require of the MM layer) so people can be better prepared. >>>> >>>> And first I'd like to ask, aren't IOMMUs supposed to already largely >>>> solve this problem ? (probably a dumb question, but that just tells >>>> you how much you need to explain :) >>> For GPU the motivation is three fold. With the advance of GPU compute >>> and also with newer graphic program we see a massive increase in GPU >>> memory consumption. We easily can reach buffer that are bigger than >>> 1gbytes. So the first motivation is to directly use the memory the >>> user allocated through malloc in the GPU this avoid copying 1gbytes of >>> data with the cpu to the gpu buffer. The second and mostly important >>> to GPU compute is the use of GPU seamlessly with the CPU, in order to >>> achieve this you want the programmer to have a single address space on >>> the CPU and GPU. So that the same address point to the same object on >>> GPU as on the CPU. This would also be a tremendous cleaner design from >>> driver point of view toward memory management. >> When GPU will comsume memory? >> >> The userspace process like mplayer will have video datas and GPU >> will play this datas and use memory of mplayer since these video >> datas load in mplayer process's address space? So GPU codes will >> call gup to take a reference of memory? Please correct me if my >> understanding is wrong. ;-) > First target is not thing such as video decompression, however they could > too benefit from it given updated driver kernel API. In case of using > iommu hardware page fault we don't call get_user_pages (gup) those we > don't take a reference on the page. That's the whole point of the hardware > pagefault, not taking reference on the page. mplayer process is running on normal CPU or GPU? chipset_integrated graphics will use normal memory and discrete graphics will use its own memory, correct? So the memory used by discrete graphics won't need gup, correct? > > Cheers, > Jerome Glisse -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx143.postini.com [74.125.245.143]) by kanga.kvack.org (Postfix) with SMTP id DA1BF6B0036 for ; Wed, 10 Apr 2013 23:42:13 -0400 (EDT) Received: by mail-ie0-f173.google.com with SMTP id 10so1407993ied.32 for ; Wed, 10 Apr 2013 20:42:13 -0700 (PDT) Message-ID: <5166310D.4020100@gmail.com> Date: Thu, 11 Apr 2013 11:42:05 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> <5163D119.80603@gmail.com> <20130409142156.GA1909@gmail.com> <5164C365.70302@gmail.com> <20130410204507.GA3958@gmail.com> In-Reply-To: <20130410204507.GA3958@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Glisse Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss Hi Jerome, On 04/11/2013 04:45 AM, Jerome Glisse wrote: > On Wed, Apr 10, 2013 at 09:41:57AM +0800, Simon Jeons wrote: >> Hi Jerome, >> On 04/09/2013 10:21 PM, Jerome Glisse wrote: >>> On Tue, Apr 09, 2013 at 04:28:09PM +0800, Simon Jeons wrote: >>>> Hi Jerome, >>>> On 02/10/2013 12:29 AM, Jerome Glisse wrote: >>>>> On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse wrote: >>>>>> On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel wrote: >>>>>>> Hi, >>>>>>> >>>>>>> We would like to present a reference implementation for safely sharing >>>>>>> memory pages from user space with the hardware, without pinning. >>>>>>> >>>>>>> We will be happy to hear the community feedback on our prototype >>>>>>> implementation, and suggestions for future improvements. >>>>>>> >>>>>>> We would also like to discuss adding features to the core MM subsystem to >>>>>>> assist hardware access to user memory without pinning. >>>>>> This sounds kinda scary TBH; however I do understand the need for such >>>>>> technology. >>>>>> >>>>>> I think one issue is that many MM developers are insufficiently aware >>>>>> of such developments; having a technology presentation would probably >>>>>> help there; but traditionally LSF/MM sessions are more interactive >>>>>> between developers who are already quite familiar with the technology. >>>>>> I think it would help if you could send in advance a detailed >>>>>> presentation of the problem and the proposed solutions (and then what >>>>>> they require of the MM layer) so people can be better prepared. >>>>>> >>>>>> And first I'd like to ask, aren't IOMMUs supposed to already largely >>>>>> solve this problem ? (probably a dumb question, but that just tells >>>>>> you how much you need to explain :) >>>>> For GPU the motivation is three fold. With the advance of GPU compute >>>>> and also with newer graphic program we see a massive increase in GPU >>>>> memory consumption. We easily can reach buffer that are bigger than >>>>> 1gbytes. So the first motivation is to directly use the memory the >>>>> user allocated through malloc in the GPU this avoid copying 1gbytes of >>>>> data with the cpu to the gpu buffer. The second and mostly important >>>>> to GPU compute is the use of GPU seamlessly with the CPU, in order to >>>>> achieve this you want the programmer to have a single address space on >>>>> the CPU and GPU. So that the same address point to the same object on >>>>> GPU as on the CPU. This would also be a tremendous cleaner design from >>>>> driver point of view toward memory management. >>>>> >>>>> And last, the most important, with such big buffer (>1gbytes) the >>>>> memory pinning is becoming way to expensive and also drastically >>>>> reduce the freedom of the mm to free page for other process. Most of >>>>> the time a small window (every thing is relative the window can be > >>>>> 100mbytes not so small :)) of the object will be in use by the >>>>> hardware. The hardware pagefault support would avoid the necessity to >>>> What's the meaning of hardware pagefault? >>> It's a PCIE extension (well it's a combination of extension that allow >>> that see http://www.pcisig.com/specifications/iov/ats/). Idea is that the >>> iommu can trigger a regular pagefault inside a process address space on >>> behalf of the hardware. The only iommu supporting that right now is the >>> AMD iommu v2 that you find on recent AMD platform. >> Why need hardware page fault? regular page fault is trigger by cpu >> mmu, correct? > Well here i abuse regular page fault term. Idea is that with hardware page > fault you don't need to pin memory or take reference on page for hardware to > use it. So that kernel can free as usual page that would otherwise have been For the case when GPU need to pin memory, why GPU need grap the memory of normal process instead of allocating for itself? > pinned. If GPU is really using them it will trigger a fault through the iommu > driver that call get_user_pages (which can end up calling handle_mm_fault like > a regular page fault that happened on the CPU). This time normal process can't use this page, correct? So GPU and normal process both have their own pages? > One use case is GPU working on BIG dataset (think > GB buffer that can be on disk > and just paged in when a chunk is needed). This is one example, but usualy GPU > works on very large dataset because that's what they are good at. >>>>> pin memory and thus offer greater flexibility. At the same time the >>>>> driver wants to avoid page fault as much as possible this is why i >>>>> would like to be able to give hint to the mm about range of address it >>>>> should avoid freeing page (swapping them out). >>>>> >>>>> The iommu was designed with other goals, which were first isolate >>>>> device from one another and restrict device access to allowed memory. >>>>> Second allow to remap address that are above device address space >>>> When need remap address? >>> Some hardware have 24bits or 32bits address limitation, iommu allow to >>> remap memory that are above this range into the working range of the >>> device. Just as i said below. Or are your question different ? >> Oh, this method can replace bounce buffer, correct? > Yes, no bounce buffer, bounce buffer is frowned upon in GPU world because you > really really really don't want to use the dma sync buffer API. > > Cheers, > Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx125.postini.com [74.125.245.125]) by kanga.kvack.org (Postfix) with SMTP id 9457D6B0005 for ; Thu, 11 Apr 2013 14:42:09 -0400 (EDT) Received: by mail-ee0-f43.google.com with SMTP id e50so898982eek.30 for ; Thu, 11 Apr 2013 11:42:08 -0700 (PDT) Date: Thu, 11 Apr 2013 14:38:29 -0400 From: Jerome Glisse Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes Message-ID: <20130411183828.GA6696@gmail.com> References: <5114DF05.7070702@mellanox.com> <5163D119.80603@gmail.com> <20130409142156.GA1909@gmail.com> <5164C365.70302@gmail.com> <20130410204507.GA3958@gmail.com> <5166310D.4020100@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5166310D.4020100@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss On Thu, Apr 11, 2013 at 11:42:05AM +0800, Simon Jeons wrote: > Hi Jerome, > On 04/11/2013 04:45 AM, Jerome Glisse wrote: > >On Wed, Apr 10, 2013 at 09:41:57AM +0800, Simon Jeons wrote: > >>Hi Jerome, > >>On 04/09/2013 10:21 PM, Jerome Glisse wrote: > >>>On Tue, Apr 09, 2013 at 04:28:09PM +0800, Simon Jeons wrote: > >>>>Hi Jerome, > >>>>On 02/10/2013 12:29 AM, Jerome Glisse wrote: > >>>>>On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse wrote: > >>>>>>On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel wrote: > >>>>>>>Hi, > >>>>>>> > >>>>>>>We would like to present a reference implementation for safely sharing > >>>>>>>memory pages from user space with the hardware, without pinning. > >>>>>>> > >>>>>>>We will be happy to hear the community feedback on our prototype > >>>>>>>implementation, and suggestions for future improvements. > >>>>>>> > >>>>>>>We would also like to discuss adding features to the core MM subsystem to > >>>>>>>assist hardware access to user memory without pinning. > >>>>>>This sounds kinda scary TBH; however I do understand the need for such > >>>>>>technology. > >>>>>> > >>>>>>I think one issue is that many MM developers are insufficiently aware > >>>>>>of such developments; having a technology presentation would probably > >>>>>>help there; but traditionally LSF/MM sessions are more interactive > >>>>>>between developers who are already quite familiar with the technology. > >>>>>>I think it would help if you could send in advance a detailed > >>>>>>presentation of the problem and the proposed solutions (and then what > >>>>>>they require of the MM layer) so people can be better prepared. > >>>>>> > >>>>>>And first I'd like to ask, aren't IOMMUs supposed to already largely > >>>>>>solve this problem ? (probably a dumb question, but that just tells > >>>>>>you how much you need to explain :) > >>>>>For GPU the motivation is three fold. With the advance of GPU compute > >>>>>and also with newer graphic program we see a massive increase in GPU > >>>>>memory consumption. We easily can reach buffer that are bigger than > >>>>>1gbytes. So the first motivation is to directly use the memory the > >>>>>user allocated through malloc in the GPU this avoid copying 1gbytes of > >>>>>data with the cpu to the gpu buffer. The second and mostly important > >>>>>to GPU compute is the use of GPU seamlessly with the CPU, in order to > >>>>>achieve this you want the programmer to have a single address space on > >>>>>the CPU and GPU. So that the same address point to the same object on > >>>>>GPU as on the CPU. This would also be a tremendous cleaner design from > >>>>>driver point of view toward memory management. > >>>>> > >>>>>And last, the most important, with such big buffer (>1gbytes) the > >>>>>memory pinning is becoming way to expensive and also drastically > >>>>>reduce the freedom of the mm to free page for other process. Most of > >>>>>the time a small window (every thing is relative the window can be > > >>>>>100mbytes not so small :)) of the object will be in use by the > >>>>>hardware. The hardware pagefault support would avoid the necessity to > >>>>What's the meaning of hardware pagefault? > >>>It's a PCIE extension (well it's a combination of extension that allow > >>>that see http://www.pcisig.com/specifications/iov/ats/). Idea is that the > >>>iommu can trigger a regular pagefault inside a process address space on > >>>behalf of the hardware. The only iommu supporting that right now is the > >>>AMD iommu v2 that you find on recent AMD platform. > >>Why need hardware page fault? regular page fault is trigger by cpu > >>mmu, correct? > >Well here i abuse regular page fault term. Idea is that with hardware page > >fault you don't need to pin memory or take reference on page for hardware to > >use it. So that kernel can free as usual page that would otherwise have been > > For the case when GPU need to pin memory, why GPU need grap the > memory of normal process instead of allocating for itself? Pin memory is today world where gpu allocate its own memory (GB of memory) that disappear from kernel control ie kernel can no longer reclaim this memory it's lost memory (i had complain about that already from user than saw GB of memory vanish and couldn't understand why the GPU was using so much). Tomorrow world we want gpu to be able to access memory that the application allocated through a simple malloc and we want the kernel to be able to recycly any page at any time because of memory pressure or because kernel decide to do so. That's just what we want to do. To achieve so we are getting hw that can do pagefault. No change to kernel core mm code (some improvement might be made). > > >pinned. If GPU is really using them it will trigger a fault through the iommu > >driver that call get_user_pages (which can end up calling handle_mm_fault like > >a regular page fault that happened on the CPU). > > This time normal process can't use this page, correct? So GPU and > normal process both have their own pages? No, tomorrow world, gpu and cpu both using same page in same address space at the same time. Just like two cpu core each running a different thread of the same process would. Just consider the gpu as a new cpu core working in same address space using the same memory all at the same time as cpu. Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx162.postini.com [74.125.245.162]) by kanga.kvack.org (Postfix) with SMTP id 565B66B0006 for ; Thu, 11 Apr 2013 14:51:30 -0400 (EDT) Received: by mail-ee0-f47.google.com with SMTP id t10so905321eei.34 for ; Thu, 11 Apr 2013 11:51:28 -0700 (PDT) Date: Thu, 11 Apr 2013 14:48:06 -0400 From: Jerome Glisse Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes Message-ID: <20130411184806.GB6696@gmail.com> References: <5114DF05.7070702@mellanox.com> <5164C6EE.7020502@gmail.com> <20130410205557.GB3958@gmail.com> <51662FFF.10103@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51662FFF.10103@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss On Thu, Apr 11, 2013 at 11:37:35AM +0800, Simon Jeons wrote: > Hi Jerome, > On 04/11/2013 04:55 AM, Jerome Glisse wrote: > >On Wed, Apr 10, 2013 at 09:57:02AM +0800, Simon Jeons wrote: > >>Hi Jerome, > >>On 02/10/2013 12:29 AM, Jerome Glisse wrote: > >>>On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse wrote: > >>>>On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel wrote: > >>>>>Hi, > >>>>> > >>>>>We would like to present a reference implementation for safely sharing > >>>>>memory pages from user space with the hardware, without pinning. > >>>>> > >>>>>We will be happy to hear the community feedback on our prototype > >>>>>implementation, and suggestions for future improvements. > >>>>> > >>>>>We would also like to discuss adding features to the core MM subsystem to > >>>>>assist hardware access to user memory without pinning. > >>>>This sounds kinda scary TBH; however I do understand the need for such > >>>>technology. > >>>> > >>>>I think one issue is that many MM developers are insufficiently aware > >>>>of such developments; having a technology presentation would probably > >>>>help there; but traditionally LSF/MM sessions are more interactive > >>>>between developers who are already quite familiar with the technology. > >>>>I think it would help if you could send in advance a detailed > >>>>presentation of the problem and the proposed solutions (and then what > >>>>they require of the MM layer) so people can be better prepared. > >>>> > >>>>And first I'd like to ask, aren't IOMMUs supposed to already largely > >>>>solve this problem ? (probably a dumb question, but that just tells > >>>>you how much you need to explain :) > >>>For GPU the motivation is three fold. With the advance of GPU compute > >>>and also with newer graphic program we see a massive increase in GPU > >>>memory consumption. We easily can reach buffer that are bigger than > >>>1gbytes. So the first motivation is to directly use the memory the > >>>user allocated through malloc in the GPU this avoid copying 1gbytes of > >>>data with the cpu to the gpu buffer. The second and mostly important > >>>to GPU compute is the use of GPU seamlessly with the CPU, in order to > >>>achieve this you want the programmer to have a single address space on > >>>the CPU and GPU. So that the same address point to the same object on > >>>GPU as on the CPU. This would also be a tremendous cleaner design from > >>>driver point of view toward memory management. > >>When GPU will comsume memory? > >> > >>The userspace process like mplayer will have video datas and GPU > >>will play this datas and use memory of mplayer since these video > >>datas load in mplayer process's address space? So GPU codes will > >>call gup to take a reference of memory? Please correct me if my > >>understanding is wrong. ;-) > >First target is not thing such as video decompression, however they could > >too benefit from it given updated driver kernel API. In case of using > >iommu hardware page fault we don't call get_user_pages (gup) those we > >don't take a reference on the page. That's the whole point of the hardware > >pagefault, not taking reference on the page. > > mplayer process is running on normal CPU or GPU? > chipset_integrated graphics will use normal memory and discrete > graphics will use its own memory, correct? So the memory used by > discrete graphics won't need gup, correct? mplayer can decode video in software an only use the cpu. It can also use one of the accleration API such as VDPAU. In any case mplayer is still opening the video file allocating some memory with malloc, reading from file into this memory eventually do some preprocessing on that memory and then memcpy from this memory to memory allocated by the gpu driver. No imagine a world where you don't have to memcpy so that the gpu can access it. Even if it's doable today it's really not something you want todo, ie gup on page and not releasing page for minutes. There is two kind of integrated GPU, on x86 integrated GPU should be considered as discrete GPU because BIOS steal a chunk of system ram and transform it in fake vram. This stolen chunk is never ever under the control of the linux kernel (from mm pov the gpu kernel driver is in charge of it). In any case both discrete GPU and integrated GPU have their own page table or memory controller and they map system memory in it or video memory, sometime interleaving (at address 0x100000 64k is in vram but at address 0x10000+64k it's system memory pointing to some pages). So right now any time we map a normal system ram page we take a reference on it so it does not goes away. We decided to not use gup because it will break several kernel assumption on anonymous memory in case of GPU. But we could use gup for short lived memory transaction like memcpy from system ram to vram (no matter if it's fake vram or real vram). Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx114.postini.com [74.125.245.114]) by kanga.kvack.org (Postfix) with SMTP id 718076B0070 for ; Thu, 11 Apr 2013 21:54:22 -0400 (EDT) Received: by mail-pa0-f42.google.com with SMTP id kq13so1195148pab.15 for ; Thu, 11 Apr 2013 18:54:21 -0700 (PDT) Message-ID: <51676941.3050802@gmail.com> Date: Fri, 12 Apr 2013 09:54:09 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> <5163D119.80603@gmail.com> <20130409142156.GA1909@gmail.com> <5164C365.70302@gmail.com> <20130410204507.GA3958@gmail.com> <5166310D.4020100@gmail.com> <20130411183828.GA6696@gmail.com> In-Reply-To: <20130411183828.GA6696@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Glisse Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss Hi Jerome, On 04/12/2013 02:38 AM, Jerome Glisse wrote: > On Thu, Apr 11, 2013 at 11:42:05AM +0800, Simon Jeons wrote: >> Hi Jerome, >> On 04/11/2013 04:45 AM, Jerome Glisse wrote: >>> On Wed, Apr 10, 2013 at 09:41:57AM +0800, Simon Jeons wrote: >>>> Hi Jerome, >>>> On 04/09/2013 10:21 PM, Jerome Glisse wrote: >>>>> On Tue, Apr 09, 2013 at 04:28:09PM +0800, Simon Jeons wrote: >>>>>> Hi Jerome, >>>>>> On 02/10/2013 12:29 AM, Jerome Glisse wrote: >>>>>>> On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse wrote: >>>>>>>> On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> We would like to present a reference implementation for safely sharing >>>>>>>>> memory pages from user space with the hardware, without pinning. >>>>>>>>> >>>>>>>>> We will be happy to hear the community feedback on our prototype >>>>>>>>> implementation, and suggestions for future improvements. >>>>>>>>> >>>>>>>>> We would also like to discuss adding features to the core MM subsystem to >>>>>>>>> assist hardware access to user memory without pinning. >>>>>>>> This sounds kinda scary TBH; however I do understand the need for such >>>>>>>> technology. >>>>>>>> >>>>>>>> I think one issue is that many MM developers are insufficiently aware >>>>>>>> of such developments; having a technology presentation would probably >>>>>>>> help there; but traditionally LSF/MM sessions are more interactive >>>>>>>> between developers who are already quite familiar with the technology. >>>>>>>> I think it would help if you could send in advance a detailed >>>>>>>> presentation of the problem and the proposed solutions (and then what >>>>>>>> they require of the MM layer) so people can be better prepared. >>>>>>>> >>>>>>>> And first I'd like to ask, aren't IOMMUs supposed to already largely >>>>>>>> solve this problem ? (probably a dumb question, but that just tells >>>>>>>> you how much you need to explain :) >>>>>>> For GPU the motivation is three fold. With the advance of GPU compute >>>>>>> and also with newer graphic program we see a massive increase in GPU >>>>>>> memory consumption. We easily can reach buffer that are bigger than >>>>>>> 1gbytes. So the first motivation is to directly use the memory the >>>>>>> user allocated through malloc in the GPU this avoid copying 1gbytes of >>>>>>> data with the cpu to the gpu buffer. The second and mostly important >>>>>>> to GPU compute is the use of GPU seamlessly with the CPU, in order to >>>>>>> achieve this you want the programmer to have a single address space on >>>>>>> the CPU and GPU. So that the same address point to the same object on >>>>>>> GPU as on the CPU. This would also be a tremendous cleaner design from >>>>>>> driver point of view toward memory management. >>>>>>> >>>>>>> And last, the most important, with such big buffer (>1gbytes) the >>>>>>> memory pinning is becoming way to expensive and also drastically >>>>>>> reduce the freedom of the mm to free page for other process. Most of >>>>>>> the time a small window (every thing is relative the window can be > >>>>>>> 100mbytes not so small :)) of the object will be in use by the >>>>>>> hardware. The hardware pagefault support would avoid the necessity to >>>>>> What's the meaning of hardware pagefault? >>>>> It's a PCIE extension (well it's a combination of extension that allow >>>>> that see http://www.pcisig.com/specifications/iov/ats/). Idea is that the >>>>> iommu can trigger a regular pagefault inside a process address space on >>>>> behalf of the hardware. The only iommu supporting that right now is the >>>>> AMD iommu v2 that you find on recent AMD platform. >>>> Why need hardware page fault? regular page fault is trigger by cpu >>>> mmu, correct? >>> Well here i abuse regular page fault term. Idea is that with hardware page >>> fault you don't need to pin memory or take reference on page for hardware to >>> use it. So that kernel can free as usual page that would otherwise have been >> For the case when GPU need to pin memory, why GPU need grap the >> memory of normal process instead of allocating for itself? > Pin memory is today world where gpu allocate its own memory (GB of memory) > that disappear from kernel control ie kernel can no longer reclaim this > memory it's lost memory (i had complain about that already from user than > saw GB of memory vanish and couldn't understand why the GPU was using so > much). > > Tomorrow world we want gpu to be able to access memory that the application > allocated through a simple malloc and we want the kernel to be able to > recycly any page at any time because of memory pressure or because kernel > decide to do so. > > That's just what we want to do. To achieve so we are getting hw that can do > pagefault. No change to kernel core mm code (some improvement might be made). The memory disappear since you have a reference(gup) against it, correct? Tomorrow world you want the page fault trigger through iommu driver that call get_user_pages, it also will take a reference(since gup is called), isn't it? Anyway, assume tomorrow world doesn't take a reference, we don't need care page which used by GPU is reclaimed? > >>> pinned. If GPU is really using them it will trigger a fault through the iommu >>> driver that call get_user_pages (which can end up calling handle_mm_fault like >>> a regular page fault that happened on the CPU). >> This time normal process can't use this page, correct? So GPU and >> normal process both have their own pages? > No, tomorrow world, gpu and cpu both using same page in same address space at > the same time. Just like two cpu core each running a different thread of > the same process would. Just consider the gpu as a new cpu core working in same > address space using the same memory all at the same time as cpu. > > Cheers, > Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx204.postini.com [74.125.245.204]) by kanga.kvack.org (Postfix) with SMTP id C87B16B0027 for ; Thu, 11 Apr 2013 22:12:02 -0400 (EDT) Message-ID: <51676D6B.1020202@redhat.com> Date: Thu, 11 Apr 2013 22:11:55 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> <5163D119.80603@gmail.com> <20130409142156.GA1909@gmail.com> <5164C365.70302@gmail.com> <20130410204507.GA3958@gmail.com> <5166310D.4020100@gmail.com> <20130411183828.GA6696@gmail.com> <51676941.3050802@gmail.com> In-Reply-To: <51676941.3050802@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Jerome Glisse , Andrea Arcangeli , Haggai Eran , lsf-pc@lists.linux-foundation.org, Liran Liss , Shachar Raindel , Sagi Grimberg , Roland Dreier , linux-mm@kvack.org, Or Gerlitz , Michel Lespinasse On 04/11/2013 09:54 PM, Simon Jeons wrote: > Hi Jerome, > On 04/12/2013 02:38 AM, Jerome Glisse wrote: >> On Thu, Apr 11, 2013 at 11:42:05AM +0800, Simon Jeons wrote: >> Tomorrow world we want gpu to be able to access memory that the >> application >> allocated through a simple malloc and we want the kernel to be able to >> recycly any page at any time because of memory pressure or because kernel >> decide to do so. >> >> That's just what we want to do. To achieve so we are getting hw that >> can do >> pagefault. No change to kernel core mm code (some improvement might be >> made). > > The memory disappear since you have a reference(gup) against it, > correct? Tomorrow world you want the page fault trigger through iommu > driver that call get_user_pages, it also will take a reference(since gup > is called), isn't it? Anyway, assume tomorrow world doesn't take a > reference, we don't need care page which used by GPU is reclaimed? The GPU and CPU may each have a different page table format. The kernel will need to keep both in sync. That is one of the things this discussion is about. For performance reasons, it may also make sense to locate some of the application's data in the GPU's own memory, so it does not have to cross the PCIE bus every time it needs to load the data. That requires memory coherency code in the kernel. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx141.postini.com [74.125.245.141]) by kanga.kvack.org (Postfix) with SMTP id 4C08A6B0006 for ; Thu, 11 Apr 2013 22:57:34 -0400 (EDT) Received: by mail-qe0-f53.google.com with SMTP id q19so1312856qeb.12 for ; Thu, 11 Apr 2013 19:57:33 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <51676941.3050802@gmail.com> References: <5114DF05.7070702@mellanox.com> <5163D119.80603@gmail.com> <20130409142156.GA1909@gmail.com> <5164C365.70302@gmail.com> <20130410204507.GA3958@gmail.com> <5166310D.4020100@gmail.com> <20130411183828.GA6696@gmail.com> <51676941.3050802@gmail.com> Date: Thu, 11 Apr 2013 22:57:33 -0400 Message-ID: Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes From: Jerome Glisse Content-Type: multipart/alternative; boundary=047d7bd7693a8c64b604da21103b Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss --047d7bd7693a8c64b604da21103b Content-Type: text/plain; charset=ISO-8859-1 On Thu, Apr 11, 2013 at 9:54 PM, Simon Jeons wrote: > Hi Jerome, > > On 04/12/2013 02:38 AM, Jerome Glisse wrote: > >> On Thu, Apr 11, 2013 at 11:42:05AM +0800, Simon Jeons wrote: >> >>> Hi Jerome, >>> On 04/11/2013 04:45 AM, Jerome Glisse wrote: >>> >>>> On Wed, Apr 10, 2013 at 09:41:57AM +0800, Simon Jeons wrote: >>>> >>>>> Hi Jerome, >>>>> On 04/09/2013 10:21 PM, Jerome Glisse wrote: >>>>> >>>>>> On Tue, Apr 09, 2013 at 04:28:09PM +0800, Simon Jeons wrote: >>>>>> >>>>>>> Hi Jerome, >>>>>>> On 02/10/2013 12:29 AM, Jerome Glisse wrote: >>>>>>> >>>>>>>> On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse < >>>>>>>> walken@google.com> wrote: >>>>>>>> >>>>>>>>> On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel < >>>>>>>>> raindel@mellanox.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> We would like to present a reference implementation for safely >>>>>>>>>> sharing >>>>>>>>>> memory pages from user space with the hardware, without pinning. >>>>>>>>>> >>>>>>>>>> We will be happy to hear the community feedback on our prototype >>>>>>>>>> implementation, and suggestions for future improvements. >>>>>>>>>> >>>>>>>>>> We would also like to discuss adding features to the core MM >>>>>>>>>> subsystem to >>>>>>>>>> assist hardware access to user memory without pinning. >>>>>>>>>> >>>>>>>>> This sounds kinda scary TBH; however I do understand the need for >>>>>>>>> such >>>>>>>>> technology. >>>>>>>>> >>>>>>>>> I think one issue is that many MM developers are insufficiently >>>>>>>>> aware >>>>>>>>> of such developments; having a technology presentation would >>>>>>>>> probably >>>>>>>>> help there; but traditionally LSF/MM sessions are more interactive >>>>>>>>> between developers who are already quite familiar with the >>>>>>>>> technology. >>>>>>>>> I think it would help if you could send in advance a detailed >>>>>>>>> presentation of the problem and the proposed solutions (and then >>>>>>>>> what >>>>>>>>> they require of the MM layer) so people can be better prepared. >>>>>>>>> >>>>>>>>> And first I'd like to ask, aren't IOMMUs supposed to already >>>>>>>>> largely >>>>>>>>> solve this problem ? (probably a dumb question, but that just tells >>>>>>>>> you how much you need to explain :) >>>>>>>>> >>>>>>>> For GPU the motivation is three fold. With the advance of GPU >>>>>>>> compute >>>>>>>> and also with newer graphic program we see a massive increase in GPU >>>>>>>> memory consumption. We easily can reach buffer that are bigger than >>>>>>>> 1gbytes. So the first motivation is to directly use the memory the >>>>>>>> user allocated through malloc in the GPU this avoid copying 1gbytes >>>>>>>> of >>>>>>>> data with the cpu to the gpu buffer. The second and mostly important >>>>>>>> to GPU compute is the use of GPU seamlessly with the CPU, in order >>>>>>>> to >>>>>>>> achieve this you want the programmer to have a single address space >>>>>>>> on >>>>>>>> the CPU and GPU. So that the same address point to the same object >>>>>>>> on >>>>>>>> GPU as on the CPU. This would also be a tremendous cleaner design >>>>>>>> from >>>>>>>> driver point of view toward memory management. >>>>>>>> >>>>>>>> And last, the most important, with such big buffer (>1gbytes) the >>>>>>>> memory pinning is becoming way to expensive and also drastically >>>>>>>> reduce the freedom of the mm to free page for other process. Most of >>>>>>>> the time a small window (every thing is relative the window can be > >>>>>>>> 100mbytes not so small :)) of the object will be in use by the >>>>>>>> hardware. The hardware pagefault support would avoid the necessity >>>>>>>> to >>>>>>>> >>>>>>> What's the meaning of hardware pagefault? >>>>>>> >>>>>> It's a PCIE extension (well it's a combination of extension that allow >>>>>> that see http://www.pcisig.com/**specifications/iov/ats/). >>>>>> Idea is that the >>>>>> iommu can trigger a regular pagefault inside a process address space >>>>>> on >>>>>> behalf of the hardware. The only iommu supporting that right now is >>>>>> the >>>>>> AMD iommu v2 that you find on recent AMD platform. >>>>>> >>>>> Why need hardware page fault? regular page fault is trigger by cpu >>>>> mmu, correct? >>>>> >>>> Well here i abuse regular page fault term. Idea is that with hardware >>>> page >>>> fault you don't need to pin memory or take reference on page for >>>> hardware to >>>> use it. So that kernel can free as usual page that would otherwise have >>>> been >>>> >>> For the case when GPU need to pin memory, why GPU need grap the >>> memory of normal process instead of allocating for itself? >>> >> Pin memory is today world where gpu allocate its own memory (GB of memory) >> that disappear from kernel control ie kernel can no longer reclaim this >> memory it's lost memory (i had complain about that already from user than >> saw GB of memory vanish and couldn't understand why the GPU was using so >> much). >> >> Tomorrow world we want gpu to be able to access memory that the >> application >> allocated through a simple malloc and we want the kernel to be able to >> recycly any page at any time because of memory pressure or because kernel >> decide to do so. >> >> That's just what we want to do. To achieve so we are getting hw that can >> do >> pagefault. No change to kernel core mm code (some improvement might be >> made). >> > > The memory disappear since you have a reference(gup) against it, correct? > Tomorrow world you want the page fault trigger through iommu driver that > call get_user_pages, it also will take a reference(since gup is called), > isn't it? Anyway, assume tomorrow world doesn't take a reference, we don't > need care page which used by GPU is reclaimed? > > Right now code use gup because it's convenient but it drop the reference right after the fault. So reference is hold only for short period of time. No you don't need to care about reclaim thanks to mmu notifier, ie before page is remove mmu notifier is call and iommu register a notifier, so it get the invalidate event and invalidate the device tlb and things goes on. If gpu access the page a new pagefault happen and a new page is allocated. All this code is upstream in linux kernel just read it. There is just no device that use it yet. That being said we will want improvement so that page that are hot in the device are not reclaimed. But it can work without such improvement. Cheers, Jerome --047d7bd7693a8c64b604da21103b Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Thu, Apr 11, 2013 at 9:54 PM, Simon Jeons <simon.jeons@gmail.com> wrote:
Hi Jerome,

On 04/12/2013 02:38 AM, Jerome Glisse wrote:
On Thu, Apr 11, 2013 at 11:42:05AM +0800, Simon Jeons wrote:
Hi Jerome,
On 04/11/2013 04:45 AM, Jerome Glisse wrote:
On Wed, Apr 10, 2013 at 09:41:57AM +0800, Simon Jeons wrote:
Hi Jerome,
On 04/09/2013 10:21 PM, Jerome Glisse wrote:
On Tue, Apr 09, 2013 at 04:28:09PM +0800, Simon Jeons wrote:
Hi Jerome,
On 02/10/2013 12:29 AM, Jerome Glisse wrote:
On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse <walken@google.com> wrote:
On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel <raindel@mellanox.com> wrote:
Hi,

We would like to present a reference implementation for safely sharing
memory pages from user space with the hardware, without pinning.

We will be happy to hear the community feedback on our prototype
implementation, and suggestions for future improvements.

We would also like to discuss adding features to the core MM subsystem to assist hardware access to user memory without pinning.
This sounds kinda scary TBH; however I do understand the need for such
technology.

I think one issue is that many MM developers are insufficiently aware
of such developments; having a technology presentation would probably
help there; but traditionally LSF/MM sessions are more interactive
between developers who are already quite familiar with the technology.
I think it would help if you could send in advance a detailed
presentation of the problem and the proposed solutions (and then what
they require of the MM layer) so people can be better prepared.

And first I'd like to ask, aren't IOMMUs supposed to already largel= y
solve this problem ? (probably a dumb question, but that just tells
you how much you need to explain :)
For GPU the motivation is three fold. With the advance of GPU compute
and also with newer graphic program we see a massive increase in GPU
memory consumption. We easily can reach buffer that are bigger than
1gbytes. So the first motivation is to directly use the memory the
user allocated through malloc in the GPU this avoid copying 1gbytes of
data with the cpu to the gpu buffer. The second and mostly important
to GPU compute is the use of GPU seamlessly with the CPU, in order to
achieve this you want the programmer to have a single address space on
the CPU and GPU. So that the same address point to the same object on
GPU as on the CPU. This would also be a tremendous cleaner design from
driver point of view toward memory management.

And last, the most important, with such big buffer (>1gbytes) the
memory pinning is becoming way to expensive and also drastically
reduce the freedom of the mm to free page for other process. Most of
the time a small window (every thing is relative the window can be >
100mbytes not so small :)) of the object will be in use by the
hardware. The hardware pagefault support would avoid the necessity to
What's the meaning of hardware pagefault?
It's a PCIE extension (well it's a combination of extension that al= low
that see http://www.pcisig.com/specifications/iov/ats/). Idea= is that the
iommu can trigger a regular pagefault inside a process address space on
behalf of the hardware. The only iommu supporting that right now is the
AMD iommu v2 that you find on recent AMD platform.
Why need hardware page fault? regular page fault is trigger by cpu
mmu, correct?
Well here i abuse regular page fault term. Idea is that with hardware page<= br> fault you don't need to pin memory or take reference on page for hardwa= re to
use it. So that kernel can free as usual page that would otherwise have bee= n
For the case when GPU need to pin memory, why GPU need grap the
memory of normal process instead of allocating for itself?
Pin memory is today world where gpu allocate its own memory (GB of memory)<= br> that disappear from kernel control ie kernel can no longer reclaim this
memory it's lost memory (i had complain about that already from user th= an
saw GB of memory vanish and couldn't understand why the GPU was using s= o
much).

Tomorrow world we want gpu to be able to access memory that the application=
allocated through a simple malloc and we want the kernel to be able to
recycly any page at any time because of memory pressure or because kernel decide to do so.

That's just what we want to do. To achieve so we are getting hw that ca= n do
pagefault. No change to kernel core mm code (some improvement might be made= ).

The memory disappear since you have a reference(gup) against it, correct? T= omorrow world you want the page fault trigger through iommu driver that cal= l get_user_pages, it also will take a reference(since gup is called), isn&#= 39;t it? Anyway, assume tomorrow world doesn't take a reference, we don= 't need care page which used by GPU is reclaimed?


Right now code use = gup because it's convenient but it drop the reference right after the f= ault. So reference is hold only for short period of time.

No you don= 't need to care about reclaim thanks to mmu notifier, ie before page is= remove mmu notifier is call and iommu register a notifier, so it get the i= nvalidate event and invalidate the device tlb and things goes on. If gpu ac= cess the page a new pagefault happen and a new page is allocated.

All this code is upstream in linux kernel just read it. There is just n= o device that use it yet.

That being said we will want improvement s= o that page that are hot in the device are not reclaimed. But it can work w= ithout such improvement.

Cheers,
Jerome
--047d7bd7693a8c64b604da21103b-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx118.postini.com [74.125.245.118]) by kanga.kvack.org (Postfix) with SMTP id ABFF36B0006 for ; Thu, 11 Apr 2013 23:13:24 -0400 (EDT) Received: by mail-da0-f45.google.com with SMTP id v40so952275dad.4 for ; Thu, 11 Apr 2013 20:13:23 -0700 (PDT) Message-ID: <51677BCA.2050002@gmail.com> Date: Fri, 12 Apr 2013 11:13:14 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> <5164C6EE.7020502@gmail.com> <20130410205557.GB3958@gmail.com> <51662FFF.10103@gmail.com> <20130411184806.GB6696@gmail.com> In-Reply-To: <20130411184806.GB6696@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Glisse Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss Hi Jerome, On 04/12/2013 02:48 AM, Jerome Glisse wrote: > On Thu, Apr 11, 2013 at 11:37:35AM +0800, Simon Jeons wrote: >> Hi Jerome, >> On 04/11/2013 04:55 AM, Jerome Glisse wrote: >>> On Wed, Apr 10, 2013 at 09:57:02AM +0800, Simon Jeons wrote: >>>> Hi Jerome, >>>> On 02/10/2013 12:29 AM, Jerome Glisse wrote: >>>>> On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse wrote: >>>>>> On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel wrote: >>>>>>> Hi, >>>>>>> >>>>>>> We would like to present a reference implementation for safely sharing >>>>>>> memory pages from user space with the hardware, without pinning. >>>>>>> >>>>>>> We will be happy to hear the community feedback on our prototype >>>>>>> implementation, and suggestions for future improvements. >>>>>>> >>>>>>> We would also like to discuss adding features to the core MM subsystem to >>>>>>> assist hardware access to user memory without pinning. >>>>>> This sounds kinda scary TBH; however I do understand the need for such >>>>>> technology. >>>>>> >>>>>> I think one issue is that many MM developers are insufficiently aware >>>>>> of such developments; having a technology presentation would probably >>>>>> help there; but traditionally LSF/MM sessions are more interactive >>>>>> between developers who are already quite familiar with the technology. >>>>>> I think it would help if you could send in advance a detailed >>>>>> presentation of the problem and the proposed solutions (and then what >>>>>> they require of the MM layer) so people can be better prepared. >>>>>> >>>>>> And first I'd like to ask, aren't IOMMUs supposed to already largely >>>>>> solve this problem ? (probably a dumb question, but that just tells >>>>>> you how much you need to explain :) >>>>> For GPU the motivation is three fold. With the advance of GPU compute >>>>> and also with newer graphic program we see a massive increase in GPU >>>>> memory consumption. We easily can reach buffer that are bigger than >>>>> 1gbytes. So the first motivation is to directly use the memory the >>>>> user allocated through malloc in the GPU this avoid copying 1gbytes of >>>>> data with the cpu to the gpu buffer. The second and mostly important >>>>> to GPU compute is the use of GPU seamlessly with the CPU, in order to >>>>> achieve this you want the programmer to have a single address space on >>>>> the CPU and GPU. So that the same address point to the same object on >>>>> GPU as on the CPU. This would also be a tremendous cleaner design from >>>>> driver point of view toward memory management. >>>> When GPU will comsume memory? >>>> >>>> The userspace process like mplayer will have video datas and GPU >>>> will play this datas and use memory of mplayer since these video >>>> datas load in mplayer process's address space? So GPU codes will >>>> call gup to take a reference of memory? Please correct me if my >>>> understanding is wrong. ;-) >>> First target is not thing such as video decompression, however they could >>> too benefit from it given updated driver kernel API. In case of using >>> iommu hardware page fault we don't call get_user_pages (gup) those we >>> don't take a reference on the page. That's the whole point of the hardware >>> pagefault, not taking reference on the page. >> mplayer process is running on normal CPU or GPU? >> chipset_integrated graphics will use normal memory and discrete >> graphics will use its own memory, correct? So the memory used by >> discrete graphics won't need gup, correct? > mplayer can decode video in software an only use the cpu. It can also use > one of the accleration API such as VDPAU. In any case mplayer is still opening > the video file allocating some memory with malloc, reading from file into > this memory eventually do some preprocessing on that memory and then > memcpy from this memory to memory allocated by the gpu driver. > > No imagine a world where you don't have to memcpy so that the gpu can access > it. Even if it's doable today it's really not something you want todo, ie > gup on page and not releasing page for minutes. > > There is two kind of integrated GPU, on x86 integrated GPU should be considered > as discrete GPU because BIOS steal a chunk of system ram and transform it in > fake vram. This stolen chunk is never ever under the control of the linux kernel > (from mm pov the gpu kernel driver is in charge of it). I configure integrated GPU in BIOS during system boot, it's seems that we can preallocate memory for integrated GPU, is this the memory you mentioned? > > In any case both discrete GPU and integrated GPU have their own page table or Discrete GPU will not use normal memory even if their own memory is exhaused, correct? > memory controller and they map system memory in it or video memory, sometime > interleaving (at address 0x100000 64k is in vram but at address 0x10000+64k it's > system memory pointing to some pages). > > So right now any time we map a normal system ram page we take a reference on it > so it does not goes away. We decided to not use gup because it will break several > kernel assumption on anonymous memory in case of GPU. But we could use gup for > short lived memory transaction like memcpy from system ram to vram (no matter if > it's fake vram or real vram). > > Cheers, > Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx165.postini.com [74.125.245.165]) by kanga.kvack.org (Postfix) with SMTP id C86226B0006 for ; Thu, 11 Apr 2013 23:21:18 -0400 (EDT) Received: by mail-qe0-f45.google.com with SMTP id 1so1315622qee.32 for ; Thu, 11 Apr 2013 20:21:17 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <51677BCA.2050002@gmail.com> References: <5114DF05.7070702@mellanox.com> <5164C6EE.7020502@gmail.com> <20130410205557.GB3958@gmail.com> <51662FFF.10103@gmail.com> <20130411184806.GB6696@gmail.com> <51677BCA.2050002@gmail.com> Date: Thu, 11 Apr 2013 23:21:17 -0400 Message-ID: Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes From: Jerome Glisse Content-Type: multipart/alternative; boundary=047d7b5db86c7453b104da216565 Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss --047d7b5db86c7453b104da216565 Content-Type: text/plain; charset=ISO-8859-1 On Thu, Apr 11, 2013 at 11:13 PM, Simon Jeons wrote: > Hi Jerome, > > On 04/12/2013 02:48 AM, Jerome Glisse wrote: > >> On Thu, Apr 11, 2013 at 11:37:35AM +0800, Simon Jeons wrote: >> >>> Hi Jerome, >>> On 04/11/2013 04:55 AM, Jerome Glisse wrote: >>> >>>> On Wed, Apr 10, 2013 at 09:57:02AM +0800, Simon Jeons wrote: >>>> >>>>> Hi Jerome, >>>>> On 02/10/2013 12:29 AM, Jerome Glisse wrote: >>>>> >>>>>> On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse >>>>>> wrote: >>>>>> >>>>>>> On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel < >>>>>>> raindel@mellanox.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> We would like to present a reference implementation for safely >>>>>>>> sharing >>>>>>>> memory pages from user space with the hardware, without pinning. >>>>>>>> >>>>>>>> We will be happy to hear the community feedback on our prototype >>>>>>>> implementation, and suggestions for future improvements. >>>>>>>> >>>>>>>> We would also like to discuss adding features to the core MM >>>>>>>> subsystem to >>>>>>>> assist hardware access to user memory without pinning. >>>>>>>> >>>>>>> This sounds kinda scary TBH; however I do understand the need for >>>>>>> such >>>>>>> technology. >>>>>>> >>>>>>> I think one issue is that many MM developers are insufficiently aware >>>>>>> of such developments; having a technology presentation would probably >>>>>>> help there; but traditionally LSF/MM sessions are more interactive >>>>>>> between developers who are already quite familiar with the >>>>>>> technology. >>>>>>> I think it would help if you could send in advance a detailed >>>>>>> presentation of the problem and the proposed solutions (and then what >>>>>>> they require of the MM layer) so people can be better prepared. >>>>>>> >>>>>>> And first I'd like to ask, aren't IOMMUs supposed to already largely >>>>>>> solve this problem ? (probably a dumb question, but that just tells >>>>>>> you how much you need to explain :) >>>>>>> >>>>>> For GPU the motivation is three fold. With the advance of GPU compute >>>>>> and also with newer graphic program we see a massive increase in GPU >>>>>> memory consumption. We easily can reach buffer that are bigger than >>>>>> 1gbytes. So the first motivation is to directly use the memory the >>>>>> user allocated through malloc in the GPU this avoid copying 1gbytes of >>>>>> data with the cpu to the gpu buffer. The second and mostly important >>>>>> to GPU compute is the use of GPU seamlessly with the CPU, in order to >>>>>> achieve this you want the programmer to have a single address space on >>>>>> the CPU and GPU. So that the same address point to the same object on >>>>>> GPU as on the CPU. This would also be a tremendous cleaner design from >>>>>> driver point of view toward memory management. >>>>>> >>>>> When GPU will comsume memory? >>>>> >>>>> The userspace process like mplayer will have video datas and GPU >>>>> will play this datas and use memory of mplayer since these video >>>>> datas load in mplayer process's address space? So GPU codes will >>>>> call gup to take a reference of memory? Please correct me if my >>>>> understanding is wrong. ;-) >>>>> >>>> First target is not thing such as video decompression, however they >>>> could >>>> too benefit from it given updated driver kernel API. In case of using >>>> iommu hardware page fault we don't call get_user_pages (gup) those we >>>> don't take a reference on the page. That's the whole point of the >>>> hardware >>>> pagefault, not taking reference on the page. >>>> >>> mplayer process is running on normal CPU or GPU? >>> chipset_integrated graphics will use normal memory and discrete >>> graphics will use its own memory, correct? So the memory used by >>> discrete graphics won't need gup, correct? >>> >> mplayer can decode video in software an only use the cpu. It can also use >> one of the accleration API such as VDPAU. In any case mplayer is still >> opening >> the video file allocating some memory with malloc, reading from file into >> this memory eventually do some preprocessing on that memory and then >> memcpy from this memory to memory allocated by the gpu driver. >> >> No imagine a world where you don't have to memcpy so that the gpu can >> access >> it. Even if it's doable today it's really not something you want todo, ie >> gup on page and not releasing page for minutes. >> >> There is two kind of integrated GPU, on x86 integrated GPU should be >> considered >> as discrete GPU because BIOS steal a chunk of system ram and transform it >> in >> fake vram. This stolen chunk is never ever under the control of the linux >> kernel >> (from mm pov the gpu kernel driver is in charge of it). >> > > I configure integrated GPU in BIOS during system boot, it's seems that we > can preallocate memory for integrated GPU, is this the memory you mentioned > ? Most likely it's > In any case both discrete GPU and integrated GPU have their own page table >> or >> > > Discrete GPU will not use normal memory even if their own memory is > exhaused, correct? > > They will consume normal memory, right now you can see that on heavy load hugue chunk of your system memory disappear, it's the gpu driver that is using it, it get mapped into gpu address space and from gpu unit pov it's just like any other memory (ie vram or sram looks the same to the gpu acceleration core, sram is just slower). Cheers Jerome --047d7b5db86c7453b104da216565 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Thu, Apr 11, 2013 at 11:13 PM, Simon Jeons <simon.jeons@gmail.com> wrote:
Hi Jerome,

On 04/12/2013 02:48 AM, Jerome Glisse wrote:
On Thu, Apr 11, 2013 at 11:37:35AM +0800, Simon Jeons wrote:
Hi Jerome,
On 04/11/2013 04:55 AM, Jerome Glisse wrote:
On Wed, Apr 10, 2013 at 09:57:02AM +0800, Simon Jeons wrote:
Hi Jerome,
On 02/10/2013 12:29 AM, Jerome Glisse wrote:
On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse <walken@google.com> wrote:
On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel <raindel@mellanox.com> wrote:
Hi,

We would like to present a reference implementation for safely sharing
memory pages from user space with the hardware, without pinning.

We will be happy to hear the community feedback on our prototype
implementation, and suggestions for future improvements.

We would also like to discuss adding features to the core MM subsystem to assist hardware access to user memory without pinning.
This sounds kinda scary TBH; however I do understand the need for such
technology.

I think one issue is that many MM developers are insufficiently aware
of such developments; having a technology presentation would probably
help there; but traditionally LSF/MM sessions are more interactive
between developers who are already quite familiar with the technology.
I think it would help if you could send in advance a detailed
presentation of the problem and the proposed solutions (and then what
they require of the MM layer) so people can be better prepared.

And first I'd like to ask, aren't IOMMUs supposed to already largel= y
solve this problem ? (probably a dumb question, but that just tells
you how much you need to explain :)
For GPU the motivation is three fold. With the advance of GPU compute
and also with newer graphic program we see a massive increase in GPU
memory consumption. We easily can reach buffer that are bigger than
1gbytes. So the first motivation is to directly use the memory the
user allocated through malloc in the GPU this avoid copying 1gbytes of
data with the cpu to the gpu buffer. The second and mostly important
to GPU compute is the use of GPU seamlessly with the CPU, in order to
achieve this you want the programmer to have a single address space on
the CPU and GPU. So that the same address point to the same object on
GPU as on the CPU. This would also be a tremendous cleaner design from
driver point of view toward memory management.
When GPU will comsume memory?

The userspace process like mplayer will have video datas and GPU
will play this datas and use memory of mplayer since these video
datas load in mplayer process's address space? So GPU codes will
call gup to take a reference of memory? Please correct me if my
understanding is wrong. ;-)
First target is not thing such as video decompression, however they could too benefit from it given updated driver kernel API. In case of using
iommu hardware page fault we don't call get_user_pages (gup) those we don't take a reference on the page. That's the whole point of the h= ardware
pagefault, not taking reference on the page.
mplayer process is running on normal CPU or GPU?
chipset_integrated graphics will use normal memory and discrete
graphics will use its own memory, correct? So the memory used by
discrete graphics won't need gup, correct?
mplayer can decode video in software an only use the cpu. It can also use one of the accleration API such as VDPAU. In any case mplayer is still open= ing
the video file allocating some memory with malloc, reading from file into this memory eventually do some preprocessing on that memory and then
memcpy from this memory to memory allocated by the gpu driver.

No imagine a world where you don't have to memcpy so that the gpu can a= ccess
it. Even if it's doable today it's really not something you want to= do, ie
gup on page and not releasing page for minutes.

There is two kind of integrated GPU, on x86 integrated GPU should be consid= ered
as discrete GPU because BIOS steal a chunk of system ram and transform it i= n
fake vram. This stolen chunk is never ever under the control of the linux k= ernel
(from mm pov the gpu kernel driver is in charge of it).

I configure integrated GPU in BIOS during system boot, it's seems that = we can preallocate memory for integrated GPU, is this the memory you mentio= ned ?

Most likely it's
=A0
In any case both discrete GPU and integrated GPU have their own page table = or

Discrete GPU will not use normal memory even if their own memory is exhause= d, correct?


They will consume normal memory, right now you can see that o= n heavy load hugue chunk of your system memory disappear, it's the gpu = driver that is using it, it get mapped into gpu address space and from gpu = unit pov it's just like any other memory (ie vram or sram looks the sam= e to the gpu acceleration core, sram is just slower).

Cheers
Jerome
--047d7b5db86c7453b104da216565-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx118.postini.com [74.125.245.118]) by kanga.kvack.org (Postfix) with SMTP id B3B136B0005 for ; Fri, 12 Apr 2013 01:44:48 -0400 (EDT) Received: by mail-pb0-f46.google.com with SMTP id rp8so1229792pbb.33 for ; Thu, 11 Apr 2013 22:44:47 -0700 (PDT) Message-ID: <51679F46.7030901@gmail.com> Date: Fri, 12 Apr 2013 13:44:38 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> <5163D119.80603@gmail.com> <20130409142156.GA1909@gmail.com> <5164C365.70302@gmail.com> <20130410204507.GA3958@gmail.com> <5166310D.4020100@gmail.com> <20130411183828.GA6696@gmail.com> <51676941.3050802@gmail.com> In-Reply-To: Content-Type: multipart/alternative; boundary="------------070608050006030707030304" Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Glisse Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss This is a multi-part message in MIME format. --------------070608050006030707030304 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi Jerome, On 04/12/2013 10:57 AM, Jerome Glisse wrote: > On Thu, Apr 11, 2013 at 9:54 PM, Simon Jeons > wrote: > > Hi Jerome, > > On 04/12/2013 02:38 AM, Jerome Glisse wrote: > > On Thu, Apr 11, 2013 at 11:42:05AM +0800, Simon Jeons wrote: > > Hi Jerome, > On 04/11/2013 04:45 AM, Jerome Glisse wrote: > > On Wed, Apr 10, 2013 at 09:41:57AM +0800, Simon Jeons > wrote: > > Hi Jerome, > On 04/09/2013 10:21 PM, Jerome Glisse wrote: > > On Tue, Apr 09, 2013 at 04:28:09PM +0800, > Simon Jeons wrote: > > Hi Jerome, > On 02/10/2013 12:29 AM, Jerome Glisse wrote: > > On Sat, Feb 9, 2013 at 1:05 AM, Michel > Lespinasse > wrote: > > On Fri, Feb 8, 2013 at 3:18 AM, > Shachar Raindel > > wrote: > > Hi, > > We would like to present a > reference implementation for > safely sharing > memory pages from user space > with the hardware, without > pinning. > > We will be happy to hear the > community feedback on our > prototype > implementation, and > suggestions for future > improvements. > > We would also like to discuss > adding features to the core MM > subsystem to > assist hardware access to user > memory without pinning. > > This sounds kinda scary TBH; > however I do understand the need > for such > technology. > > I think one issue is that many MM > developers are insufficiently aware > of such developments; having a > technology presentation would probably > help there; but traditionally > LSF/MM sessions are more interactive > between developers who are already > quite familiar with the technology. > I think it would help if you could > send in advance a detailed > presentation of the problem and > the proposed solutions (and then what > they require of the MM layer) so > people can be better prepared. > > And first I'd like to ask, aren't > IOMMUs supposed to already largely > solve this problem ? (probably a > dumb question, but that just tells > you how much you need to explain :) > > For GPU the motivation is three fold. > With the advance of GPU compute > and also with newer graphic program we > see a massive increase in GPU > memory consumption. We easily can > reach buffer that are bigger than > 1gbytes. So the first motivation is to > directly use the memory the > user allocated through malloc in the > GPU this avoid copying 1gbytes of > data with the cpu to the gpu buffer. > The second and mostly important > to GPU compute is the use of GPU > seamlessly with the CPU, in order to > achieve this you want the programmer > to have a single address space on > the CPU and GPU. So that the same > address point to the same object on > GPU as on the CPU. This would also be > a tremendous cleaner design from > driver point of view toward memory > management. > > And last, the most important, with > such big buffer (>1gbytes) the > memory pinning is becoming way to > expensive and also drastically > reduce the freedom of the mm to free > page for other process. Most of > the time a small window (every thing > is relative the window can be > > 100mbytes not so small :)) of the > object will be in use by the > hardware. The hardware pagefault > support would avoid the necessity to > > What's the meaning of hardware pagefault? > > It's a PCIE extension (well it's a combination > of extension that allow > that see > http://www.pcisig.com/specifications/iov/ats/). Idea > is that the > iommu can trigger a regular pagefault inside a > process address space on > behalf of the hardware. The only iommu > supporting that right now is the > AMD iommu v2 that you find on recent AMD platform. > > Why need hardware page fault? regular page fault > is trigger by cpu > mmu, correct? > > Well here i abuse regular page fault term. Idea is > that with hardware page > fault you don't need to pin memory or take reference > on page for hardware to > use it. So that kernel can free as usual page that > would otherwise have been > > For the case when GPU need to pin memory, why GPU need > grap the > memory of normal process instead of allocating for itself? > > Pin memory is today world where gpu allocate its own memory > (GB of memory) > that disappear from kernel control ie kernel can no longer > reclaim this > memory it's lost memory (i had complain about that already > from user than > saw GB of memory vanish and couldn't understand why the GPU > was using so > much). > > Tomorrow world we want gpu to be able to access memory that > the application > allocated through a simple malloc and we want the kernel to be > able to > recycly any page at any time because of memory pressure or > because kernel > decide to do so. > > That's just what we want to do. To achieve so we are getting > hw that can do > pagefault. No change to kernel core mm code (some improvement > might be made). > > > The memory disappear since you have a reference(gup) against it, > correct? Tomorrow world you want the page fault trigger through > iommu driver that call get_user_pages, it also will take a > reference(since gup is called), isn't it? Anyway, assume tomorrow > world doesn't take a reference, we don't need care page which used > by GPU is reclaimed? > > > Right now code use gup because it's convenient but it drop the > reference right after the fault. So reference is hold only for short > period of time. Are you sure gup will drop the reference right after the fault? I redig the codes and fail verify it. Could you point out to me? > > No you don't need to care about reclaim thanks to mmu notifier, ie > before page is remove mmu notifier is call and iommu register a > notifier, so it get the invalidate event and invalidate the device tlb > and things goes on. If gpu access the page a new pagefault happen and > a new page is allocated. Good idea! ;-) > > All this code is upstream in linux kernel just read it. There is just > no device that use it yet. > > That being said we will want improvement so that page that are hot in > the device are not reclaimed. But it can work without such improvement. > > Cheers, > Jerome --------------070608050006030707030304 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit
Hi Jerome,
On 04/12/2013 10:57 AM, Jerome Glisse wrote:
On Thu, Apr 11, 2013 at 9:54 PM, Simon Jeons <simon.jeons@gmail.com> wrote:
Hi Jerome,

On 04/12/2013 02:38 AM, Jerome Glisse wrote:
On Thu, Apr 11, 2013 at 11:42:05AM +0800, Simon Jeons wrote:
Hi Jerome,
On 04/11/2013 04:45 AM, Jerome Glisse wrote:
On Wed, Apr 10, 2013 at 09:41:57AM +0800, Simon Jeons wrote:
Hi Jerome,
On 04/09/2013 10:21 PM, Jerome Glisse wrote:
On Tue, Apr 09, 2013 at 04:28:09PM +0800, Simon Jeons wrote:
Hi Jerome,
On 02/10/2013 12:29 AM, Jerome Glisse wrote:
On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse <walken@google.com> wrote:
On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel <raindel@mellanox.com> wrote:
Hi,

We would like to present a reference implementation for safely sharing
memory pages from user space with the hardware, without pinning.

We will be happy to hear the community feedback on our prototype
implementation, and suggestions for future improvements.

We would also like to discuss adding features to the core MM subsystem to
assist hardware access to user memory without pinning.
This sounds kinda scary TBH; however I do understand the need for such
technology.

I think one issue is that many MM developers are insufficiently aware
of such developments; having a technology presentation would probably
help there; but traditionally LSF/MM sessions are more interactive
between developers who are already quite familiar with the technology.
I think it would help if you could send in advance a detailed
presentation of the problem and the proposed solutions (and then what
they require of the MM layer) so people can be better prepared.

And first I'd like to ask, aren't IOMMUs supposed to already largely
solve this problem ? (probably a dumb question, but that just tells
you how much you need to explain :)
For GPU the motivation is three fold. With the advance of GPU compute
and also with newer graphic program we see a massive increase in GPU
memory consumption. We easily can reach buffer that are bigger than
1gbytes. So the first motivation is to directly use the memory the
user allocated through malloc in the GPU this avoid copying 1gbytes of
data with the cpu to the gpu buffer. The second and mostly important
to GPU compute is the use of GPU seamlessly with the CPU, in order to
achieve this you want the programmer to have a single address space on
the CPU and GPU. So that the same address point to the same object on
GPU as on the CPU. This would also be a tremendous cleaner design from
driver point of view toward memory management.

And last, the most important, with such big buffer (>1gbytes) the
memory pinning is becoming way to expensive and also drastically
reduce the freedom of the mm to free page for other process. Most of
the time a small window (every thing is relative the window can be >
100mbytes not so small :)) of the object will be in use by the
hardware. The hardware pagefault support would avoid the necessity to
What's the meaning of hardware pagefault?
It's a PCIE extension (well it's a combination of extension that allow
that see http://www.pcisig.com/specifications/iov/ats/). Idea is that the
iommu can trigger a regular pagefault inside a process address space on
behalf of the hardware. The only iommu supporting that right now is the
AMD iommu v2 that you find on recent AMD platform.
Why need hardware page fault? regular page fault is trigger by cpu
mmu, correct?
Well here i abuse regular page fault term. Idea is that with hardware page
fault you don't need to pin memory or take reference on page for hardware to
use it. So that kernel can free as usual page that would otherwise have been
For the case when GPU need to pin memory, why GPU need grap the
memory of normal process instead of allocating for itself?
Pin memory is today world where gpu allocate its own memory (GB of memory)
that disappear from kernel control ie kernel can no longer reclaim this
memory it's lost memory (i had complain about that already from user than
saw GB of memory vanish and couldn't understand why the GPU was using so
much).

Tomorrow world we want gpu to be able to access memory that the application
allocated through a simple malloc and we want the kernel to be able to
recycly any page at any time because of memory pressure or because kernel
decide to do so.

That's just what we want to do. To achieve so we are getting hw that can do
pagefault. No change to kernel core mm code (some improvement might be made).

The memory disappear since you have a reference(gup) against it, correct? Tomorrow world you want the page fault trigger through iommu driver that call get_user_pages, it also will take a reference(since gup is called), isn't it? Anyway, assume tomorrow world doesn't take a reference, we don't need care page which used by GPU is reclaimed?


Right now code use gup because it's convenient but it drop the reference right after the fault. So reference is hold only for short period of time.

Are you sure gup will drop the reference right after the fault? I redig the codes and fail verify it. Could you point out to me?


No you don't need to care about reclaim thanks to mmu notifier, ie before page is remove mmu notifier is call and iommu register a notifier, so it get the invalidate event and invalidate the device tlb and things goes on. If gpu access the page a new pagefault happen and a new page is allocated.

Good idea! ;-)


All this code is upstream in linux kernel just read it. There is just no device that use it yet.

That being said we will want improvement so that page that are hot in the device are not reclaimed. But it can work without such improvement.

Cheers,
Jerome

--------------070608050006030707030304-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx173.postini.com [74.125.245.173]) by kanga.kvack.org (Postfix) with SMTP id 8568C6B0005 for ; Fri, 12 Apr 2013 09:32:47 -0400 (EDT) Received: by mail-qe0-f50.google.com with SMTP id a11so1533056qen.9 for ; Fri, 12 Apr 2013 06:32:46 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <51679F46.7030901@gmail.com> References: <5114DF05.7070702@mellanox.com> <5163D119.80603@gmail.com> <20130409142156.GA1909@gmail.com> <5164C365.70302@gmail.com> <20130410204507.GA3958@gmail.com> <5166310D.4020100@gmail.com> <20130411183828.GA6696@gmail.com> <51676941.3050802@gmail.com> <51679F46.7030901@gmail.com> Date: Fri, 12 Apr 2013 09:32:46 -0400 Message-ID: Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes From: Jerome Glisse Content-Type: multipart/alternative; boundary=f46d0447a18d46aeec04da29f009 Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss --f46d0447a18d46aeec04da29f009 Content-Type: text/plain; charset=ISO-8859-1 On Fri, Apr 12, 2013 at 1:44 AM, Simon Jeons wrote: > Hi Jerome, > > On 04/12/2013 10:57 AM, Jerome Glisse wrote: > > On Thu, Apr 11, 2013 at 9:54 PM, Simon Jeons wrote: > >> Hi Jerome, >> >> On 04/12/2013 02:38 AM, Jerome Glisse wrote: >> >>> On Thu, Apr 11, 2013 at 11:42:05AM +0800, Simon Jeons wrote: >>> >>>> Hi Jerome, >>>> On 04/11/2013 04:45 AM, Jerome Glisse wrote: >>>> >>>>> On Wed, Apr 10, 2013 at 09:41:57AM +0800, Simon Jeons wrote: >>>>> >>>>>> Hi Jerome, >>>>>> On 04/09/2013 10:21 PM, Jerome Glisse wrote: >>>>>> >>>>>>> On Tue, Apr 09, 2013 at 04:28:09PM +0800, Simon Jeons wrote: >>>>>>> >>>>>>>> Hi Jerome, >>>>>>>> On 02/10/2013 12:29 AM, Jerome Glisse wrote: >>>>>>>> >>>>>>>>> On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse < >>>>>>>>> walken@google.com> wrote: >>>>>>>>> >>>>>>>>>> On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel < >>>>>>>>>> raindel@mellanox.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> We would like to present a reference implementation for safely >>>>>>>>>>> sharing >>>>>>>>>>> memory pages from user space with the hardware, without pinning. >>>>>>>>>>> >>>>>>>>>>> We will be happy to hear the community feedback on our prototype >>>>>>>>>>> implementation, and suggestions for future improvements. >>>>>>>>>>> >>>>>>>>>>> We would also like to discuss adding features to the core MM >>>>>>>>>>> subsystem to >>>>>>>>>>> assist hardware access to user memory without pinning. >>>>>>>>>>> >>>>>>>>>> This sounds kinda scary TBH; however I do understand the need for >>>>>>>>>> such >>>>>>>>>> technology. >>>>>>>>>> >>>>>>>>>> I think one issue is that many MM developers are insufficiently >>>>>>>>>> aware >>>>>>>>>> of such developments; having a technology presentation would >>>>>>>>>> probably >>>>>>>>>> help there; but traditionally LSF/MM sessions are more interactive >>>>>>>>>> between developers who are already quite familiar with the >>>>>>>>>> technology. >>>>>>>>>> I think it would help if you could send in advance a detailed >>>>>>>>>> presentation of the problem and the proposed solutions (and then >>>>>>>>>> what >>>>>>>>>> they require of the MM layer) so people can be better prepared. >>>>>>>>>> >>>>>>>>>> And first I'd like to ask, aren't IOMMUs supposed to already >>>>>>>>>> largely >>>>>>>>>> solve this problem ? (probably a dumb question, but that just >>>>>>>>>> tells >>>>>>>>>> you how much you need to explain :) >>>>>>>>>> >>>>>>>>> For GPU the motivation is three fold. With the advance of GPU >>>>>>>>> compute >>>>>>>>> and also with newer graphic program we see a massive increase in >>>>>>>>> GPU >>>>>>>>> memory consumption. We easily can reach buffer that are bigger than >>>>>>>>> 1gbytes. So the first motivation is to directly use the memory the >>>>>>>>> user allocated through malloc in the GPU this avoid copying >>>>>>>>> 1gbytes of >>>>>>>>> data with the cpu to the gpu buffer. The second and mostly >>>>>>>>> important >>>>>>>>> to GPU compute is the use of GPU seamlessly with the CPU, in order >>>>>>>>> to >>>>>>>>> achieve this you want the programmer to have a single address >>>>>>>>> space on >>>>>>>>> the CPU and GPU. So that the same address point to the same object >>>>>>>>> on >>>>>>>>> GPU as on the CPU. This would also be a tremendous cleaner design >>>>>>>>> from >>>>>>>>> driver point of view toward memory management. >>>>>>>>> >>>>>>>>> And last, the most important, with such big buffer (>1gbytes) the >>>>>>>>> memory pinning is becoming way to expensive and also drastically >>>>>>>>> reduce the freedom of the mm to free page for other process. Most >>>>>>>>> of >>>>>>>>> the time a small window (every thing is relative the window can be >>>>>>>>> > >>>>>>>>> 100mbytes not so small :)) of the object will be in use by the >>>>>>>>> hardware. The hardware pagefault support would avoid the necessity >>>>>>>>> to >>>>>>>>> >>>>>>>> What's the meaning of hardware pagefault? >>>>>>>> >>>>>>> It's a PCIE extension (well it's a combination of extension that >>>>>>> allow >>>>>>> that see http://www.pcisig.com/specifications/iov/ats/). Idea is >>>>>>> that the >>>>>>> iommu can trigger a regular pagefault inside a process address space >>>>>>> on >>>>>>> behalf of the hardware. The only iommu supporting that right now is >>>>>>> the >>>>>>> AMD iommu v2 that you find on recent AMD platform. >>>>>>> >>>>>> Why need hardware page fault? regular page fault is trigger by cpu >>>>>> mmu, correct? >>>>>> >>>>> Well here i abuse regular page fault term. Idea is that with hardware >>>>> page >>>>> fault you don't need to pin memory or take reference on page for >>>>> hardware to >>>>> use it. So that kernel can free as usual page that would otherwise >>>>> have been >>>>> >>>> For the case when GPU need to pin memory, why GPU need grap the >>>> memory of normal process instead of allocating for itself? >>>> >>> Pin memory is today world where gpu allocate its own memory (GB of >>> memory) >>> that disappear from kernel control ie kernel can no longer reclaim this >>> memory it's lost memory (i had complain about that already from user than >>> saw GB of memory vanish and couldn't understand why the GPU was using so >>> much). >>> >>> Tomorrow world we want gpu to be able to access memory that the >>> application >>> allocated through a simple malloc and we want the kernel to be able to >>> recycly any page at any time because of memory pressure or because kernel >>> decide to do so. >>> >>> That's just what we want to do. To achieve so we are getting hw that can >>> do >>> pagefault. No change to kernel core mm code (some improvement might be >>> made). >>> >> >> The memory disappear since you have a reference(gup) against it, >> correct? Tomorrow world you want the page fault trigger through iommu >> driver that call get_user_pages, it also will take a reference(since gup is >> called), isn't it? Anyway, assume tomorrow world doesn't take a reference, >> we don't need care page which used by GPU is reclaimed? >> >> > Right now code use gup because it's convenient but it drop the reference > right after the fault. So reference is hold only for short period of time. > > > Are you sure gup will drop the reference right after the fault? I redig > the codes and fail verify it. Could you point out to me? > > In amd_iommu_v2.c:do_fault get_user_pages followed by put_page Cheers, Jerome --f46d0447a18d46aeec04da29f009 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
On Fri, Apr 12, 2013 at 1:44 AM, Simon Jeons <simon.jeons@gmail.com> wrote:
=20 =20 =20
Hi Jerome,

On 04/12/2013 10:57 AM, Jerome Glisse wrote:
On Thu, Apr 11, 2013 at 9:54 PM, Simon Jeons = <simon.jeons@gmail.com> wrote:
Hi Jerome,

On 04/12/2013 02:38 AM, Jerome Glisse wrote:
On Thu, Apr 11, 2013 at 11:42:05AM +0800, Simon Jeons wrote:
Hi Jerome,
On 04/11/2013 04:45 AM, Jerome Glisse wrote:
On Wed, Apr 10, 2013 at 09:41:57AM +0800, Simon Jeons wrote:
Hi Jerome,
On 04/09/2013 10:21 PM, Jerome Glisse wrote:
On Tue, Apr 09, 2013 at 04:28:09PM +0800, Simon Jeons wrote:
Hi Jerome,
On 02/10/2013 12:29 AM, Jerome Glisse wrote:
On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse <walken@google.com> wrote:
On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel <raindel@mellanox.com> wrote:
Hi,

We would like to present a reference implementation for safely sharing
memory pages from user space with the hardware, without pinning.

We will be happy to hear the community feedback on our prototype
implementation, and suggestions for future improvements.

We would also like to discuss adding features to the core MM subsystem to
assist hardware access to user memory without pinning.
This sounds kinda scary TBH; however I do understand the need for such
technology.

I think one issue is that many MM developers are insufficiently aware
of such developments; having a technology presentation would probably
help there; but traditionally LSF/MM sessions are more interactive
between developers who are already quite familiar with the technology.
I think it would help if you could send in advance a detailed
presentation of the problem and the proposed solutions (and then what
they require of the MM layer) so people can be better prepared.

And first I'd like to ask, aren't IOM= MUs supposed to already largely
solve this problem ? (probably a dumb question, but that just tells
you how much you need to explain :)
For GPU the motivation is three fold. With the advance of GPU compute
and also with newer graphic program we see a massive increase in GPU
memory consumption. We easily can reach buffer that are bigger than
1gbytes. So the first motivation is to directly use the memory the
user allocated through malloc in the GPU this avoid copying 1gbytes of
data with the cpu to the gpu buffer. The second and mostly important
to GPU compute is the use of GPU seamlessly with the CPU, in order to
achieve this you want the programmer to have a single address space on
the CPU and GPU. So that the same address point to the same object on
GPU as on the CPU. This would also be a tremendous cleaner design from
driver point of view toward memory management.

And last, the most important, with such big buffer (>1gbytes) the
memory pinning is becoming way to expensive and also drastically
reduce the freedom of the mm to free page for other process. Most of
the time a small window (every thing is relative the window can be >
100mbytes not so small :)) of the object will be in use by the
hardware. The hardware pagefault support would avoid the necessity to
What's the meaning of hardware pagefault?
It's a PCIE extension (well it's a combinat= ion of extension that allow
that see http://www.pcisig.com/specifications/iov/= ats/). Idea is that the
iommu can trigger a regular pagefault inside a process address space on
behalf of the hardware. The only iommu supporting that right now is the
AMD iommu v2 that you find on recent AMD platform.
Why need hardware page fault? regular page fault is trigger by cpu
mmu, correct?
Well here i abuse regular page fault term. Idea is that with hardware page
fault you don't need to pin memory or take referenc= e on page for hardware to
use it. So that kernel can free as usual page that would otherwise have been
For the case when GPU need to pin memory, why GPU need grap the
memory of normal process instead of allocating for itself?
Pin memory is today world where gpu allocate its own memory (GB of memory)
that disappear from kernel control ie kernel can no longer reclaim this
memory it's lost memory (i had complain about that already from user than
saw GB of memory vanish and couldn't understand why the GPU was using so
much).

Tomorrow world we want gpu to be able to access memory that the application
allocated through a simple malloc and we want the kernel to be able to
recycly any page at any time because of memory pressure or because kernel
decide to do so.

That's just what we want to do. To achieve so we are getting hw that can do
pagefault. No change to kernel core mm code (some improvement might be made).

The memory disappear since you have a reference(gup) against it, correct? Tomorrow world you want the page fault trigger through iommu driver that call get_user_pages, it also will take a reference(since gup is called), isn't it? Anyway, assume tomorrow world doesn't take a reference, we don't = need care page which used by GPU is reclaimed?


Right now code use gup because it's convenient but it drop th= e reference right after the fault. So reference is hold only for short period of time.

Are you sure gup will drop the reference right after the fault? I redig the codes and fail verify it. Could you point out to me?

=A0
In amd_iommu_v2.c:= do_fault get_user_pages followed by put_page
=A0
=A0
Cheers,
= Jerome
--f46d0447a18d46aeec04da29f009-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx121.postini.com [74.125.245.121]) by kanga.kvack.org (Postfix) with SMTP id D80116B0002 for ; Mon, 15 Apr 2013 04:39:25 -0400 (EDT) Received: by mail-qc0-f181.google.com with SMTP id a22so550899qcs.12 for ; Mon, 15 Apr 2013 01:39:24 -0700 (PDT) Message-ID: <516BBCB5.7050303@gmail.com> Date: Mon, 15 Apr 2013 16:39:17 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Glisse Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss Hi Jerome, On 02/10/2013 12:29 AM, Jerome Glisse wrote: > On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse wrote: >> On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel wrote: >>> Hi, >>> >>> We would like to present a reference implementation for safely sharing >>> memory pages from user space with the hardware, without pinning. >>> >>> We will be happy to hear the community feedback on our prototype >>> implementation, and suggestions for future improvements. >>> >>> We would also like to discuss adding features to the core MM subsystem to >>> assist hardware access to user memory without pinning. >> This sounds kinda scary TBH; however I do understand the need for such >> technology. >> >> I think one issue is that many MM developers are insufficiently aware >> of such developments; having a technology presentation would probably >> help there; but traditionally LSF/MM sessions are more interactive >> between developers who are already quite familiar with the technology. >> I think it would help if you could send in advance a detailed >> presentation of the problem and the proposed solutions (and then what >> they require of the MM layer) so people can be better prepared. >> >> And first I'd like to ask, aren't IOMMUs supposed to already largely >> solve this problem ? (probably a dumb question, but that just tells >> you how much you need to explain :) > For GPU the motivation is three fold. With the advance of GPU compute > and also with newer graphic program we see a massive increase in GPU > memory consumption. We easily can reach buffer that are bigger than > 1gbytes. So the first motivation is to directly use the memory the > user allocated through malloc in the GPU this avoid copying 1gbytes of > data with the cpu to the gpu buffer. The second and mostly important The pinned memory you mentioned is the memory user allocated or the memory of gpu buffer? > to GPU compute is the use of GPU seamlessly with the CPU, in order to > achieve this you want the programmer to have a single address space on > the CPU and GPU. So that the same address point to the same object on > GPU as on the CPU. This would also be a tremendous cleaner design from > driver point of view toward memory management. > > And last, the most important, with such big buffer (>1gbytes) the > memory pinning is becoming way to expensive and also drastically > reduce the freedom of the mm to free page for other process. Most of > the time a small window (every thing is relative the window can be > > 100mbytes not so small :)) of the object will be in use by the > hardware. The hardware pagefault support would avoid the necessity to > pin memory and thus offer greater flexibility. At the same time the > driver wants to avoid page fault as much as possible this is why i > would like to be able to give hint to the mm about range of address it > should avoid freeing page (swapping them out). > > The iommu was designed with other goals, which were first isolate > device from one another and restrict device access to allowed memory. > Second allow to remap address that are above device address space > limit. Lot of device can only address 24bit or 32bit of memory and > with computer with several gbytes of memory suddenly lot of the page > become unreachable to the hardware. The iommu allow to work around > this by remapping those high page into address that the hardware can > reach. > > The hardware page fault support is a new feature of iommu designed to > help the os and driver to reduce memory pinning and also share address > space. Thought i am sure there are other motivations that i am not > even aware off or would think off. > > Btw i won't be at LSF/MM so a free good beer (or other beverage) on me > to whoever takes note on this subject in next conf we run into each > others. > > Cheers, > Jerome > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx107.postini.com [74.125.245.107]) by kanga.kvack.org (Postfix) with SMTP id A434E6B0002 for ; Mon, 15 Apr 2013 11:38:13 -0400 (EDT) Received: by mail-qa0-f41.google.com with SMTP id hg5so909666qab.0 for ; Mon, 15 Apr 2013 08:38:12 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <516BBCB5.7050303@gmail.com> References: <5114DF05.7070702@mellanox.com> <516BBCB5.7050303@gmail.com> Date: Mon, 15 Apr 2013 11:38:12 -0400 Message-ID: Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes From: Jerome Glisse Content-Type: multipart/alternative; boundary=047d7bdc853a62ef3a04da680a27 Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss --047d7bdc853a62ef3a04da680a27 Content-Type: text/plain; charset=ISO-8859-1 On Mon, Apr 15, 2013 at 4:39 AM, Simon Jeons wrote: > Hi Jerome, > On 02/10/2013 12:29 AM, Jerome Glisse wrote: > >> On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse >> wrote: >> >>> On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel >>> wrote: >>> >>>> Hi, >>>> >>>> We would like to present a reference implementation for safely sharing >>>> memory pages from user space with the hardware, without pinning. >>>> >>>> We will be happy to hear the community feedback on our prototype >>>> implementation, and suggestions for future improvements. >>>> >>>> We would also like to discuss adding features to the core MM subsystem >>>> to >>>> assist hardware access to user memory without pinning. >>>> >>> This sounds kinda scary TBH; however I do understand the need for such >>> technology. >>> >>> I think one issue is that many MM developers are insufficiently aware >>> of such developments; having a technology presentation would probably >>> help there; but traditionally LSF/MM sessions are more interactive >>> between developers who are already quite familiar with the technology. >>> I think it would help if you could send in advance a detailed >>> presentation of the problem and the proposed solutions (and then what >>> they require of the MM layer) so people can be better prepared. >>> >>> And first I'd like to ask, aren't IOMMUs supposed to already largely >>> solve this problem ? (probably a dumb question, but that just tells >>> you how much you need to explain :) >>> >> For GPU the motivation is three fold. With the advance of GPU compute >> and also with newer graphic program we see a massive increase in GPU >> memory consumption. We easily can reach buffer that are bigger than >> 1gbytes. So the first motivation is to directly use the memory the >> user allocated through malloc in the GPU this avoid copying 1gbytes of >> data with the cpu to the gpu buffer. The second and mostly important >> > > The pinned memory you mentioned is the memory user allocated or the memory > of gpu buffer? > Memory user allocated, we don't want to pin this memory. Cheers, Jerome --047d7bdc853a62ef3a04da680a27 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
On Mon, Apr 15, 2013 at 4:39 AM, Simon Jeons <simon.jeons@gmail.com> wrote:
Hi Jerome,
On 02/10/2013 12:29 AM, Jerome Glisse wrote:
On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse <walken@google.com> wrote:
On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel <raindel@mellanox.com> wrote:
Hi,

We would like to present a reference implementation for safely sharing
memory pages from user space with the hardware, without pinning.

We will be happy to hear the community feedback on our prototype
implementation, and suggestions for future improvements.

We would also like to discuss adding features to the core MM subsystem to assist hardware access to user memory without pinning.
This sounds kinda scary TBH; however I do understand the need for such
technology.

I think one issue is that many MM developers are insufficiently aware
of such developments; having a technology presentation would probably
help there; but traditionally LSF/MM sessions are more interactive
between developers who are already quite familiar with the technology.
I think it would help if you could send in advance a detailed
presentation of the problem and the proposed solutions (and then what
they require of the MM layer) so people can be better prepared.

And first I'd like to ask, aren't IOMMUs supposed to already largel= y
solve this problem ? (probably a dumb question, but that just tells
you how much you need to explain :)
For GPU the motivation is three fold. With the advance of GPU compute
and also with newer graphic program we see a massive increase in GPU
memory consumption. We easily can reach buffer that are bigger than
1gbytes. So the first motivation is to directly use the memory the
user allocated through malloc in the GPU this avoid copying 1gbytes of
data with the cpu to the gpu buffer. The second and mostly important

The pinned memory you mentioned is the memory user allocated or the memory = of gpu buffer?

Memory user allocated, we don'= t want to pin this memory.

Cheers,
Jerome
--047d7bdc853a62ef3a04da680a27-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx193.postini.com [74.125.245.193]) by kanga.kvack.org (Postfix) with SMTP id 73B7A6B0002 for ; Tue, 16 Apr 2013 00:20:52 -0400 (EDT) Received: by mail-ie0-f171.google.com with SMTP id e11so77567iej.30 for ; Mon, 15 Apr 2013 21:20:51 -0700 (PDT) Message-ID: <516CD19C.6080508@gmail.com> Date: Tue, 16 Apr 2013 12:20:44 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> <516BBCB5.7050303@gmail.com> In-Reply-To: Content-Type: multipart/alternative; boundary="------------000405040503050909090509" Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Glisse Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss This is a multi-part message in MIME format. --------------000405040503050909090509 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi Jerome, On 04/15/2013 11:38 PM, Jerome Glisse wrote: > On Mon, Apr 15, 2013 at 4:39 AM, Simon Jeons > wrote: > > Hi Jerome, > On 02/10/2013 12:29 AM, Jerome Glisse wrote: > > On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse > > wrote: > > On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel > > wrote: > > Hi, > > We would like to present a reference implementation > for safely sharing > memory pages from user space with the hardware, > without pinning. > > We will be happy to hear the community feedback on our > prototype > implementation, and suggestions for future improvements. > > We would also like to discuss adding features to the > core MM subsystem to > assist hardware access to user memory without pinning. > > This sounds kinda scary TBH; however I do understand the > need for such > technology. > > I think one issue is that many MM developers are > insufficiently aware > of such developments; having a technology presentation > would probably > help there; but traditionally LSF/MM sessions are more > interactive > between developers who are already quite familiar with the > technology. > I think it would help if you could send in advance a detailed > presentation of the problem and the proposed solutions > (and then what > they require of the MM layer) so people can be better > prepared. > > And first I'd like to ask, aren't IOMMUs supposed to > already largely > solve this problem ? (probably a dumb question, but that > just tells > you how much you need to explain :) > > For GPU the motivation is three fold. With the advance of GPU > compute > and also with newer graphic program we see a massive increase > in GPU > memory consumption. We easily can reach buffer that are bigger > than > 1gbytes. So the first motivation is to directly use the memory the > user allocated through malloc in the GPU this avoid copying > 1gbytes of > data with the cpu to the gpu buffer. The second and mostly > important > > > The pinned memory you mentioned is the memory user allocated or > the memory of gpu buffer? > > > Memory user allocated, we don't want to pin this memory. After this idea merged, we don't need to allocate memory for integrated GPU buffer and discrete GPU don't need to have its own memory, correct? > > Cheers, > Jerome --------------000405040503050909090509 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit
Hi Jerome,
On 04/15/2013 11:38 PM, Jerome Glisse wrote:
On Mon, Apr 15, 2013 at 4:39 AM, Simon Jeons <simon.jeons@gmail.com> wrote:
Hi Jerome,
On 02/10/2013 12:29 AM, Jerome Glisse wrote:
On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse <walken@google.com> wrote:
On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel <raindel@mellanox.com> wrote:
Hi,

We would like to present a reference implementation for safely sharing
memory pages from user space with the hardware, without pinning.

We will be happy to hear the community feedback on our prototype
implementation, and suggestions for future improvements.

We would also like to discuss adding features to the core MM subsystem to
assist hardware access to user memory without pinning.
This sounds kinda scary TBH; however I do understand the need for such
technology.

I think one issue is that many MM developers are insufficiently aware
of such developments; having a technology presentation would probably
help there; but traditionally LSF/MM sessions are more interactive
between developers who are already quite familiar with the technology.
I think it would help if you could send in advance a detailed
presentation of the problem and the proposed solutions (and then what
they require of the MM layer) so people can be better prepared.

And first I'd like to ask, aren't IOMMUs supposed to already largely
solve this problem ? (probably a dumb question, but that just tells
you how much you need to explain :)
For GPU the motivation is three fold. With the advance of GPU compute
and also with newer graphic program we see a massive increase in GPU
memory consumption. We easily can reach buffer that are bigger than
1gbytes. So the first motivation is to directly use the memory the
user allocated through malloc in the GPU this avoid copying 1gbytes of
data with the cpu to the gpu buffer. The second and mostly important

The pinned memory you mentioned is the memory user allocated or the memory of gpu buffer?

Memory user allocated, we don't want to pin this memory.

After this idea merged, we don't need to allocate memory for integrated GPU buffer and discrete GPU don't need to have its own memory, correct?


Cheers,
Jerome

--------------000405040503050909090509-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx123.postini.com [74.125.245.123]) by kanga.kvack.org (Postfix) with SMTP id BDA786B0002 for ; Tue, 16 Apr 2013 03:03:31 -0400 (EDT) Received: by mail-qc0-f170.google.com with SMTP id d42so78975qca.15 for ; Tue, 16 Apr 2013 00:03:30 -0700 (PDT) Message-ID: <516CF7BB.3050301@gmail.com> Date: Tue, 16 Apr 2013 15:03:23 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Glisse Cc: Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss Hi Jerome, On 02/08/2013 11:21 PM, Jerome Glisse wrote: > On Fri, Feb 8, 2013 at 6:18 AM, Shachar Raindel wrote: >> Hi, >> >> We would like to present a reference implementation for safely sharing >> memory pages from user space with the hardware, without pinning. >> >> We will be happy to hear the community feedback on our prototype >> implementation, and suggestions for future improvements. >> >> We would also like to discuss adding features to the core MM subsystem to >> assist hardware access to user memory without pinning. >> >> Following is a longer motivation and explanation on the technology >> presented: >> >> Many application developers would like to be able to be able to communicate >> directly with the hardware from the userspace. >> >> Use cases for that includes high performance networking API such as >> InfiniBand, RoCE and iWarp and interfacing with GPUs. >> >> Currently, if the user space application wants to share system memory with >> the hardware device, the kernel component must pin the memory pages in RAM, >> using get_user_pages. >> >> This is a hurdle, as it usually makes large portions the application memory >> unmovable. This pinning also makes the user space development model very >> complicated ? one needs to register memory before using it for communication >> with the hardware. >> >> We use the mmu-notifiers [1] mechanism to inform the hardware when the >> mapping of a page is changed. If the hardware tries to access a page which >> is not yet mapped for the hardware, it requests a resolution for the page >> address from the kernel. >> >> This mechanism allows the hardware to access the entire address space of the >> user application, without pinning even a single page. >> >> We would like to use the LSF/MM forum opportunity to discuss open issues we >> have for further development, such as: >> >> -Allowing the hardware to perform page table walk, similar to >> get_user_pages_fast to resolve user pages that are already in RAM. get_user_pages_fast just get page reference count instead of populate the pte to page table, correct? Then how can GPU driver use iommu to access the page? >> >> -Batching page eviction by various kernel subsystems (swapper, page-cache) >> to reduce the amount of communication needed with the hardware in such >> events >> >> -Hinting from the hardware to the MM regarding page fetches which are >> speculative, similarly to prefetching done by the page-cache >> >> -Page-in notifications from the kernel to the driver, such that we can keep >> our secondary TLB in sync with the kernel page table without incurring page >> faults. >> >> -Allowed and banned actions while in an MMU notifier callback. We have >> already done some work on making the MMU notifiers sleepable [2], but there >> might be additional limitations, which we would like to discuss. >> >> -Hinting from the MMU notifiers as for the reason for the notification - for >> example we would like to react differently if a page was moved by NUMA >> migration vs. page being swapped out. >> >> [1] http://lwn.net/Articles/266320/ >> >> [2] http://comments.gmane.org/gmane.linux.kernel.mm/85002 >> >> Thanks, >> >> --Shachar > As a GPU driver developer i can say that this is something we want to > do in a very near future. Also i think we would like another > capabilities : > > - hint to mm on memory range that are best not to evict (easier for > driver to know what is hot and gonna see activities) > > Dunno how big the change to the page eviction path would need to be. > > Cheers, > Jerome > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx127.postini.com [74.125.245.127]) by kanga.kvack.org (Postfix) with SMTP id D07EE6B0002 for ; Tue, 16 Apr 2013 12:19:35 -0400 (EDT) Received: by mail-qa0-f48.google.com with SMTP id bn16so381969qab.14 for ; Tue, 16 Apr 2013 09:19:34 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <516CD19C.6080508@gmail.com> References: <5114DF05.7070702@mellanox.com> <516BBCB5.7050303@gmail.com> <516CD19C.6080508@gmail.com> Date: Tue, 16 Apr 2013 12:19:34 -0400 Message-ID: Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes From: Jerome Glisse Content-Type: multipart/alternative; boundary=047d7bdc853a2e08ef04da7cbcf0 Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Michel Lespinasse , Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss --047d7bdc853a2e08ef04da7cbcf0 Content-Type: text/plain; charset=ISO-8859-1 On Tue, Apr 16, 2013 at 12:20 AM, Simon Jeons wrote: > Hi Jerome, > > On 04/15/2013 11:38 PM, Jerome Glisse wrote: > > On Mon, Apr 15, 2013 at 4:39 AM, Simon Jeons wrote: > >> Hi Jerome, >> On 02/10/2013 12:29 AM, Jerome Glisse wrote: >> >>> On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse >>> wrote: >>> >>>> On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> We would like to present a reference implementation for safely sharing >>>>> memory pages from user space with the hardware, without pinning. >>>>> >>>>> We will be happy to hear the community feedback on our prototype >>>>> implementation, and suggestions for future improvements. >>>>> >>>>> We would also like to discuss adding features to the core MM subsystem >>>>> to >>>>> assist hardware access to user memory without pinning. >>>>> >>>> This sounds kinda scary TBH; however I do understand the need for such >>>> technology. >>>> >>>> I think one issue is that many MM developers are insufficiently aware >>>> of such developments; having a technology presentation would probably >>>> help there; but traditionally LSF/MM sessions are more interactive >>>> between developers who are already quite familiar with the technology. >>>> I think it would help if you could send in advance a detailed >>>> presentation of the problem and the proposed solutions (and then what >>>> they require of the MM layer) so people can be better prepared. >>>> >>>> And first I'd like to ask, aren't IOMMUs supposed to already largely >>>> solve this problem ? (probably a dumb question, but that just tells >>>> you how much you need to explain :) >>>> >>> For GPU the motivation is three fold. With the advance of GPU compute >>> and also with newer graphic program we see a massive increase in GPU >>> memory consumption. We easily can reach buffer that are bigger than >>> 1gbytes. So the first motivation is to directly use the memory the >>> user allocated through malloc in the GPU this avoid copying 1gbytes of >>> data with the cpu to the gpu buffer. The second and mostly important >>> >> >> The pinned memory you mentioned is the memory user allocated or the >> memory of gpu buffer? >> > > Memory user allocated, we don't want to pin this memory. > > > After this idea merged, we don't need to allocate memory for integrated > GPU buffer and discrete GPU don't need to have its own memory, correct? > You need to stop considering discret and integrated GPU as different, they are not from driver point of view. Integrated GPU will keep stealing a chunk of system memory at boot because its a BIOS things and BIOS don't change like that. Both (integrated and discret) will keep allocating system memory in kernel for their own buffer because API such as OpenGL or OpenCL needs too. The transparent use of same address space on GPU as on CPU will only happen with newer API such as OpenCL 2.0 or other API that are schedule down the road. And discrete GPU will keep having its own memory, the whole point is that GDDR5 can be more than 10 times faster than the fastest DDR3. You can not beat that. GPU is all about bandwidth and that's not gonna change. Cheers, Jerome --047d7bdc853a2e08ef04da7cbcf0 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
On Tue, Apr 16, 2013 at 12:20 AM, Simon Jeons <simon.jeons@gmail.com> wrote:
=20 =20 =20
Hi Jerome,

On 04/15/2013 11:38 PM, Jerome Glisse wrote:
On Mon, Apr 15, 2013 at 4:39 AM, Simon Jeons <simon.jeons@gmail.com> wrote:
Hi Jerome,
On 02/10/2013 12:29 AM, Jerome Glisse wrote:
On Sat, Feb 9, 2013 at 1:05 AM, Michel Lespinasse <walken@google.com>= wrote:
On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel <raindel@mellanox.com> wrote:
Hi,

We would like to present a reference implementation for safely sharing
memory pages from user space with the hardware, without pinning.

We will be happy to hear the community feedback on our prototype
implementation, and suggestions for future improvements.

We would also like to discuss adding features to the core MM subsystem to
assist hardware access to user memory without pinning.
This sounds kinda scary TBH; however I do understand the need for such
technology.

I think one issue is that many MM developers are insufficiently aware
of such developments; having a technology presentation would probably
help there; but traditionally LSF/MM sessions are more interactive
between developers who are already quite familiar with the technology.
I think it would help if you could send in advance a detailed
presentation of the problem and the proposed solutions (and then what
they require of the MM layer) so people can be better prepared.

And first I'd like to ask, aren't IOMMUs supposed= to already largely
solve this problem ? (probably a dumb question, but that just tells
you how much you need to explain :)
For GPU the motivation is three fold. With the advance of GPU compute
and also with newer graphic program we see a massive increase in GPU
memory consumption. We easily can reach buffer that are bigger than
1gbytes. So the first motivation is to directly use the memory the
user allocated through malloc in the GPU this avoid copying 1gbytes of
data with the cpu to the gpu buffer. The second and mostly important

The pinned memory you mentioned is the memory user allocated or the memory of gpu buffer?

Memory user allocated, we don't want to pin this memory.

After this idea merged, we don't need to allocate memory for integrated GPU buffer and discrete GPU don't need to have its own memory, correct?

You need to stop consi= dering discret and integrated GPU as different, they are not from driver po= int of view. Integrated GPU will keep stealing a chunk of system memory at = boot because its a BIOS things and BIOS don't change like that.

Both (integrated and discret) will keep allocating system memory in ker= nel for their own buffer because API such as OpenGL or OpenCL needs too. Th= e transparent use of same address space on GPU as on CPU will only happen w= ith newer API such as OpenCL 2.0 or other API that are schedule down the ro= ad.

And discrete GPU will keep having its own memory, the whole point is th= at GDDR5 can be more than 10 times faster than the fastest DDR3. You can no= t beat that. GPU is all about bandwidth and that's not gonna change.
Cheers,
Jerome
--047d7bdc853a2e08ef04da7cbcf0-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email:
email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx105.postini.com [74.125.245.105]) by kanga.kvack.org (Postfix) with SMTP id 2E5076B0036 for ; Tue, 16 Apr 2013 12:27:22 -0400 (EDT) Received: by mail-qe0-f48.google.com with SMTP id 2so368259qea.35 for ; Tue, 16 Apr 2013 09:27:21 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <516CF7BB.3050301@gmail.com> References: <5114DF05.7070702@mellanox.com> <516CF7BB.3050301@gmail.com> Date: Tue, 16 Apr 2013 12:27:21 -0400 Message-ID: Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes From: Jerome Glisse Content-Type: multipart/alternative; boundary=047d7b5d617cfa98ab04da7cd72e Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss --047d7b5d617cfa98ab04da7cd72e Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable On Tue, Apr 16, 2013 at 3:03 AM, Simon Jeons wrote: > Hi Jerome, > > On 02/08/2013 11:21 PM, Jerome Glisse wrote: > >> On Fri, Feb 8, 2013 at 6:18 AM, Shachar Raindel >> wrote: >> >>> Hi, >>> >>> We would like to present a reference implementation for safely sharing >>> memory pages from user space with the hardware, without pinning. >>> >>> We will be happy to hear the community feedback on our prototype >>> implementation, and suggestions for future improvements. >>> >>> We would also like to discuss adding features to the core MM subsystem = to >>> assist hardware access to user memory without pinning. >>> >>> Following is a longer motivation and explanation on the technology >>> presented: >>> >>> Many application developers would like to be able to be able to >>> communicate >>> directly with the hardware from the userspace. >>> >>> Use cases for that includes high performance networking API such as >>> InfiniBand, RoCE and iWarp and interfacing with GPUs. >>> >>> Currently, if the user space application wants to share system memory >>> with >>> the hardware device, the kernel component must pin the memory pages in >>> RAM, >>> using get_user_pages. >>> >>> This is a hurdle, as it usually makes large portions the application >>> memory >>> unmovable. This pinning also makes the user space development model ver= y >>> complicated =96 one needs to register memory before using it for >>> communication >>> with the hardware. >>> >>> We use the mmu-notifiers [1] mechanism to inform the hardware when the >>> mapping of a page is changed. If the hardware tries to access a page >>> which >>> is not yet mapped for the hardware, it requests a resolution for the pa= ge >>> address from the kernel. >>> >>> This mechanism allows the hardware to access the entire address space o= f >>> the >>> user application, without pinning even a single page. >>> >>> We would like to use the LSF/MM forum opportunity to discuss open issue= s >>> we >>> have for further development, such as: >>> >>> -Allowing the hardware to perform page table walk, similar to >>> get_user_pages_fast to resolve user pages that are already in RAM. >>> >> > get_user_pages_fast just get page reference count instead of populate the > pte to page table, correct? Then how can GPU driver use iommu to access t= he > page? > As i said this is for pre-filling already present entry, ie pte that are present with a valid page (no special bit set). This is an optimization so that the GPU can pre-fill its tlb without having to take any mmap_sem. Hope is that in most common case this will be enough, but in some case you will have to go through the lengthy non fast gup. Cheers, Jerome --047d7b5d617cfa98ab04da7cd72e Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable
On Tue, Apr 16, 2013 at 3:03 AM, Simon Jeons <simon.jeons@gmail.com> wrote:
Hi Jerome,

On 02/08/2013 11:21 PM, Jerome Glisse wrote:
On Fri, Feb 8, 2013 at 6:18 AM, Shachar Raindel <raindel@mellanox.com> wrote:
Hi,

We would like to present a reference implementation for safely sharing
memory pages from user space with the hardware, without pinning.

We will be happy to hear the community feedback on our prototype
implementation, and suggestions for future improvements.

We would also like to discuss adding features to the core MM subsystem to assist hardware access to user memory without pinning.

Following is a longer motivation and explanation on the technology
presented:

Many application developers would like to be able to be able to communicate=
directly with the hardware from the userspace.

Use cases for that includes high performance networking API such as
InfiniBand, RoCE and iWarp and interfacing with GPUs.

Currently, if the user space application wants to share system memory with<= br> the hardware device, the kernel component must pin the memory pages in RAM,=
using get_user_pages.

This is a hurdle, as it usually makes large portions the application memory=
unmovable. This pinning also makes the user space development model very complicated =96 one needs to register memory before using it for communicat= ion
with the hardware.

We use the mmu-notifiers [1] mechanism to inform the hardware when the
mapping of a page is changed. If the hardware tries to access a page which<= br> is not yet mapped for the hardware, it requests a resolution for the page address from the kernel.

This mechanism allows the hardware to access the entire address space of th= e
user application, without pinning even a single page.

We would like to use the LSF/MM forum opportunity to discuss open issues we=
have for further development, such as:

-Allowing the hardware to perform page table walk, similar to
get_user_pages_fast to resolve user pages that are already in RAM.

get_user_pages_fast just get page reference count instead of populate the p= te to page table, correct? Then how can GPU driver use iommu to access the = page?

As i said this is for pre-filling already pr= esent entry, ie pte that are present with a valid page (no special bit set)= . This is an optimization so that the GPU can pre-fill its tlb without havi= ng to take any mmap_sem. Hope is that in most common case this will be enou= gh, but in some case you will have to go through the lengthy non fast gup.<= br>
Cheers,
Jerome
--047d7b5d617cfa98ab04da7cd72e-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx180.postini.com [74.125.245.180]) by kanga.kvack.org (Postfix) with SMTP id 3266C6B0027 for ; Tue, 16 Apr 2013 19:50:48 -0400 (EDT) Received: by mail-ia0-f172.google.com with SMTP id k38so936085iah.31 for ; Tue, 16 Apr 2013 16:50:47 -0700 (PDT) Message-ID: <516DE3D1.7030800@gmail.com> Date: Wed, 17 Apr 2013 07:50:41 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> <516CF7BB.3050301@gmail.com> In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Glisse Cc: Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss On 04/17/2013 12:27 AM, Jerome Glisse wrote: [snip] > > > As i said this is for pre-filling already present entry, ie pte that > are present with a valid page (no special bit set). This is an > optimization so that the GPU can pre-fill its tlb without having to > take any mmap_sem. Hope is that in most common case this will be > enough, but in some case you will have to go through the lengthy non > fast gup. I know this. What I concern is the pte you mentioned is for normal cpu, correct? How can you pre-fill pte and tlb of GPU? > > Cheers, > Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx132.postini.com [74.125.245.132]) by kanga.kvack.org (Postfix) with SMTP id 5E7226B008A for ; Wed, 17 Apr 2013 10:01:48 -0400 (EDT) Received: by mail-qe0-f53.google.com with SMTP id q19so879887qeb.26 for ; Wed, 17 Apr 2013 07:01:47 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <516DE3D1.7030800@gmail.com> References: <5114DF05.7070702@mellanox.com> <516CF7BB.3050301@gmail.com> <516DE3D1.7030800@gmail.com> Date: Wed, 17 Apr 2013 10:01:47 -0400 Message-ID: Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes From: Jerome Glisse Content-Type: multipart/alternative; boundary=047d7bdc853a3d0f8d04da8eeda4 Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss --047d7bdc853a3d0f8d04da8eeda4 Content-Type: text/plain; charset=ISO-8859-1 On Tue, Apr 16, 2013 at 7:50 PM, Simon Jeons wrote: > On 04/17/2013 12:27 AM, Jerome Glisse wrote: > > [snip] > > >> >> As i said this is for pre-filling already present entry, ie pte that are >> present with a valid page (no special bit set). This is an optimization so >> that the GPU can pre-fill its tlb without having to take any mmap_sem. Hope >> is that in most common case this will be enough, but in some case you will >> have to go through the lengthy non fast gup. >> > > I know this. What I concern is the pte you mentioned is for normal cpu, > correct? How can you pre-fill pte and tlb of GPU? > You getting confuse, idea is to look at cpu pte and prefill gpu pte. I do not prefill cpu pte, if a cpu pte is valid then i use the page it point to prefill the GPU pte. So i don't pre-fill CPU PTE and TLB GPU, i pre-fill GPU PTE from CPU PTE if CPU PTE is valid. Other GPU PTE are marked as invalid and will trigger a fault that will be handle using gup that will fill CPU PTE (if fault happen at a valid address) at which point GPU PTE is updated or error is reported if fault happened at an invalid address. Cheers, Jerome --047d7bdc853a3d0f8d04da8eeda4 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
On Tue, Apr 16, 2013 at 7:50 PM, Simon Jeons <simon.jeons@gmail.com> wrote:
On 04/17/2013 12:27 AM, Jerome Glisse wrote:

[snip]



As i said this is for pre-filling already present entry, ie pte that are pr= esent with a valid page (no special bit set). This is an optimization so th= at the GPU can pre-fill its tlb without having to take any mmap_sem. Hope i= s that in most common case this will be enough, but in some case you will h= ave to go through the lengthy non fast gup.

I know this. What I concern is the pte you mentioned is for normal cpu, cor= rect? How can you pre-fill pte and tlb of GPU?

Yo= u getting confuse, idea is to look at cpu pte and prefill gpu pte. I do not= prefill cpu pte, if a cpu pte is valid then i use the page it point to pre= fill the GPU pte.

So i don't pre-fill CPU PTE and TLB GPU, i pre-fill GPU PTE from CP= U PTE if CPU PTE is valid. Other GPU PTE are marked as invalid and will tri= gger a fault that will be handle using gup that will fill CPU PTE (if fault= happen at a valid address) at which point GPU PTE is updated or error is r= eported if fault happened at an invalid address.

Cheers,
Jerome
--047d7bdc853a3d0f8d04da8eeda4-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx152.postini.com [74.125.245.152]) by kanga.kvack.org (Postfix) with SMTP id 15A456B00C8 for ; Wed, 17 Apr 2013 19:48:36 -0400 (EDT) Received: by mail-ye0-f172.google.com with SMTP id l13so358120yen.31 for ; Wed, 17 Apr 2013 16:48:35 -0700 (PDT) Message-ID: <516F34CA.8050902@gmail.com> Date: Thu, 18 Apr 2013 07:48:26 +0800 From: Simon Jeons MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes References: <5114DF05.7070702@mellanox.com> <516CF7BB.3050301@gmail.com> <516DE3D1.7030800@gmail.com> In-Reply-To: Content-Type: multipart/alternative; boundary="------------050205070209060200010204" Sender: owner-linux-mm@kvack.org List-ID: To: Jerome Glisse Cc: Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss This is a multi-part message in MIME format. --------------050205070209060200010204 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi Jerome, On 04/17/2013 10:01 PM, Jerome Glisse wrote: > On Tue, Apr 16, 2013 at 7:50 PM, Simon Jeons > wrote: > > On 04/17/2013 12:27 AM, Jerome Glisse wrote: > > [snip] > > > > As i said this is for pre-filling already present entry, ie > pte that are present with a valid page (no special bit set). > This is an optimization so that the GPU can pre-fill its tlb > without having to take any mmap_sem. Hope is that in most > common case this will be enough, but in some case you will > have to go through the lengthy non fast gup. > > > I know this. What I concern is the pte you mentioned is for normal > cpu, correct? How can you pre-fill pte and tlb of GPU? > > > You getting confuse, idea is to look at cpu pte and prefill gpu pte. I > do not prefill cpu pte, if a cpu pte is valid then i use the page it > point to prefill the GPU pte. Yes, confused! > > So i don't pre-fill CPU PTE and TLB GPU, i pre-fill GPU PTE from CPU > PTE if CPU PTE is valid. Other GPU PTE are marked as invalid and will > trigger a fault that will be handle using gup that will fill CPU PTE > (if fault happen at a valid address) at which point GPU PTE is updated > or error is reported if fault happened at an invalid address. gup is used to fill CPU PTE, could you point out to me which codes will re-fill GPU PTE? gup fast? GPU page table is different from CPU? > > Cheers, > Jerome --------------050205070209060200010204 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit
Hi Jerome,
On 04/17/2013 10:01 PM, Jerome Glisse wrote:
On Tue, Apr 16, 2013 at 7:50 PM, Simon Jeons <simon.jeons@gmail.com> wrote:
On 04/17/2013 12:27 AM, Jerome Glisse wrote:

[snip]



As i said this is for pre-filling already present entry, ie pte that are present with a valid page (no special bit set). This is an optimization so that the GPU can pre-fill its tlb without having to take any mmap_sem. Hope is that in most common case this will be enough, but in some case you will have to go through the lengthy non fast gup.

I know this. What I concern is the pte you mentioned is for normal cpu, correct? How can you pre-fill pte and tlb of GPU?

You getting confuse, idea is to look at cpu pte and prefill gpu pte. I do not prefill cpu pte, if a cpu pte is valid then i use the page it point to prefill the GPU pte.

Yes, confused!


So i don't pre-fill CPU PTE and TLB GPU, i pre-fill GPU PTE from CPU PTE if CPU PTE is valid. Other GPU PTE are marked as invalid and will trigger a fault that will be handle using gup that will fill CPU PTE (if fault happen at a valid address) at which point GPU PTE is updated or error is reported if fault happened at an invalid address.

gup is used to fill CPU PTE, could you point out to me which codes will re-fill GPU PTE? gup fast?
GPU page table is different from CPU?


Cheers,
Jerome

--------------050205070209060200010204-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx113.postini.com [74.125.245.113]) by kanga.kvack.org (Postfix) with SMTP id A98026B00D6 for ; Wed, 17 Apr 2013 21:02:37 -0400 (EDT) Received: by mail-qe0-f50.google.com with SMTP id a11so1302693qen.9 for ; Wed, 17 Apr 2013 18:02:36 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <516F34CA.8050902@gmail.com> References: <5114DF05.7070702@mellanox.com> <516CF7BB.3050301@gmail.com> <516DE3D1.7030800@gmail.com> <516F34CA.8050902@gmail.com> Date: Wed, 17 Apr 2013 21:02:36 -0400 Message-ID: Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes From: Jerome Glisse Content-Type: multipart/alternative; boundary=14dae9399b6f8733f104da982855 Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Shachar Raindel , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Andrea Arcangeli , Roland Dreier , Haggai Eran , Or Gerlitz , Sagi Grimberg , Liran Liss --14dae9399b6f8733f104da982855 Content-Type: text/plain; charset=ISO-8859-1 On Wed, Apr 17, 2013 at 7:48 PM, Simon Jeons wrote: > Hi Jerome, > > On 04/17/2013 10:01 PM, Jerome Glisse wrote: > > On Tue, Apr 16, 2013 at 7:50 PM, Simon Jeons wrote: > >> On 04/17/2013 12:27 AM, Jerome Glisse wrote: >> >> [snip] >> >> >>> >>> As i said this is for pre-filling already present entry, ie pte that are >>> present with a valid page (no special bit set). This is an optimization so >>> that the GPU can pre-fill its tlb without having to take any mmap_sem. Hope >>> is that in most common case this will be enough, but in some case you will >>> have to go through the lengthy non fast gup. >>> >> >> I know this. What I concern is the pte you mentioned is for normal cpu, >> correct? How can you pre-fill pte and tlb of GPU? >> > > You getting confuse, idea is to look at cpu pte and prefill gpu pte. I do > not prefill cpu pte, if a cpu pte is valid then i use the page it point to > prefill the GPU pte. > > > Yes, confused! > > > > So i don't pre-fill CPU PTE and TLB GPU, i pre-fill GPU PTE from CPU PTE > if CPU PTE is valid. Other GPU PTE are marked as invalid and will trigger a > fault that will be handle using gup that will fill CPU PTE (if fault happen > at a valid address) at which point GPU PTE is updated or error is reported > if fault happened at an invalid address. > > > gup is used to fill CPU PTE, could you point out to me which codes will > re-fill GPU PTE? gup fast? > GPU page table is different from CPU? > > The GPU interrupt handler will schedule a work thread that will call gup and then update gpu page table. Cheers, Jerome --14dae9399b6f8733f104da982855 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Wed, Apr 17, 2013 at 7:48 PM, Simon Jeons <simon.jeons@gmail.com> wrote:
=20 =20 =20
Hi Jerome,

On 04/17/2013 10:01 PM, Jerome Glisse wrote:
On Tue, Apr 16, 2013 at 7:50 PM, Simon Jeons <simon.jeons@gmail.com> wrote:
On 04/17/2013 12:27 AM, Jerome Glisse wrote:

[snip]



As i said this is for pre-filling already present entry, ie pte that are present with a valid page (no special bit set). This is an optimization so that the GPU can pre-fill its tlb without having to take any mmap_sem. Hope is that in most common case this will be enough, but in some case you will have to go through the lengthy non fast gup.

I know this. What I concern is the pte you mentioned is for normal cpu, correct? How can you pre-fill pte and tlb of GPU?

You getting confuse, idea is to look at cpu pte and prefill gpu pte. I do not prefill cpu pte, if a cpu pte is valid then i use the page it point to prefill the GPU pte.

Yes, confused!



So i don't pre-fill CPU PTE and TLB GPU, i pre-fill GPU PTE from CPU PTE if CPU PTE is valid. Other GPU PTE are marked as invalid and will trigger a fault that will be handle using gup that will fill CPU PTE (if fault happen at a valid address) at which point GPU PTE is updated or error is reported if fault happened at an invalid address.

gup is used to fill CPU PTE, could you point out to me which codes will re-fill GPU PTE? gup fast?
GPU page table is different from CPU?


The GPU interrupt handler will schedule= a work thread that will call gup and then update gpu page table.

Ch= eers,
Jerome
--14dae9399b6f8733f104da982855-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org