From mboxrd@z Thu Jan 1 00:00:00 1970 From: j.glisse@gmail.com Subject: HMM (Heterogeneous Memory Management) v8 Date: Mon, 5 Jan 2015 17:44:43 -0500 Message-ID: <1420497889-10088-1-git-send-email-j.glisse@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: , , Linus Torvalds , , Mel Gorman , "H. Peter Anvin" , Peter Zijlstra , Andrea Arcangeli , Johannes Weiner , Larry Woodman , Rik van Riel , Dave Airlie , Brendan Conoboy , Joe Donohue , Duncan Poole , Sherry Cheung , Subhash Gutti , John Hubbard , Mark Hairgrove , Lucien Dunning , Cameron Buschardt , Arvind Gopalakrishnan , Shachar Raindel , Liran Liss , Roland Dreier Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org So a resend with corrections base on Haggai comments. This patchset is = just the ground foundation on to which we want to build our features set. Ma= in feature being migrating memory to device memory. The very first version= of this patchset already show cased proof of concept of much of the featur= es. Below is previous patchset cover letter pretty much unchanged as backgr= ound and motivation for it did not. What it is ? In a nutshell HMM is a subsystem that provide an easy to use api to mir= ror a process address on a device with minimal hardware requirement (mainly d= evice page fault and read only page mapping). This does not rely on ATS and P= ASID PCIE extensions. It intends to supersede those extensions by allowing t= o move system memory to device memory in a transparent fashion for core kernel= mm code (ie cpu page fault on page residing in device memory will trigger migration back to system memory). Why doing this ? We want to be able to mirror a process address space so that compute ap= i such as OpenCL or other similar api can start using the exact same address s= pace on the GPU as on the CPU. This will greatly simplify usages of those api. = Moreover we believe that we will see more and more specialize unit functions tha= t will want to mirror process address using their own mmu. The migration side is simply because GPU memory bandwidth is far beyond= than system memory bandwith and there is no sign that this gap is closing (q= uite the opposite). Current status and future features : None of this core code change in any major way core kernel mm code. Thi= s is simple ground work with no impact on existing code path. Features th= at will be implemented on top of this are : 1 - Tansparently handle page mapping on behalf of device driver (DMA)= =2E 2 - Improve DMA api to better match new usage pattern of HMM. 3 - Migration of anonymous memory to device memory. 4 - Locking memory to remote memory (CPU access trigger SIGBUS). 5 - Access exclusion btw CPU and device for atomic operations. 6 - Migration of file backed memory to device memory. How future features will be implemented : 1 - Simply use existing DMA api to map page on behalf of a device. 2 - Introduce new DMA api to match new semantic of HMM. It is no longer= page we map but address range and managing which page is effectively bac= king an address should be easy to update. I gave a presentation about th= at during this LPC. 3 - Requires change to cpu page fault code path to handle migration bac= k to system memory on cpu access. An implementation of this was already = sent as part of v1. This will be low impact and only add a new special s= wap type handling to existing fault code. 4 - Require a new syscall as i can not see which current syscall would = be appropriate for this. My first feeling was to use mbind as it has t= he right semantic (binding a range of address to a device) but mbind i= s too numa centric. Second one was madvise, but semantic does not match, madvise does a= llow kernel to ignore them while we do want to block cpu access for as l= ong as the range is bind to a device. So i do not think any of existing syscall can be extended with new = flags but maybe i am wrong. 5 - Allowing to map a page as read only on the CPU while a device perfo= rm some atomic operation on it (this is mainly to work around system b= us that do not support atomic memory access and sadly there is a large base of hardware without that feature). Easiest implementation would be using some page flags but there is = none left. So it must be some flags in vma to know if there is a need to= query HMM for write protection. 6 - This is the trickiest one to implement and while i showed a proof o= f concept with v1, i am still have a lot of conflictual feeling about= how to achieve this. As usual comments are more then welcome. Thanks in advance to anyone th= at take a look at this code. Previous patchset posting : v1 http://lwn.net/Articles/597289/ v2 https://lkml.org/lkml/2014/6/12/559 (cover letter did not make it = to ml) v3 https://lkml.org/lkml/2014/6/13/633 v4 https://lkml.org/lkml/2014/8/29/423 v5 https://lkml.org/lkml/2014/11/3/759 v6 http://lwn.net/Articles/619737/ Cheers, J=C3=A9r=C3=B4me To: "Andrew Morton" , Cc: , Cc: linux-mm , Cc: , Cc: "Linus Torvalds" , Cc: "Mel Gorman" , Cc: "H. Peter Anvin" , Cc: "Peter Zijlstra" , Cc: "Linda Wang" , Cc: "Kevin E Martin" , Cc: "Jerome Glisse" , Cc: "Andrea Arcangeli" , Cc: "Johannes Weiner" , Cc: "Larry Woodman" , Cc: "Rik van Riel" , Cc: "Dave Airlie" , Cc: "Jeff Law" , Cc: "Brendan Conoboy" , Cc: "Joe Donohue" , Cc: "Duncan Poole" , Cc: "Sherry Cheung" , Cc: "Subhash Gutti" , Cc: "John Hubbard" , Cc: "Mark Hairgrove" , Cc: "Lucien Dunning" , Cc: "Cameron Buschardt" , Cc: "Arvind Gopalakrishnan" , Cc: "Haggai Eran" , Cc: "Or Gerlitz" , Cc: "Sagi Grimberg" Cc: "Shachar Raindel" , Cc: "Liran Liss" , Cc: "Roland Dreier" , Cc: "Sander, Ben" , Cc: "Stoner, Greg" , Cc: "Bridgman, John" , Cc: "Mantor, Michael" , Cc: "Blinzer, Paul" , Cc: "Morichetti, Laurent" , Cc: "Deucher, Alexander" , Cc: "Gabbay, Oded" ,