From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jerome Glisse Subject: Re: HMM (heterogeneous memory management) v6 Date: Wed, 12 Nov 2014 23:28:21 -0500 Message-ID: <20141113042819.GB7720@gmail.com> References: <1415644096-3513-1-git-send-email-j.glisse@gmail.com> <20141112200911.GA7720@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Linus Torvalds , joro@8bytes.org, Mel Gorman , "H. Peter Anvin" , Peter Zijlstra , Andrea Arcangeli , Johannes Weiner , Larry Woodman , Rik van Riel , Dave Airlie , Brendan Conoboy , Joe Donohue , Duncan Poole , Sherry Cheung , Subhash Gutti , John Hubbard , Mark Hairgrove , Lucien Dunning , Cameron Buschardt , Arvind Gopalakrishnan , Shachar Raindel , Liran Liss , R To: Christoph Lameter Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Nov 12, 2014 at 05:08:47PM -0600, Christoph Lameter wrote: > On Wed, 12 Nov 2014, Jerome Glisse wrote: >=20 > > > Could we define a new NUMA node that maps memory from the GPU and > > > then simply use the existing NUMA features to move a process over t= here. > > > > So GPU process will never run on CPU nor will they have a kernel task= struct > > associated with them. From core kernel point of view they do not exis= t. I > > hope that at one point down the line the hw will allow for better int= egration > > with kernel core but it's not there yet. >=20 > Right. So all of this is not relevant because the GPU manages it. You o= nly > need access from the regular processors from Linux which has and uses P= age > tables. >=20 > > So the NUMA idea was considered early on but was discarded as it's no= t really > > appropriate. You can have several CPU thread working with several GPU= thread > > at the same time and they can either access disjoint memory or some s= hare > > memory. Usual case will be few kbytes of share memory for synchroniza= tion > > btw CPU and GPU threads. >=20 > It is possible to ahve several threads accessing the memory in Linux. T= he > GPU threads run on the gpu and therefore are not a Linux issue. Where d= id > you see the problem? When they both use system memory there is no issue but if you want to lev= erage GPU to its full potential you need to migrate memory from system memory t= o GPU memory for the duration of the GPU computation (might be several minutes/= hours or more). But at the same time you do not want CPU access to be forbiden = thus if CPU access does happen you want to catch the CPU fault schedule a migr= ation of GPU memory back to system memory and resume the CPU thread that faulte= d. So from CPU point of view this GPU memory is like a swap, the memory is s= waped in the GPU memory and this is exactly how i implemented in, using a speci= al swap type. Refer to the v1 of my patchset where i show case implementation of = most of the features. >=20 > > But when a GPU job is launch we want most of the memory it will use t= o be > > migrated to device memory. Issue is that the device memory is not acc= essible > > from the CPU (PCIE bar are too small). So there is no way to keep the= memory > > mapped for the CPU. We do need to mark the memory as unaccessible to = the CPU > > and then migrate it to the GPU memory. >=20 > Ok so this is transfer issue? Isnt this like block I/O? Write to a devi= ce? >=20 It can be as slow as block I/O but it's unlike a block device, it's close= r to NUMA in theory because it's just about having memory close to the compute= unit (ie GPU memory in this case) but nothing else beside that match NUMA. >=20 > > Now when there is a CPU page fault on some migrated memory we need to= migrate > > memory back to system memory. Hence why i need to tie HMM with some c= ore MM > > code so that on this kind of fault core kernel knows it needs to call= into > > HMM which will perform housekeeping and starts migration back to syst= em > > memory. >=20 >=20 > Sounds like a read operation and like a major fault if you would use > device semantics. You write the pages to the device and then evict them > from memory (madvise can do that for you). An access then causes a page > fault which leads to a read operation from the device. Yes it's a major fault case but we do not want to use this with any speci= al syscall think existing application that link against library. Now you por= t the library to use GPU but application is ignorant of this and thus any C= PU access it does will be through usual mmaped range that did not go through= any special syscall. >=20 > > So technicaly there is no task migration only memory migration. > > > > > > Is there something i missing inside NUMA or some NUMA work in progres= s that > > change NUMA sufficiently that it might somehow address the use case i= am > > describing above ? >=20 > I think you need to be looking at treating GPU memory as a block device > then you have the semantics you need. This was explored too but block device does not match what we want. Block= device is nice for file backed memory and we could have special file that would = be backed by GPU memory and process would open those special file and write to it. = But this is not how we want to use this, we do really want to mirror process addre= ss space, ie any kind of existing CPU mapping can be use by GPU (except mmaped IO) = and we want to be able to migrate any of those existing CPU mapping to GPU memor= y while still being able to service CPU page fault on range migrated to GPU memor= y. So unless there is something i am completely oblivious too in the block d= evice model in the linux kernel, i fail to see how it could apply to what we wa= nt to achieve. Cheers, J=E9r=F4me -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org