From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jerome Glisse <j.glisse@gmail.com>
Subject: Re: HMM (heterogeneous memory management) v6
Date: Wed, 12 Nov 2014 23:28:21 -0500
Message-ID: <20141113042819.GB7720@gmail.com>
References: <1415644096-3513-1-git-send-email-j.glisse@gmail.com>
 <alpine.DEB.2.11.1411111259560.6657@gentwo.org>
 <20141112200911.GA7720@gmail.com>
 <alpine.DEB.2.11.1411121703200.17784@gentwo.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Linus Torvalds <torvalds@linux-foundation.org>,
	joro@8bytes.org, Mel Gorman <mgorman@suse.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Larry Woodman <lwoodman@redhat.com>, Rik van Riel <riel@redhat.com>,
	Dave Airlie <airlied@redhat.com>, Brendan Conoboy <blc@redhat.com>,
	Joe Donohue <jdonohue@redhat.com>, Duncan Poole <dpoole@nvidia.com>,
	Sherry Cheung <SCheung@nvidia.com>,
	Subhash Gutti <sgutti@nvidia.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Mark Hairgrove <mhairgrove@nvidia.com>,
	Lucien Dunning <ldunning@nvidia.com>,
	Cameron Buschardt <cabuschardt@nvidia.com>,
	Arvind Gopalakrishnan <arvindg@nvidia.com>,
	Shachar Raindel <raindel@mellanox.com>,
	Liran Liss <liranl@mellanox.com>,
	R
To: Christoph Lameter <cl@linux.com>
Return-path: <owner-linux-mm@kvack.org>
Content-Disposition: inline
In-Reply-To: <alpine.DEB.2.11.1411121703200.17784@gentwo.org>
Sender: owner-linux-mm@kvack.org
List-Id: linux-fsdevel.vger.kernel.org

On Wed, Nov 12, 2014 at 05:08:47PM -0600, Christoph Lameter wrote:
> On Wed, 12 Nov 2014, Jerome Glisse wrote:
>=20
> > > Could we define a new NUMA node that maps memory from the GPU and
> > > then simply use the existing NUMA features to move a process over t=
here.
> >
> > So GPU process will never run on CPU nor will they have a kernel task=
 struct
> > associated with them. From core kernel point of view they do not exis=
t. I
> > hope that at one point down the line the hw will allow for better int=
egration
> > with kernel core but it's not there yet.
>=20
> Right. So all of this is not relevant because the GPU manages it. You o=
nly
> need access from the regular processors from Linux which has and uses P=
age
> tables.
>=20
> > So the NUMA idea was considered early on but was discarded as it's no=
t really
> > appropriate. You can have several CPU thread working with several GPU=
 thread
> > at the same time and they can either access disjoint memory or some s=
hare
> > memory. Usual case will be few kbytes of share memory for synchroniza=
tion
> > btw CPU and GPU threads.
>=20
> It is possible to ahve several threads accessing the memory in Linux. T=
he
> GPU threads run on the gpu and therefore are not a Linux issue. Where d=
id
> you see the problem?

When they both use system memory there is no issue but if you want to lev=
erage
GPU to its full potential you need to migrate memory from system memory t=
o GPU
memory for the duration of the GPU computation (might be several minutes/=
hours
or more). But at the same time you do not want CPU access to be forbiden =
thus
if CPU access does happen you want to catch the CPU fault schedule a migr=
ation
of GPU memory back to system memory and resume the CPU thread that faulte=
d.

So from CPU point of view this GPU memory is like a swap, the memory is s=
waped
in the GPU memory and this is exactly how i implemented in, using a speci=
al swap
type. Refer to the v1 of my patchset where i show case implementation of =
most
of the features.

>=20
> > But when a GPU job is launch we want most of the memory it will use t=
o be
> > migrated to device memory. Issue is that the device memory is not acc=
essible
> > from the CPU (PCIE bar are too small). So there is no way to keep the=
 memory
> > mapped for the CPU. We do need to mark the memory as unaccessible to =
the CPU
> > and then migrate it to the GPU memory.
>=20
> Ok so this is transfer issue? Isnt this like block I/O? Write to a devi=
ce?
>=20

It can be as slow as block I/O but it's unlike a block device, it's close=
r to
NUMA in theory because it's just about having memory close to the compute=
 unit
(ie GPU memory in this case) but nothing else beside that match NUMA.

>=20
> > Now when there is a CPU page fault on some migrated memory we need to=
 migrate
> > memory back to system memory. Hence why i need to tie HMM with some c=
ore MM
> > code so that on this kind of fault core kernel knows it needs to call=
 into
> > HMM which will perform housekeeping and starts migration back to syst=
em
> > memory.
>=20
>=20
> Sounds like a read operation and like a major fault if you would use
> device semantics. You write the pages to the device and then evict them
> from memory (madvise can do that for you). An access then causes a page
> fault which leads to a read operation from the device.

Yes it's a major fault case but we do not want to use this with any speci=
al
syscall think existing application that link against library. Now you por=
t
the library to use GPU but application is ignorant of this and thus any C=
PU
access it does will be through usual mmaped range that did not go through=
 any
special syscall.

>=20
> > So technicaly there is no task migration only memory migration.
> >
> >
> > Is there something i missing inside NUMA or some NUMA work in progres=
s that
> > change NUMA sufficiently that it might somehow address the use case i=
 am
> > describing above ?
>=20
> I think you need to be looking at treating GPU memory as a block device
> then you have the semantics you need.

This was explored too but block device does not match what we want. Block=
 device
is nice for file backed memory and we could have special file that would =
be backed
by GPU memory and process would open those special file and write to it. =
But this
is not how we want to use this, we do really want to mirror process addre=
ss space,
ie any kind of existing CPU mapping can be use by GPU (except mmaped IO) =
and we
want to be able to migrate any of those existing CPU mapping to GPU memor=
y while
still being able to service CPU page fault on range migrated to GPU memor=
y.

So unless there is something i am completely oblivious too in the block d=
evice
model in the linux kernel, i fail to see how it could apply to what we wa=
nt to
achieve.

Cheers,
J=E9r=F4me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=3Dmailto:"dont@kvack.org"> email@kvack.org </a>