HMM (heterogeneous memory management) v6

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* HMM (heterogeneous memory management) v6
@ 2014-11-10 18:28 j.glisse
  2014-11-11 19:00 ` Christoph Lameter
  0 siblings, 1 reply; 5+ messages in thread
From: j.glisse @ 2014-11-10 18:28 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Shachar Raindel, Liran Liss,
	Roland Dreier <ro

Andrew so resending with review and ack from Riek and couple minor fixes
along the way. Is there anything blocking this from getting in next kernel ?
Again hardware is coming and there is still a long list of features waiting
on this core set of patches getting in. I reinclude part of my previous
email below.

What it is ?

In a nutshell HMM is a subsystem that provide an easy to use api to mirror a
process address on a device with minimal hardware requirement (mainly device
page fault and read only page mapping). This does not rely on ATS and PASID
PCIE extensions. It intends to supersede those extensions by allowing to move
system memory to device memory in a transparent fashion for core kernel mm
code (ie cpu page fault on page residing in device memory will trigger
migration back to system memory).

Why doing this ?

We want to be able to mirror a process address space so that compute api such
as OpenCL or other similar api can start using the exact same address space on
the GPU as on the CPU. This will greatly simplify usages of those api. Moreover
we believe that we will see more and more specialize unit functions that will
want to mirror process address using their own mmu.

The migration side is simply because GPU memory bandwidth is far beyond than
system memory bandwith and there is no sign that this gap is closing (quite the
opposite).

Current status and future features :

None of this core code change in any major way core kernel mm code. This
is simple ground work with no impact on existing code path. Features that
will be implemented on top of this are :
  1 - Tansparently handle page mapping on behalf of device driver (DMA).
  2 - Improve DMA api to better match new usage pattern of HMM.
  3 - Migration of anonymous memory to device memory.
  4 - Locking memory to remote memory (CPU access triger SIGBUS).
  5 - Access exclusion btw CPU and device for atomic operations.
  6 - Migration of file backed memory to device memory.

How future features will be implemented :
1 - Simply use existing DMA api to map page on behalf of a device.
2 - Introduce new DMA api to match new semantic of HMM. It is no longer page
    we map but address range and managing which page is effectively backing
    an address should be easy to update. I gave a presentation about that
    during this LPC.
3 - Requires change to cpu page fault code path to handle migration back to
    system memory on cpu access. An implementation of this was already sent
    as part of v1. This will be low impact and only add a new special swap
    type handling to existing fault code.
4 - Require a new syscall as i can not see which current syscall would be
    appropriate for this. My first feeling was to use mbind as it has the
    right semantic (binding a range of address to a device) but mbind is
    too numa centric.

    Second one was madvise, but semantic does not match, madvise does allow
    kernel to ignore them while we do want to block cpu access for as long
    as the range is bind to a device.

    So i do not think any of existing syscall can be extended with new flags
    but maybe i am wrong.
5 - Allowing to map a page as read only on the CPU while a device perform
    some atomic operation on it (this is mainly to work around system bus
    that do not support atomic memory access and sadly there is a large
    base of hardware without that feature).

    Easiest implementation would be using some page flags but there is none
    left. So it must be some flags in vma to know if there is a need to query
    HMM for write protection.

6 - This is the trickiest one to implement and while i showed a proof of
    concept with v1, i am still have a lot of conflictual feeling about how
    to achieve this.

As usual comments are more then welcome. Thanks in advance to anyone that
take a look at this code.

Previous patchset posting :
  v1 http://lwn.net/Articles/597289/
  v2 https://lkml.org/lkml/2014/6/12/559 (cover letter did not make it to ml)
  v3 https://lkml.org/lkml/2014/6/13/633
  v4 https://lkml.org/lkml/2014/8/29/423
  v5 https://lkml.org/lkml/2014/11/3/759

Cheers,
Jérôme

To: "Andrew Morton" <akpm@linux-foundation.org>,
Cc: <linux-kernel@vger.kernel.org>,
Cc: linux-mm <linux-mm@kvack.org>,
Cc: <linux-fsdevel@vger.kernel.org>,
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>,
Cc: "Mel Gorman" <mgorman@suse.de>,
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Cc: "Peter Zijlstra" <peterz@infradead.org>,
Cc: "Linda Wang" <lwang@redhat.com>,
Cc: "Kevin E Martin" <kem@redhat.com>,
Cc: "Jerome Glisse" <jglisse@redhat.com>,
Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
Cc: "Johannes Weiner" <jweiner@redhat.com>,
Cc: "Larry Woodman" <lwoodman@redhat.com>,
Cc: "Rik van Riel" <riel@redhat.com>,
Cc: "Dave Airlie" <airlied@redhat.com>,
Cc: "Jeff Law" <law@redhat.com>,
Cc: "Brendan Conoboy" <blc@redhat.com>,
Cc: "Joe Donohue" <jdonohue@redhat.com>,
Cc: "Duncan Poole" <dpoole@nvidia.com>,
Cc: "Sherry Cheung" <SCheung@nvidia.com>,
Cc: "Subhash Gutti" <sgutti@nvidia.com>,
Cc: "John Hubbard" <jhubbard@nvidia.com>,
Cc: "Mark Hairgrove" <mhairgrove@nvidia.com>,
Cc: "Lucien Dunning" <ldunning@nvidia.com>,
Cc: "Cameron Buschardt" <cabuschardt@nvidia.com>,
Cc: "Arvind Gopalakrishnan" <arvindg@nvidia.com>,
Cc: "Haggai Eran" <haggaie@mellanox.com>,
Cc: "Or Gerlitz" <ogerlitz@mellanox.com>,
Cc: "Sagi Grimberg" <sagig@mellanox.com>
Cc: "Shachar Raindel" <raindel@mellanox.com>,
Cc: "Liran Liss" <liranl@mellanox.com>,
Cc: "Roland Dreier" <roland@purestorage.com>,
Cc: "Sander, Ben" <ben.sander@amd.com>,
Cc: "Stoner, Greg" <Greg.Stoner@amd.com>,
Cc: "Bridgman, John" <John.Bridgman@amd.com>,
Cc: "Mantor, Michael" <Michael.Mantor@amd.com>,
Cc: "Blinzer, Paul" <Paul.Blinzer@amd.com>,
Cc: "Morichetti, Laurent" <Laurent.Morichetti@amd.com>,
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
Cc: "Gabbay, Oded" <Oded.Gabbay@amd.com>,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: HMM (heterogeneous memory management) v6
  2014-11-10 18:28 HMM (heterogeneous memory management) v6 j.glisse
@ 2014-11-11 19:00 ` Christoph Lameter
  2014-11-12 20:09   ` Jerome Glisse
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Lameter @ 2014-11-11 19:00 UTC (permalink / raw)
  To: j.glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Shachar Raindel, Liran Liss

On Mon, 10 Nov 2014, j.glisse@gmail.com wrote:

> In a nutshell HMM is a subsystem that provide an easy to use api to mirror a
> process address on a device with minimal hardware requirement (mainly device
> page fault and read only page mapping). This does not rely on ATS and PASID
> PCIE extensions. It intends to supersede those extensions by allowing to move
> system memory to device memory in a transparent fashion for core kernel mm
> code (ie cpu page fault on page residing in device memory will trigger
> migration back to system memory).

Could we define a new NUMA node that maps memory from the GPU and
then simply use the existing NUMA features to move a process over there.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: HMM (heterogeneous memory management) v6
  2014-11-11 19:00 ` Christoph Lameter
@ 2014-11-12 20:09   ` Jerome Glisse
  2014-11-12 23:08     ` Christoph Lameter
  0 siblings, 1 reply; 5+ messages in thread
From: Jerome Glisse @ 2014-11-12 20:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Shachar Raindel, Liran Liss, R

On Tue, Nov 11, 2014 at 01:00:56PM -0600, Christoph Lameter wrote:
> On Mon, 10 Nov 2014, j.glisse@gmail.com wrote:
> 
> > In a nutshell HMM is a subsystem that provide an easy to use api to mirror a
> > process address on a device with minimal hardware requirement (mainly device
> > page fault and read only page mapping). This does not rely on ATS and PASID
> > PCIE extensions. It intends to supersede those extensions by allowing to move
> > system memory to device memory in a transparent fashion for core kernel mm
> > code (ie cpu page fault on page residing in device memory will trigger
> > migration back to system memory).
> 
> Could we define a new NUMA node that maps memory from the GPU and
> then simply use the existing NUMA features to move a process over there.

Sorry for late reply, i am traveling and working on an updated patchset to
change the device page table design to something simpler and easier to grasp.

So GPU process will never run on CPU nor will they have a kernel task struct
associated with them. From core kernel point of view they do not exist. I
hope that at one point down the line the hw will allow for better integration
with kernel core but it's not there yet.

So the NUMA idea was considered early on but was discarded as it's not really
appropriate. You can have several CPU thread working with several GPU thread
at the same time and they can either access disjoint memory or some share
memory. Usual case will be few kbytes of share memory for synchronization
btw CPU and GPU threads.

But when a GPU job is launch we want most of the memory it will use to be
migrated to device memory. Issue is that the device memory is not accessible
from the CPU (PCIE bar are too small). So there is no way to keep the memory
mapped for the CPU. We do need to mark the memory as unaccessible to the CPU
and then migrate it to the GPU memory.

Now when there is a CPU page fault on some migrated memory we need to migrate
memory back to system memory. Hence why i need to tie HMM with some core MM
code so that on this kind of fault core kernel knows it needs to call into
HMM which will perform housekeeping and starts migration back to system
memory.

So technicaly there is no task migration only memory migration.

Is there something i missing inside NUMA or some NUMA work in progress that
change NUMA sufficiently that it might somehow address the use case i am
describing above ?

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: HMM (heterogeneous memory management) v6
  2014-11-12 20:09   ` Jerome Glisse
@ 2014-11-12 23:08     ` Christoph Lameter
  2014-11-13  4:28       ` Jerome Glisse
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Lameter @ 2014-11-12 23:08 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Shachar Raindel <raindel

On Wed, 12 Nov 2014, Jerome Glisse wrote:

> > Could we define a new NUMA node that maps memory from the GPU and
> > then simply use the existing NUMA features to move a process over there.
>
> So GPU process will never run on CPU nor will they have a kernel task struct
> associated with them. From core kernel point of view they do not exist. I
> hope that at one point down the line the hw will allow for better integration
> with kernel core but it's not there yet.

Right. So all of this is not relevant because the GPU manages it. You only
need access from the regular processors from Linux which has and uses Page
tables.

> So the NUMA idea was considered early on but was discarded as it's not really
> appropriate. You can have several CPU thread working with several GPU thread
> at the same time and they can either access disjoint memory or some share
> memory. Usual case will be few kbytes of share memory for synchronization
> btw CPU and GPU threads.

It is possible to ahve several threads accessing the memory in Linux. The
GPU threads run on the gpu and therefore are not a Linux issue. Where did
you see the problem?

> But when a GPU job is launch we want most of the memory it will use to be
> migrated to device memory. Issue is that the device memory is not accessible
> from the CPU (PCIE bar are too small). So there is no way to keep the memory
> mapped for the CPU. We do need to mark the memory as unaccessible to the CPU
> and then migrate it to the GPU memory.

Ok so this is transfer issue? Isnt this like block I/O? Write to a device?


> Now when there is a CPU page fault on some migrated memory we need to migrate
> memory back to system memory. Hence why i need to tie HMM with some core MM
> code so that on this kind of fault core kernel knows it needs to call into
> HMM which will perform housekeeping and starts migration back to system
> memory.


Sounds like a read operation and like a major fault if you would use
device semantics. You write the pages to the device and then evict them
from memory (madvise can do that for you). An access then causes a page
fault which leads to a read operation from the device.

> So technicaly there is no task migration only memory migration.
>
>
> Is there something i missing inside NUMA or some NUMA work in progress that
> change NUMA sufficiently that it might somehow address the use case i am
> describing above ?

I think you need to be looking at treating GPU memory as a block device
then you have the semantics you need.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: HMM (heterogeneous memory management) v6
  2014-11-12 23:08     ` Christoph Lameter
@ 2014-11-13  4:28       ` Jerome Glisse
  0 siblings, 0 replies; 5+ messages in thread
From: Jerome Glisse @ 2014-11-13  4:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Shachar Raindel, Liran Liss, R

On Wed, Nov 12, 2014 at 05:08:47PM -0600, Christoph Lameter wrote:
> On Wed, 12 Nov 2014, Jerome Glisse wrote:
> 
> > > Could we define a new NUMA node that maps memory from the GPU and
> > > then simply use the existing NUMA features to move a process over there.
> >
> > So GPU process will never run on CPU nor will they have a kernel task struct
> > associated with them. From core kernel point of view they do not exist. I
> > hope that at one point down the line the hw will allow for better integration
> > with kernel core but it's not there yet.
> 
> Right. So all of this is not relevant because the GPU manages it. You only
> need access from the regular processors from Linux which has and uses Page
> tables.
> 
> > So the NUMA idea was considered early on but was discarded as it's not really
> > appropriate. You can have several CPU thread working with several GPU thread
> > at the same time and they can either access disjoint memory or some share
> > memory. Usual case will be few kbytes of share memory for synchronization
> > btw CPU and GPU threads.
> 
> It is possible to ahve several threads accessing the memory in Linux. The
> GPU threads run on the gpu and therefore are not a Linux issue. Where did
> you see the problem?

When they both use system memory there is no issue but if you want to leverage
GPU to its full potential you need to migrate memory from system memory to GPU
memory for the duration of the GPU computation (might be several minutes/hours
or more). But at the same time you do not want CPU access to be forbiden thus
if CPU access does happen you want to catch the CPU fault schedule a migration
of GPU memory back to system memory and resume the CPU thread that faulted.

So from CPU point of view this GPU memory is like a swap, the memory is swaped
in the GPU memory and this is exactly how i implemented in, using a special swap
type. Refer to the v1 of my patchset where i show case implementation of most
of the features.

> 
> > But when a GPU job is launch we want most of the memory it will use to be
> > migrated to device memory. Issue is that the device memory is not accessible
> > from the CPU (PCIE bar are too small). So there is no way to keep the memory
> > mapped for the CPU. We do need to mark the memory as unaccessible to the CPU
> > and then migrate it to the GPU memory.
> 
> Ok so this is transfer issue? Isnt this like block I/O? Write to a device?
> 

It can be as slow as block I/O but it's unlike a block device, it's closer to
NUMA in theory because it's just about having memory close to the compute unit
(ie GPU memory in this case) but nothing else beside that match NUMA.

> 
> > Now when there is a CPU page fault on some migrated memory we need to migrate
> > memory back to system memory. Hence why i need to tie HMM with some core MM
> > code so that on this kind of fault core kernel knows it needs to call into
> > HMM which will perform housekeeping and starts migration back to system
> > memory.
> 
> 
> Sounds like a read operation and like a major fault if you would use
> device semantics. You write the pages to the device and then evict them
> from memory (madvise can do that for you). An access then causes a page
> fault which leads to a read operation from the device.

Yes it's a major fault case but we do not want to use this with any special
syscall think existing application that link against library. Now you port
the library to use GPU but application is ignorant of this and thus any CPU
access it does will be through usual mmaped range that did not go through any
special syscall.

> 
> > So technicaly there is no task migration only memory migration.
> >
> >
> > Is there something i missing inside NUMA or some NUMA work in progress that
> > change NUMA sufficiently that it might somehow address the use case i am
> > describing above ?
> 
> I think you need to be looking at treating GPU memory as a block device
> then you have the semantics you need.

This was explored too but block device does not match what we want. Block device
is nice for file backed memory and we could have special file that would be backed
by GPU memory and process would open those special file and write to it. But this
is not how we want to use this, we do really want to mirror process address space,
ie any kind of existing CPU mapping can be use by GPU (except mmaped IO) and we
want to be able to migrate any of those existing CPU mapping to GPU memory while
still being able to service CPU page fault on range migrated to GPU memory.

So unless there is something i am completely oblivious too in the block device
model in the linux kernel, i fail to see how it could apply to what we want to
achieve.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-11-13  4:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-10 18:28 HMM (heterogeneous memory management) v6 j.glisse
2014-11-11 19:00 ` Christoph Lameter
2014-11-12 20:09   ` Jerome Glisse
2014-11-12 23:08     ` Christoph Lameter
2014-11-13  4:28       ` Jerome Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).