All of lore.kernel.org
 help / color / mirror / Atom feed
diff for duplicates of <20170316234950.GA5725@redhat.com>

diff --git a/a/1.txt b/N1/1.txt
index e991211..58ba2f8 100644
--- a/a/1.txt
+++ b/N1/1.txt
@@ -112,4 +112,4 @@ there thing i should describe more thouroughly or aspect you feel are
 missing ?
 
 Cheers,
-Jerome
+Jérôme
diff --git a/a/2.txt b/N1/2.txt
index 8b13789..cb8b2cf 100644
--- a/a/2.txt
+++ b/N1/2.txt
@@ -1 +1,151 @@
+>From 4a2cb2211af22b6b149ba9afebc27f8d5763bac2 Mon Sep 17 00:00:00 2001
+From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
+Date: Thu, 16 Mar 2017 20:27:43 -0400
+Subject: [PATCH] hmm: heterogeneous memory management documentation
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
 
+This add documentation for HMM (Heterogeneous Memory Management). It
+presents the motivation behind it, the features necessary for it to
+be usefull and and gives an overview of how this is implemented.
+
+Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
+---
+ Documentation/hmm.txt | 125 ++++++++++++++++++++++++++++++++++++++++++++++++++
+ 1 file changed, 125 insertions(+)
+ create mode 100644 Documentation/hmm.txt
+
+diff --git a/Documentation/hmm.txt b/Documentation/hmm.txt
+new file mode 100644
+index 0000000..83dd0ff
+--- /dev/null
++++ b/Documentation/hmm.txt
+@@ -0,0 +1,125 @@
++Heterogeneous Memory Management (HMM)
++
++Transparently allow any component of a program to use any memory region of said
++program with a device without using device specific memory allocator. This is
++becoming a requirement to simplify the use of advance heterogeneous computing
++where GPU, DSP or FPGA are use to perform various computations.
++
++This document is divided as follow, in the first section i expose the problems
++related to the use of a device specific allocator. The second section i expose
++the hardware limitations that are inherent to many platforms. The third section
++gives an overview of HMM designs.
++
++
++-------------------------------------------------------------------------------
++
++1) Problems of using device specific memory allocator:
++
++Device with large amount of on board memory (several giga bytes) like GPU have
++historicaly manage their memory through dedicated driver specific API. This
++creates a disconnect between memory allocated and managed by device driver and
++regular application memory (private anonynous, share memory or regular file
++back memory). From here on i will refer to this aspect as split address space.
++I use share address space to refer to the opposite situation ie one in which
++any memory region can be use by device transparently.
++
++Split address space because device can only access memory allocated through the
++device specific API. This imply that all memory object in a program are not
++equal from device point of view which complicate large program that rely on a
++wide set of libraries.
++
++Concretly this means that code that wants to leverage device like GPU need to
++copy object between genericly allocated memory (malloc, mmap private/share/)
++and memory allocated through the device driver API (this still end up with an
++mmap but of the device file).
++
++For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
++complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
++data-set need to re-map all the pointer relations between each of its elements.
++This is error prone and program gets harder to debug because of the duplicate
++data-set.
++
++Split address space also means that library can not transparently use data they
++are getting from core program or other library and thus each library might have
++to duplicate its input data-set using specific memory allocator. Large project
++suffer from this and waste resources because of the various memory copy.
++
++Duplicating each library API to accept as input or output memory allocted by
++each device specific allocator is not a viable option. It would lead to a
++combinatorial explosions in the library entry points.
++
++Finaly with the advance of high level langage constructs (in C++ but in other
++langage too) it is now possible for compiler to leverage GPU or other devices
++without even the programmer knowledge. Some of compiler identified patterns are
++only do-able with a share address. It is as well more reasonable to use a share
++address space for all the other patterns.
++
++
++-------------------------------------------------------------------------------
++
++2) System bus, device memory characteristics
++
++System bus cripple share address due to few limitations. Most system bus only
++allow basic memory access from device to main memory, even cache coherency is
++often optional. Access to device memory from CPU is even more limited, most
++often than not it is not cache coherent.
++
++If we only consider the PCIE bus than device can access main memory (often
++through an IOMMU) and be cache coherent with the CPUs. However it only allows
++a limited set of atomic operation from device on main memory. This is worse
++in the other direction the CPUs can only access a limited range of the device
++memory and can not perform atomic operations on it. Thus device memory can not
++be consider like regular memory from kernel point of view.
++
++Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
++and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
++The final limitation is latency, access to main memory from the device has an
++order of magnitude higher latency than when the device access its own memory.
++
++Some platform are developing new system bus or additions/modifications to PCIE
++to address some of those limitations (OpenCAPI, CCIX). They mainly allow two
++way cache coherency between CPU and device and allow all atomic operations the
++architecture supports. Saddly not all platform are following this trends and
++some major architecture are left without hardware solutions to those problems.
++
++So for share address space to make sense not only we must allow device to
++access any memory memory but we must also permit any memory to be migrated to
++device memory while device is using it (blocking CPU access while it happens).
++
++
++-------------------------------------------------------------------------------
++
++3) Share address space and migration
++
++HMM intends to provide two main features. First one is to share the address
++space by duplication the CPU page table into the device page table so same
++address point to same memory and this for any valid main memory address in
++the process address space.
++
++To achieve this, HMM offer a set of helpers to populate the device page table
++while keeping track of CPU page table updates. Device page table updates are
++not as easy as CPU page table updates. To update the device page table you must
++allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics
++commands in it to perform the update (unmap, cache invalidations and flush,
++...). This can not be done through common code for all device. Hence why HMM
++provides helpers to factor out everything that can be while leaving the gory
++details to the device driver.
++
++The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
++allow to allocate a struct page for each page of the device memory. Those page
++are special because the CPU can not map them. They however allow to migrate
++main memory to device memory using exhisting migration mechanism and everything
++looks like if page was swap out to disk from CPU point of view. Using a struct
++page gives the easiest and cleanest integration with existing mm mechanisms.
++Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
++for the device memory and second to perform migration. Policy decision of what
++and when to migrate things is left to the device driver.
++
++Note that any CPU acess to a device page trigger a page fault which initiate a
++migration back to system memory so that CPU can access it.
++
++
++With this two features, HMM not only allow a device to mirror a process address
++space and keeps both CPU and device page table synchronize, but also allow to
++leverage device memory by migrating part of data-set that is actively use by a
++device.
+-- 
+2.4.11
diff --git a/a/content_digest b/N1/content_digest
index 9891be9..0f47238 100644
--- a/a/content_digest
+++ b/N1/content_digest
@@ -125,9 +125,160 @@
  "missing ?\n"
  "\n"
  "Cheers,\n"
- Jerome
+ "J\303\251r\303\264me"
  "\01:2\0"
  "fn\00001-hmm-heterogeneous-memory-management-documentation.patch\0"
  "b\0"
+ ">From 4a2cb2211af22b6b149ba9afebc27f8d5763bac2 Mon Sep 17 00:00:00 2001\n"
+ "From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>\n"
+ "Date: Thu, 16 Mar 2017 20:27:43 -0400\n"
+ "Subject: [PATCH] hmm: heterogeneous memory management documentation\n"
+ "MIME-Version: 1.0\n"
+ "Content-Type: text/plain; charset=UTF-8\n"
+ "Content-Transfer-Encoding: 8bit\n"
+ "\n"
+ "This add documentation for HMM (Heterogeneous Memory Management). It\n"
+ "presents the motivation behind it, the features necessary for it to\n"
+ "be usefull and and gives an overview of how this is implemented.\n"
+ "\n"
+ "Signed-off-by: J\303\251r\303\264me Glisse <jglisse@redhat.com>\n"
+ "---\n"
+ " Documentation/hmm.txt | 125 ++++++++++++++++++++++++++++++++++++++++++++++++++\n"
+ " 1 file changed, 125 insertions(+)\n"
+ " create mode 100644 Documentation/hmm.txt\n"
+ "\n"
+ "diff --git a/Documentation/hmm.txt b/Documentation/hmm.txt\n"
+ "new file mode 100644\n"
+ "index 0000000..83dd0ff\n"
+ "--- /dev/null\n"
+ "+++ b/Documentation/hmm.txt\n"
+ "@@ -0,0 +1,125 @@\n"
+ "+Heterogeneous Memory Management (HMM)\n"
+ "+\n"
+ "+Transparently allow any component of a program to use any memory region of said\n"
+ "+program with a device without using device specific memory allocator. This is\n"
+ "+becoming a requirement to simplify the use of advance heterogeneous computing\n"
+ "+where GPU, DSP or FPGA are use to perform various computations.\n"
+ "+\n"
+ "+This document is divided as follow, in the first section i expose the problems\n"
+ "+related to the use of a device specific allocator. The second section i expose\n"
+ "+the hardware limitations that are inherent to many platforms. The third section\n"
+ "+gives an overview of HMM designs.\n"
+ "+\n"
+ "+\n"
+ "+-------------------------------------------------------------------------------\n"
+ "+\n"
+ "+1) Problems of using device specific memory allocator:\n"
+ "+\n"
+ "+Device with large amount of on board memory (several giga bytes) like GPU have\n"
+ "+historicaly manage their memory through dedicated driver specific API. This\n"
+ "+creates a disconnect between memory allocated and managed by device driver and\n"
+ "+regular application memory (private anonynous, share memory or regular file\n"
+ "+back memory). From here on i will refer to this aspect as split address space.\n"
+ "+I use share address space to refer to the opposite situation ie one in which\n"
+ "+any memory region can be use by device transparently.\n"
+ "+\n"
+ "+Split address space because device can only access memory allocated through the\n"
+ "+device specific API. This imply that all memory object in a program are not\n"
+ "+equal from device point of view which complicate large program that rely on a\n"
+ "+wide set of libraries.\n"
+ "+\n"
+ "+Concretly this means that code that wants to leverage device like GPU need to\n"
+ "+copy object between genericly allocated memory (malloc, mmap private/share/)\n"
+ "+and memory allocated through the device driver API (this still end up with an\n"
+ "+mmap but of the device file).\n"
+ "+\n"
+ "+For flat dataset (array, grid, image, ...) this isn't too hard to achieve but\n"
+ "+complex data-set (list, tree, ...) are hard to get right. Duplicating a complex\n"
+ "+data-set need to re-map all the pointer relations between each of its elements.\n"
+ "+This is error prone and program gets harder to debug because of the duplicate\n"
+ "+data-set.\n"
+ "+\n"
+ "+Split address space also means that library can not transparently use data they\n"
+ "+are getting from core program or other library and thus each library might have\n"
+ "+to duplicate its input data-set using specific memory allocator. Large project\n"
+ "+suffer from this and waste resources because of the various memory copy.\n"
+ "+\n"
+ "+Duplicating each library API to accept as input or output memory allocted by\n"
+ "+each device specific allocator is not a viable option. It would lead to a\n"
+ "+combinatorial explosions in the library entry points.\n"
+ "+\n"
+ "+Finaly with the advance of high level langage constructs (in C++ but in other\n"
+ "+langage too) it is now possible for compiler to leverage GPU or other devices\n"
+ "+without even the programmer knowledge. Some of compiler identified patterns are\n"
+ "+only do-able with a share address. It is as well more reasonable to use a share\n"
+ "+address space for all the other patterns.\n"
+ "+\n"
+ "+\n"
+ "+-------------------------------------------------------------------------------\n"
+ "+\n"
+ "+2) System bus, device memory characteristics\n"
+ "+\n"
+ "+System bus cripple share address due to few limitations. Most system bus only\n"
+ "+allow basic memory access from device to main memory, even cache coherency is\n"
+ "+often optional. Access to device memory from CPU is even more limited, most\n"
+ "+often than not it is not cache coherent.\n"
+ "+\n"
+ "+If we only consider the PCIE bus than device can access main memory (often\n"
+ "+through an IOMMU) and be cache coherent with the CPUs. However it only allows\n"
+ "+a limited set of atomic operation from device on main memory. This is worse\n"
+ "+in the other direction the CPUs can only access a limited range of the device\n"
+ "+memory and can not perform atomic operations on it. Thus device memory can not\n"
+ "+be consider like regular memory from kernel point of view.\n"
+ "+\n"
+ "+Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0\n"
+ "+and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).\n"
+ "+The final limitation is latency, access to main memory from the device has an\n"
+ "+order of magnitude higher latency than when the device access its own memory.\n"
+ "+\n"
+ "+Some platform are developing new system bus or additions/modifications to PCIE\n"
+ "+to address some of those limitations (OpenCAPI, CCIX). They mainly allow two\n"
+ "+way cache coherency between CPU and device and allow all atomic operations the\n"
+ "+architecture supports. Saddly not all platform are following this trends and\n"
+ "+some major architecture are left without hardware solutions to those problems.\n"
+ "+\n"
+ "+So for share address space to make sense not only we must allow device to\n"
+ "+access any memory memory but we must also permit any memory to be migrated to\n"
+ "+device memory while device is using it (blocking CPU access while it happens).\n"
+ "+\n"
+ "+\n"
+ "+-------------------------------------------------------------------------------\n"
+ "+\n"
+ "+3) Share address space and migration\n"
+ "+\n"
+ "+HMM intends to provide two main features. First one is to share the address\n"
+ "+space by duplication the CPU page table into the device page table so same\n"
+ "+address point to same memory and this for any valid main memory address in\n"
+ "+the process address space.\n"
+ "+\n"
+ "+To achieve this, HMM offer a set of helpers to populate the device page table\n"
+ "+while keeping track of CPU page table updates. Device page table updates are\n"
+ "+not as easy as CPU page table updates. To update the device page table you must\n"
+ "+allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics\n"
+ "+commands in it to perform the update (unmap, cache invalidations and flush,\n"
+ "+...). This can not be done through common code for all device. Hence why HMM\n"
+ "+provides helpers to factor out everything that can be while leaving the gory\n"
+ "+details to the device driver.\n"
+ "+\n"
+ "+The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does\n"
+ "+allow to allocate a struct page for each page of the device memory. Those page\n"
+ "+are special because the CPU can not map them. They however allow to migrate\n"
+ "+main memory to device memory using exhisting migration mechanism and everything\n"
+ "+looks like if page was swap out to disk from CPU point of view. Using a struct\n"
+ "+page gives the easiest and cleanest integration with existing mm mechanisms.\n"
+ "+Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory\n"
+ "+for the device memory and second to perform migration. Policy decision of what\n"
+ "+and when to migrate things is left to the device driver.\n"
+ "+\n"
+ "+Note that any CPU acess to a device page trigger a page fault which initiate a\n"
+ "+migration back to system memory so that CPU can access it.\n"
+ "+\n"
+ "+\n"
+ "+With this two features, HMM not only allow a device to mirror a process address\n"
+ "+space and keeps both CPU and device page table synchronize, but also allow to\n"
+ "+leverage device memory by migrating part of data-set that is actively use by a\n"
+ "+device.\n"
+ "-- \n"
+ 2.4.11
 
-1398a8a336175847b3b8b7c035cda1f069ee3ffb4eb6b76f358a639e1c8c19b2
+6d998231c2fa36310b2d969fe85b9ec232d703d563f316fe325b25c54cea401a

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.