diff for duplicates of <20170316234950.GA5725@redhat.com> diff --git a/a/1.txt b/N1/1.txt index e991211..58ba2f8 100644 --- a/a/1.txt +++ b/N1/1.txt @@ -112,4 +112,4 @@ there thing i should describe more thouroughly or aspect you feel are missing ? Cheers, -Jerome +Jérôme diff --git a/a/2.txt b/N1/2.txt index 8b13789..cb8b2cf 100644 --- a/a/2.txt +++ b/N1/2.txt @@ -1 +1,151 @@ +>From 4a2cb2211af22b6b149ba9afebc27f8d5763bac2 Mon Sep 17 00:00:00 2001 +From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com> +Date: Thu, 16 Mar 2017 20:27:43 -0400 +Subject: [PATCH] hmm: heterogeneous memory management documentation +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit +This add documentation for HMM (Heterogeneous Memory Management). It +presents the motivation behind it, the features necessary for it to +be usefull and and gives an overview of how this is implemented. + +Signed-off-by: Jérôme Glisse <jglisse@redhat.com> +--- + Documentation/hmm.txt | 125 ++++++++++++++++++++++++++++++++++++++++++++++++++ + 1 file changed, 125 insertions(+) + create mode 100644 Documentation/hmm.txt + +diff --git a/Documentation/hmm.txt b/Documentation/hmm.txt +new file mode 100644 +index 0000000..83dd0ff +--- /dev/null ++++ b/Documentation/hmm.txt +@@ -0,0 +1,125 @@ ++Heterogeneous Memory Management (HMM) ++ ++Transparently allow any component of a program to use any memory region of said ++program with a device without using device specific memory allocator. This is ++becoming a requirement to simplify the use of advance heterogeneous computing ++where GPU, DSP or FPGA are use to perform various computations. ++ ++This document is divided as follow, in the first section i expose the problems ++related to the use of a device specific allocator. The second section i expose ++the hardware limitations that are inherent to many platforms. The third section ++gives an overview of HMM designs. ++ ++ ++------------------------------------------------------------------------------- ++ ++1) Problems of using device specific memory allocator: ++ ++Device with large amount of on board memory (several giga bytes) like GPU have ++historicaly manage their memory through dedicated driver specific API. This ++creates a disconnect between memory allocated and managed by device driver and ++regular application memory (private anonynous, share memory or regular file ++back memory). From here on i will refer to this aspect as split address space. ++I use share address space to refer to the opposite situation ie one in which ++any memory region can be use by device transparently. ++ ++Split address space because device can only access memory allocated through the ++device specific API. This imply that all memory object in a program are not ++equal from device point of view which complicate large program that rely on a ++wide set of libraries. ++ ++Concretly this means that code that wants to leverage device like GPU need to ++copy object between genericly allocated memory (malloc, mmap private/share/) ++and memory allocated through the device driver API (this still end up with an ++mmap but of the device file). ++ ++For flat dataset (array, grid, image, ...) this isn't too hard to achieve but ++complex data-set (list, tree, ...) are hard to get right. Duplicating a complex ++data-set need to re-map all the pointer relations between each of its elements. ++This is error prone and program gets harder to debug because of the duplicate ++data-set. ++ ++Split address space also means that library can not transparently use data they ++are getting from core program or other library and thus each library might have ++to duplicate its input data-set using specific memory allocator. Large project ++suffer from this and waste resources because of the various memory copy. ++ ++Duplicating each library API to accept as input or output memory allocted by ++each device specific allocator is not a viable option. It would lead to a ++combinatorial explosions in the library entry points. ++ ++Finaly with the advance of high level langage constructs (in C++ but in other ++langage too) it is now possible for compiler to leverage GPU or other devices ++without even the programmer knowledge. Some of compiler identified patterns are ++only do-able with a share address. It is as well more reasonable to use a share ++address space for all the other patterns. ++ ++ ++------------------------------------------------------------------------------- ++ ++2) System bus, device memory characteristics ++ ++System bus cripple share address due to few limitations. Most system bus only ++allow basic memory access from device to main memory, even cache coherency is ++often optional. Access to device memory from CPU is even more limited, most ++often than not it is not cache coherent. ++ ++If we only consider the PCIE bus than device can access main memory (often ++through an IOMMU) and be cache coherent with the CPUs. However it only allows ++a limited set of atomic operation from device on main memory. This is worse ++in the other direction the CPUs can only access a limited range of the device ++memory and can not perform atomic operations on it. Thus device memory can not ++be consider like regular memory from kernel point of view. ++ ++Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0 ++and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s). ++The final limitation is latency, access to main memory from the device has an ++order of magnitude higher latency than when the device access its own memory. ++ ++Some platform are developing new system bus or additions/modifications to PCIE ++to address some of those limitations (OpenCAPI, CCIX). They mainly allow two ++way cache coherency between CPU and device and allow all atomic operations the ++architecture supports. Saddly not all platform are following this trends and ++some major architecture are left without hardware solutions to those problems. ++ ++So for share address space to make sense not only we must allow device to ++access any memory memory but we must also permit any memory to be migrated to ++device memory while device is using it (blocking CPU access while it happens). ++ ++ ++------------------------------------------------------------------------------- ++ ++3) Share address space and migration ++ ++HMM intends to provide two main features. First one is to share the address ++space by duplication the CPU page table into the device page table so same ++address point to same memory and this for any valid main memory address in ++the process address space. ++ ++To achieve this, HMM offer a set of helpers to populate the device page table ++while keeping track of CPU page table updates. Device page table updates are ++not as easy as CPU page table updates. To update the device page table you must ++allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics ++commands in it to perform the update (unmap, cache invalidations and flush, ++...). This can not be done through common code for all device. Hence why HMM ++provides helpers to factor out everything that can be while leaving the gory ++details to the device driver. ++ ++The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does ++allow to allocate a struct page for each page of the device memory. Those page ++are special because the CPU can not map them. They however allow to migrate ++main memory to device memory using exhisting migration mechanism and everything ++looks like if page was swap out to disk from CPU point of view. Using a struct ++page gives the easiest and cleanest integration with existing mm mechanisms. ++Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory ++for the device memory and second to perform migration. Policy decision of what ++and when to migrate things is left to the device driver. ++ ++Note that any CPU acess to a device page trigger a page fault which initiate a ++migration back to system memory so that CPU can access it. ++ ++ ++With this two features, HMM not only allow a device to mirror a process address ++space and keeps both CPU and device page table synchronize, but also allow to ++leverage device memory by migrating part of data-set that is actively use by a ++device. +-- +2.4.11 diff --git a/a/content_digest b/N1/content_digest index 9891be9..0f47238 100644 --- a/a/content_digest +++ b/N1/content_digest @@ -125,9 +125,160 @@ "missing ?\n" "\n" "Cheers,\n" - Jerome + "J\303\251r\303\264me" "\01:2\0" "fn\00001-hmm-heterogeneous-memory-management-documentation.patch\0" "b\0" + ">From 4a2cb2211af22b6b149ba9afebc27f8d5763bac2 Mon Sep 17 00:00:00 2001\n" + "From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>\n" + "Date: Thu, 16 Mar 2017 20:27:43 -0400\n" + "Subject: [PATCH] hmm: heterogeneous memory management documentation\n" + "MIME-Version: 1.0\n" + "Content-Type: text/plain; charset=UTF-8\n" + "Content-Transfer-Encoding: 8bit\n" + "\n" + "This add documentation for HMM (Heterogeneous Memory Management). It\n" + "presents the motivation behind it, the features necessary for it to\n" + "be usefull and and gives an overview of how this is implemented.\n" + "\n" + "Signed-off-by: J\303\251r\303\264me Glisse <jglisse@redhat.com>\n" + "---\n" + " Documentation/hmm.txt | 125 ++++++++++++++++++++++++++++++++++++++++++++++++++\n" + " 1 file changed, 125 insertions(+)\n" + " create mode 100644 Documentation/hmm.txt\n" + "\n" + "diff --git a/Documentation/hmm.txt b/Documentation/hmm.txt\n" + "new file mode 100644\n" + "index 0000000..83dd0ff\n" + "--- /dev/null\n" + "+++ b/Documentation/hmm.txt\n" + "@@ -0,0 +1,125 @@\n" + "+Heterogeneous Memory Management (HMM)\n" + "+\n" + "+Transparently allow any component of a program to use any memory region of said\n" + "+program with a device without using device specific memory allocator. This is\n" + "+becoming a requirement to simplify the use of advance heterogeneous computing\n" + "+where GPU, DSP or FPGA are use to perform various computations.\n" + "+\n" + "+This document is divided as follow, in the first section i expose the problems\n" + "+related to the use of a device specific allocator. The second section i expose\n" + "+the hardware limitations that are inherent to many platforms. The third section\n" + "+gives an overview of HMM designs.\n" + "+\n" + "+\n" + "+-------------------------------------------------------------------------------\n" + "+\n" + "+1) Problems of using device specific memory allocator:\n" + "+\n" + "+Device with large amount of on board memory (several giga bytes) like GPU have\n" + "+historicaly manage their memory through dedicated driver specific API. This\n" + "+creates a disconnect between memory allocated and managed by device driver and\n" + "+regular application memory (private anonynous, share memory or regular file\n" + "+back memory). From here on i will refer to this aspect as split address space.\n" + "+I use share address space to refer to the opposite situation ie one in which\n" + "+any memory region can be use by device transparently.\n" + "+\n" + "+Split address space because device can only access memory allocated through the\n" + "+device specific API. This imply that all memory object in a program are not\n" + "+equal from device point of view which complicate large program that rely on a\n" + "+wide set of libraries.\n" + "+\n" + "+Concretly this means that code that wants to leverage device like GPU need to\n" + "+copy object between genericly allocated memory (malloc, mmap private/share/)\n" + "+and memory allocated through the device driver API (this still end up with an\n" + "+mmap but of the device file).\n" + "+\n" + "+For flat dataset (array, grid, image, ...) this isn't too hard to achieve but\n" + "+complex data-set (list, tree, ...) are hard to get right. Duplicating a complex\n" + "+data-set need to re-map all the pointer relations between each of its elements.\n" + "+This is error prone and program gets harder to debug because of the duplicate\n" + "+data-set.\n" + "+\n" + "+Split address space also means that library can not transparently use data they\n" + "+are getting from core program or other library and thus each library might have\n" + "+to duplicate its input data-set using specific memory allocator. Large project\n" + "+suffer from this and waste resources because of the various memory copy.\n" + "+\n" + "+Duplicating each library API to accept as input or output memory allocted by\n" + "+each device specific allocator is not a viable option. It would lead to a\n" + "+combinatorial explosions in the library entry points.\n" + "+\n" + "+Finaly with the advance of high level langage constructs (in C++ but in other\n" + "+langage too) it is now possible for compiler to leverage GPU or other devices\n" + "+without even the programmer knowledge. Some of compiler identified patterns are\n" + "+only do-able with a share address. It is as well more reasonable to use a share\n" + "+address space for all the other patterns.\n" + "+\n" + "+\n" + "+-------------------------------------------------------------------------------\n" + "+\n" + "+2) System bus, device memory characteristics\n" + "+\n" + "+System bus cripple share address due to few limitations. Most system bus only\n" + "+allow basic memory access from device to main memory, even cache coherency is\n" + "+often optional. Access to device memory from CPU is even more limited, most\n" + "+often than not it is not cache coherent.\n" + "+\n" + "+If we only consider the PCIE bus than device can access main memory (often\n" + "+through an IOMMU) and be cache coherent with the CPUs. However it only allows\n" + "+a limited set of atomic operation from device on main memory. This is worse\n" + "+in the other direction the CPUs can only access a limited range of the device\n" + "+memory and can not perform atomic operations on it. Thus device memory can not\n" + "+be consider like regular memory from kernel point of view.\n" + "+\n" + "+Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0\n" + "+and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).\n" + "+The final limitation is latency, access to main memory from the device has an\n" + "+order of magnitude higher latency than when the device access its own memory.\n" + "+\n" + "+Some platform are developing new system bus or additions/modifications to PCIE\n" + "+to address some of those limitations (OpenCAPI, CCIX). They mainly allow two\n" + "+way cache coherency between CPU and device and allow all atomic operations the\n" + "+architecture supports. Saddly not all platform are following this trends and\n" + "+some major architecture are left without hardware solutions to those problems.\n" + "+\n" + "+So for share address space to make sense not only we must allow device to\n" + "+access any memory memory but we must also permit any memory to be migrated to\n" + "+device memory while device is using it (blocking CPU access while it happens).\n" + "+\n" + "+\n" + "+-------------------------------------------------------------------------------\n" + "+\n" + "+3) Share address space and migration\n" + "+\n" + "+HMM intends to provide two main features. First one is to share the address\n" + "+space by duplication the CPU page table into the device page table so same\n" + "+address point to same memory and this for any valid main memory address in\n" + "+the process address space.\n" + "+\n" + "+To achieve this, HMM offer a set of helpers to populate the device page table\n" + "+while keeping track of CPU page table updates. Device page table updates are\n" + "+not as easy as CPU page table updates. To update the device page table you must\n" + "+allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics\n" + "+commands in it to perform the update (unmap, cache invalidations and flush,\n" + "+...). This can not be done through common code for all device. Hence why HMM\n" + "+provides helpers to factor out everything that can be while leaving the gory\n" + "+details to the device driver.\n" + "+\n" + "+The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does\n" + "+allow to allocate a struct page for each page of the device memory. Those page\n" + "+are special because the CPU can not map them. They however allow to migrate\n" + "+main memory to device memory using exhisting migration mechanism and everything\n" + "+looks like if page was swap out to disk from CPU point of view. Using a struct\n" + "+page gives the easiest and cleanest integration with existing mm mechanisms.\n" + "+Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory\n" + "+for the device memory and second to perform migration. Policy decision of what\n" + "+and when to migrate things is left to the device driver.\n" + "+\n" + "+Note that any CPU acess to a device page trigger a page fault which initiate a\n" + "+migration back to system memory so that CPU can access it.\n" + "+\n" + "+\n" + "+With this two features, HMM not only allow a device to mirror a process address\n" + "+space and keeps both CPU and device page table synchronize, but also allow to\n" + "+leverage device memory by migrating part of data-set that is actively use by a\n" + "+device.\n" + "-- \n" + 2.4.11 -1398a8a336175847b3b8b7c035cda1f069ee3ffb4eb6b76f358a639e1c8c19b2 +6d998231c2fa36310b2d969fe85b9ec232d703d563f316fe325b25c54cea401a
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.