HMM (Heterogeneous Memory Management) v10

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* HMM (Heterogeneous Memory Management) v10
@ 2015-08-13 19:15 Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 01/15] mmu_notifier: add event information to address invalidation v8 Jérôme Glisse
                   ` (15 more replies)
  0 siblings, 16 replies; 20+ messages in thread
From: Jérôme Glisse @ 2015-08-13 19:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Linda Wang, Kevin E Martin, Jeff Law,
	Or Gerlitz, Sagi Grimberg

Minor fixes since last post (1), apply on top of 4.2-rc6 done
that because conflict in infiniband are harder to solve then
conflict with mm tree.

Tree with the patchset:
git://people.freedesktop.org/~glisse/linux hmm-v10 branch

Previous cover letter :

HMM (Heterogeneous Memory Management) is an helper layer for device
that want to mirror a process address space into their own mmu. Main
target is GPU but other hardware, like network device can take also
use HMM.

There is two side to HMM, first one is mirroring of process address
space on behalf of a device. HMM will manage a secondary page table
for the device and keep it synchronize with the CPU page table. HMM
also do DMA mapping on behalf of the device (which would allow new
kind of optimization further down the road (2)).

Second side is allowing to migrate process memory to device memory
where device memory is unmappable by the CPU. Any CPU access will
trigger special fault that will migrate memory back. This patchset
does not deal with remote memory migration.

Why doing this ?

Mirroring a process address space is mandatory with OpenCL 2.0 and
with other GPU compute API. OpenCL 2.0 allow different level of
implementation and currently only the lowest 2 are supported on
Linux. To implement the highest level, where CPU and GPU access
can happen concurently and are cache coherent, HMM is needed, or
something providing same functionality, for instance through
platform hardware.

Hardware solution such as PCIE ATS/PASID is limited to mirroring
system memory and does not provide way to migrate memory to device
memory (which offer significantly more bandwidth up to 10 times
faster than regular system memory with discret GPU, also have
lower latency than PCIE transaction).

Current CPU with GPU on same die (AMD or Intel) use the ATS/PASID
and for Intel a special level of cache (backed by a large pool of
fast memory).

For foreseeable futur, discrete GPU will remain releveant as they
can have a large quantity of faster memory than integrated GPU.

Thus we believe HMM will allow to leverage discret GPU memory in
a transparent fashion to the application, with minimum disruption
to the linux kernel mm code. Also HMM can work along hardware
solution such as PCIE ATS/PASID (leaving regular case to ATS/PASID
while HMM handles the migrated memory case).

Design :

The patch 1, 2, 3 and 4 augment the mmu notifier API with new
informations to more efficiently mirror CPU page table updates.

The first side of HMM, process address space mirroring, is
implemented in patch 5 through 14. This use a secondary page
table, in which HMM mirror memory actively use by the device.
HMM does not take a reference on any of the page, it use the
mmu notifier API to track changes to the CPU page table and to
update the mirror page table. All this while providing a simple
API to device driver.

To implement this we use a "generic" page table and not a radix
tree because we need to store more flags than radix allows and
we need to store dma address (sizeof(dma_addr_t) > sizeof(long)
on some platform). All this is

(1) Previous patchset posting :
    v1 http://lwn.net/Articles/597289/
    v2 https://lkml.org/lkml/2014/6/12/559
    v3 https://lkml.org/lkml/2014/6/13/633
    v4 https://lkml.org/lkml/2014/8/29/423
    v5 https://lkml.org/lkml/2014/11/3/759
    v6 http://lwn.net/Articles/619737/
    v7 http://lwn.net/Articles/627316/
    v8 https://lwn.net/Articles/645515/
    v9 https://lwn.net/Articles/651553/

(2) Because HMM keeps a secondary page table which keeps track of
    DMA mapping, there is room for new optimization. We want to
    add a new DMA API to allow to manage DMA page table mapping
    at directory level. This would allow to minimize memory
    consumption of mirror page table and also over head of doing
    DMA mapping page per page. This is a future feature we want
    to work on and hope the idea will proove usefull not only to
    HMM users.

Cheers,
JA(C)rA'me

To: "Andrew Morton" <akpm@linux-foundation.org>,
To: <linux-kernel@vger.kernel.org>,
To: linux-mm <linux-mm@kvack.org>,
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>,
Cc: "Mel Gorman" <mgorman@suse.de>,
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Cc: "Peter Zijlstra" <peterz@infradead.org>,
Cc: "Linda Wang" <lwang@redhat.com>,
Cc: "Kevin E Martin" <kem@redhat.com>,
Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
Cc: "Johannes Weiner" <jweiner@redhat.com>,
Cc: "Larry Woodman" <lwoodman@redhat.com>,
Cc: "Rik van Riel" <riel@redhat.com>,
Cc: "Dave Airlie" <airlied@redhat.com>,
Cc: "Jeff Law" <law@redhat.com>,
Cc: "Brendan Conoboy" <blc@redhat.com>,
Cc: "Joe Donohue" <jdonohue@redhat.com>,
Cc: "Christophe Harle" <charle@nvidia.com>,
Cc: "Duncan Poole" <dpoole@nvidia.com>,
Cc: "Sherry Cheung" <SCheung@nvidia.com>,
Cc: "Subhash Gutti" <sgutti@nvidia.com>,
Cc: "John Hubbard" <jhubbard@nvidia.com>,
Cc: "Mark Hairgrove" <mhairgrove@nvidia.com>,
Cc: "Lucien Dunning" <ldunning@nvidia.com>,
Cc: "Cameron Buschardt" <cabuschardt@nvidia.com>,
Cc: "Arvind Gopalakrishnan" <arvindg@nvidia.com>,
Cc: "Haggai Eran" <haggaie@mellanox.com>,
Cc: "Or Gerlitz" <ogerlitz@mellanox.com>,
Cc: "Sagi Grimberg" <sagig@mellanox.com>
Cc: "Shachar Raindel" <raindel@mellanox.com>,
Cc: "Liran Liss" <liranl@mellanox.com>,
Cc: "Roland Dreier" <roland@purestorage.com>,
Cc: "Sander, Ben" <ben.sander@amd.com>,
Cc: "Stoner, Greg" <Greg.Stoner@amd.com>,
Cc: "Bridgman, John" <John.Bridgman@amd.com>,
Cc: "Mantor, Michael" <Michael.Mantor@amd.com>,
Cc: "Blinzer, Paul" <Paul.Blinzer@amd.com>,
Cc: "Morichetti, Laurent" <Laurent.Morichetti@amd.com>,
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
Cc: "Leonid Shamis" <Leonid.Shamis@amd.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 01/15] mmu_notifier: add event information to address invalidation v8
  2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
@ 2015-08-13 19:15 ` Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 02/15] mmu_notifier: keep track of active invalidation ranges v4 Jérôme Glisse
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Jérôme Glisse @ 2015-08-13 19:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

The event information will be useful for new user of mmu_notifier API.
The event argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different path for different event for instance on unmap
the resource used to track a vma are still valid and should stay around.
While if the event is saying that a vma is being destroy it means that any
resources used to track this vma can be free.

Changed since v1:
  - renamed action into event (updated commit message too).
  - simplified the event names and clarified their usage
    also documenting what exceptation the listener can have in
    respect to each event.

Changed since v2:
  - Avoid crazy name.
  - Do not move code that do not need to move.

Changed since v3:
  - Separate huge page split from mlock/munlock and softdirty.

Changed since v4:
  - Rebase (no other changes).

Changed since v5:
  - Typo fix.
  - Changed zap_page_range from MMU_MUNMAP to MMU_MIGRATE to reflect the
    fact that the address range is still valid just the page backing it
    are no longer.

Changed since v6:
  - try_to_unmap_one() only invalidate when doing migration.
  - Differentiate fork from other case.

Changed since v7:
  - Renamed MMU_HUGE_PAGE_SPLIT to MMU_HUGE_PAGE_SPLIT.
  - Renamed MMU_ISDIRTY to MMU_CLEAR_SOFT_DIRTY.
  - Renamed MMU_WRITE_PROTECT to MMU_KSM_WRITE_PROTECT.
  - English syntax fixes.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  |   3 +-
 drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
 drivers/gpu/drm/radeon/radeon_mn.c      |   3 +-
 drivers/infiniband/core/umem_odp.c      |   9 ++-
 drivers/iommu/amd_iommu_v2.c            |   3 +-
 drivers/misc/sgi-gru/grutlbpurge.c      |   9 ++-
 drivers/xen/gntdev.c                    |   9 ++-
 fs/proc/task_mmu.c                      |   6 +-
 include/linux/mmu_notifier.h            | 132 ++++++++++++++++++++++++++------
 kernel/events/uprobes.c                 |  10 ++-
 mm/huge_memory.c                        |  39 ++++++----
 mm/hugetlb.c                            |  23 +++---
 mm/ksm.c                                |  18 +++--
 mm/memory.c                             |  27 ++++---
 mm/migrate.c                            |   9 ++-
 mm/mmu_notifier.c                       |  28 ++++---
 mm/mprotect.c                           |   6 +-
 mm/mremap.c                             |   6 +-
 mm/rmap.c                               |   4 +-
 virt/kvm/kvm_main.c                     |  12 ++-
 20 files changed, 258 insertions(+), 101 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index b1969f2..7ca805c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -121,7 +121,8 @@ static void amdgpu_mn_release(struct mmu_notifier *mn,
 static void amdgpu_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     enum mmu_event event)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 8fd431b..adc5480 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -132,7 +132,8 @@ restart:
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
 						       unsigned long start,
-						       unsigned long end)
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
 	struct interval_tree_node *it = NULL;
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index eef006c..3a9615b 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -121,7 +121,8 @@ static void radeon_mn_release(struct mmu_notifier *mn,
 static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     enum mmu_event event)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct interval_tree_node *it;
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 40becdb..6ed69fa 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -165,7 +165,8 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
 
 static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -192,7 +193,8 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -217,7 +219,8 @@ static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
 static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index f7b875b..f206804 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -392,7 +392,8 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
-			       unsigned long address)
+			       unsigned long address,
+			       enum mmu_event event)
 {
 	__mn_flush_page(mn, address);
 }
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 2129274..e67fed1 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+				       unsigned long start, unsigned long end,
+				       enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
 				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end)
+				     unsigned long end,
+				     enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
-				unsigned long address)
+				unsigned long address,
+				enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 0dbb222..f07301e 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -467,7 +467,9 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start,
+				unsigned long end,
+				enum mmu_event event)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
@@ -484,9 +486,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
-			 unsigned long address)
+			 unsigned long address,
+			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
+	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ca1e091..4c450fa 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -934,11 +934,13 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				downgrade_write(&mm->mmap_sem);
 				break;
 			}
-			mmu_notifier_invalidate_range_start(mm, 0, -1);
+			mmu_notifier_invalidate_range_start(mm, 0, -1,
+							MMU_CLEAR_SOFT_DIRTY);
 		}
 		walk_page_range(0, ~0UL, &clear_refs_walk);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1);
+			mmu_notifier_invalidate_range_end(mm, 0, -1,
+							MMU_CLEAR_SOFT_DIRTY);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 out_mm:
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 61cd67f..f9b1e10 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -9,6 +9,67 @@
 struct mmu_notifier;
 struct mmu_notifier_ops;
 
+/* MMU Events report fine-grained information to the callback routine, allowing
+ * the event listener to make a more informed decision as to what action to
+ * take. The event types are:
+ *
+ *   - MMU_FORK a process is forking. This will lead to vmas getting
+ *     write-protected, in order to set up COW
+ *
+ *   - MMU_HUGE_PAGE_SPLIT the pages don't move, nor does their content change,
+ *     but the page table structure is updated (levels added or removed).
+ *
+ *   - MMU_CLEAR_SOFT_DIRTY need to write protect so write properly update the
+ *     soft dirty bit of page table entry.
+ *
+ *   - MMU_MIGRATE: memory is migrating from one page to another, thus all write
+ *     access must stop after invalidate_range_start callback returns.
+ *     Furthermore, no read access should be allowed either, as a new page can
+ *     be remapped with write access before the invalidate_range_end callback
+ *     happens and thus any read access to old page might read stale data. There
+ *     are several sources for this event, including:
+ *
+ *         - A page moving to swap (various reasons, including page reclaim),
+ *         - An mremap syscall,
+ *         - migration for NUMA reasons,
+ *         - balancing the memory pool,
+ *         - write fault on COW page,
+ *         - and more that are not listed here.
+ *
+ *   - MMU_MPROT: memory access protection is changing. Refer to the vma to get
+ *     the new access protection. All memory access are still valid until the
+ *     invalidate_range_end callback.
+ *
+ *   - MMU_MUNLOCK: unlock memory. Content of page table stays the same but
+ *     page are unlocked.
+ *
+ *   - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall or
+ *     process destruction). However, access is still allowed, up until the
+ *     invalidate_range_free_pages callback. This also implies that secondary
+ *     page table can be trimmed, because the address range is no longer valid.
+ *
+ *   - MMU_WRITE_BACK: memory is being written back to disk, all write accesses
+ *     must stop after invalidate_range_start callback returns. Read access are
+ *     still allowed.
+ *
+ *   - MMU_KSM_WRITE_PROTECT: memory is being write protected for KSM.
+ *
+ * If in doubt when adding a new notifier caller, please use MMU_MIGRATE,
+ * because it will always lead to reasonable behavior, but will not allow the
+ * listener a chance to optimize its events.
+ */
+enum mmu_event {
+	MMU_FORK = 0,
+	MMU_HUGE_PAGE_SPLIT,
+	MMU_CLEAR_SOFT_DIRTY,
+	MMU_MIGRATE,
+	MMU_MPROT,
+	MMU_MUNLOCK,
+	MMU_MUNMAP,
+	MMU_WRITE_BACK,
+	MMU_KSM_WRITE_PROTECT,
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -82,7 +143,8 @@ struct mmu_notifier_ops {
 	void (*change_pte)(struct mmu_notifier *mn,
 			   struct mm_struct *mm,
 			   unsigned long address,
-			   pte_t pte);
+			   pte_t pte,
+			   enum mmu_event event);
 
 	/*
 	 * Before this is invoked any secondary MMU is still ok to
@@ -93,7 +155,8 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long address);
+				unsigned long address,
+				enum mmu_event event);
 
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
@@ -140,10 +203,14 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
+				       unsigned long start,
+				       unsigned long end,
+				       enum mmu_event event);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start, unsigned long end);
+				     unsigned long start,
+				     unsigned long end,
+				     enum mmu_event event);
 
 	/*
 	 * invalidate_range() is either called between
@@ -206,13 +273,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
-				      unsigned long address, pte_t pte);
+				      unsigned long address,
+				      pte_t pte,
+				      enum mmu_event event);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address);
+					  unsigned long address,
+					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						  unsigned long start,
+						  unsigned long end,
+						  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						unsigned long start,
+						unsigned long end,
+						enum mmu_event event);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 
@@ -240,31 +314,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_change_pte(mm, address, pte);
+		__mmu_notifier_change_pte(mm, address, pte, event);
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address);
+		__mmu_notifier_invalidate_page(mm, address, event);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, event);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range_end(mm, start, end, event);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -359,13 +440,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
  * old page would remain mapped readonly in the secondary MMUs after the new
  * page is already writable by some CPU through the primary MMU.
  */
-#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event)	\
 ({									\
 	struct mm_struct *___mm = __mm;					\
 	unsigned long ___address = __address;				\
 	pte_t ___pte = __pte;						\
 									\
-	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
+	mmu_notifier_change_pte(___mm, ___address, ___pte, __event);	\
 	set_pte_at(___mm, ___address, __ptep, ___pte);			\
 })
 
@@ -393,22 +474,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index cb346f2..802828a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -176,7 +176,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -194,7 +195,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -208,7 +211,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 097c7a4..1b3ce77 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1024,7 +1024,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1058,7 +1059,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1068,7 +1070,8 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -1160,7 +1163,8 @@ alloc:
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	spin_lock(ptl);
 	if (page)
@@ -1192,7 +1196,8 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 out:
 	return ret;
 out_unlock:
@@ -1611,7 +1616,8 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_HUGE_PAGE_SPLIT);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1627,7 +1633,8 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_HUGE_PAGE_SPLIT);
 
 	return ret;
 }
@@ -2486,7 +2493,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2496,7 +2504,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_collapse_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2893,24 +2902,28 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087..2b513e2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2749,7 +2749,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_start(src, mmun_start,
+						    mmun_end, MMU_MIGRATE);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2803,7 +2804,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 	return ret;
 }
@@ -2829,7 +2831,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	address = start;
 again:
 	for (; address < end; address += sz) {
@@ -2903,7 +2906,8 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -3082,8 +3086,8 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -3104,7 +3108,8 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_MIGRATE);
 out_release_all:
 	page_cache_release(new_page);
 out_release_old:
@@ -3572,7 +3577,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3622,7 +3627,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_tlb_range(vma, start, end);
 	mmu_notifier_invalidate_range(mm, start, end);
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 7ee101e..eb1b2b5 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -872,7 +872,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_KSM_WRITE_PROTECT);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -904,7 +905,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		if (pte_dirty(entry))
 			set_page_dirty(page);
 		entry = pte_mkclean(pte_wrprotect(entry));
-		set_pte_at_notify(mm, addr, ptep, entry);
+		set_pte_at_notify(mm, addr, ptep, entry, MMU_KSM_WRITE_PROTECT);
 	}
 	*orig_pte = *ptep;
 	err = 0;
@@ -912,7 +913,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_KSM_WRITE_PROTECT);
 out:
 	return err;
 }
@@ -948,7 +950,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -961,7 +964,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -971,7 +976,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_MIGRATE);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 388dcf9..1be64ce 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1048,7 +1048,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_end   = end;
 	if (is_cow)
 		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end);
+						    mmun_end, MMU_FORK);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1065,7 +1065,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src_mm, mmun_start,
+						  mmun_end, MMU_FORK);
 	return ret;
 }
 
@@ -1335,10 +1336,12 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_start(mm, start_addr,
+					    end_addr, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_end(mm, start_addr,
+					  end_addr, MMU_MUNMAP);
 }
 
 /**
@@ -1360,10 +1363,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
 	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1386,9 +1389,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end);
+	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end);
+	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2087,7 +2090,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	__SetPageUptodate(new_page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2120,7 +2124,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
-		set_pte_at_notify(mm, address, page_table, entry);
+		set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
 		update_mmu_cache(vma, address, page_table);
 		if (old_page) {
 			/*
@@ -2159,7 +2163,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 
 	pte_unmap_unlock(page_table, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index eb42671..0a0ed0c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1763,12 +1763,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1822,7 +1824,8 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 3b9b3d0..e51ea02 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -142,8 +142,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	return young;
 }
 
-void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
-			       pte_t pte)
+void __mmu_notifier_change_pte(struct mm_struct *mm,
+			       unsigned long address,
+			       pte_t pte,
+			       enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -151,13 +153,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->change_pte)
-			mn->ops->change_pte(mn, mm, address, pte);
+			mn->ops->change_pte(mn, mm, address, pte, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+				    unsigned long address,
+				    enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -165,13 +168,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
+			mn->ops->invalidate_page(mn, mm, address, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					   unsigned long start,
+					   unsigned long end,
+					   enum mmu_event event)
+
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -179,14 +185,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+			mn->ops->invalidate_range_start(mn, mm, start,
+							end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					 unsigned long start,
+					 unsigned long end,
+					 enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -204,7 +213,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		if (mn->ops->invalidate_range)
 			mn->ops->invalidate_range(mn, mm, start, end);
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end);
+			mn->ops->invalidate_range_end(mn, mm, start,
+						      end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e7d6f11..a57e8af 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -155,7 +155,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end);
+			mmu_notifier_invalidate_range_start(mm, mni_start,
+							    end, MMU_MPROT);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -183,7 +184,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end);
+		mmu_notifier_invalidate_range_end(mm, mni_start, end,
+						  MMU_MPROT);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index a7c93ec..72051cf 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -176,7 +176,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -228,7 +229,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 171b687..b1e6eae 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -891,7 +891,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
 		(*cleaned)++;
 	}
 out:
@@ -1298,7 +1298,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
 out:
 	return ret;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8b8a444..4dfa91c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -259,7 +259,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush, idx;
@@ -301,7 +302,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
-					pte_t pte)
+					pte_t pte,
+					enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int idx;
@@ -317,7 +319,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -343,7 +346,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 02/15] mmu_notifier: keep track of active invalidation ranges v4
  2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 01/15] mmu_notifier: add event information to address invalidation v8 Jérôme Glisse
@ 2015-08-13 19:15 ` Jérôme Glisse
  2015-09-01  3:27   ` Mark Hairgrove
  2015-08-13 19:15 ` [PATCH 03/15] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() v2 Jérôme Glisse
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 20+ messages in thread
From: Jérôme Glisse @ 2015-08-13 19:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

The invalidate_range_start() and invalidate_range_end() can be
considered as forming an "atomic" section for the cpu page table
update point of view. Between this two function the cpu page
table content is unreliable for the address range being
invalidated.

This patch use a structure define at all place doing range
invalidation. This structure is added to a list for the duration
of the update ie added with invalid_range_start() and removed
with invalidate_range_end().

Helpers allow querying if a range is valid and wait for it if
necessary.

For proper synchronization, user must block any new range
invalidation from inside there invalidate_range_start() callback.
Otherwise there is no garanty that a new range invalidation will
not be added after the call to the helper function to query for
existing range.

Changed since v1:
  - Fix a possible deadlock in mmu_notifier_range_wait_valid()

Changed since v2:
  - Add the range to invalid range list before calling ->range_start().
  - Del the range from invalid range list after calling ->range_end().
  - Remove useless list initialization.

Changed since v3:
  - Improved commit message.
  - Added comment to explain how helpers function are suppose to be use.
  - English syntax fixes.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  |  13 ++--
 drivers/gpu/drm/i915/i915_gem_userptr.c |  10 +--
 drivers/gpu/drm/radeon/radeon_mn.c      |  16 ++--
 drivers/infiniband/core/umem_odp.c      |  20 ++---
 drivers/misc/sgi-gru/grutlbpurge.c      |  15 ++--
 drivers/xen/gntdev.c                    |  15 ++--
 fs/proc/task_mmu.c                      |  11 ++-
 include/linux/mmu_notifier.h            |  55 +++++++-------
 kernel/events/uprobes.c                 |  13 ++--
 mm/huge_memory.c                        |  78 +++++++++-----------
 mm/hugetlb.c                            |  55 +++++++-------
 mm/ksm.c                                |  28 +++----
 mm/memory.c                             |  72 ++++++++++--------
 mm/migrate.c                            |  36 ++++-----
 mm/mmu_notifier.c                       | 126 +++++++++++++++++++++++++++++---
 mm/mprotect.c                           |  18 +++--
 mm/mremap.c                             |  14 ++--
 virt/kvm/kvm_main.c                     |  14 ++--
 18 files changed, 350 insertions(+), 259 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 7ca805c..7c9eb1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -119,27 +119,24 @@ static void amdgpu_mn_release(struct mmu_notifier *mn,
  * unmap them by move them into system domain again.
  */
 static void amdgpu_mn_invalidate_range_start(struct mmu_notifier *mn,
-					     struct mm_struct *mm,
-					     unsigned long start,
-					     unsigned long end,
-					     enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
-
 	/* notification is exclusive, but interval is inclusive */
-	end -= 1;
+	unsigned long end = range->end - 1;
 
 	mutex_lock(&rmn->lock);
 
-	it = interval_tree_iter_first(&rmn->objects, start, end);
+	it = interval_tree_iter_first(&rmn->objects, range->start, end);
 	while (it) {
 		struct amdgpu_mn_node *node;
 		struct amdgpu_bo *bo;
 		long r;
 
 		node = container_of(it, struct amdgpu_mn_node, it);
-		it = interval_tree_iter_next(it, start, end);
+		it = interval_tree_iter_next(it, range->start, end);
 
 		list_for_each_entry(bo, &node->bos, mn_list) {
 
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index adc5480..40ae9c1 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -130,17 +130,17 @@ restart:
 }
 
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
-						       struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
 	struct interval_tree_node *it = NULL;
+	unsigned long start = range->start;
 	unsigned long next = start;
+	/* interval ranges are inclusive, but invalidate range is exclusive */
+	unsigned long end = range->end - 1;
 	unsigned long serial = 0;
 
-	end--; /* interval ranges are inclusive, but invalidate range is exclusive */
 	while (next < end) {
 		struct drm_i915_gem_object *obj = NULL;
 
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index 3a9615b..5276f01 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -112,34 +112,30 @@ static void radeon_mn_release(struct mmu_notifier *mn,
  *
  * @mn: our notifier
  * @mn: the mm this callback is about
- * @start: start of updated range
- * @end: end of updated range
+ * @range: Address range information.
  *
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
 static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
-					     struct mm_struct *mm,
-					     unsigned long start,
-					     unsigned long end,
-					     enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct interval_tree_node *it;
-
 	/* notification is exclusive, but interval is inclusive */
-	end -= 1;
+	unsigned long end = range->end - 1;
 
 	mutex_lock(&rmn->lock);
 
-	it = interval_tree_iter_first(&rmn->objects, start, end);
+	it = interval_tree_iter_first(&rmn->objects, range->start, end);
 	while (it) {
 		struct radeon_mn_node *node;
 		struct radeon_bo *bo;
 		long r;
 
 		node = container_of(it, struct radeon_mn_node, it);
-		it = interval_tree_iter_next(it, start, end);
+		it = interval_tree_iter_next(it, range->start, end);
 
 		list_for_each_entry(bo, &node->bos, mn_list) {
 
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 6ed69fa..58d9a00 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -191,10 +191,8 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 }
 
 static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
-						    struct mm_struct *mm,
-						    unsigned long start,
-						    unsigned long end,
-						    enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -203,8 +201,8 @@ static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	ib_ucontext_notifier_start_account(context);
 	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
-				      end,
+	rbt_ib_umem_for_each_in_range(&context->umem_tree, range->start,
+				      range->end,
 				      invalidate_range_start_trampoline, NULL);
 	up_read(&context->umem_rwsem);
 }
@@ -217,10 +215,8 @@ static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
 }
 
 static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
-						  struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -228,8 +224,8 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 		return;
 
 	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
-				      end,
+	rbt_ib_umem_for_each_in_range(&context->umem_tree, range->start,
+				      range->end,
 				      invalidate_range_end_trampoline, NULL);
 	up_read(&context->umem_rwsem);
 	ib_ucontext_notifier_end_account(context);
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index e67fed1..44b41b7 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,8 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end,
-				       enum mmu_event event)
+				       const struct mmu_notifier_range *range)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -230,14 +229,13 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 	STAT(mmu_invalidate_range);
 	atomic_inc(&gms->ms_range_active);
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
-		start, end, atomic_read(&gms->ms_range_active));
-	gru_flush_tlb_range(gms, start, end - start);
+		range->start, range->end, atomic_read(&gms->ms_range_active));
+	gru_flush_tlb_range(gms, range->start, range->end - range->start);
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
-				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end,
-				     enum mmu_event event)
+				     struct mm_struct *mm,
+				     const struct mmu_notifier_range *range)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -246,7 +244,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 	(void)atomic_dec_and_test(&gms->ms_range_active);
 
 	wake_up_all(&gms->ms_wait_queue);
-	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end);
+	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms,
+		range->start, range->end);
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index f07301e..166d16e 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -467,19 +467,17 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start,
-				unsigned long end,
-				enum mmu_event event)
+				const struct mmu_notifier_range *range)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
 
 	mutex_lock(&priv->lock);
 	list_for_each_entry(map, &priv->maps, next) {
-		unmap_if_in_range(map, start, end);
+		unmap_if_in_range(map, range->start, range->end);
 	}
 	list_for_each_entry(map, &priv->freeable_maps, next) {
-		unmap_if_in_range(map, start, end);
+		unmap_if_in_range(map, range->start, range->end);
 	}
 	mutex_unlock(&priv->lock);
 }
@@ -489,7 +487,12 @@ static void mn_invl_page(struct mmu_notifier *mn,
 			 unsigned long address,
 			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
+	struct mmu_notifier_range range;
+
+	range.start = address;
+	range.end = address + PAGE_SIZE;
+	range.event = event;
+	mn_invl_range_start(mn, mm, &range);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4c450fa..f7333cb 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -908,6 +908,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			.mm = mm,
 			.private = &cp,
 		};
+		struct mmu_notifier_range range = {
+			.start = 0,
+			.end = ~0UL,
+			.event = MMU_CLEAR_SOFT_DIRTY,
+		};
 
 		if (type == CLEAR_REFS_MM_HIWATER_RSS) {
 			/*
@@ -934,13 +939,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				downgrade_write(&mm->mmap_sem);
 				break;
 			}
-			mmu_notifier_invalidate_range_start(mm, 0, -1,
-							MMU_CLEAR_SOFT_DIRTY);
+			mmu_notifier_invalidate_range_start(mm, &range);
 		}
 		walk_page_range(0, ~0UL, &clear_refs_walk);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1,
-							MMU_CLEAR_SOFT_DIRTY);
+			mmu_notifier_invalidate_range_end(mm, &range);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 out_mm:
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f9b1e10..13b4b51 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -70,6 +70,13 @@ enum mmu_event {
 	MMU_KSM_WRITE_PROTECT,
 };
 
+struct mmu_notifier_range {
+	struct list_head list;
+	unsigned long start;
+	unsigned long end;
+	enum mmu_event event;
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -83,6 +90,12 @@ struct mmu_notifier_mm {
 	struct hlist_head list;
 	/* to serialize the list modifications and hlist_unhashed */
 	spinlock_t lock;
+	/* List of all active range invalidations. */
+	struct list_head ranges;
+	/* Number of active range invalidations. */
+	int nranges;
+	/* For threads waiting on range invalidations. */
+	wait_queue_head_t wait_queue;
 };
 
 struct mmu_notifier_ops {
@@ -203,14 +216,10 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start,
-				       unsigned long end,
-				       enum mmu_event event);
+				       const struct mmu_notifier_range *range);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start,
-				     unsigned long end,
-				     enum mmu_event event);
+				     const struct mmu_notifier_range *range);
 
 	/*
 	 * invalidate_range() is either called between
@@ -280,15 +289,17 @@ extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 					  unsigned long address,
 					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event);
+					struct mmu_notifier_range *range);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						unsigned long start,
-						unsigned long end,
-						enum mmu_event event);
+					struct mmu_notifier_range *range);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
+extern bool mmu_notifier_range_is_valid(struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end);
+extern void mmu_notifier_range_wait_valid(struct mm_struct *mm,
+					  unsigned long start,
+					  unsigned long end);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
@@ -331,21 +342,17 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+					struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end, event);
+		__mmu_notifier_invalidate_range_start(mm, range);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						     unsigned long start,
-						     unsigned long end,
-						     enum mmu_event event)
+					struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end, event);
+		__mmu_notifier_invalidate_range_end(mm, range);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -487,16 +494,12 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+					struct mmu_notifier_range *range)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						     unsigned long start,
-						     unsigned long end,
-						     enum mmu_event event)
+					struct mmu_notifier_range *range)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 802828a..b7f7f6b 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -164,9 +164,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *ptep;
 	int err;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = addr;
-	const unsigned long mmun_end   = addr + PAGE_SIZE;
+	struct mmu_notifier_range range;
 	struct mem_cgroup *memcg;
 
 	err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg);
@@ -176,8 +174,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -211,8 +211,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1b3ce77..a57822d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -983,8 +983,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	pmd_t _pmd;
 	int ret = 0, i;
 	struct page **pages;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
 			GFP_KERNEL);
@@ -1022,10 +1021,10 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		cond_resched();
 	}
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1059,8 +1058,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1070,8 +1068,7 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -1090,9 +1087,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL, *new_page;
 	struct mem_cgroup *memcg;
 	unsigned long haddr;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
 	gfp_t huge_gfp;			/* for allocation and charge */
+	struct mmu_notifier_range range;
 
 	ptl = pmd_lockptr(mm, pmd);
 	VM_BUG_ON_VMA(!vma->anon_vma, vma);
@@ -1161,10 +1157,10 @@ alloc:
 		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	spin_lock(ptl);
 	if (page)
@@ -1196,8 +1192,7 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return ret;
 out_unlock:
@@ -1612,12 +1607,12 @@ static int __split_huge_page_splitting(struct page *page,
 	spinlock_t *ptl;
 	pmd_t *pmd;
 	int ret = 0;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = address;
-	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
+	struct mmu_notifier_range range;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_HUGE_PAGE_SPLIT);
+	range.start = address;
+	range.end = address + HPAGE_PMD_SIZE;
+	range.event = MMU_HUGE_PAGE_SPLIT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1633,8 +1628,7 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_HUGE_PAGE_SPLIT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	return ret;
 }
@@ -2445,8 +2439,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	int isolated;
 	unsigned long hstart, hend;
 	struct mem_cgroup *memcg;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	gfp_t gfp;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
@@ -2491,10 +2484,10 @@ static void collapse_huge_page(struct mm_struct *mm,
 	pte = pte_offset_map(pmd, address);
 	pte_ptl = pte_lockptr(mm, pmd);
 
-	mmun_start = address;
-	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = address;
+	range.end = address + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2504,8 +2497,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_collapse_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2894,36 +2886,32 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long haddr = address & HPAGE_PMD_MASK;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, &range);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, &range);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2b513e2..631de15 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2740,17 +2740,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	int cow;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	int ret = 0;
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
-	mmun_start = vma->vm_start;
-	mmun_end = vma->vm_end;
+	range.start = vma->vm_start;
+	range.end = vma->vm_end;
+	range.event = MMU_MIGRATE;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start,
-						    mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_start(src, &range);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2790,8 +2789,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		} else {
 			if (cow) {
 				huge_ptep_set_wrprotect(src, addr, src_pte);
-				mmu_notifier_invalidate_range(src, mmun_start,
-								   mmun_end);
+				mmu_notifier_invalidate_range(src, range.start,
+								   range.end);
 			}
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
@@ -2804,8 +2803,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(src, &range);
 
 	return ret;
 }
@@ -2823,16 +2821,17 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
-	const unsigned long mmun_start = start;	/* For mmu_notifiers */
-	const unsigned long mmun_end   = end;	/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~huge_page_mask(h));
 	BUG_ON(end & ~huge_page_mask(h));
 
+	range.start = start;
+	range.end = end;
+	range.event = MMU_MIGRATE;
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	address = start;
 again:
 	for (; address < end; address += sz) {
@@ -2906,8 +2905,7 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -3004,8 +3002,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *old_page, *new_page;
 	int ret = 0, outside_reserve = 0;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	old_page = pte_page(pte);
 
@@ -3084,10 +3081,11 @@ retry_avoidcopy:
 	__SetPageUptodate(new_page);
 	set_page_huge_active(new_page);
 
-	mmun_start = address & huge_page_mask(h);
-	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = address & huge_page_mask(h);
+	range.end = range.start + huge_page_size(h);
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
+
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -3099,7 +3097,7 @@ retry_avoidcopy:
 
 		/* Break COW */
 		huge_ptep_clear_flush(vma, address, ptep);
-		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
 		page_remove_rmap(old_page);
@@ -3108,8 +3106,7 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out_release_all:
 	page_cache_release(new_page);
 out_release_old:
@@ -3573,11 +3570,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long pages = 0;
+	struct mmu_notifier_range range;
 
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
+	range.start = start;
+	range.end = end;
+	range.event = MMU_MPROT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3627,7 +3628,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_tlb_range(vma, start, end);
 	mmu_notifier_invalidate_range(mm, start, end);
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index eb1b2b5..e384a97 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -855,14 +855,13 @@ static inline int pages_identical(struct page *page1, struct page *page2)
 static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 			      pte_t *orig_pte)
 {
+	struct mmu_notifier_range range;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr;
 	pte_t *ptep;
 	spinlock_t *ptl;
 	int swapped;
 	int err = -EFAULT;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
@@ -870,10 +869,10 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	BUG_ON(PageTransCompound(page));
 
-	mmun_start = addr;
-	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_KSM_WRITE_PROTECT);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_KSM_WRITE_PROTECT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -913,8 +912,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_KSM_WRITE_PROTECT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return err;
 }
@@ -937,8 +935,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	spinlock_t *ptl;
 	unsigned long addr;
 	int err = -EFAULT;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
@@ -948,10 +945,10 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	if (!pmd)
 		goto out;
 
-	mmun_start = addr;
-	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -976,8 +973,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 1be64ce..d784e35 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1009,8 +1009,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	unsigned long next;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	bool is_cow;
 	int ret;
 
@@ -1044,11 +1043,11 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * is_cow_mapping() returns true.
 	 */
 	is_cow = is_cow_mapping(vma->vm_flags);
-	mmun_start = addr;
-	mmun_end   = end;
+	range.start = addr;
+	range.end = end;
+	range.event = MMU_FORK;
 	if (is_cow)
-		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end, MMU_FORK);
+		mmu_notifier_invalidate_range_start(src_mm, &range);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1065,8 +1064,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start,
-						  mmun_end, MMU_FORK);
+		mmu_notifier_invalidate_range_end(src_mm, &range);
 	return ret;
 }
 
@@ -1335,13 +1333,16 @@ void unmap_vmas(struct mmu_gather *tlb,
 		unsigned long end_addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_notifier_range range = {
+		.start = start_addr,
+		.end = end_addr,
+		.event = MMU_MUNMAP,
+	};
 
-	mmu_notifier_invalidate_range_start(mm, start_addr,
-					    end_addr, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr,
-					  end_addr, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_end(mm, &range);
 }
 
 /**
@@ -1358,16 +1359,20 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
-	unsigned long end = start + size;
+	struct mmu_notifier_range range = {
+		.start = start,
+		.end = start + size,
+		.event = MMU_MIGRATE,
+	};
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm, start, end);
+	tlb_gather_mmu(&tlb, mm, start, range.end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
-	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
-		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
-	tlb_finish_mmu(&tlb, start, end);
+	mmu_notifier_invalidate_range_start(mm, &range);
+	for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
+		unmap_single_vma(&tlb, vma, start, range.end, details);
+	mmu_notifier_invalidate_range_end(mm, &range);
+	tlb_finish_mmu(&tlb, start, range.end);
 }
 
 /**
@@ -1384,15 +1389,19 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
-	unsigned long end = address + size;
+	struct mmu_notifier_range range = {
+		.start = address,
+		.end = address + size,
+		.event = MMU_MUNMAP,
+	};
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm, address, end);
+	tlb_gather_mmu(&tlb, mm, address, range.end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
-	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
-	tlb_finish_mmu(&tlb, address, end);
+	mmu_notifier_invalidate_range_start(mm, &range);
+	unmap_single_vma(&tlb, vma, address, range.end, details);
+	mmu_notifier_invalidate_range_end(mm, &range);
+	tlb_finish_mmu(&tlb, address, range.end);
 }
 
 /**
@@ -2000,6 +2009,7 @@ static inline int wp_page_reuse(struct mm_struct *mm,
 	__releases(ptl)
 {
 	pte_t entry;
+
 	/*
 	 * Clear the pages cpupid information as the existing
 	 * information potentially belongs to a now completely
@@ -2067,9 +2077,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl = NULL;
 	pte_t entry;
 	int page_copied = 0;
-	const unsigned long mmun_start = address & PAGE_MASK;	/* For mmu_notifiers */
-	const unsigned long mmun_end = mmun_start + PAGE_SIZE;	/* For mmu_notifiers */
 	struct mem_cgroup *memcg;
+	struct mmu_notifier_range range;
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
@@ -2090,8 +2099,10 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	__SetPageUptodate(new_page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = address & PAGE_MASK;
+	range.end = range.start + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2163,8 +2174,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 
 	pte_unmap_unlock(page_table, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index 0a0ed0c..c612e2c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1725,10 +1725,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int isolated = 0;
 	struct page *new_page = NULL;
 	int page_lru = page_is_file_cache(page);
-	unsigned long mmun_start = address & HPAGE_PMD_MASK;
-	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+	struct mmu_notifier_range range;
 	pmd_t orig_entry;
 
+	range.start = address & HPAGE_PMD_MASK;
+	range.end = range.start + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
@@ -1750,7 +1753,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	}
 
 	if (mm_tlb_flush_pending(mm))
-		flush_tlb_range(vma, mmun_start, mmun_end);
+		flush_tlb_range(vma, range.start, range.end);
 
 	/* Prepare a page as a migration target */
 	__set_page_locked(new_page);
@@ -1763,14 +1766,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, &range);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1803,17 +1804,17 @@ fail_putback:
 	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
 	 * guarantee the copy is visible before the pagetable update.
 	 */
-	flush_cache_range(vma, mmun_start, mmun_end);
-	page_add_anon_rmap(new_page, vma, mmun_start);
-	pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
-	set_pmd_at(mm, mmun_start, pmd, entry);
-	flush_tlb_range(vma, mmun_start, mmun_end);
+	flush_cache_range(vma, range.start, range.end);
+	page_add_anon_rmap(new_page, vma, range.start);
+	pmdp_huge_clear_flush_notify(vma, range.start, pmd);
+	set_pmd_at(mm, range.start, pmd, entry);
+	flush_tlb_range(vma, range.start, range.end);
 	update_mmu_cache_pmd(vma, address, &entry);
 
 	if (page_count(page) != 2) {
-		set_pmd_at(mm, mmun_start, pmd, orig_entry);
-		flush_tlb_range(vma, mmun_start, mmun_end);
-		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+		set_pmd_at(mm, range.start, pmd, orig_entry);
+		flush_tlb_range(vma, range.start, range.end);
+		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		update_mmu_cache_pmd(vma, address, &entry);
 		page_remove_rmap(new_page);
 		goto fail_putback;
@@ -1824,8 +1825,7 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
@@ -1850,7 +1850,7 @@ out_dropref:
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
 		entry = pmd_modify(entry, vma->vm_page_prot);
-		set_pmd_at(mm, mmun_start, pmd, entry);
+		set_pmd_at(mm, range.start, pmd, entry);
 		update_mmu_cache_pmd(vma, address, &entry);
 	}
 	spin_unlock(ptl);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index e51ea02..99fccbd 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,28 +174,28 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-					   unsigned long start,
-					   unsigned long end,
-					   enum mmu_event event)
+					   struct mmu_notifier_range *range)
 
 {
 	struct mmu_notifier *mn;
 	int id;
 
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
+	mm->mmu_notifier_mm->nranges++;
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start,
-							end, event);
+			mn->ops->invalidate_range_start(mn, mm, range);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-					 unsigned long start,
-					 unsigned long end,
-					 enum mmu_event event)
+					 struct mmu_notifier_range *range)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -211,12 +211,23 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		 * (besides the pointer check).
 		 */
 		if (mn->ops->invalidate_range)
-			mn->ops->invalidate_range(mn, mm, start, end);
+			mn->ops->invalidate_range(mn, mm,
+						  range->start, range->end);
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start,
-						      end, event);
+			mn->ops->invalidate_range_end(mn, mm, range);
 	}
 	srcu_read_unlock(&srcu, id);
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	list_del_init(&range->list);
+	mm->mmu_notifier_mm->nranges--;
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
+	/*
+	 * Wakeup after callback so they can do their job before any of the
+	 * waiters resume.
+	 */
+	wake_up(&mm->mmu_notifier_mm->wait_queue);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
 
@@ -235,6 +246,96 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);
 
+/* mmu_notifier_range_is_valid_locked() - test if range overlap with active
+ * invalidation.
+ *
+ * @mm: The mm struct.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ * Returns: false if overlap with an active invalidation, true otherwise.
+ *
+ * This function test whether any active invalidated range conflict with a
+ * given range ([start, end[), active invalidation are added to a list inside
+ * __mmu_notifier_invalidate_range_start() and removed from that list inside
+ * __mmu_notifier_invalidate_range_end().
+ */
+static bool mmu_notifier_range_is_valid_locked(struct mm_struct *mm,
+					       unsigned long start,
+					       unsigned long end)
+{
+	struct mmu_notifier_range *range;
+
+	list_for_each_entry(range, &mm->mmu_notifier_mm->ranges, list) {
+		if (range->end > start && range->start < end)
+			return false;
+	}
+	return true;
+}
+
+/* mmu_notifier_range_is_valid() - test if range overlap with active
+ * invalidation.
+ *
+ * @mm: The mm struct.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This function wait for any active range invalidation that conflict with the
+ * given range, to end. See mmu_notifier_range_wait_valid() on how to use this
+ * function properly.
+ */
+bool mmu_notifier_range_is_valid(struct mm_struct *mm,
+				 unsigned long start,
+				 unsigned long end)
+{
+	bool valid;
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	valid = mmu_notifier_range_is_valid_locked(mm, start, end);
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+	return valid;
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_is_valid);
+
+/* mmu_notifier_range_wait_valid() - wait for a range to have no conflict with
+ * active invalidation.
+ *
+ * @mm: The mm struct.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This function wait for any active range invalidation that conflict with the
+ * given range, to end.
+ *
+ * Note by the time this function return a new range invalidation that conflict
+ * might have started. So you need to atomically block new range and query
+ * again if range is still valid with mmu_notifier_range_is_valid(). So call
+ * sequence should be :
+ *
+ * again:
+ * mmu_notifier_range_wait_valid()
+ * // block new invalidation using that lock inside your range_start callback
+ * lock_block_new_invalidation()
+ * if (!mmu_notifier_range_is_valid())
+ *     goto again;
+ * unlock()
+ */
+void mmu_notifier_range_wait_valid(struct mm_struct *mm,
+				   unsigned long start,
+				   unsigned long end)
+{
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	while (!mmu_notifier_range_is_valid_locked(mm, start, end)) {
+		int nranges = mm->mmu_notifier_mm->nranges;
+
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+		wait_event(mm->mmu_notifier_mm->wait_queue,
+			   nranges != mm->mmu_notifier_mm->nranges);
+		spin_lock(&mm->mmu_notifier_mm->lock);
+	}
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_wait_valid);
+
 static int do_mmu_notifier_register(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
 				    int take_mmap_sem)
@@ -264,6 +365,9 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
 	if (!mm_has_notifiers(mm)) {
 		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
 		spin_lock_init(&mmu_notifier_mm->lock);
+		INIT_LIST_HEAD(&mmu_notifier_mm->ranges);
+		mmu_notifier_mm->nranges = 0;
+		init_waitqueue_head(&mmu_notifier_mm->wait_queue);
 
 		mm->mmu_notifier_mm = mmu_notifier_mm;
 		mmu_notifier_mm = NULL;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a57e8af..0c394db 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -142,7 +142,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	unsigned long next;
 	unsigned long pages = 0;
 	unsigned long nr_huge_updates = 0;
-	unsigned long mni_start = 0;
+	struct mmu_notifier_range range = {
+		.start = 0,
+	};
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -153,10 +155,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			continue;
 
 		/* invoke the mmu notifier if the pmd is populated */
-		if (!mni_start) {
-			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start,
-							    end, MMU_MPROT);
+		if (!range.start) {
+			range.start = addr;
+			range.end = end;
+			range.event = MMU_MPROT;
+			mmu_notifier_invalidate_range_start(mm, &range);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -183,9 +186,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pages += this_pages;
 	} while (pmd++, addr = next, addr != end);
 
-	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end,
-						  MMU_MPROT);
+	if (range.start)
+		mmu_notifier_invalidate_range_end(mm, &range);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 72051cf..03fb4e5 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -166,18 +166,17 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		bool need_rmap_locks)
 {
 	unsigned long extent, next, old_end;
+	struct mmu_notifier_range range;
 	pmd_t *old_pmd, *new_pmd;
 	bool need_flush = false;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
-	mmun_start = old_addr;
-	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = old_addr;
+	range.end = old_end;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(vma->vm_mm, &range);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -229,8 +228,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, &range);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4dfa91c..7e79aa8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -317,10 +317,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 }
 
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
-						    struct mm_struct *mm,
-						    unsigned long start,
-						    unsigned long end,
-						    enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -333,7 +331,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 * count is also read inside the mmu_lock critical section.
 	 */
 	kvm->mmu_notifier_count++;
-	need_tlb_flush = kvm_unmap_hva_range(kvm, start, end);
+	need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end);
 	need_tlb_flush |= kvm->tlbs_dirty;
 	/* we've to flush the tlb before the pages can be freed */
 	if (need_tlb_flush)
@@ -344,10 +342,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
-						  struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 02/15] mmu_notifier: keep track of active invalidation ranges v4
  2015-08-13 19:15 ` [PATCH 02/15] mmu_notifier: keep track of active invalidation ranges v4 Jérôme Glisse
@ 2015-09-01  3:27   ` Mark Hairgrove
  2015-09-01 14:58     ` Jerome Glisse
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Hairgrove @ 2015-09-01  3:27 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Christophe Harle, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher

[-- Attachment #1: Type: text/plain, Size: 6415 bytes --]



On Thu, 13 Aug 2015, JA(C)rA'me Glisse wrote:

> The invalidate_range_start() and invalidate_range_end() can be
> considered as forming an "atomic" section for the cpu page table
> update point of view. Between this two function the cpu page
> table content is unreliable for the address range being
> invalidated.
> 
> This patch use a structure define at all place doing range
> invalidation. This structure is added to a list for the duration
> of the update ie added with invalid_range_start() and removed
> with invalidate_range_end().
> 
> Helpers allow querying if a range is valid and wait for it if
> necessary.
> 
> For proper synchronization, user must block any new range
> invalidation from inside there invalidate_range_start() callback.

s/there/their/

> Otherwise there is no garanty that a new range invalidation will

s/garanty/guarantee/

> not be added after the call to the helper function to query for
> existing range.
> 
> [...]
>
> +/* mmu_notifier_range_is_valid_locked() - test if range overlap with active

s/overlap/overlaps/

> + * invalidation.
> + *
> + * @mm: The mm struct.
> + * @start: Start address of the range (inclusive).
> + * @end: End address of the range (exclusive).
> + * Returns: false if overlap with an active invalidation, true otherwise.
> + *
> + * This function test whether any active invalidated range conflict with a

s/test/tests/
s/invalidated/invalidation/
s/conflict/conflicts/

> + * given range ([start, end[), active invalidation are added to a list inside

end[ -> end]
s/invalidation/invalidations/

> + * __mmu_notifier_invalidate_range_start() and removed from that list inside
> + * __mmu_notifier_invalidate_range_end().
> + */
> +static bool mmu_notifier_range_is_valid_locked(struct mm_struct *mm,
> +					       unsigned long start,
> +					       unsigned long end)
> +{
> +	struct mmu_notifier_range *range;
> +
> +	list_for_each_entry(range, &mm->mmu_notifier_mm->ranges, list) {
> +		if (range->end > start && range->start < end)
> +			return false;
> +	}
> +	return true;
> +}
> +
> +/* mmu_notifier_range_is_valid() - test if range overlap with active

s/overlap/overlaps/

> + * invalidation.
> + *
> + * @mm: The mm struct.
> + * @start: Start address of the range (inclusive).
> + * @end: End address of the range (exclusive).
> + *
> + * This function wait for any active range invalidation that conflict with the
> + * given range, to end. See mmu_notifier_range_wait_valid() on how to use this
> + * function properly.

Bad copy/paste from range_wait_valid? mmu_notifier_range_is_valid just 
queries the state, it doesn't wait.

> + */
> +bool mmu_notifier_range_is_valid(struct mm_struct *mm,
> +				 unsigned long start,
> +				 unsigned long end)
> +{
> +	bool valid;
> +
> +	spin_lock(&mm->mmu_notifier_mm->lock);
> +	valid = mmu_notifier_range_is_valid_locked(mm, start, end);
> +	spin_unlock(&mm->mmu_notifier_mm->lock);
> +	return valid;
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_range_is_valid);
> +
> +/* mmu_notifier_range_wait_valid() - wait for a range to have no conflict with
> + * active invalidation.
> + *
> + * @mm: The mm struct.
> + * @start: Start address of the range (inclusive).
> + * @end: End address of the range (exclusive).
> + *
> + * This function wait for any active range invalidation that conflict with the
> + * given range, to end.
> + *
> + * Note by the time this function return a new range invalidation that conflict
> + * might have started. So you need to atomically block new range and query
> + * again if range is still valid with mmu_notifier_range_is_valid(). So call
> + * sequence should be :
> + *
> + * again:
> + * mmu_notifier_range_wait_valid()
> + * // block new invalidation using that lock inside your range_start callback
> + * lock_block_new_invalidation()
> + * if (!mmu_notifier_range_is_valid())
> + *     goto again;
> + * unlock()

I think this example sequence can deadlock so I wouldn't want to encourage 
its use. New invalidation regions are added to the list before the 
range_start callback is invoked.

Thread A                           Thread B
-----------------                  -----------------
mmu_notifier_range_wait_valid
// returns
                                   __mmu_notifier_invalidate_range_start
                                     list_add_tail
lock_block_new_invalidation
                                     ->invalidate_range_start
                                       // invalidation blocked in callback
mmu_notifier_range_is_valid // fails
goto again
mmu_notifier_range_wait_valid // deadlock

mmu_notifier_range_wait_valid can't finish until thread B's callback 
returns, but thread B's callback can't return because it's blocked.

I see that HMM in later patches takes the approach of not holding the lock 
when mmu_notifier_range_is_valid returns false. Instead of stalling new 
invalidations it returns -EAGAIN to the caller. While that resolves the 
deadlock, it won't prevent the faulting thread from being starved in the 
pathological case.

Is it out of the question to build a lock into the mmu notifier API 
directly? It's a little worrisome to me that the complexity for this 
locking is pushed into the callbacks rather than handled in the core. 
Something like this:

    mmu_notifier_range_lock(start, end)
    mmu_notifier_range_unlock(start, end)

If that's not feasible and we have to stick with the current approach, 
then I suggest changing the "valid" name. "valid" doesn't have a clear 
meaning at first glance because the reader doesn't know what would make a 
range "valid." How about "active" instead? Then the names would look 
something like this, assuming the polarity matches their current versions:

    mmu_notifier_range_inactive_locked
    mmu_notifier_range_inactive
    mmu_notifier_range_wait_active


> + */
> +void mmu_notifier_range_wait_valid(struct mm_struct *mm,
> +				   unsigned long start,
> +				   unsigned long end)
> +{
> +	spin_lock(&mm->mmu_notifier_mm->lock);
> +	while (!mmu_notifier_range_is_valid_locked(mm, start, end)) {
> +		int nranges = mm->mmu_notifier_mm->nranges;
> +
> +		spin_unlock(&mm->mmu_notifier_mm->lock);
> +		wait_event(mm->mmu_notifier_mm->wait_queue,
> +			   nranges != mm->mmu_notifier_mm->nranges);
> +		spin_lock(&mm->mmu_notifier_mm->lock);
> +	}
> +	spin_unlock(&mm->mmu_notifier_mm->lock);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_range_wait_valid);
> +

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 02/15] mmu_notifier: keep track of active invalidation ranges v4
  2015-09-01  3:27   ` Mark Hairgrove
@ 2015-09-01 14:58     ` Jerome Glisse
  0 siblings, 0 replies; 20+ messages in thread
From: Jerome Glisse @ 2015-09-01 14:58 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Christophe Harle, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher

On Mon, Aug 31, 2015 at 08:27:17PM -0700, Mark Hairgrove wrote:
> On Thu, 13 Aug 2015, Jerome Glisse wrote:

[...]

Will fix syntax.

[...]

> > +/* mmu_notifier_range_wait_valid() - wait for a range to have no conflict with
> > + * active invalidation.
> > + *
> > + * @mm: The mm struct.
> > + * @start: Start address of the range (inclusive).
> > + * @end: End address of the range (exclusive).
> > + *
> > + * This function wait for any active range invalidation that conflict with the
> > + * given range, to end.
> > + *
> > + * Note by the time this function return a new range invalidation that conflict
> > + * might have started. So you need to atomically block new range and query
> > + * again if range is still valid with mmu_notifier_range_is_valid(). So call
> > + * sequence should be :
> > + *
> > + * again:
> > + * mmu_notifier_range_wait_valid()
> > + * // block new invalidation using that lock inside your range_start callback
> > + * lock_block_new_invalidation()
> > + * if (!mmu_notifier_range_is_valid())
> > + *     goto again;
> > + * unlock()
> 
> I think this example sequence can deadlock so I wouldn't want to encourage 
> its use. New invalidation regions are added to the list before the 
> range_start callback is invoked.
> 
> Thread A                           Thread B
> -----------------                  -----------------
> mmu_notifier_range_wait_valid
> // returns
>                                    __mmu_notifier_invalidate_range_start
>                                      list_add_tail
> lock_block_new_invalidation
>                                      ->invalidate_range_start
>                                        // invalidation blocked in callback
> mmu_notifier_range_is_valid // fails
> goto again
> mmu_notifier_range_wait_valid // deadlock
> 
> mmu_notifier_range_wait_valid can't finish until thread B's callback 
> returns, but thread B's callback can't return because it's blocked.
> 
> I see that HMM in later patches takes the approach of not holding the lock 
> when mmu_notifier_range_is_valid returns false. Instead of stalling new 
> invalidations it returns -EAGAIN to the caller. While that resolves the 
> deadlock, it won't prevent the faulting thread from being starved in the 
> pathological case.

The comment here is not clear, what HMM does is what is intended. If
mmu_notifier_range_is_valid() return false then you drop lock and try
again. I am not sure we should care about the starve case as it can
not happen, it would mean that something keeps invalidating over and
over the same range of address space of a process. I do not see how
such thing would happen.

> 
> Is it out of the question to build a lock into the mmu notifier API 
> directly? It's a little worrisome to me that the complexity for this 
> locking is pushed into the callbacks rather than handled in the core. 
> Something like this:
> 
>     mmu_notifier_range_lock(start, end)
>     mmu_notifier_range_unlock(start, end)

If a range is about to be invalidated it is better to avoid faulting
in memory in the device as that same memory is about to be invalidated.
This is why i always have invalidation take precedence over device fault.


> If that's not feasible and we have to stick with the current approach, 
> then I suggest changing the "valid" name. "valid" doesn't have a clear 
> meaning at first glance because the reader doesn't know what would make a 
> range "valid." How about "active" instead? Then the names would look 
> something like this, assuming the polarity matches their current versions:
> 
>     mmu_notifier_range_inactive_locked
>     mmu_notifier_range_inactive
>     mmu_notifier_range_wait_active

Those names are better i will update the patch accordingly.

Thanks for the review,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 03/15] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() v2
  2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 01/15] mmu_notifier: add event information to address invalidation v8 Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 02/15] mmu_notifier: keep track of active invalidation ranges v4 Jérôme Glisse
@ 2015-08-13 19:15 ` Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 04/15] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier Jérôme Glisse
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Jérôme Glisse @ 2015-08-13 19:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Listener of mm event might not have easy way to get the struct page
behind an address invalidated with mmu_notifier_invalidate_page()
function as this happens after the cpu page table have been clear/
updated. This happens for instance if the listener is storing a dma
mapping inside its secondary page table. To avoid complex reverse
dma mapping lookup just pass along a pointer to the page being
invalidated.

Changed since v1:
  - English syntax fixes.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 drivers/infiniband/core/umem_odp.c | 1 +
 drivers/iommu/amd_iommu_v2.c       | 1 +
 drivers/misc/sgi-gru/grutlbpurge.c | 1 +
 drivers/xen/gntdev.c               | 1 +
 include/linux/mmu_notifier.h       | 6 +++++-
 mm/mmu_notifier.c                  | 3 ++-
 mm/rmap.c                          | 4 ++--
 virt/kvm/kvm_main.c                | 1 +
 8 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 58d9a00..0541761 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -166,6 +166,7 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
 static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long address,
+					     struct page *page,
 					     enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index f206804..3335eb5 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -393,6 +393,7 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
 			       unsigned long address,
+			       struct page *page,
 			       enum mmu_event event)
 {
 	__mn_flush_page(mn, address);
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 44b41b7..c7659b76 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -250,6 +250,7 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
 				unsigned long address,
+				struct page *page,
 				enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 166d16e..6ac1918 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -485,6 +485,7 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
 			 unsigned long address,
+			 struct page *page,
 			 enum mmu_event event)
 {
 	struct mmu_notifier_range range;
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 13b4b51..1a20145c 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -169,6 +169,7 @@ struct mmu_notifier_ops {
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
 				unsigned long address,
+				struct page *page,
 				enum mmu_event event);
 
 	/*
@@ -287,6 +288,7 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      enum mmu_event event);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 					  unsigned long address,
+					  struct page *page,
 					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 					struct mmu_notifier_range *range);
@@ -335,10 +337,11 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 						unsigned long address,
+						struct page *page,
 						enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address, event);
+		__mmu_notifier_invalidate_page(mm, address, page, event);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
@@ -489,6 +492,7 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 						unsigned long address,
+						struct page *page,
 						enum mmu_event event)
 {
 }
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 99fccbd..2ed6d0d 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -160,6 +160,7 @@ void __mmu_notifier_change_pte(struct mm_struct *mm,
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 				    unsigned long address,
+				    struct page *page,
 				    enum mmu_event event)
 {
 	struct mmu_notifier *mn;
@@ -168,7 +169,7 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address, event);
+			mn->ops->invalidate_page(mn, mm, address, page, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index b1e6eae..65aee96 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -891,7 +891,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
+		mmu_notifier_invalidate_page(mm, address, page, MMU_WRITE_BACK);
 		(*cleaned)++;
 	}
 out:
@@ -1298,7 +1298,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
+		mmu_notifier_invalidate_page(mm, address, page, MMU_MIGRATE);
 out:
 	return ret;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7e79aa8..5f35340 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -260,6 +260,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long address,
+					     struct page *page,
 					     enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 04/15] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier
  2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
                   ` (2 preceding siblings ...)
  2015-08-13 19:15 ` [PATCH 03/15] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() v2 Jérôme Glisse
@ 2015-08-13 19:15 ` Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 05/15] HMM: introduce heterogeneous memory management v5 Jérôme Glisse
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Jérôme Glisse @ 2015-08-13 19:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

This patch allow to invalidate a range while excluding call to a specific
mmu_notifier which allow for a subsystem to invalidate a range for everyone
but itself.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/mmu_notifier.h | 66 ++++++++++++++++++++++++++++++++++++++++----
 mm/mmu_notifier.c            | 16 +++++++++--
 2 files changed, 73 insertions(+), 9 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 1a20145c..794d626 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -291,11 +291,15 @@ extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 					  struct page *page,
 					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-					struct mmu_notifier_range *range);
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-					struct mmu_notifier_range *range);
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+					    unsigned long start,
+					    unsigned long end,
+					    const struct mmu_notifier *exclude);
 extern bool mmu_notifier_range_is_valid(struct mm_struct *mm,
 					unsigned long start,
 					unsigned long end);
@@ -348,21 +352,49 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 					struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, range);
+		__mmu_notifier_invalidate_range_start(mm, range, NULL);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 					struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, range);
+		__mmu_notifier_invalidate_range_end(mm, range, NULL);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range(mm, start, end);
+		__mmu_notifier_invalidate_range(mm, start, end, NULL);
+}
+
+static inline void mmu_notifier_invalidate_range_start_excluding(
+					struct mm_struct *mm,
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_start(mm, range, exclude);
+}
+
+static inline void mmu_notifier_invalidate_range_end_excluding(
+					struct mm_struct *mm,
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_end(mm, range, exclude);
+}
+
+static inline void mmu_notifier_invalidate_range_excluding(
+					struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end,
+					const struct mmu_notifier *exclude)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range(mm, start, end, exclude);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -512,6 +544,28 @@ static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 {
 }
 
+static inline void mmu_notifier_invalidate_range_start_excluding(
+					struct mm_struct *mm,
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_end_excluding(
+					struct mm_struct *mm,
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_excluding(
+					struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end,
+					const struct mmu_notifier *exclude)
+{
+}
+
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
 {
 }
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 2ed6d0d..c673ba9 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -175,7 +175,8 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-					   struct mmu_notifier_range *range)
+					   struct mmu_notifier_range *range,
+					   const struct mmu_notifier *exclude)
 
 {
 	struct mmu_notifier *mn;
@@ -188,6 +189,8 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn == exclude)
+			continue;
 		if (mn->ops->invalidate_range_start)
 			mn->ops->invalidate_range_start(mn, mm, range);
 	}
@@ -196,13 +199,16 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-					 struct mmu_notifier_range *range)
+					 struct mmu_notifier_range *range,
+					 const struct mmu_notifier *exclude)
 {
 	struct mmu_notifier *mn;
 	int id;
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn == exclude)
+			continue;
 		/*
 		 * Call invalidate_range here too to avoid the need for the
 		 * subsystem of having to register an invalidate_range_end
@@ -233,13 +239,17 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
 
 void __mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+				     unsigned long start,
+				     unsigned long end,
+				     const struct mmu_notifier *exclude)
 {
 	struct mmu_notifier *mn;
 	int id;
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn == exclude)
+			continue;
 		if (mn->ops->invalidate_range)
 			mn->ops->invalidate_range(mn, mm, start, end);
 	}
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 05/15] HMM: introduce heterogeneous memory management v5.
  2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
                   ` (3 preceding siblings ...)
  2015-08-13 19:15 ` [PATCH 04/15] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier Jérôme Glisse
@ 2015-08-13 19:15 ` Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 06/15] HMM: add HMM page table v4 Jérôme Glisse
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Jérôme Glisse @ 2015-08-13 19:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

This patch only introduce core HMM functions for registering a new
mirror and stopping a mirror as well as HMM device registering and
unregistering.

The lifecycle of HMM object is handled differently then the one of
mmu_notifier because unlike mmu_notifier there can be concurrent
call from both mm code to HMM code and/or from device driver code
to HMM code. Moreover lifetime of HMM can be uncorrelated from the
lifetime of the process that is being mirror (GPU might take longer
time to cleanup).

Changed since v1:
  - Updated comment of hmm_device_register().

Changed since v2:
  - Expose struct hmm for easy access to mm struct.
  - Simplify hmm_mirror_register() arguments.
  - Removed the device name.
  - Refcount the mirror struct internaly to HMM allowing to get
    rid of the srcu and making the device driver callback error
    handling simpler.
  - Safe to call several time hmm_mirror_unregister().
  - Rework the mmu_notifier unregistration and release callback.

Changed since v3:
  - Rework hmm_mirror lifetime rules.
  - Synchronize with mmu_notifier srcu before droping mirror last
    reference in hmm_mirror_unregister()
  - Use spinlock for device's mirror list.
  - Export mirror ref/unref functions.
  - English syntax fixes.

Changed since v4:
  - Properly reference existing hmm struct if any.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 MAINTAINERS              |   7 +
 include/linux/hmm.h      | 173 +++++++++++++++++++++
 include/linux/mm.h       |  11 ++
 include/linux/mm_types.h |  14 ++
 kernel/fork.c            |   2 +
 mm/Kconfig               |  14 ++
 mm/Makefile              |   1 +
 mm/hmm.c                 | 381 +++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 603 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index a9ae6c1..410b826 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4871,6 +4871,13 @@ F:	include/uapi/linux/if_hippi.h
 F:	net/802/hippi.c
 F:	drivers/net/hippi/
 
+HMM - Heterogeneous Memory Management
+M:	JA(C)rA'me Glisse <jglisse@redhat.com>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/hmm.c
+F:	include/linux/hmm.h
+
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
 L:	hostap@shmoo.com (subscribers-only)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..b559c0b
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,173 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* This is a heterogeneous memory management (hmm). In a nutshell this provide
+ * an API to mirror a process address on a device which has its own mmu using
+ * its own page table for the process. It supports everything except special
+ * vma.
+ *
+ * Mandatory hardware features :
+ *   - An mmu with pagetable.
+ *   - Read only flag per cpu page.
+ *   - Page fault ie hardware must stop and wait for kernel to service fault.
+ *
+ * Optional hardware features :
+ *   - Dirty bit per cpu page.
+ *   - Access bit per cpu page.
+ *
+ * The hmm code handle all the interfacing with the core kernel mm code and
+ * provide a simple API. It does support migrating system memory to device
+ * memory and handle migration back to system memory on cpu page fault.
+ *
+ * Migrated memory is considered as swaped from cpu and core mm code point of
+ * view.
+ */
+#ifndef _HMM_H
+#define _HMM_H
+
+#ifdef CONFIG_HMM
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+#include <linux/mman.h>
+
+
+struct hmm_device;
+struct hmm_mirror;
+struct hmm;
+
+
+/* hmm_device - Each device must register one and only one hmm_device.
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* struct hmm_device_operations - HMM device operation callback
+ */
+struct hmm_device_ops {
+	/* release() - mirror must stop using the address space.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 *
+	 * When this is called, device driver must kill all device thread using
+	 * this mirror. It is call either from :
+	 *   - mm dying (all process using this mm exiting).
+	 *   - hmm_mirror_unregister() (if no other thread holds a reference)
+	 *   - outcome of some device error reported by any of the device
+	 *     callback against that mirror.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+
+	/* free() - mirror can be freed.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 *
+	 * When this is called, device driver can free the underlying memory
+	 * associated with that mirror. Note this is call from atomic context
+	 * so device driver callback can not sleep.
+	 */
+	void (*free)(struct hmm_mirror *mirror);
+};
+
+
+/* struct hmm - per mm_struct HMM states.
+ *
+ * @mm: The mm struct this hmm is associated with.
+ * @mirrors: List of all mirror for this mm (one per device).
+ * @vm_end: Last valid address for this mm (exclusive).
+ * @kref: Reference counter.
+ * @rwsem: Serialize the mirror list modifications.
+ * @mmu_notifier: The mmu_notifier of this mm.
+ * @rcu: For delayed cleanup call from mmu_notifier.release() callback.
+ *
+ * For each process address space (mm_struct) there is one and only one hmm
+ * struct. hmm functions will redispatch to each devices the change made to
+ * the process address space.
+ *
+ * Device driver must not access this structure other than for getting the
+ * mm pointer.
+ */
+struct hmm {
+	struct mm_struct	*mm;
+	struct hlist_head	mirrors;
+	unsigned long		vm_end;
+	struct kref		kref;
+	struct rw_semaphore	rwsem;
+	struct mmu_notifier	mmu_notifier;
+	struct rcu_head		rcu;
+};
+
+
+/* struct hmm_device - per device HMM structure
+ *
+ * @dev: Linux device structure pointer.
+ * @ops: The hmm operations callback.
+ * @mirrors: List of all active mirrors for the device.
+ * @lock: Lock protecting mirrors list.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct (only once per linux device).
+ */
+struct hmm_device {
+	struct device			*dev;
+	const struct hmm_device_ops	*ops;
+	struct list_head		mirrors;
+	spinlock_t			lock;
+};
+
+int hmm_device_register(struct hmm_device *device);
+int hmm_device_unregister(struct hmm_device *device);
+
+
+/* hmm_mirror - device specific mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct associating
+ * the process address space with the device. Same process can be mirrored by
+ * several different devices at the same time.
+ */
+
+/* struct hmm_mirror - per device and per mm HMM structure
+ *
+ * @device: The hmm_device struct this hmm_mirror is associated to.
+ * @hmm: The hmm struct this hmm_mirror is associated to.
+ * @kref: Reference counter (private to HMM do not use).
+ * @dlist: List of all hmm_mirror for same device.
+ * @mlist: List of all hmm_mirror for same process.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct for each of the address space it wants to mirror. Same device can
+ * mirror several different address space. As well same address space can be
+ * mirror by different devices.
+ */
+struct hmm_mirror {
+	struct hmm_device	*device;
+	struct hmm		*hmm;
+	struct kref		kref;
+	struct list_head	dlist;
+	struct hlist_node	mlist;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
+void hmm_mirror_unref(struct hmm_mirror **mirror);
+
+
+#endif /* CONFIG_HMM */
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e872f9..b5bf210 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2243,5 +2243,16 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+#ifdef CONFIG_HMM
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+	mm->hmm = NULL;
+}
+#else /* !CONFIG_HMM */
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+}
+#endif /* !CONFIG_HMM */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7..fa05917 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -15,6 +15,10 @@
 #include <asm/page.h>
 #include <asm/mmu.h>
 
+#ifdef CONFIG_HMM
+struct hmm;
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -451,6 +455,16 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_HMM
+	/*
+	 * hmm always register an mmu_notifier we rely on mmu notifier to keep
+	 * refcount on mm struct as well as forbiding registering hmm on a
+	 * dying mm
+	 *
+	 * This field is set with mmap_sem held in write mode.
+	 */
+	struct hmm *hmm;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index dbd9b8d..bf2dcb6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -602,6 +603,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
+	hmm_mm_init(mm);
 	clear_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
diff --git a/mm/Kconfig b/mm/Kconfig
index e79de2b..e1e0a82 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -654,3 +654,17 @@ config DEFERRED_STRUCT_PAGE_INIT
 	  when kswapd starts. This has a potential performance impact on
 	  processes running early in the lifetime of the systemm until kswapd
 	  finishes the initialisation.
+
+if STAGING
+config HMM
+	bool "Enable heterogeneous memory management (HMM)"
+	depends on MMU
+	select MMU_NOTIFIER
+	default n
+	help
+	  Heterogeneous memory management provide infrastructure for a device
+	  to mirror a process address space into an hardware mmu or into any
+	  things supporting pagefault like event.
+
+	  If unsure, say N to disable hmm.
+endif # STAGING
diff --git a/mm/Makefile b/mm/Makefile
index 98c4eae..90ca9c4 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -78,3 +78,4 @@ obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
+obj-$(CONFIG_HMM) += hmm.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..8d861c4
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,381 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* This is the core code for heterogeneous memory management (HMM). HMM intend
+ * to provide helper for mirroring a process address space on a device as well
+ * as allowing migration of data between system memory and device memory refer
+ * as remote memory from here on out.
+ *
+ * Refer to include/linux/hmm.h for further information on general design.
+ */
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/list.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/ksm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mmu_context.h>
+#include <linux/memcontrol.h>
+#include <linux/hmm.h>
+#include <linux/wait.h>
+#include <linux/mman.h>
+#include <linux/delay.h>
+#include <linux/workqueue.h>
+
+#include "internal.h"
+
+static struct mmu_notifier_ops hmm_notifier_ops;
+
+
+/* hmm - core HMM functions.
+ *
+ * Core HMM functions that deal with all the process mm activities.
+ */
+
+static int hmm_init(struct hmm *hmm)
+{
+	hmm->mm = current->mm;
+	hmm->vm_end = TASK_SIZE;
+	kref_init(&hmm->kref);
+	INIT_HLIST_HEAD(&hmm->mirrors);
+	init_rwsem(&hmm->rwsem);
+
+	/* register notifier */
+	hmm->mmu_notifier.ops = &hmm_notifier_ops;
+	return __mmu_notifier_register(&hmm->mmu_notifier, current->mm);
+}
+
+static int hmm_add_mirror(struct hmm *hmm, struct hmm_mirror *mirror)
+{
+	struct hmm_mirror *tmp;
+
+	down_write(&hmm->rwsem);
+	hlist_for_each_entry(tmp, &hmm->mirrors, mlist)
+		if (tmp->device == mirror->device) {
+			/* Same device can mirror only once. */
+			up_write(&hmm->rwsem);
+			return -EINVAL;
+		}
+	hlist_add_head(&mirror->mlist, &hmm->mirrors);
+	hmm_mirror_ref(mirror);
+	up_write(&hmm->rwsem);
+
+	return 0;
+}
+
+static inline struct hmm *hmm_ref(struct hmm *hmm)
+{
+	if (!hmm || !kref_get_unless_zero(&hmm->kref))
+		return NULL;
+	return hmm;
+}
+
+static void hmm_destroy_delayed(struct rcu_head *rcu)
+{
+	struct hmm *hmm;
+
+	hmm = container_of(rcu, struct hmm, rcu);
+	kfree(hmm);
+}
+
+static void hmm_destroy(struct kref *kref)
+{
+	struct hmm *hmm;
+
+	hmm = container_of(kref, struct hmm, kref);
+	BUG_ON(!hlist_empty(&hmm->mirrors));
+
+	down_write(&hmm->mm->mmap_sem);
+	/* A new hmm might have been register before reaching that point. */
+	if (hmm->mm->hmm == hmm)
+		hmm->mm->hmm = NULL;
+	up_write(&hmm->mm->mmap_sem);
+
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
+
+	mmu_notifier_call_srcu(&hmm->rcu, &hmm_destroy_delayed);
+}
+
+static inline struct hmm *hmm_unref(struct hmm *hmm)
+{
+	if (hmm)
+		kref_put(&hmm->kref, hmm_destroy);
+	return NULL;
+}
+
+
+/* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
+ *
+ * HMM use use mmu notifier to track change made to process address space.
+ */
+static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm;
+
+	hmm = hmm_ref(container_of(mn, struct hmm, mmu_notifier));
+	if (!hmm)
+		return;
+
+	down_write(&hmm->rwsem);
+	while (hmm->mirrors.first) {
+		struct hmm_mirror *mirror;
+
+		/*
+		 * Here we are holding the mirror reference from the mirror
+		 * list. As list removal is synchronized through rwsem, no
+		 * other thread can assume it holds that reference.
+		 */
+		mirror = hlist_entry(hmm->mirrors.first,
+				     struct hmm_mirror,
+				     mlist);
+		hlist_del_init(&mirror->mlist);
+		up_write(&hmm->rwsem);
+
+		mirror->device->ops->release(mirror);
+		hmm_mirror_unref(&mirror);
+
+		down_write(&hmm->rwsem);
+	}
+	up_write(&hmm->rwsem);
+
+	hmm_unref(hmm);
+}
+
+static struct mmu_notifier_ops hmm_notifier_ops = {
+	.release		= hmm_notifier_release,
+};
+
+
+/* hmm_mirror - per device mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct. A process
+ * can be mirror by several devices at the same time.
+ *
+ * Below are all the functions and their helpers use by device driver to mirror
+ * the process address space. Those functions either deals with updating the
+ * device page table (through hmm callback). Or provide helper functions use by
+ * the device driver to fault in range of memory in the device page table.
+ */
+struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror)
+{
+	if (!mirror || !kref_get_unless_zero(&mirror->kref))
+		return NULL;
+	return mirror;
+}
+EXPORT_SYMBOL(hmm_mirror_ref);
+
+static void hmm_mirror_destroy(struct kref *kref)
+{
+	struct hmm_device *device;
+	struct hmm_mirror *mirror;
+
+	mirror = container_of(kref, struct hmm_mirror, kref);
+	device = mirror->device;
+
+	hmm_unref(mirror->hmm);
+
+	spin_lock(&device->lock);
+	list_del_init(&mirror->dlist);
+	device->ops->free(mirror);
+	spin_unlock(&device->lock);
+}
+
+void hmm_mirror_unref(struct hmm_mirror **mirror)
+{
+	struct hmm_mirror *tmp = mirror ? *mirror : NULL;
+
+	if (tmp) {
+		*mirror = NULL;
+		kref_put(&tmp->kref, hmm_mirror_destroy);
+	}
+}
+EXPORT_SYMBOL(hmm_mirror_unref);
+
+/* hmm_mirror_register() - register mirror against current process for a device.
+ *
+ * @mirror: The mirror struct being registered.
+ * Returns: 0 on success or -ENOMEM, -EINVAL on error.
+ *
+ * Call when device driver want to start mirroring a process address space. The
+ * HMM shim will register mmu_notifier and start monitoring process address
+ * space changes. Hence callback to device driver might happen even before this
+ * function return.
+ *
+ * The task device driver want to mirror must be current !
+ *
+ * Only one mirror per mm and hmm_device can be created, it will return NULL if
+ * the hmm_device already has an hmm_mirror for the the mm.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror)
+{
+	struct mm_struct *mm = current->mm;
+	struct hmm *hmm = NULL;
+	int ret = 0;
+
+	/* Sanity checks. */
+	BUG_ON(!mirror);
+	BUG_ON(!mirror->device);
+	BUG_ON(!mm);
+
+	/*
+	 * Initialize the mirror struct fields, the mlist init and del dance is
+	 * necessary to make the error path easier for driver and for hmm.
+	 */
+	kref_init(&mirror->kref);
+	INIT_HLIST_NODE(&mirror->mlist);
+	INIT_LIST_HEAD(&mirror->dlist);
+	spin_lock(&mirror->device->lock);
+	list_add(&mirror->dlist, &mirror->device->mirrors);
+	spin_unlock(&mirror->device->lock);
+
+	down_write(&mm->mmap_sem);
+
+	hmm = hmm_ref(mm->hmm);
+	if (hmm == NULL) {
+		/* no hmm registered yet so register one */
+		hmm = kzalloc(sizeof(*mm->hmm), GFP_KERNEL);
+		if (hmm == NULL) {
+			up_write(&mm->mmap_sem);
+			ret = -ENOMEM;
+			goto error;
+		}
+
+		ret = hmm_init(hmm);
+		if (ret) {
+			up_write(&mm->mmap_sem);
+			kfree(hmm);
+			goto error;
+		}
+
+		mm->hmm = hmm;
+	}
+
+	mirror->hmm = hmm;
+	ret = hmm_add_mirror(hmm, mirror);
+	up_write(&mm->mmap_sem);
+	if (ret) {
+		mirror->hmm = NULL;
+		hmm_unref(hmm);
+		goto error;
+	}
+	return 0;
+
+error:
+	spin_lock(&mirror->device->lock);
+	list_del_init(&mirror->dlist);
+	spin_unlock(&mirror->device->lock);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+static void hmm_mirror_kill(struct hmm_mirror *mirror)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm *hmm = hmm_ref(mirror->hmm);
+
+	if (!hmm)
+		return;
+
+	down_write(&hmm->rwsem);
+	if (!hlist_unhashed(&mirror->mlist)) {
+		hlist_del_init(&mirror->mlist);
+		up_write(&hmm->rwsem);
+		device->ops->release(mirror);
+		hmm_mirror_unref(&mirror);
+	} else
+		up_write(&hmm->rwsem);
+
+	hmm_unref(hmm);
+}
+
+/* hmm_mirror_unregister() - unregister a mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Driver can call this function when it wants to stop mirroring a process.
+ * This will trigger a call to the ->release() callback if it did not aleady
+ * happen.
+ *
+ * Note that caller must hold a reference on the mirror.
+ *
+ * THIS CAN NOT BE CALL FROM device->release() CALLBACK OR IT WILL DEADLOCK.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	if (mirror == NULL)
+		return;
+
+	hmm_mirror_kill(mirror);
+	mmu_notifier_synchronize();
+	hmm_mirror_unref(&mirror);
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* hmm_device_register() - register a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EINVAL otherwise.
+ *
+ *
+ * Call when device driver want to register itself with HMM. Device driver must
+ * only register once.
+ */
+int hmm_device_register(struct hmm_device *device)
+{
+	/* sanity check */
+	BUG_ON(!device);
+	BUG_ON(!device->ops);
+	BUG_ON(!device->ops->release);
+
+	spin_lock_init(&device->lock);
+	INIT_LIST_HEAD(&device->mirrors);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_register);
+
+/* hmm_device_unregister() - unregister a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EBUSY otherwise.
+ *
+ * Call when device driver want to unregister itself with HMM. This will check
+ * that there is no any active mirror and returns -EBUSY if so.
+ */
+int hmm_device_unregister(struct hmm_device *device)
+{
+	spin_lock(&device->lock);
+	if (!list_empty(&device->mirrors)) {
+		spin_unlock(&device->lock);
+		return -EBUSY;
+	}
+	spin_unlock(&device->lock);
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_unregister);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 06/15] HMM: add HMM page table v4.
  2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
                   ` (4 preceding siblings ...)
  2015-08-13 19:15 ` [PATCH 05/15] HMM: introduce heterogeneous memory management v5 Jérôme Glisse
@ 2015-08-13 19:15 ` Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 07/15] HMM: add per mirror " Jérôme Glisse
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Jérôme Glisse @ 2015-08-13 19:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

Heterogeneous memory management main purpose is to mirror a process
address. To do so it must maintain a secondary page table that is
use by the device driver to program the device or build a device
specific page table.

Radix tree can't be use to create this secondary page table because
HMM needs more flags than RADIX_TREE_MAX_TAGS (while this can be
increase we believe HMM will require so much flags that cost will
becomes prohibitive to others users of radix tree).

Moreover radix tree is built around long but for HMM we need to
store dma address and on some platform sizeof(dma_addr_t) is bigger
than sizeof(long). Thus radix tree is unsuitable to fulfill HMM
requirement hence why we introduce this code which allows to create
page table that can grow and shrink dynamicly.

The design is very close to CPU page table as it reuse some of the
feature such as spinlock embedded in struct page.

Changed since v1:
  - Use PAGE_SHIFT as shift value to reserve low bit for private
    device specific flags. This is to allow device driver to use
    and some of the lower bits for their own device specific purpose.
  - Add set of helper for atomically clear, setting and testing bit
    on dma_addr_t pointer. Atomicity being useful only for dirty bit.
  - Differentiate btw DMA mapped entry and non mapped entry (pfn).
  - Split page directory entry and page table entry helpers.

Changed since v2:
  - Rename hmm_pt_iter_update() -> hmm_pt_iter_lookup().
  - Rename hmm_pt_iter_fault() -> hmm_pt_iter_populate().
  - Add hmm_pt_iter_walk()
  - Remove hmm_pt_iter_next() (useless now).
  - Code simplification and improved comments.
  - Fix hmm_pt_fini_directory().

Changed since v3:
  - Fix hmm_pt_iter_directory_unref_safe().

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 MAINTAINERS            |   2 +
 include/linux/hmm_pt.h | 342 ++++++++++++++++++++++++++++
 mm/Makefile            |   2 +-
 mm/hmm_pt.c            | 603 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 948 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/hmm_pt.h
 create mode 100644 mm/hmm_pt.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 410b826..526f71c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4877,6 +4877,8 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm.c
 F:	include/linux/hmm.h
+F:	mm/hmm_pt.c
+F:	include/linux/hmm_pt.h
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
new file mode 100644
index 0000000..4a8beb1
--- /dev/null
+++ b/include/linux/hmm_pt.h
@@ -0,0 +1,342 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * This provide a set of helpers for HMM page table. See include/linux/hmm.h
+ * for a description of what HMM is.
+ *
+ * HMM page table rely on a locking mecanism similar to CPU page table for page
+ * table update. It use the spinlock embedded inside the struct page to protect
+ * change to page table directory which should minimize lock contention for
+ * concurrent update.
+ *
+ * It does also provide a directory tree protection mechanism. Unlike CPU page
+ * table there is no mmap semaphore to protect directory tree from removal and
+ * this is done intentionaly so that concurrent removal/insertion of directory
+ * inside the tree can happen.
+ *
+ * So anyone walking down the page table must protect directory it traverses so
+ * they are not free by some other thread. This is done by using a reference
+ * counter for each directory. Before traversing a directory a reference is
+ * taken and once traversal is done the reference is drop.
+ *
+ * A directory entry dereference and refcount increment of sub-directory page
+ * must happen in a critical rcu section so that directory page removal can
+ * gracefully wait for all possible other threads that might have dereferenced
+ * the directory.
+ */
+#ifndef _HMM_PT_H
+#define _HMM_PT_H
+
+/*
+ * The HMM page table entry does not reflect any specific hardware. It is just
+ * a common entry format use by HMM internal and expose to HMM user so they can
+ * extract information out of HMM page table.
+ *
+ * Device driver should only rely on the helpers and should not traverse the
+ * page table themself.
+ */
+#define HMM_PT_MAX_LEVEL	6
+
+#define HMM_PDE_VALID_BIT	0
+#define HMM_PDE_VALID		(1 << HMM_PDE_VALID_BIT)
+#define HMM_PDE_PFN_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+
+static inline dma_addr_t hmm_pde_from_pfn(dma_addr_t pfn)
+{
+	return (pfn << PAGE_SHIFT) | HMM_PDE_VALID;
+}
+
+static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
+{
+	return (pde & HMM_PDE_VALID) ? pde >> PAGE_SHIFT : 0;
+}
+
+
+/*
+ * The HMM_PTE_VALID_DMA_BIT is set for valid DMA mapped entry, while for pfn
+ * entry the HMM_PTE_VALID_PFN_BIT is set. If the hmm_device is associated with
+ * a valid struct device than device driver will be supplied with DMA mapped
+ * entry otherwise it will be supplied with pfn entry.
+ *
+ * In the first case the device driver must ignore any pfn entry as they might
+ * show as transient state while HMM is mapping the page.
+ */
+#define HMM_PTE_VALID_DMA_BIT	0
+#define HMM_PTE_VALID_PFN_BIT	1
+#define HMM_PTE_WRITE_BIT	2
+#define HMM_PTE_DIRTY_BIT	3
+/*
+ * Reserve some bits for device driver private flags. Note that thus can only
+ * be manipulated using the hmm_pte_*_bit() sets of helpers.
+ *
+ * WARNING ONLY SET/CLEAR THOSE FLAG ON PTE ENTRY THAT HAVE THE VALID BIT SET
+ * AS OTHERWISE ANY BIT SET BY THE DRIVER WILL BE OVERWRITTEN BY HMM.
+ */
+#define HMM_PTE_HW_SHIFT	4
+
+#define HMM_PTE_PFN_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+#define HMM_PTE_DMA_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+
+
+#ifdef __BIG_ENDIAN
+/*
+ * The dma_addr_t casting we do on little endian do not work on big endian. It
+ * would require some macro trickery to adjust the bit value depending on the
+ * number of bit unsigned long have in comparison to dma_addr_t. This is just
+ * low on the todo list for now.
+ */
+#error "HMM not supported on BIG_ENDIAN architecture.\n"
+#else /* __BIG_ENDIAN */
+static inline void hmm_pte_clear_bit(dma_addr_t *ptep, unsigned char bit)
+{
+	clear_bit(bit, (unsigned long *)ptep);
+}
+
+static inline void hmm_pte_set_bit(dma_addr_t *ptep, unsigned char bit)
+{
+	set_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_bit(dma_addr_t *ptep, unsigned char bit)
+{
+	return !!test_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_and_clear_bit(dma_addr_t *ptep,
+					      unsigned char bit)
+{
+	return !!test_and_clear_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_and_set_bit(dma_addr_t *ptep,
+					    unsigned char bit)
+{
+	return !!test_and_set_bit(bit, (unsigned long *)ptep);
+}
+#endif /* __BIG_ENDIAN */
+
+
+#define HMM_PTE_CLEAR_BIT(name, bit)\
+	static inline void hmm_pte_clear_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_clear_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_SET_BIT(name, bit)\
+	static inline void hmm_pte_set_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_set_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_TEST_BIT(name, bit)\
+	static inline bool hmm_pte_test_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_test_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_TEST_AND_CLEAR_BIT(name, bit)\
+	static inline bool hmm_pte_test_and_clear_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_test_and_clear_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_TEST_AND_SET_BIT(name, bit)\
+	static inline bool hmm_pte_test_and_set_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_test_and_set_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_BIT_HELPER(name, bit)\
+	HMM_PTE_CLEAR_BIT(name, bit)\
+	HMM_PTE_SET_BIT(name, bit)\
+	HMM_PTE_TEST_BIT(name, bit)\
+	HMM_PTE_TEST_AND_CLEAR_BIT(name, bit)\
+	HMM_PTE_TEST_AND_SET_BIT(name, bit)
+
+HMM_PTE_BIT_HELPER(valid_dma, HMM_PTE_VALID_DMA_BIT)
+HMM_PTE_BIT_HELPER(valid_pfn, HMM_PTE_VALID_PFN_BIT)
+HMM_PTE_BIT_HELPER(dirty, HMM_PTE_DIRTY_BIT)
+HMM_PTE_BIT_HELPER(write, HMM_PTE_WRITE_BIT)
+
+static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
+{
+	return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
+}
+
+static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
+{
+	return hmm_pte_test_valid_pfn(&pte) ? pte >> PAGE_SHIFT : 0;
+}
+
+
+/* struct hmm_pt - HMM page table structure.
+ *
+ * @mask: Array of address mask value of each level.
+ * @directory_mask: Mask for directory index (see below).
+ * @last: Last valid address (inclusive).
+ * @pgd: page global directory (top first level of the directory tree).
+ * @lock: Share lock if spinlock_t does not fit in struct page.
+ * @shift: Array of address shift value of each level.
+ * @llevel: Last level.
+ *
+ * The index into each directory for a given address and level is :
+ *   (address >> shift[level]) & directory_mask
+ *
+ * Only hmm_pt.last field needs to be set before calling hmm_pt_init().
+ */
+struct hmm_pt {
+	unsigned long		mask[HMM_PT_MAX_LEVEL];
+	unsigned long		directory_mask;
+	unsigned long		last;
+	dma_addr_t		*pgd;
+	spinlock_t		lock;
+	unsigned char		shift[HMM_PT_MAX_LEVEL];
+	unsigned char		llevel;
+};
+
+int hmm_pt_init(struct hmm_pt *pt);
+void hmm_pt_fini(struct hmm_pt *pt);
+
+static inline unsigned hmm_pt_index(struct hmm_pt *pt,
+				    unsigned long addr,
+				    unsigned level)
+{
+	return (addr >> pt->shift[level]) & pt->directory_mask;
+}
+
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
+					 struct page *ptd,
+					 unsigned level)
+{
+	if (level)
+		spin_lock(&ptd->ptl);
+	else
+		spin_lock(&pt->lock);
+}
+
+static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
+					   struct page *ptd,
+					   unsigned level)
+{
+	if (level)
+		spin_unlock(&ptd->ptl);
+	else
+		spin_unlock(&pt->lock);
+}
+#else /* USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS */
+static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
+					 struct page *ptd,
+					 unsigned level)
+{
+	spin_lock(&pt->lock);
+}
+
+static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
+					   struct page *ptd,
+					   unsigned level)
+{
+	spin_unlock(&pt->lock);
+}
+#endif
+
+static inline void hmm_pt_directory_ref(struct hmm_pt *pt,
+					struct page *ptd)
+{
+	if (!atomic_inc_not_zero(&ptd->_mapcount))
+		/* Illegal this should not happen. */
+		BUG();
+}
+
+static inline void hmm_pt_directory_unref(struct hmm_pt *pt,
+					  struct page *ptd)
+{
+	if (atomic_dec_and_test(&ptd->_mapcount))
+		/* Illegal this should not happen. */
+		BUG();
+
+}
+
+
+/* struct hmm_pt_iter - page table iterator states.
+ *
+ * @ptd: Array of directory struct page pointer for each levels.
+ * @ptdp: Array of pointer to mapped directory levels.
+ * @dead_directories: List of directories that died while walking page table.
+ * @cur: Current address.
+ */
+struct hmm_pt_iter {
+	struct page		*ptd[HMM_PT_MAX_LEVEL - 1];
+	dma_addr_t		*ptdp[HMM_PT_MAX_LEVEL - 1];
+	struct hmm_pt		*pt;
+	struct list_head	dead_directories;
+	unsigned long		cur;
+};
+
+void hmm_pt_iter_init(struct hmm_pt_iter *iter, struct hmm_pt *pt);
+void hmm_pt_iter_fini(struct hmm_pt_iter *iter);
+dma_addr_t *hmm_pt_iter_walk(struct hmm_pt_iter *iter,
+			     unsigned long *addr,
+			     unsigned long *next);
+dma_addr_t *hmm_pt_iter_lookup(struct hmm_pt_iter *iter,
+			       unsigned long addr,
+			       unsigned long *next);
+dma_addr_t *hmm_pt_iter_populate(struct hmm_pt_iter *iter,
+				 unsigned long addr,
+				 unsigned long *next);
+
+/* hmm_pt_protect_directory_ref() - reference current entry directory.
+ *
+ * @iter: Iterator states that currently protect the entry directory.
+ *
+ * This function will reference the current entry directory. Call this when
+ * you add a new valid entry to the entry directory.
+ */
+static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter)
+{
+	BUG_ON(!iter->ptd[iter->pt->llevel - 1]);
+	hmm_pt_directory_ref(iter->pt, iter->ptd[iter->pt->llevel - 1]);
+}
+
+/* hmm_pt_protect_directory_unref() - unreference current entry directory.
+ *
+ * @iter: Iterator states that currently protect the entry directory.
+ *
+ * This function will unreference the current entry directory. Call this when
+ * you remove a valid entry from the entry directory.
+ */
+static inline void hmm_pt_iter_directory_unref(struct hmm_pt_iter *iter)
+{
+	BUG_ON(!iter->ptd[iter->pt->llevel - 1]);
+	hmm_pt_directory_unref(iter->pt, iter->ptd[iter->pt->llevel - 1]);
+}
+
+static inline void hmm_pt_iter_directory_lock(struct hmm_pt_iter *iter)
+{
+	struct hmm_pt *pt = iter->pt;
+
+	hmm_pt_directory_lock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
+}
+
+static inline void hmm_pt_iter_directory_unlock(struct hmm_pt_iter *iter)
+{
+	struct hmm_pt *pt = iter->pt;
+
+	hmm_pt_directory_unlock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
+}
+
+
+#endif /* _HMM_PT_H */
diff --git a/mm/Makefile b/mm/Makefile
index 90ca9c4..04d7d45 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -78,4 +78,4 @@ obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
-obj-$(CONFIG_HMM) += hmm.o
+obj-$(CONFIG_HMM) += hmm.o hmm_pt.o
diff --git a/mm/hmm_pt.c b/mm/hmm_pt.c
new file mode 100644
index 0000000..ed766a0
--- /dev/null
+++ b/mm/hmm_pt.c
@@ -0,0 +1,603 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * This provide a set of helpers for HMM page table. See include/linux/hmm.h
+ * for a description of what HMM is and include/linux/hmm_pt.h.
+ */
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include <linux/hmm_pt.h>
+
+/* hmm_pt_init() - initialize HMM page table.
+ *
+ * @pt: HMM page table to initialize.
+ *
+ * This function will initialize HMM page table and allocate memory for global
+ * directory. Only the hmm_pt.last fields need to be set prior to calling this
+ * function.
+ */
+int hmm_pt_init(struct hmm_pt *pt)
+{
+	unsigned directory_shift, i = 0, npgd;
+
+	/* Align end address with end of page for current arch. */
+	pt->last |= (PAGE_SIZE - 1);
+	spin_lock_init(&pt->lock);
+	/*
+	 * Directory shift is the number of bits that a single directory level
+	 * represent. For instance if PAGE_SIZE is 4096 and each entry takes 8
+	 * bytes (sizeof(dma_addr_t) == 8) then directory_shift = 9.
+	 */
+	directory_shift = PAGE_SHIFT - ilog2(sizeof(dma_addr_t));
+	/*
+	 * Level 0 is the root level of the page table. It might use less
+	 * bits than directory_shift but all sub-directory level will use all
+	 * directory_shift bits.
+	 *
+	 * For instance if hmm_pt.last == (1 << 48) - 1, PAGE_SHIFT == 12 and
+	 * sizeof(dma_addr_t) == 8 then :
+	 *   directory_shift = 9
+	 *   shift[0] = 39
+	 *   shift[1] = 30
+	 *   shift[2] = 21
+	 *   shift[3] = 12
+	 *   llevel = 3
+	 *
+	 * Note that shift[llevel] == PAGE_SHIFT because the last level
+	 * correspond to the page table entry level (ignoring the case of huge
+	 * page).
+	 */
+	pt->shift[0] = ((__fls(pt->last >> PAGE_SHIFT) / directory_shift) *
+			directory_shift) + PAGE_SHIFT;
+	while (pt->shift[i++] > PAGE_SHIFT)
+		pt->shift[i] = pt->shift[i - 1] - directory_shift;
+	pt->llevel = i - 1;
+	pt->directory_mask = (1 << directory_shift) - 1;
+
+	for (i = 0; i <= pt->llevel; ++i)
+		pt->mask[i] = ~((1UL << pt->shift[i]) - 1);
+
+	npgd = (pt->last >> pt->shift[0]) + 1;
+	pt->pgd = kcalloc(npgd, sizeof(dma_addr_t), GFP_KERNEL);
+	if (!pt->pgd)
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_pt_init);
+
+static void hmm_pt_fini_directory(struct hmm_pt *pt,
+				  struct page *ptd,
+				  unsigned level)
+{
+	dma_addr_t *ptdp;
+	unsigned i;
+
+	if (level == pt->llevel)
+		return;
+
+	ptdp = kmap(ptd);
+	for (i = 0; i <= pt->directory_mask; ++i) {
+		struct page *lptd;
+
+		if (!(ptdp[i] & HMM_PDE_VALID))
+			continue;
+		lptd = pfn_to_page(hmm_pde_pfn(ptdp[i]));
+		ptdp[i] = 0;
+		hmm_pt_fini_directory(pt, lptd, level + 1);
+		atomic_set(&lptd->_mapcount, -1);
+		__free_page(lptd);
+	}
+	kunmap(ptd);
+}
+
+/* hmm_pt_fini() - finalize HMM page table.
+ *
+ * @pt: HMM page table to finalize.
+ *
+ * This function will free all resources of a directory page table.
+ */
+void hmm_pt_fini(struct hmm_pt *pt)
+{
+	unsigned i;
+
+	/* Free all directory. */
+	for (i = 0; i <= (pt->last >> pt->shift[0]); ++i) {
+		struct page *ptd;
+
+		if (!(pt->pgd[i] & HMM_PDE_VALID))
+			continue;
+		ptd = pfn_to_page(hmm_pde_pfn(pt->pgd[i]));
+		pt->pgd[i] = 0;
+		hmm_pt_fini_directory(pt, ptd, 1);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+
+	kfree(pt->pgd);
+	pt->pgd = NULL;
+}
+EXPORT_SYMBOL(hmm_pt_fini);
+
+/* hmm_pt_level_start() - Start (inclusive) address of directory at given level
+ *
+ * @pt: HMM page table.
+ * @addr: Address for which to get the directory start address.
+ * @level: Directory level.
+ *
+ * This return the start address of directory at given level for a given
+ * address. So using usual x86-64 example with :
+ *   (hmm_pt.last == (1 << 48) - 1, PAGE_SHIFT == 12, sizeof(dma_addr_t) == 8)
+ * We have :
+ *   llevel = 3 (which is the page table entry level)
+ *   shift[0] = 39  mask[0] = ~((1 << 39) - 1)
+ *   shift[1] = 30  mask[1] = ~((1 << 30) - 1)
+ *   shift[2] = 21  mask[2] = ~((1 << 21) - 1)
+ *   shift[3] = 12  mask[3] = ~((1 << 12) - 1)
+ * Which gives :
+ *   start = hmm_pt_level_start(pt, addr, 3)
+ *         = addr & pt->mask[3 - 1]
+ *         = addr & ~((1 << 21) - 1)
+ */
+static inline unsigned long hmm_pt_level_start(struct hmm_pt *pt,
+					       unsigned long addr,
+					       unsigned level)
+{
+	return level ? addr & pt->mask[level - 1] : 0;
+}
+
+/* hmm_pt_level_end() - End address (inclusive) of directory at given level.
+ *
+ * @pt: HMM page table.
+ * @addr: Address for which to get the directory end address.
+ * @level: Directory level.
+ *
+ * This return the start address of directory at given level for a given
+ * address. So using usual x86-64 example with :
+ *   (hmm_pt.last == (1 << 48) - 1, PAGE_SHIFT == 12, sizeof(dma_addr_t) == 8)
+ * We have :
+ *   llevel = 3 (which is the page table entry level)
+ *   shift[0] = 39  mask[0] = ~((1 << 39) - 1)
+ *   shift[1] = 30  mask[1] = ~((1 << 30) - 1)
+ *   shift[2] = 21  mask[2] = ~((1 << 21) - 1)
+ *   shift[3] = 12  mask[3] = ~((1 << 12) - 1)
+ * Which gives :
+ *   start = hmm_pt_level_end(pt, addr, 3)
+ *         = addr | ~pt->mask[3 - 1]
+ *         = addr | ((1 << 21) - 1)
+ */
+static inline unsigned long hmm_pt_level_end(struct hmm_pt *pt,
+					     unsigned long addr,
+					     unsigned level)
+{
+	return level ? (addr | (~pt->mask[level - 1])) : pt->last;
+}
+
+static inline dma_addr_t *hmm_pt_iter_ptdp(struct hmm_pt_iter *iter,
+					   unsigned long addr)
+{
+	struct hmm_pt *pt = iter->pt;
+
+	BUG_ON(!iter->ptd[pt->llevel - 1] ||
+	       addr < hmm_pt_level_start(pt, iter->cur, pt->llevel) ||
+	       addr > hmm_pt_level_end(pt, iter->cur, pt->llevel));
+	return &iter->ptdp[pt->llevel - 1][hmm_pt_index(pt, addr, pt->llevel)];
+}
+
+/* hmm_pt_iter_init() - initialize iterator states.
+ *
+ * @iter: Iterator states.
+ *
+ * This function will initialize iterator states. It must always be pair with a
+ * call to hmm_pt_iter_fini().
+ */
+void hmm_pt_iter_init(struct hmm_pt_iter *iter, struct hmm_pt *pt)
+{
+	iter->pt = pt;
+	memset(iter->ptd, 0, sizeof(iter->ptd));
+	memset(iter->ptdp, 0, sizeof(iter->ptdp));
+	INIT_LIST_HEAD(&iter->dead_directories);
+}
+EXPORT_SYMBOL(hmm_pt_iter_init);
+
+/* hmm_pt_iter_directory_unref_safe() - unref a directory that is safe to free.
+ *
+ * @iter: Iterator states.
+ * @pt: HMM page table.
+ * @level: Level of the directory to unref.
+ *
+ * This function will unreference a directory and add it to dead list if
+ * directory no longer have any reference. It will also clear the entry to
+ * that directory into the upper level directory as well as dropping ref
+ * on the upper directory.
+ */
+static void hmm_pt_iter_directory_unref_safe(struct hmm_pt_iter *iter,
+					     unsigned level)
+{
+	struct page *upper_ptd;
+	dma_addr_t *upper_ptdp;
+
+	/* Nothing to do for root level. */
+	if (!level)
+		return;
+
+	if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
+		return;
+
+	upper_ptd = level > 1 ? iter->ptd[level - 2] : NULL;
+	upper_ptdp = level > 1 ? iter->ptdp[level - 2] : iter->pt->pgd;
+	upper_ptdp = &upper_ptdp[hmm_pt_index(iter->pt, iter->cur, level - 1)];
+	hmm_pt_directory_lock(iter->pt, upper_ptd, level - 1);
+	/*
+	 * There might be race btw decrementing reference count on a directory
+	 * and another thread trying to fault in a new directory. To avoid
+	 * erasing the new directory entry we need to check that the entry
+	 * still correspond to the directory we are removing.
+	 */
+	if (hmm_pde_pfn(*upper_ptdp) == page_to_pfn(iter->ptd[level - 1]))
+		*upper_ptdp = 0;
+	hmm_pt_directory_unlock(iter->pt, upper_ptd, level - 1);
+
+	/* Add it to delayed free list. */
+	list_add_tail(&iter->ptd[level - 1]->lru, &iter->dead_directories);
+
+	/*
+	 * The upper directory is now safe to unref as we have an extra ref and
+	 * thus refcount should not reach 0.
+	 */
+	if (upper_ptd)
+		hmm_pt_directory_unref(iter->pt, upper_ptd);
+}
+
+static void hmm_pt_iter_unprotect_directory(struct hmm_pt_iter *iter,
+					    unsigned level)
+{
+	if (!iter->ptd[level - 1])
+		return;
+	kunmap(iter->ptd[level - 1]);
+	hmm_pt_iter_directory_unref_safe(iter, level);
+	iter->ptd[level - 1] = NULL;
+}
+
+/* hmm_pt_iter_protect_directory() - protect a directory.
+ *
+ * @iter: Iterator states.
+ * @ptd: directory struct page to protect.
+ * @addr: Address of the directory.
+ * @level: Level of this directory (> 0).
+ * Returns -EINVAL on error, 1 if protection succeeded, 0 otherwise.
+ *
+ * This function will proctect a directory by taking a reference. It will also
+ * map the directory to allow cpu access.
+ *
+ * Call to this function must be made from inside the rcu read critical section
+ * that convert the table entry to the directory struct page. Doing so allow to
+ * support concurrent removal of directory because this function will take the
+ * reference inside the rcu critical section and thus rcu synchronization will
+ * garanty that we can safely free directory.
+ */
+static int hmm_pt_iter_protect_directory(struct hmm_pt_iter *iter,
+					 struct page *ptd,
+					 unsigned long addr,
+					 unsigned level)
+{
+	/* This must be call inside rcu read section. */
+	BUG_ON(!rcu_read_lock_held());
+
+	if (!level || iter->ptd[level - 1]) {
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+
+	if (!atomic_inc_not_zero(&ptd->_mapcount)) {
+		rcu_read_unlock();
+		return 0;
+	}
+
+	rcu_read_unlock();
+
+	iter->ptd[level - 1] = ptd;
+	iter->ptdp[level - 1] = kmap(ptd);
+	iter->cur = addr;
+
+	return 1;
+}
+
+/* hmm_pt_iter_walk() - Walk page table for a valid entry directory.
+ *
+ * @iter: Iterator states.
+ * @addr: Start address of the range, return address of the entry directory.
+ * @next: End address of the range, return address of next directory.
+ * Returns Entry directory pointer and associated address if a valid entry
+ * directory exist in the range, or NULL and empty (*addr=*next) range
+ * otherwise.
+ *
+ * This function will return the first valid entry directory over a range of
+ * address. It update the addr parameter with the entry address and the next
+ * parameter with the address of the end of that directory. So device driver
+ * can do :
+ *
+ * for (addr = start; addr < end;) {
+ *   unsigned long next = end;
+ *
+ *   for (ptep=hmm_pt_iter_walk(iter, &addr, &next); ptep; addr + PAGE_SIZE) {
+ *     // Use ptep
+ *     ptep++;
+ *   }
+ * }
+ */
+dma_addr_t *hmm_pt_iter_walk(struct hmm_pt_iter *iter,
+			     unsigned long *addr,
+			     unsigned long *next)
+{
+	struct hmm_pt *pt = iter->pt;
+	int i;
+
+	*addr &= PAGE_MASK;
+
+	if (iter->ptd[pt->llevel - 1] &&
+	    *addr >= hmm_pt_level_start(pt, iter->cur, pt->llevel) &&
+	    *addr <= hmm_pt_level_end(pt, iter->cur, pt->llevel)) {
+		*next = min(*next, hmm_pt_level_end(pt, *addr, pt->llevel)+1);
+		return hmm_pt_iter_ptdp(iter, *addr);
+	}
+
+again:
+	/* First unprotect any directory that do not cover the address. */
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		if (*addr >= hmm_pt_level_start(pt, iter->cur, i) &&
+		    *addr <= hmm_pt_level_end(pt, iter->cur, i))
+			break;
+		hmm_pt_iter_unprotect_directory(iter, i);
+	}
+
+	/* Walk down to last level of the directory tree. */
+	for (; i < pt->llevel; ++i) {
+		struct page *ptd;
+		dma_addr_t pte, *ptdp;
+
+		rcu_read_lock();
+		ptdp = i ? iter->ptdp[i - 1] : pt->pgd;
+		pte = ACCESS_ONCE(ptdp[hmm_pt_index(pt, *addr, i)]);
+		if (!(pte & HMM_PDE_VALID)) {
+			rcu_read_unlock();
+			*addr = hmm_pt_level_end(pt, iter->cur, i) + 1;
+			if (*addr > *next) {
+				*addr = *next;
+				return NULL;
+			}
+			goto again;
+		}
+		ptd = pfn_to_page(hmm_pde_pfn(pte));
+		/* RCU read unlock inside hmm_pt_iter_protect_directory(). */
+		if (hmm_pt_iter_protect_directory(iter, ptd,
+						  *addr, i + 1) != 1) {
+			if (*addr > *next) {
+				*addr = *next;
+				return NULL;
+			}
+			goto again;
+		}
+	}
+
+	*next = min(*next, hmm_pt_level_end(pt, *addr, pt->llevel) + 1);
+	return hmm_pt_iter_ptdp(iter, *addr);
+}
+EXPORT_SYMBOL(hmm_pt_iter_walk);
+
+/* hmm_pt_iter_lookup() - Lookup entry directory for an address.
+ *
+ * @iter: Iterator states.
+ * @addr: Address of the entry directory to lookup.
+ * @next: End address up to which the entry directory is valid.
+ * Returns Entry directory pointer and its end address.
+ *
+ * This function will return the entry directory pointer for a given address as
+ * well as the end address of that directory (address of the next directory).
+ * Use patern is :
+ *
+ * for (addr = start; addr < end;) {
+ *   unsigned long next;
+ *
+ *   for (ptep=hmm_pt_iter_lookup(iter, addr, &next); ptep; addr+=PAGE_SIZE) {
+ *     // Use ptep
+ *     ptep++;
+ *   }
+ * }
+ */
+dma_addr_t *hmm_pt_iter_lookup(struct hmm_pt_iter *iter,
+			       unsigned long addr,
+			       unsigned long *next)
+{
+	struct hmm_pt *pt = iter->pt;
+	int i;
+
+	addr &= PAGE_MASK;
+
+	if (iter->ptd[pt->llevel - 1] &&
+	    addr >= hmm_pt_level_start(pt, iter->cur, pt->llevel) &&
+	    addr <= hmm_pt_level_end(pt, iter->cur, pt->llevel)) {
+		*next = min(*next, hmm_pt_level_end(pt, addr, pt->llevel) + 1);
+		return hmm_pt_iter_ptdp(iter, addr);
+	}
+
+	/* First unprotect any directory that do not cover the address. */
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		if (addr >= hmm_pt_level_start(pt, iter->cur, i) &&
+		    addr <= hmm_pt_level_end(pt, iter->cur, i))
+			break;
+		hmm_pt_iter_unprotect_directory(iter, i);
+	}
+
+	/* Walk down to last level of the directory tree. */
+	for (; i < pt->llevel; ++i) {
+		struct page *ptd;
+		dma_addr_t pte, *ptdp;
+
+		rcu_read_lock();
+		ptdp = i ? iter->ptdp[i - 1] : pt->pgd;
+		pte = ACCESS_ONCE(ptdp[hmm_pt_index(pt, addr, i)]);
+		if (!(pte & HMM_PDE_VALID)) {
+			rcu_read_unlock();
+			*next = min(*next,
+				    hmm_pt_level_end(pt, iter->cur, i) + 1);
+			return NULL;
+		}
+		ptd = pfn_to_page(hmm_pde_pfn(pte));
+		/* RCU read unlock inside hmm_pt_iter_protect_directory(). */
+		if (hmm_pt_iter_protect_directory(iter, ptd, addr, i + 1) != 1) {
+			*next = min(*next,
+				    hmm_pt_level_end(pt, iter->cur, i) + 1);
+			return NULL;
+		}
+	}
+
+	*next = min(*next, hmm_pt_level_end(pt, addr, pt->llevel) + 1);
+	return hmm_pt_iter_ptdp(iter, addr);
+}
+EXPORT_SYMBOL(hmm_pt_iter_lookup);
+
+/* hmm_pt_iter_populate() - Allocate entry directory for an address.
+ *
+ * @iter: Iterator states.
+ * @addr: Address of the entry directory to lookup.
+ * @next: End address up to which the entry directory is valid.
+ * Returns Entry directory pointer and its end address.
+ *
+ * This function will return the entry directory pointer (and allocate a new
+ * one if none exist) for a given address as well as the end address of that
+ * directory (address of the next directory). Use patern is :
+ *
+ * for (addr = start; addr < end;) {
+ *   unsigned long next;
+ *
+ *   ptep = hmm_pt_iter_populate(iter,addr,&next);
+ *   if (!ptep) {
+ *     // error handling.
+ *   }
+ *   for (; addr < next; addr += PAGE_SIZE, ptep++) {
+ *     // Use ptep
+ *   }
+ * }
+ */
+dma_addr_t *hmm_pt_iter_populate(struct hmm_pt_iter *iter,
+				 unsigned long addr,
+				 unsigned long *next)
+{
+	dma_addr_t *ptdp = hmm_pt_iter_lookup(iter, addr, next);
+	struct hmm_pt *pt = iter->pt;
+	struct page *new = NULL;
+	int i;
+
+	if (ptdp)
+		return ptdp;
+
+	/* Populate directory tree structures. */
+	for (i = 1, iter->cur = addr; i <= pt->llevel; ++i) {
+		struct page *upper_ptd;
+		dma_addr_t *upper_ptdp;
+
+		if (iter->ptd[i - 1])
+			continue;
+
+		new = new ? new : alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+		if (!new)
+			return NULL;
+
+		upper_ptd = i > 1 ? iter->ptd[i - 2] : NULL;
+		upper_ptdp = i > 1 ? iter->ptdp[i - 2] : pt->pgd;
+		upper_ptdp = &upper_ptdp[hmm_pt_index(pt, addr, i - 1)];
+		hmm_pt_directory_lock(pt, upper_ptd, i - 1);
+		if (((*upper_ptdp) & HMM_PDE_VALID)) {
+			struct page *ptd;
+
+			ptd = pfn_to_page(hmm_pde_pfn(*upper_ptdp));
+			if (atomic_inc_not_zero(&ptd->_mapcount)) {
+				/* Already allocated by another thread. */
+				iter->ptd[i - 1] = ptd;
+				hmm_pt_directory_unlock(pt, upper_ptd, i - 1);
+				iter->ptdp[i - 1] = kmap(ptd);
+				continue;
+			}
+			/*
+			 * Means we raced with removal of dead directory it is
+			 * safe to overwritte *upper_ptdp entry with new entry.
+			 */
+		}
+		/* Initialize struct page field for the directory. */
+		atomic_set(&new->_mapcount, 1);
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+		spin_lock_init(&new->ptl);
+#endif
+		*upper_ptdp = hmm_pde_from_pfn(page_to_pfn(new));
+		/* The pgd level is not refcounted. */
+		if (i > 1)
+			hmm_pt_directory_ref(pt, iter->ptd[i - 2]);
+		/* Unlock upper directory and map the new directory. */
+		hmm_pt_directory_unlock(pt, upper_ptd, i - 1);
+		iter->ptd[i - 1] = new;
+		iter->ptdp[i - 1] = kmap(new);
+		new = NULL;
+	}
+	if (new)
+		__free_page(new);
+	*next = min(*next, hmm_pt_level_end(pt, addr, pt->llevel) + 1);
+	return hmm_pt_iter_ptdp(iter, addr);
+}
+EXPORT_SYMBOL(hmm_pt_iter_populate);
+
+/* hmm_pt_iter_fini() - finalize iterator.
+ *
+ * @iter: Iterator states.
+ * @pt: HMM page table.
+ *
+ * This function will cleanup iterator by unmapping and unreferencing any
+ * directory still mapped and referenced. It will also free any dead directory.
+ */
+void hmm_pt_iter_fini(struct hmm_pt_iter *iter)
+{
+	struct page *ptd, *tmp;
+	unsigned i;
+
+	for (i = iter->pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		hmm_pt_iter_unprotect_directory(iter, i);
+	}
+
+	/* Avoid useless synchronize_rcu() if there is no directory to free. */
+	if (list_empty(&iter->dead_directories))
+		return;
+
+	/*
+	 * Some iterator may have dereferenced a dead directory entry and looked
+	 * up the struct page but haven't check yet the reference count. As all
+	 * the above happen in rcu read critical section we know that we need
+	 * to wait for grace period before being able to free any of the dead
+	 * directory page.
+	 */
+	synchronize_rcu();
+	list_for_each_entry_safe(ptd, tmp, &iter->dead_directories, lru) {
+		list_del(&ptd->lru);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+}
+EXPORT_SYMBOL(hmm_pt_iter_fini);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 07/15] HMM: add per mirror page table v4.
  2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
                   ` (5 preceding siblings ...)
  2015-08-13 19:15 ` [PATCH 06/15] HMM: add HMM page table v4 Jérôme Glisse
@ 2015-08-13 19:15 ` Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 08/15] HMM: add device page fault support v4 Jérôme Glisse
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Jérôme Glisse @ 2015-08-13 19:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

This patch add the per mirror page table. It also propagate CPU page
table update to this per mirror page table using mmu_notifier callback.
All update are contextualized with an HMM event structure that convey
all information needed by device driver to take proper actions (update
its own mmu to reflect changes and schedule proper flushing).

Core HMM is responsible for updating the per mirror page table once
the device driver is done with its update. Most importantly HMM will
properly propagate HMM page table dirty bit to underlying page.

Changed since v1:
  - Removed unused fence code to defer it to latter patches.

Changed since v2:
  - Use new bit flag helper for mirror page table manipulation.
  - Differentiate fork event with HMM_FORK from other events.

Changed since v3:
  - Get rid of HMM_ISDIRTY and rely on write protect instead.
  - Adapt to HMM page table changes

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h |  83 ++++++++++++++++++++
 mm/hmm.c            | 218 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 301 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index b559c0b..5488fa9 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -46,6 +46,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/workqueue.h>
 #include <linux/mman.h>
+#include <linux/hmm_pt.h>
 
 
 struct hmm_device;
@@ -53,6 +54,38 @@ struct hmm_mirror;
 struct hmm;
 
 
+/*
+ * hmm_event - each event is described by a type associated with a struct.
+ */
+enum hmm_etype {
+	HMM_NONE = 0,
+	HMM_FORK,
+	HMM_MIGRATE,
+	HMM_MUNMAP,
+	HMM_DEVICE_RFAULT,
+	HMM_DEVICE_WFAULT,
+	HMM_WRITE_PROTECT,
+};
+
+/* struct hmm_event - memory event information.
+ *
+ * @list: So HMM can keep track of all active events.
+ * @start: First address (inclusive).
+ * @end: Last address (exclusive).
+ * @pte_mask: HMM pte update mask (bit(s) that are still valid).
+ * @etype: Event type (munmap, migrate, truncate, ...).
+ * @backoff: Only meaningful for device page fault.
+ */
+struct hmm_event {
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	dma_addr_t		pte_mask;
+	enum hmm_etype		etype;
+	bool			backoff;
+};
+
+
 /* hmm_device - Each device must register one and only one hmm_device.
  *
  * The hmm_device is the link btw HMM and each device driver.
@@ -83,6 +116,54 @@ struct hmm_device_ops {
 	 * so device driver callback can not sleep.
 	 */
 	void (*free)(struct hmm_mirror *mirror);
+
+	/* update() - update device mmu following an event.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @event: The event that triggered the update.
+	 * Returns: 0 on success or error code {-EIO, -ENOMEM}.
+	 *
+	 * Called to update device page table for a range of address.
+	 * The event type provide the nature of the update :
+	 *   - Range is no longer valid (munmap).
+	 *   - Range protection changes (mprotect, COW, ...).
+	 *   - Range is unmapped (swap, reclaim, page migration, ...).
+	 *   - Device page fault.
+	 *   - ...
+	 *
+	 * Thought most device driver only need to use pte_mask as it reflects
+	 * change that will happen to the HMM page table ie :
+	 *   new_pte = old_pte & event->pte_mask;
+	 *
+	 * Device driver must not update the HMM mirror page table (except the
+	 * dirty bit see below). Core HMM will update HMM page table after the
+	 * update is done.
+	 *
+	 * Note that device must be cache coherent with system memory (snooping
+	 * in case of PCIE devices) so there should be no need for device to
+	 * flush anything.
+	 *
+	 * When write protection is turned on device driver must make sure the
+	 * hardware will no longer be able to write to the page otherwise file
+	 * system corruption may occur.
+	 *
+	 * Device must properly set the dirty bit using hmm_pte_set_bit() on
+	 * each page entry for memory that was written by the device. If device
+	 * can not properly account for write access then the dirty bit must be
+	 * set unconditionally so that proper write back of file backed page
+	 * can happen.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill.
+	 *
+	 * Return 0 on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*update)(struct hmm_mirror *mirror,
+		      struct hmm_event *event);
 };
 
 
@@ -149,6 +230,7 @@ int hmm_device_unregister(struct hmm_device *device);
  * @kref: Reference counter (private to HMM do not use).
  * @dlist: List of all hmm_mirror for same device.
  * @mlist: List of all hmm_mirror for same process.
+ * @pt: Mirror page table.
  *
  * Each device that want to mirror an address space must register one of this
  * struct for each of the address space it wants to mirror. Same device can
@@ -161,6 +243,7 @@ struct hmm_mirror {
 	struct kref		kref;
 	struct list_head	dlist;
 	struct hlist_node	mlist;
+	struct hmm_pt		pt;
 };
 
 int hmm_mirror_register(struct hmm_mirror *mirror);
diff --git a/mm/hmm.c b/mm/hmm.c
index 8d861c4..ef94e2a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -45,6 +45,50 @@
 #include "internal.h"
 
 static struct mmu_notifier_ops hmm_notifier_ops;
+static void hmm_mirror_kill(struct hmm_mirror *mirror);
+static inline int hmm_mirror_update(struct hmm_mirror *mirror,
+				    struct hmm_event *event);
+static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
+				 struct hmm_event *event);
+
+
+/* hmm_event - use to track information relating to an event.
+ *
+ * Each change to cpu page table or fault from a device is considered as an
+ * event by hmm. For each event there is a common set of things that need to
+ * be tracked. The hmm_event struct centralize those and the helper functions
+ * help dealing with all this.
+ */
+
+static inline int hmm_event_init(struct hmm_event *event,
+				 struct hmm *hmm,
+				 unsigned long start,
+				 unsigned long end,
+				 enum hmm_etype etype)
+{
+	event->start = start & PAGE_MASK;
+	event->end = min(end, hmm->vm_end);
+	if (event->start >= event->end)
+		return -EINVAL;
+	event->etype = etype;
+	event->pte_mask = (dma_addr_t)-1ULL;
+	switch (etype) {
+	case HMM_DEVICE_RFAULT:
+	case HMM_DEVICE_WFAULT:
+		break;
+	case HMM_FORK:
+	case HMM_WRITE_PROTECT:
+		event->pte_mask ^= (1 << HMM_PTE_WRITE_BIT);
+		break;
+	case HMM_MIGRATE:
+	case HMM_MUNMAP:
+		event->pte_mask = 0;
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
 
 
 /* hmm - core HMM functions.
@@ -123,6 +167,27 @@ static inline struct hmm *hmm_unref(struct hmm *hmm)
 	return NULL;
 }
 
+static void hmm_update(struct hmm *hmm, struct hmm_event *event)
+{
+	struct hmm_mirror *mirror;
+
+	/* Is this hmm already fully stop ? */
+	if (hmm->mm->hmm != hmm)
+		return;
+
+again:
+	down_read(&hmm->rwsem);
+	hlist_for_each_entry(mirror, &hmm->mirrors, mlist)
+		if (hmm_mirror_update(mirror, event)) {
+			mirror = hmm_mirror_ref(mirror);
+			up_read(&hmm->rwsem);
+			hmm_mirror_kill(mirror);
+			hmm_mirror_unref(&mirror);
+			goto again;
+		}
+	up_read(&hmm->rwsem);
+}
+
 
 /* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
  *
@@ -139,6 +204,7 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	down_write(&hmm->rwsem);
 	while (hmm->mirrors.first) {
 		struct hmm_mirror *mirror;
+		struct hmm_event event;
 
 		/*
 		 * Here we are holding the mirror reference from the mirror
@@ -151,6 +217,10 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 		hlist_del_init(&mirror->mlist);
 		up_write(&hmm->rwsem);
 
+		/* Make sure everything is unmapped. */
+		hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
+		hmm_mirror_update(mirror, &event);
+
 		mirror->device->ops->release(mirror);
 		hmm_mirror_unref(&mirror);
 
@@ -161,8 +231,89 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	hmm_unref(hmm);
 }
 
+static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
+				   unsigned long addr,
+				   enum mmu_event mmu_event,
+				   enum hmm_etype *etype)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, addr);
+	if (!vma || vma->vm_start > addr || !(vma->vm_flags & VM_READ)) {
+		*etype = HMM_MUNMAP;
+		return;
+	}
+
+	if (!(vma->vm_flags & VM_WRITE)) {
+		*etype = HMM_WRITE_PROTECT;
+		return;
+	}
+
+	*etype = HMM_NONE;
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
+{
+	struct hmm_event event;
+	unsigned long start = range->start, end = range->end;
+	struct hmm *hmm;
+
+	hmm = container_of(mn, struct hmm, mmu_notifier);
+	if (start >= hmm->vm_end)
+		return;
+
+	switch (range->event) {
+	case MMU_FORK:
+		event.etype = HMM_FORK;
+		break;
+	case MMU_MUNLOCK:
+		/* Still same physical ram backing same address. */
+		return;
+	case MMU_MPROT:
+		hmm_mmu_mprot_to_etype(mm, start, range->event, &event.etype);
+		if (event.etype == HMM_NONE)
+			return;
+		break;
+	case MMU_CLEAR_SOFT_DIRTY:
+	case MMU_WRITE_BACK:
+	case MMU_KSM_WRITE_PROTECT:
+		event.etype = HMM_WRITE_PROTECT;
+		break;
+	case MMU_HUGE_PAGE_SPLIT:
+	case MMU_MUNMAP:
+		event.etype = HMM_MUNMAP;
+		break;
+	case MMU_MIGRATE:
+	default:
+		event.etype = HMM_MIGRATE;
+		break;
+	}
+
+	hmm_event_init(&event, hmm, start, end, event.etype);
+
+	hmm_update(hmm, &event);
+}
+
+static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
+					 struct mm_struct *mm,
+					 unsigned long addr,
+					 struct page *page,
+					 enum mmu_event mmu_event)
+{
+	struct mmu_notifier_range range;
+
+	range.start = addr & PAGE_MASK;
+	range.end = range.start + PAGE_SIZE;
+	range.event = mmu_event;
+	hmm_notifier_invalidate_range_start(mn, mm, &range);
+}
+
 static struct mmu_notifier_ops hmm_notifier_ops = {
 	.release		= hmm_notifier_release,
+	.invalidate_page	= hmm_notifier_invalidate_page,
+	.invalidate_range_start	= hmm_notifier_invalidate_range_start,
 };
 
 
@@ -192,6 +343,7 @@ static void hmm_mirror_destroy(struct kref *kref)
 	mirror = container_of(kref, struct hmm_mirror, kref);
 	device = mirror->device;
 
+	hmm_pt_fini(&mirror->pt);
 	hmm_unref(mirror->hmm);
 
 	spin_lock(&device->lock);
@@ -211,6 +363,59 @@ void hmm_mirror_unref(struct hmm_mirror **mirror)
 }
 EXPORT_SYMBOL(hmm_mirror_unref);
 
+static inline int hmm_mirror_update(struct hmm_mirror *mirror,
+				    struct hmm_event *event)
+{
+	struct hmm_device *device = mirror->device;
+	int ret = 0;
+
+	ret = device->ops->update(mirror, event);
+	hmm_mirror_update_pt(mirror, event);
+	return ret;
+}
+
+static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
+				 struct hmm_event *event)
+{
+	unsigned long addr;
+	struct hmm_pt_iter iter;
+
+	hmm_pt_iter_init(&iter, &mirror->pt);
+	for (addr = event->start; addr != event->end;) {
+		unsigned long next = event->end;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_lookup(&iter, addr, &next);
+		if (!hmm_pte) {
+			addr = next;
+			continue;
+		}
+		/*
+		 * The directory lock protect against concurrent clearing of
+		 * page table bit flags. Exceptions being the dirty bit and
+		 * the device driver private flags.
+		 */
+		hmm_pt_iter_directory_lock(&iter);
+		do {
+			if (!hmm_pte_test_valid_pfn(hmm_pte))
+				continue;
+			if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
+			    hmm_pte_test_write(hmm_pte)) {
+				struct page *page;
+
+				page = pfn_to_page(hmm_pte_pfn(*hmm_pte));
+				set_page_dirty(page);
+			}
+			*hmm_pte &= event->pte_mask;
+			if (hmm_pte_test_valid_pfn(hmm_pte))
+				continue;
+			hmm_pt_iter_directory_unref(&iter);
+		} while (addr += PAGE_SIZE, hmm_pte++, addr != next);
+		hmm_pt_iter_directory_unlock(&iter);
+	}
+	hmm_pt_iter_fini(&iter);
+}
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
@@ -242,6 +447,11 @@ int hmm_mirror_register(struct hmm_mirror *mirror)
 	 * necessary to make the error path easier for driver and for hmm.
 	 */
 	kref_init(&mirror->kref);
+	mirror->pt.last = TASK_SIZE - 1;
+	if (hmm_pt_init(&mirror->pt)) {
+		kfree(mirror);
+		return -ENOMEM;
+	}
 	INIT_HLIST_NODE(&mirror->mlist);
 	INIT_LIST_HEAD(&mirror->dlist);
 	spin_lock(&mirror->device->lock);
@@ -278,6 +488,7 @@ int hmm_mirror_register(struct hmm_mirror *mirror)
 		hmm_unref(hmm);
 		goto error;
 	}
+	BUG_ON(mirror->pt.last >= hmm->vm_end);
 	return 0;
 
 error:
@@ -298,8 +509,15 @@ static void hmm_mirror_kill(struct hmm_mirror *mirror)
 
 	down_write(&hmm->rwsem);
 	if (!hlist_unhashed(&mirror->mlist)) {
+		struct hmm_event event;
+
 		hlist_del_init(&mirror->mlist);
 		up_write(&hmm->rwsem);
+
+		/* Make sure everything is unmapped. */
+		hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
+		hmm_mirror_update(mirror, &event);
+
 		device->ops->release(mirror);
 		hmm_mirror_unref(&mirror);
 	} else
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 08/15] HMM: add device page fault support v4.
  2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
                   ` (6 preceding siblings ...)
  2015-08-13 19:15 ` [PATCH 07/15] HMM: add per mirror " Jérôme Glisse
@ 2015-08-13 19:15 ` Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 09/15] HMM: add mm page table iterator helpers Jérôme Glisse
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Jérôme Glisse @ 2015-08-13 19:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

This patch add helper for device page fault. Device page fault helper will
fill the mirror page table using the CPU page table all this synchronized
with any update to CPU page table.

Changed since v1:
  - Add comment about directory lock.

Changed since v2:
  - Check for mirror->hmm in hmm_mirror_fault()

Changed since v3:
  - Adapt to HMM page table changes.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h |  15 +++
 mm/hmm.c            | 370 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 384 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 5488fa9..d819ec9 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -85,6 +85,12 @@ struct hmm_event {
 	bool			backoff;
 };
 
+static inline bool hmm_event_overlap(const struct hmm_event *a,
+				     const struct hmm_event *b)
+{
+	return !((a->end <= b->start) || (a->start >= b->end));
+}
+
 
 /* hmm_device - Each device must register one and only one hmm_device.
  *
@@ -176,6 +182,10 @@ struct hmm_device_ops {
  * @rwsem: Serialize the mirror list modifications.
  * @mmu_notifier: The mmu_notifier of this mm.
  * @rcu: For delayed cleanup call from mmu_notifier.release() callback.
+ * @device_faults: List of all active device page faults.
+ * @ndevice_faults: Number of active device page faults.
+ * @wait_queue: Wait queue for event synchronization.
+ * @lock: Serialize device_faults list modification.
  *
  * For each process address space (mm_struct) there is one and only one hmm
  * struct. hmm functions will redispatch to each devices the change made to
@@ -192,6 +202,10 @@ struct hmm {
 	struct rw_semaphore	rwsem;
 	struct mmu_notifier	mmu_notifier;
 	struct rcu_head		rcu;
+	struct list_head	device_faults;
+	unsigned		ndevice_faults;
+	wait_queue_head_t	wait_queue;
+	spinlock_t		lock;
 };
 
 
@@ -250,6 +264,7 @@ int hmm_mirror_register(struct hmm_mirror *mirror);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
 void hmm_mirror_unref(struct hmm_mirror **mirror);
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
 
 
 #endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index ef94e2a..88f7801 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -67,7 +67,7 @@ static inline int hmm_event_init(struct hmm_event *event,
 				 enum hmm_etype etype)
 {
 	event->start = start & PAGE_MASK;
-	event->end = min(end, hmm->vm_end);
+	event->end = PAGE_ALIGN(min(end, hmm->vm_end));
 	if (event->start >= event->end)
 		return -EINVAL;
 	event->etype = etype;
@@ -103,6 +103,10 @@ static int hmm_init(struct hmm *hmm)
 	kref_init(&hmm->kref);
 	INIT_HLIST_HEAD(&hmm->mirrors);
 	init_rwsem(&hmm->rwsem);
+	INIT_LIST_HEAD(&hmm->device_faults);
+	hmm->ndevice_faults = 0;
+	init_waitqueue_head(&hmm->wait_queue);
+	spin_lock_init(&hmm->lock);
 
 	/* register notifier */
 	hmm->mmu_notifier.ops = &hmm_notifier_ops;
@@ -167,6 +171,58 @@ static inline struct hmm *hmm_unref(struct hmm *hmm)
 	return NULL;
 }
 
+static int hmm_device_fault_start(struct hmm *hmm, struct hmm_event *event)
+{
+	int ret = 0;
+
+	mmu_notifier_range_wait_valid(hmm->mm, event->start, event->end);
+
+	spin_lock(&hmm->lock);
+	if (mmu_notifier_range_is_valid(hmm->mm, event->start, event->end)) {
+		list_add_tail(&event->list, &hmm->device_faults);
+		hmm->ndevice_faults++;
+		event->backoff = false;
+	} else
+		ret = -EAGAIN;
+	spin_unlock(&hmm->lock);
+
+	wake_up(&hmm->wait_queue);
+
+	return ret;
+}
+
+static void hmm_device_fault_end(struct hmm *hmm, struct hmm_event *event)
+{
+	spin_lock(&hmm->lock);
+	list_del_init(&event->list);
+	hmm->ndevice_faults--;
+	spin_unlock(&hmm->lock);
+
+	wake_up(&hmm->wait_queue);
+}
+
+static void hmm_wait_device_fault(struct hmm *hmm, struct hmm_event *ievent)
+{
+	struct hmm_event *fevent;
+	unsigned long wait_for = 0;
+
+again:
+	spin_lock(&hmm->lock);
+	list_for_each_entry(fevent, &hmm->device_faults, list) {
+		if (!hmm_event_overlap(fevent, ievent))
+			continue;
+		fevent->backoff = true;
+		wait_for = hmm->ndevice_faults;
+	}
+	spin_unlock(&hmm->lock);
+
+	if (wait_for > 0) {
+		wait_event(hmm->wait_queue, wait_for != hmm->ndevice_faults);
+		wait_for = 0;
+		goto again;
+	}
+}
+
 static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 {
 	struct hmm_mirror *mirror;
@@ -175,6 +231,8 @@ static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 	if (hmm->mm->hmm != hmm)
 		return;
 
+	hmm_wait_device_fault(hmm, event);
+
 again:
 	down_read(&hmm->rwsem);
 	hlist_for_each_entry(mirror, &hmm->mirrors, mlist)
@@ -186,6 +244,33 @@ again:
 			goto again;
 		}
 	up_read(&hmm->rwsem);
+
+	wake_up(&hmm->wait_queue);
+}
+
+static int hmm_mm_fault(struct hmm *hmm,
+			struct hmm_event *event,
+			struct vm_area_struct *vma,
+			unsigned long addr)
+{
+	unsigned flags = FAULT_FLAG_ALLOW_RETRY;
+	struct mm_struct *mm = vma->vm_mm;
+	int r;
+
+	flags |= (event->etype == HMM_DEVICE_WFAULT) ? FAULT_FLAG_WRITE : 0;
+	for (addr &= PAGE_MASK; addr < event->end; addr += PAGE_SIZE) {
+
+		r = handle_mm_fault(mm, vma, addr, flags);
+		if (r & VM_FAULT_RETRY)
+			return -EBUSY;
+		if (r & VM_FAULT_ERROR) {
+			if (r & VM_FAULT_OOM)
+				return -ENOMEM;
+			/* Same error code for all other cases. */
+			return -EFAULT;
+		}
+	}
+	return 0;
 }
 
 
@@ -228,6 +313,7 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	}
 	up_write(&hmm->rwsem);
 
+	wake_up(&hmm->wait_queue);
 	hmm_unref(hmm);
 }
 
@@ -416,6 +502,288 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 	hmm_pt_iter_fini(&iter);
 }
 
+static inline bool hmm_mirror_is_dead(struct hmm_mirror *mirror)
+{
+	if (hlist_unhashed(&mirror->mlist) || list_empty(&mirror->dlist))
+		return true;
+	return false;
+}
+
+struct hmm_mirror_fault {
+	struct hmm_mirror	*mirror;
+	struct hmm_event	*event;
+	struct vm_area_struct	*vma;
+	unsigned long		addr;
+	struct hmm_pt_iter	*iter;
+};
+
+static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
+				 struct hmm_event *event,
+				 struct vm_area_struct *vma,
+				 struct hmm_pt_iter *iter,
+				 pmd_t *pmdp,
+				 struct hmm_mirror_fault *mirror_fault,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct page *page;
+	unsigned long addr, pfn;
+	unsigned flags = FOLL_TOUCH;
+	spinlock_t *ptl;
+	int ret;
+
+	ptl = pmd_lock(mirror->hmm->mm, pmdp);
+	if (unlikely(!pmd_trans_huge(*pmdp))) {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
+	if (unlikely(pmd_trans_splitting(*pmdp))) {
+		spin_unlock(ptl);
+		wait_split_huge_page(vma->anon_vma, pmdp);
+		return -EAGAIN;
+	}
+	flags |= event->etype == HMM_DEVICE_WFAULT ? FOLL_WRITE : 0;
+	page = follow_trans_huge_pmd(vma, start, pmdp, flags);
+	pfn = page_to_pfn(page);
+	spin_unlock(ptl);
+
+	/* Just fault in the whole PMD. */
+	start &= PMD_MASK;
+	end = start + PMD_SIZE - 1;
+
+	if (!pmd_write(*pmdp) && event->etype == HMM_DEVICE_WFAULT)
+			return -ENOENT;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i = 0, next = end;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+		if (!hmm_pte)
+			return -ENOMEM;
+
+		/*
+		 * The directory lock protect against concurrent clearing of
+		 * page table bit flags. Exceptions being the dirty bit and
+		 * the device driver private flags.
+		 */
+		hmm_pt_iter_directory_lock(iter);
+		do {
+			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
+				hmm_pte[i] = hmm_pte_from_pfn(pfn);
+				hmm_pt_iter_directory_ref(iter);
+			}
+			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
+			if (pmd_write(*pmdp))
+				hmm_pte_set_write(&hmm_pte[i]);
+		} while (addr += PAGE_SIZE, pfn++, i++, addr != next);
+		hmm_pt_iter_directory_unlock(iter);
+		mirror_fault->addr = addr;
+	}
+
+	return 0;
+}
+
+static int hmm_mirror_fault_pmd(pmd_t *pmdp,
+				unsigned long start,
+				unsigned long end,
+				struct mm_walk *walk)
+{
+	struct hmm_mirror_fault *mirror_fault = walk->private;
+	struct hmm_mirror *mirror = mirror_fault->mirror;
+	struct hmm_event *event = mirror_fault->event;
+	struct hmm_pt_iter *iter = mirror_fault->iter;
+	bool write = (event->etype == HMM_DEVICE_WFAULT);
+	unsigned long addr;
+	int ret = 0;
+
+	/* Make sure there was no gap. */
+	if (start != mirror_fault->addr)
+		return -ENOENT;
+
+	if (event->backoff)
+		return -EAGAIN;
+
+	if (pmd_none(*pmdp))
+		return -ENOENT;
+
+	if (pmd_trans_huge(*pmdp))
+		return hmm_mirror_fault_hpmd(mirror, event, mirror_fault->vma,
+					     iter, pmdp, mirror_fault, start,
+					     end);
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp))
+		return -EFAULT;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i = 0, next = end;
+		dma_addr_t *hmm_pte;
+		pte_t *ptep;
+
+		hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+		if (!hmm_pte)
+			return -ENOMEM;
+
+		ptep = pte_offset_map(pmdp, start);
+		hmm_pt_iter_directory_lock(iter);
+		do {
+			if (!pte_present(*ptep) ||
+			    (write && !pte_write(*ptep))) {
+				ret = -ENOENT;
+				ptep++;
+				break;
+			}
+
+			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
+				hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
+				hmm_pt_iter_directory_ref(iter);
+			}
+			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pte_pfn(*ptep));
+			if (pte_write(*ptep))
+				hmm_pte_set_write(&hmm_pte[i]);
+		} while (addr += PAGE_SIZE, ptep++, i++, addr != next);
+		hmm_pt_iter_directory_unlock(iter);
+		pte_unmap(ptep - 1);
+		mirror_fault->addr = addr;
+	}
+
+	return ret;
+}
+
+static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
+				   struct hmm_event *event,
+				   struct vm_area_struct *vma,
+				   struct hmm_pt_iter *iter)
+{
+	struct hmm_mirror_fault mirror_fault;
+	unsigned long addr = event->start;
+	struct mm_walk walk = {0};
+	int ret = 0;
+
+	if ((event->etype == HMM_DEVICE_WFAULT) && !(vma->vm_flags & VM_WRITE))
+		return -EACCES;
+
+	ret = hmm_device_fault_start(mirror->hmm, event);
+	if (ret)
+		return ret;
+
+again:
+	if (event->backoff) {
+		ret = -EAGAIN;
+		goto out;
+	}
+	if (addr >= event->end)
+		goto out;
+
+	mirror_fault.event = event;
+	mirror_fault.mirror = mirror;
+	mirror_fault.vma = vma;
+	mirror_fault.addr = addr;
+	mirror_fault.iter = iter;
+	walk.mm = mirror->hmm->mm;
+	walk.private = &mirror_fault;
+	walk.pmd_entry = hmm_mirror_fault_pmd;
+	ret = walk_page_range(addr, event->end, &walk);
+	if (!ret) {
+		ret = mirror->device->ops->update(mirror, event);
+		if (!ret) {
+			addr = mirror_fault.addr;
+			goto again;
+		}
+	}
+
+out:
+	hmm_device_fault_end(mirror->hmm, event);
+	if (ret == -ENOENT) {
+		ret = hmm_mm_fault(mirror->hmm, event, vma, addr);
+		ret = ret ? ret : -EAGAIN;
+	}
+	return ret;
+}
+
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event)
+{
+	struct vm_area_struct *vma;
+	struct hmm_pt_iter iter;
+	int ret = 0;
+
+	mirror = hmm_mirror_ref(mirror);
+	if (!mirror)
+		return -ENODEV;
+	if (event->start >= mirror->hmm->vm_end) {
+		hmm_mirror_unref(&mirror);
+		return -EINVAL;
+	}
+	if (hmm_event_init(event, mirror->hmm, event->start,
+			   event->end, event->etype)) {
+		hmm_mirror_unref(&mirror);
+		return -EINVAL;
+	}
+	hmm_pt_iter_init(&iter, &mirror->pt);
+
+retry:
+	if (hmm_mirror_is_dead(mirror)) {
+		hmm_mirror_unref(&mirror);
+		return -ENODEV;
+	}
+
+	/*
+	 * So synchronization with the cpu page table is the most important
+	 * and tedious aspect of device page fault. There must be a strong
+	 * ordering btw call to device->update() for device page fault and
+	 * device->update() for cpu page table invalidation/update.
+	 *
+	 * Page that are exposed to device driver must stay valid while the
+	 * callback is in progress ie any cpu page table invalidation that
+	 * render those pages obsolete must call device->update() after the
+	 * device->update() call that faulted those pages.
+	 *
+	 * To achieve this we rely on few things. First the mmap_sem insure
+	 * us that any munmap() syscall will serialize with us. So issue are
+	 * with unmap_mapping_range() and with migrate or merge page. For this
+	 * hmm keep track of affected range of address and block device page
+	 * fault that hit overlapping range.
+	 */
+	down_read(&mirror->hmm->mm->mmap_sem);
+	vma = find_vma_intersection(mirror->hmm->mm, event->start, event->end);
+	if (!vma) {
+		ret = -EFAULT;
+		goto out;
+	}
+	if (vma->vm_start > event->start) {
+		event->end = vma->vm_start;
+		ret = -EFAULT;
+		goto out;
+	}
+	event->end = min(event->end, vma->vm_end) & PAGE_MASK;
+	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	switch (event->etype) {
+	case HMM_DEVICE_RFAULT:
+	case HMM_DEVICE_WFAULT:
+		ret = hmm_mirror_handle_fault(mirror, event, vma, &iter);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+out:
+	/* Drop the mmap_sem so anyone waiting on it have a chance. */
+	if (ret != -EBUSY)
+		up_read(&mirror->hmm->mm->mmap_sem);
+	wake_up(&mirror->hmm->wait_queue);
+	if (ret == -EAGAIN)
+		goto retry;
+	hmm_pt_iter_fini(&iter);
+	hmm_mirror_unref(&mirror);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_fault);
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 09/15] HMM: add mm page table iterator helpers.
  2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
                   ` (7 preceding siblings ...)
  2015-08-13 19:15 ` [PATCH 08/15] HMM: add device page fault support v4 Jérôme Glisse
@ 2015-08-13 19:15 ` Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 10/15] HMM: use CPU page table during invalidation Jérôme Glisse
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Jérôme Glisse @ 2015-08-13 19:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Because inside the mmu_notifier callback we do not have access to the
vma nor do we know which lock we are holding (the mmap semaphore or
the i_mmap_lock) we can not rely on the regular page table walk (nor
do we want as we have to be carefull to not split huge page).

So this patch introduce an helper to iterate of the cpu page table
content in an efficient way for the situation we are in. Which is we
know that none of the page table entry might vanish from below us
and thus it is safe to walk the page table.

The only added value of the iterator is that it keeps the page table
entry level map accross call which fit well with the HMM mirror page
table update code.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index 88f7801..f0c3c81 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -403,6 +403,101 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
 };
 
 
+struct mm_pt_iter {
+	struct mm_struct	*mm;
+	pte_t			*ptep;
+	unsigned long		addr;
+};
+
+static void mm_pt_iter_init(struct mm_pt_iter *pt_iter, struct mm_struct *mm)
+{
+	pt_iter->mm = mm;
+	pt_iter->ptep = NULL;
+	pt_iter->addr = -1UL;
+}
+
+static void mm_pt_iter_fini(struct mm_pt_iter *pt_iter)
+{
+	pte_unmap(pt_iter->ptep);
+	pt_iter->ptep = NULL;
+	pt_iter->addr = -1UL;
+	pt_iter->mm = NULL;
+}
+
+static inline bool mm_pt_iter_in_range(struct mm_pt_iter *pt_iter,
+				       unsigned long addr)
+{
+	return (addr >= pt_iter->addr && addr < (pt_iter->addr + PMD_SIZE));
+}
+
+static struct page *mm_pt_iter_page(struct mm_pt_iter *pt_iter,
+				    unsigned long addr)
+{
+	pgd_t *pgdp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+
+again:
+	/*
+	 * What we are doing here is only valid if we old either the mmap
+	 * semaphore or the i_mmap_lock of vma->address_space the address
+	 * belongs to. Sadly because we can not easily get the vma struct
+	 * we can not sanity test that either of those lock is taken.
+	 *
+	 * We have to rely on people using this code knowing what they do.
+	 */
+	if (mm_pt_iter_in_range(pt_iter, addr) && likely(pt_iter->ptep)) {
+		pte_t pte = *(pt_iter->ptep + pte_index(addr));
+		unsigned long pfn;
+
+		if (pte_none(pte) || !pte_present(pte))
+			return NULL;
+		if (unlikely(pte_special(pte)))
+			return NULL;
+
+		pfn = pte_pfn(pte);
+		if (is_zero_pfn(pfn))
+			return NULL;
+		return pfn_to_page(pfn);
+	}
+
+	if (pt_iter->ptep) {
+		pte_unmap(pt_iter->ptep);
+		pt_iter->ptep = NULL;
+		pt_iter->addr = -1UL;
+	}
+
+	pgdp = pgd_offset(pt_iter->mm, addr);
+	if (pgd_none_or_clear_bad(pgdp))
+		return NULL;
+	pudp = pud_offset(pgdp, addr);
+	if (pud_none_or_clear_bad(pudp))
+		return NULL;
+	pmdp = pmd_offset(pudp, addr);
+	/*
+	 * Because we either have the mmap semaphore or the i_mmap_lock we know
+	 * that pmd can not vanish from under us, thus if pmd exist then it is
+	 * either a huge page or a valid pmd. It might also be in the splitting
+	 * transitory state.
+	 */
+	if (pmd_none(*pmdp) || unlikely(pmd_bad(*pmdp)))
+		return NULL;
+	if (pmd_trans_splitting(*pmdp))
+		/*
+		 * FIXME idealy we would wait but we have no easy mean to get a
+		 * hold of the vma. So for now busy loop until the splitting is
+		 * done.
+		 */
+		goto again;
+	if (pmd_huge(*pmdp))
+		return pmd_page(*pmdp) + pte_index(addr);
+	/* Regular pmd and it can not morph. */
+	pt_iter->ptep = pte_offset_map(pmdp, addr & PMD_MASK);
+	pt_iter->addr = addr & PMD_MASK;
+	goto again;
+}
+
+
 /* hmm_mirror - per device mirroring functions.
  *
  * Each device that mirror a process has a uniq hmm_mirror struct. A process
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 10/15] HMM: use CPU page table during invalidation.
  2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
                   ` (8 preceding siblings ...)
  2015-08-13 19:15 ` [PATCH 09/15] HMM: add mm page table iterator helpers Jérôme Glisse
@ 2015-08-13 19:15 ` Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 11/15] HMM: add discard range helper (to clear and free resources for a range) Jérôme Glisse
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Jérôme Glisse @ 2015-08-13 19:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jerome Glisse

From: Jerome Glisse <jglisse@redhat.com>

Once we store the dma mapping inside the secondary page table we can
no longer easily find back the page backing an address. Instead use
the cpu page table which still has the proper information, except for
the invalidate_page() case which is handled by using the page passed
by the mmu_notifier layer.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 53 +++++++++++++++++++++++++++++++++++------------------
 1 file changed, 35 insertions(+), 18 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index f0c3c81..d652233 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -47,9 +47,11 @@
 static struct mmu_notifier_ops hmm_notifier_ops;
 static void hmm_mirror_kill(struct hmm_mirror *mirror);
 static inline int hmm_mirror_update(struct hmm_mirror *mirror,
-				    struct hmm_event *event);
+				    struct hmm_event *event,
+				    struct page *page);
 static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
-				 struct hmm_event *event);
+				 struct hmm_event *event,
+				 struct page *page);
 
 
 /* hmm_event - use to track information relating to an event.
@@ -223,7 +225,9 @@ again:
 	}
 }
 
-static void hmm_update(struct hmm *hmm, struct hmm_event *event)
+static void hmm_update(struct hmm *hmm,
+		       struct hmm_event *event,
+		       struct page *page)
 {
 	struct hmm_mirror *mirror;
 
@@ -236,7 +240,7 @@ static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 again:
 	down_read(&hmm->rwsem);
 	hlist_for_each_entry(mirror, &hmm->mirrors, mlist)
-		if (hmm_mirror_update(mirror, event)) {
+		if (hmm_mirror_update(mirror, event, page)) {
 			mirror = hmm_mirror_ref(mirror);
 			up_read(&hmm->rwsem);
 			hmm_mirror_kill(mirror);
@@ -304,7 +308,7 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 
 		/* Make sure everything is unmapped. */
 		hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
-		hmm_mirror_update(mirror, &event);
+		hmm_mirror_update(mirror, &event, NULL);
 
 		mirror->device->ops->release(mirror);
 		hmm_mirror_unref(&mirror);
@@ -338,9 +342,10 @@ static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
 	*etype = HMM_NONE;
 }
 
-static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
-					struct mm_struct *mm,
-					const struct mmu_notifier_range *range)
+static void hmm_notifier_invalidate(struct mmu_notifier *mn,
+				    struct mm_struct *mm,
+				    struct page *page,
+				    const struct mmu_notifier_range *range)
 {
 	struct hmm_event event;
 	unsigned long start = range->start, end = range->end;
@@ -379,7 +384,14 @@ static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	hmm_event_init(&event, hmm, start, end, event.etype);
 
-	hmm_update(hmm, &event);
+	hmm_update(hmm, &event, page);
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
+{
+	hmm_notifier_invalidate(mn, mm, NULL, range);
 }
 
 static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
@@ -393,7 +405,7 @@ static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
 	range.start = addr & PAGE_MASK;
 	range.end = range.start + PAGE_SIZE;
 	range.event = mmu_event;
-	hmm_notifier_invalidate_range_start(mn, mm, &range);
+	hmm_notifier_invalidate(mn, mm, page, &range);
 }
 
 static struct mmu_notifier_ops hmm_notifier_ops = {
@@ -545,23 +557,27 @@ void hmm_mirror_unref(struct hmm_mirror **mirror)
 EXPORT_SYMBOL(hmm_mirror_unref);
 
 static inline int hmm_mirror_update(struct hmm_mirror *mirror,
-				    struct hmm_event *event)
+				    struct hmm_event *event,
+				    struct page *page)
 {
 	struct hmm_device *device = mirror->device;
 	int ret = 0;
 
 	ret = device->ops->update(mirror, event);
-	hmm_mirror_update_pt(mirror, event);
+	hmm_mirror_update_pt(mirror, event, page);
 	return ret;
 }
 
 static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
-				 struct hmm_event *event)
+				 struct hmm_event *event,
+				 struct page *page)
 {
 	unsigned long addr;
 	struct hmm_pt_iter iter;
+	struct mm_pt_iter mm_iter;
 
 	hmm_pt_iter_init(&iter, &mirror->pt);
+	mm_pt_iter_init(&mm_iter, mirror->hmm->mm);
 	for (addr = event->start; addr != event->end;) {
 		unsigned long next = event->end;
 		dma_addr_t *hmm_pte;
@@ -582,10 +598,10 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 				continue;
 			if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
 			    hmm_pte_test_write(hmm_pte)) {
-				struct page *page;
-
-				page = pfn_to_page(hmm_pte_pfn(*hmm_pte));
-				set_page_dirty(page);
+				page = page ? : mm_pt_iter_page(&mm_iter, addr);
+				if (page)
+					set_page_dirty(page);
+				page = NULL;
 			}
 			*hmm_pte &= event->pte_mask;
 			if (hmm_pte_test_valid_pfn(hmm_pte))
@@ -595,6 +611,7 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 		hmm_pt_iter_directory_unlock(&iter);
 	}
 	hmm_pt_iter_fini(&iter);
+	mm_pt_iter_fini(&mm_iter);
 }
 
 static inline bool hmm_mirror_is_dead(struct hmm_mirror *mirror)
@@ -979,7 +996,7 @@ static void hmm_mirror_kill(struct hmm_mirror *mirror)
 
 		/* Make sure everything is unmapped. */
 		hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
-		hmm_mirror_update(mirror, &event);
+		hmm_mirror_update(mirror, &event, NULL);
 
 		device->ops->release(mirror);
 		hmm_mirror_unref(&mirror);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 11/15] HMM: add discard range helper (to clear and free resources for a range).
  2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
                   ` (9 preceding siblings ...)
  2015-08-13 19:15 ` [PATCH 10/15] HMM: use CPU page table during invalidation Jérôme Glisse
@ 2015-08-13 19:15 ` Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 12/15] HMM: add dirty range helper (toggle dirty bit inside mirror page table) v2 Jérôme Glisse
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Jérôme Glisse @ 2015-08-13 19:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

A common use case is for device driver to stop caring for a range of
address long before said range is munmapped by userspace program. To
avoid keeping track of such range provide an helper function that will
free HMM resources for a range of address.

NOTE THAT DEVICE DRIVER MUST MAKE SURE THE HARDWARE WILL NO LONGER
ACCESS THE RANGE BECAUSE CALLING THIS HELPER !

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/hmm.h |  3 +++
 mm/hmm.c            | 24 ++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index d819ec9..10e1558 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -265,6 +265,9 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
 struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
 void hmm_mirror_unref(struct hmm_mirror **mirror);
 int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
+void hmm_mirror_range_discard(struct hmm_mirror *mirror,
+			      unsigned long start,
+			      unsigned long end);
 
 
 #endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index d652233..f271441 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -896,6 +896,30 @@ out:
 }
 EXPORT_SYMBOL(hmm_mirror_fault);
 
+/* hmm_mirror_range_discard() - discard a range of address.
+ *
+ * @mirror: The mirror struct.
+ * @start: Start address of the range to discard (inclusive).
+ * @end: End address of the range to discard (exclusive).
+ *
+ * Call when device driver want to stop mirroring a range of address and free
+ * any HMM resources associated with that range (including dma mapping if any).
+ *
+ * THIS FUNCTION ASSUME THAT DRIVER ALREADY STOPPED USING THE RANGE OF ADDRESS
+ * AND THUS DO NOT PERFORM ANY SYNCHRONIZATION OR UPDATE WITH THE DRIVER TO
+ * INVALIDATE SAID RANGE.
+ */
+void hmm_mirror_range_discard(struct hmm_mirror *mirror,
+			      unsigned long start,
+			      unsigned long end)
+{
+	struct hmm_event event;
+
+	hmm_event_init(&event, mirror->hmm, start, end, HMM_MUNMAP);
+	hmm_mirror_update_pt(mirror, &event, NULL);
+}
+EXPORT_SYMBOL(hmm_mirror_range_discard);
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 12/15] HMM: add dirty range helper (toggle dirty bit inside mirror page table) v2.
  2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
                   ` (10 preceding siblings ...)
  2015-08-13 19:15 ` [PATCH 11/15] HMM: add discard range helper (to clear and free resources for a range) Jérôme Glisse
@ 2015-08-13 19:15 ` Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 13/15] HMM: DMA map memory on behalf of device driver v2 Jérôme Glisse
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Jérôme Glisse @ 2015-08-13 19:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Device driver must properly toggle the dirty inside the mirror page table
so dirtyness is properly accounted when core mm code needs to know. Provide
a simple helper to toggle that bit for a range of address.

Changed since v1:
  - Adapt to HMM page table changes.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/hmm.h |  3 +++
 mm/hmm.c            | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 10e1558..4bc132a 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -268,6 +268,9 @@ int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
 void hmm_mirror_range_discard(struct hmm_mirror *mirror,
 			      unsigned long start,
 			      unsigned long end);
+void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
+			    unsigned long start,
+			    unsigned long end);
 
 
 #endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index f271441..2c6530e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -920,6 +920,44 @@ void hmm_mirror_range_discard(struct hmm_mirror *mirror,
 }
 EXPORT_SYMBOL(hmm_mirror_range_discard);
 
+/* hmm_mirror_range_dirty() - toggle dirty bit for a range of address.
+ *
+ * @mirror: The mirror struct.
+ * @start: Start address of the range to discard (inclusive).
+ * @end: End address of the range to discard (exclusive).
+ *
+ * Call when device driver want to toggle the dirty bit for a range of address.
+ * Useful when the device driver just want to toggle the bit for whole range
+ * without walking the mirror page table itself.
+ *
+ * Note this function does not directly dirty the page behind an address, but
+ * this will happen once address is invalidated or discard by device driver or
+ * core mm code.
+ */
+void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
+			    unsigned long start,
+			    unsigned long end)
+{
+	struct hmm_pt_iter iter;
+	unsigned long addr;
+
+	hmm_pt_iter_init(&iter, &mirror->pt);
+	for (addr = start; addr != end;) {
+		unsigned long next = end;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
+		for (; hmm_pte && addr != next; hmm_pte++, addr += PAGE_SIZE) {
+			if (!hmm_pte_test_valid_pfn(hmm_pte) ||
+			    !hmm_pte_test_write(hmm_pte))
+				continue;
+			hmm_pte_set_dirty(hmm_pte);
+		}
+	}
+	hmm_pt_iter_fini(&iter);
+}
+EXPORT_SYMBOL(hmm_mirror_range_dirty);
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 13/15] HMM: DMA map memory on behalf of device driver v2.
  2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
                   ` (11 preceding siblings ...)
  2015-08-13 19:15 ` [PATCH 12/15] HMM: add dirty range helper (toggle dirty bit inside mirror page table) v2 Jérôme Glisse
@ 2015-08-13 19:15 ` Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 14/15] HMM: add documentation explaining HMM internals and how to use it Jérôme Glisse
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Jérôme Glisse @ 2015-08-13 19:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Do the DMA mapping on behalf of the device as HMM is a good place
to perform this common task. Moreover in the future we hope to
add new infrastructure that would make DMA mapping more efficient
(lower overhead per page) by leveraging HMM data structure.

Changed since v1:
  - Adapt to HMM page table changes.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/hmm_pt.h |  11 +++
 mm/hmm.c               | 200 +++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 173 insertions(+), 38 deletions(-)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index 4a8beb1..8a59a75 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -176,6 +176,17 @@ static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
 	return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
 }
 
+static inline dma_addr_t hmm_pte_from_dma_addr(dma_addr_t dma_addr)
+{
+	return (dma_addr & HMM_PTE_DMA_MASK) | (1 << HMM_PTE_VALID_DMA_BIT);
+}
+
+static inline dma_addr_t hmm_pte_dma_addr(dma_addr_t pte)
+{
+	/* FIXME Use max dma addr instead of 0 ? */
+	return hmm_pte_test_valid_dma(&pte) ? (pte & HMM_PTE_DMA_MASK) : 0;
+}
+
 static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
 {
 	return hmm_pte_test_valid_pfn(&pte) ? pte >> PAGE_SHIFT : 0;
diff --git a/mm/hmm.c b/mm/hmm.c
index 2c6530e..e5d5f29 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -41,6 +41,7 @@
 #include <linux/mman.h>
 #include <linux/delay.h>
 #include <linux/workqueue.h>
+#include <linux/dma-mapping.h>
 
 #include "internal.h"
 
@@ -568,6 +569,46 @@ static inline int hmm_mirror_update(struct hmm_mirror *mirror,
 	return ret;
 }
 
+static void hmm_mirror_update_pte(struct hmm_mirror *mirror,
+				  struct hmm_event *event,
+				  struct hmm_pt_iter *iter,
+				  struct mm_pt_iter *mm_iter,
+				  struct page *page,
+				  dma_addr_t *hmm_pte,
+				  unsigned long addr)
+{
+	bool dirty = hmm_pte_test_and_clear_dirty(hmm_pte);
+
+	if (hmm_pte_test_valid_pfn(hmm_pte)) {
+		*hmm_pte &= event->pte_mask;
+		if (!hmm_pte_test_valid_pfn(hmm_pte))
+			hmm_pt_iter_directory_unref(iter);
+		goto out;
+	}
+
+	if (!hmm_pte_test_valid_dma(hmm_pte))
+		return;
+
+	if (!hmm_pte_test_valid_dma(&event->pte_mask)) {
+		struct device *dev = mirror->device->dev;
+		dma_addr_t dma_addr;
+
+		dma_addr = hmm_pte_dma_addr(*hmm_pte);
+		dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	}
+
+	*hmm_pte &= event->pte_mask;
+	if (!hmm_pte_test_valid_dma(hmm_pte))
+		hmm_pt_iter_directory_unref(iter);
+
+out:
+	if (dirty) {
+		page = page ? : mm_pt_iter_page(mm_iter, addr);
+		if (page)
+			set_page_dirty(page);
+	}
+}
+
 static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 				 struct hmm_event *event,
 				 struct page *page)
@@ -594,19 +635,9 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 		 */
 		hmm_pt_iter_directory_lock(&iter);
 		do {
-			if (!hmm_pte_test_valid_pfn(hmm_pte))
-				continue;
-			if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
-			    hmm_pte_test_write(hmm_pte)) {
-				page = page ? : mm_pt_iter_page(&mm_iter, addr);
-				if (page)
-					set_page_dirty(page);
-				page = NULL;
-			}
-			*hmm_pte &= event->pte_mask;
-			if (hmm_pte_test_valid_pfn(hmm_pte))
-				continue;
-			hmm_pt_iter_directory_unref(&iter);
+			hmm_mirror_update_pte(mirror, event, &iter, &mm_iter,
+					      page, hmm_pte, addr);
+			page = NULL;
 		} while (addr += PAGE_SIZE, hmm_pte++, addr != next);
 		hmm_pt_iter_directory_unlock(&iter);
 	}
@@ -681,6 +712,9 @@ static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
 		 */
 		hmm_pt_iter_directory_lock(iter);
 		do {
+			if (hmm_pte_test_valid_dma(&hmm_pte[i]))
+				continue;
+
 			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
 				hmm_pte[i] = hmm_pte_from_pfn(pfn);
 				hmm_pt_iter_directory_ref(iter);
@@ -746,6 +780,9 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 				break;
 			}
 
+			if (hmm_pte_test_valid_dma(&hmm_pte[i]))
+				continue;
+
 			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
 				hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
 				hmm_pt_iter_directory_ref(iter);
@@ -762,6 +799,80 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 	return ret;
 }
 
+static int hmm_mirror_dma_map(struct hmm_mirror *mirror,
+			      struct hmm_pt_iter *iter,
+			      unsigned long start,
+			      unsigned long end)
+{
+	struct device *dev = mirror->device->dev;
+	unsigned long addr;
+	int ret;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i = 0, next = end;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+		if (!hmm_pte)
+			return -ENOENT;
+
+		do {
+			dma_addr_t dma_addr, pte;
+			struct page *page;
+
+again:
+			pte = ACCESS_ONCE(hmm_pte[i]);
+			if (!hmm_pte_test_valid_pfn(&pte)) {
+				if (!hmm_pte_test_valid_dma(&pte)) {
+					ret = -ENOENT;
+					break;
+				}
+				continue;
+			}
+
+			page = pfn_to_page(hmm_pte_pfn(pte));
+			VM_BUG_ON(!page);
+			dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
+						DMA_BIDIRECTIONAL);
+			if (dma_mapping_error(dev, dma_addr)) {
+				ret = -ENOMEM;
+				break;
+			}
+
+			hmm_pt_iter_directory_lock(iter);
+			/*
+			 * Make sure we transfer the dirty bit. Note that there
+			 * might still be a window for another thread to set
+			 * the dirty bit before we check for pte equality. This
+			 * will just lead to a useless retry so it is not the
+			 * end of the world here.
+			 */
+			if (hmm_pte_test_dirty(&hmm_pte[i]))
+				hmm_pte_set_dirty(&pte);
+			if (ACCESS_ONCE(hmm_pte[i]) != pte) {
+				hmm_pt_iter_directory_unlock(iter);
+				dma_unmap_page(dev, dma_addr, PAGE_SIZE,
+					       DMA_BIDIRECTIONAL);
+				if (hmm_pte_test_valid_pfn(&pte))
+					goto again;
+				if (!hmm_pte_test_valid_dma(&pte)) {
+					ret = -ENOENT;
+					break;
+				}
+			} else {
+				hmm_pte[i] = hmm_pte_from_dma_addr(dma_addr);
+				if (hmm_pte_test_write(&pte))
+					hmm_pte_set_write(&hmm_pte[i]);
+				if (hmm_pte_test_dirty(&pte))
+					hmm_pte_set_dirty(&hmm_pte[i]);
+				hmm_pt_iter_directory_unlock(iter);
+			}
+		} while (addr += PAGE_SIZE, i++, addr != next && !ret);
+	}
+
+	return ret;
+}
+
 static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
 				   struct hmm_event *event,
 				   struct vm_area_struct *vma,
@@ -770,7 +881,7 @@ static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
 	struct hmm_mirror_fault mirror_fault;
 	unsigned long addr = event->start;
 	struct mm_walk walk = {0};
-	int ret = 0;
+	int ret;
 
 	if ((event->etype == HMM_DEVICE_WFAULT) && !(vma->vm_flags & VM_WRITE))
 		return -EACCES;
@@ -779,32 +890,44 @@ static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
 	if (ret)
 		return ret;
 
-again:
-	if (event->backoff) {
-		ret = -EAGAIN;
-		goto out;
-	}
-	if (addr >= event->end)
-		goto out;
+	do {
+		if (event->backoff) {
+			ret = -EAGAIN;
+			break;
+		}
+		if (addr >= event->end)
+			break;
+
+		mirror_fault.event = event;
+		mirror_fault.mirror = mirror;
+		mirror_fault.vma = vma;
+		mirror_fault.addr = addr;
+		mirror_fault.iter = iter;
+		walk.mm = mirror->hmm->mm;
+		walk.private = &mirror_fault;
+		walk.pmd_entry = hmm_mirror_fault_pmd;
+		ret = walk_page_range(addr, event->end, &walk);
+		if (ret)
+			break;
+
+		if (event->backoff) {
+			ret = -EAGAIN;
+			break;
+		}
 
-	mirror_fault.event = event;
-	mirror_fault.mirror = mirror;
-	mirror_fault.vma = vma;
-	mirror_fault.addr = addr;
-	mirror_fault.iter = iter;
-	walk.mm = mirror->hmm->mm;
-	walk.private = &mirror_fault;
-	walk.pmd_entry = hmm_mirror_fault_pmd;
-	ret = walk_page_range(addr, event->end, &walk);
-	if (!ret) {
-		ret = mirror->device->ops->update(mirror, event);
-		if (!ret) {
-			addr = mirror_fault.addr;
-			goto again;
+		if (mirror->device->dev) {
+			ret = hmm_mirror_dma_map(mirror, iter,
+						 addr, event->end);
+			if (ret)
+				break;
 		}
-	}
 
-out:
+		ret = mirror->device->ops->update(mirror, event);
+		if (ret)
+			break;
+		addr = mirror_fault.addr;
+	} while (1);
+
 	hmm_device_fault_end(mirror->hmm, event);
 	if (ret == -ENOENT) {
 		ret = hmm_mm_fault(mirror->hmm, event, vma, addr);
@@ -948,7 +1071,8 @@ void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
 
 		hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
 		for (; hmm_pte && addr != next; hmm_pte++, addr += PAGE_SIZE) {
-			if (!hmm_pte_test_valid_pfn(hmm_pte) ||
+			if ((!hmm_pte_test_valid_pfn(hmm_pte) &&
+			     !hmm_pte_test_valid_dma(hmm_pte)) ||
 			    !hmm_pte_test_write(hmm_pte))
 				continue;
 			hmm_pte_set_dirty(hmm_pte);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 14/15] HMM: add documentation explaining HMM internals and how to use it.
  2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
                   ` (12 preceding siblings ...)
  2015-08-13 19:15 ` [PATCH 13/15] HMM: DMA map memory on behalf of device driver v2 Jérôme Glisse
@ 2015-08-13 19:15 ` Jérôme Glisse
  2015-08-13 19:15 ` [PATCH 15/15] hmm/dummy: dummy driver for testing and showcasing the HMM API Jérôme Glisse
  2015-09-23 10:21 ` HMM (Heterogeneous Memory Management) v10 Pavel Machek
  15 siblings, 0 replies; 20+ messages in thread
From: Jérôme Glisse @ 2015-08-13 19:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

This add documentation with a high level overview of how HMM works
and a more in depth view of how it should be use by device driver
writers.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 Documentation/vm/hmm.txt | 219 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 219 insertions(+)
 create mode 100644 Documentation/vm/hmm.txt

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
new file mode 100644
index 0000000..febed50
--- /dev/null
+++ b/Documentation/vm/hmm.txt
@@ -0,0 +1,219 @@
+Heterogeneous Memory Management (HMM)
+-------------------------------------
+
+The raison d'etre of HMM is to provide a common API for device driver that
+wants to mirror a process address space on there device and/or migrate system
+memory to device memory. Device driver can decide to only use one aspect of
+HMM (mirroring or memory migration), for instance some device can directly
+access process address space through hardware (for instance PCIe ATS/PASID),
+but still want to benefit from memory migration capabilities that HMM offer.
+
+While HMM rely on existing kernel infrastructure (namely mmu_notifier) some
+of its features (memory migration, atomic access) require integration with
+core mm kernel code. Having HMM as the common intermediary is more appealing
+than having each device driver hooking itself inside the common mm code.
+
+Moreover HMM as a layer allows integration with DMA API or page reclaimation.
+
+
+Mirroring address space on the device:
+--------------------------------------
+
+Device that can't directly access transparently the process address space, need
+to mirror the CPU page table into there own page table. HMM helps to keep the
+device page table synchronize with the CPU page table. It is not expected that
+the device will fully mirror the CPU page table but only mirror region that are
+actively accessed by the device. For that reasons HMM only helps populating and
+synchronizing device page table for range that the device driver explicitly ask
+for.
+
+Mirroring address space inside the device page table is easy with HMM :
+
+  /* Create a mirror for the current process for your device. */
+  your_hmm_mirror->hmm_mirror.device = your_hmm_device;
+  hmm_mirror_register(&your_hmm_mirror->hmm_mirror);
+
+  ...
+
+  /* Mirror memory (in read mode) between addressA and addressB */
+  your_hmm_event->hmm_event.start = addressA;
+  your_hmm_event->hmm_event.end = addressB;
+  your_hmm_event->hmm_event.etype = HMM_DEVICE_RFAULT;
+  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
+    /* HMM callback into your driver with the >update() callback. During the
+     * callback use the HMM page table to populate the device page table. You
+     * can only use the HMM page table to populate the device page table for
+     * the specified range during the >update() callback, at any other point in
+     * time the HMM page table content should be assume to be undefined.
+     */
+    your_hmm_device->update(mirror, event);
+
+  ...
+
+  /* Process is quiting or device done stop the mirroring and cleanup. */
+  hmm_mirror_unregister(&your_hmm_mirror->hmm_mirror);
+  /* Device driver can free your_hmm_mirror */
+
+
+HMM mirror page table:
+----------------------
+
+Each hmm_mirror object is associated with a mirror page table that HMM keeps
+synchronize with the CPU page table by using the mmu_notifier API. HMM is using
+its own generic page table format because it needs to store DMA address, which
+are bigger than long on some architecture, and have more flags per entry than
+radix tree allows.
+
+The HMM page table mostly mirror x86 page table layout. A page holds a global
+directory and each entry points to a lower level directory. Unlike regular CPU
+page table, directory level are more aggressively freed and remove from the HMM
+mirror page table. This means device driver needs to use the HMM helpers and to
+follow directive on when and how to access the mirror page table. HMM use the
+per page spinlock of directory page to synchronize update of directory ie update
+can happen on different directory concurently.
+
+As a rules the mirror page table can only be accessed by device driver from one
+of the HMM device callback. Any access from outside a callback is illegal and
+gives undertimed result.
+
+Accessing the mirror page table from a device callback needs to use the HMM
+page table helpers. A loop to access entry for a range of address looks like :
+
+  /* Initialize a HMM page table iterator. */
+  struct hmm_pt_iter iter;
+  hmm_pt_iter_init(&iter, &mirror->pt)
+
+  /* Get pointer to HMM page table entry for a given address. */
+  dma_addr_t *hmm_pte;
+  hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
+
+If there is no valid entry directory for given range address then hmm_pte is
+NULL. If there is a valid entry directory then you can access the hmm_pte and
+the pointer will stay valid as long as you do not call hmm_pt_iter_walk() with
+the same iter struct for a different address or call hmm_pt_iter_fini().
+
+While the HMM page table entry pointer stays valid you can only modify the
+value it is pointing to by using one of HMM helpers (hmm_pte_*()) as other
+threads might be updating the same entry concurrently. The device driver only
+need to update an HMM page table entry to set the dirty bit, so driver should
+only be using hmm_pte_set_dirty().
+
+Similarly to extract information the device driver should use one of the helper
+like hmm_pte_dma_addr() or hmm_pte_pfn() (if HMM is not doing DMA mapping which
+is a device driver at initialization parameter).
+
+
+Migrating system memory to device memory:
+-----------------------------------------
+
+Device like discret GPU often have there own local memory which offer bigger
+bandwidth and smaller latency than access to system memory for the GPU. This
+local memory is not necessarily accessible by the CPU. Device local memory will
+remain revealent for the foreseeable future as bandwidth of GPU memory keep
+increasing faster than bandwidth of system memory and as latency of PCIe does
+not decrease.
+
+Thus to maximize use of device like GPU, program need to use the device memory.
+Userspace API wants to make this as transparent as it can be, so that there is
+no need for complex modification of applications.
+
+Transparent use of device memory for range of address of a process require core
+mm code modifications. Adding a new memory zone for devices memory did not make
+sense given that such memory is often only accessible by the device only. This
+is why we decided to use a special kind of swap, migrated memory is mark as a
+special swap entry inside the CPU page table.
+
+While HMM handles the migration process, it does not decide what range or when
+to migrate memory. The decision to perform such migration is under the control
+of the device driver. Migration back to system memory happens either because
+the CPU try to access the memory or because device driver decided to migrate
+the memory back.
+
+
+  /* Migrate system memory between addressA and addressB to device memory. */
+  your_hmm_event->hmm_event.start = addressA;
+  your_hmm_event->hmm_event.end = addressB;
+  your_hmm_event->hmm_event.etype = HMM_COPY_TO_DEVICE;
+  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
+    /* HMM callback into your driver with the >copy_to_device() callback.
+     * Device driver must allocate device memory, DMA system memory to device
+     * memory, update the device page table to point to device memory and
+     * return. See hmm.h for details instructions and how failure are handled.
+     */
+    your_hmm_device->copy_to_device(mirror, event, dst, addressA, addressB);
+
+
+Right now HMM only support migrating anonymous private memory. Migration of
+share memory and more generaly file mapped memory is on the road map.
+
+
+Locking consideration and overall design:
+-----------------------------------------
+
+As a rule HMM will handle proper locking on the behalf of the device driver,
+as such device driver does not need to take any mm lock before calling into
+the HMM code.
+
+HMM is also responsible of the hmm_device and hmm_mirror object lifetime. The
+device driver can only free those after calling hmm_device_unregister() or
+hmm_mirror_unregister() respectively.
+
+All the lock inside any of the HMM structure should never be use by the device
+driver. They are intended to be use only and only by HMM code. Below is short
+description of the 3 main locks that exist for HMM internal use. Educational
+purpose only.
+
+Each process mm has one and only one struct hmm associated with it. Each hmm
+struct can be use by several different mirror. There is one and only one mirror
+per mm and device pair. So in essence the hmm struct is the core that dispatch
+everything to every single mirror, each of them corresponding to a specific
+device. The list of mirror for an hmm struct is protected by a semaphore as it
+sees mostly read access.
+
+Each time a device fault a range of address it calls hmm_mirror_fault(), HMM
+keeps track, inside the hmm struct, of each range currently being faulted. It
+does that so it can synchronize with any CPU page table update. If there is a
+CPU page table update then a callback through mmu_notifier will happen and HMM
+will try to interrupt the device page fault that conflict (ie address range
+overlap with the range being updated) and wait for them to back off. This
+insure that at no point in time the device driver see transient page table
+information. The list of active fault is protected by a spinlock, query on
+that list should be short and quick (we haven't gather enough statistic on
+that side yet to have a good idea of the average access pattern).
+
+Each device driver wanting to use HMM must register one and only one hmm_device
+struct per physical device with HMM. The hmm_device struct have pointer to the
+device driver call back and keeps track of active mirrors for a given device.
+The active mirrors list is protected by a spinlock.
+
+
+Future work:
+------------
+
+Improved atomic access by the device to system memory. Some platform bus (PCIe)
+offer limited number of atomic memory operations, some platform do not even
+have any kind of atomic memory operations by a device. In order to allow such
+atomic operation we want to map page read only the CPU while the device perform
+its operation. For this we need a new case inside the CPU write fault code path
+to synchronize with the device.
+
+We want to allow program to lock a range of memory inside device memory and
+forbid CPU access while the memory is lock inside the device. Any CPU access
+to locked range would result in SIGBUS. We think that madvise() would be the
+right syscall into which we could plug that feature.
+
+In order to minimize kernel memory consumption and overhead of DMA mapping, we
+want to introduce new DMA API that allows to manage mapping on IOMMU directory
+page basis. This would allow to map/unmap/update DMA mapping in bulk and
+minimize IOMMU update and flushing overhead. Moreover this would allow to
+improve IOMMU bad access reporting for DMA address inside those directory.
+
+Because update to the device page table might require "heavy" synchronization
+with the device, the mmu_notifier callback might have to sleep while HMM is
+waiting for the device driver to report device page table update completion.
+This is especialy bad if this happens during page reclaimation, this might
+bring the system to pause. We want to mitigate this, either by maintaining a
+new intermediate lru level in which we put pages actively mirrored by a device
+or by some other mecanism. For time being we advice that device driver that
+use HMM explicitly explain this corner case so that user are aware that this
+can happens if there is memory pressure.
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 15/15] hmm/dummy: dummy driver for testing and showcasing the HMM API
  2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
                   ` (13 preceding siblings ...)
  2015-08-13 19:15 ` [PATCH 14/15] HMM: add documentation explaining HMM internals and how to use it Jérôme Glisse
@ 2015-08-13 19:15 ` Jérôme Glisse
  2015-09-23 10:21 ` HMM (Heterogeneous Memory Management) v10 Pavel Machek
  15 siblings, 0 replies; 20+ messages in thread
From: Jérôme Glisse @ 2015-08-13 19:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

This is a dummy driver which full fill two purposes :
  - showcase the HMM API and gives references on how to use it.
  - provide an extensive user space API to stress test HMM.

This is a particularly dangerous module as it allow to access a
mirror of a process address space through its device file. Hence
it should not be enabled by default and only people actively
developing for hmm should use it.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 drivers/char/Kconfig           |   9 +
 drivers/char/Makefile          |   1 +
 drivers/char/hmm_dummy.c       | 923 +++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/hmm_dummy.h |  51 +++
 4 files changed, 984 insertions(+)
 create mode 100644 drivers/char/hmm_dummy.c
 create mode 100644 include/uapi/linux/hmm_dummy.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index a043107..b19c2ac 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -601,6 +601,15 @@ config TILE_SROM
 	  device appear much like a simple EEPROM, and knows
 	  how to partition a single ROM for multiple purposes.
 
+config HMM_DUMMY
+	tristate "hmm dummy driver to test hmm."
+	depends on HMM
+	default n
+	help
+	  Say Y here if you want to build the hmm dummy driver that allow you
+	  to test the hmm infrastructure by mapping a process address space
+	  in hmm dummy driver device file. When in doubt, say "N".
+
 source "drivers/char/xillybus/Kconfig"
 
 endmenu
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index d8a7579..3531f92 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -60,3 +60,4 @@ js-rtc-y = rtc.o
 
 obj-$(CONFIG_TILE_SROM)		+= tile-srom.o
 obj-$(CONFIG_XILLYBUS)		+= xillybus/
+obj-$(CONFIG_HMM_DUMMY)		+= hmm_dummy.o
diff --git a/drivers/char/hmm_dummy.c b/drivers/char/hmm_dummy.c
new file mode 100644
index 0000000..52843cb
--- /dev/null
+++ b/drivers/char/hmm_dummy.c
@@ -0,0 +1,923 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * This is a dummy driver to exercice the HMM (heterogeneous memory management)
+ * API of the kernel. It allow an userspace program to map its whole address
+ * space through the hmm dummy driver file.
+ *
+ * In some way it can also serve as an example driver for people wanting to use
+ * HMM inside there device driver.
+ */
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/highmem.h>
+#include <linux/delay.h>
+#include <linux/hmm.h>
+
+#include <uapi/linux/hmm_dummy.h>
+
+#define HMM_DUMMY_DEVICE_NAME "hmm_dummy_device"
+#define HMM_DUMMY_MAX_DEVICES 4
+#define HMM_DUMMY_MAX_MIRRORS 4
+
+struct dummy_device;
+
+struct dummy_mirror {
+	struct file		*filp;
+	unsigned		minor;
+	pid_t			pid;
+	struct dummy_device	*ddevice;
+	struct hmm_mirror	mirror;
+	struct hmm_pt		pt;
+	struct list_head	events;
+	spinlock_t		lock;
+	wait_queue_head_t	wait_queue;
+	unsigned		naccess;
+	atomic_t		nworkers;
+	bool			dead;
+};
+
+struct dummy_device {
+	struct cdev		cdevice;
+	struct hmm_device	hdevice;
+	dev_t			dev;
+	int			major;
+	struct mutex		mutex;
+	char			name[32];
+	/* device file mapping tracking (keep track of all vma) */
+	struct dummy_mirror	*dmirrors[HMM_DUMMY_MAX_MIRRORS];
+	struct address_space	*fmapping[HMM_DUMMY_MAX_MIRRORS];
+};
+
+struct dummy_event {
+	struct hmm_event	hevent;
+	struct list_head	list;
+	uint64_t		nsys_pages;
+	uint64_t		nfaulted_sys_pages;
+	bool			backoff;
+};
+
+static struct dummy_device ddevices[HMM_DUMMY_MAX_DEVICES];
+
+
+static void dummy_mirror_release(struct hmm_mirror *mirror)
+{
+	struct dummy_mirror *dmirror;
+	struct dummy_device *ddevice;
+
+	dmirror = container_of(mirror, struct dummy_mirror, mirror);
+	ddevice = dmirror->ddevice;
+	dmirror->dead = true;
+}
+
+static void dummy_mirror_free(struct hmm_mirror *mirror)
+{
+	struct dummy_mirror *dmirror;
+
+	dmirror = container_of(mirror, struct dummy_mirror, mirror);
+	kfree(dmirror);
+}
+
+static void dummy_mirror_access_wait(struct dummy_mirror *dmirror,
+				     const struct hmm_event *event)
+{
+	struct dummy_event *devent;
+
+again:
+	spin_lock(&dmirror->lock);
+	list_for_each_entry(devent, &dmirror->events, list) {
+		if (hmm_event_overlap(event, &devent->hevent)) {
+			unsigned tmp = dmirror->naccess;
+
+			devent->backoff = true;
+			spin_unlock(&dmirror->lock);
+			wait_event(dmirror->wait_queue,
+				   dmirror->naccess != tmp);
+			goto again;
+		}
+	}
+	spin_unlock(&dmirror->lock);
+}
+
+static void dummy_mirror_access_start(struct dummy_mirror *dmirror,
+				      struct dummy_event *devent)
+{
+	spin_lock(&dmirror->lock);
+	list_add_tail(&devent->list, &dmirror->events);
+	dmirror->naccess++;
+	spin_unlock(&dmirror->lock);
+}
+
+static void dummy_mirror_access_stop(struct dummy_mirror *dmirror,
+				     struct dummy_event *devent)
+{
+	spin_lock(&dmirror->lock);
+	list_del_init(&devent->list);
+	dmirror->naccess--;
+	spin_unlock(&dmirror->lock);
+	wake_up(&dmirror->wait_queue);
+}
+
+
+/*
+ * The various HMM callback are the core of HMM API, the device driver gets all
+ * its information through thus callbacks. For the dummy driver we simply use a
+ * page table to store the page frame number backing address the dummy mirror
+ * user wants to access.
+ *
+ * A real device driver would schedule update to the mirror's device page table
+ * and would synchronize with the device to wait for the update to go through.
+ */
+static int dummy_mirror_pt_populate(struct hmm_mirror *mirror,
+				    struct hmm_event *event)
+{
+	unsigned long addr = event->start;
+	struct hmm_pt_iter miter, diter;
+	struct dummy_mirror *dmirror;
+	struct dummy_event *devent;
+	int ret = 0;
+
+	dmirror = container_of(mirror, struct dummy_mirror, mirror);
+	devent = container_of(event, struct dummy_event, hevent);
+
+	hmm_pt_iter_init(&diter, &dmirror->pt);
+	hmm_pt_iter_init(&miter, &mirror->pt);
+
+	do {
+		unsigned long next = event->end;
+		dma_addr_t *mpte, *dpte;
+
+		dpte = hmm_pt_iter_populate(&diter, addr, &next);
+		if (!dpte) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		mpte = hmm_pt_iter_lookup(&miter, addr, &next);
+		/*
+		 * Sanity check, this is only important for debugging HMM, a
+		 * device driver can ignore those test and assume mpte is not
+		 * NULL as NULL would be a serious HMM bug.
+		 */
+		if (!mpte || !hmm_pte_test_valid_pfn(mpte)) {
+			pr_debug("(%s:%4d) (HMM FATAL) empty pt at 0x%lX\n",
+				 __FILE__, __LINE__, addr);
+			ret = -ENOENT;
+			break;
+		}
+		/*
+		 * Sanity check, this is only important for debugging HMM, a
+		 * device driver can ignore this write test permission.
+		 */
+		if (event->etype == HMM_DEVICE_WFAULT &&
+		    !hmm_pte_test_write(mpte)) {
+			pr_debug("(%s:%4d) (HMM FATAL) RO instead of RW (%pad) at 0x%lX\n",
+				 __FILE__, __LINE__, mpte, addr);
+			ret = -EACCES;
+			break;
+		}
+
+		/*
+		 * This is bit inefficient to lock directoy per entry instead
+		 * of locking directory and going over all its entry. But this
+		 * is a dummy driver and we do not care about efficiency here.
+		 */
+		hmm_pt_iter_directory_lock(&diter);
+		/*
+		 * Simply copy entry, this is a dmmy device, real device would
+		 * reformat the page table entry for the device format and most
+		 * likely write it to some command buffer that would be send to
+		 * device once fill with the update.
+		 */
+		*dpte = *mpte;
+		/* Also increment ref count of dummy page table directory. */
+		hmm_pt_iter_directory_ref(&diter);
+		hmm_pt_iter_directory_unlock(&diter);
+
+		devent->nfaulted_sys_pages++;
+
+		addr += PAGE_SIZE;
+	} while (addr < event->end);
+	hmm_pt_iter_fini(&diter);
+	hmm_pt_iter_fini(&miter);
+
+	return ret;
+}
+
+static int dummy_mirror_pt_invalidate(struct hmm_mirror *mirror,
+				      struct hmm_event *event)
+{
+	unsigned long addr = event->start;
+	struct hmm_pt_iter miter, diter;
+	struct dummy_mirror *dmirror;
+	int ret = 0;
+
+	dmirror = container_of(mirror, struct dummy_mirror, mirror);
+
+	hmm_pt_iter_init(&diter, &dmirror->pt);
+	hmm_pt_iter_init(&miter, &mirror->pt);
+
+	do {
+		dma_addr_t *mpte, *dpte;
+		unsigned long next = event->end;
+
+		dpte = hmm_pt_iter_lookup(&diter, addr, &next);
+		if (!dpte) {
+			addr = next;
+			continue;
+		}
+
+		mpte = hmm_pt_iter_lookup(&miter, addr, &next);
+
+		/*
+		 * This is bit inefficient to lock directoy per entry instead
+		 * of locking directory and going over all its entry. But this
+		 * is a dummy driver and we do not care about efficiency here.
+		 */
+		hmm_pt_iter_directory_lock(&diter);
+
+		/*
+		 * Just skip this entry if it is not valid inside the dummy
+		 * mirror page table.
+		 */
+		if (!hmm_pte_test_valid_pfn(dpte)) {
+			addr += PAGE_SIZE;
+			hmm_pt_iter_directory_unlock(&diter);
+			continue;
+		}
+
+		/*
+		 * Sanity check, this is only important for debugging HMM, a
+		 * device driver can ignore those test and assume mpte is not
+		 * NULL as NULL would be a serious HMM bug.
+		 */
+		if (!mpte || !hmm_pte_test_valid_pfn(mpte)) {
+			hmm_pt_iter_directory_unlock(&diter);
+			pr_debug("(%s:%4d) (HMM FATAL) empty pt at 0x%lX\n",
+				 __FILE__, __LINE__, addr);
+			ret = -ENOENT;
+			break;
+		}
+
+		/*
+		 * Transfer dirty bit. Real device would schedule update to the
+		 * device page table first and then gather the dirtyness from
+		 * device page table before setting the mirror page table entry
+		 * dirty accordingly.
+		 */
+		if (hmm_pte_test_and_clear_dirty(dpte))
+			hmm_pte_set_dirty(mpte);
+
+		/*
+		 * Clear the dummy mirror page table using event mask as dummy
+		 * page table format is same as mirror page table format.
+		 *
+		 * Reall device driver would schedule device page table update
+		 * inside a command buffer, execute the command buffer and wait
+		 * for completion to make sure device and HMM are in sync.
+		 */
+		*dpte &= event->pte_mask;
+
+		/*
+		 * Also decrement ref count of dummy page table directory if
+		 * necessary. We know here for sure that no one could have race
+		 * us to clear the valid entry bit as dummy mirror directory
+		 * is lock.
+		 */
+		if (!hmm_pte_test_valid_pfn(dpte))
+			hmm_pt_iter_directory_unref(&diter);
+
+		hmm_pt_iter_directory_unlock(&diter);
+
+		addr += PAGE_SIZE;
+	} while (addr < event->end);
+	hmm_pt_iter_fini(&diter);
+	hmm_pt_iter_fini(&miter);
+
+	dummy_mirror_access_wait(dmirror, event);
+
+	return ret;
+}
+
+static int dummy_mirror_update(struct hmm_mirror *mirror,
+			       struct hmm_event *event)
+{
+	switch (event->etype) {
+	case HMM_MIGRATE:
+	case HMM_MUNMAP:
+	case HMM_FORK:
+	case HMM_WRITE_PROTECT:
+		return dummy_mirror_pt_invalidate(mirror, event);
+	case HMM_DEVICE_RFAULT:
+	case HMM_DEVICE_WFAULT:
+		return dummy_mirror_pt_populate(mirror, event);
+	default:
+		pr_debug("(%s:%4d) (DUMMY FATAL) unknown event %d\n",
+			 __FILE__, __LINE__, event->etype);
+		return -EIO;
+	}
+}
+
+static const struct hmm_device_ops hmm_dummy_ops = {
+	.release		= &dummy_mirror_release,
+	.free			= &dummy_mirror_free,
+	.update			= &dummy_mirror_update,
+};
+
+
+/* dummy_mirror_alloc() - allocate and initialize dummy mirror struct.
+ *
+ * @ddevice: The dummy device this mirror is associated with.
+ * @filp: The active device file descriptor this mirror is associated with.
+ * @minor: Minor device number or index into dummy device mirror array.
+ */
+static struct dummy_mirror *dummy_mirror_alloc(struct dummy_device *ddevice,
+					       struct file *filp,
+					       unsigned minor)
+{
+	struct dummy_mirror *dmirror;
+
+	/* Mirror this process address space */
+	dmirror = kzalloc(sizeof(*dmirror), GFP_KERNEL);
+	if (dmirror == NULL)
+		return NULL;
+	dmirror->pt.last = TASK_SIZE - 1;
+	if (hmm_pt_init(&dmirror->pt)) {
+		kfree(dmirror);
+		return NULL;
+	}
+	dmirror->ddevice = ddevice;
+	dmirror->mirror.device = &ddevice->hdevice;
+	dmirror->pid = task_pid_nr(current);
+	dmirror->dead = false;
+	dmirror->minor = minor;
+	dmirror->filp = filp;
+	INIT_LIST_HEAD(&dmirror->events);
+	spin_lock_init(&dmirror->lock);
+	init_waitqueue_head(&dmirror->wait_queue);
+	dmirror->naccess = 0;
+	atomic_set(&dmirror->nworkers, 0);
+	return dmirror;
+}
+
+/* dummy_mirror_fault() - fault an address.
+ *
+ * @dmirror: The dummy mirror against which we want to fault.
+ * @event: The dummy event structure describing range to fault.
+ * @write: Is this a write fault.
+ */
+static int dummy_mirror_fault(struct dummy_mirror *dmirror,
+			      struct dummy_event *event,
+			      bool write)
+{
+	struct hmm_mirror *mirror = &dmirror->mirror;
+	int ret;
+
+	event->hevent.etype = write ? HMM_DEVICE_WFAULT : HMM_DEVICE_RFAULT;
+
+	do {
+		cond_resched();
+
+		ret = hmm_mirror_fault(mirror, &event->hevent);
+	} while (ret == -EBUSY);
+
+	return ret;
+}
+
+/* dummy_mirror_worker_thread_sart() - account for a worker thread.
+ *
+ * @dmirror: The dummy mirror.
+ *
+ * Each time we perform an operation on the dummy mirror (fread, fwrite, ioctl,
+ * ...) we pretend a worker thread start. The worker thread count is use to
+ * keep track of active thread that might access the dummy mirror page table.
+ */
+static void dummy_mirror_worker_thread_start(struct dummy_mirror *dmirror)
+{
+	if (dmirror)
+		atomic_inc(&dmirror->nworkers);
+}
+
+/* dummy_mirror_worker_thread_stop() - cleanup after worker thread.
+ *
+ * @dmirror: The dummy mirror.
+ *
+ * Each time we perform an operation on the dummy mirror (fread, fwrite, ioctl,
+ * ...) we pretend a worker thread start and each time we are done we cleanup
+ * after the thread and this also involve freeing the dummy mirror page table
+ * if the mirror is dead.
+ */
+static void dummy_mirror_worker_thread_stop(struct dummy_mirror *dmirror)
+{
+	if (atomic_dec_and_test(&dmirror->nworkers) && dmirror->dead) {
+		/* Free the page table. */
+		hmm_pt_fini(&dmirror->pt);
+	}
+}
+
+static int dummy_read(struct dummy_mirror *dmirror,
+		      struct dummy_event *devent,
+		      char __user *buf,
+		      size_t size)
+{
+	struct hmm_event *event = &devent->hevent;
+	long r = 0;
+
+	while (!r && size) {
+		struct hmm_pt_iter diter;
+		unsigned long offset;
+
+		offset = event->start - (event->start & PAGE_MASK);
+
+		hmm_pt_iter_init(&diter, &dmirror->pt);
+		for (r = 0; !r && size; offset = 0) {
+			unsigned long count = min(PAGE_SIZE - offset, size);
+			unsigned long next = event->end;
+			dma_addr_t *dptep, dpte;
+			struct page *page;
+			char *ptr;
+
+			cond_resched();
+
+			dptep = hmm_pt_iter_lookup(&diter, event->start, &next);
+			if (!dptep)
+				break;
+
+			/*
+			 * This is inefficient but we do not care. Access is a
+			 * barrier for page table invalidation. All information
+			 * extracted from the page table btw start and stop is
+			 * valid.
+			 *
+			 * Real device driver do not need this. It should be
+			 * part of there device page table update.
+			 */
+			dummy_mirror_access_start(dmirror, devent);
+
+			/*
+			 * Because we allow concurrent invalidation of dummy
+			 * mirror page table we need to make sure we use one
+			 * coherent value for each page table entry.
+			 */
+			dpte = ACCESS_ONCE(*dptep);
+			if (!hmm_pte_test_valid_pfn(&dpte)) {
+				dummy_mirror_access_stop(dmirror, devent);
+				break;
+			}
+
+			devent->nsys_pages++;
+
+			page = pfn_to_page(hmm_pte_pfn(dpte));
+			ptr = kmap(page);
+			r = copy_to_user(buf, ptr + offset, count);
+
+			dummy_mirror_access_stop(dmirror, devent);
+
+			event->start += count;
+			size -= count;
+			buf += count;
+			kunmap(page);
+		}
+		hmm_pt_iter_fini(&diter);
+
+		if (!r && size)
+			r = dummy_mirror_fault(dmirror, devent, false);
+	}
+
+	return r;
+}
+
+static int dummy_write(struct dummy_mirror *dmirror,
+		       struct dummy_event *devent,
+		       char __user *buf,
+		       size_t size)
+{
+	struct hmm_event *event = &devent->hevent;
+	long r = 0;
+
+	while (!r && size) {
+		struct hmm_pt_iter diter;
+		unsigned long offset;
+
+		offset = event->start - (event->start & PAGE_MASK);
+
+		hmm_pt_iter_init(&diter, &dmirror->pt);
+		for (r = 0; !r && size; offset = 0) {
+			unsigned long count = min(PAGE_SIZE - offset, size);
+			unsigned long next = event->end;
+			dma_addr_t *dptep, dpte;
+			struct page *page;
+			char *ptr;
+
+			cond_resched();
+
+			dptep = hmm_pt_iter_lookup(&diter, event->start, &next);
+			if (!dptep)
+				break;
+
+			/*
+			 * This is inefficient but we do not care. Access is a
+			 * barrier for page table invalidation. All information
+			 * extracted from the page table btw start and stop is
+			 * valid.
+			 *
+			 * Real device driver do not need this. It should be
+			 * part of there device page table update.
+			 */
+			dummy_mirror_access_start(dmirror, devent);
+
+			/*
+			 * Because we allow concurrent invalidation of dummy
+			 * mirror page table we need to make sure we use one
+			 * coherent value for each page table entry.
+			 */
+			dpte = ACCESS_ONCE(*dptep);
+			if (!hmm_pte_test_valid_pfn(&dpte) ||
+			    !hmm_pte_test_write(&dpte)) {
+				dummy_mirror_access_stop(dmirror, devent);
+				break;
+			}
+
+			devent->nsys_pages++;
+
+			page = pfn_to_page(hmm_pte_pfn(dpte));
+			ptr = kmap(page);
+			r = copy_from_user(ptr + offset, buf, count);
+
+			dummy_mirror_access_stop(dmirror, devent);
+
+			event->start += count;
+			size -= count;
+			buf += count;
+			kunmap(page);
+		}
+		hmm_pt_iter_fini(&diter);
+
+		if (!r && size)
+			r = dummy_mirror_fault(dmirror, devent, true);
+	}
+
+	return r;
+}
+
+
+/*
+ * Below are the vm operation for the dummy device file. Sadly we can not allow
+ * to use the device file through mmap as there is no way to make a page from
+ * the mirror process without having the core mm assume it is a regular page
+ * and thus perform regular operation on it. Allowing this to happen would not
+ * allow to perform proper sanity check and debugging check on HMM and one of
+ * the purpose of the dummy driver is to provide a device driver through which
+ * HMM can be tested and debugged.
+ */
+static int dummy_mmap_fault(struct vm_area_struct *vma,
+				struct vm_fault *vmf)
+{
+	/* Forbid mmap of the dummy device file, see above for the reasons. */
+	return VM_FAULT_SIGBUS;
+}
+
+static void dummy_mmap_open(struct vm_area_struct *vma)
+{
+	/* nop */
+}
+
+static void dummy_mmap_close(struct vm_area_struct *vma)
+{
+	/* nop */
+}
+
+static const struct vm_operations_struct mmap_mem_ops = {
+	.fault			= dummy_mmap_fault,
+	.open			= dummy_mmap_open,
+	.close			= dummy_mmap_close,
+};
+
+
+/*
+ * Below are the file operation for the dummy device file. Only ioctl matter.
+ *
+ * Note this is highly specific to the dummy device driver and should not be
+ * construed as an example on how to design the API a real device driver would
+ * expose to userspace.
+ *
+ * The dummy_mirror.nworkers field is use to mimic the count of device thread
+ * actively using a mirror.
+ */
+static ssize_t dummy_fops_read(struct file *filp,
+			       char __user *buf,
+			       size_t count,
+			       loff_t *ppos)
+{
+	return -EINVAL;
+}
+
+static ssize_t dummy_fops_write(struct file *filp,
+				const char __user *buf,
+				size_t count,
+				loff_t *ppos)
+{
+	return -EINVAL;
+}
+
+static int dummy_fops_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	/*
+	 * Forbid mmap of the dummy device file, see comment preceding the vm
+	 * operation functions.
+	 */
+	return -EINVAL;
+}
+
+static int dummy_fops_open(struct inode *inode, struct file *filp)
+{
+	struct cdev *cdev = inode->i_cdev;
+	const int minor = iminor(inode);
+	struct dummy_device *ddevice;
+
+	/* No exclusive opens. */
+	if (filp->f_flags & O_EXCL)
+		return -EINVAL;
+
+	ddevice = container_of(cdev, struct dummy_device, cdevice);
+	filp->private_data = ddevice;
+	ddevice->fmapping[minor] = &inode->i_data;
+
+	return 0;
+}
+
+static int dummy_fops_release(struct inode *inode, struct file *filp)
+{
+	struct cdev *cdev = inode->i_cdev;
+	const int minor = iminor(inode);
+	struct dummy_device *ddevice;
+	struct dummy_mirror *dmirror;
+
+	ddevice = container_of(cdev, struct dummy_device, cdevice);
+	mutex_lock(&ddevice->mutex);
+	dmirror = ddevice->dmirrors[minor];
+	ddevice->dmirrors[minor] = NULL;
+	mutex_unlock(&ddevice->mutex);
+
+	/* Nothing to do if no active mirror. */
+	if (!dmirror)
+		return 0;
+
+	/*
+	 * Unregister the mirror this will also drop the reference and lead to
+	 * dummy mirror struct being free through the HMM free() callback once
+	 * all thread holding a reference on the mirror drop it.
+	 */
+	hmm_mirror_unregister(&dmirror->mirror);
+	return 0;
+}
+
+static long dummy_fops_unlocked_ioctl(struct file *filp,
+				      unsigned int command,
+				      unsigned long arg)
+{
+	void __user *uarg = (void __user *)arg;
+	struct dummy_device *ddevice;
+	struct dummy_mirror *dmirror;
+	struct hmm_dummy_write dwrite;
+	struct hmm_dummy_read dread;
+	struct dummy_event devent;
+	unsigned minor;
+	int ret;
+
+	minor = iminor(file_inode(filp));
+	ddevice = filp->private_data;
+
+	mutex_lock(&ddevice->mutex);
+	dmirror = ddevice->dmirrors[minor];
+	if (dmirror)
+		dummy_mirror_worker_thread_start(dmirror);
+	mutex_unlock(&ddevice->mutex);
+
+	switch (command) {
+	case HMM_DUMMY_EXPOSE_MM:
+		if (dmirror) {
+			dummy_mirror_worker_thread_stop(dmirror);
+			return -EBUSY;
+		}
+
+		/* Allocate a new dummy mirror. */
+		dmirror = dummy_mirror_alloc(ddevice, filp, minor);
+		if (!dmirror)
+			return -ENOMEM;
+		dummy_mirror_worker_thread_start(dmirror);
+
+		/* Register the current process mm as being mirrored. */
+		ret = hmm_mirror_register(&dmirror->mirror);
+		if (ret) {
+			dmirror->dead = true;
+			dummy_mirror_worker_thread_stop(dmirror);
+			dummy_mirror_free(&dmirror->mirror);
+			return ret;
+		}
+
+		/*
+		 * Now we can expose the dummy mirror so other file operation
+		 * on the device can start using it.
+		 */
+		mutex_lock(&ddevice->mutex);
+		if (ddevice->dmirrors[minor]) {
+			/* This really should not happen. */
+			mutex_unlock(&ddevice->mutex);
+			dmirror->dead = true;
+			dummy_mirror_worker_thread_stop(dmirror);
+			hmm_mirror_unregister(&dmirror->mirror);
+			return -EBUSY;
+		}
+		ddevice->dmirrors[minor] = dmirror;
+		mutex_unlock(&ddevice->mutex);
+
+		/* Success. */
+		pr_info("mirroring address space of %d\n", dmirror->pid);
+		dummy_mirror_worker_thread_stop(dmirror);
+		return 0;
+	case HMM_DUMMY_READ:
+		if (copy_from_user(&dread, uarg, sizeof(dread))) {
+			dummy_mirror_worker_thread_stop(dmirror);
+			return -EFAULT;
+		}
+
+		memset(&devent, 0, sizeof(devent));
+		devent.hevent.start = dread.address;
+		devent.hevent.end = dread.address + dread.size;
+		ret = dummy_read(dmirror, &devent,
+				 (void __user *)dread.ptr,
+				 dread.size);
+
+		dread.nsys_pages = devent.nsys_pages;
+		dread.nfaulted_sys_pages = devent.nfaulted_sys_pages;
+		if (copy_to_user(uarg, &dread, sizeof(dread))) {
+			dummy_mirror_worker_thread_stop(dmirror);
+			return -EFAULT;
+		}
+
+		dummy_mirror_worker_thread_stop(dmirror);
+		return ret;
+	case HMM_DUMMY_WRITE:
+		if (copy_from_user(&dwrite, uarg, sizeof(dwrite))) {
+			dummy_mirror_worker_thread_stop(dmirror);
+			return -EFAULT;
+		}
+
+		memset(&devent, 0, sizeof(devent));
+		devent.hevent.start = dwrite.address;
+		devent.hevent.end = dwrite.address + dwrite.size;
+		ret = dummy_write(dmirror, &devent,
+				  (void __user *)dwrite.ptr,
+				  dwrite.size);
+
+		dwrite.nsys_pages = devent.nsys_pages;
+		dwrite.nfaulted_sys_pages = devent.nfaulted_sys_pages;
+		if (copy_to_user(uarg, &dwrite, sizeof(dwrite))) {
+			dummy_mirror_worker_thread_stop(dmirror);
+			return -EFAULT;
+		}
+
+		dummy_mirror_worker_thread_stop(dmirror);
+		return ret;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static const struct file_operations hmm_dummy_fops = {
+	.read		= dummy_fops_read,
+	.write		= dummy_fops_write,
+	.mmap		= dummy_fops_mmap,
+	.open		= dummy_fops_open,
+	.release	= dummy_fops_release,
+	.unlocked_ioctl = dummy_fops_unlocked_ioctl,
+	.llseek		= default_llseek,
+	.owner		= THIS_MODULE,
+};
+
+
+/*
+ * The usual char device driver boiler plate, nothing fancy here.
+ */
+static int dummy_device_init(struct dummy_device *ddevice)
+{
+	int ret, i;
+
+	ret = alloc_chrdev_region(&ddevice->dev, 0,
+				  HMM_DUMMY_MAX_DEVICES,
+				  ddevice->name);
+	if (ret < 0)
+		return ret;
+	ddevice->major = MAJOR(ddevice->dev);
+
+	cdev_init(&ddevice->cdevice, &hmm_dummy_fops);
+	ret = cdev_add(&ddevice->cdevice, ddevice->dev, HMM_DUMMY_MAX_MIRRORS);
+	if (ret) {
+		unregister_chrdev_region(ddevice->dev, HMM_DUMMY_MAX_MIRRORS);
+		return ret;
+	}
+
+	/* Register the hmm device. */
+	for (i = 0; i < HMM_DUMMY_MAX_MIRRORS; i++)
+		ddevice->dmirrors[i] = NULL;
+	mutex_init(&ddevice->mutex);
+	ddevice->hdevice.ops = &hmm_dummy_ops;
+	ddevice->hdevice.dev = NULL;
+
+	ret = hmm_device_register(&ddevice->hdevice);
+	if (ret) {
+		cdev_del(&ddevice->cdevice);
+		unregister_chrdev_region(ddevice->dev, HMM_DUMMY_MAX_MIRRORS);
+	}
+	return ret;
+}
+
+static void dummy_device_fini(struct dummy_device *ddevice)
+{
+	struct dummy_mirror *dmirror;
+	unsigned i;
+
+	/* First unregister all mirror. */
+	do {
+		mutex_lock(&ddevice->mutex);
+		for (i = 0; i < HMM_DUMMY_MAX_MIRRORS; i++) {
+			dmirror = ddevices->dmirrors[i];
+			ddevices->dmirrors[i] = NULL;
+			if (dmirror)
+				break;
+		}
+		mutex_unlock(&ddevice->mutex);
+		if (dmirror)
+			hmm_mirror_unregister(&dmirror->mirror);
+	} while (dmirror);
+
+	hmm_device_unregister(&ddevice->hdevice);
+
+	cdev_del(&ddevice->cdevice);
+	unregister_chrdev_region(ddevice->dev, HMM_DUMMY_MAX_MIRRORS);
+}
+
+static int __init hmm_dummy_init(void)
+{
+	int i, ret;
+
+	for (i = 0; i < HMM_DUMMY_MAX_DEVICES; ++i) {
+		snprintf(ddevices[i].name, sizeof(ddevices[i].name),
+			 "%s%d", HMM_DUMMY_DEVICE_NAME, i);
+		ret = dummy_device_init(&ddevices[i]);
+		if (ret) {
+			/* Empty name means device is not valid. */
+			ddevices[i].name[0] = 0;
+			/*
+			 * Report failure only if we fail to create at least
+			 * one device.
+			 */
+			if (!i)
+				return ret;
+		}
+	}
+
+	pr_info("hmm_dummy loaded THIS IS A DANGEROUS MODULE !!!\n");
+	return 0;
+}
+
+static void __exit hmm_dummy_exit(void)
+{
+	int i;
+
+	for (i = 0; i < HMM_DUMMY_MAX_DEVICES; ++i) {
+		/* Empty name means device is not valid. */
+		if (!ddevices[i].name[0])
+			continue;
+		dummy_device_fini(&ddevices[i]);
+	}
+}
+
+module_init(hmm_dummy_init);
+module_exit(hmm_dummy_exit);
+MODULE_LICENSE("GPL");
diff --git a/include/uapi/linux/hmm_dummy.h b/include/uapi/linux/hmm_dummy.h
new file mode 100644
index 0000000..3af71d4
--- /dev/null
+++ b/include/uapi/linux/hmm_dummy.h
@@ -0,0 +1,51 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * This is a dummy driver to exercice the HMM (heterogeneous memory management)
+ * API of the kernel. It allow an userspace program to expose its whole address
+ * space through the hmm dummy driver file.
+ */
+#ifndef _UAPI_LINUX_HMM_DUMMY_H
+#define _UAPI_LINUX_HMM_DUMMY_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+#include <linux/irqnr.h>
+
+struct hmm_dummy_read {
+	uint64_t		address;
+	uint64_t		size;
+	uint64_t		ptr;
+	uint64_t		nsys_pages;
+	uint64_t		nfaulted_sys_pages;
+	uint64_t		reserved[11];
+};
+
+struct hmm_dummy_write {
+	uint64_t		address;
+	uint64_t		size;
+	uint64_t		ptr;
+	uint64_t		nsys_pages;
+	uint64_t		nfaulted_sys_pages;
+	uint64_t		reserved[11];
+};
+
+/* Expose the address space of the calling process through hmm dummy dev file */
+#define HMM_DUMMY_EXPOSE_MM	_IO('H', 0x00)
+#define HMM_DUMMY_READ		_IOWR('H', 0x01, struct hmm_dummy_read)
+#define HMM_DUMMY_WRITE		_IOWR('H', 0x02, struct hmm_dummy_write)
+
+#endif /* _UAPI_LINUX_HMM_DUMMY_H */
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: HMM (Heterogeneous Memory Management) v10
  2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
                   ` (14 preceding siblings ...)
  2015-08-13 19:15 ` [PATCH 15/15] hmm/dummy: dummy driver for testing and showcasing the HMM API Jérôme Glisse
@ 2015-09-23 10:21 ` Pavel Machek
  15 siblings, 0 replies; 20+ messages in thread
From: Pavel Machek @ 2015-09-23 10:21 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Christophe Harle, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Leonid Shamis, Laurent Morichetti, Alexander Deucher, Linda Wang,
	Kevin E Martin, Jeff Law, Or Gerlitz, Sagi Grimberg

Hi!

> Minor fixes since last post (1), apply on top of 4.2-rc6 done
> that because conflict in infiniband are harder to solve then

is.

> conflict with mm tree.
> 
> Tree with the patchset:
> git://people.freedesktop.org/~glisse/linux hmm-v10 branch
> 
> Previous cover letter :
> 
> 
> 
> HMM (Heterogeneous Memory Management) is an helper layer for device
> that want to mirror a process address space into their own mmu. Main
> target is GPU but other hardware, like network device can take also

GPU, ... devices

> use HMM.
> 
> There is two side to HMM, first one is mirroring of process address

There are

> space on behalf of a device. HMM will manage a secondary page table
> for the device and keep it synchronize with the CPU page table. HMM

synchronized

> also do DMA mapping on behalf of the device (which would allow new

also does

> kind of optimization further down the road (2)).

kinds of optimalizations

> Second side is allowing to migrate process memory to device memory

,

> where device memory is unmappable by the CPU. Any CPU access will
> trigger special fault that will migrate memory back. This patchset
> does not deal with remote memory migration.
> 
> 
> Why doing this ?
> 
> Mirroring a process address space is mandatory with OpenCL 2.0 and
> with other GPU compute API. OpenCL 2.0 allow different level of

APIs... allows

> implementation and currently only the lowest 2 are supported on
> Linux. To implement the highest level, where CPU and GPU access
> can happen concurently and are cache coherent, HMM is needed, or
> something providing same functionality, for instance through
> platform hardware.
> 
> Hardware solution such as PCIE ATS/PASID is limited to mirroring
> system memory and does not provide way to migrate memory to device
> memory (which offer significantly more bandwidth up to 10 times

, up to

> faster than regular system memory with discret GPU, also have

discrete? 

> lower latency than PCIE transaction).

> Current CPU with GPU on same die (AMD or Intel) use the ATS/PASID
> and for Intel a special level of cache (backed by a large pool of
> fast memory).
> 
> For foreeseeable futur, discrete GPU will remain releveant as they

future .. GPUs

> can have a large quantity of faster memory than integrated GPU.
> 
> Thus we believe HMM will allow to leverage discret GPU memory in

allow us... discrete

> a transparent fashion to the application, with minimum disruption
> to the linux kernel mm code. Also HMM can work along hardware
> solution such as PCIE ATS/PASID (leaving regular case to ATS/PASID
> while HMM handles the migrated memory case).

Best regards,
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 00/15] HMM (Heterogeneous Memory Management) v9
@ 2015-07-17 18:52 Jérôme Glisse
  2015-07-17 18:52 ` [PATCH 08/15] HMM: add device page fault support v4 Jérôme Glisse
  0 siblings, 1 reply; 20+ messages in thread
From: Jérôme Glisse @ 2015-07-17 18:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Linda Wang, Kevin E Martin, Jeff Law,
	Or Gerlitz, Sagi Grimberg

Not much changed since last post (1), i did incorporate comments i got
so far, fixed couple bugs here and there and simplified the HMM page
table code. I am splitting the patchset into 3 parts. The first part
has the core of HMM and is enough for device that do not care about
memory migration. The second part is a convertion of Mellanox device
driver to use HMM. The last part is about remote memory migration
and i will post it at latter date. I think first and second part can
be merge and i would really like to know where i stand on this.
For the remote memory, i am reworking some bits related to memcg due
to recent changes upstream on that front. Below is the previous
cover letter as explanation and motivation did not change.

Tree with the patchset:
git://people.freedesktop.org/~glisse/linux hmm-v9 branch

HMM (Heterogeneous Memory Management) is an helper layer for device
that want to mirror a process address space into their own mmu. Main
target is GPU but other hardware, like network device can take also
use HMM.

There is two side to HMM, first one is mirroring of process address
space on behalf of a device. HMM will manage a secondary page table
for the device and keep it synchronize with the CPU page table. HMM
also do DMA mapping on behalf of the device (which would allow new
kind of optimization further down the road (2)).

Second side is allowing to migrate process memory to device memory
where device memory is unmappable by the CPU. Any CPU access will
trigger special fault that will migrate memory back. This patchset
does not deal with remote memory migration.

Why doing this ?

Mirroring a process address space is mandatory with OpenCL 2.0 and
with other GPU compute API. OpenCL 2.0 allow different level of
implementation and currently only the lowest 2 are supported on
Linux. To implement the highest level, where CPU and GPU access
can happen concurently and are cache coherent, HMM is needed, or
something providing same functionality, for instance through
platform hardware.

Hardware solution such as PCIE ATS/PASID is limited to mirroring
system memory and does not provide way to migrate memory to device
memory (which offer significantly more bandwidth up to 10 times
faster than regular system memory with discret GPU, also have
lower latency than PCIE transaction).

Current CPU with GPU on same die (AMD or Intel) use the ATS/PASID
and for Intel a special level of cache (backed by a large pool of
fast memory).

For foreseeable futur, discrete GPU will remain releveant as they
can have a large quantity of faster memory than integrated GPU.

Thus we believe HMM will allow to leverage discret GPU memory in
a transparent fashion to the application, with minimum disruption
to the linux kernel mm code. Also HMM can work along hardware
solution such as PCIE ATS/PASID (leaving regular case to ATS/PASID
while HMM handles the migrated memory case).

Design :

The patch 1, 2, 3 and 4 augment the mmu notifier API with new
informations to more efficiently mirror CPU page table updates.

The first side of HMM, process address space mirroring, is
implemented in patch 5 through 14. This use a secondary page
table, in which HMM mirror memory actively use by the device.
HMM does not take a reference on any of the page, it use the
mmu notifier API to track changes to the CPU page table and to
update the mirror page table. All this while providing a simple
API to device driver.

To implement this we use a "generic" page table and not a radix
tree because we need to store more flags than radix allows and
we need to store dma address (sizeof(dma_addr_t) > sizeof(long)
on some platform). All this is

(1) Previous patchset posting :
    v1 http://lwn.net/Articles/597289/
    v2 https://lkml.org/lkml/2014/6/12/559
    v3 https://lkml.org/lkml/2014/6/13/633
    v4 https://lkml.org/lkml/2014/8/29/423
    v5 https://lkml.org/lkml/2014/11/3/759
    v6 http://lwn.net/Articles/619737/
    v7 http://lwn.net/Articles/627316/
    v8 https://lwn.net/Articles/645515/

(2) Because HMM keeps a secondary page table which keeps track of
    DMA mapping, there is room for new optimization. We want to
    add a new DMA API to allow to manage DMA page table mapping
    at directory level. This would allow to minimize memory
    consumption of mirror page table and also over head of doing
    DMA mapping page per page. This is a future feature we want
    to work on and hope the idea will proove usefull not only to
    HMM users.

Cheers,
JA(C)rA'me

To: <linux-kernel@vger.kernel.org>,
To: linux-mm <linux-mm@kvack.org>,
To: "Andrew Morton" <akpm@linux-foundation.org>,
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>,
Cc: "Mel Gorman" <mgorman@suse.de>,
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Cc: "Peter Zijlstra" <peterz@infradead.org>,
Cc: "Linda Wang" <lwang@redhat.com>,
Cc: "Kevin E Martin" <kem@redhat.com>,
Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
Cc: "Johannes Weiner" <jweiner@redhat.com>,
Cc: "Larry Woodman" <lwoodman@redhat.com>,
Cc: "Rik van Riel" <riel@redhat.com>,
Cc: "Dave Airlie" <airlied@redhat.com>,
Cc: "Jeff Law" <law@redhat.com>,
Cc: "Brendan Conoboy" <blc@redhat.com>,
Cc: "Joe Donohue" <jdonohue@redhat.com>,
Cc: "Christophe Harle" <charle@nvidia.com>,
Cc: "Duncan Poole" <dpoole@nvidia.com>,
Cc: "Sherry Cheung" <SCheung@nvidia.com>,
Cc: "Subhash Gutti" <sgutti@nvidia.com>,
Cc: "John Hubbard" <jhubbard@nvidia.com>,
Cc: "Mark Hairgrove" <mhairgrove@nvidia.com>,
Cc: "Lucien Dunning" <ldunning@nvidia.com>,
Cc: "Cameron Buschardt" <cabuschardt@nvidia.com>,
Cc: "Arvind Gopalakrishnan" <arvindg@nvidia.com>,
Cc: "Haggai Eran" <haggaie@mellanox.com>,
Cc: "Or Gerlitz" <ogerlitz@mellanox.com>,
Cc: "Sagi Grimberg" <sagig@mellanox.com>
Cc: "Shachar Raindel" <raindel@mellanox.com>,
Cc: "Liran Liss" <liranl@mellanox.com>,
Cc: "Roland Dreier" <roland@purestorage.com>,
Cc: "Sander, Ben" <ben.sander@amd.com>,
Cc: "Stoner, Greg" <Greg.Stoner@amd.com>,
Cc: "Bridgman, John" <John.Bridgman@amd.com>,
Cc: "Mantor, Michael" <Michael.Mantor@amd.com>,
Cc: "Blinzer, Paul" <Paul.Blinzer@amd.com>,
Cc: "Morichetti, Laurent" <Laurent.Morichetti@amd.com>,
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
Cc: "Leonid Shamis" <Leonid.Shamis@amd.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 08/15] HMM: add device page fault support v4.
  2015-07-17 18:52 [PATCH 00/15] HMM (Heterogeneous Memory Management) v9 Jérôme Glisse
@ 2015-07-17 18:52 ` Jérôme Glisse
  0 siblings, 0 replies; 20+ messages in thread
From: Jérôme Glisse @ 2015-07-17 18:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

This patch add helper for device page fault. Device page fault helper will
fill the mirror page table using the CPU page table all this synchronized
with any update to CPU page table.

Changed since v1:
  - Add comment about directory lock.

Changed since v2:
  - Check for mirror->hmm in hmm_mirror_fault()

Changed since v3:
  - Adapt to HMM page table changes.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h |  15 +++
 mm/hmm.c            | 370 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 384 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 5488fa9..d819ec9 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -85,6 +85,12 @@ struct hmm_event {
 	bool			backoff;
 };
 
+static inline bool hmm_event_overlap(const struct hmm_event *a,
+				     const struct hmm_event *b)
+{
+	return !((a->end <= b->start) || (a->start >= b->end));
+}
+
 
 /* hmm_device - Each device must register one and only one hmm_device.
  *
@@ -176,6 +182,10 @@ struct hmm_device_ops {
  * @rwsem: Serialize the mirror list modifications.
  * @mmu_notifier: The mmu_notifier of this mm.
  * @rcu: For delayed cleanup call from mmu_notifier.release() callback.
+ * @device_faults: List of all active device page faults.
+ * @ndevice_faults: Number of active device page faults.
+ * @wait_queue: Wait queue for event synchronization.
+ * @lock: Serialize device_faults list modification.
  *
  * For each process address space (mm_struct) there is one and only one hmm
  * struct. hmm functions will redispatch to each devices the change made to
@@ -192,6 +202,10 @@ struct hmm {
 	struct rw_semaphore	rwsem;
 	struct mmu_notifier	mmu_notifier;
 	struct rcu_head		rcu;
+	struct list_head	device_faults;
+	unsigned		ndevice_faults;
+	wait_queue_head_t	wait_queue;
+	spinlock_t		lock;
 };
 
 
@@ -250,6 +264,7 @@ int hmm_mirror_register(struct hmm_mirror *mirror);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
 void hmm_mirror_unref(struct hmm_mirror **mirror);
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
 
 
 #endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index 08e9501..a9e3dc5 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -67,7 +67,7 @@ static inline int hmm_event_init(struct hmm_event *event,
 				 enum hmm_etype etype)
 {
 	event->start = start & PAGE_MASK;
-	event->end = min(end, hmm->vm_end);
+	event->end = PAGE_ALIGN(min(end, hmm->vm_end));
 	if (event->start >= event->end)
 		return -EINVAL;
 	event->etype = etype;
@@ -103,6 +103,10 @@ static int hmm_init(struct hmm *hmm)
 	kref_init(&hmm->kref);
 	INIT_HLIST_HEAD(&hmm->mirrors);
 	init_rwsem(&hmm->rwsem);
+	INIT_LIST_HEAD(&hmm->device_faults);
+	hmm->ndevice_faults = 0;
+	init_waitqueue_head(&hmm->wait_queue);
+	spin_lock_init(&hmm->lock);
 
 	/* register notifier */
 	hmm->mmu_notifier.ops = &hmm_notifier_ops;
@@ -167,6 +171,58 @@ static inline struct hmm *hmm_unref(struct hmm *hmm)
 	return NULL;
 }
 
+static int hmm_device_fault_start(struct hmm *hmm, struct hmm_event *event)
+{
+	int ret = 0;
+
+	mmu_notifier_range_wait_valid(hmm->mm, event->start, event->end);
+
+	spin_lock(&hmm->lock);
+	if (mmu_notifier_range_is_valid(hmm->mm, event->start, event->end)) {
+		list_add_tail(&event->list, &hmm->device_faults);
+		hmm->ndevice_faults++;
+		event->backoff = false;
+	} else
+		ret = -EAGAIN;
+	spin_unlock(&hmm->lock);
+
+	wake_up(&hmm->wait_queue);
+
+	return ret;
+}
+
+static void hmm_device_fault_end(struct hmm *hmm, struct hmm_event *event)
+{
+	spin_lock(&hmm->lock);
+	list_del_init(&event->list);
+	hmm->ndevice_faults--;
+	spin_unlock(&hmm->lock);
+
+	wake_up(&hmm->wait_queue);
+}
+
+static void hmm_wait_device_fault(struct hmm *hmm, struct hmm_event *ievent)
+{
+	struct hmm_event *fevent;
+	unsigned long wait_for = 0;
+
+again:
+	spin_lock(&hmm->lock);
+	list_for_each_entry(fevent, &hmm->device_faults, list) {
+		if (!hmm_event_overlap(fevent, ievent))
+			continue;
+		fevent->backoff = true;
+		wait_for = hmm->ndevice_faults;
+	}
+	spin_unlock(&hmm->lock);
+
+	if (wait_for > 0) {
+		wait_event(hmm->wait_queue, wait_for != hmm->ndevice_faults);
+		wait_for = 0;
+		goto again;
+	}
+}
+
 static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 {
 	struct hmm_mirror *mirror;
@@ -175,6 +231,8 @@ static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 	if (hmm->mm->hmm != hmm)
 		return;
 
+	hmm_wait_device_fault(hmm, event);
+
 again:
 	down_read(&hmm->rwsem);
 	hlist_for_each_entry(mirror, &hmm->mirrors, mlist)
@@ -186,6 +244,33 @@ again:
 			goto again;
 		}
 	up_read(&hmm->rwsem);
+
+	wake_up(&hmm->wait_queue);
+}
+
+static int hmm_mm_fault(struct hmm *hmm,
+			struct hmm_event *event,
+			struct vm_area_struct *vma,
+			unsigned long addr)
+{
+	unsigned flags = FAULT_FLAG_ALLOW_RETRY;
+	struct mm_struct *mm = vma->vm_mm;
+	int r;
+
+	flags |= (event->etype == HMM_DEVICE_WFAULT) ? FAULT_FLAG_WRITE : 0;
+	for (addr &= PAGE_MASK; addr < event->end; addr += PAGE_SIZE) {
+
+		r = handle_mm_fault(mm, vma, addr, flags);
+		if (r & VM_FAULT_RETRY)
+			return -EBUSY;
+		if (r & VM_FAULT_ERROR) {
+			if (r & VM_FAULT_OOM)
+				return -ENOMEM;
+			/* Same error code for all other cases. */
+			return -EFAULT;
+		}
+	}
+	return 0;
 }
 
 
@@ -228,6 +313,7 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	}
 	up_write(&hmm->rwsem);
 
+	wake_up(&hmm->wait_queue);
 	hmm_unref(hmm);
 }
 
@@ -416,6 +502,288 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 	hmm_pt_iter_fini(&iter);
 }
 
+static inline bool hmm_mirror_is_dead(struct hmm_mirror *mirror)
+{
+	if (hlist_unhashed(&mirror->mlist) || list_empty(&mirror->dlist))
+		return true;
+	return false;
+}
+
+struct hmm_mirror_fault {
+	struct hmm_mirror	*mirror;
+	struct hmm_event	*event;
+	struct vm_area_struct	*vma;
+	unsigned long		addr;
+	struct hmm_pt_iter	*iter;
+};
+
+static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
+				 struct hmm_event *event,
+				 struct vm_area_struct *vma,
+				 struct hmm_pt_iter *iter,
+				 pmd_t *pmdp,
+				 struct hmm_mirror_fault *mirror_fault,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct page *page;
+	unsigned long addr, pfn;
+	unsigned flags = FOLL_TOUCH;
+	spinlock_t *ptl;
+	int ret;
+
+	ptl = pmd_lock(mirror->hmm->mm, pmdp);
+	if (unlikely(!pmd_trans_huge(*pmdp))) {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
+	if (unlikely(pmd_trans_splitting(*pmdp))) {
+		spin_unlock(ptl);
+		wait_split_huge_page(vma->anon_vma, pmdp);
+		return -EAGAIN;
+	}
+	flags |= event->etype == HMM_DEVICE_WFAULT ? FOLL_WRITE : 0;
+	page = follow_trans_huge_pmd(vma, start, pmdp, flags);
+	pfn = page_to_pfn(page);
+	spin_unlock(ptl);
+
+	/* Just fault in the whole PMD. */
+	start &= PMD_MASK;
+	end = start + PMD_SIZE - 1;
+
+	if (!pmd_write(*pmdp) && event->etype == HMM_DEVICE_WFAULT)
+			return -ENOENT;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i = 0, next = end;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+		if (!hmm_pte)
+			return -ENOMEM;
+
+		/*
+		 * The directory lock protect against concurrent clearing of
+		 * page table bit flags. Exceptions being the dirty bit and
+		 * the device driver private flags.
+		 */
+		hmm_pt_iter_directory_lock(iter);
+		do {
+			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
+				hmm_pte[i] = hmm_pte_from_pfn(pfn);
+				hmm_pt_iter_directory_ref(iter);
+			}
+			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
+			if (pmd_write(*pmdp))
+				hmm_pte_set_write(&hmm_pte[i]);
+		} while (addr += PAGE_SIZE, pfn++, i++, addr != next);
+		hmm_pt_iter_directory_unlock(iter);
+		mirror_fault->addr = addr;
+	}
+
+	return 0;
+}
+
+static int hmm_mirror_fault_pmd(pmd_t *pmdp,
+				unsigned long start,
+				unsigned long end,
+				struct mm_walk *walk)
+{
+	struct hmm_mirror_fault *mirror_fault = walk->private;
+	struct hmm_mirror *mirror = mirror_fault->mirror;
+	struct hmm_event *event = mirror_fault->event;
+	struct hmm_pt_iter *iter = mirror_fault->iter;
+	bool write = (event->etype == HMM_DEVICE_WFAULT);
+	unsigned long addr;
+	int ret = 0;
+
+	/* Make sure there was no gap. */
+	if (start != mirror_fault->addr)
+		return -ENOENT;
+
+	if (event->backoff)
+		return -EAGAIN;
+
+	if (pmd_none(*pmdp))
+		return -ENOENT;
+
+	if (pmd_trans_huge(*pmdp))
+		return hmm_mirror_fault_hpmd(mirror, event, mirror_fault->vma,
+					     iter, pmdp, mirror_fault, start,
+					     end);
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp))
+		return -EFAULT;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i = 0, next = end;
+		dma_addr_t *hmm_pte;
+		pte_t *ptep;
+
+		hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+		if (!hmm_pte)
+			return -ENOMEM;
+
+		ptep = pte_offset_map(pmdp, start);
+		hmm_pt_iter_directory_lock(iter);
+		do {
+			if (!pte_present(*ptep) ||
+			    (write && !pte_write(*ptep))) {
+				ret = -ENOENT;
+				ptep++;
+				break;
+			}
+
+			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
+				hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
+				hmm_pt_iter_directory_ref(iter);
+			}
+			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pte_pfn(*ptep));
+			if (pte_write(*ptep))
+				hmm_pte_set_write(&hmm_pte[i]);
+		} while (addr += PAGE_SIZE, ptep++, i++, addr != next);
+		hmm_pt_iter_directory_unlock(iter);
+		pte_unmap(ptep - 1);
+		mirror_fault->addr = addr;
+	}
+
+	return ret;
+}
+
+static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
+				   struct hmm_event *event,
+				   struct vm_area_struct *vma,
+				   struct hmm_pt_iter *iter)
+{
+	struct hmm_mirror_fault mirror_fault;
+	unsigned long addr = event->start;
+	struct mm_walk walk = {0};
+	int ret = 0;
+
+	if ((event->etype == HMM_DEVICE_WFAULT) && !(vma->vm_flags & VM_WRITE))
+		return -EACCES;
+
+	ret = hmm_device_fault_start(mirror->hmm, event);
+	if (ret)
+		return ret;
+
+again:
+	if (event->backoff) {
+		ret = -EAGAIN;
+		goto out;
+	}
+	if (addr >= event->end)
+		goto out;
+
+	mirror_fault.event = event;
+	mirror_fault.mirror = mirror;
+	mirror_fault.vma = vma;
+	mirror_fault.addr = addr;
+	mirror_fault.iter = iter;
+	walk.mm = mirror->hmm->mm;
+	walk.private = &mirror_fault;
+	walk.pmd_entry = hmm_mirror_fault_pmd;
+	ret = walk_page_range(addr, event->end, &walk);
+	if (!ret) {
+		ret = mirror->device->ops->update(mirror, event);
+		if (!ret) {
+			addr = mirror_fault.addr;
+			goto again;
+		}
+	}
+
+out:
+	hmm_device_fault_end(mirror->hmm, event);
+	if (ret == -ENOENT) {
+		ret = hmm_mm_fault(mirror->hmm, event, vma, addr);
+		ret = ret ? ret : -EAGAIN;
+	}
+	return ret;
+}
+
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event)
+{
+	struct vm_area_struct *vma;
+	struct hmm_pt_iter iter;
+	int ret = 0;
+
+	mirror = hmm_mirror_ref(mirror);
+	if (!mirror)
+		return -ENODEV;
+	if (event->start >= mirror->hmm->vm_end) {
+		hmm_mirror_unref(&mirror);
+		return -EINVAL;
+	}
+	if (hmm_event_init(event, mirror->hmm, event->start,
+			   event->end, event->etype)) {
+		hmm_mirror_unref(&mirror);
+		return -EINVAL;
+	}
+	hmm_pt_iter_init(&iter, &mirror->pt);
+
+retry:
+	if (hmm_mirror_is_dead(mirror)) {
+		hmm_mirror_unref(&mirror);
+		return -ENODEV;
+	}
+
+	/*
+	 * So synchronization with the cpu page table is the most important
+	 * and tedious aspect of device page fault. There must be a strong
+	 * ordering btw call to device->update() for device page fault and
+	 * device->update() for cpu page table invalidation/update.
+	 *
+	 * Page that are exposed to device driver must stay valid while the
+	 * callback is in progress ie any cpu page table invalidation that
+	 * render those pages obsolete must call device->update() after the
+	 * device->update() call that faulted those pages.
+	 *
+	 * To achieve this we rely on few things. First the mmap_sem insure
+	 * us that any munmap() syscall will serialize with us. So issue are
+	 * with unmap_mapping_range() and with migrate or merge page. For this
+	 * hmm keep track of affected range of address and block device page
+	 * fault that hit overlapping range.
+	 */
+	down_read(&mirror->hmm->mm->mmap_sem);
+	vma = find_vma_intersection(mirror->hmm->mm, event->start, event->end);
+	if (!vma) {
+		ret = -EFAULT;
+		goto out;
+	}
+	if (vma->vm_start > event->start) {
+		event->end = vma->vm_start;
+		ret = -EFAULT;
+		goto out;
+	}
+	event->end = min(event->end, vma->vm_end) & PAGE_MASK;
+	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	switch (event->etype) {
+	case HMM_DEVICE_RFAULT:
+	case HMM_DEVICE_WFAULT:
+		ret = hmm_mirror_handle_fault(mirror, event, vma, &iter);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+out:
+	/* Drop the mmap_sem so anyone waiting on it have a chance. */
+	if (ret != -EBUSY)
+		up_read(&mirror->hmm->mm->mmap_sem);
+	wake_up(&mirror->hmm->wait_queue);
+	if (ret == -EAGAIN)
+		goto retry;
+	hmm_pt_iter_fini(&iter);
+	hmm_mirror_unref(&mirror);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_fault);
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-09-23 10:21 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-13 19:15 HMM (Heterogeneous Memory Management) v10 Jérôme Glisse
2015-08-13 19:15 ` [PATCH 01/15] mmu_notifier: add event information to address invalidation v8 Jérôme Glisse
2015-08-13 19:15 ` [PATCH 02/15] mmu_notifier: keep track of active invalidation ranges v4 Jérôme Glisse
2015-09-01  3:27   ` Mark Hairgrove
2015-09-01 14:58     ` Jerome Glisse
2015-08-13 19:15 ` [PATCH 03/15] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() v2 Jérôme Glisse
2015-08-13 19:15 ` [PATCH 04/15] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier Jérôme Glisse
2015-08-13 19:15 ` [PATCH 05/15] HMM: introduce heterogeneous memory management v5 Jérôme Glisse
2015-08-13 19:15 ` [PATCH 06/15] HMM: add HMM page table v4 Jérôme Glisse
2015-08-13 19:15 ` [PATCH 07/15] HMM: add per mirror " Jérôme Glisse
2015-08-13 19:15 ` [PATCH 08/15] HMM: add device page fault support v4 Jérôme Glisse
2015-08-13 19:15 ` [PATCH 09/15] HMM: add mm page table iterator helpers Jérôme Glisse
2015-08-13 19:15 ` [PATCH 10/15] HMM: use CPU page table during invalidation Jérôme Glisse
2015-08-13 19:15 ` [PATCH 11/15] HMM: add discard range helper (to clear and free resources for a range) Jérôme Glisse
2015-08-13 19:15 ` [PATCH 12/15] HMM: add dirty range helper (toggle dirty bit inside mirror page table) v2 Jérôme Glisse
2015-08-13 19:15 ` [PATCH 13/15] HMM: DMA map memory on behalf of device driver v2 Jérôme Glisse
2015-08-13 19:15 ` [PATCH 14/15] HMM: add documentation explaining HMM internals and how to use it Jérôme Glisse
2015-08-13 19:15 ` [PATCH 15/15] hmm/dummy: dummy driver for testing and showcasing the HMM API Jérôme Glisse
2015-09-23 10:21 ` HMM (Heterogeneous Memory Management) v10 Pavel Machek
  -- strict thread matches above, loose matches on Subject: below --
2015-07-17 18:52 [PATCH 00/15] HMM (Heterogeneous Memory Management) v9 Jérôme Glisse
2015-07-17 18:52 ` [PATCH 08/15] HMM: add device page fault support v4 Jérôme Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).