[RFC PATCH 0/6] HMM (heterogeneous memory management) v4

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/6] HMM (heterogeneous memory management) v4
@ 2014-08-29 19:10 j.glisse
  2014-08-29 19:10 ` [RFC PATCH 1/6] mmu_notifier: add event information to address invalidation v4 j.glisse
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: j.glisse @ 2014-08-29 19:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm, Haggai Eran
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay

This is an updated patchset for HMM (Heterogeneous Memory Management). In
a nutshell HMM is a subsystem that provide an easy to use api to mirror a
process address on a device with minimal hardware requirement. The device
must be able to handle page fault (missing entry inside its page table),
must be support read only mapping. It does not rely on ATS and PASID PCIE
extensions. It intends to supersede those extensions by allowing to move
system memory to device memory in a transparent fashion for core kernel mm
code (ie cpu page fault on page residing in device memory will trigger
migration back to system memory).

I think it has been establish in previous discussion of HMM that device
memory, specificaly in GPU case, is a thing that will remains revealent
given that device memory bandwith keep growing faster than system memory
bandwidth. You can find link to previous discussion at bottom of this
email.

Even the case of CPU with GPU in same package (same die or not) associated
with pool of fast memory (either in same package or in same die like stack
memory) can leverage HMM. Current implementation of package memory manage
it like another level of cache making it transparent for the operating
system but adding a management cost and being less flexible and in some
case making worse decision on what should be cached or not.

In this version of the patchset i wanted to address the request of device
driver folks that want to abstract the handling of the iommu mapping of
pages of a process. To achieve this patch 5 introduce a new iommu domain
api that allow to map a directory of page into a specific domain and to
later on update this directory mapping.

Before going further down that patch i would like to gather feedback and
see if such api have a change to be accepted. I should stress that while
i demonstrate how it is intended to be use inside HMM in patch 6. This api
can also be use by thing such as dma-buf for more legacy GPU workload and
i am sure there are others driver that might find this api usefull.

It does however rely on exposing which domain a device is bound to. This
domain should not change as long as a driver is bind to the device (at
least this is my understanding). So i believe it is fine to expose this.

To avoid writting over and over the same code patch 2 introduce a generic
lockless arch independant page table. I have made it very flexible leading
to quite a lot of fields but if it sounds more reasonable to only care
about two case (unsigned long and uint64 namely) then code and structure
can be simplify.

It has been designed to be a suitable replacement for AMD and Intel IOMMU
driver code. Thought for AMD case there is still some work that needs to
be done to support level skipping. Before adding that i wanted to gather
interest about of converting AMD and Intel IOMMU driver to use that code.

The HMM have been lightly tested and the IOMMU part is untested at this
point. But if there is no objection with the IOMMU API i will go ahead
and implement it for AMD and Intel IOMMU. Then i will take a stab at
converting mlx5 driver to use HMM.

As usual comments are more then welcome. Thanks in advance to anyone that
take a look at this code.

Previous patchset posting :
  v1 http://lwn.net/Articles/597289/
  v2 https://lkml.org/lkml/2014/6/12/559 (cover letter did not make it to ml)
  v3 https://lkml.org/lkml/2014/6/13/633

Cheers,
JA(C)rA'me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/6] mmu_notifier: add event information to address invalidation v4
  2014-08-29 19:10 [RFC PATCH 0/6] HMM (heterogeneous memory management) v4 j.glisse
@ 2014-08-29 19:10 ` j.glisse
  2014-09-11 10:00   ` Haggai Eran
  2014-08-29 19:10 ` [RFC PATCH 2/6] lib: lockless generic and arch independent page table (gpt) j.glisse
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 9+ messages in thread
From: j.glisse @ 2014-08-29 19:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm, Haggai Eran
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

The event information will be usefull for new user of mmu_notifier API.
The event argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different path for different event for instance on unmap
the resource used to track a vma are still valid and should stay around.
While if the event is saying that a vma is being destroy it means that any
resources used to track this vma can be free.

Changed since v1:
  - renamed action into event (updated commit message too).
  - simplified the event names and clarified their intented usage
    also documenting what exceptation the listener can have in
    respect to each event.

Changed since v2:
  - Avoid crazy name.
  - Do not move code that do not need to move.

Changed since v3:
  - Separate hugue page split from mlock/munlock and softdirty.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
 drivers/iommu/amd_iommu_v2.c            |  11 ++-
 drivers/misc/sgi-gru/grutlbpurge.c      |   9 ++-
 drivers/xen/gntdev.c                    |   9 ++-
 fs/proc/task_mmu.c                      |   6 +-
 include/linux/mmu_notifier.h            | 131 ++++++++++++++++++++++++++------
 kernel/events/uprobes.c                 |  10 ++-
 mm/filemap_xip.c                        |   2 +-
 mm/fremap.c                             |   4 +-
 mm/huge_memory.c                        |  39 ++++++----
 mm/hugetlb.c                            |  23 +++---
 mm/ksm.c                                |  18 +++--
 mm/memory.c                             |  27 ++++---
 mm/migrate.c                            |   9 ++-
 mm/mmu_notifier.c                       |  28 ++++---
 mm/mprotect.c                           |   5 +-
 mm/mremap.c                             |   6 +-
 mm/rmap.c                               |  24 ++++--
 virt/kvm/kvm_main.c                     |  12 ++-
 19 files changed, 271 insertions(+), 105 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index fe69fc8..a13307d 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -124,7 +124,8 @@ restart:
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
 						       unsigned long start,
-						       unsigned long end)
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
 	struct interval_tree_node *it = NULL;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 5f578e8..9a6b837 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -411,14 +411,17 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
-			       unsigned long address)
+			       unsigned long address,
+			       enum mmu_event event)
 {
 	__mn_flush_page(mn, address);
 }
 
 static void mn_invalidate_range_start(struct mmu_notifier *mn,
 				      struct mm_struct *mm,
-				      unsigned long start, unsigned long end)
+				      unsigned long start,
+				      unsigned long end,
+				      enum mmu_event event)
 {
 	struct pasid_state *pasid_state;
 	struct device_state *dev_state;
@@ -439,7 +442,9 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,
 
 static void mn_invalidate_range_end(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
-				    unsigned long start, unsigned long end)
+				    unsigned long start,
+				    unsigned long end,
+				    enum mmu_event event)
 {
 	struct pasid_state *pasid_state;
 	struct device_state *dev_state;
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 2129274..e67fed1 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+				       unsigned long start, unsigned long end,
+				       enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
 				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end)
+				     unsigned long end,
+				     enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
-				unsigned long address)
+				unsigned long address,
+				enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 073b4a1..fe9da94 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,7 +428,9 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start,
+				unsigned long end,
+				enum mmu_event event)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
@@ -445,9 +447,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
-			 unsigned long address)
+			 unsigned long address,
+			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
+	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index dfc791c..0ddb975 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -830,7 +830,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		};
 		down_read(&mm->mmap_sem);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_start(mm, 0, -1);
+			mmu_notifier_invalidate_range_start(mm, 0,
+							    -1, MMU_ISDIRTY);
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
 			cp.vma = vma;
 			if (is_vm_hugetlb_page(vma))
@@ -858,7 +859,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 					&clear_refs_walk);
 		}
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1);
+			mmu_notifier_invalidate_range_end(mm, 0,
+							  -1, MMU_ISDIRTY);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 2728869..94f6890 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -9,6 +9,66 @@
 struct mmu_notifier;
 struct mmu_notifier_ops;
 
+/* MMU Events report fine-grained information to the callback routine, allowing
+ * the event listener to make a more informed decision as to what action to
+ * take. The event types are:
+ *
+ *   - MMU_HSPLIT huge page split, the memory is the same only the page table
+ *     structure is updated (level added or removed).
+ *
+ *   - MMU_ISDIRTY need to update the dirty bit of the page table so proper
+ *     dirty accounting can happen.
+ *
+ *   - MMU_MIGRATE: memory is migrating from one page to another, thus all write
+ *     access must stop after invalidate_range_start callback returns.
+ *     Furthermore, no read access should be allowed either, as a new page can
+ *     be remapped with write access before the invalidate_range_end callback
+ *     happens and thus any read access to old page might read stale data. There
+ *     are several sources for this event, including:
+ *
+ *         - A page moving to swap (various reasons, including page reclaim),
+ *         - An mremap syscall,
+ *         - migration for NUMA reasons,
+ *         - balancing the memory pool,
+ *         - write fault on COW page,
+ *         - and more that are not listed here.
+ *
+ *   - MMU_MPROT: memory access protection is changing. Refer to the vma to get
+ *     the new access protection. All memory access are still valid until the
+ *     invalidate_range_end callback.
+ *
+ *   - MMU_MUNLOCK: unlock memory. Content of page table stays the same but
+ *     page are unlocked.
+ *
+ *   - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall or
+ *     process destruction). However, access is still allowed, up until the
+ *     invalidate_range_free_pages callback. This also implies that secondary
+ *     page table can be trimmed, because the address range is no longer valid.
+ *
+ *   - MMU_WRITE_BACK: memory is being written back to disk, all write accesses
+ *     must stop after invalidate_range_start callback returns. Read access are
+ *     still allowed.
+ *
+ *   - MMU_WRITE_PROTECT: memory is being writte protected (ie should be mapped
+ *     read only no matter what the vma memory protection allows). All write
+ *     accesses must stop after invalidate_range_start callback returns. Read
+ *     access are still allowed.
+ *
+ * If in doubt when adding a new notifier caller, please use MMU_MIGRATE,
+ * because it will always lead to reasonable behavior, but will not allow the
+ * listener a chance to optimize its events.
+ */
+enum mmu_event {
+	MMU_HSPLIT = 0,
+	MMU_ISDIRTY,
+	MMU_MIGRATE,
+	MMU_MPROT,
+	MMU_MUNLOCK,
+	MMU_MUNMAP,
+	MMU_WRITE_BACK,
+	MMU_WRITE_PROTECT,
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -79,7 +139,8 @@ struct mmu_notifier_ops {
 	void (*change_pte)(struct mmu_notifier *mn,
 			   struct mm_struct *mm,
 			   unsigned long address,
-			   pte_t pte);
+			   pte_t pte,
+			   enum mmu_event event);
 
 	/*
 	 * Before this is invoked any secondary MMU is still ok to
@@ -90,7 +151,8 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long address);
+				unsigned long address,
+				enum mmu_event event);
 
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
@@ -137,10 +199,14 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
+				       unsigned long start,
+				       unsigned long end,
+				       enum mmu_event event);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start, unsigned long end);
+				     unsigned long start,
+				     unsigned long end,
+				     enum mmu_event event);
 };
 
 /*
@@ -179,13 +245,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
-				      unsigned long address, pte_t pte);
+				      unsigned long address,
+				      pte_t pte,
+				      enum mmu_event event);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address);
+					  unsigned long address,
+					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						  unsigned long start,
+						  unsigned long end,
+						  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						unsigned long start,
+						unsigned long end,
+						enum mmu_event event);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
@@ -210,31 +283,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_change_pte(mm, address, pte);
+		__mmu_notifier_change_pte(mm, address, pte, event);
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address);
+		__mmu_notifier_invalidate_page(mm, address, event);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, event);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range_end(mm, start, end, event);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -280,13 +360,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
  * old page would remain mapped readonly in the secondary MMUs after the new
  * page is already writable by some CPU through the primary MMU.
  */
-#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event)	\
 ({									\
 	struct mm_struct *___mm = __mm;					\
 	unsigned long ___address = __address;				\
 	pte_t ___pte = __pte;						\
 									\
-	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
+	mmu_notifier_change_pte(___mm, ___address, ___pte, __event);	\
 	set_pte_at(___mm, ___address, __ptep, ___pte);			\
 })
 
@@ -313,22 +393,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 1d0af8a..62d07e9 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -176,7 +176,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -194,7 +195,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -208,7 +211,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..a2b3f09 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -198,7 +198,7 @@ retry:
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
 			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address);
+			mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/fremap.c b/mm/fremap.c
index 72b8fa3..37b2904 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -258,9 +258,9 @@ get_write_lock:
 		vma->vm_flags = vm_flags;
 	}
 
-	mmu_notifier_invalidate_range_start(mm, start, start + size);
+	mmu_notifier_invalidate_range_start(mm, start, start + size, MMU_MUNMAP);
 	err = vma->vm_ops->remap_pages(vma, start, size, pgoff);
-	mmu_notifier_invalidate_range_end(mm, start, start + size);
+	mmu_notifier_invalidate_range_end(mm, start, start + size, MMU_MUNMAP);
 
 	/*
 	 * We can't clear VM_NONLINEAR because we'd have to do
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9a21d06..e3efba5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1029,7 +1029,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1063,7 +1064,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1073,7 +1075,8 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -1165,7 +1168,8 @@ alloc:
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	spin_lock(ptl);
 	if (page)
@@ -1197,7 +1201,8 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 out:
 	return ret;
 out_unlock:
@@ -1632,7 +1637,8 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_HSPLIT);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1647,7 +1653,8 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_HSPLIT);
 
 	return ret;
 }
@@ -2470,7 +2477,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2480,7 +2488,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2871,24 +2880,28 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eeceeeb..ae98b53 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2560,7 +2560,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_start(src, mmun_start,
+						    mmun_end, MMU_MIGRATE);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2611,7 +2612,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 	return ret;
 }
@@ -2637,7 +2639,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 again:
 	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
@@ -2708,7 +2711,8 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -2886,8 +2890,8 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -2907,7 +2911,8 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_MIGRATE);
 out_release_all:
 	page_cache_release(new_page);
 out_release_old:
@@ -3345,7 +3350,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3375,7 +3380,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 */
 	flush_tlb_range(vma, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index fb75902..21d210b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -872,7 +872,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_WRITE_PROTECT);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -904,7 +905,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		if (pte_dirty(entry))
 			set_page_dirty(page);
 		entry = pte_mkclean(pte_wrprotect(entry));
-		set_pte_at_notify(mm, addr, ptep, entry);
+		set_pte_at_notify(mm, addr, ptep, entry, MMU_WRITE_PROTECT);
 	}
 	*orig_pte = *ptep;
 	err = 0;
@@ -912,7 +913,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_WRITE_PROTECT);
 out:
 	return err;
 }
@@ -948,7 +950,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -961,7 +964,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -971,7 +976,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_MIGRATE);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index ab3537b..f679eb45a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1050,7 +1050,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_end   = end;
 	if (is_cow)
 		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end);
+						    mmun_end, MMU_MIGRATE);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1067,7 +1067,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
+						  MMU_MIGRATE);
 	return ret;
 }
 
@@ -1371,10 +1372,12 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_start(mm, start_addr,
+					    end_addr, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_end(mm, start_addr,
+					  end_addr, MMU_MUNMAP);
 }
 
 /**
@@ -1396,10 +1399,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1422,9 +1425,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end);
+	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end);
+	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2208,7 +2211,8 @@ gotten:
 
 	mmun_start  = address & PAGE_MASK;
 	mmun_end    = mmun_start + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2240,7 +2244,7 @@ gotten:
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
-		set_pte_at_notify(mm, address, page_table, entry);
+		set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
 		update_mmu_cache(vma, address, page_table);
 		if (old_page) {
 			/*
@@ -2279,7 +2283,8 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (mmun_end > mmun_start)
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index f78ec9b..30417d5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1819,12 +1819,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1877,7 +1879,8 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 950813b..de039e4 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -141,8 +141,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	return young;
 }
 
-void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
-			       pte_t pte)
+void __mmu_notifier_change_pte(struct mm_struct *mm,
+			       unsigned long address,
+			       pte_t pte,
+			       enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -150,13 +152,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->change_pte)
-			mn->ops->change_pte(mn, mm, address, pte);
+			mn->ops->change_pte(mn, mm, address, pte, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+				    unsigned long address,
+				    enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -164,13 +167,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
+			mn->ops->invalidate_page(mn, mm, address, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					   unsigned long start,
+					   unsigned long end,
+					   enum mmu_event event)
+
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -178,14 +184,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+			mn->ops->invalidate_range_start(mn, mm, start,
+							end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					 unsigned long start,
+					 unsigned long end,
+					 enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -193,7 +202,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end);
+			mn->ops->invalidate_range_end(mn, mm, start,
+						      end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557..886405b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -157,7 +157,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end);
+			mmu_notifier_invalidate_range_start(mm, mni_start,
+							    end, MMU_MPROT);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -185,7 +186,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end);
+		mmu_notifier_invalidate_range_end(mm, mni_start, end, MMU_MPROT);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 05f1180..6827d2f 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -177,7 +177,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -228,7 +229,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 3e8491c..0b67e7d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -840,7 +840,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
 		(*cleaned)++;
 	}
 out:
@@ -1128,6 +1128,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
 	enum ttu_flags flags = (enum ttu_flags)arg;
+	enum mmu_event event = MMU_MIGRATE;
+
+	if (flags & TTU_MUNLOCK)
+		event = MMU_MUNLOCK;
 
 	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
@@ -1233,7 +1237,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, event);
 out:
 	return ret;
 
@@ -1287,7 +1291,9 @@ out_mlock:
 #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
 
 static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
-		struct vm_area_struct *vma, struct page *check_page)
+				struct vm_area_struct *vma,
+				struct page *check_page,
+				enum ttu_flags flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t *pmd;
@@ -1301,6 +1307,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	unsigned long end;
 	int ret = SWAP_AGAIN;
 	int locked_vma = 0;
+	enum mmu_event event = MMU_MIGRATE;
+
+	if (flags & TTU_MUNLOCK)
+		event = MMU_MUNLOCK;
 
 	address = (vma->vm_start + cursor) & CLUSTER_MASK;
 	end = address + CLUSTER_SIZE;
@@ -1315,7 +1325,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 	mmun_start = address;
 	mmun_end   = end;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
 
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1380,7 +1390,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
@@ -1436,7 +1446,9 @@ static int try_to_unmap_nonlinear(struct page *page,
 			while (cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				if (try_to_unmap_cluster(cursor, &mapcount,
-						vma, page) == SWAP_MLOCK)
+							 vma, page,
+							 (enum ttu_flags)arg)
+							 == SWAP_MLOCK)
 					ret = SWAP_MLOCK;
 				cursor += CLUSTER_SIZE;
 				vma->vm_private_data = (void *) cursor;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 33712fb..0ed3e88 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -262,7 +262,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush, idx;
@@ -301,7 +302,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
-					pte_t pte)
+					pte_t pte,
+					enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int idx;
@@ -317,7 +319,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -343,7 +346,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/6] mmu_notifier: add event information to address invalidation v4
  2014-08-29 19:10 ` [RFC PATCH 1/6] mmu_notifier: add event information to address invalidation v4 j.glisse
@ 2014-09-11 10:00   ` Haggai Eran
  2014-09-11 14:13     ` Jerome Glisse
  0 siblings, 1 reply; 9+ messages in thread
From: Haggai Eran @ 2014-09-11 10:00 UTC (permalink / raw)
  To: j.glisse, linux-kernel, linux-mm, akpm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

On 29/08/2014 22:10, j.glisse@gmail.com wrote:
> + * - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall or
> + * process destruction). However, access is still allowed, up until the
> + * invalidate_range_free_pages callback. This also implies that secondary
> + * page table can be trimmed, because the address range is no longer valid.

I couldn't find the invalidate_range_free_pages callback. Is that a left over 
from a previous version of the patch?

Also, I think that you have to invalidate the secondary PTEs of the range being 
unmapped immediately, because put_page may be called immediately after the 
invalidate_range_start returns.

Haggai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/6] mmu_notifier: add event information to address invalidation v4
  2014-09-11 10:00   ` Haggai Eran
@ 2014-09-11 14:13     ` Jerome Glisse
  0 siblings, 0 replies; 9+ messages in thread
From: Jerome Glisse @ 2014-09-11 14:13 UTC (permalink / raw)
  To: Haggai Eran
  Cc: linux-kernel, linux-mm, akpm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Shachar Raindel, Liran Liss, Roland Dreier,
	Ben Sander, Greg Stoner, John Bridgman, Michael Mantor,
	Paul Blinzer, Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

On Thu, Sep 11, 2014 at 01:00:52PM +0300, Haggai Eran wrote:
> On 29/08/2014 22:10, j.glisse@gmail.com wrote:
> > + * - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall or
> > + * process destruction). However, access is still allowed, up until the
> > + * invalidate_range_free_pages callback. This also implies that secondary
> > + * page table can be trimmed, because the address range is no longer valid.
> 
> I couldn't find the invalidate_range_free_pages callback. Is that a left over 
> from a previous version of the patch?
> 
> Also, I think that you have to invalidate the secondary PTEs of the range being 
> unmapped immediately, because put_page may be called immediately after the 
> invalidate_range_start returns.

This is because patchset was originaly on top of a variation of another
patchset :

https://lkml.org/lkml/2014/9/9/601

In which invalidate_range_free_pages was a function call right after cpu
page table is updated but before page are free. Hence the comment was
right if on top of that patchset but on top of master you are right this
comment is wrong.

Cheers,
Jerome

> 
> Haggai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/6] lib: lockless generic and arch independent page table (gpt).
  2014-08-29 19:10 [RFC PATCH 0/6] HMM (heterogeneous memory management) v4 j.glisse
  2014-08-29 19:10 ` [RFC PATCH 1/6] mmu_notifier: add event information to address invalidation v4 j.glisse
@ 2014-08-29 19:10 ` j.glisse
  2014-08-29 19:10 ` [RFC PATCH 3/6] hmm: heterogeneous memory management v5 j.glisse
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: j.glisse @ 2014-08-29 19:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm, Haggai Eran
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Page table is a common structure format most notably use by cpu mmu. The
arch depend page table code has strong tie to the architecture which makes
it unsuitable to be use by other non arch specific code.

This patch implement a generic and arch independent page table. It is generic
in the sense that entry size can be any power of two smaller than PAGE_SIZE
and each entry page size can cover a different power of two then PAGE_SIZE.

It is lockless in the sense that at any point in time you can have concurrent
thread updating the page table (removing or changing entry) and faulting in
the page table (adding new entry). This is achieve by enforcing each updater
and each faulter to take a range lock. There is no exclusion on range lock,
ie several thread can fault or update the same range concurrently and it is
the responsability of the user to synchronize update to the page table entry
(pte), update to the page table directory (pdp) is under gpt responsability.

API usage pattern is :
  gpt_init()

  gpt_lock_update(lock_range)
  // User can update pte for instance by using atomic bit operation
  // allowing complete lockless update.
  gpt_unlock_update(lock_range)

  gpt_lock_fault(lock_range)
  // User can fault in pte but he is responsible for avoiding thread
  // to concurrently fault the same pte and for properly accounting
  // the number of pte faulted in the pdp structure.
  gpt_unlock_fault(lock_range)
  // The new faulted pte will only be visible to others updaters only
  // once all faulter unlock.

Details on how the lockless concurrent updater and faulter works is provided
in the header file.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/gpt.h | 625 ++++++++++++++++++++++++++++++++++++
 lib/Kconfig         |   3 +
 lib/Makefile        |   2 +
 lib/gpt.c           | 897 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1527 insertions(+)
 create mode 100644 include/linux/gpt.h
 create mode 100644 lib/gpt.c

diff --git a/include/linux/gpt.h b/include/linux/gpt.h
new file mode 100644
index 0000000..192935a
--- /dev/null
+++ b/include/linux/gpt.h
@@ -0,0 +1,625 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * High level overview
+ * -------------------
+ *
+ * This is a generic arch independant page table implementation with lockless
+ * (allmost lockless) access. The content of the page table ie the page table
+ * entry, are not protected by the gpt helper, it is up to the code using gpt
+ * to protect the page table entry from concurrent update with no restriction
+ * on the mechanism (can be atomic or can sleep).
+ *
+ * The gpt code only deals with protecting the page directory tree structure.
+ * Which is done in a lockless way. Concurrent threads can read and or write
+ * overlapping range of the gpt. There can also be concurrent insertion and
+ * removal of page directory (insertion or removal of page table level).
+ *
+ * While removal of page directory is completely lockless, insertion of new
+ * page directory still require a lock (to avoid double insertion). If the
+ * architecture have a spinlock in its page struct then several threads can
+ * concurrently insert new directory (level) as long as they are inserting into
+ * different page directory. Otherwise insertion will serialize using a common
+ * spinlock. Note that insertion in this context only refer to inserting page
+ * directory, it does not deal about page table entry insertion and again this
+ * is the responsability of gpt user to properly synchronize those.
+ *
+ *
+ * Each gpt access must be done under gpt lock protection by calling gpt_lock()
+ * with a lock structure. Once a range is "locked" with gpt_lock() all access
+ * can be done in lockless fashion, using either gpt_walk or gpt_iter helpers.
+ * Note however that only directory that are considered as established will be
+ * considered ie if a thread is concurently inserting a new directory in the
+ * locked range then this directory will be ignore by gpt_walk or gpt_iter.
+ *
+ * This restriction comes from the lockless design. Some thread can hold a gpt
+ * lock for long time but if it holds it for a period long enough some of the
+ * internal gpt counter (unsigned long) might wrap around breaking all further
+ * access (thought it is self healing after a period of time). So access
+ * pattern to gpt should be :
+ *   gpt_lock(gpt, lock)
+ *   gpt_walk(gpt, lock, walk)
+ *   gpt_unlock(gpt, lock)
+ *
+ * Walker callback can sleep but for for now longer than it would take for
+ * other thread to wrap around internal gpt value through :
+ *   gpt_lock_fault(gpt, lock)
+ *   // user faulting in new pte
+ *   gpt_unlock_fault(gpt, lock)
+ *
+ * The lockless design refer to gpt_lock() and gpt_unlock() taking a spinlock
+ * only for adding/removing the lock struct to active lock list ie no more than
+ * few instructions in both case leaving little room for lock contention.
+ *
+ * Moreover there is no memory allocation during gpt_lock() or gpt_unlock() or
+ * gpt_walk(). The only constraint is that the lock struct must be the same for
+ * gpt_lock(), gpt_unlock() and gpt_walk() so gpt_lock struct might need to be
+ * allocated. This is however a small struct.
+ *
+ *
+ * Internal of gpt synchronization :
+ * ---------------------------------
+ *
+ * For the curious here is how gpt page directory access are synchronize with
+ * each others.
+ *
+ * Each time a user want to access a range of the gpt it must take a lock on
+ * the range using gpt_lock(). Each lock is associated with a serial number
+ * that is the current serial number for insertion (ie all new page directory
+ * will be assigned this serial number). Each lock holder will take a reference
+ * on each page directory that is older than its serial number (ie each page
+ * directory that are now considered alive).
+ *
+ * So page directory and lock holder form a timeline with following a regular
+ * expression :
+ *   (o*)h((r)*(a)*)*s
+ * With :
+ *   o : old directory page
+ *   h : oldest active lock holder (ie lock holder with the smallest serial
+ *       number)
+ *   r : recent directory page added after oldest lock holder
+ *   a : recent lock holder (ie lock holder with a serial number greater than
+ *       the oldest active lock holder serial number)
+ *   s : current serial number ie the serial number that is use by any thread
+ *       actively adding new page directory.
+ *
+ * So few rules are in place, 's' will only increase once all thread using a
+ * serial number are done inserting new page directory. So new page directory
+ * go into the 'r' state once gpt_unlock_fault() is call.
+ *
+ * Now what is important is to keep the relationship between each lock holder
+ * ('h' and 'a' state) and each recent directory page ('r' state) intact so
+ * that when unlocking only page directory that were referenced by the lock
+ * holder at lock time are unreferenced. This is simple, if page directory
+ * serial number is older that lock serial number then we know that this page
+ * directory was know at lock time and thus was referenced. Otherwise this is
+ * a recent directory page that was unknown (or skip) during lock operation.
+ *
+ * Same rules apply when walking the directory, gpt will only consider page
+ * directory that have an older serial number than the lock serial number.
+ *
+ * There is two issue that needs to be solve. First issue is wrap around of
+ * serial number, it is solve by using unsigned long and the jiffies helpers
+ * for comparing time (simple sign trick).
+ *
+ * However this first issue imply a second one, some page directory might sit
+ * inside the gpt without being visited for a long period so long that the
+ * current serial number wrapped around and is now smaller to this very old
+ * serial number leading to invalid assumption about this page directory.
+ *
+ * Trying to update old serial number is cumberstone and tricky with the lock
+ * less design of gpt. Not to mention this would need to happen at regular
+ * interval.
+ *
+ * Instead the problem can be simplify by noticing that we only care about page
+ * directory that were added after the oldest active lock 'h' everything added
+ * before was know to all gpt lock holder and thus reference. So all is needed
+ * is to keep track of serial number for page directory recently added and move
+ * those page directory to the old status once the oldest active lock serial
+ * number is after there serial number.
+ *
+ * To do this, gpt place new page directory on list, it is naturally sorted as
+ * new page directory are added at the end. Each time gpt_unlock() is call the
+ * new oldest lock serial number is found and the new page directory list is
+ * traversed and entry that are now older than the oldest lock serial number
+ * are remove.
+ *
+ * So if a page directory is not on the recent list then lock holder knows for
+ * sure they have to unreference it during gpt_unlock() or traverse it during
+ * gpt_walk().
+ *
+ * So issue can only happen if a thread hold a lock long enough for the serial
+ * number to wrap around which would block page directory on the recent list
+ * from being properly remove and lead to wrong assumption about how old is a
+ * directory page.
+ *
+ * Page directory removal is easier, each page directory keeps a count of the
+ * number of valid entry they have and number of lock that took a reference
+ * on it. So when this count drop to 0 gpt code knows that no thread is trying
+ * to access this page directory nor this page directory have any valid entry
+ * left thus it can safely be remove. This use atomic counter and rcu read
+ * section for synchronization.
+ */
+#ifndef __LINUX_GPT_H
+#define __LINUX_GPT_H
+
+#include <linux/mm.h>
+#include <asm/types.h>
+
+#ifdef CONFIG_64BIT
+#define GPT_PDIR_NBITS		((unsigned long)PAGE_SHIFT - 3UL)
+#else
+#define GPT_PDIR_NBITS		((unsigned long)PAGE_SHIFT - 2UL)
+#endif
+#define GPT_PDIR_MASK		((1UL << GPT_PDIR_NBITS) - 1UL)
+
+struct gpt;
+struct gpt_lock;
+struct gpt_walk;
+struct gpt_iter;
+
+/* struct gpt_ops - generic page table operations.
+ *
+ * @lock_update: Lock address range for update.
+ * @unlock_update: Unlock address range after update.
+ * @lock_fault: Lock address range for fault.
+ * @unlock_fault: Unlock address range after fault.
+ *
+ * The generic page table use lock hold accross update or fault operation to
+ * synchronize concurrent updater and faulter thread with each other. Because
+ * generic page table is configurable it needs different function depending if
+ * the page table is flat one (ie just one level) or a tree one (ie several
+ * levels).
+ */
+struct gpt_ops {
+	void (*lock_update)(struct gpt *gpt, struct gpt_lock *lock);
+	void (*unlock_update)(struct gpt *gpt, struct gpt_lock *lock);
+	int (*lock_fault)(struct gpt *gpt, struct gpt_lock *lock);
+	void (*unlock_fault)(struct gpt *gpt, struct gpt_lock *lock);
+	int (*walk)(struct gpt_walk *walk,
+		    struct gpt *gpt,
+		    struct gpt_lock *lock);
+	bool (*iter_addr)(struct gpt_iter *iter, unsigned long addr);
+	bool (*iter_first)(struct gpt_iter *iter,
+			   unsigned long start,
+			   unsigned long end);
+};
+
+
+/* struct gpt_user_ops - generic page table user provided operations.
+ *
+ * @pde_from_pdp: Return page directory entry that correspond to a page
+ * directory page. This allow user to use there own custom page directory
+ * entry format for all page directory level.
+ */
+struct gpt_user_ops {
+	unsigned long (*pde_from_pdp)(struct gpt *gpt, struct page *pdp);
+};
+
+
+/* struct gpt - generic page table structure.
+ *
+ * @ops: Generic page table operations.
+ * @user_ops: User provided gpt operation (if null use default implementation).
+ * @pgd: Page global directory if multi level (tree page table).
+ * @pte: Page table entry if single level (flat page table).
+ * @faulters: List of all concurrent fault locks.
+ * @updaters: List of all concurrent update locks.
+ * @pdp_young: List of all young page directory page.
+ * @pdp_free: List of all page directory page to free (delayed free).
+ * @max_addr: Maximum address that can index this page table (inclusive).
+ * @min_serial: Oldest serial number use by the oldest updater.
+ * @updater_serial: Current serial number use for updater.
+ * @faulter_serial: Current serial number use for faulter.
+ * @page_shift: The size as power of two of each table entry.
+ * @pde_size: Size of page directory entry (sizeof(long) for instance).
+ * @pfn_mask: Mask bit significant for page frame number of directory entry.
+ * @pfn_shift: Shift value to get the pfn from a page directory entry.
+ * @pfn_valid: Mask to know if a page directory entry is valid.
+ * @pgd_shift: Shift value to get the index inside the pgd from an address.
+ * @pld_shift: Page lower directory shift. This is shift value for the lowest
+ * page directory ie the page directory containing pfn of page table entry
+ * array. For instance if page_shift is 12 (4096 bytes) and each entry require
+ * 1 bytes then :
+ *   pld_shift = 12 + (PAGE_SHIFT - 0) = 12 + PAGE_SHIFT.
+ * Now if each entry cover 4096 bytes and  each entry require 2 bytes then :
+ *   pld_shift = 12 + (PAGE_SHIFT - 1) = 11 + PAGE_SHIFT.
+ * @pte_mask: Mask bit significant for indexing page table entry (pte) array.
+ * @pte_shift: Size shift for each page table entry (0 if each entry is one
+ * byte, 1 if each entry is 2 bytes, ...).
+ * @lock: Lock protecting serial number and updaters/faulters list.
+ * @pgd_lock: Lock protecting pgd level (and all level if arch do not have room
+ * for spinlock inside its page struct).
+ */
+struct gpt {
+	const struct gpt_ops		*ops;
+	const struct gpt_user_ops	*user_ops;
+	union {
+		unsigned long		*pgd;
+		void			*pte;
+	};
+	struct list_head		faulters;
+	struct list_head		updaters;
+	struct list_head		pdp_young;
+	struct list_head		pdp_free;
+	unsigned long			max_addr;
+	unsigned long			min_serial;
+	unsigned long			faulter_serial;
+	unsigned long			updater_serial;
+	unsigned long			page_shift;
+	unsigned long			pde_size;
+	unsigned long			pfn_invalid;
+	unsigned long			pfn_mask;
+	unsigned long			pfn_shift;
+	unsigned long			pfn_valid;
+	unsigned long			pgd_shift;
+	unsigned long			pld_shift;
+	unsigned long			pte_mask;
+	unsigned long			pte_shift;
+	unsigned			gfp_flags;
+	spinlock_t			lock;
+	spinlock_t			pgd_lock;
+};
+
+void gpt_free(struct gpt *gpt);
+int gpt_init(struct gpt *gpt);
+
+static inline unsigned long gpt_align_start_addr(struct gpt *gpt,
+						 unsigned long addr)
+{
+	return addr & ~((1UL << gpt->page_shift) - 1UL);
+}
+
+static inline unsigned long gpt_align_end_addr(struct gpt *gpt,
+					       unsigned long addr)
+{
+	return addr | ((1UL << gpt->page_shift) - 1UL);
+}
+
+static inline unsigned long gpt_pdp_shift(struct gpt *gpt, struct page *pdp)
+{
+	if (!pdp)
+		return gpt->pgd_shift;
+	return pdp->flags & 0xff;
+}
+
+static inline unsigned long gpt_pdp_start(struct gpt *gpt, struct page *pdp)
+{
+	if (!pdp)
+		return 0UL;
+	return pdp->index;
+}
+
+static inline unsigned long gpt_pdp_end(struct gpt *gpt, struct page *pdp)
+{
+	if (!pdp)
+		return gpt->max_addr;
+	return pdp->index + (1UL << gpt_pdp_shift(gpt, pdp)) - 1UL;
+}
+
+static inline bool gpt_pdp_cover_addr(struct gpt *gpt,
+				      struct page *pdp,
+				      unsigned long addr)
+{
+	return (addr >= gpt_pdp_start(gpt, pdp)) &&
+	       (addr <= gpt_pdp_end(gpt, pdp));
+}
+
+static inline bool gpt_pde_valid(struct gpt *gpt, unsigned long pde)
+{
+	return (pde & gpt->pfn_valid) && !(pde & gpt->pfn_invalid);
+}
+
+static inline struct page *gpt_pde_pdp(struct gpt *gpt,
+				       volatile unsigned long *pde)
+{
+	unsigned long tmp = *pde;
+
+	if (!gpt_pde_valid(gpt, tmp))
+		return NULL;
+
+	return pfn_to_page((tmp & gpt->pfn_mask) >> gpt->pfn_shift);
+}
+
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+static inline void gpt_pdp_lock(struct gpt *gpt, struct page  *pdp)
+{
+	if (pdp)
+		spin_lock(&page->ptl);
+	else
+		spin_lock(&gpt->pgd_lock);
+}
+
+static inline void gpt_pdp_unlock(struct gpt *gpt, struct page  *pdp)
+{
+	if (pdp)
+		spin_unlock(&page->ptl);
+	else
+		spin_unlock(&gpt->pgd_lock);
+}
+#else /* USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS */
+static inline void gpt_pdp_lock(struct gpt *gpt, struct page  *pdp)
+{
+	spin_lock(&gpt->pgd_lock);
+}
+
+static inline void gpt_pdp_unlock(struct gpt *gpt, struct page  *pdp)
+{
+	spin_unlock(&gpt->pgd_lock);
+}
+#endif /* USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS */
+
+static inline void gpt_ptp_ref(struct gpt *gpt, struct page  *ptp)
+{
+	if (ptp)
+		atomic_inc(&ptp->_mapcount);
+}
+
+static inline void gpt_ptp_unref(struct gpt *gpt, struct page  *ptp)
+{
+	if (ptp && atomic_dec_and_test(&ptp->_mapcount))
+		BUG();
+}
+
+static inline void gpt_ptp_batch_ref(struct gpt *gpt, struct page  *ptp, int n)
+{
+	if (ptp)
+		atomic_add(n, &ptp->_mapcount);
+}
+
+static inline void gpt_ptp_batch_unref(struct gpt *gpt,
+				       struct page  *ptp,
+				       int n)
+{
+	if (ptp && atomic_sub_and_test(n, &ptp->_mapcount))
+		BUG();
+}
+
+static inline void *gpt_ptp_pte(struct gpt *gpt, struct page *ptp)
+{
+	if (!ptp)
+		return gpt->pte;
+	return page_address(ptp);
+}
+
+static inline void *gpt_pte_from_addr(struct gpt *gpt,
+				      struct page *ptp,
+				      void *pte,
+				      unsigned long addr)
+{
+	addr = ((addr & gpt->pte_mask) >> gpt->page_shift) << gpt->pte_shift;
+	if (!ptp)
+		return gpt->pte += addr;
+	return (void *)(((unsigned long)pte & PAGE_MASK) + addr);
+}
+
+
+/* struct gpt_lock - generic page table range lock structure.
+ *
+ * @list: List struct for active lock holder lists.
+ * @start: Start address of the locked range (inclusive).
+ * @end: End address of the locked range (inclusive).
+ * @serial: Serial number associated with that lock.
+ *
+ * Before any read/update access to a range of the generic page table, it must
+ * be locked to synchronize with conurrent read/update and insertion. In most
+ * case gpt_lock will complete with only taking one spinlock for protecting the
+ * struct insertion in the active lock holder list (either updaters or faulters
+ * list depending if calling gpt_lock() or gpt_fault_lock()).
+ */
+struct gpt_lock {
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	unsigned long		serial;
+	bool (*hold)(struct gpt_lock *lock, struct page *pdp);
+};
+
+static inline int gpt_lock_update(struct gpt *gpt, struct gpt_lock *lock)
+{
+	lock->start = gpt_align_start_addr(gpt, lock->start);
+	lock->end = gpt_align_end_addr(gpt, lock->end);
+	if ((lock->start >= lock->end) || (lock->end > gpt->max_addr))
+		return -EINVAL;
+
+	if (gpt->ops->lock_update)
+		gpt->ops->lock_update(gpt, lock);
+	return 0;
+}
+
+static inline void gpt_unlock_update(struct gpt *gpt, struct gpt_lock *lock)
+{
+	if ((lock->start >= lock->end) || (lock->end > gpt->max_addr))
+		BUG();
+	if (list_empty(&lock->list))
+		BUG();
+
+	if (gpt->ops->unlock_update)
+		gpt->ops->unlock_update(gpt, lock);
+}
+
+static inline int gpt_lock_fault(struct gpt *gpt, struct gpt_lock *lock)
+{
+	lock->start = gpt_align_start_addr(gpt, lock->start);
+	lock->end = gpt_align_end_addr(gpt, lock->end);
+	if ((lock->start >= lock->end) || (lock->end > gpt->max_addr))
+		return -EINVAL;
+
+	if (gpt->ops->lock_fault)
+		return gpt->ops->lock_fault(gpt, lock);
+	return 0;
+}
+
+static inline void gpt_unlock_fault(struct gpt *gpt, struct gpt_lock *lock)
+{
+	if ((lock->start >= lock->end) || (lock->end > gpt->max_addr))
+		BUG();
+	if (list_empty(&lock->list))
+		BUG();
+
+	if (gpt->ops->unlock_fault)
+		gpt->ops->unlock_fault(gpt, lock);
+}
+
+
+/* struct gpt_walk - generic page table range walker structure.
+ *
+ * @lock: The lock protecting this iterator.
+ * @start: Start address of the walked range (inclusive).
+ * @end: End address of the walked range (inclusive).
+ *
+ * This is similar to the cpu page table walker. It allows to walk a range of
+ * the generic page table. Note that gpt walk does not imply protection hence
+ * you must call gpt_lock() prior to using gpt_walk() if you want to safely
+ * walk the range as otherwise you will be open to all kind of synchronization
+ * issue.
+ */
+struct gpt_walk {
+	int (*pte)(struct gpt *gpt,
+		   struct gpt_walk *walk,
+		   struct page *pdp,
+		   void *pte,
+		   unsigned long start,
+		   unsigned long end);
+	int (*pde)(struct gpt *gpt,
+		   struct gpt_walk *walk,
+		   struct page *pdp,
+		   volatile unsigned long *pde,
+		   unsigned long start,
+		   unsigned long end,
+		   unsigned long shift);
+	int (*pde_post)(struct gpt *gpt,
+			struct gpt_walk *walk,
+			struct page *pdp,
+			volatile unsigned long *pde,
+			unsigned long start,
+			unsigned long end,
+			unsigned long shift);
+	struct gpt_lock	*lock;
+	unsigned long	start;
+	unsigned long	end;
+	void		*data;
+};
+
+static inline int gpt_walk(struct gpt_walk *walk,
+			   struct gpt *gpt,
+			   struct gpt_lock *lock)
+{
+	if ((lock->start >= lock->end) || (lock->end > gpt->max_addr))
+		return -EINVAL;
+	if (list_empty(&lock->list))
+		return -EINVAL;
+	if (!((lock->start <= walk->start) && (lock->end >= walk->end)))
+		return -EINVAL;
+
+	walk->lock = lock;
+	return gpt->ops->walk(walk, gpt, lock);
+}
+
+
+/* struct gpt_iter - generic page table range iterator structure.
+ *
+ * @gpt: The generic page table structure.
+ * @lock: The lock protecting this iterator.
+ * @ptp: Current page directory page.
+ * @pte: Current page table entry array corresponding to pdp.
+ * @pte_addr: Page table entry address.
+ *
+ * This allow to iterate over a range of generic page table. The range you want
+ * to iterate over must be locked. First call gpt_iter_start() then to iterate
+ * simply call gpt_iter_next() which return false once you reached the end.
+ */
+struct gpt_iter {
+	struct gpt	*gpt;
+	struct gpt_lock	*lock;
+	struct page	*ptp;
+	void		*pte;
+	unsigned long	pte_addr;
+};
+
+static inline void gpt_iter_init(struct gpt_iter *iter,
+				 struct gpt *gpt,
+				 struct gpt_lock *lock)
+{
+	iter->gpt = gpt;
+	iter->lock = lock;
+	iter->ptp = NULL;
+	iter->pte = NULL;
+}
+
+static inline bool gpt_iter_addr(struct gpt_iter *iter, unsigned long addr)
+{
+	addr = gpt_align_start_addr(iter->gpt, addr);
+	if ((addr < iter->lock->start) || (addr > iter->lock->end)) {
+		iter->ptp = NULL;
+		iter->pte = NULL;
+		return false;
+	}
+	if (iter->pte && gpt_pdp_cover_addr(iter->gpt, iter->ptp, addr)) {
+		struct gpt *gpt = iter->gpt;
+
+		iter->pte = gpt_pte_from_addr(gpt, iter->ptp, iter->pte, addr);
+		iter->pte_addr = addr;
+		return true;
+	}
+
+	return iter->gpt->ops->iter_addr(iter, addr);
+}
+
+static inline bool gpt_iter_first(struct gpt_iter *iter,
+				  unsigned long start,
+				  unsigned long end)
+{
+	start = gpt_align_start_addr(iter->gpt, start);
+	end = gpt_align_end_addr(iter->gpt, start);
+	if (!((start >= iter->lock->start) && (end <= iter->lock->end))) {
+		iter->ptp = NULL;
+		iter->pte = NULL;
+		return false;
+	}
+	if (iter->pte && gpt_pdp_cover_addr(iter->gpt, iter->ptp, start)) {
+		struct gpt *gpt = iter->gpt;
+
+		iter->pte = gpt_pte_from_addr(gpt, iter->ptp, iter->pte, start);
+		iter->pte_addr = start;
+		return true;
+	}
+
+	return iter->gpt->ops->iter_first(iter, start, end);
+}
+
+static inline bool gpt_iter_next(struct gpt_iter *iter)
+{
+	unsigned long start;
+
+	if (!iter->pte) {
+		iter->ptp = NULL;
+		iter->pte = NULL;
+		return false;
+	}
+
+	start = iter->pte_addr + (1UL << iter->gpt->page_shift);
+	if (start > iter->lock->end) {
+		iter->ptp = NULL;
+		iter->pte = NULL;
+		return false;
+	}
+
+	return iter->gpt->ops->iter_first(iter, start, iter->lock->end);
+}
+
+
+#endif /* __LINUX_GPT_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index a5ce0c7..176be56 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -515,4 +515,7 @@ source "lib/fonts/Kconfig"
 config ARCH_HAS_SG_CHAIN
 	def_bool n
 
+config GENERIC_PAGE_TABLE
+	bool
+
 endmenu
diff --git a/lib/Makefile b/lib/Makefile
index d6b4bc4..eae4516 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -196,3 +196,5 @@ quiet_cmd_build_OID_registry = GEN     $@
 clean-files	+= oid_registry_data.c
 
 obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
+
+obj-$(CONFIG_GENERIC_PAGE_TABLE) += gpt.o
diff --git a/lib/gpt.c b/lib/gpt.c
new file mode 100644
index 0000000..5d82777
--- /dev/null
+++ b/lib/gpt.c
@@ -0,0 +1,897 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* Generic arch independant page table implementation. See include/linux/gpt.h
+ * for further informations on the design.
+ */
+#include <linux/gpt.h>
+#include <linux/highmem.h>
+#include <linux/slab.h>
+
+
+/*
+ * Generic page table operations for page table with multi directory level.
+ */
+static inline struct page *gpt_pdp_upper_pdp(struct gpt *gpt,
+					      struct page *pdp)
+{
+	if (!pdp)
+		return NULL;
+	return pdp->s_mem;
+}
+
+static inline unsigned long *gpt_pdp_upper_pde(struct gpt *gpt,
+						struct page *pdp)
+{
+	unsigned long idx;
+	struct page *updp;
+
+	if (!pdp)
+		return NULL;
+
+	updp = gpt_pdp_upper_pdp(gpt, pdp);
+	idx = gpt_pdp_start(gpt, pdp) - gpt_pdp_start(gpt, updp);
+	idx = (idx >> gpt_pdp_shift(gpt, pdp)) & GPT_PDIR_MASK;
+	if (!updp) {
+		return gpt->pgd + idx;
+	}
+	return ((unsigned long *)page_address(updp)) + idx;
+}
+
+static inline bool gpt_pdp_before_serial(struct page *pdp, unsigned long serial)
+{
+	/*
+	 * To know if a page directory is new or old we first check if it's not
+	 * on the recently added list. If it is and its serial number is newer
+	 * or equal to our lock serial number then it is a new page directory
+	 * entry and must be ignore.
+	 */
+	return list_empty(&pdp->lru) || time_after(serial, pdp->private);
+}
+
+static inline bool gpt_pdp_before_eq_serial(struct page *pdp, unsigned long serial)
+{
+	/*
+	 * To know if a page directory is new or old we first check if it's not
+	 * on the recently added list. If it is and its serial number is newer
+	 * or equal to our lock serial number then it is a new page directory
+	 * entry and must be ignore.
+	 */
+	return list_empty(&pdp->lru) || time_after_eq(serial, pdp->private);
+}
+
+static int gpt_walk_pde(struct gpt *gpt,
+			struct gpt_walk *walk,
+			struct page *pdp,
+			volatile unsigned long *pde,
+			unsigned long start,
+			unsigned long end,
+			unsigned long shift)
+{
+	unsigned long addr, idx, lshift, mask, next, npde;
+	int ret;
+
+	npde = ((end - start) >> shift) + 1;
+	mask = ~((1UL << shift) - 1UL);
+	lshift = shift - GPT_PDIR_NBITS;
+
+	if (walk->pde) {
+		ret = walk->pde(gpt, walk, pdp, pde, start, end, shift);
+		if (ret)
+			return ret;
+	}
+
+	for (addr = start, idx = 0; idx < npde; addr = next + 1UL, ++idx) {
+		struct page *lpdp;
+		void *lpde;
+
+		next = min((addr & mask) + (1UL << shift) - 1UL, end);
+		lpdp = gpt_pde_pdp(gpt, &pde[idx]);
+		if (!lpdp || !walk->lock->hold(walk->lock, lpdp))
+			continue;
+		lpde = page_address(lpdp);
+		if (lshift >= gpt->pld_shift) {
+			lpde += ((addr >> lshift) & GPT_PDIR_MASK) *
+				gpt->pde_size;
+			ret = gpt_walk_pde(gpt, walk, lpdp, lpde,
+					   addr, next, lshift);
+			if (ret)
+				return ret;
+		} else if (walk->pte) {
+			lpde = gpt_pte_from_addr(gpt, lpdp, lpde, addr);
+			ret = walk->pte(gpt, walk, lpdp, lpde, addr, next);
+			if (ret)
+				return ret;
+		}
+	}
+
+	if (walk->pde_post) {
+		ret = walk->pde_post(gpt, walk, pdp, pde, start, end, shift);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int gpt_tree_walk(struct gpt_walk *walk,
+			 struct gpt *gpt,
+			 struct gpt_lock *lock)
+{
+	unsigned long idx;
+
+	walk->start = gpt_align_start_addr(gpt, walk->start);
+	walk->end = gpt_align_end_addr(gpt, walk->start);
+	if ((walk->start >= walk->end) || (walk->end > gpt->max_addr))
+		return -EINVAL;
+
+	idx = (walk->start >> gpt->pgd_shift);
+	return gpt_walk_pde(gpt, walk, NULL, &gpt->pgd[idx],
+			    walk->start, walk->end,
+			    gpt->pgd_shift);
+}
+
+struct gpt_lock_walk {
+	struct list_head	pdp_to_free;
+	struct gpt_lock		*lock;
+	unsigned long		locked[BITS_TO_LONGS(1UL << GPT_PDIR_NBITS)];
+};
+
+static inline void gpt_pdp_unref(struct gpt *gpt,
+				 struct page *pdp,
+				 struct gpt_lock_walk *wlock,
+				 struct page *updp,
+				 volatile unsigned long *upde)
+{
+	if (!atomic_dec_and_test(&pdp->_mapcount))
+		return;
+
+	*upde = 0;
+	if (!list_empty(&pdp->lru)) {
+		/*
+		 * This should be a rare event, it means this page directory
+		 * was added recently but is about to be destroy before it
+		 * could be remove from the young list.
+		 *
+		 * Because it is in the young list and lock holder can access
+		 * the page table without rcu protection it means that we can
+		 * not rely on synchronize_rcu to know when it is safe to free
+		 * the page. We have to wait for all lock that are older than
+		 * this page directory to be release. Only once we reach that
+		 * point we know for sure that no thread can have a live
+		 * reference on that page directory.
+		 */
+		spin_lock(&gpt->pgd_lock);
+		list_add_tail(&pdp->lru, &gpt->pdp_free);
+		spin_unlock(&gpt->pgd_lock);
+	} else
+		/*
+		 * This means this is an old page directory and thus any lock
+		 * holder that might dereference a pointer to it would have a
+		 * reference on it. Hence when refcount reach 0 we know for
+		 * sure no lock holder will dereference this page directory
+		 * and thus synchronize_rcu is an long enough delay before
+		 * free.
+		 */
+		list_add_tail(&pdp->lru, &wlock->pdp_to_free);
+
+	/* Un-account this entry caller must hold a ref on pdp. */
+	if (updp && atomic_dec_and_test(&updp->_mapcount))
+		BUG();
+}
+
+static void gpt_lock_walk_free_pdp(struct gpt_lock_walk *wlock)
+{
+	struct page *pdp, *tmp;
+
+	if (list_empty(&wlock->pdp_to_free))
+		return;
+
+	synchronize_rcu();
+
+	list_for_each_entry_safe(pdp, tmp, &wlock->pdp_to_free, lru) {
+		/* Restore page struct fields to their expect value. */
+		list_del(&pdp->lru);
+		atomic_dec(&pdp->_mapcount);
+		pdp->mapping = NULL;
+		pdp->index = 0;
+		pdp->flags &= (~0xffUL);
+		__free_page(pdp);
+	}
+}
+
+static bool gpt_lock_update_hold(struct gpt_lock *lock, struct page *pdp)
+{
+	if (!atomic_read(&pdp->_mapcount))
+		return false;
+	if (!gpt_pdp_before_serial(pdp, lock->serial))
+		return false;
+	return true;
+}
+
+static void gpt_lock_walk_update_finish(struct gpt *gpt,
+					struct gpt_lock_walk *wlock)
+{
+	struct gpt_lock *lock = wlock->lock;
+	unsigned long min_serial;
+
+	spin_lock(&gpt->lock);
+	min_serial = gpt->min_serial;
+	list_del_init(&lock->list);
+	lock = list_first_entry_or_null(&gpt->updaters, struct gpt_lock, list);
+	gpt->min_serial = lock ? lock->serial : gpt->updater_serial;
+	spin_unlock(&gpt->lock);
+
+	/*
+	 * Drain the young pdp list if the new smallest serial lock holder is
+	 * different from previous one.
+	 */
+	if (gpt->min_serial != min_serial) {
+		struct page *pdp, *next;
+
+		spin_lock(&gpt->pgd_lock);
+		list_for_each_entry_safe(pdp, next, &gpt->pdp_young, lru) {
+			if (!gpt_pdp_before_serial(pdp, gpt->min_serial))
+				break;
+			list_del_init(&pdp->lru);
+		}
+		list_for_each_entry_safe(pdp, next, &gpt->pdp_free, lru) {
+			if (!gpt_pdp_before_serial(pdp, gpt->min_serial))
+				break;
+			list_del(&pdp->lru);
+			list_add_tail(&pdp->lru, &wlock->pdp_to_free);
+		}
+		spin_unlock(&gpt->pgd_lock);
+	}
+}
+
+static int gpt_pde_lock_update(struct gpt *gpt,
+			       struct gpt_walk *walk,
+			       struct page *pdp,
+			       volatile unsigned long *pde,
+			       unsigned long start,
+			       unsigned long end,
+			       unsigned long shift)
+{
+	unsigned long addr, idx, mask, next, npde;
+	struct gpt_lock_walk *wlock = walk->data;
+	struct gpt_lock *lock = wlock->lock;
+
+	npde = ((end - start) >> shift) + 1;
+	mask = ~((1UL << shift) - 1UL);
+
+	rcu_read_lock();
+	for (addr = start, idx = 0; idx < npde; addr = next + 1UL, ++idx) {
+		struct page *lpdp;
+
+		next = min((addr & mask) + (1UL << shift) - 1UL, end);
+		lpdp = gpt_pde_pdp(gpt, &pde[idx]);
+		if (!lpdp)
+			continue;
+		if (!atomic_inc_not_zero(&lpdp->_mapcount)) {
+			/*
+			 * Force page directory entry to zero we know for sure
+			 * that some other thread is deleting this entry. So it
+			 * is safe to double clear pde.
+			 */
+			pde[idx] = 0;
+			continue;
+		}
+
+		if (!gpt_pdp_before_serial(lpdp, lock->serial)) {
+			/* This is a new entry drop reference and ignore it. */
+			gpt_pdp_unref(gpt, lpdp, wlock, pdp, &pde[idx]);
+			continue;
+		}
+		set_bit(idx, wlock->locked);
+	}
+	rcu_read_unlock();
+
+	for (addr = start, idx = 0; idx < npde; addr = next + 1UL, ++idx) {
+		struct page *lpdp;
+
+		next = min((addr & mask) + (1UL << shift) - 1UL, end);
+		if (!test_bit(idx, wlock->locked))
+			continue;
+		clear_bit(idx, wlock->locked);
+		lpdp = gpt_pde_pdp(gpt, &pde[idx]);
+		kmap(lpdp);
+	}
+
+	return 0;
+}
+
+static void gpt_tree_lock_update(struct gpt *gpt, struct gpt_lock *lock)
+{
+	struct gpt_lock_walk wlock;
+	struct gpt_walk walk;
+
+	lock->hold = &gpt_lock_update_hold;
+	spin_lock(&gpt->lock);
+	lock->serial = gpt->updater_serial;
+	list_add_tail(&lock->list, &gpt->updaters);
+	spin_unlock(&gpt->lock);
+
+	bitmap_zero(wlock.locked, BITS_TO_LONGS(1UL << GPT_PDIR_NBITS));
+	INIT_LIST_HEAD(&wlock.pdp_to_free);
+	wlock.lock = lock;
+	walk.lock = lock;
+	walk.data = &wlock;
+	walk.pde = &gpt_pde_lock_update;
+	walk.pde_post = NULL;
+	walk.pte = NULL;
+	walk.start = lock->start;
+	walk.end = lock->end;
+
+	gpt_tree_walk(&walk, gpt, lock);
+	gpt_lock_walk_free_pdp(&wlock);
+}
+
+static int gpt_pde_unlock_update(struct gpt *gpt,
+				 struct gpt_walk *walk,
+				 struct page *pdp,
+				 volatile unsigned long *pde,
+				 unsigned long start,
+				 unsigned long end,
+				 unsigned long shift)
+{
+	unsigned long addr, idx, mask, next, npde;
+	struct gpt_lock_walk *wlock = walk->data;
+	struct gpt_lock *lock = wlock->lock;
+
+	npde = ((end - start) >> shift) + 1;
+	mask = ~((1UL << shift) - 1UL);
+
+	for (addr = start, idx = 0; idx < npde; addr = next + 1UL, ++idx) {
+		struct page *lpdp;
+
+		next = min((addr & mask) + (1UL << shift) - 1UL, end);
+		lpdp = gpt_pde_pdp(gpt, &pde[idx]);
+		if (!lpdp || !gpt_pdp_before_serial(lpdp, lock->serial))
+			continue;
+		kunmap(lpdp);
+		gpt_pdp_unref(gpt, lpdp, wlock, pdp, &pde[idx]);
+	}
+
+	return 0;
+}
+
+static void gpt_tree_unlock_update(struct gpt *gpt, struct gpt_lock *lock)
+{
+	struct gpt_lock_walk wlock;
+	struct gpt_walk walk;
+
+	bitmap_zero(wlock.locked, BITS_TO_LONGS(1UL << GPT_PDIR_NBITS));
+	INIT_LIST_HEAD(&wlock.pdp_to_free);
+	wlock.lock = lock;
+	walk.lock = lock;
+	walk.data = &wlock;
+	walk.pde = NULL;
+	walk.pde_post = &gpt_pde_unlock_update;
+	walk.pte = NULL;
+	walk.start = lock->start;
+	walk.end = lock->end;
+
+	gpt_tree_walk(&walk, gpt, lock);
+
+	gpt_lock_walk_update_finish(gpt, &wlock);
+	gpt_lock_walk_free_pdp(&wlock);
+}
+
+static bool gpt_lock_fault_hold(struct gpt_lock *lock, struct page *pdp)
+{
+	if (!atomic_read(&pdp->_mapcount))
+		return false;
+	if (!gpt_pdp_before_eq_serial(pdp, lock->serial))
+		return false;
+	return true;
+}
+
+static void gpt_lock_walk_fault_finish(struct gpt *gpt,
+				       struct gpt_lock_walk *wlock)
+{
+	struct gpt_lock *lock = wlock->lock;
+
+	spin_lock(&gpt->lock);
+	list_del_init(&lock->list);
+	lock = list_first_entry_or_null(&gpt->faulters, struct gpt_lock, list);
+	if (lock)
+		gpt->updater_serial = lock->serial;
+	else
+		gpt->updater_serial = gpt->faulter_serial;
+	spin_unlock(&gpt->lock);
+}
+
+static int gpt_pde_lock_fault(struct gpt *gpt,
+			      struct gpt_walk *walk,
+			      struct page *pdp,
+			      volatile unsigned long *pde,
+			      unsigned long start,
+			      unsigned long end,
+			      unsigned long shift)
+{
+	unsigned long addr, c, idx, mask, next, npde;
+	struct gpt_lock_walk *wlock = walk->data;
+	struct gpt_lock *lock = wlock->lock;
+	struct list_head lpdp_new, lpdp_added;
+	struct page *lpdp, *tmp;
+	int ret;
+
+	npde = ((end - start) >> shift) + 1;
+	mask = ~((1UL << shift) - 1UL);
+	INIT_LIST_HEAD(&lpdp_added);
+	INIT_LIST_HEAD(&lpdp_new);
+
+	rcu_read_lock();
+	for (addr = start, c = idx = 0; idx < npde; addr = next + 1UL, ++idx) {
+
+		next = min((addr & mask) + (1UL << shift) - 1UL, end);
+		lpdp = gpt_pde_pdp(gpt, &pde[idx]);
+		if (lpdp == NULL) {
+			c++;
+			continue;
+		}
+		if (!atomic_inc_not_zero(&lpdp->_mapcount)) {
+			/*
+			 * Force page directory entry to zero we know for sure
+			 * that some other thread is deleting this entry. So it
+			 * is safe to double clear pde.
+			 */
+			c++;
+			pde[idx] = 0;
+			continue;
+		}
+		set_bit(idx, wlock->locked);
+	}
+	rcu_read_unlock();
+
+	/* Allocate missing page directory page. */
+	for (idx = 0; idx < c; ++idx) {
+		lpdp = alloc_page(gpt->gfp_flags | __GFP_ZERO);
+		if (!lpdp) {
+			ret = -ENOMEM;
+			goto error;
+		}
+		list_add_tail(&lpdp->lru, &lpdp_new);
+		kmap(lpdp);
+	}
+
+	gpt_pdp_lock(gpt, pdp);
+	for (addr = start, idx = 0; idx < npde; addr = next + 1UL, ++idx) {
+		struct page *lpdp;
+
+		next = min((addr & mask) + (1UL << shift) - 1UL, end);
+		/* Anoter thread might already have populated entry. */
+		if (test_bit(idx, wlock->locked) || gpt_pde_valid(gpt, pde[idx]))
+			continue;
+
+		lpdp = list_first_entry_or_null(&lpdp_new, struct page, lru);
+		BUG_ON(!lpdp);
+		list_del(&lpdp->lru);
+
+		/* Initialize page directory page struct. */
+		lpdp->private = lock->serial;
+		lpdp->s_mem = pdp;
+		lpdp->index = addr & ~((1UL << shift) - 1UL);
+		lpdp->flags |= shift & 0xff;
+		list_add_tail(&lpdp->lru, &lpdp_added);
+		atomic_set(&lpdp->_mapcount, 1);
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+		spin_lock_init(&page->ptl);
+#endif
+
+		pde[idx] = gpt->user_ops->pde_from_pdp(gpt, lpdp);
+		/* Account this new entry inside upper directory. */
+		if (pdp)
+			atomic_inc(&pdp->_mapcount);
+	}
+	gpt_pdp_unlock(gpt, pdp);
+
+	spin_lock(&gpt->pgd_lock);
+	list_splice_tail(&lpdp_added, &gpt->pdp_young);
+	spin_unlock(&gpt->pgd_lock);
+
+	for (addr = start, idx = 0; idx < npde; addr = next + 1UL, ++idx) {
+		struct page *lpdp;
+
+		next = min((addr & mask) + (1UL << shift) - 1UL, end);
+		if (!test_bit(idx, wlock->locked));
+			continue;
+		clear_bit(idx, wlock->locked);
+		lpdp = gpt_pde_pdp(gpt, &pde[idx]);
+		kmap(lpdp);
+	}
+
+	/* Free any left over pages. */
+	list_for_each_entry_safe (lpdp, tmp, &lpdp_new, lru) {
+		list_del(&lpdp->lru);
+		kunmap(lpdp);
+		__free_page(lpdp);
+	}
+	return 0;
+
+error:
+	list_for_each_entry_safe (lpdp, tmp, &lpdp_new, lru) {
+		list_del(&lpdp->lru);
+		kunmap(lpdp);
+		__free_page(lpdp);
+	}
+	walk->end = start - 1UL;
+	return ret;
+}
+
+static int gpt_pde_unlock_fault(struct gpt *gpt,
+				struct gpt_walk *walk,
+				struct page *pdp,
+				volatile unsigned long *pde,
+				unsigned long start,
+				unsigned long end,
+				unsigned long shift)
+{
+	unsigned long addr, idx, mask, next, npde;
+	struct gpt_lock_walk *wlock = walk->data;
+	struct gpt_lock *lock = wlock->lock;
+
+	npde = ((end - start) >> shift) + 1;
+	mask = ~((1UL << shift) - 1UL);
+
+	for (addr = start, idx = 0; idx < npde; addr = next + 1UL, ++idx) {
+		struct page *lpdp;
+
+		next = min((addr & mask) + (1UL << shift) - 1UL, end);
+		lpdp = gpt_pde_pdp(gpt, &pde[idx]);
+		if (!lpdp || !lock->hold(lock, lpdp))
+			continue;
+		kunmap(lpdp);
+		gpt_pdp_unref(gpt, lpdp, wlock, pdp, &pde[idx]);
+	}
+
+	return 0;
+}
+
+static int gpt_tree_lock_fault(struct gpt *gpt, struct gpt_lock *lock)
+{
+	struct gpt_lock_walk wlock;
+	struct gpt_walk walk;
+	int ret;
+
+	lock->hold = &gpt_lock_fault_hold;
+	spin_lock(&gpt->lock);
+	lock->serial = gpt->faulter_serial++;
+	list_add_tail(&lock->list, &gpt->faulters);
+	spin_unlock(&gpt->lock);
+
+	bitmap_zero(wlock.locked, BITS_TO_LONGS(1UL << GPT_PDIR_NBITS));
+	INIT_LIST_HEAD(&wlock.pdp_to_free);
+	wlock.lock = lock;
+	walk.lock = lock;
+	walk.data = &wlock;
+	walk.pde = &gpt_pde_lock_fault;
+	walk.pde_post = NULL;
+	walk.pte = NULL;
+	walk.start = lock->start;
+	walk.end = lock->end;
+
+	ret = gpt_tree_walk(&walk, gpt, lock);
+	if (ret) {
+		walk.pde = NULL;
+		walk.pde_post = &gpt_pde_unlock_fault;
+		gpt_tree_walk(&walk, gpt, lock);
+		gpt_lock_walk_fault_finish(gpt, &wlock);
+	}
+	gpt_lock_walk_free_pdp(&wlock);
+
+	return ret;
+}
+
+static void gpt_tree_unlock_fault(struct gpt *gpt, struct gpt_lock *lock)
+{
+	struct gpt_lock_walk wlock;
+	struct gpt_walk walk;
+
+	bitmap_zero(wlock.locked, BITS_TO_LONGS(1UL << GPT_PDIR_NBITS));
+	INIT_LIST_HEAD(&wlock.pdp_to_free);
+	wlock.lock = lock;
+	walk.lock = lock;
+	walk.data = &wlock;
+	walk.pde = NULL;
+	walk.pde_post = &gpt_pde_unlock_fault;
+	walk.pte = NULL;
+	walk.start = lock->start;
+	walk.end = lock->end;
+
+	gpt_tree_walk(&walk, gpt, lock);
+
+	gpt_lock_walk_fault_finish(gpt, &wlock);
+	gpt_lock_walk_free_pdp(&wlock);
+}
+
+static bool gpt_tree_iter_addr_ptp(struct gpt_iter *iter,
+				   unsigned long addr,
+				   struct page *pdp,
+				   volatile unsigned long *pde,
+				   unsigned long shift)
+{
+	struct gpt_lock *lock = iter->lock;
+	struct gpt *gpt = iter->gpt;
+	struct page *lpdp;
+	void *lpde;
+
+	lpdp = gpt_pde_pdp(gpt, pde);
+	if (!lpdp || !lock->hold(lock, lpdp)) {
+		iter->ptp = NULL;
+		iter->pte = NULL;
+		return false;
+	}
+
+	lpde = page_address(lpdp);
+	if (shift == gpt->pld_shift) {
+		iter->ptp = lpdp;
+		iter->pte = gpt_pte_from_addr(gpt, lpdp, lpde, addr);
+		iter->pte_addr = addr;
+		return true;
+	}
+
+	shift -= GPT_PDIR_NBITS;
+	lpde += ((addr >> shift) & GPT_PDIR_MASK) * gpt->pde_size;
+	return gpt_tree_iter_addr_ptp(iter, addr, lpdp, lpde, shift);
+}
+
+static bool gpt_tree_iter_next_pdp(struct gpt_iter *iter,
+				   struct page *pdp,
+				   unsigned long addr)
+{
+	struct gpt *gpt = iter->gpt;
+	unsigned long *upde;
+	struct page *updp;
+
+	updp = gpt_pdp_upper_pdp(gpt, pdp);
+	upde = gpt_pdp_upper_pde(gpt, pdp);
+	if (gpt_pdp_cover_addr(gpt, updp, addr)) {
+		unsigned long shift = gpt_pdp_shift(gpt, updp);
+
+		return gpt_tree_iter_addr_ptp(iter, addr, updp, upde, shift);
+	}
+
+	return gpt_tree_iter_next_pdp(iter, updp, addr);
+}
+
+static bool gpt_tree_iter_addr(struct gpt_iter *iter, unsigned long addr)
+{
+	volatile unsigned long *pde;
+	struct gpt *gpt = iter->gpt;
+
+	if (iter->ptp)
+		return gpt_tree_iter_next_pdp(iter, iter->ptp, addr);
+	pde = gpt->pgd + (addr >> gpt->pgd_shift);
+	return gpt_tree_iter_addr_ptp(iter, addr, NULL, pde, gpt->pgd_shift);
+}
+
+static bool gpt_tree_iter_first_ptp(struct gpt_iter *iter,
+				    unsigned long start,
+				    unsigned long end,
+				    struct page *pdp,
+				    volatile unsigned long *pde,
+				    unsigned long shift)
+{
+	unsigned long addr, idx, lshift, mask, next, npde;
+	struct gpt_lock *lock = iter->lock;
+	struct gpt *gpt = iter->gpt;
+
+	npde = ((end - start) >> shift) + 1;
+	mask = ~((1UL << shift) - 1UL);
+	lshift = shift - GPT_PDIR_NBITS;
+
+	for (addr = start, idx = 0; idx < npde; addr = next + 1UL, ++idx) {
+		struct page *lpdp;
+		void *lpde;
+
+		next = min((addr & mask) + (1UL << shift) - 1UL, end);
+		lpdp = gpt_pde_pdp(gpt, &pde[idx]);
+		if (!lpdp || !lock->hold(lock, lpdp))
+			continue;
+
+		lpde = page_address(lpdp);
+		if (gpt->pld_shift == shift) {
+			iter->ptp = lpdp;
+			iter->pte = gpt_pte_from_addr(gpt, lpdp, lpde, addr);
+			iter->pte_addr = addr;
+			return true;
+		}
+
+		lpde += ((addr >> lshift) & GPT_PDIR_MASK) * gpt->pde_size;
+		if (gpt_tree_iter_first_ptp(iter, addr, next,
+					    lpdp, lpde, lshift))
+			return true;
+	}
+	return false;
+}
+
+static bool gpt_tree_iter_first_pdp(struct gpt_iter *iter,
+				    struct page *pdp,
+				    unsigned long start,
+				    unsigned long end)
+{
+	struct gpt *gpt = iter->gpt;
+	unsigned long *upde;
+	struct page *updp;
+
+	updp = gpt_pdp_upper_pdp(gpt, pdp);
+	upde = gpt_pdp_upper_pde(gpt, pdp);
+	if (gpt_pdp_cover_addr(gpt, updp, start)) {
+		unsigned long shift = gpt_pdp_shift(gpt, updp);
+
+		if (gpt_tree_iter_first_ptp(iter, start, end,
+					    updp, upde, shift))
+			return true;
+		start = gpt_pdp_end(gpt, updp) + 1UL;
+		if (start > end) {
+			iter->ptp = NULL;
+			iter->pte = NULL;
+			return false;
+		}
+	}
+
+	return gpt_tree_iter_first_pdp(iter, updp, start, end);
+}
+
+static bool gpt_tree_iter_first(struct gpt_iter *iter,
+				unsigned long start,
+				unsigned long end)
+{
+	unsigned long *pde;
+	struct gpt *gpt = iter->gpt;
+
+	if (iter->ptp)
+		return gpt_tree_iter_first_pdp(iter, iter->ptp, start, end);
+
+	pde = gpt->pgd + (start >> gpt->pgd_shift);
+	return gpt_tree_iter_first_ptp(iter, start, end, NULL,
+				       pde, gpt->pgd_shift);
+}
+
+static const struct gpt_ops _gpt_ops_tree = {
+	.lock_update = gpt_tree_lock_update,
+	.unlock_update = gpt_tree_unlock_update,
+	.lock_fault = gpt_tree_lock_fault,
+	.unlock_fault = gpt_tree_unlock_fault,
+	.walk = gpt_tree_walk,
+	.iter_addr = gpt_tree_iter_addr,
+	.iter_first = gpt_tree_iter_first,
+};
+
+
+/*
+ * Generic page table operations for page table with single level (flat).
+ */
+static int gpt_flat_walk(struct gpt_walk *walk,
+			 struct gpt *gpt,
+			 struct gpt_lock *lock)
+{
+	void *pte;
+
+	if (!walk->pte)
+		return 0;
+
+	pte = gpt_pte_from_addr(gpt, NULL, gpt->pte, walk->start);
+	return walk->pte(gpt, walk, NULL, pte, walk->start, walk->end);
+}
+
+static bool gpt_flat_iter_addr(struct gpt_iter *iter, unsigned long addr)
+{
+	struct gpt *gpt = iter->gpt;
+
+	iter->ptp = NULL;
+	iter->pte = gpt_pte_from_addr(gpt, NULL, gpt->pte, addr);
+	iter->pte_addr = addr;
+	return true;
+}
+
+static bool gpt_flat_iter_first(struct gpt_iter *iter,
+				unsigned long start,
+				unsigned long end)
+{
+	struct gpt *gpt = iter->gpt;
+
+	iter->ptp = NULL;
+	iter->pte = gpt_pte_from_addr(gpt, NULL, gpt->pte, start);
+	iter->pte_addr = start;
+	return true;
+}
+
+static const struct gpt_ops _gpt_ops_flat = {
+	.lock_update = NULL,
+	.unlock_update = NULL,
+	.lock_fault = NULL,
+	.unlock_fault = NULL,
+	.walk = gpt_flat_walk,
+	.iter_addr = gpt_flat_iter_addr,
+	.iter_first = gpt_flat_iter_first,
+};
+
+
+/*
+ * Default user operations implementation.
+ */
+static unsigned long gpt_default_pde_from_pdp(struct gpt *gpt,
+					      struct page *pdp)
+{
+	unsigned long pde;
+
+	pde = (page_to_pfn(pdp) << gpt->pfn_shift) | gpt->pfn_valid;
+	return pde;
+}
+
+static const struct gpt_user_ops _gpt_user_ops_default = {
+	.pde_from_pdp = gpt_default_pde_from_pdp,
+};
+
+
+void gpt_free(struct gpt *gpt)
+{
+	BUG_ON(!list_empty(&gpt->faulters));
+	BUG_ON(!list_empty(&gpt->updaters));
+	gpt->max_addr = 0;
+	kfree(gpt->pgd);
+}
+EXPORT_SYMBOL(gpt_free);
+
+int gpt_init(struct gpt *gpt)
+{
+	unsigned long pgd_size;
+
+	gpt->user_ops = gpt->user_ops ? gpt->user_ops : &_gpt_user_ops_default;
+	gpt->max_addr = gpt_align_end_addr(gpt, gpt->max_addr);
+	INIT_LIST_HEAD(&gpt->pdp_young);
+	INIT_LIST_HEAD(&gpt->pdp_free);
+	INIT_LIST_HEAD(&gpt->faulters);
+	INIT_LIST_HEAD(&gpt->updaters);
+	spin_lock_init(&gpt->pgd_lock);
+	spin_lock_init(&gpt->lock);
+	gpt->updater_serial = gpt->faulter_serial = gpt->min_serial = 0;
+	gpt->pde_size = sizeof(long);
+
+	/* The page table entry size must smaller than page size. */
+	if (gpt->pte_shift >= PAGE_SHIFT)
+		return -EINVAL;
+	gpt->pte_mask = (1UL << (PAGE_SHIFT - gpt->pte_shift)) - 1UL;
+	gpt->pte_mask = gpt->pte_mask << gpt->page_shift;
+
+	gpt->pld_shift = PAGE_SHIFT - gpt->pte_shift + gpt->page_shift;
+	if (gpt_align_end_addr(gpt, 1UL << gpt->pld_shift) >= gpt->max_addr) {
+		/* Only need one level ie this is a flat page table. */
+		gpt->pgd_shift = gpt->page_shift;
+		pgd_size = (gpt->max_addr >> gpt->pgd_shift) + 1UL;
+		pgd_size = pgd_size << gpt->pte_shift;
+		gpt->ops = &_gpt_ops_flat;
+	} else {
+		unsigned long nbits, pgd_nbits;
+
+		nbits = __fls(gpt->max_addr);
+		pgd_nbits = (nbits - gpt->pld_shift) % GPT_PDIR_NBITS;
+		pgd_nbits = pgd_nbits ? pgd_nbits : GPT_PDIR_NBITS;
+		gpt->pgd_shift = nbits - pgd_nbits;
+		pgd_size = ((gpt->max_addr >> gpt->pgd_shift) + 1UL) *
+			   gpt->pde_size;
+		gpt->ops = &_gpt_ops_tree;
+	}
+
+	gpt->pgd = kzalloc(pgd_size, GFP_KERNEL);
+	if (!gpt->pgd)
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL(gpt_init);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 3/6] hmm: heterogeneous memory management v5
  2014-08-29 19:10 [RFC PATCH 0/6] HMM (heterogeneous memory management) v4 j.glisse
  2014-08-29 19:10 ` [RFC PATCH 1/6] mmu_notifier: add event information to address invalidation v4 j.glisse
  2014-08-29 19:10 ` [RFC PATCH 2/6] lib: lockless generic and arch independent page table (gpt) j.glisse
@ 2014-08-29 19:10 ` j.glisse
  2014-08-29 19:10 ` [RFC PATCH 4/6] hmm/dummy: dummy driver to showcase the hmm api v2 j.glisse
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: j.glisse @ 2014-08-29 19:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm, Haggai Eran
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, Jatin Kumar

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Motivation:

Heterogeneous memory management is intended to allow a device to transparently
access a process address space without having to lock pages of the process or
take references on them. In other word mirroring a process address space while
allowing the regular memory management event such as page reclamation or page
migration, to happen seamlessly.

Recent years have seen a surge into the number of specialized devices that are
part of a computer platform (from desktop to phone). So far each of those
devices have operated on there own private address space that is not link or
expose to the process address space that is using them. This separation often
leads to multiple memory copy happening between the device owned memory and the
process memory. This of course is both a waste of cpu cycle and memory.

Over the last few years most of those devices have gained a full mmu allowing
them to support multiple page table, page fault and other features that are
found inside cpu mmu. There is now a strong incentive to start leveraging
capabilities of such devices and to start sharing process address to avoid
any unnecessary memory copy as well as simplifying the programming model of
those devices by sharing an unique and common address space with the process
that use them.

The aim of the heterogeneous memory management is to provide a common API that
can be use by any such devices in order to mirror process address. The hmm code
provide an unique entry point and interface itself with the core mm code of the
linux kernel avoiding duplicate implementation and shielding device driver code
from core mm code.

Moreover, hmm also intend to provide support for migrating memory to device
private memory, allowing device to work on its own fast local memory. The hmm
code would be responsible to intercept cpu page fault on migrated range and
to migrate it back to system memory allowing cpu to resume its access to the
memory.

Another feature hmm intend to provide is support for atomic operation for the
device even if the bus linking the device and the cpu do not have any such
capabilities. On such hardware atomic operation require the page to only be
mapped on the device or on the cpu but not both at the same time.

We expect that graphic processing unit and network interface to be among the
first users of such api.

Hardware requirement:

Because hmm is intended to be use by device driver there are minimum features
requirement for the hardware mmu :
  - hardware have its own page table per process (can be share btw != devices)
  - hardware mmu support page fault and suspend execution until the page fault
    is serviced by hmm code. The page fault must also trigger some form of
    interrupt so that hmm code can be call by the device driver.
  - hardware must support at least read only mapping (otherwise it can not
    access read only range of the process address space).
  - hardware access to system memory must be cache coherent with the cpu.

For better memory management it is highly recommanded that the device also
support the following features :
  - hardware mmu set access bit in its page table on memory access (like cpu).
  - hardware page table can be updated from cpu or through a fast path.
  - hardware provide advanced statistic over which range of memory it access
    the most.
  - hardware differentiate atomic memory access from regular access allowing
    to support atomic operation even on platform that do not have atomic
    support on the bus linking the device with the cpu.

Implementation:

The hmm layer provide a simple API to the device driver. Each device driver
have to register and hmm device that holds pointer to all the callback the hmm
code will make to synchronize the device page table with the cpu page table of
a given process.

For each process it wants to mirror the device driver must register a mirror
hmm structure that holds all the informations specific to the process being
mirrored. Each hmm mirror uniquely link an hmm device with a process address
space (the mm struct).

This design allow several different device driver to mirror concurrently the
same process. The hmm layer will dispatch approprietly to each device driver
modification that are happening to the process address space.

The hmm layer rely on the mmu notifier api to monitor change to the process
address space. Because update to device page table can have unbound completion
time, the hmm layer need the capability to sleep during mmu notifier callback.

This patch only implement the core of the hmm layer and do not support feature
such as migration to device memory.

Changed since v1:
  - convert fence to refcounted object
  - change the api to provide pte value directly avoiding useless temporary
    special hmm pfn value
  - cleanups & fixes ...

Changed since v2:
  - fixed checkpatch.pl warnings & errors
  - converted to a staging feature

Changed since v3:
  - Use mmput notifier chain instead of adding hmm destroy call to mmput.
  - Clear mm->hmm inside mm_init to be match mmu_notifier.
  - Separate cpu page table invalidation from device page table fault to
    have cleaner and simpler code for synchronization btw this two types
    of event.
  - Removing hmm_mirror kref and rely on user to manage lifetime of the
    hmm_mirror.

Changed since v4:
  - Invalidate either in range_start() or in range_end() depending on the
    kind of mmu event.
  - Use the new generic page table implementation to keep an hmm mirror of
    the cpu page table.
  - Get rid of the range lock exclusion as it is no longer needed.
  - Simplify the driver api.
  - Support for hugue page.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h      |  364 ++++++++++++++
 include/linux/mm.h       |   11 +
 include/linux/mm_types.h |   14 +
 kernel/fork.c            |    2 +
 mm/Kconfig               |   15 +
 mm/Makefile              |    1 +
 mm/hmm.c                 | 1253 ++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 1660 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..f7c379b
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,364 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* This is a heterogeneous memory management (hmm). In a nutshell this provide
+ * an API to mirror a process address on a device which has its own mmu using
+ * its own page table for the process. It supports everything except special
+ * vma.
+ *
+ * Mandatory hardware features :
+ *   - An mmu with pagetable.
+ *   - Read only flag per cpu page.
+ *   - Page fault ie hardware must stop and wait for kernel to service fault.
+ *
+ * Optional hardware features :
+ *   - Dirty bit per cpu page.
+ *   - Access bit per cpu page.
+ *
+ * The hmm code handle all the interfacing with the core kernel mm code and
+ * provide a simple API. It does support migrating system memory to device
+ * memory and handle migration back to system memory on cpu page fault.
+ *
+ * Migrated memory is considered as swaped from cpu and core mm code point of
+ * view.
+ */
+#ifndef _HMM_H
+#define _HMM_H
+
+#ifdef CONFIG_HMM
+
+#include <linux/list.h>
+#include <linux/rwsem.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mman.h>
+
+
+struct hmm_device;
+struct hmm_device_ops;
+struct hmm_mirror;
+struct hmm_event;
+struct hmm;
+
+
+/* hmm_fence - device driver fence to wait for device driver operations.
+ *
+ * In order to concurrently update several different devices mmu the hmm rely
+ * on device driver fence to wait for operation hmm schedules to complete on
+ * devices. It is strongly recommanded to implement fences and have the hmm
+ * callback do as little as possible (just scheduling the update and returning
+ * a fence). Moreover the hmm code will reschedule for i/o the current process
+ * if necessary once it has scheduled all updates on all devices.
+ *
+ * Each fence is created as a result of either an update to range of memory or
+ * for remote memory to/from local memory dma.
+ *
+ * Update to range of memory correspond to a specific event type. For instance
+ * range of memory is unmap for page reclamation, or range of memory is unmap
+ * from process address space as result of munmap syscall (HMM_MUNMAP), or a
+ * memory protection change on the range. There is one hmm_etype for each of
+ * those event allowing the device driver to take appropriate action like for
+ * instance freeing device page table on HMM_MUNMAP but keeping it when it is
+ * just an access protection change or temporary unmap.
+ */
+enum hmm_etype {
+	HMM_NONE = 0,
+	HMM_ISDIRTY,
+	HMM_MIGRATE,
+	HMM_MUNMAP,
+	HMM_RFAULT,
+	HMM_WFAULT,
+	HMM_WRITE_PROTECT,
+};
+
+struct hmm_fence {
+	struct hmm_mirror	*mirror;
+	struct list_head	list;
+};
+
+
+/* struct hmm_event - used to serialize change to overlapping range of address.
+ *
+ * @list: Core hmm keep track of all active events.
+ * @start: First address (inclusive).
+ * @end: Last address (exclusive).
+ * @fences: List of device fences associated with this event.
+ * @etype: Event type (munmap, migrate, truncate, ...).
+ * @backoff: Only meaningful for device page fault.
+ */
+struct hmm_event {
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	struct list_head	fences;
+	enum hmm_etype		etype;
+	bool			backoff;
+};
+
+
+/* struct hmm_range - used to communicate range infos to various callback.
+ *
+ * @pte: The hmm page table entry for the range.
+ * @ptp: The page directory page struct.
+ * @start: First address (inclusive).
+ * @end: Last address (exclusive).
+ */
+struct hmm_range {
+	unsigned long		*pte;
+	struct page		*ptp;
+	unsigned long		start;
+	unsigned long		end;
+};
+
+static inline unsigned long hmm_range_size(struct hmm_range *range)
+{
+	return range->end - range->start;
+}
+
+#define HMM_PTE_VALID_PDIR_BIT	0UL
+#define HMM_PTE_VALID_SMEM_BIT	1UL
+#define HMM_PTE_WRITE_BIT	2UL
+#define HMM_PTE_DIRTY_BIT	3UL
+
+static inline unsigned long hmm_pte_from_pfn(unsigned long pfn)
+{
+	return (pfn << PAGE_SHIFT) | (1UL << HMM_PTE_VALID_SMEM_BIT);
+}
+
+static inline void hmm_pte_mk_dirty(volatile unsigned long *hmm_pte)
+{
+	set_bit(HMM_PTE_DIRTY_BIT, hmm_pte);
+}
+
+static inline void hmm_pte_mk_write(volatile unsigned long *hmm_pte)
+{
+	set_bit(HMM_PTE_WRITE_BIT, hmm_pte);
+}
+
+static inline bool hmm_pte_clear_valid_smem(volatile unsigned long *hmm_pte)
+{
+	return test_and_clear_bit(HMM_PTE_VALID_SMEM_BIT, hmm_pte);
+}
+
+static inline bool hmm_pte_clear_write(volatile unsigned long *hmm_pte)
+{
+	return test_and_clear_bit(HMM_PTE_WRITE_BIT, hmm_pte);
+}
+
+static inline bool hmm_pte_is_valid_smem(const volatile unsigned long *hmm_pte)
+{
+	return test_bit(HMM_PTE_VALID_SMEM_BIT, hmm_pte);
+}
+
+static inline bool hmm_pte_is_write(const volatile unsigned long *hmm_pte)
+{
+	return test_bit(HMM_PTE_WRITE_BIT, hmm_pte);
+}
+
+static inline unsigned long hmm_pte_pfn(unsigned long hmm_pte)
+{
+	return hmm_pte >> PAGE_SHIFT;
+}
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device.
+ *
+ * The hmm_device is the link btw hmm and each device driver.
+ */
+
+/* struct hmm_device_operations - hmm device operation callback
+ */
+struct hmm_device_ops {
+	/* mirror_ref() - take reference on mirror struct.
+	 *
+	 * @mirror: Struct being referenced.
+	 */
+	struct hmm_mirror *(*mirror_ref)(struct hmm_mirror *hmm_mirror);
+
+	/* mirror_unref() - drop reference on mirror struct.
+	 *
+	 * @mirror: Struct being dereferenced.
+	 */
+	struct hmm_mirror *(*mirror_unref)(struct hmm_mirror *hmm_mirror);
+
+	/* mirror_release() - device must stop using the address space.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 *
+	 * This callback is call either on mm destruction or as result to a
+	 * call of hmm_mirror_release(). Device driver have to stop all hw
+	 * thread and all usage of the address space, it has to dirty all
+	 * pages that have been dirty by the device.
+	 */
+	void (*mirror_release)(struct hmm_mirror *hmm_mirror);
+
+	/* fence_wait() - to wait on device driver fence.
+	 *
+	 * @fence: The device driver fence struct.
+	 * Returns: 0 on success,-EIO on error, -EAGAIN to wait again.
+	 *
+	 * Called when hmm want to wait for all operations associated with a
+	 * fence to complete (including device cache flush if the event mandate
+	 * it).
+	 *
+	 * Device driver must free fence and associated resources if it returns
+	 * something else thant -EAGAIN. On -EAGAIN the fence must not be free
+	 * as hmm will call back again.
+	 *
+	 * Return error if scheduled operation failed or if need to wait again.
+	 * -EIO Some input/output error with the device.
+	 * -EAGAIN The fence not yet signaled, hmm reschedule waiting thread.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*fence_wait)(struct hmm_fence *fence);
+
+	/* fence_ref() - take a reference fence structure.
+	 *
+	 * @fence: Fence structure hmm is referencing.
+	 */
+	void (*fence_ref)(struct hmm_fence *fence);
+
+	/* fence_unref() - drop a reference fence structure.
+	 *
+	 * @fence: Fence structure hmm is dereferencing.
+	 */
+	void (*fence_unref)(struct hmm_fence *fence);
+
+	/* update() - update device mmu for a range of address.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @event: The event that triggered the update.
+	 * @range: All informations about the range that needs to be updated.
+	 * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+	 *
+	 * Called to update device page table for a range of address.
+	 * The event type provide the nature of the update :
+	 *   - Range is no longer valid (munmap).
+	 *   - Range protection changes (mprotect, COW, ...).
+	 *   - Range is unmapped (swap, reclaim, page migration, ...).
+	 *   - Device page fault.
+	 *   - ...
+	 *
+	 * Any event that block further write to the memory must also trigger a
+	 * device cache flush and everything has to be flush to local memory by
+	 * the time the wait callback return (if this callback returned a fence
+	 * otherwise everything must be flush by the time the callback return).
+	 *
+	 * Device must properly set the dirty bit using hmm_pte_mk_dirty helper
+	 * on each hmm page table entry.
+	 *
+	 * The driver should return a fence pointer or NULL on success. Device
+	 * driver should return fence and delay wait for the operation to the
+	 * fence wait callback. Returning a fence allow hmm to batch update to
+	 * several devices and delay wait on those once they all have scheduled
+	 * the update.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill.
+	 *
+	 * Return fence or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_fence *(*update)(struct hmm_mirror *mirror,
+				    struct hmm_event *event,
+				    const struct hmm_range *range);
+};
+
+
+/* struct hmm_device - per device hmm structure
+ *
+ * @name: Device name (uniquely identify the device on the system).
+ * @ops: The hmm operations callback.
+ * @mirrors: List of all active mirrors for the device.
+ * @mutex: Mutex protecting mirrors list.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct (only once).
+ */
+struct hmm_device {
+	const char			*name;
+	const struct hmm_device_ops	*ops;
+	struct list_head		mirrors;
+	struct mutex			mutex;
+};
+
+int hmm_device_register(struct hmm_device *device);
+int hmm_device_unregister(struct hmm_device *device);
+
+
+/* hmm_mirror - device specific mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct associating
+ * the process address space with the device. Same process can be mirrored by
+ * several different devices at the same time.
+ */
+
+/* struct hmm_mirror - per device and per mm hmm structure
+ *
+ * @device: The hmm_device struct this hmm_mirror is associated to.
+ * @hmm: The hmm struct this hmm_mirror is associated to.
+ * @dlist: List of all hmm_mirror for same device.
+ * @mlist: List of all hmm_mirror for same process.
+ * @work: Work struct for delayed unreference.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct for each of the address space it wants to mirror. Same device can
+ * mirror several different address space. As well same address space can be
+ * mirror by different devices.
+ */
+struct hmm_mirror {
+	struct hmm_device	*device;
+	struct hmm		*hmm;
+	struct list_head	dlist;
+	struct list_head	mlist;
+	struct work_struct	work;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror,
+			struct hmm_device *device,
+			struct mm_struct *mm);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+static inline struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror)
+{
+	if (!mirror || !mirror->device)
+		return NULL;
+
+	return mirror->device->ops->mirror_ref(mirror);
+}
+
+static inline struct hmm_mirror *hmm_mirror_unref(struct hmm_mirror *mirror)
+{
+	if (!mirror || !mirror->device)
+		return NULL;
+
+	return mirror->device->ops->mirror_unref(mirror);
+}
+
+void hmm_mirror_release(struct hmm_mirror *mirror);
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
+
+
+#endif /* CONFIG_HMM */
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..2237ceb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2120,5 +2120,16 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+#ifdef CONFIG_HMM
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+	mm->hmm = NULL;
+}
+#else /* !CONFIG_HMM */
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+}
+#endif /* !CONFIG_HMM */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e0b286..7eeff71 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -16,6 +16,10 @@
 #include <asm/page.h>
 #include <asm/mmu.h>
 
+#ifdef CONFIG_HMM
+struct hmm;
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -425,6 +429,16 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_HMM
+	/*
+	 * hmm always register an mmu_notifier we rely on mmu notifier to keep
+	 * refcount on mm struct as well as forbiding registering hmm on a
+	 * dying mm
+	 *
+	 * This field is set with mmap_sem old in write mode.
+	 */
+	struct hmm *hmm;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 1380d8a..88032ca 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -562,6 +563,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
+	hmm_mm_init(mm);
 	clear_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
diff --git a/mm/Kconfig b/mm/Kconfig
index 886db21..8784e7e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -610,3 +610,18 @@ config MAX_STACK_SIZE_MB
 	  changed to a smaller value in which case that is used.
 
 	  A sane initial value is 80 MB.
+
+if STAGING
+config HMM
+	bool "Enable heterogeneous memory management (HMM)"
+	depends on MMU
+	select MMU_NOTIFIER
+	select GENERIC_PAGE_TABLE
+	default n
+	help
+	  Heterogeneous memory management provide infrastructure for a device
+	  to mirror a process address space into an hardware mmu or into any
+	  things supporting pagefault like event.
+
+	  If unsure, say N to disable hmm.
+endif # STAGING
diff --git a/mm/Makefile b/mm/Makefile
index 632ae77..35091ae 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -64,3 +64,4 @@ obj-$(CONFIG_ZBUD)	+= zbud.o
 obj-$(CONFIG_ZSMALLOC)	+= zsmalloc.o
 obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)	+= cma.o
+obj-$(CONFIG_HMM) += hmm.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..d29a2d9
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,1253 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* This is the core code for heterogeneous memory management (HMM). HMM intend
+ * to provide helper for mirroring a process address space on a device as well
+ * as allowing migration of data between system memory and device memory refer
+ * as remote memory from here on out.
+ *
+ * Refer to include/linux/hmm.h for further informations on general design.
+ */
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/list.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/ksm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mmu_context.h>
+#include <linux/memcontrol.h>
+#include <linux/hmm.h>
+#include <linux/wait.h>
+#include <linux/mman.h>
+#include <linux/delay.h>
+#include <linux/workqueue.h>
+#include <linux/gpt.h>
+
+#include "internal.h"
+
+/* global SRCU for all HMMs */
+static struct srcu_struct srcu;
+
+
+/* struct hmm - per mm_struct hmm structure
+ *
+ * @mm: The mm struct this hmm is associated with.
+ * @kref: Reference counter
+ * @lock: Serialize the mirror list modifications.
+ * @device_faults: List of all active device page table faults.
+ * @invalidations: List of all active invalidations.
+ * @mirrors: List of all mirror for this mm (one per device).
+ * @mmu_notifier: The mmu_notifier of this mm.
+ * @wait_queue: Wait queue for event synchronization.
+ *
+ * For each process address space (mm_struct) there is one and only one hmm
+ * struct. hmm functions will redispatch to each devices the change made to
+ * the process address space.
+ */
+struct hmm {
+	struct mm_struct	*mm;
+	struct kref		kref;
+	spinlock_t		lock;
+	struct list_head	device_faults;
+	struct list_head	invalidations;
+	struct list_head	mirrors;
+	struct mmu_notifier	mmu_notifier;
+	unsigned long		ndevice_faults;
+	unsigned long		ninvalidations;
+	struct gpt		pt;
+	wait_queue_head_t	wait_queue;
+};
+
+static struct mmu_notifier_ops hmm_notifier_ops;
+
+static inline struct hmm *hmm_ref(struct hmm *hmm);
+static inline struct hmm *hmm_unref(struct hmm *hmm);
+
+static void hmm_mirror_delayed_unref(struct work_struct *work);
+static void hmm_mirror_handle_error(struct hmm_mirror *mirror);
+
+static void hmm_device_fence_wait(struct hmm_device *device,
+				  struct hmm_fence *fence);
+
+
+/* hmm_event - use to track information relating to an event.
+ *
+ * Each change to cpu page table or fault from a device is considered as an
+ * event by hmm. For each event there is a common set of things that need to
+ * be tracked. The hmm_event struct centralize those and the helper functions
+ * help dealing with all this.
+ */
+
+static inline bool hmm_event_overlap(struct hmm_event *a, struct hmm_event *b)
+{
+	return !((a->end <= b->start) || (a->start >= b->end));
+}
+
+static inline void hmm_event_init(struct hmm_event *event,
+				  unsigned long start,
+				  unsigned long end)
+{
+	event->start = start & PAGE_MASK;
+	event->end = PAGE_ALIGN(end);
+	INIT_LIST_HEAD(&event->fences);
+}
+
+static inline void hmm_event_wait(struct hmm_event *event)
+{
+	struct hmm_fence *fence, *tmp;
+
+	if (list_empty(&event->fences))
+		/* Nothing to wait for. */
+		return;
+
+	io_schedule();
+
+	list_for_each_entry_safe(fence, tmp, &event->fences, list) {
+		hmm_device_fence_wait(fence->mirror->device, fence);
+	}
+}
+
+
+/* hmm_range - range helper functions.
+ *
+ * Range are use to communicate btw various hmm function and device driver.
+ */
+
+static void hmm_range_update_mirrors(struct hmm_range *range,
+				     struct hmm *hmm,
+				     struct hmm_event *event)
+{
+	struct hmm_mirror *mirror;
+	int id;
+
+	id = srcu_read_lock(&srcu);
+	list_for_each_entry(mirror, &hmm->mirrors, mlist) {
+		struct hmm_device *device = mirror->device;
+		struct hmm_fence *fence;
+
+		fence = device->ops->update(mirror, event, range);
+		if (fence) {
+			if (IS_ERR(fence)) {
+				hmm_mirror_handle_error(mirror);
+			} else {
+				fence->mirror = hmm_mirror_ref(mirror);
+				list_add_tail(&fence->list, &event->fences);
+			}
+		}
+	}
+	srcu_read_unlock(&srcu, id);
+}
+
+static bool hmm_range_wprot(struct hmm_range *range, struct hmm *hmm)
+{
+	unsigned long i;
+	bool update = false;
+
+	for (i = 0; i < (hmm_range_size(range) >> PAGE_SHIFT); ++i) {
+		update |= hmm_pte_clear_write(&range->pte[i]);
+	}
+	return update;
+}
+
+static void hmm_range_clear(struct hmm_range *range, struct hmm *hmm)
+{
+	unsigned long i;
+
+	for (i = 0; i < (hmm_range_size(range) >> PAGE_SHIFT); ++i)
+		if (hmm_pte_clear_valid_smem(&range->pte[i]))
+			gpt_ptp_unref(&hmm->pt, range->ptp);
+}
+
+
+/* hmm - core hmm functions.
+ *
+ * Core hmm functions that deal with all the process mm activities and use
+ * event for synchronization. Those function are use mostly as result of cpu
+ * mm event.
+ */
+
+static int hmm_init(struct hmm *hmm, struct mm_struct *mm)
+{
+	int ret;
+
+	hmm->mm = mm;
+	kref_init(&hmm->kref);
+	INIT_LIST_HEAD(&hmm->device_faults);
+	INIT_LIST_HEAD(&hmm->invalidations);
+	INIT_LIST_HEAD(&hmm->mirrors);
+	spin_lock_init(&hmm->lock);
+	init_waitqueue_head(&hmm->wait_queue);
+	hmm->ndevice_faults = 0;
+	hmm->ninvalidations = 0;
+
+	/* Initialize page table. */
+	hmm->pt.max_addr = mm->highest_vm_end - 1UL;
+	hmm->pt.page_shift = PAGE_SHIFT;
+	hmm->pt.pfn_invalid = 0;
+	hmm->pt.pfn_mask = PAGE_MASK;
+	hmm->pt.pfn_shift = PAGE_SHIFT;
+	hmm->pt.pfn_valid = 1UL << HMM_PTE_VALID_PDIR_BIT;
+	hmm->pt.pte_shift = ffs(BITS_PER_LONG) - 4;
+	hmm->pt.user_ops = NULL;
+	hmm->pt.gfp_flags = GFP_HIGHUSER;
+	ret = gpt_init(&hmm->pt);
+	if (ret)
+		return ret;
+
+	/* register notifier */
+	hmm->mmu_notifier.ops = &hmm_notifier_ops;
+	return __mmu_notifier_register(&hmm->mmu_notifier, mm);
+}
+
+static void hmm_del_mirror_locked(struct hmm *hmm, struct hmm_mirror *mirror)
+{
+	list_del_rcu(&mirror->mlist);
+}
+
+static int hmm_add_mirror(struct hmm *hmm, struct hmm_mirror *mirror)
+{
+	struct hmm_mirror *tmp_mirror;
+
+	spin_lock(&hmm->lock);
+	list_for_each_entry_rcu (tmp_mirror, &hmm->mirrors, mlist)
+		if (tmp_mirror->device == mirror->device) {
+			/* Same device can mirror only once. */
+			spin_unlock(&hmm->lock);
+			return -EINVAL;
+		}
+	list_add_rcu(&mirror->mlist, &hmm->mirrors);
+	spin_unlock(&hmm->lock);
+
+	return 0;
+}
+
+static inline struct hmm *hmm_ref(struct hmm *hmm)
+{
+	if (hmm) {
+		if (!kref_get_unless_zero(&hmm->kref))
+			return NULL;
+		return hmm;
+	}
+	return NULL;
+}
+
+static void hmm_destroy(struct kref *kref)
+{
+	struct hmm *hmm;
+
+	hmm = container_of(kref, struct hmm, kref);
+
+	down_write(&hmm->mm->mmap_sem);
+	/* A new hmm might have been register before we get call. */
+	if (hmm->mm->hmm == hmm)
+		hmm->mm->hmm = NULL;
+	up_write(&hmm->mm->mmap_sem);
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
+
+	mmu_notifier_synchronize();
+
+	gpt_free(&hmm->pt);
+	kfree(hmm);
+}
+
+static inline struct hmm *hmm_unref(struct hmm *hmm)
+{
+	if (hmm)
+		kref_put(&hmm->kref, hmm_destroy);
+	return NULL;
+}
+
+static void hmm_start_device_faults(struct hmm *hmm, struct hmm_event *fevent)
+{
+	struct hmm_event *ievent;
+	unsigned long wait_for = 0;
+
+again:
+	spin_lock(&hmm->lock);
+	list_for_each_entry (ievent, &hmm->device_faults, list) {
+		if (!hmm_event_overlap(fevent, ievent))
+			continue;
+		wait_for = hmm->ninvalidations;
+	}
+
+	if (!wait_for) {
+		fevent->backoff = false;
+		list_add_tail(&fevent->list, &hmm->device_faults);
+		hmm->ndevice_faults++;
+		spin_unlock(&hmm->lock);
+		return;
+	}
+
+	spin_unlock(&hmm->lock);
+	wait_event(hmm->wait_queue, wait_for != hmm->ninvalidations);
+	wait_for = 0;
+	goto again;
+}
+
+static void hmm_end_device_faults(struct hmm *hmm, struct hmm_event *fevent)
+{
+	spin_lock(&hmm->lock);
+	list_del_init(&fevent->list);
+	hmm->ndevice_faults--;
+	spin_unlock(&hmm->lock);
+	wake_up(&hmm->wait_queue);
+}
+
+static void hmm_start_invalidations(struct hmm *hmm, struct hmm_event *ievent)
+{
+	struct hmm_event *fevent;
+	unsigned long wait_for = 0;
+
+	spin_lock(&hmm->lock);
+	list_add_tail(&ievent->list, &hmm->invalidations);
+	hmm->ninvalidations++;
+
+again:
+	list_for_each_entry (fevent, &hmm->device_faults, list) {
+		if (!hmm_event_overlap(fevent, ievent))
+			continue;
+		fevent->backoff = true;
+		wait_for = hmm->ndevice_faults;
+	}
+	spin_unlock(&hmm->lock);
+
+	if (wait_for > 0) {
+		wait_event(hmm->wait_queue, wait_for != hmm->ndevice_faults);
+		spin_lock(&hmm->lock);
+		wait_for = 0;
+		goto again;
+	}
+}
+
+static void hmm_end_invalidations(struct hmm *hmm, struct hmm_event *ievent)
+{
+	spin_lock(&hmm->lock);
+	list_del_init(&ievent->list);
+	hmm->ninvalidations--;
+	spin_unlock(&hmm->lock);
+	wake_up(&hmm->wait_queue);
+}
+
+static void hmm_end_migrate(struct hmm *hmm, struct hmm_event *ievent)
+{
+	struct hmm_event *event;
+
+	spin_lock(&hmm->lock);
+	list_for_each_entry (event, &hmm->invalidations, list) {
+		if (event->etype != HMM_MIGRATE)
+			continue;
+		if (event->start != ievent->start || event->end != ievent->end)
+			continue;
+		list_del_init(&event->list);
+		spin_unlock(&hmm->lock);
+		wake_up(&hmm->wait_queue);
+		kfree(event);
+		return;
+	}
+	spin_unlock(&hmm->lock);
+}
+
+static void hmm_update(struct hmm *hmm,
+		       struct hmm_event *event)
+{
+	struct hmm_range range;
+	struct gpt_lock lock;
+	struct gpt_iter iter;
+	struct gpt *pt = &hmm->pt;
+
+	/* This hmm is already fully stop. */
+	if (hmm->mm->hmm != hmm)
+		return;
+
+	hmm_start_invalidations(hmm, event);
+
+	lock.start = event->start;
+	lock.end = event->end - 1UL;
+	if (gpt_lock_update(&hmm->pt, &lock)) {
+		if (event->etype != HMM_MIGRATE)
+			hmm_end_invalidations(hmm, event);
+		return;
+	}
+	gpt_iter_init(&iter, &hmm->pt, &lock);
+	if (!gpt_iter_first(&iter, event->start, event->end - 1UL)) {
+		/* Empty range nothing to invalidate. */
+		gpt_unlock_update(&hmm->pt, &lock);
+		if (event->etype != HMM_MIGRATE)
+			hmm_end_invalidations(hmm, event);
+		return;
+	}
+
+	for (range.start = iter.pte_addr; iter.pte;) {
+		bool update_mirrors = true;
+
+		range.pte = iter.pte;
+		range.ptp = iter.ptp;
+		range.end = min(gpt_pdp_end(pt, iter.ptp) + 1UL, event->end);
+		if (event->etype == HMM_WRITE_PROTECT)
+			update_mirrors = hmm_range_wprot(&range, hmm);
+		if (update_mirrors)
+			hmm_range_update_mirrors(&range, hmm, event);
+
+		range.start = range.end;
+		gpt_iter_first(&iter, range.start, event->end - 1UL);
+	}
+
+	hmm_event_wait(event);
+
+	if (event->etype == HMM_MUNMAP || event->etype == HMM_MIGRATE) {
+		BUG_ON(!gpt_iter_first(&iter, event->start, event->end - 1UL));
+		for (range.start = iter.pte_addr; iter.pte;) {
+			range.pte = iter.pte;
+			range.ptp = iter.ptp;
+			range.end = min(gpt_pdp_end(pt, iter.ptp) + 1UL,
+					event->end);
+			hmm_range_clear(&range, hmm);
+			range.start = range.end;
+			gpt_iter_first(&iter, range.start, event->end - 1UL);
+		}
+	}
+
+	gpt_unlock_update(&hmm->pt, &lock);
+	if (event->etype != HMM_MIGRATE)
+		hmm_end_invalidations(hmm, event);
+}
+
+static int hmm_do_mm_fault(struct hmm *hmm,
+			   struct hmm_event *event,
+			   struct vm_area_struct *vma,
+			   unsigned long addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int r;
+
+	for (; addr < event->end; addr += PAGE_SIZE) {
+		unsigned flags = 0;
+
+		flags |= event->etype == HMM_WFAULT ? FAULT_FLAG_WRITE : 0;
+		flags |= FAULT_FLAG_ALLOW_RETRY;
+		do {
+			r = handle_mm_fault(mm, vma, addr, flags);
+			if (!(r & VM_FAULT_RETRY) && (r & VM_FAULT_ERROR)) {
+				if (r & VM_FAULT_OOM)
+					return -ENOMEM;
+				/* Same error code for all other cases. */
+				return -EFAULT;
+			}
+			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+		} while (r & VM_FAULT_RETRY);
+	}
+
+	return 0;
+}
+
+
+/* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
+ *
+ * HMM use use mmu notifier to track change made to process address space.
+ */
+
+static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm_mirror *mirror;
+	struct hmm *hmm;
+
+	/* The hmm structure can not be free because the mmu_notifier srcu is
+	 * read locked thus any concurrent hmm_mirror_unregister that would
+	 * free hmm would have to wait on the mmu_notifier.
+	 */
+	hmm = container_of(mn, struct hmm, mmu_notifier);
+	spin_lock(&hmm->lock);
+	mirror = list_first_or_null_rcu(&hmm->mirrors,
+					struct hmm_mirror,
+					mlist);
+	while (mirror) {
+		hmm_del_mirror_locked(hmm, mirror);
+		spin_unlock(&hmm->lock);
+
+		mirror->device->ops->mirror_release(mirror);
+		INIT_WORK(&mirror->work, hmm_mirror_delayed_unref);
+		schedule_work(&mirror->work);
+
+		spin_lock(&hmm->lock);
+		mirror = list_first_or_null_rcu(&hmm->mirrors,
+						struct hmm_mirror,
+						mlist);
+	}
+	spin_unlock(&hmm->lock);
+
+	synchronize_srcu(&srcu);
+
+	wake_up(&hmm->wait_queue);
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+						struct mm_struct *mm,
+						unsigned long start,
+						unsigned long end,
+						enum mmu_event mmu_event)
+{
+	struct hmm_event *event, static_event;
+	struct hmm *hmm;
+
+	BUG_ON(start >= mm->highest_vm_end);
+	BUG_ON(end >= mm->highest_vm_end);
+
+	switch (mmu_event) {
+	case MMU_ISDIRTY:
+		event = &static_event;
+		event->etype = HMM_ISDIRTY;
+		break;
+	case MMU_HSPLIT:
+	case MMU_MPROT:
+	case MMU_MUNLOCK:
+	case MMU_WRITE_BACK:
+	case MMU_WRITE_PROTECT:
+		return;
+	case MMU_MUNMAP:
+		event = &static_event;
+		event->etype = HMM_MUNMAP;
+		break;
+	default:
+		event = kzalloc(sizeof(*event), GFP_KERNEL);
+		if (!event) {
+			pr_warning("Out of memory killing hmm mirroring !");
+			hmm_notifier_release(mn, mm);
+			return;
+		}
+		event->etype = HMM_MIGRATE;
+		break;
+	}
+
+	hmm = container_of(mn, struct hmm, mmu_notifier);
+	hmm_event_init(event, start, end);
+
+	hmm_update(hmm, event);
+}
+
+static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
+				   unsigned long addr,
+				   enum mmu_event mmu_event,
+				   enum hmm_etype *etype)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, addr);
+	if (!vma || vma->vm_start > addr || !(vma->vm_flags & VM_READ)) {
+		*etype = HMM_MUNMAP;
+		return;
+	}
+
+	if (!(vma->vm_flags & VM_WRITE)) {
+		*etype = HMM_WRITE_PROTECT;
+		return;
+	}
+
+	*etype = HMM_NONE;
+}
+
+static void hmm_notifier_invalidate_range_end(struct mmu_notifier *mn,
+					      struct mm_struct *mm,
+					      unsigned long start,
+					      unsigned long end,
+					      enum mmu_event mmu_event)
+{
+	struct hmm_event event;
+	struct hmm *hmm;
+
+	hmm = container_of(mn, struct hmm, mmu_notifier);
+	hmm_event_init(&event, start, end);
+
+	switch (mmu_event) {
+	case MMU_MIGRATE:
+		event.etype = HMM_MIGRATE;
+		hmm_end_migrate(hmm, &event);
+		return;
+	case MMU_MPROT:
+		hmm_mmu_mprot_to_etype(mm, start, mmu_event, &event.etype);
+		if (event.etype == HMM_NONE)
+			return;
+		break;
+	case MMU_WRITE_BACK:
+	case MMU_WRITE_PROTECT:
+		event.etype = HMM_WRITE_PROTECT;
+		break;
+	case MMU_HSPLIT:
+	case MMU_MUNLOCK:
+	default:
+		return;
+	}
+
+	hmm_update(hmm, &event);
+}
+
+static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
+					 struct mm_struct *mm,
+					 unsigned long addr,
+					 enum mmu_event mmu_event)
+{
+	struct hmm_event event;
+	struct hmm *hmm;
+
+	switch (mmu_event) {
+	case MMU_HSPLIT:
+	case MMU_MUNLOCK:
+		return;
+	case MMU_ISDIRTY:
+		event.etype = HMM_ISDIRTY;
+		break;
+	case MMU_MPROT:
+		hmm_mmu_mprot_to_etype(mm, addr, mmu_event, &event.etype);
+		if (event.etype == HMM_NONE)
+			return;
+		break;
+	case MMU_WRITE_BACK:
+	case MMU_WRITE_PROTECT:
+		event.etype = HMM_WRITE_PROTECT;
+		break;
+	case MMU_MUNMAP:
+		event.etype = HMM_MUNMAP;
+		break;
+	default:
+		event.etype = HMM_MIGRATE;
+		break;
+	}
+
+	hmm = container_of(mn, struct hmm, mmu_notifier);
+	hmm_event_init(&event, addr, addr);
+
+	hmm_update(hmm, &event);
+}
+
+static struct mmu_notifier_ops hmm_notifier_ops = {
+	.release		= hmm_notifier_release,
+	/* .clear_flush_young FIXME we probably want to do something. */
+	/* .test_young FIXME we probably want to do something. */
+	/* WARNING .change_pte must always bracketed by range_start/end there
+	 * was patches to remove that behavior we must make sure that those
+	 * patches are not included as there are alternative solutions to issue
+	 * they are trying to solve.
+	 *
+	 * Fact is hmm can not use the change_pte callback as non sleeping lock
+	 * are held during change_pte callback.
+	 */
+	.change_pte		= NULL,
+	.invalidate_page	= hmm_notifier_invalidate_page,
+	.invalidate_range_start	= hmm_notifier_invalidate_range_start,
+	.invalidate_range_end	= hmm_notifier_invalidate_range_end,
+};
+
+
+/* hmm_mirror - per device mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct. A process
+ * can be mirror by several devices at the same time.
+ *
+ * Below are all the functions and their helpers use by device driver to mirror
+ * the process address space. Those functions either deals with updating the
+ * device page table (through hmm callback). Or provide helper functions use by
+ * the device driver to fault in range of memory in the device page table.
+ */
+
+/* hmm_mirror_register() - register a device mirror against an mm struct
+ *
+ * @mirror: The mirror that link process address space with the device.
+ * @device: The device struct to associate this mirror with.
+ * @mm: The mm struct of the process.
+ * Returns: 0 success, -ENOMEM or -EINVAL if process already mirrored.
+ *
+ * Call when device driver want to start mirroring a process address space. The
+ * hmm shim will register mmu_notifier and start monitoring process address
+ * space changes. Hence callback to device driver might happen even before this
+ * function return.
+ *
+ * The mm pin must also be hold (either task is current or using get_task_mm).
+ *
+ * Only one mirror per mm and hmm_device can be created, it will return -EINVAL
+ * if the hmm_device already has an hmm_mirror for the the mm.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror,
+			struct hmm_device *device,
+			struct mm_struct *mm)
+{
+	struct hmm *hmm = NULL;
+	int ret = 0;
+
+	/* Sanity checks. */
+	BUG_ON(!mirror);
+	BUG_ON(!device);
+	BUG_ON(!mm);
+
+	/*
+	 * Initialize the mirror struct fields, the mlist init and del dance is
+	 * necessary to make the error path easier for driver and for hmm.
+	 */
+	INIT_LIST_HEAD(&mirror->mlist);
+	list_del(&mirror->mlist);
+	INIT_LIST_HEAD(&mirror->dlist);
+	mutex_lock(&device->mutex);
+	mirror->device = device;
+	list_add(&mirror->dlist, &device->mirrors);
+	mutex_unlock(&device->mutex);
+	mirror->hmm = NULL;
+	mirror = hmm_mirror_ref(mirror);
+	if (!mirror) {
+		mutex_lock(&device->mutex);
+		list_del_init(&mirror->dlist);
+		mutex_unlock(&device->mutex);
+		return -EINVAL;
+	}
+
+	down_write(&mm->mmap_sem);
+
+	hmm = mm->hmm ? hmm_ref(hmm) : NULL;
+	if (hmm == NULL) {
+		/* no hmm registered yet so register one */
+		hmm = kzalloc(sizeof(*mm->hmm), GFP_KERNEL);
+		if (hmm == NULL) {
+			up_write(&mm->mmap_sem);
+			hmm_mirror_unref(mirror);
+			return -ENOMEM;
+		}
+
+		ret = hmm_init(hmm, mm);
+		if (ret) {
+			up_write(&mm->mmap_sem);
+			hmm_mirror_unref(mirror);
+			kfree(hmm);
+			return ret;
+		}
+
+		mm->hmm = hmm;
+	}
+
+	mirror->hmm = hmm;
+	ret = hmm_add_mirror(hmm, mirror);
+	up_write(&mm->mmap_sem);
+	if (ret) {
+		mirror->hmm = NULL;
+		hmm_mirror_unref(mirror);
+		hmm_unref(hmm);
+		return ret;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+static void hmm_mirror_delayed_unref(struct work_struct *work)
+{
+	struct hmm_mirror *mirror;
+
+	mirror = container_of(work, struct hmm_mirror, work);
+	hmm_mirror_unref(mirror);
+}
+
+static void hmm_mirror_handle_error(struct hmm_mirror *mirror)
+{
+	struct hmm *hmm = mirror->hmm;
+
+	spin_lock(&hmm->lock);
+	if (mirror->mlist.prev != LIST_POISON2) {
+		hmm_del_mirror_locked(hmm, mirror);
+		spin_unlock(&hmm->lock);
+
+		mirror->device->ops->mirror_release(mirror);
+		INIT_WORK(&mirror->work, hmm_mirror_delayed_unref);
+		schedule_work(&mirror->work);
+	} else
+		spin_unlock(&hmm->lock);
+}
+
+/* hmm_mirror_unregister() - unregister an hmm_mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Device driver must call this function when it is destroying a registered
+ * mirror structure. If destruction was initiated by the device driver then
+ * it must have call hmm_mirror_release() prior to calling this function.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	BUG_ON(!mirror || !mirror->device);
+	BUG_ON(mirror->mlist.prev != LIST_POISON2);
+
+	mirror->hmm = hmm_unref(mirror->hmm);
+
+	mutex_lock(&mirror->device->mutex);
+	list_del_init(&mirror->dlist);
+	mutex_unlock(&mirror->device->mutex);
+	mirror->device = NULL;
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+
+/* hmm_mirror_release() - release an hmm_mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Device driver must call this function when it wants to stop mirroring the
+ * process.
+ */
+void hmm_mirror_release(struct hmm_mirror *mirror)
+{
+	if (!mirror->hmm)
+		return;
+
+	spin_lock(&mirror->hmm->lock);
+	/* Check if the mirror is already removed from the mirror list in which
+	 * case there is no reason to call release.
+	 */
+	if (mirror->mlist.prev != LIST_POISON2) {
+		hmm_del_mirror_locked(mirror->hmm, mirror);
+		spin_unlock(&mirror->hmm->lock);
+
+		mirror->device->ops->mirror_release(mirror);
+		synchronize_srcu(&srcu);
+
+		hmm_mirror_unref(mirror);
+	} else
+		spin_unlock(&mirror->hmm->lock);
+}
+EXPORT_SYMBOL(hmm_mirror_release);
+
+static int hmm_mirror_update(struct hmm_mirror *mirror,
+			     struct hmm_event *event,
+			     unsigned long *start,
+			     struct gpt_iter *iter)
+{
+	unsigned long addr = *start & PAGE_MASK;
+
+	if (!gpt_iter_addr(iter, addr))
+		return -EINVAL;
+
+	do {
+		struct hmm_device *device = mirror->device;
+		unsigned long *pte = iter->pte;
+		struct hmm_fence *fence;
+		struct hmm_range range;
+
+		if (event->backoff)
+			return -EAGAIN;
+
+		range.start = addr;
+		range.end = min(gpt_pdp_end(iter->gpt, iter->ptp) + 1UL,
+				event->end);
+		range.pte = iter->pte;
+		for (; addr < range.end; addr += PAGE_SIZE, ++pte) {
+			if (!hmm_pte_is_valid_smem(pte)) {
+				*start = addr;
+				return 0;
+			}
+			if (event->etype == HMM_WFAULT &&
+			    !hmm_pte_is_write(pte)) {
+				*start = addr;
+				return 0;
+			}
+		}
+
+		fence = device->ops->update(mirror, event, &range);
+		if (fence) {
+			if (IS_ERR(fence)) {
+				*start = range.start;
+				return -EIO;
+			}
+			fence->mirror = hmm_mirror_ref(mirror);
+			list_add_tail(&fence->list, &event->fences);
+		}
+
+	} while (addr < event->end && gpt_iter_addr(iter, addr));
+
+	*start = addr;
+	return 0;
+}
+
+struct hmm_mirror_fault {
+	struct hmm_mirror	*mirror;
+	struct hmm_event	*event;
+	struct vm_area_struct	*vma;
+	unsigned long		addr;
+	struct gpt_iter		*iter;
+};
+
+static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
+				 struct hmm_event *event,
+				 struct vm_area_struct *vma,
+				 struct gpt_iter *iter,
+				 pmd_t *pmdp,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct page *page;
+	unsigned long *hmm_pte, i;
+	unsigned flags = FOLL_TOUCH;
+	spinlock_t *ptl;
+
+	ptl = pmd_lock(mirror->hmm->mm, pmdp);
+	if (unlikely(!pmd_trans_huge(*pmdp))) {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
+	if (unlikely(pmd_trans_splitting(*pmdp))) {
+		spin_unlock(ptl);
+		wait_split_huge_page(vma->anon_vma, pmdp);
+		return -EAGAIN;
+	}
+	flags |= event->etype == HMM_WFAULT ? FOLL_WRITE : 0;
+	page = follow_trans_huge_pmd(vma, start, pmdp, flags);
+	spin_unlock(ptl);
+
+	BUG_ON(!gpt_iter_addr(iter, start));
+	hmm_pte = iter->pte;
+
+	gpt_pdp_lock(&mirror->hmm->pt, iter->ptp);
+	for (i = 0; start < end; start += PAGE_SIZE, ++i, ++page) {
+		if (!hmm_pte_is_valid_smem(&hmm_pte[i])) {
+			hmm_pte[i] = hmm_pte_from_pfn(page_to_pfn(page));
+			gpt_ptp_ref(&mirror->hmm->pt, iter->ptp);
+		}
+		BUG_ON(hmm_pte_pfn(hmm_pte[i]) != page_to_pfn(page));
+		if (pmd_write(*pmdp))
+			hmm_pte_mk_write(&hmm_pte[i]);
+	}
+	gpt_pdp_unlock(&mirror->hmm->pt, iter->ptp);
+
+	return 0;
+}
+
+static int hmm_mirror_fault_pmd(pmd_t *pmdp,
+				unsigned long start,
+				unsigned long end,
+				struct mm_walk *walk)
+{
+	struct hmm_mirror_fault *mirror_fault = walk->private;
+	struct vm_area_struct *vma = mirror_fault->vma;
+	struct hmm_mirror *mirror = mirror_fault->mirror;
+	struct hmm_event *event = mirror_fault->event;
+	struct gpt_iter *iter = mirror_fault->iter;
+	unsigned long addr = start, i, *hmm_pte;
+	struct hmm *hmm = mirror->hmm;
+	pte_t *ptep;
+	int ret = 0;
+
+	/* Make sure there was no gap. */
+	if (start != mirror_fault->addr)
+		return -ENOENT;
+
+	if (event->backoff)
+		return -EAGAIN;
+
+	if (pmd_none(*pmdp))
+		return -ENOENT;
+
+	if (pmd_trans_huge(*pmdp)) {
+		ret = hmm_mirror_fault_hpmd(mirror, event, vma, iter,
+					    pmdp, start, end);
+		mirror_fault->addr = ret ? start : end;
+		return ret;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp))
+		return -EFAULT;
+
+	BUG_ON(!gpt_iter_addr(iter, start));
+	hmm_pte = iter->pte;
+
+	ptep = pte_offset_map(pmdp, start);
+	gpt_pdp_lock(&hmm->pt, iter->ptp);
+	for (i = 0; addr < end; addr += PAGE_SIZE, ++i) {
+		if (!pte_present(*ptep) ||
+		    ((event->etype == HMM_WFAULT) && !pte_write(*ptep))) {
+			ptep++;
+			ret = -ENOENT;
+			break;
+		}
+
+		if (!hmm_pte_is_valid_smem(&hmm_pte[i])) {
+			hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
+			gpt_ptp_ref(&hmm->pt, iter->ptp);
+		}
+		BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pte_pfn(*ptep));
+		if (pte_write(*ptep))
+			hmm_pte_mk_write(&hmm_pte[i]);
+		ptep++;
+	}
+	gpt_pdp_unlock(&hmm->pt, iter->ptp);
+	pte_unmap(ptep - 1);
+	mirror_fault->addr = addr;
+
+	return ret;
+}
+
+static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
+				   struct hmm_event *event,
+				   struct vm_area_struct *vma)
+{
+	struct hmm_mirror_fault mirror_fault;
+	struct mm_walk walk = {0};
+	struct gpt_lock lock;
+	struct gpt_iter iter;
+	unsigned long addr;
+	int ret;
+
+	if ((event->etype == HMM_WFAULT) && !(vma->vm_flags & VM_WRITE))
+		return -EACCES;
+
+	hmm_start_device_faults(mirror->hmm, event);
+
+	addr = event->start;
+	lock.start = event->start;
+	lock.end = event->end - 1UL;
+	ret = gpt_lock_fault(&mirror->hmm->pt, &lock);
+	if (ret) {
+		hmm_end_device_faults(mirror->hmm, event);
+		return ret;
+	}
+	gpt_iter_init(&iter, &mirror->hmm->pt, &lock);
+
+again:
+	ret = hmm_mirror_update(mirror, event, &addr, &iter);
+	if (ret)
+		goto out;
+
+	if (event->backoff) {
+		ret = -EAGAIN;
+		goto out;
+	}
+	if (addr >= event->end)
+		goto out;
+
+	mirror_fault.event = event;
+	mirror_fault.mirror = mirror;
+	mirror_fault.vma = vma;
+	mirror_fault.addr = addr;
+	mirror_fault.iter = &iter;
+	walk.mm = mirror->hmm->mm;
+	walk.private = &mirror_fault;
+	walk.pmd_entry = hmm_mirror_fault_pmd;
+	ret = walk_page_range(addr, event->end, &walk);
+	hmm_event_wait(event);
+	if (!ret)
+		goto again;
+	addr = mirror_fault.addr;
+
+out:
+	gpt_unlock_fault(&mirror->hmm->pt, &lock);
+	hmm_end_device_faults(mirror->hmm, event);
+	if (ret == -ENOENT) {
+		ret = hmm_do_mm_fault(mirror->hmm, event, vma, addr);
+		ret = ret ? ret : -EAGAIN;
+	}
+	return ret;
+}
+
+/* hmm_mirror_fault() - call by the device driver on device memory fault.
+ *
+ * @mirror: Mirror related to the fault if any.
+ * @event: Event describing the fault.
+ *
+ * Device driver call this function either if it needs to fill its page table
+ * for a range of address or if it needs to migrate memory between system and
+ * remote memory.
+ *
+ * This function perform vma lookup and access permission check on behalf of
+ * the device. If device ask for range [A; D] but there is only a valid vma
+ * starting at B with B > A and B < D then callback will return -EFAULT and
+ * set event->end to B so device driver can either report an issue back or
+ * call again the hmm_mirror_fault with range updated to [B; D].
+ *
+ * This allows device driver to optimistically fault range of address without
+ * having to know about valid vma range. Device driver can then take proper
+ * action if a real memory access happen inside an invalid address range.
+ *
+ * Also the fault will clamp the requested range to valid vma range (unless the
+ * vma into which event->start falls to, can grow). So in previous example if D
+ * D is not cover by any vma then hmm_mirror_fault will stop a C with C < D and
+ * C being the last address of a valid vma. Also event->end will be set to C.
+ *
+ * All error must be handled by device driver and most likely result in the
+ * process device tasks to be kill by the device driver.
+ *
+ * Returns:
+ * > 0 Number of pages faulted.
+ * -EINVAL if invalid argument.
+ * -ENOMEM if failing to allocate memory.
+ * -EACCES if trying to write to read only address.
+ * -EFAULT if trying to access an invalid address.
+ * -ENODEV if mirror is in process of being destroy.
+ * -EIO if device driver update callback failed.
+ */
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event)
+{
+	struct vm_area_struct *vma;
+	int ret = 0;
+
+	if (!mirror || !event || event->start >= event->end)
+		return -EINVAL;
+
+	hmm_event_init(event, event->start, event->end);
+	if (event->end > mirror->hmm->mm->highest_vm_end)
+		return -EFAULT;
+
+retry:
+	if (!mirror->hmm->mm->hmm)
+		return -ENODEV;
+
+	/*
+	 * So synchronization with the cpu page table is the most important
+	 * and tedious aspect of device page fault. There must be a strong
+	 * ordering btw call to device->update() for device page fault and
+	 * device->update() for cpu page table invalidation/update.
+	 *
+	 * Page that are exposed to device driver must stay valid while the
+	 * callback is in progress ie any cpu page table invalidation that
+	 * render those pages obsolete must call device->update() after the
+	 * device->update() call that faulted those pages.
+	 *
+	 * To achieve this we rely on few things. First the mmap_sem insure
+	 * us that any munmap() syscall will serialize with us. So issue are
+	 * with unmap_mapping_range() and with migrate or merge page. For this
+	 * hmm keep track of affected range of address and block device page
+	 * fault that hit overlapping range.
+	 */
+	down_read(&mirror->hmm->mm->mmap_sem);
+	vma = find_vma_intersection(mirror->hmm->mm, event->start, event->end);
+	if (!vma) {
+		ret = -EFAULT;
+		goto out;
+	}
+	if (vma->vm_start > event->start) {
+		event->end = vma->vm_start;
+		ret = -EFAULT;
+		goto out;
+	}
+	event->end = min(event->end, vma->vm_end);
+	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	switch (event->etype) {
+	case HMM_RFAULT:
+	case HMM_WFAULT:
+		ret = hmm_mirror_handle_fault(mirror, event, vma);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+out:
+	/* Drop the mmap_sem so anyone waiting on it have a chance. */
+	up_read(&mirror->hmm->mm->mmap_sem);
+	if (ret == -EAGAIN)
+		goto retry;
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_fault);
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device
+ *
+ * The hmm_device is the link btw hmm and each device driver.
+ */
+
+/* hmm_device_register() - register a device with hmm.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success, -EINVAL otherwise.
+ *
+ * Call when device driver want to register itself with hmm. Device driver can
+ * only register once. It will return a reference on the device thus to release
+ * a device the driver must unreference the device.
+ */
+int hmm_device_register(struct hmm_device *device)
+{
+	/* sanity check */
+	BUG_ON(!device);
+	BUG_ON(!device->ops);
+	BUG_ON(!device->ops->mirror_ref);
+	BUG_ON(!device->ops->mirror_unref);
+	BUG_ON(!device->ops->mirror_release);
+	BUG_ON(!device->ops->fence_wait);
+	BUG_ON(!device->ops->fence_ref);
+	BUG_ON(!device->ops->fence_unref);
+	BUG_ON(!device->ops->update);
+
+	mutex_init(&device->mutex);
+	INIT_LIST_HEAD(&device->mirrors);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_register);
+
+/* hmm_device_unregister() - unregister a device with hmm.
+ *
+ * @device: The hmm_device struct.
+ *
+ * Call when device driver want to unregister itself with hmm. This will check
+ * if there is any active mirror and return -EBUSY if so. It is device driver
+ * responsability to cleanup and stop all mirror before calling this.
+ */
+int hmm_device_unregister(struct hmm_device *device)
+{
+	struct hmm_mirror *mirror;
+
+	mutex_lock(&device->mutex);
+	mirror = list_first_entry_or_null(&device->mirrors,
+					  struct hmm_mirror,
+					  dlist);
+	mutex_unlock(&device->mutex);
+	if (mirror)
+		return -EBUSY;
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_unregister);
+
+static void hmm_device_fence_wait(struct hmm_device *device,
+				  struct hmm_fence *fence)
+{
+	struct hmm_mirror *mirror;
+	int r;
+
+	if (fence == NULL)
+		return;
+
+	list_del_init(&fence->list);
+	do {
+		r = device->ops->fence_wait(fence);
+		if (r == -EAGAIN)
+			io_schedule();
+	} while (r == -EAGAIN);
+
+	mirror = fence->mirror;
+	device->ops->fence_unref(fence);
+	if (r)
+		hmm_mirror_handle_error(mirror);
+	hmm_mirror_unref(mirror);
+}
+
+
+static int __init hmm_subsys_init(void)
+{
+	return init_srcu_struct(&srcu);
+}
+subsys_initcall(hmm_subsys_init);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 4/6] hmm/dummy: dummy driver to showcase the hmm api v2
  2014-08-29 19:10 [RFC PATCH 0/6] HMM (heterogeneous memory management) v4 j.glisse
                   ` (2 preceding siblings ...)
  2014-08-29 19:10 ` [RFC PATCH 3/6] hmm: heterogeneous memory management v5 j.glisse
@ 2014-08-29 19:10 ` j.glisse
  2014-08-29 19:10 ` [RFC PATCH 5/6] iommu: new api to map an array of page frame number into a domain j.glisse
  2014-08-29 19:10 ` [RFC PATCH 6/6] hmm: add support for iommu domain j.glisse
  5 siblings, 0 replies; 9+ messages in thread
From: j.glisse @ 2014-08-29 19:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm, Haggai Eran
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This is a dummy driver which full fill two purposes :
  - showcase the hmm api and gives references on how to use it.
  - provide an extensive user space api to stress test hmm.

This is a particularly dangerous module as it allow to access a
mirror of a process address space through its device file. Hence
it should not be enabled by default and only people actively
developing for hmm should use it.

Changed since v1:
  - Fixed all checkpatch.pl issue (ignoreing some over 80 characters).

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 drivers/char/Kconfig           |    9 +
 drivers/char/Makefile          |    1 +
 drivers/char/hmm_dummy.c       | 1149 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/hmm_dummy.h |   30 ++
 4 files changed, 1189 insertions(+)
 create mode 100644 drivers/char/hmm_dummy.c
 create mode 100644 include/uapi/linux/hmm_dummy.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 6e9f74a..199e111 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -600,5 +600,14 @@ config TILE_SROM
 	  device appear much like a simple EEPROM, and knows
 	  how to partition a single ROM for multiple purposes.
 
+config HMM_DUMMY
+	tristate "hmm dummy driver to test hmm."
+	depends on HMM
+	default n
+	help
+	  Say Y here if you want to build the hmm dummy driver that allow you
+	  to test the hmm infrastructure by mapping a process address space
+	  in hmm dummy driver device file. When in doubt, say "N".
+
 endmenu
 
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index a324f93..83d89b8 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -61,3 +61,4 @@ obj-$(CONFIG_JS_RTC)		+= js-rtc.o
 js-rtc-y = rtc.o
 
 obj-$(CONFIG_TILE_SROM)		+= tile-srom.o
+obj-$(CONFIG_HMM_DUMMY)		+= hmm_dummy.o
diff --git a/drivers/char/hmm_dummy.c b/drivers/char/hmm_dummy.c
new file mode 100644
index 0000000..ae3a048
--- /dev/null
+++ b/drivers/char/hmm_dummy.c
@@ -0,0 +1,1149 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* This is a dummy driver made to exercice the HMM (hardware memory management)
+ * API of the kernel. It allow an userspace program to map its whole address
+ * space through the hmm dummy driver file.
+ *
+ * In here mirror address are address in the process address space that is
+ * being mirrored. While virtual address are the address in the current
+ * process that has the hmm dummy dev file mapped (address of the file
+ * mapping).
+ *
+ * You must be carefull to not mix one and another.
+ */
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/highmem.h>
+#include <linux/delay.h>
+#include <linux/hmm.h>
+
+#include <uapi/linux/hmm_dummy.h>
+
+#define HMM_DUMMY_DEVICE_NAME	"hmm_dummy_device"
+#define HMM_DUMMY_MAX_DEVICES	4
+
+struct hmm_dummy_device;
+
+struct hmm_dummy_mirror {
+	struct kref		kref;
+	struct file		*filp;
+	struct hmm_dummy_device	*ddevice;
+	struct hmm_mirror	mirror;
+	unsigned		minor;
+	pid_t			pid;
+	struct mm_struct	*mm;
+	unsigned long		*pgdp;
+	struct mutex		mutex;
+	bool			stop;
+};
+
+struct hmm_dummy_device {
+	struct cdev		cdev;
+	struct hmm_device	device;
+	dev_t			dev;
+	int			major;
+	struct mutex		mutex;
+	char			name[32];
+	/* device file mapping tracking (keep track of all vma) */
+	struct hmm_dummy_mirror	*dmirrors[HMM_DUMMY_MAX_DEVICES];
+	struct address_space	*fmapping[HMM_DUMMY_MAX_DEVICES];
+};
+
+/* We only create 2 device to show the inter device rmem sharing/migration
+ * capabilities.
+ */
+static struct hmm_dummy_device ddevices[2];
+
+
+/* hmm_dummy_pt - dummy page table, the dummy device fake its own page table.
+ *
+ * Helper function to manage the dummy device page table.
+ */
+#define HMM_DUMMY_PTE_VALID		(1UL << 0UL)
+#define HMM_DUMMY_PTE_READ		(1UL << 1UL)
+#define HMM_DUMMY_PTE_WRITE		(1UL << 2UL)
+#define HMM_DUMMY_PTE_DIRTY		(1UL << 3UL)
+#define HMM_DUMMY_PFN_SHIFT		(PAGE_SHIFT)
+
+#define ARCH_PAGE_SIZE			((unsigned long)PAGE_SIZE)
+#define ARCH_PAGE_SHIFT			((unsigned long)PAGE_SHIFT)
+
+#define HMM_DUMMY_PTRS_PER_LEVEL	(ARCH_PAGE_SIZE / sizeof(long))
+#ifdef CONFIG_64BIT
+#define HMM_DUMMY_BITS_PER_LEVEL	(ARCH_PAGE_SHIFT - 3UL)
+#else
+#define HMM_DUMMY_BITS_PER_LEVEL	(ARCH_PAGE_SHIFT - 2UL)
+#endif
+#define HMM_DUMMY_PLD_SHIFT		(ARCH_PAGE_SHIFT)
+#define HMM_DUMMY_PMD_SHIFT		(HMM_DUMMY_PLD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PUD_SHIFT		(HMM_DUMMY_PMD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PGD_SHIFT		(HMM_DUMMY_PUD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PGD_NPTRS		(1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PMD_NPTRS		(1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PUD_NPTRS		(1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PLD_NPTRS		(1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PLD_SIZE		(1UL << (HMM_DUMMY_PLD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PMD_SIZE		(1UL << (HMM_DUMMY_PMD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PUD_SIZE		(1UL << (HMM_DUMMY_PUD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PGD_SIZE		(1UL << (HMM_DUMMY_PGD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PLD_MASK		(~(HMM_DUMMY_PLD_SIZE - 1UL))
+#define HMM_DUMMY_PMD_MASK		(~(HMM_DUMMY_PMD_SIZE - 1UL))
+#define HMM_DUMMY_PUD_MASK		(~(HMM_DUMMY_PUD_SIZE - 1UL))
+#define HMM_DUMMY_PGD_MASK		(~(HMM_DUMMY_PGD_SIZE - 1UL))
+#define HMM_DUMMY_MAX_ADDR		(1UL << (HMM_DUMMY_PGD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+
+static inline unsigned long hmm_dummy_pld_index(unsigned long addr)
+{
+	return (addr >> HMM_DUMMY_PLD_SHIFT) & (HMM_DUMMY_PLD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pmd_index(unsigned long addr)
+{
+	return (addr >> HMM_DUMMY_PMD_SHIFT) & (HMM_DUMMY_PMD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pud_index(unsigned long addr)
+{
+	return (addr >> HMM_DUMMY_PUD_SHIFT) & (HMM_DUMMY_PUD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pgd_index(unsigned long addr)
+{
+	return (addr >> HMM_DUMMY_PGD_SHIFT) & (HMM_DUMMY_PGD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pld_base(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PLD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pmd_base(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PMD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pud_base(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PUD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pgd_base(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PGD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pld_next(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PLD_MASK) + HMM_DUMMY_PLD_SIZE;
+}
+
+static inline unsigned long hmm_dummy_pmd_next(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PMD_MASK) + HMM_DUMMY_PMD_SIZE;
+}
+
+static inline unsigned long hmm_dummy_pud_next(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PUD_MASK) + HMM_DUMMY_PUD_SIZE;
+}
+
+static inline unsigned long hmm_dummy_pgd_next(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PGD_MASK) + HMM_DUMMY_PGD_SIZE;
+}
+
+static inline struct page *hmm_dummy_pte_to_page(unsigned long pte)
+{
+	if (!(pte & HMM_DUMMY_PTE_VALID))
+		return NULL;
+	return pfn_to_page((pte >> HMM_DUMMY_PFN_SHIFT));
+}
+
+struct hmm_dummy_pt_map {
+	struct hmm_dummy_mirror	*dmirror;
+	struct page		*pud_page;
+	struct page		*pmd_page;
+	struct page		*pld_page;
+	unsigned long		pgd_idx;
+	unsigned long		pud_idx;
+	unsigned long		pmd_idx;
+	unsigned long		*pudp;
+	unsigned long		*pmdp;
+	unsigned long		*pldp;
+};
+
+static inline unsigned long *hmm_dummy_pt_pud_map(struct hmm_dummy_pt_map *pt_map,
+						  unsigned long addr)
+{
+	struct hmm_dummy_mirror *dmirror = pt_map->dmirror;
+	unsigned long *pdep;
+
+	if (!dmirror->pgdp)
+		return NULL;
+
+	if (!pt_map->pud_page || pt_map->pgd_idx != hmm_dummy_pgd_index(addr)) {
+		if (pt_map->pud_page) {
+			kunmap(pt_map->pud_page);
+			pt_map->pud_page = NULL;
+			pt_map->pudp = NULL;
+		}
+		pt_map->pgd_idx = hmm_dummy_pgd_index(addr);
+		pdep = &dmirror->pgdp[pt_map->pgd_idx];
+		if (!((*pdep) & HMM_DUMMY_PTE_VALID))
+			return NULL;
+		pt_map->pud_page = pfn_to_page((*pdep) >> HMM_DUMMY_PFN_SHIFT);
+		pt_map->pudp = kmap(pt_map->pud_page);
+	}
+	return pt_map->pudp;
+}
+
+static inline unsigned long *hmm_dummy_pt_pmd_map(struct hmm_dummy_pt_map *pt_map,
+						  unsigned long addr)
+{
+	unsigned long *pdep;
+
+	if (!hmm_dummy_pt_pud_map(pt_map, addr))
+		return NULL;
+
+	if (!pt_map->pmd_page || pt_map->pud_idx != hmm_dummy_pud_index(addr)) {
+		if (pt_map->pmd_page) {
+			kunmap(pt_map->pmd_page);
+			pt_map->pmd_page = NULL;
+			pt_map->pmdp = NULL;
+		}
+		pt_map->pud_idx = hmm_dummy_pud_index(addr);
+		pdep = &pt_map->pudp[pt_map->pud_idx];
+		if (!((*pdep) & HMM_DUMMY_PTE_VALID))
+			return NULL;
+		pt_map->pmd_page = pfn_to_page((*pdep) >> HMM_DUMMY_PFN_SHIFT);
+		pt_map->pmdp = kmap(pt_map->pmd_page);
+	}
+	return pt_map->pmdp;
+}
+
+static inline unsigned long *hmm_dummy_pt_pld_map(struct hmm_dummy_pt_map *pt_map,
+						  unsigned long addr)
+{
+	unsigned long *pdep;
+
+	if (!hmm_dummy_pt_pmd_map(pt_map, addr))
+		return NULL;
+
+	if (!pt_map->pld_page || pt_map->pmd_idx != hmm_dummy_pmd_index(addr)) {
+		if (pt_map->pld_page) {
+			kunmap(pt_map->pld_page);
+			pt_map->pld_page = NULL;
+			pt_map->pldp = NULL;
+		}
+		pt_map->pmd_idx = hmm_dummy_pmd_index(addr);
+		pdep = &pt_map->pmdp[pt_map->pmd_idx];
+		if (!((*pdep) & HMM_DUMMY_PTE_VALID))
+			return NULL;
+		pt_map->pld_page = pfn_to_page((*pdep) >> HMM_DUMMY_PFN_SHIFT);
+		pt_map->pldp = kmap(pt_map->pld_page);
+	}
+	return pt_map->pldp;
+}
+
+static inline void hmm_dummy_pt_pld_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+	if (pt_map->pld_page) {
+		kunmap(pt_map->pld_page);
+		pt_map->pld_page = NULL;
+		pt_map->pldp = NULL;
+	}
+}
+
+static inline void hmm_dummy_pt_pmd_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+	hmm_dummy_pt_pld_unmap(pt_map);
+	if (pt_map->pmd_page) {
+		kunmap(pt_map->pmd_page);
+		pt_map->pmd_page = NULL;
+		pt_map->pmdp = NULL;
+	}
+}
+
+static inline void hmm_dummy_pt_pud_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+	hmm_dummy_pt_pmd_unmap(pt_map);
+	if (pt_map->pud_page) {
+		kunmap(pt_map->pud_page);
+		pt_map->pud_page = NULL;
+		pt_map->pudp = NULL;
+	}
+}
+
+static inline void hmm_dummy_pt_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+	hmm_dummy_pt_pud_unmap(pt_map);
+}
+
+static int hmm_dummy_pt_alloc(struct hmm_dummy_mirror *dmirror,
+			      unsigned long start,
+			      unsigned long end)
+{
+	unsigned long *pgdp, *pudp, *pmdp;
+
+	if (dmirror->stop)
+		return -EINVAL;
+
+	if (dmirror->pgdp == NULL) {
+		dmirror->pgdp = kzalloc(PAGE_SIZE, GFP_KERNEL);
+		if (dmirror->pgdp == NULL)
+			return -ENOMEM;
+	}
+
+	for (; start < end; start = hmm_dummy_pld_next(start)) {
+		struct page *pud_page, *pmd_page;
+
+		pgdp = &dmirror->pgdp[hmm_dummy_pgd_index(start)];
+		if (!((*pgdp) & HMM_DUMMY_PTE_VALID)) {
+			pud_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+			if (!pud_page)
+				return -ENOMEM;
+			*pgdp  = (page_to_pfn(pud_page)<<HMM_DUMMY_PFN_SHIFT);
+			*pgdp |= HMM_DUMMY_PTE_VALID;
+		}
+
+		pud_page = pfn_to_page((*pgdp) >> HMM_DUMMY_PFN_SHIFT);
+		pudp = kmap(pud_page);
+		pudp = &pudp[hmm_dummy_pud_index(start)];
+		if (!((*pudp) & HMM_DUMMY_PTE_VALID)) {
+			pmd_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+			if (!pmd_page) {
+				kunmap(pud_page);
+				return -ENOMEM;
+			}
+			*pudp  = (page_to_pfn(pmd_page)<<HMM_DUMMY_PFN_SHIFT);
+			*pudp |= HMM_DUMMY_PTE_VALID;
+		}
+
+		pmd_page = pfn_to_page((*pudp) >> HMM_DUMMY_PFN_SHIFT);
+		pmdp = kmap(pmd_page);
+		pmdp = &pmdp[hmm_dummy_pmd_index(start)];
+		if (!((*pmdp) & HMM_DUMMY_PTE_VALID)) {
+			struct page *page;
+
+			page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+			if (!page) {
+				kunmap(pmd_page);
+				kunmap(pud_page);
+				return -ENOMEM;
+			}
+			*pmdp  = (page_to_pfn(page) << HMM_DUMMY_PFN_SHIFT);
+			*pmdp |= HMM_DUMMY_PTE_VALID;
+		}
+
+		kunmap(pmd_page);
+		kunmap(pud_page);
+	}
+
+	return 0;
+}
+
+static void hmm_dummy_pt_free_pmd(struct hmm_dummy_pt_map *pt_map,
+				  unsigned long start,
+				  unsigned long end)
+{
+	for (; start < end; start = hmm_dummy_pld_next(start)) {
+		unsigned long pfn, *pmdp, next;
+		struct page *page;
+
+		next = min(hmm_dummy_pld_next(start), end);
+		if (start > hmm_dummy_pld_base(start) || end < next)
+			continue;
+		pmdp = hmm_dummy_pt_pmd_map(pt_map, start);
+		if (!pmdp)
+			continue;
+		if (!(pmdp[hmm_dummy_pmd_index(start)] & HMM_DUMMY_PTE_VALID))
+			continue;
+		pfn = pmdp[hmm_dummy_pmd_index(start)] >> HMM_DUMMY_PFN_SHIFT;
+		page = pfn_to_page(pfn);
+		pmdp[hmm_dummy_pmd_index(start)] = 0;
+		__free_page(page);
+	}
+}
+
+static void hmm_dummy_pt_free_pud(struct hmm_dummy_pt_map *pt_map,
+				  unsigned long start,
+				  unsigned long end)
+{
+	for (; start < end; start = hmm_dummy_pmd_next(start)) {
+		unsigned long pfn, *pudp, next;
+		struct page *page;
+
+		next = min(hmm_dummy_pmd_next(start), end);
+		hmm_dummy_pt_free_pmd(pt_map, start, next);
+		hmm_dummy_pt_pmd_unmap(pt_map);
+		if (start > hmm_dummy_pmd_base(start) || end < next)
+			continue;
+		pudp = hmm_dummy_pt_pud_map(pt_map, start);
+		if (!pudp)
+			continue;
+		if (!(pudp[hmm_dummy_pud_index(start)] & HMM_DUMMY_PTE_VALID))
+			continue;
+		pfn = pudp[hmm_dummy_pud_index(start)] >> HMM_DUMMY_PFN_SHIFT;
+		page = pfn_to_page(pfn);
+		pudp[hmm_dummy_pud_index(start)] = 0;
+		__free_page(page);
+	}
+}
+
+static void hmm_dummy_pt_free(struct hmm_dummy_mirror *dmirror,
+			      unsigned long start,
+			      unsigned long end)
+{
+	struct hmm_dummy_pt_map pt_map = {0};
+
+	if (!dmirror->pgdp || (end - start) < HMM_DUMMY_PLD_SIZE)
+		return;
+
+	pt_map.dmirror = dmirror;
+
+	for (; start < end; start = hmm_dummy_pud_next(start)) {
+		unsigned long pfn, *pgdp, next;
+		struct page *page;
+
+		next = min(hmm_dummy_pud_next(start), end);
+		pgdp = dmirror->pgdp;
+		hmm_dummy_pt_free_pud(&pt_map, start, next);
+		hmm_dummy_pt_pud_unmap(&pt_map);
+		if (start > hmm_dummy_pud_base(start) || end < next)
+			continue;
+		if (!(pgdp[hmm_dummy_pgd_index(start)] & HMM_DUMMY_PTE_VALID))
+			continue;
+		pfn = pgdp[hmm_dummy_pgd_index(start)] >> HMM_DUMMY_PFN_SHIFT;
+		page = pfn_to_page(pfn);
+		pgdp[hmm_dummy_pgd_index(start)] = 0;
+		__free_page(page);
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+}
+
+
+
+
+/* hmm_ops - hmm callback for the hmm dummy driver.
+ *
+ * Below are the various callback that the hmm api require for a device. The
+ * implementation of the dummy device driver is necessarily simpler that what
+ * a real device driver would do. We do not have interrupt nor any kind of
+ * command buffer on to which schedule memory invalidation and updates.
+ */
+static struct hmm_mirror *hmm_dummy_mirror_ref(struct hmm_mirror *mirror)
+{
+	struct hmm_dummy_mirror *dmirror;
+
+	if (!mirror)
+		return NULL;
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	if (!kref_get_unless_zero(&dmirror->kref))
+		return NULL;
+	return mirror;
+}
+
+static void hmm_dummy_mirror_destroy(struct kref *kref)
+{
+	struct hmm_dummy_mirror *dmirror;
+
+	dmirror = container_of(kref, struct hmm_dummy_mirror, kref);
+	mutex_lock(&dmirror->ddevice->mutex);
+	dmirror->ddevice->dmirrors[dmirror->minor] = NULL;
+	mutex_unlock(&dmirror->ddevice->mutex);
+
+	hmm_mirror_unregister(&dmirror->mirror);
+
+	kfree(dmirror);
+}
+
+static struct hmm_mirror *hmm_dummy_mirror_unref(struct hmm_mirror *mirror)
+{
+	struct hmm_dummy_mirror *dmirror;
+
+	if (!mirror)
+		return NULL;
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	kref_put(&dmirror->kref, hmm_dummy_mirror_destroy);
+	return NULL;
+}
+
+static void hmm_dummy_mirror_release(struct hmm_mirror *mirror)
+{
+	struct hmm_dummy_mirror *dmirror;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	dmirror->stop = true;
+	mutex_lock(&dmirror->mutex);
+	hmm_dummy_pt_free(dmirror, 0, HMM_DUMMY_MAX_ADDR);
+	kfree(dmirror->pgdp);
+	dmirror->pgdp = NULL;
+	mutex_unlock(&dmirror->mutex);
+}
+
+static int hmm_dummy_fence_wait(struct hmm_fence *fence)
+{
+	/* FIXME add fake fence to showcase api */
+	return 0;
+}
+
+static void hmm_dummy_fence_ref(struct hmm_fence *fence)
+{
+	/* We never allocate fence so how could we end up here ? */
+	BUG();
+}
+
+static void hmm_dummy_fence_unref(struct hmm_fence *fence)
+{
+	/* We never allocate fence so how could we end up here ? */
+	BUG();
+}
+
+static int hmm_dummy_fault(struct hmm_mirror *mirror,
+			   struct hmm_event *event,
+			   const struct hmm_range *range)
+{
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long addr, i;
+	int ret = 0;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	pt_map.dmirror = dmirror;
+
+	mutex_lock(&dmirror->mutex);
+	for (i = 0, addr = range->start; addr < range->end; ++i, addr += PAGE_SIZE) {
+		unsigned long *pldp, pld_idx;
+		struct page *page;
+		bool write;
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, addr);
+		if (!pldp) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		if (!hmm_pte_is_valid_smem(&range->pte[i])) {
+			ret = -ENOENT;
+			break;
+		}
+		write = hmm_pte_is_write(&range->pte[i]);
+		page = pfn_to_page(hmm_pte_pfn(range->pte[i]));
+		if (event->etype == HMM_WFAULT && !write) {
+			ret = -EACCES;
+			break;
+		}
+
+		pr_info("%16s %4d [0x%016lx] pfn 0x%016lx write %d\n",
+			__func__, __LINE__, addr, page_to_pfn(page), write);
+		pld_idx = hmm_dummy_pld_index(addr);
+		pldp[pld_idx]  = (page_to_pfn(page) << HMM_DUMMY_PFN_SHIFT);
+		pldp[pld_idx] |= write ? HMM_DUMMY_PTE_WRITE : 0;
+		pldp[pld_idx] |= HMM_DUMMY_PTE_VALID | HMM_DUMMY_PTE_READ;
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+	mutex_unlock(&dmirror->mutex);
+	return ret;
+}
+
+static struct hmm_fence *hmm_dummy_update(struct hmm_mirror *mirror,
+					  struct hmm_event *event,
+					  const struct hmm_range *range)
+{
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long addr, i, mask;
+	int ret;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	pt_map.dmirror = dmirror;
+
+	/* Debugging hmm real device driver do not have to do that. */
+	switch (event->etype) {
+	case HMM_MIGRATE:
+	case HMM_MUNMAP:
+		mask = 0;
+		break;
+	case HMM_ISDIRTY:
+		mask = -1UL;
+		break;
+	case HMM_WRITE_PROTECT:
+		mask = ~HMM_DUMMY_PTE_WRITE;
+		break;
+	case HMM_RFAULT:
+	case HMM_WFAULT:
+		ret = hmm_dummy_fault(mirror, event, range);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	default:
+		return ERR_PTR(-EIO);
+	}
+
+	mutex_lock(&dmirror->mutex);
+	for (i = 0, addr = range->start; addr < range->end; ++i, addr += PAGE_SIZE) {
+		unsigned long *pldp;
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, addr);
+		if (!pldp)
+			continue;
+		if (((*pldp) & HMM_DUMMY_PTE_DIRTY)) {
+			hmm_pte_mk_dirty(&range->pte[i]);
+		}
+		*pldp &= ~HMM_DUMMY_PTE_DIRTY;
+		*pldp &= mask;
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+
+	if (event->etype == HMM_MUNMAP)
+		hmm_dummy_pt_free(dmirror, range->start, range->end);
+	mutex_unlock(&dmirror->mutex);
+	return NULL;
+}
+
+static const struct hmm_device_ops hmm_dummy_ops = {
+	.mirror_ref		= &hmm_dummy_mirror_ref,
+	.mirror_unref		= &hmm_dummy_mirror_unref,
+	.mirror_release		= &hmm_dummy_mirror_release,
+	.fence_wait		= &hmm_dummy_fence_wait,
+	.fence_ref		= &hmm_dummy_fence_ref,
+	.fence_unref		= &hmm_dummy_fence_unref,
+	.update			= &hmm_dummy_update,
+};
+
+
+/* hmm_dummy_mmap - hmm dummy device file mmap operations.
+ *
+ * The hmm dummy driver does not allow mmap of its device file. The main reason
+ * is because the kernel lack the ability to insert page with specific custom
+ * protections inside a vma.
+ */
+static int hmm_dummy_mmap_fault(struct vm_area_struct *vma,
+				struct vm_fault *vmf)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static void hmm_dummy_mmap_open(struct vm_area_struct *vma)
+{
+	/* nop */
+}
+
+static void hmm_dummy_mmap_close(struct vm_area_struct *vma)
+{
+	/* nop */
+}
+
+static const struct vm_operations_struct mmap_mem_ops = {
+	.fault			= hmm_dummy_mmap_fault,
+	.open			= hmm_dummy_mmap_open,
+	.close			= hmm_dummy_mmap_close,
+};
+
+
+/* hmm_dummy_fops - hmm dummy device file operations.
+ *
+ * The hmm dummy driver allow to read/write to the mirrored process through
+ * the device file. Below are the read and write and others device file
+ * callback that implement access to the mirrored address space.
+ */
+#define DUMMY_WINDOW		4
+
+static int hmm_dummy_mirror_fault(struct hmm_dummy_mirror *dmirror,
+				  unsigned long addr,
+				  bool write)
+{
+	struct hmm_mirror *mirror = &dmirror->mirror;
+	struct hmm_event event;
+	unsigned long start, end;
+	int ret;
+
+	event.start = start = addr > ((DUMMY_WINDOW >> 1) << PAGE_SHIFT) ?
+			      addr - ((DUMMY_WINDOW >> 1) << PAGE_SHIFT) : 0;
+	event.end = end = start + (DUMMY_WINDOW << PAGE_SHIFT);
+	event.etype = write ? HMM_WFAULT : HMM_RFAULT;
+
+	/* Pre-allocate device page table. */
+	mutex_lock(&dmirror->mutex);
+	ret = hmm_dummy_pt_alloc(dmirror, start, end);
+	mutex_unlock(&dmirror->mutex);
+	if (ret)
+		return ret;
+
+	while (1) {
+		ret = hmm_mirror_fault(mirror, &event);
+		/* Ignore any error that do not concern the fault address. */
+		if (addr >= event.end) {
+			event.start = event.end;
+			event.end = end;
+			continue;
+		}
+		break;
+	}
+
+	return ret;
+}
+
+static ssize_t hmm_dummy_fops_read(struct file *filp,
+				   char __user *buf,
+				   size_t count,
+				   loff_t *ppos)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	struct hmm_mirror *mirror;
+	unsigned long start, end, offset;
+	unsigned minor;
+	ssize_t retval = 0;
+	void *tmp;
+	long r;
+
+	tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!tmp)
+		return -ENOMEM;
+
+	/* Check if we are mirroring anything */
+	minor = iminor(file_inode(filp));
+	ddevice = filp->private_data;
+	mutex_lock(&ddevice->mutex);
+	if (ddevice->dmirrors[minor] == NULL) {
+		mutex_unlock(&ddevice->mutex);
+		kfree(tmp);
+		return 0;
+	}
+	mirror = hmm_mirror_ref(&ddevice->dmirrors[minor]->mirror);
+	mutex_unlock(&ddevice->mutex);
+
+	if (!mirror) {
+		kfree(tmp);
+		return 0;
+	}
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	if (dmirror->stop) {
+		kfree(tmp);
+		hmm_mirror_unref(mirror);
+		return 0;
+	}
+
+	/* The range of address to lookup. */
+	start = (*ppos) & PAGE_MASK;
+	offset = (*ppos) - start;
+	end = PAGE_ALIGN(start + count);
+	BUG_ON(start == end);
+	pt_map.dmirror = dmirror;
+
+	for (; count; start += PAGE_SIZE, offset = 0) {
+		unsigned long *pldp, pld_idx;
+		unsigned long size = min(PAGE_SIZE - offset, count);
+		struct page *page;
+		char *ptr;
+
+		mutex_lock(&dmirror->mutex);
+		pldp = hmm_dummy_pt_pld_map(&pt_map, start);
+		pld_idx = hmm_dummy_pld_index(start);
+		if (!pldp || !(pldp[pld_idx] & HMM_DUMMY_PTE_VALID)) {
+			hmm_dummy_pt_unmap(&pt_map);
+			mutex_unlock(&dmirror->mutex);
+			goto fault;
+		}
+		page = hmm_dummy_pte_to_page(pldp[pld_idx]);
+		if (!page) {
+			mutex_unlock(&dmirror->mutex);
+			BUG();
+			kfree(tmp);
+			hmm_mirror_unref(mirror);
+			return -EFAULT;
+		}
+		ptr = kmap(page);
+		memcpy(tmp, ptr + offset, size);
+		kunmap(page);
+		hmm_dummy_pt_unmap(&pt_map);
+		mutex_unlock(&dmirror->mutex);
+
+		r = copy_to_user(buf, tmp, size);
+		if (r) {
+			kfree(tmp);
+			hmm_mirror_unref(mirror);
+			return -EFAULT;
+		}
+		retval += size;
+		*ppos += size;
+		count -= size;
+		buf += size;
+	}
+
+	kfree(tmp);
+	hmm_mirror_unref(mirror);
+	return retval;
+
+fault:
+	kfree(tmp);
+	r = hmm_dummy_mirror_fault(dmirror, start, false);
+	hmm_mirror_unref(mirror);
+	if (r)
+		return r;
+
+	/* Force userspace to retry read if nothing was read. */
+	return retval ? retval : -EINTR;
+}
+
+static ssize_t hmm_dummy_fops_write(struct file *filp,
+				    const char __user *buf,
+				    size_t count,
+				    loff_t *ppos)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	struct hmm_mirror *mirror;
+	unsigned long start, end, offset;
+	unsigned minor;
+	ssize_t retval = 0;
+	void *tmp;
+	long r;
+
+	tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!tmp)
+		return -ENOMEM;
+
+	/* Check if we are mirroring anything */
+	minor = iminor(file_inode(filp));
+	ddevice = filp->private_data;
+	mutex_lock(&ddevice->mutex);
+	if (ddevice->dmirrors[minor] == NULL) {
+		mutex_unlock(&ddevice->mutex);
+		kfree(tmp);
+		return 0;
+	}
+	mirror = hmm_mirror_ref(&ddevice->dmirrors[minor]->mirror);
+	mutex_unlock(&ddevice->mutex);
+
+	if (!mirror) {
+		kfree(tmp);
+		return 0;
+	}
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	if (dmirror->stop) {
+		kfree(tmp);
+		hmm_mirror_unref(mirror);
+		return 0;
+	}
+
+	/* The range of address to lookup. */
+	start = (*ppos) & PAGE_MASK;
+	offset = (*ppos) - start;
+	end = PAGE_ALIGN(start + count);
+	BUG_ON(start == end);
+	pt_map.dmirror = dmirror;
+
+	for (; count; start += PAGE_SIZE, offset = 0) {
+		unsigned long *pldp, pld_idx;
+		unsigned long size = min(PAGE_SIZE - offset, count);
+		struct page *page;
+		char *ptr;
+
+		r = copy_from_user(tmp, buf, size);
+		if (r) {
+			kfree(tmp);
+			hmm_mirror_unref(mirror);
+			return -EFAULT;
+		}
+
+		mutex_lock(&dmirror->mutex);
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, start);
+		pld_idx = hmm_dummy_pld_index(start);
+		if (!pldp || !(pldp[pld_idx] & HMM_DUMMY_PTE_VALID)) {
+			hmm_dummy_pt_unmap(&pt_map);
+			mutex_unlock(&dmirror->mutex);
+			goto fault;
+		}
+		if (!(pldp[pld_idx] & HMM_DUMMY_PTE_WRITE)) {
+			hmm_dummy_pt_unmap(&pt_map);
+			mutex_unlock(&dmirror->mutex);
+			goto fault;
+		}
+		pldp[pld_idx] |= HMM_DUMMY_PTE_DIRTY;
+		page = hmm_dummy_pte_to_page(pldp[pld_idx]);
+		if (!page) {
+			mutex_unlock(&dmirror->mutex);
+			BUG();
+			kfree(tmp);
+			hmm_mirror_unref(mirror);
+			return -EFAULT;
+		}
+		ptr = kmap(page);
+		memcpy(ptr + offset, tmp, size);
+		kunmap(page);
+		hmm_dummy_pt_unmap(&pt_map);
+		mutex_unlock(&dmirror->mutex);
+
+		retval += size;
+		*ppos += size;
+		count -= size;
+		buf += size;
+	}
+
+	kfree(tmp);
+	hmm_mirror_unref(mirror);
+	return retval;
+
+fault:
+	kfree(tmp);
+	r = hmm_dummy_mirror_fault(dmirror, start, true);
+	hmm_mirror_unref(mirror);
+	if (r)
+		return r;
+
+	/* Force userspace to retry write if nothing was writen. */
+	return retval ? retval : -EINTR;
+}
+
+static int hmm_dummy_fops_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	return -EINVAL;
+}
+
+static int hmm_dummy_fops_open(struct inode *inode, struct file *filp)
+{
+	struct hmm_dummy_device *ddevice;
+	struct cdev *cdev = inode->i_cdev;
+	const int minor = iminor(inode);
+
+	/* No exclusive opens */
+	if (filp->f_flags & O_EXCL)
+		return -EINVAL;
+
+	ddevice = container_of(cdev, struct hmm_dummy_device, cdev);
+	filp->private_data = ddevice;
+	ddevice->fmapping[minor] = &inode->i_data;
+
+	return 0;
+}
+
+static int hmm_dummy_fops_release(struct inode *inode,
+				  struct file *filp)
+{
+#if 0
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_mirror *dmirror;
+	struct cdev *cdev = inode->i_cdev;
+	const int minor = iminor(inode);
+
+	ddevice = container_of(cdev, struct hmm_dummy_device, cdev);
+	mutex_lock(&ddevice->mutex);
+	dmirror = ddevice->dmirrors[minor];
+	if (dmirror && dmirror->filp == filp) {
+		struct hmm_mirror *mirror = hmm_mirror_ref(&dmirror->mirror);
+		ddevice->dmirrors[minor] = NULL;
+		mutex_unlock(&ddevice->mutex);
+
+		if (mirror) {
+			hmm_mirror_release(mirror);
+			hmm_mirror_unref(mirror);
+		}
+	} else
+		mutex_unlock(&ddevice->mutex);
+#endif
+
+	return 0;
+}
+
+static long hmm_dummy_fops_unlocked_ioctl(struct file *filp,
+					  unsigned int command,
+					  unsigned long arg)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_mirror *dmirror;
+	unsigned minor;
+	int ret;
+
+	minor = iminor(file_inode(filp));
+	ddevice = filp->private_data;
+	switch (command) {
+	case HMM_DUMMY_EXPOSE_MM:
+		mutex_lock(&ddevice->mutex);
+		dmirror = ddevice->dmirrors[minor];
+		if (dmirror) {
+			mutex_unlock(&ddevice->mutex);
+			return -EBUSY;
+		}
+		/* Mirror this process address space */
+		dmirror = kzalloc(sizeof(*dmirror), GFP_KERNEL);
+		if (dmirror == NULL) {
+			mutex_unlock(&ddevice->mutex);
+			return -ENOMEM;
+		}
+		kref_init(&dmirror->kref);
+		dmirror->mm = NULL;
+		dmirror->stop = false;
+		dmirror->pid = task_pid_nr(current);
+		dmirror->ddevice = ddevice;
+		dmirror->minor = minor;
+		dmirror->filp = filp;
+		dmirror->pgdp = NULL;
+		mutex_init(&dmirror->mutex);
+		ddevice->dmirrors[minor] = dmirror;
+		mutex_unlock(&ddevice->mutex);
+
+		ret = hmm_mirror_register(&dmirror->mirror,
+					  &ddevice->device,
+					  current->mm);
+		if (ret) {
+			mutex_lock(&ddevice->mutex);
+			ddevice->dmirrors[minor] = NULL;
+			mutex_unlock(&ddevice->mutex);
+			kfree(dmirror);
+			return ret;
+		}
+		/* Success. */
+		pr_info("mirroring address space of %d\n", dmirror->pid);
+		hmm_mirror_unref(&dmirror->mirror);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static const struct file_operations hmm_dummy_fops = {
+	.read		= hmm_dummy_fops_read,
+	.write		= hmm_dummy_fops_write,
+	.mmap		= hmm_dummy_fops_mmap,
+	.open		= hmm_dummy_fops_open,
+	.release	= hmm_dummy_fops_release,
+	.unlocked_ioctl = hmm_dummy_fops_unlocked_ioctl,
+	.llseek		= default_llseek,
+	.owner		= THIS_MODULE,
+};
+
+
+/*
+ * char device driver
+ */
+static int hmm_dummy_device_init(struct hmm_dummy_device *ddevice)
+{
+	int ret, i;
+
+	ret = alloc_chrdev_region(&ddevice->dev, 0,
+				  HMM_DUMMY_MAX_DEVICES,
+				  ddevice->name);
+	if (ret < 0)
+		goto error;
+	ddevice->major = MAJOR(ddevice->dev);
+
+	cdev_init(&ddevice->cdev, &hmm_dummy_fops);
+	ret = cdev_add(&ddevice->cdev, ddevice->dev, HMM_DUMMY_MAX_DEVICES);
+	if (ret) {
+		unregister_chrdev_region(ddevice->dev, HMM_DUMMY_MAX_DEVICES);
+		goto error;
+	}
+
+	/* Register the hmm device. */
+	for (i = 0; i < HMM_DUMMY_MAX_DEVICES; i++)
+		ddevice->dmirrors[i] = NULL;
+	mutex_init(&ddevice->mutex);
+	ddevice->device.ops = &hmm_dummy_ops;
+	ddevice->device.name = ddevice->name;
+
+	ret = hmm_device_register(&ddevice->device);
+	if (ret) {
+		cdev_del(&ddevice->cdev);
+		unregister_chrdev_region(ddevice->dev, HMM_DUMMY_MAX_DEVICES);
+		goto error;
+	}
+
+	return 0;
+
+error:
+	return ret;
+}
+
+static void hmm_dummy_device_fini(struct hmm_dummy_device *ddevice)
+{
+	unsigned i;
+
+	/* First finish hmm. */
+	mutex_lock(&ddevice->mutex);
+	for (i = 0; i < HMM_DUMMY_MAX_DEVICES; i++) {
+		struct hmm_mirror *mirror = NULL;
+
+		if (ddevices->dmirrors[i]) {
+			mirror = hmm_mirror_ref(&ddevices->dmirrors[i]->mirror);
+			ddevices->dmirrors[i] = NULL;
+		}
+		if (!mirror)
+			continue;
+
+		mutex_unlock(&ddevice->mutex);
+		hmm_mirror_release(mirror);
+		hmm_mirror_unref(mirror);
+		mutex_lock(&ddevice->mutex);
+	}
+	mutex_unlock(&ddevice->mutex);
+
+	if (hmm_device_unregister(&ddevice->device))
+		BUG();
+
+	cdev_del(&ddevice->cdev);
+	unregister_chrdev_region(ddevice->dev,
+				 HMM_DUMMY_MAX_DEVICES);
+}
+
+static int __init hmm_dummy_init(void)
+{
+	int ret;
+
+	snprintf(ddevices[0].name, sizeof(ddevices[0].name),
+		 "%s%d", HMM_DUMMY_DEVICE_NAME, 0);
+	ret = hmm_dummy_device_init(&ddevices[0]);
+	if (ret)
+		return ret;
+
+	snprintf(ddevices[1].name, sizeof(ddevices[1].name),
+		 "%s%d", HMM_DUMMY_DEVICE_NAME, 1);
+	ret = hmm_dummy_device_init(&ddevices[1]);
+	if (ret) {
+		hmm_dummy_device_fini(&ddevices[0]);
+		return ret;
+	}
+
+	pr_info("hmm_dummy loaded THIS IS A DANGEROUS MODULE !!!\n");
+	return 0;
+}
+
+static void __exit hmm_dummy_exit(void)
+{
+	hmm_dummy_device_fini(&ddevices[1]);
+	hmm_dummy_device_fini(&ddevices[0]);
+}
+
+module_init(hmm_dummy_init);
+module_exit(hmm_dummy_exit);
+MODULE_LICENSE("GPL");
diff --git a/include/uapi/linux/hmm_dummy.h b/include/uapi/linux/hmm_dummy.h
new file mode 100644
index 0000000..20eb98f
--- /dev/null
+++ b/include/uapi/linux/hmm_dummy.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* This is a dummy driver made to exercice the HMM (hardware memory management)
+ * API of the kernel. It allow an userspace program to map its whole address
+ * space through the hmm dummy driver file.
+ */
+#ifndef _UAPI_LINUX_HMM_DUMMY_H
+#define _UAPI_LINUX_HMM_DUMMY_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+#include <linux/irqnr.h>
+
+/* Expose the address space of the calling process through hmm dummy dev file */
+#define HMM_DUMMY_EXPOSE_MM	_IO('R', 0x00)
+
+#endif /* _UAPI_LINUX_RANDOM_H */
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 5/6] iommu: new api to map an array of page frame number into a domain.
  2014-08-29 19:10 [RFC PATCH 0/6] HMM (heterogeneous memory management) v4 j.glisse
                   ` (3 preceding siblings ...)
  2014-08-29 19:10 ` [RFC PATCH 4/6] hmm/dummy: dummy driver to showcase the hmm api v2 j.glisse
@ 2014-08-29 19:10 ` j.glisse
  2014-08-29 19:10 ` [RFC PATCH 6/6] hmm: add support for iommu domain j.glisse
  5 siblings, 0 replies; 9+ messages in thread
From: j.glisse @ 2014-08-29 19:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm, Haggai Eran
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

New user of iommu can share the same mapping for device in same domain.
Which allow saving resources. For this a new iommu domain callback is
needed. This add the core support for the callback.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/iommu.h | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 145 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 20f9a52..ff6983f 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -97,6 +97,9 @@ enum iommu_attr {
  * @domain_has_cap: domain capabilities query
  * @add_device: add device to iommu grouping
  * @remove_device: remove device from iommu grouping
+ * @domain_map_directory: Map a directory of pages.
+ * @domain_update_directory: Update a directory of pages mapping.
+ * @domain_unmap_directory: Unmap a directory of pages.
  * @domain_get_attr: Query domain attributes
  * @domain_set_attr: Change domain attributes
  * @pgsize_bitmap: bitmap of supported page sizes
@@ -130,6 +133,26 @@ struct iommu_ops {
 	/* Get the numer of window per domain */
 	u32 (*domain_get_windows)(struct iommu_domain *domain);
 
+	int (*domain_map_directory)(struct iommu_domain *domain,
+				    unsigned long npages,
+				    unsigned long *pfns,
+				    unsigned long pfn_mask,
+				    unsigned long pfn_shift,
+				    unsigned long pfn_valid,
+				    unsigned long pfn_write,
+				    dma_addr_t *iova_base);
+	int (*domain_update_directory)(struct iommu_domain *domain,
+				       unsigned long npages,
+				       unsigned long *pfns,
+				       unsigned long pfn_mask,
+				       unsigned long pfn_shift,
+				       unsigned long pfn_valid,
+				       unsigned long pfn_write,
+				       dma_addr_t iova_base);
+	int (*domain_unmap_directory)(struct iommu_domain *domain,
+				      unsigned long npages,
+				      dma_addr_t iova);
+
 	unsigned long pgsize_bitmap;
 };
 
@@ -240,6 +263,97 @@ static inline int report_iommu_fault(struct iommu_domain *domain,
 	return ret;
 }
 
+/* iommu_domain_map_directory() - Map a directory of pages into a given domain.
+ *
+ * @domain: Domain into which mapping should happen.
+ * @npages: The maximum number of page to map.
+ * @pfns: The pfns directory array.
+ * @pfn_mask: The pfn mask (pfn_for_i = (pfns[i] & pfn_mask) >> pfn_shift).
+ * @pfn_shift: The pfn shift (pfn_for_i = (pfns[i] & pfn_mask) >> pfn_shift).
+ * @pfn_valid: An entry in the array is valid if (pfns[i] & pfn_valid).
+ * @pfn_write: An entry should be mapped write if (pfns[i] & pfn_write)
+ * @iova_base: Base io virtual address at which directory is mapped on success.
+ * Returns: Number of mapped pages on success, negative errno otherwise.
+ *
+ * This allow to map contiguously a directory of pages into a specific domain.
+ * On success it sets the base io virtual address at which the directory is
+ * mapped and it returns the number of page successfully mapped. Each entry in
+ * the directory can either be a valid page in read only or in read and write
+ * depending on flags and there can be gaps.
+ */
+static inline int iommu_domain_map_directory(struct iommu_domain *domain,
+					     unsigned long npages,
+					     unsigned long *pfns,
+					     unsigned long pfn_mask,
+					     unsigned long pfn_shift,
+					     unsigned long pfn_valid,
+					     unsigned long pfn_write,
+					     dma_addr_t *iova_base)
+{
+	if (!domain->ops->domain_map_directory)
+		return -EINVAL;
+	return domain->ops->domain_map_directory(domain, npages, pfns,
+						 pfn_mask, pfn_shift,
+						 pfn_valid, pfn_write,
+						 iova_base);
+}
+
+/* iommu_domain_update_directory() - Update a directory mapping of pages.
+ *
+ * @domain: Domain into which mapping exist.
+ * @npages: The maximum number of page to map.
+ * @pfns: The pfns directory array.
+ * @pfn_mask: The pfn mask (pfn_for_i = (pfns[i] & pfn_mask) >> pfn_shift).
+ * @pfn_shift: The pfn shift (pfn_for_i = (pfns[i] & pfn_mask) >> pfn_shift).
+ * @pfn_valid: An entry in the array is valid if (pfns[i] & pfn_valid).
+ * @pfn_write: An entry should be mapped write if (pfns[i] & pfn_write)
+ * @iova_base: Base io virtual address at which directory is mapped.
+ * Returns: Number of mapped (positive) or unmapped (negative) pages.
+ *
+ * This allow to update a previously successfull directory mapping of pages,
+ * either by adding or removing or replacing pages or modifying page mapping
+ * (read only to read and write or read and write to read only). It returns
+ * the number of new or removed mapping. Modified mapping are not counted.
+ * So if return value is positive it means there is an increase in the number
+ * of valid mapped entry. If it is negative it means there is a decrease in
+ * the number of valid mapped entry. In all case |return| <= npages.
+ */
+static inline int iommu_domain_update_directory(struct iommu_domain *domain,
+						unsigned long npages,
+						unsigned long *pfns,
+						unsigned long pfn_mask,
+						unsigned long pfn_shift,
+						unsigned long pfn_valid,
+						unsigned long pfn_write,
+						dma_addr_t iova_base)
+{
+	if (!domain->ops->domain_update_directory)
+		return -EINVAL;
+	return domain->ops->domain_update_directory(domain, npages, pfns,
+						    pfn_mask, pfn_shift,
+						    pfn_valid, pfn_write,
+						    iova_base);
+}
+
+/* iommu_domain_unmap_directory() - Unmap a directory of pages in a domain.
+ *
+ * @domain: Domain into which mapping should happen.
+ * @npages: The maximum number of page to map.
+ * @iova_base: Base io virtual address at which directory is mapped.
+ * Returns: Number of unmapped pages.
+ *
+ * This allow to unmap a previously successfull directory mapping of pages. It
+ * free the iova and return the number of valid unmapped entries.
+ */
+static inline int iommu_domain_unmap_directory(struct iommu_domain *domain,
+					       unsigned long npages,
+					       dma_addr_t iova_base)
+{
+	if (!domain->ops->domain_unmap_directory)
+		return 0;
+	return domain->ops->domain_unmap_directory(domain, npages, iova_base);
+}
+
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
@@ -424,6 +538,37 @@ static inline void iommu_device_unlink(struct device *dev, struct device *link)
 {
 }
 
+static inline int iommu_domain_map_directory(struct iommu_domain *domain,
+					     unsigned long npages,
+					     unsigned long *pfns,
+					     unsigned long pfn_mask,
+					     unsigned long pfn_shift,
+					     unsigned long pfn_valid,
+					     unsigned long pfn_write,
+					     dma_addr_t *iova_base)
+{
+	return -EINVAL;
+}
+
+static inline int iommu_domain_update_directory(struct iommu_domain *domain,
+						unsigned long npages,
+						unsigned long *pfns,
+						unsigned long pfn_mask,
+						unsigned long pfn_shift,
+						unsigned long pfn_valid,
+						unsigned long pfn_write,
+						dma_addr_t iova_base)
+{
+	return -EINVAL;
+}
+
+static inline void iommu_domain_unmap_directory(struct iommu_domain *domain,
+						unsigned long npages,
+						dma_addr_t iova_base)
+{
+	return;
+}
+
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 6/6] hmm: add support for iommu domain.
  2014-08-29 19:10 [RFC PATCH 0/6] HMM (heterogeneous memory management) v4 j.glisse
                   ` (4 preceding siblings ...)
  2014-08-29 19:10 ` [RFC PATCH 5/6] iommu: new api to map an array of page frame number into a domain j.glisse
@ 2014-08-29 19:10 ` j.glisse
  5 siblings, 0 replies; 9+ messages in thread
From: j.glisse @ 2014-08-29 19:10 UTC (permalink / raw)
  To: linux-kernel, linux-mm, akpm, Haggai Eran
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This add support for grouping mirror of a process by share iommu domain
and mapping the necessary page into the iommu thus allowing hmm user to
share dma mapping of process pages and avoiding each of them to have to
individualy map each pages.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
---
 include/linux/hmm.h |   8 ++
 mm/hmm.c            | 375 +++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 368 insertions(+), 15 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index f7c379b..3d85721 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -49,10 +49,12 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/mman.h>
+#include <linux/iommu.h>
 
 
 struct hmm_device;
 struct hmm_device_ops;
+struct hmm_domain;
 struct hmm_mirror;
 struct hmm_event;
 struct hmm;
@@ -119,12 +121,14 @@ struct hmm_event {
  * @ptp: The page directory page struct.
  * @start: First address (inclusive).
  * @end: Last address (exclusive).
+ * @iova_base: base io virtual address for this range.
  */
 struct hmm_range {
 	unsigned long		*pte;
 	struct page		*ptp;
 	unsigned long		start;
 	unsigned long		end;
+	dma_addr_t		iova_base;
 };
 
 static inline unsigned long hmm_range_size(struct hmm_range *range)
@@ -288,6 +292,7 @@ struct hmm_device_ops {
 
 /* struct hmm_device - per device hmm structure
  *
+ * @iommu_domain: Iommu domain this device is associated with (NULL if none).
  * @name: Device name (uniquely identify the device on the system).
  * @ops: The hmm operations callback.
  * @mirrors: List of all active mirrors for the device.
@@ -297,6 +302,7 @@ struct hmm_device_ops {
  * struct (only once).
  */
 struct hmm_device {
+	struct iommu_domain		*iommu_domain;
 	const char			*name;
 	const struct hmm_device_ops	*ops;
 	struct list_head		mirrors;
@@ -317,6 +323,7 @@ int hmm_device_unregister(struct hmm_device *device);
 /* struct hmm_mirror - per device and per mm hmm structure
  *
  * @device: The hmm_device struct this hmm_mirror is associated to.
+ * @domain: The hmm domain this mirror belong to.
  * @hmm: The hmm struct this hmm_mirror is associated to.
  * @dlist: List of all hmm_mirror for same device.
  * @mlist: List of all hmm_mirror for same process.
@@ -329,6 +336,7 @@ int hmm_device_unregister(struct hmm_device *device);
  */
 struct hmm_mirror {
 	struct hmm_device	*device;
+	struct hmm_domain	*domain;
 	struct hmm		*hmm;
 	struct list_head	dlist;
 	struct list_head	mlist;
diff --git a/mm/hmm.c b/mm/hmm.c
index d29a2d9..cc6970b 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -49,8 +49,27 @@
 static struct srcu_struct srcu;
 
 
+/* struct hmm_domain - per iommu domain hmm struct.
+ *
+ * @domain: Iommu domain.
+ * @mirrors: List of all mirror of different devices in same domain.
+ * @iommu_pt: Page table storing the pfn of page use by iommu page table.
+ * @list: Head for hmm list of all hmm_domain.
+ *
+ * Device that belong to the same iommu domain and that mirror the same process
+ * are grouped together so that they can share the same iommu resources.
+ */
+struct hmm_domain {
+	struct iommu_domain	*iommu_domain;
+	struct gpt		pt;
+	struct list_head	mirrors;
+	struct list_head	list;
+};
+
+
 /* struct hmm - per mm_struct hmm structure
  *
+ * @domains: List of hmm_domain.
  * @mm: The mm struct this hmm is associated with.
  * @kref: Reference counter
  * @lock: Serialize the mirror list modifications.
@@ -65,6 +84,7 @@ static struct srcu_struct srcu;
  * the process address space.
  */
 struct hmm {
+	struct list_head	domains;
 	struct mm_struct	*mm;
 	struct kref		kref;
 	spinlock_t		lock;
@@ -179,6 +199,181 @@ static void hmm_range_clear(struct hmm_range *range, struct hmm *hmm)
 }
 
 
+/* hmm_domaine - iommu domain helper functions.
+ *
+ * To simplify and share resources hmm handle iommu on behalf of device.
+ */
+
+#define HMM_DPTE_LOCK_BIT	0UL
+#define HMM_DPTE_VALID_BIT	1UL
+
+static inline bool hmm_dpte_is_valid(const volatile unsigned long *dpte)
+{
+	return test_bit(HMM_DPTE_VALID_BIT, dpte);
+}
+
+static inline void hmm_dpte_lock(volatile unsigned long *dpte, struct hmm *hmm)
+{
+	do {
+		if (likely(!test_and_set_bit_lock(HMM_DPTE_LOCK_BIT, dpte)))
+			return;
+		wait_event(hmm->wait_queue, !test_bit(HMM_DPTE_LOCK_BIT, dpte));
+	} while (1);
+}
+
+static inline void hmm_dpte_unlock(volatile unsigned long *dpte,
+				   struct hmm *hmm)
+{
+	clear_bit(HMM_DPTE_LOCK_BIT, dpte);
+	wake_up(&hmm->wait_queue);
+}
+
+static inline bool hmm_dpte_clear_valid(volatile unsigned long *dpte)
+{
+	return test_and_clear_bit(HMM_DPTE_VALID_BIT, dpte);
+}
+
+static int hmm_domain_update_or_map(struct hmm_domain *domain,
+				    struct hmm *hmm,
+				    struct page *dptp,
+				    volatile unsigned long *dpte,
+				    unsigned long *pfns)
+{
+	dma_addr_t iova;
+	int ret;
+
+	pfns = (unsigned long *)((unsigned long)pfns & PAGE_MASK);
+	hmm_dpte_lock(dpte, hmm);
+	if (hmm_pte_is_valid_smem(dpte)) {
+		int n;
+
+		iova = *dpte & PAGE_MASK;
+		n = iommu_domain_update_directory(domain->iommu_domain,
+						  1UL << GPT_PDIR_NBITS,
+						  pfns, PAGE_MASK, PAGE_SHIFT,
+						  1 << HMM_PTE_VALID_SMEM_BIT,
+						  1 << HMM_PTE_WRITE_BIT,
+						  iova);
+		if (n > 0)
+			gpt_ptp_batch_ref(&domain->pt, dptp, n);
+		else if (n < 0)
+			gpt_ptp_batch_unref(&domain->pt, dptp, -n);
+		hmm_dpte_unlock(dpte, hmm);
+		return 0;
+	}
+
+	ret = iommu_domain_map_directory(domain->iommu_domain,
+					 1UL << GPT_PDIR_NBITS,
+					 pfns, PAGE_MASK, PAGE_SHIFT,
+					 1 << HMM_PTE_VALID_SMEM_BIT,
+					 1 << HMM_PTE_WRITE_BIT,
+					 &iova);
+	if (ret > 0) {
+		gpt_ptp_batch_ref(&domain->pt, dptp, ret);
+		ret = 0;
+	}
+	hmm_dpte_unlock(dpte, hmm);
+	return ret;
+}
+
+static bool hmm_domain_do_update(struct hmm_domain *domain,
+				 struct hmm *hmm,
+				 struct page *dptp,
+				 volatile unsigned long *dpte,
+				 unsigned long *pfns)
+{
+	bool present = false;
+
+	pfns = (unsigned long *)((unsigned long)pfns & PAGE_MASK);
+	hmm_dpte_lock(dpte, hmm);
+	if (hmm_pte_is_valid_smem(dpte)) {
+		dma_addr_t iova = *dpte & PAGE_MASK;
+		int n;
+
+		present = true;
+		n = iommu_domain_update_directory(domain->iommu_domain,
+						  1UL << GPT_PDIR_NBITS,
+						  pfns, PAGE_MASK, PAGE_SHIFT,
+						  1 << HMM_PTE_VALID_SMEM_BIT,
+						  1 << HMM_PTE_WRITE_BIT,
+						  iova);
+		if (n > 0)
+			gpt_ptp_batch_ref(&domain->pt, dptp, n);
+		else if (n < 0)
+			gpt_ptp_batch_unref(&domain->pt, dptp, -n);
+	}
+	hmm_dpte_unlock(dpte, hmm);
+
+	return present;
+}
+
+static void hmm_domain_update(struct hmm_domain *domain,
+			      struct hmm *hmm,
+			      const struct hmm_event *event,
+			      struct gpt_iter *iter)
+{
+	struct gpt_lock dlock;
+	struct gpt_iter diter;
+	unsigned long addr;
+
+	dlock.start = event->start;
+	dlock.end = event->end - 1UL;
+	BUG_ON(gpt_lock_update(&domain->pt, &dlock));
+	gpt_iter_init(&diter, &domain->pt, &dlock);
+
+	BUG_ON(!gpt_iter_first(iter, event->start, event->end - 1UL));
+	for (addr = iter->pte_addr; iter->pte;) {
+		if (gpt_iter_addr(&diter, addr))
+			hmm_domain_do_update(domain, hmm, diter.ptp,
+					     diter.pte, iter->pte);
+
+		addr = min(gpt_pdp_end(&domain->pt, iter->ptp) + 1UL,
+			   event->end);
+		gpt_iter_first(iter, addr, event->end - 1UL);
+	}
+
+	gpt_unlock_update(&domain->pt, &dlock);
+}
+
+static void hmm_domain_unmap(struct hmm_domain *domain,
+			     struct hmm *hmm,
+			     const struct hmm_event *event)
+{
+	struct gpt_lock dlock;
+	struct gpt_iter diter;
+
+	dlock.start = event->start;
+	dlock.end = event->end - 1UL;
+	BUG_ON(gpt_lock_update(&domain->pt, &dlock));
+	gpt_iter_init(&diter, &domain->pt, &dlock);
+	if (!gpt_iter_first(&diter, dlock.start, dlock.end))
+		goto out;
+	do {
+		unsigned long npages, *dpte;
+		dma_addr_t iova;
+		int n;
+
+		dpte = diter.pte;
+		iova = *dpte & PAGE_MASK;
+		hmm_dpte_lock(dpte, hmm);
+		if (!hmm_dpte_clear_valid(dpte)) {
+			hmm_dpte_unlock(dpte, hmm);
+			continue;
+		}
+
+		npages = 1UL << (PAGE_SHIFT - hmm->pt.pte_shift);
+		n = iommu_domain_unmap_directory(domain->iommu_domain,
+						 npages, iova);
+		if (n)
+			gpt_ptp_batch_unref(&domain->pt, diter.ptp, n);
+		hmm_dpte_unlock(dpte, hmm);
+	} while (gpt_iter_next(&diter));
+
+out:
+	gpt_unlock_update(&domain->pt, &dlock);
+}
+
+
 /* hmm - core hmm functions.
  *
  * Core hmm functions that deal with all the process mm activities and use
@@ -194,6 +389,7 @@ static int hmm_init(struct hmm *hmm, struct mm_struct *mm)
 	kref_init(&hmm->kref);
 	INIT_LIST_HEAD(&hmm->device_faults);
 	INIT_LIST_HEAD(&hmm->invalidations);
+	INIT_LIST_HEAD(&hmm->domains);
 	INIT_LIST_HEAD(&hmm->mirrors);
 	spin_lock_init(&hmm->lock);
 	init_waitqueue_head(&hmm->wait_queue);
@@ -219,23 +415,114 @@ static int hmm_init(struct hmm *hmm, struct mm_struct *mm)
 	return __mmu_notifier_register(&hmm->mmu_notifier, mm);
 }
 
+static struct hmm_domain *hmm_find_domain_locked(struct hmm *hmm,
+						 struct iommu_domain *iommu_domain)
+{
+	struct hmm_domain *domain;
+
+	list_for_each_entry (domain, &hmm->domains, list)
+		if (domain->iommu_domain == iommu_domain)
+			return domain;
+	return NULL;
+}
+
+static struct hmm_domain *hmm_new_domain(struct hmm *hmm)
+{
+	struct hmm_domain *domain;
+
+	domain = kmalloc(sizeof(*domain), GFP_KERNEL);
+	if (!domain)
+		return NULL;
+	domain->pt.max_addr = 0;
+	INIT_LIST_HEAD(&domain->list);
+	INIT_LIST_HEAD(&domain->mirrors);
+	/*
+	 * The domain page table store a dma address for each pld (page lower
+	 * directory level) of the hmm page table.
+	 */
+	domain->pt.max_addr = hmm->pt.max_addr;
+	domain->pt.page_shift = 2 * PAGE_SHIFT - (ffs(BITS_PER_LONG) - 4);
+	domain->pt.pfn_invalid = 0;
+	domain->pt.pfn_mask = PAGE_MASK;
+	domain->pt.pfn_shift = PAGE_SHIFT;
+	domain->pt.pfn_valid = 1UL << HMM_PTE_VALID_SMEM_BIT;
+	domain->pt.pte_shift = ffs(BITS_PER_LONG) - 4;
+	domain->pt.user_ops = NULL;
+	if (gpt_init(&domain->pt)) {
+		kfree(domain);
+		return NULL;
+	}
+	return domain;
+}
+
+static void hmm_free_domain_locked(struct hmm *hmm, struct hmm_domain *domain)
+{
+	struct hmm_event event;
+
+	BUG_ON(!list_empty(&domain->mirrors));
+
+	event.start = 0;
+	event.end = hmm->mm->highest_vm_end;
+	event.etype = HMM_MUNMAP;
+	hmm_domain_unmap(domain, hmm, &event);
+
+	list_del(&domain->list);
+	gpt_free(&domain->pt);
+	kfree(domain);
+}
+
 static void hmm_del_mirror_locked(struct hmm *hmm, struct hmm_mirror *mirror)
 {
 	list_del_rcu(&mirror->mlist);
+	if (mirror->domain && list_empty(&mirror->domain->mirrors))
+		hmm_free_domain_locked(hmm, mirror->domain);
+	mirror->domain = NULL;
 }
 
 static int hmm_add_mirror(struct hmm *hmm, struct hmm_mirror *mirror)
 {
+	struct hmm_device *device = mirror->device;
+	struct hmm_domain *domain;
 	struct hmm_mirror *tmp_mirror;
 
+	mirror->domain = NULL;
+
 	spin_lock(&hmm->lock);
-	list_for_each_entry_rcu (tmp_mirror, &hmm->mirrors, mlist)
-		if (tmp_mirror->device == mirror->device) {
-			/* Same device can mirror only once. */
+	if (device->iommu_domain) {
+		domain = hmm_find_domain_locked(hmm, device->iommu_domain);
+		if (!domain) {
+			struct hmm_domain *tmp_domain;
+
 			spin_unlock(&hmm->lock);
-			return -EINVAL;
+			tmp_domain = hmm_new_domain(hmm);
+			if (!tmp_domain)
+				return -ENOMEM;
+			spin_lock(&hmm->lock);
+			domain = hmm_find_domain_locked(hmm,
+							device->iommu_domain);
+			if (!domain) {
+				domain = tmp_domain;
+				list_add_tail(&domain->list, &hmm->domains);
+			} else
+				hmm_free_domain_locked(hmm, tmp_domain);
 		}
-	list_add_rcu(&mirror->mlist, &hmm->mirrors);
+		list_for_each_entry_rcu (tmp_mirror, &domain->mirrors, mlist)
+			if (tmp_mirror->device == mirror->device) {
+				/* Same device can mirror only once. */
+				spin_unlock(&hmm->lock);
+				return -EINVAL;
+			}
+		mirror->domain = domain;
+		list_add_rcu(&mirror->mlist, &domain->mirrors);
+	} else {
+		list_for_each_entry_rcu (tmp_mirror, &hmm->mirrors, mlist)
+			if (tmp_mirror->device == mirror->device) {
+				/* Same device can mirror only once. */
+				spin_unlock(&hmm->lock);
+				return -EINVAL;
+			}
+		list_add_rcu(&mirror->mlist, &hmm->mirrors);
+	}
 	spin_unlock(&hmm->lock);
 
 	return 0;
@@ -370,10 +657,12 @@ static void hmm_end_migrate(struct hmm *hmm, struct hmm_event *ievent)
 static void hmm_update(struct hmm *hmm,
 		       struct hmm_event *event)
 {
+	struct hmm_domain *domain;
 	struct hmm_range range;
 	struct gpt_lock lock;
 	struct gpt_iter iter;
 	struct gpt *pt = &hmm->pt;
+	int id;
 
 	/* This hmm is already fully stop. */
 	if (hmm->mm->hmm != hmm)
@@ -414,19 +703,34 @@ static void hmm_update(struct hmm *hmm,
 
 	hmm_event_wait(event);
 
-	if (event->etype == HMM_MUNMAP || event->etype == HMM_MIGRATE) {
-		BUG_ON(!gpt_iter_first(&iter, event->start, event->end - 1UL));
-		for (range.start = iter.pte_addr; iter.pte;) {
-			range.pte = iter.pte;
-			range.ptp = iter.ptp;
-			range.end = min(gpt_pdp_end(pt, iter.ptp) + 1UL,
-					event->end);
-			hmm_range_clear(&range, hmm);
-			range.start = range.end;
-			gpt_iter_first(&iter, range.start, event->end - 1UL);
+	if (event->etype == HMM_WRITE_PROTECT) {
+		id = srcu_read_lock(&srcu);
+		list_for_each_entry(domain, &hmm->domains, list) {
+			hmm_domain_update(domain, hmm, event, &iter);
 		}
+		srcu_read_unlock(&srcu, id);
 	}
 
+	if (event->etype != HMM_MUNMAP && event->etype != HMM_MIGRATE)
+		goto out;
+
+	BUG_ON(!gpt_iter_first(&iter, event->start, event->end - 1UL));
+	for (range.start = iter.pte_addr; iter.pte;) {
+		range.pte = iter.pte;
+		range.ptp = iter.ptp;
+		range.end = min(gpt_pdp_end(pt, iter.ptp) + 1UL,
+				event->end);
+		hmm_range_clear(&range, hmm);
+		range.start = range.end;
+		gpt_iter_first(&iter, range.start, event->end - 1UL);
+	}
+
+	id = srcu_read_lock(&srcu);
+	list_for_each_entry(domain, &hmm->domains, list)
+		hmm_domain_unmap(domain, hmm, event);
+	srcu_read_unlock(&srcu, id);
+
+out:
 	gpt_unlock_update(&hmm->pt, &lock);
 	if (event->etype != HMM_MIGRATE)
 		hmm_end_invalidations(hmm, event);
@@ -829,16 +1133,46 @@ void hmm_mirror_release(struct hmm_mirror *mirror)
 }
 EXPORT_SYMBOL(hmm_mirror_release);
 
+static int hmm_mirror_domain_update(struct hmm_mirror *mirror,
+				    struct hmm_range *range,
+				    struct gpt_iter *diter)
+{
+	unsigned long *dpte, offset;
+	struct hmm *hmm = mirror->hmm;
+	int ret;
+
+	BUG_ON(!gpt_iter_addr(diter, range->start));
+	offset = gpt_pdp_start(&mirror->domain->pt, diter->ptp) - range->start;
+	ret = hmm_domain_update_or_map(mirror->domain, hmm, diter->ptp,
+				       diter->pte, range->pte);
+	dpte = diter->pte;
+	range->iova_base = (*dpte & PAGE_MASK) + offset;
+	return ret;
+}
+
 static int hmm_mirror_update(struct hmm_mirror *mirror,
 			     struct hmm_event *event,
 			     unsigned long *start,
 			     struct gpt_iter *iter)
 {
 	unsigned long addr = *start & PAGE_MASK;
+	struct gpt_lock dlock;
+	struct gpt_iter diter;
 
 	if (!gpt_iter_addr(iter, addr))
 		return -EINVAL;
 
+	if (mirror->domain) {
+		int ret;
+
+		dlock.start = event->start;
+		dlock.end = event->end;
+		ret = gpt_lock_fault(&mirror->domain->pt, &dlock);
+		if (ret)
+			return ret;
+		gpt_iter_init(&diter, &mirror->domain->pt, &dlock);
+	}
+
 	do {
 		struct hmm_device *device = mirror->device;
 		unsigned long *pte = iter->pte;
@@ -864,6 +1198,14 @@ static int hmm_mirror_update(struct hmm_mirror *mirror,
 			}
 		}
 
+		if (mirror->domain) {
+			int ret;
+
+			ret = hmm_mirror_domain_update(mirror, &range, &diter);
+			if (ret)
+				return ret;
+		}
+
 		fence = device->ops->update(mirror, event, &range);
 		if (fence) {
 			if (IS_ERR(fence)) {
@@ -876,6 +1218,9 @@ static int hmm_mirror_update(struct hmm_mirror *mirror,
 
 	} while (addr < event->end && gpt_iter_addr(iter, addr));
 
+	if (mirror->domain)
+		gpt_unlock_fault(&mirror->domain->pt, &dlock);
+
 	*start = addr;
 	return 0;
 }
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-09-11 14:13 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-29 19:10 [RFC PATCH 0/6] HMM (heterogeneous memory management) v4 j.glisse
2014-08-29 19:10 ` [RFC PATCH 1/6] mmu_notifier: add event information to address invalidation v4 j.glisse
2014-09-11 10:00   ` Haggai Eran
2014-09-11 14:13     ` Jerome Glisse
2014-08-29 19:10 ` [RFC PATCH 2/6] lib: lockless generic and arch independent page table (gpt) j.glisse
2014-08-29 19:10 ` [RFC PATCH 3/6] hmm: heterogeneous memory management v5 j.glisse
2014-08-29 19:10 ` [RFC PATCH 4/6] hmm/dummy: dummy driver to showcase the hmm api v2 j.glisse
2014-08-29 19:10 ` [RFC PATCH 5/6] iommu: new api to map an array of page frame number into a domain j.glisse
2014-08-29 19:10 ` [RFC PATCH 6/6] hmm: add support for iommu domain j.glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).