[RFC][PATCH 0/6] Another go at speculative page faults

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/6] Another go at speculative page faults
@ 2014-10-20 21:56 Peter Zijlstra
  2014-10-20 21:56 ` [RFC][PATCH 1/6] mm: Dont assume page-table invariance during faults Peter Zijlstra
                   ` (8 more replies)
  0 siblings, 9 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-20 21:56 UTC (permalink / raw)
  To: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave
  Cc: linux-kernel, linux-mm, Peter Zijlstra

Hi,

I figured I'd give my 2010 speculative fault series another spin:

  https://lkml.org/lkml/2010/1/4/257

Since then I think many of the outstanding issues have changed sufficiently to
warrant another go. In particular Al Viro's delayed fput seems to have made it
entirely 'normal' to delay fput(). Lai Jiangshan's SRCU rewrite provided us
with call_srcu() and my preemptible mmu_gather removed the TLB flushes from
under the PTL.

The code needs way more attention but builds a kernel and runs the
micro-benchmark so I figured I'd post it before sinking more time into it.

I realize the micro-bench is about as good as it gets for this series and not
very realistic otherwise, but I think it does show the potential benefit the
approach has.

(patches go against .18-rc1+)

---

Using Kamezawa's multi-fault micro-bench from: https://lkml.org/lkml/2010/1/6/28

My Ivy Bridge EP (2*10*2) has a ~58% improvement in pagefault throughput:

PRE:

root@ivb-ep:~# perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault 20

 Performance counter stats for './multi-fault 20' (5 runs):

       149,441,555      page-faults                  ( +-  1.25% )
     2,153,651,828      cache-misses                 ( +-  1.09% )

      60.003082014 seconds time elapsed              ( +-  0.00% )

POST:

root@ivb-ep:~# perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault 20

 Performance counter stats for './multi-fault 20' (5 runs):

       236,442,626      page-faults                  ( +-  0.08% )
     2,796,353,939      cache-misses                 ( +-  1.01% )

      60.002792431 seconds time elapsed              ( +-  0.00% )


My Ivy Bridge EX (4*15*2) has a ~78% improvement in pagefault throughput:

PRE:

root@ivb-ex:~# perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault 60

 Performance counter stats for './multi-fault 60' (5 runs):

       105,789,078      page-faults                 ( +-  2.24% )
     1,314,072,090      cache-misses                ( +-  1.17% )

      60.009243533 seconds time elapsed             ( +-  0.00% )

POST:

root@ivb-ex:~# perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault 60

 Performance counter stats for './multi-fault 60' (5 runs):

       187,751,767      page-faults                 ( +-  2.24% )
     1,792,758,664      cache-misses                ( +-  2.30% )

      60.011611579 seconds time elapsed             ( +-  0.00% )

(I've not yet looked at why the EX sucks chunks compared to the EP box, I
 suspect we contend on other locks, but it could be anything.)

---

 arch/x86/mm/fault.c      |  35 ++-
 include/linux/mm.h       |  19 +-
 include/linux/mm_types.h |   5 +
 kernel/fork.c            |   1 +
 mm/init-mm.c             |   1 +
 mm/internal.h            |  18 ++
 mm/memory.c              | 672 ++++++++++++++++++++++++++++-------------------
 mm/mmap.c                | 101 +++++--
 8 files changed, 544 insertions(+), 308 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC][PATCH 1/6] mm: Dont assume page-table invariance during faults
  2014-10-20 21:56 [RFC][PATCH 0/6] Another go at speculative page faults Peter Zijlstra
@ 2014-10-20 21:56 ` Peter Zijlstra
  2014-10-20 21:56 ` [RFC][PATCH 2/6] mm: Prepare for FAULT_FLAG_SPECULATIVE Peter Zijlstra
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-20 21:56 UTC (permalink / raw)
  To: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: peterz-mm-kill-pte-pointer.patch --]
[-- Type: text/plain, Size: 6982 bytes --]

One of the side effects of speculating on faults (without holding
mmap_sem) is that we can race with free_pgtables() and therefore we
cannot assume the page-tables will stick around.

Remove the relyance on the pte pointer.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 mm/memory.c |   76 ++++++++++++++++--------------------------------------------
 1 file changed, 21 insertions(+), 55 deletions(-)

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1933,31 +1933,6 @@ int apply_to_page_range(struct mm_struct
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
-/*
- * handle_pte_fault chooses page fault handler according to an entry
- * which was read non-atomically.  Before making any commitment, on
- * those architectures or configurations (e.g. i386 with PAE) which
- * might give a mix of unmatched parts, do_swap_page and do_nonlinear_fault
- * must check under lock before unmapping the pte and proceeding
- * (but do_wp_page is only called after already making such a check;
- * and do_anonymous_page can safely check later on).
- */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-				pte_t *page_table, pte_t orig_pte)
-{
-	int same = 1;
-#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
-	if (sizeof(pte_t) > sizeof(unsigned long)) {
-		spinlock_t *ptl = pte_lockptr(mm, pmd);
-		spin_lock(ptl);
-		same = pte_same(*page_table, orig_pte);
-		spin_unlock(ptl);
-	}
-#endif
-	pte_unmap(page_table);
-	return same;
-}
-
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
 {
 	debug_dma_assert_idle(src);
@@ -2407,21 +2382,18 @@ EXPORT_SYMBOL(unmap_mapping_range);
  * as does filemap_fault().
  */
 static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
+		unsigned long address, pmd_t *pmd,
 		unsigned int flags, pte_t orig_pte)
 {
 	spinlock_t *ptl;
 	struct page *page, *swapcache;
 	struct mem_cgroup *memcg;
 	swp_entry_t entry;
-	pte_t pte;
+	pte_t *page_table, pte;
 	int locked;
 	int exclusive = 0;
 	int ret = 0;
 
-	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
-		goto out;
-
 	entry = pte_to_swp_entry(orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
@@ -2624,15 +2596,13 @@ static inline int check_stack_guard_page
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
 static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
+		unsigned long address, pmd_t *pmd,
 		unsigned int flags)
 {
 	struct mem_cgroup *memcg;
 	struct page *page;
 	spinlock_t *ptl;
-	pte_t entry;
-
-	pte_unmap(page_table);
+	pte_t entry, *page_table;
 
 	/* Check if we need to add a guard page to the stack */
 	if (check_stack_guard_page(vma, address) < 0)
@@ -3031,13 +3001,12 @@ static int do_shared_fault(struct mm_str
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
 static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
+		unsigned long address, pmd_t *pmd,
 		unsigned int flags, pte_t orig_pte)
 {
 	pgoff_t pgoff = (((address & PAGE_MASK)
 			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
-	pte_unmap(page_table);
 	if (!(flags & FAULT_FLAG_WRITE))
 		return do_read_fault(mm, vma, address, pmd, pgoff, flags,
 				orig_pte);
@@ -3059,16 +3028,13 @@ static int do_linear_fault(struct mm_str
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
 static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
+		unsigned long address, pmd_t *pmd,
 		unsigned int flags, pte_t orig_pte)
 {
 	pgoff_t pgoff;
 
 	flags |= FAULT_FLAG_NONLINEAR;
 
-	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
-		return 0;
-
 	if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
 		/*
 		 * Page table corrupted: show pte and kill process.
@@ -3103,7 +3069,7 @@ static int numa_migrate_prep(struct page
 }
 
 static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+		   unsigned long addr, pte_t pte, pmd_t *pmd)
 {
 	struct page *page = NULL;
 	spinlock_t *ptl;
@@ -3112,6 +3078,7 @@ static int do_numa_page(struct mm_struct
 	int target_nid;
 	bool migrated = false;
 	int flags = 0;
+	pte_t *ptep;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3122,8 +3089,7 @@ static int do_numa_page(struct mm_struct
 	* the _PAGE_NUMA bit and it is not really expected that there
 	* would be concurrent hardware modifications to the PTE.
 	*/
-	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
+	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (unlikely(!pte_same(*ptep, pte))) {
 		pte_unmap_unlock(ptep, ptl);
 		goto out;
@@ -3195,34 +3161,32 @@ static int do_numa_page(struct mm_struct
  */
 static int handle_pte_fault(struct mm_struct *mm,
 		     struct vm_area_struct *vma, unsigned long address,
-		     pte_t *pte, pmd_t *pmd, unsigned int flags)
+		     pte_t entry, pmd_t *pmd, unsigned int flags)
 {
-	pte_t entry;
 	spinlock_t *ptl;
+	pte_t *pte;
 
-	entry = ACCESS_ONCE(*pte);
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
 			if (vma->vm_ops) {
 				if (likely(vma->vm_ops->fault))
 					return do_linear_fault(mm, vma, address,
-						pte, pmd, flags, entry);
+						pmd, flags, entry);
 			}
 			return do_anonymous_page(mm, vma, address,
-						 pte, pmd, flags);
+						 pmd, flags);
 		}
 		if (pte_file(entry))
 			return do_nonlinear_fault(mm, vma, address,
-					pte, pmd, flags, entry);
+					pmd, flags, entry);
 		return do_swap_page(mm, vma, address,
-					pte, pmd, flags, entry);
+					pmd, flags, entry);
 	}
 
 	if (pte_numa(entry))
-		return do_numa_page(mm, vma, address, entry, pte, pmd);
+		return do_numa_page(mm, vma, address, entry, pmd);
 
-	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (unlikely(!pte_same(*pte, entry)))
 		goto unlock;
 	if (flags & FAULT_FLAG_WRITE) {
@@ -3261,7 +3225,7 @@ static int __handle_mm_fault(struct mm_s
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
+	pte_t *pte, entry;
 
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
@@ -3331,8 +3295,10 @@ static int __handle_mm_fault(struct mm_s
 	 * safe to run pte_offset_map().
 	 */
 	pte = pte_offset_map(pmd, address);
+	entry = ACCESS_ONCE(*pte);
+	pte_unmap(pte);
 
-	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+	return handle_pte_fault(mm, vma, address, entry, pmd, flags);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC][PATCH 2/6] mm: Prepare for FAULT_FLAG_SPECULATIVE
  2014-10-20 21:56 [RFC][PATCH 0/6] Another go at speculative page faults Peter Zijlstra
  2014-10-20 21:56 ` [RFC][PATCH 1/6] mm: Dont assume page-table invariance during faults Peter Zijlstra
@ 2014-10-20 21:56 ` Peter Zijlstra
  2014-10-20 21:56 ` [RFC][PATCH 3/6] mm: VMA sequence count Peter Zijlstra
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-20 21:56 UTC (permalink / raw)
  To: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: peterz-mm-pte_map_lock.patch --]
[-- Type: text/plain, Size: 36608 bytes --]

When speculating faults (without holding mmap_sem) we need to validate
that the vma against which we loaded pages is still valid when we're
ready to install the new PTE.

Therefore, replace the pte_offset_map_lock() calls that (re)take the
PTL with pte_map_lock() which can fail in case we find the VMA changed
since we started the fault.

Instead of passing around the endless list of function arguments,
replace the lot with a single structure so we can change context
without endless function signature changes.

XXX: split this patch into two parts, the first which introduces
fault_env and the second doing the pte_map_lock bit.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/mm.h |   17 -
 mm/memory.c        |  522 +++++++++++++++++++++++++++++------------------------
 2 files changed, 297 insertions(+), 242 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -187,14 +187,15 @@ extern unsigned int kobjsize(const void
  */
 extern pgprot_t protection_map[16];
 
-#define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
-#define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
-#define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
-#define FAULT_FLAG_ALLOW_RETRY	0x08	/* Retry fault if blocking */
-#define FAULT_FLAG_RETRY_NOWAIT	0x10	/* Don't drop mmap_sem and wait when retrying */
-#define FAULT_FLAG_KILLABLE	0x20	/* The fault task is in SIGKILL killable region */
-#define FAULT_FLAG_TRIED	0x40	/* second try */
-#define FAULT_FLAG_USER		0x80	/* The fault originated in userspace */
+#define FAULT_FLAG_WRITE	0x001	/* Fault was a write access */
+#define FAULT_FLAG_NONLINEAR	0x002	/* Fault was via a nonlinear mapping */
+#define FAULT_FLAG_MKWRITE	0x004	/* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_ALLOW_RETRY	0x008	/* Retry fault if blocking */
+#define FAULT_FLAG_RETRY_NOWAIT	0x010	/* Don't drop mmap_sem and wait when retrying */
+#define FAULT_FLAG_KILLABLE	0x020	/* The fault task is in SIGKILL killable region */
+#define FAULT_FLAG_TRIED	0x040	/* second try */
+#define FAULT_FLAG_USER		0x080	/* The fault originated in userspace */
+#define FAULT_FLAG_SPECULATIVE	0x100	/* Speculative fault, not holding mmap_sem */
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1993,6 +1993,23 @@ static int do_page_mkwrite(struct vm_are
 	return ret;
 }
 
+struct fault_env {
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	unsigned long address;
+	pmd_t *pmd;
+	pte_t *pte;
+	pte_t entry;
+	spinlock_t *ptl;
+	unsigned int flags;
+};
+
+static bool pte_map_lock(struct fault_env *fe)
+{
+	fe->pte = pte_offset_map_lock(fe->mm, fe->pmd, fe->address, &fe->ptl);
+	return true;
+}
+
 /*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
@@ -2011,9 +2028,7 @@ static int do_page_mkwrite(struct vm_are
  * but allow concurrent faults), with pte both mapped and locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		spinlock_t *ptl, pte_t orig_pte)
+static int do_wp_page(struct fault_env *fe)
 	__releases(ptl)
 {
 	struct page *old_page, *new_page = NULL;
@@ -2025,7 +2040,7 @@ static int do_wp_page(struct mm_struct *
 	unsigned long mmun_end = 0;	/* For mmu_notifiers */
 	struct mem_cgroup *memcg;
 
-	old_page = vm_normal_page(vma, address, orig_pte);
+	old_page = vm_normal_page(fe->vma, fe->address, fe->entry);
 	if (!old_page) {
 		/*
 		 * VM_MIXEDMAP !pfn_valid() case
@@ -2034,7 +2049,7 @@ static int do_wp_page(struct mm_struct *
 		 * Just mark the pages writable as we can't do any dirty
 		 * accounting on raw pfn maps.
 		 */
-		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+		if ((fe->vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 				     (VM_WRITE|VM_SHARED))
 			goto reuse;
 		goto gotten;
@@ -2047,14 +2062,20 @@ static int do_wp_page(struct mm_struct *
 	if (PageAnon(old_page) && !PageKsm(old_page)) {
 		if (!trylock_page(old_page)) {
 			page_cache_get(old_page);
-			pte_unmap_unlock(page_table, ptl);
+			pte_unmap_unlock(fe->pte, fe->ptl);
 			lock_page(old_page);
-			page_table = pte_offset_map_lock(mm, pmd, address,
-							 &ptl);
-			if (!pte_same(*page_table, orig_pte)) {
+
+			if (!pte_map_lock(fe)) {
+				unlock_page(old_page);
+				ret |= VM_FAULT_RETRY;
+				goto err;
+			}
+
+			if (!pte_same(*fe->pte, fe->entry)) {
 				unlock_page(old_page);
 				goto unlock;
 			}
+
 			page_cache_release(old_page);
 		}
 		if (reuse_swap_page(old_page)) {
@@ -2063,37 +2084,44 @@ static int do_wp_page(struct mm_struct *
 			 * the rmap code will not search our parent or siblings.
 			 * Protected against the rmap code by the page lock.
 			 */
-			page_move_anon_rmap(old_page, vma, address);
+			page_move_anon_rmap(old_page, fe->vma, fe->address);
 			unlock_page(old_page);
 			goto reuse;
 		}
 		unlock_page(old_page);
-	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+	} else if (unlikely((fe->vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 					(VM_WRITE|VM_SHARED))) {
 		/*
 		 * Only catch write-faults on shared writable pages,
 		 * read-only shared pages can get COWed by
 		 * get_user_pages(.write=1, .force=1).
 		 */
-		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
+		if (fe->vma->vm_ops && fe->vma->vm_ops->page_mkwrite) {
 			int tmp;
+
 			page_cache_get(old_page);
-			pte_unmap_unlock(page_table, ptl);
-			tmp = do_page_mkwrite(vma, old_page, address);
+			pte_unmap_unlock(fe->pte, fe->ptl);
+			tmp = do_page_mkwrite(fe->vma, old_page, fe->address);
 			if (unlikely(!tmp || (tmp &
 					(VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
 				page_cache_release(old_page);
 				return tmp;
 			}
+
 			/*
 			 * Since we dropped the lock we need to revalidate
 			 * the PTE as someone else may have changed it.  If
 			 * they did, we just return, as we can count on the
 			 * MMU to tell us if they didn't also make it writable.
 			 */
-			page_table = pte_offset_map_lock(mm, pmd, address,
-							 &ptl);
-			if (!pte_same(*page_table, orig_pte)) {
+
+			if (!pte_map_lock(fe)) {
+				unlock_page(old_page);
+				ret |= VM_FAULT_RETRY;
+				goto err;
+			}
+
+			if (!pte_same(*fe->pte, fe->entry)) {
 				unlock_page(old_page);
 				goto unlock;
 			}
@@ -2112,12 +2140,12 @@ static int do_wp_page(struct mm_struct *
 		if (old_page)
 			page_cpupid_xchg_last(old_page, (1 << LAST_CPUPID_SHIFT) - 1);
 
-		flush_cache_page(vma, address, pte_pfn(orig_pte));
-		entry = pte_mkyoung(orig_pte);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		if (ptep_set_access_flags(vma, address, page_table, entry,1))
-			update_mmu_cache(vma, address, page_table);
-		pte_unmap_unlock(page_table, ptl);
+		flush_cache_page(fe->vma, fe->address, pte_pfn(fe->entry));
+		entry = pte_mkyoung(fe->entry);
+		entry = maybe_mkwrite(pte_mkdirty(entry), fe->vma);
+		if (ptep_set_access_flags(fe->vma, fe->address, fe->pte, entry, 1))
+			update_mmu_cache(fe->vma, fe->address, fe->pte);
+		pte_unmap_unlock(fe->pte, fe->ptl);
 		ret |= VM_FAULT_WRITE;
 
 		if (!dirty_page)
@@ -2135,8 +2163,8 @@ static int do_wp_page(struct mm_struct *
 			wait_on_page_locked(dirty_page);
 			set_page_dirty_balance(dirty_page);
 			/* file_update_time outside page_lock */
-			if (vma->vm_file)
-				file_update_time(vma->vm_file);
+			if (fe->vma->vm_file)
+				file_update_time(fe->vma->vm_file);
 		}
 		put_page(dirty_page);
 		if (page_mkwrite) {
@@ -2145,7 +2173,7 @@ static int do_wp_page(struct mm_struct *
 			set_page_dirty(dirty_page);
 			unlock_page(dirty_page);
 			page_cache_release(dirty_page);
-			if (mapping)	{
+			if (mapping) {
 				/*
 				 * Some device drivers do not set page.mapping
 				 * but still dirty their pages
@@ -2162,62 +2190,68 @@ static int do_wp_page(struct mm_struct *
 	 */
 	page_cache_get(old_page);
 gotten:
-	pte_unmap_unlock(page_table, ptl);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 
-	if (unlikely(anon_vma_prepare(vma)))
+	if (unlikely(anon_vma_prepare(fe->vma)))
 		goto oom;
 
-	if (is_zero_pfn(pte_pfn(orig_pte))) {
-		new_page = alloc_zeroed_user_highpage_movable(vma, address);
+	if (is_zero_pfn(pte_pfn(fe->entry))) {
+		new_page = alloc_zeroed_user_highpage_movable(fe->vma, fe->address);
 		if (!new_page)
 			goto oom;
 	} else {
-		new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+		new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, fe->vma, fe->address);
 		if (!new_page)
 			goto oom;
-		cow_user_page(new_page, old_page, address, vma);
+		cow_user_page(new_page, old_page, fe->address, fe->vma);
 	}
 	__SetPageUptodate(new_page);
 
-	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
+	if (mem_cgroup_try_charge(new_page, fe->mm, GFP_KERNEL, &memcg))
 		goto oom_free_new;
 
-	mmun_start  = address & PAGE_MASK;
+	mmun_start  = fe->address & PAGE_MASK;
 	mmun_end    = mmun_start + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(fe->mm, mmun_start, mmun_end);
 
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (likely(pte_same(*page_table, orig_pte))) {
+	if (!pte_map_lock(fe)) {
+		mem_cgroup_cancel_charge(new_page, memcg);
+		page_cache_release(new_page);
+		ret |= VM_FAULT_RETRY;
+		goto err;
+	}
+
+	if (likely(pte_same(*fe->pte, fe->entry))) {
 		if (old_page) {
 			if (!PageAnon(old_page)) {
-				dec_mm_counter_fast(mm, MM_FILEPAGES);
-				inc_mm_counter_fast(mm, MM_ANONPAGES);
+				dec_mm_counter_fast(fe->mm, MM_FILEPAGES);
+				inc_mm_counter_fast(fe->mm, MM_ANONPAGES);
 			}
 		} else
-			inc_mm_counter_fast(mm, MM_ANONPAGES);
-		flush_cache_page(vma, address, pte_pfn(orig_pte));
-		entry = mk_pte(new_page, vma->vm_page_prot);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			inc_mm_counter_fast(fe->mm, MM_ANONPAGES);
+		flush_cache_page(fe->vma, fe->address, pte_pfn(fe->entry));
+		entry = mk_pte(new_page, fe->vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), fe->vma);
 		/*
 		 * Clear the pte entry and flush it first, before updating the
 		 * pte with the new entry. This will avoid a race condition
 		 * seen in the presence of one thread doing SMC and another
 		 * thread doing COW.
 		 */
-		ptep_clear_flush(vma, address, page_table);
-		page_add_new_anon_rmap(new_page, vma, address);
+		ptep_clear_flush(fe->vma, fe->address, fe->pte);
+		page_add_new_anon_rmap(new_page, fe->vma, fe->address);
 		mem_cgroup_commit_charge(new_page, memcg, false);
-		lru_cache_add_active_or_unevictable(new_page, vma);
+		lru_cache_add_active_or_unevictable(new_page, fe->vma);
 		/*
 		 * We call the notify macro here because, when using secondary
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
-		set_pte_at_notify(mm, address, page_table, entry);
-		update_mmu_cache(vma, address, page_table);
+		set_pte_at_notify(fe->mm, fe->address, fe->pte, entry);
+		update_mmu_cache(fe->vma, fe->address, fe->pte);
 		if (old_page) {
 			/*
 			 * Only after switching the pte to the new page may
@@ -2253,15 +2287,16 @@ static int do_wp_page(struct mm_struct *
 	if (new_page)
 		page_cache_release(new_page);
 unlock:
-	pte_unmap_unlock(page_table, ptl);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 	if (mmun_end > mmun_start)
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(fe->mm, mmun_start, mmun_end);
+err:
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
 		 * keep the mlocked page.
 		 */
-		if ((ret & VM_FAULT_WRITE) && (vma->vm_flags & VM_LOCKED)) {
+		if ((ret & VM_FAULT_WRITE) && (fe->vma->vm_flags & VM_LOCKED)) {
 			lock_page(old_page);	/* LRU manipulation */
 			munlock_vma_page(old_page);
 			unlock_page(old_page);
@@ -2269,6 +2304,7 @@ static int do_wp_page(struct mm_struct *
 		page_cache_release(old_page);
 	}
 	return ret;
+
 oom_free_new:
 	page_cache_release(new_page);
 oom:
@@ -2381,27 +2417,24 @@ EXPORT_SYMBOL(unmap_mapping_range);
  * We return with the mmap_sem locked or unlocked in the same cases
  * as does filemap_fault().
  */
-static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd,
-		unsigned int flags, pte_t orig_pte)
+static int do_swap_page(struct fault_env *fe)
 {
-	spinlock_t *ptl;
 	struct page *page, *swapcache;
 	struct mem_cgroup *memcg;
 	swp_entry_t entry;
-	pte_t *page_table, pte;
+	pte_t pte;
 	int locked;
 	int exclusive = 0;
 	int ret = 0;
 
-	entry = pte_to_swp_entry(orig_pte);
+	entry = pte_to_swp_entry(fe->entry);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
-			migration_entry_wait(mm, pmd, address);
+			migration_entry_wait(fe->mm, fe->pmd, fe->address);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
 		} else {
-			print_bad_pte(vma, address, orig_pte, NULL);
+			print_bad_pte(fe->vma, fe->address, fe->entry, NULL);
 			ret = VM_FAULT_SIGBUS;
 		}
 		goto out;
@@ -2410,14 +2443,16 @@ static int do_swap_page(struct mm_struct
 	page = lookup_swap_cache(entry);
 	if (!page) {
 		page = swapin_readahead(entry,
-					GFP_HIGHUSER_MOVABLE, vma, address);
+				GFP_HIGHUSER_MOVABLE, fe->vma, fe->address);
 		if (!page) {
 			/*
 			 * Back out if somebody else faulted in this pte
 			 * while we released the pte lock.
 			 */
-			page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-			if (likely(pte_same(*page_table, orig_pte)))
+			if (!pte_map_lock(fe))
+				return VM_FAULT_RETRY;
+
+			if (likely(pte_same(*fe->pte, fe->entry)))
 				ret = VM_FAULT_OOM;
 			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 			goto unlock;
@@ -2426,7 +2461,7 @@ static int do_swap_page(struct mm_struct
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
-		mem_cgroup_count_vm_event(mm, PGMAJFAULT);
+		mem_cgroup_count_vm_event(fe->mm, PGMAJFAULT);
 	} else if (PageHWPoison(page)) {
 		/*
 		 * hwpoisoned dirty swapcache pages are kept for killing
@@ -2439,7 +2474,7 @@ static int do_swap_page(struct mm_struct
 	}
 
 	swapcache = page;
-	locked = lock_page_or_retry(page, mm, flags);
+	locked = lock_page_or_retry(page, fe->mm, fe->flags);
 
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 	if (!locked) {
@@ -2456,14 +2491,14 @@ static int do_swap_page(struct mm_struct
 	if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
 		goto out_page;
 
-	page = ksm_might_need_to_copy(page, vma, address);
+	page = ksm_might_need_to_copy(page, fe->vma, fe->address);
 	if (unlikely(!page)) {
 		ret = VM_FAULT_OOM;
 		page = swapcache;
 		goto out_page;
 	}
 
-	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(page, fe->mm, GFP_KERNEL, &memcg)) {
 		ret = VM_FAULT_OOM;
 		goto out_page;
 	}
@@ -2471,8 +2506,12 @@ static int do_swap_page(struct mm_struct
 	/*
 	 * Back out if somebody else already faulted in this pte.
 	 */
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*page_table, orig_pte)))
+	if (!pte_map_lock(fe)) {
+		ret = VM_FAULT_RETRY;
+		goto out_charge;
+	}
+
+	if (unlikely(!pte_same(*fe->pte, fe->entry)))
 		goto out_nomap;
 
 	if (unlikely(!PageUptodate(page))) {
@@ -2490,30 +2529,30 @@ static int do_swap_page(struct mm_struct
 	 * must be called after the swap_free(), or it will never succeed.
 	 */
 
-	inc_mm_counter_fast(mm, MM_ANONPAGES);
-	dec_mm_counter_fast(mm, MM_SWAPENTS);
-	pte = mk_pte(page, vma->vm_page_prot);
-	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
-		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
-		flags &= ~FAULT_FLAG_WRITE;
+	inc_mm_counter_fast(fe->mm, MM_ANONPAGES);
+	dec_mm_counter_fast(fe->mm, MM_SWAPENTS);
+	pte = mk_pte(page, fe->vma->vm_page_prot);
+	if ((fe->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
+		pte = maybe_mkwrite(pte_mkdirty(pte), fe->vma);
+		fe->flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
 		exclusive = 1;
 	}
-	flush_icache_page(vma, page);
-	if (pte_swp_soft_dirty(orig_pte))
+	flush_icache_page(fe->vma, page);
+	if (pte_swp_soft_dirty(fe->entry))
 		pte = pte_mksoft_dirty(pte);
-	set_pte_at(mm, address, page_table, pte);
+	set_pte_at(fe->mm, fe->address, fe->pte, pte);
 	if (page == swapcache) {
-		do_page_add_anon_rmap(page, vma, address, exclusive);
+		do_page_add_anon_rmap(page, fe->vma, fe->address, exclusive);
 		mem_cgroup_commit_charge(page, memcg, true);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, address);
+		page_add_new_anon_rmap(page, fe->vma, fe->address);
 		mem_cgroup_commit_charge(page, memcg, false);
-		lru_cache_add_active_or_unevictable(page, vma);
+		lru_cache_add_active_or_unevictable(page, fe->vma);
 	}
 
 	swap_free(entry);
-	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
+	if (vm_swap_full() || (fe->vma->vm_flags & VM_LOCKED) || PageMlocked(page))
 		try_to_free_swap(page);
 	unlock_page(page);
 	if (page != swapcache) {
@@ -2529,22 +2568,23 @@ static int do_swap_page(struct mm_struct
 		page_cache_release(swapcache);
 	}
 
-	if (flags & FAULT_FLAG_WRITE) {
-		ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
+	if (fe->flags & FAULT_FLAG_WRITE) {
+		ret |= do_wp_page(fe);
 		if (ret & VM_FAULT_ERROR)
 			ret &= VM_FAULT_ERROR;
 		goto out;
 	}
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, address, page_table);
+	update_mmu_cache(fe->vma, fe->address, fe->pte);
 unlock:
-	pte_unmap_unlock(page_table, ptl);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 out:
 	return ret;
 out_nomap:
+	pte_unmap_unlock(fe->pte, fe->ptl);
+out_charge:
 	mem_cgroup_cancel_charge(page, memcg);
-	pte_unmap_unlock(page_table, ptl);
 out_page:
 	unlock_page(page);
 out_release:
@@ -2595,33 +2635,34 @@ static inline int check_stack_guard_page
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd,
-		unsigned int flags)
+static int do_anonymous_page(struct fault_env *fe)
 {
 	struct mem_cgroup *memcg;
 	struct page *page;
-	spinlock_t *ptl;
-	pte_t entry, *page_table;
+	pte_t entry;
 
 	/* Check if we need to add a guard page to the stack */
-	if (check_stack_guard_page(vma, address) < 0)
+	if (check_stack_guard_page(fe->vma, fe->address) < 0)
 		return VM_FAULT_SIGBUS;
 
 	/* Use the zero-page for reads */
-	if (!(flags & FAULT_FLAG_WRITE)) {
-		entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
-						vma->vm_page_prot));
-		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-		if (!pte_none(*page_table))
+	if (!(fe->flags & FAULT_FLAG_WRITE)) {
+		entry = pte_mkspecial(pfn_pte(my_zero_pfn(fe->address),
+					fe->vma->vm_page_prot));
+
+		if (!pte_map_lock(fe))
+			return VM_FAULT_RETRY;
+
+		if (!pte_none(*fe->pte))
 			goto unlock;
+
 		goto setpte;
 	}
 
 	/* Allocate our own private page. */
-	if (unlikely(anon_vma_prepare(vma)))
+	if (unlikely(anon_vma_prepare(fe->vma)))
 		goto oom;
-	page = alloc_zeroed_user_highpage_movable(vma, address);
+	page = alloc_zeroed_user_highpage_movable(fe->vma, fe->address);
 	if (!page)
 		goto oom;
 	/*
@@ -2631,28 +2672,33 @@ static int do_anonymous_page(struct mm_s
 	 */
 	__SetPageUptodate(page);
 
-	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg))
+	if (mem_cgroup_try_charge(page, fe->mm, GFP_KERNEL, &memcg))
 		goto oom_free_page;
 
-	entry = mk_pte(page, vma->vm_page_prot);
-	if (vma->vm_flags & VM_WRITE)
+	entry = mk_pte(page, fe->vma->vm_page_prot);
+	if (fe->vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (!pte_none(*page_table))
+	if (!pte_map_lock(fe)) {
+		mem_cgroup_cancel_charge(page, memcg);
+		page_cache_release(page);
+		return VM_FAULT_RETRY;
+	}
+
+	if (!pte_none(*fe->pte))
 		goto release;
 
-	inc_mm_counter_fast(mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, address);
+	inc_mm_counter_fast(fe->mm, MM_ANONPAGES);
+	page_add_new_anon_rmap(page, fe->vma, fe->address);
 	mem_cgroup_commit_charge(page, memcg, false);
-	lru_cache_add_active_or_unevictable(page, vma);
+	lru_cache_add_active_or_unevictable(page, fe->vma);
 setpte:
-	set_pte_at(mm, address, page_table, entry);
+	set_pte_at(fe->mm, fe->address, fe->pte, entry);
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, address, page_table);
+	update_mmu_cache(fe->vma, fe->address, fe->pte);
 unlock:
-	pte_unmap_unlock(page_table, ptl);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 	return 0;
 release:
 	mem_cgroup_cancel_charge(page, memcg);
@@ -2688,7 +2734,7 @@ static int __do_fault(struct vm_area_str
 		if (ret & VM_FAULT_LOCKED)
 			unlock_page(vmf.page);
 		page_cache_release(vmf.page);
-		return VM_FAULT_HWPOISON;
+		return ret | VM_FAULT_HWPOISON;
 	}
 
 	if (unlikely(!(ret & VM_FAULT_LOCKED)))
@@ -2846,13 +2892,9 @@ static void do_fault_around(struct vm_ar
 	vma->vm_ops->map_pages(vma, &vmf);
 }
 
-static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd,
-		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_read_fault(struct fault_env *fe, pgoff_t pgoff)
 {
 	struct page *fault_page;
-	spinlock_t *ptl;
-	pte_t *pte;
 	int ret = 0;
 
 	/*
@@ -2860,73 +2902,86 @@ static int do_read_fault(struct mm_struc
 	 * if page by the offset is not ready to be mapped (cold cache or
 	 * something).
 	 */
-	if (vma->vm_ops->map_pages && !(flags & FAULT_FLAG_NONLINEAR) &&
+	if (fe->vma->vm_ops->map_pages && !(fe->flags & FAULT_FLAG_NONLINEAR) &&
 	    fault_around_bytes >> PAGE_SHIFT > 1) {
-		pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-		do_fault_around(vma, address, pte, pgoff, flags);
-		if (!pte_same(*pte, orig_pte))
+
+		if (!pte_map_lock(fe))
+			return VM_FAULT_RETRY;
+
+		do_fault_around(fe->vma, fe->address, fe->pte, pgoff, fe->flags);
+		if (!pte_same(*fe->pte, fe->entry))
 			goto unlock_out;
-		pte_unmap_unlock(pte, ptl);
+
+		pte_unmap_unlock(fe->pte, fe->ptl);
 	}
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(fe->vma, fe->address, pgoff, fe->flags, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*pte, orig_pte))) {
-		pte_unmap_unlock(pte, ptl);
+	if (!pte_map_lock(fe)) {
+		unlock_page(fault_page);
+		page_cache_release(fault_page);
+		return VM_FAULT_RETRY;
+	}
+
+	if (unlikely(!pte_same(*fe->pte, fe->entry))) {
+		pte_unmap_unlock(fe->pte, fe->ptl);
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
 		return ret;
 	}
-	do_set_pte(vma, address, fault_page, pte, false, false);
+
+	do_set_pte(fe->vma, fe->address, fault_page, fe->pte, false, false);
 	unlock_page(fault_page);
 unlock_out:
-	pte_unmap_unlock(pte, ptl);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 	return ret;
 }
 
-static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd,
-		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff)
 {
 	struct page *fault_page, *new_page;
 	struct mem_cgroup *memcg;
-	spinlock_t *ptl;
-	pte_t *pte;
 	int ret;
 
-	if (unlikely(anon_vma_prepare(vma)))
+	if (unlikely(anon_vma_prepare(fe->vma)))
 		return VM_FAULT_OOM;
 
-	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, fe->vma, fe->address);
 	if (!new_page)
 		return VM_FAULT_OOM;
 
-	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(new_page, fe->mm, GFP_KERNEL, &memcg)) {
 		page_cache_release(new_page);
 		return VM_FAULT_OOM;
 	}
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(fe->vma, fe->address, pgoff, fe->flags, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		goto uncharge_out;
 
-	copy_user_highpage(new_page, fault_page, address, vma);
+	copy_user_highpage(new_page, fault_page, fe->address, fe->vma);
 	__SetPageUptodate(new_page);
 
-	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*pte, orig_pte))) {
-		pte_unmap_unlock(pte, ptl);
+	if (!pte_map_lock(fe)) {
+		unlock_page(fault_page);
+		page_cache_release(fault_page);
+		ret |= VM_FAULT_RETRY;
+		goto uncharge_out;
+	}
+
+	if (unlikely(!pte_same(*fe->pte, fe->entry))) {
+		pte_unmap_unlock(fe->pte, fe->ptl);
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
 		goto uncharge_out;
 	}
-	do_set_pte(vma, address, new_page, pte, true, true);
+
+	do_set_pte(fe->vma, fe->address, new_page, fe->pte, true, true);
 	mem_cgroup_commit_charge(new_page, memcg, false);
-	lru_cache_add_active_or_unevictable(new_page, vma);
-	pte_unmap_unlock(pte, ptl);
+	lru_cache_add_active_or_unevictable(new_page, fe->vma);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 	unlock_page(fault_page);
 	page_cache_release(fault_page);
 	return ret;
@@ -2936,18 +2991,14 @@ static int do_cow_fault(struct mm_struct
 	return ret;
 }
 
-static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd,
-		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff)
 {
 	struct page *fault_page;
 	struct address_space *mapping;
-	spinlock_t *ptl;
-	pte_t *pte;
 	int dirtied = 0;
 	int ret, tmp;
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(fe->vma, fe->address, pgoff, fe->flags, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
@@ -2955,31 +3006,35 @@ static int do_shared_fault(struct mm_str
 	 * Check if the backing address space wants to know that the page is
 	 * about to become writable
 	 */
-	if (vma->vm_ops->page_mkwrite) {
+	if (fe->vma->vm_ops->page_mkwrite) {
 		unlock_page(fault_page);
-		tmp = do_page_mkwrite(vma, fault_page, address);
-		if (unlikely(!tmp ||
-				(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
+		tmp = do_page_mkwrite(fe->vma, fault_page, fe->address);
+		if (unlikely(!tmp || (tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
 			page_cache_release(fault_page);
 			return tmp;
 		}
 	}
 
-	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*pte, orig_pte))) {
-		pte_unmap_unlock(pte, ptl);
+	if (!pte_map_lock(fe)) {
+		unlock_page(fault_page);
+		page_cache_release(fault_page);
+		return ret | VM_FAULT_RETRY;
+	}
+
+	if (unlikely(!pte_same(*fe->pte, fe->entry))) {
+		pte_unmap_unlock(fe->pte, fe->ptl);
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
 		return ret;
 	}
-	do_set_pte(vma, address, fault_page, pte, true, false);
-	pte_unmap_unlock(pte, ptl);
+	do_set_pte(fe->vma, fe->address, fault_page, fe->pte, true, false);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 
 	if (set_page_dirty(fault_page))
 		dirtied = 1;
 	mapping = fault_page->mapping;
 	unlock_page(fault_page);
-	if ((dirtied || vma->vm_ops->page_mkwrite) && mapping) {
+	if ((dirtied || fe->vma->vm_ops->page_mkwrite) && mapping) {
 		/*
 		 * Some device drivers do not set page.mapping but still
 		 * dirty their pages
@@ -2988,8 +3043,8 @@ static int do_shared_fault(struct mm_str
 	}
 
 	/* file_update_time outside page_lock */
-	if (vma->vm_file && !vma->vm_ops->page_mkwrite)
-		file_update_time(vma->vm_file);
+	if (fe->vma->vm_file && !fe->vma->vm_ops->page_mkwrite)
+		file_update_time(fe->vma->vm_file);
 
 	return ret;
 }
@@ -3000,20 +3055,16 @@ static int do_shared_fault(struct mm_str
  * The mmap_sem may have been released depending on flags and our
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
-static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd,
-		unsigned int flags, pte_t orig_pte)
-{
-	pgoff_t pgoff = (((address & PAGE_MASK)
-			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
-
-	if (!(flags & FAULT_FLAG_WRITE))
-		return do_read_fault(mm, vma, address, pmd, pgoff, flags,
-				orig_pte);
-	if (!(vma->vm_flags & VM_SHARED))
-		return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
-				orig_pte);
-	return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+static int do_linear_fault(struct fault_env *fe)
+{
+	pgoff_t pgoff = (((fe->address & PAGE_MASK) -
+			 fe->vma->vm_start) >> PAGE_SHIFT) + fe->vma->vm_pgoff;
+
+	if (!(fe->flags & FAULT_FLAG_WRITE))
+		return do_read_fault(fe, pgoff);
+	if (!(fe->vma->vm_flags & VM_SHARED))
+		return do_cow_fault(fe, pgoff);
+	return do_shared_fault(fe, pgoff);
 }
 
 /*
@@ -3027,30 +3078,26 @@ static int do_linear_fault(struct mm_str
  * The mmap_sem may have been released depending on flags and our
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
-static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd,
-		unsigned int flags, pte_t orig_pte)
+static int do_nonlinear_fault(struct fault_env *fe)
 {
 	pgoff_t pgoff;
 
-	flags |= FAULT_FLAG_NONLINEAR;
+	fe->flags |= FAULT_FLAG_NONLINEAR;
 
-	if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
+	if (unlikely(!(fe->vma->vm_flags & VM_NONLINEAR))) {
 		/*
 		 * Page table corrupted: show pte and kill process.
 		 */
-		print_bad_pte(vma, address, orig_pte, NULL);
+		print_bad_pte(fe->vma, fe->address, fe->entry, NULL);
 		return VM_FAULT_SIGBUS;
 	}
 
-	pgoff = pte_to_pgoff(orig_pte);
-	if (!(flags & FAULT_FLAG_WRITE))
-		return do_read_fault(mm, vma, address, pmd, pgoff, flags,
-				orig_pte);
-	if (!(vma->vm_flags & VM_SHARED))
-		return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
-				orig_pte);
-	return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+	pgoff = pte_to_pgoff(fe->entry);
+	if (!(fe->flags & FAULT_FLAG_WRITE))
+		return do_read_fault(fe, pgoff);
+	if (!(fe->vma->vm_flags & VM_SHARED))
+		return do_cow_fault(fe, pgoff);
+	return do_shared_fault(fe, pgoff);
 }
 
 static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
@@ -3068,17 +3115,16 @@ static int numa_migrate_prep(struct page
 	return mpol_misplaced(page, vma, addr);
 }
 
-static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		   unsigned long addr, pte_t pte, pmd_t *pmd)
+static int do_numa_page(struct fault_env *fe)
 {
 	struct page *page = NULL;
-	spinlock_t *ptl;
 	int page_nid = -1;
 	int last_cpupid;
 	int target_nid;
 	bool migrated = false;
 	int flags = 0;
-	pte_t *ptep;
+	int ret = 0;
+	pte_t entry;
 
 	/*
 	* The "pte" at this point cannot be used safely without
@@ -3089,19 +3135,23 @@ static int do_numa_page(struct mm_struct
 	* the _PAGE_NUMA bit and it is not really expected that there
 	* would be concurrent hardware modifications to the PTE.
 	*/
-	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
-	if (unlikely(!pte_same(*ptep, pte))) {
-		pte_unmap_unlock(ptep, ptl);
+	if (!pte_map_lock(fe)) {
+		ret |= VM_FAULT_RETRY;
+		goto out;
+	}
+
+	if (unlikely(!pte_same(*fe->pte, fe->entry))) {
+		pte_unmap_unlock(fe->pte, fe->ptl);
 		goto out;
 	}
 
-	pte = pte_mknonnuma(pte);
-	set_pte_at(mm, addr, ptep, pte);
-	update_mmu_cache(vma, addr, ptep);
+	entry = pte_mknonnuma(fe->entry);
+	set_pte_at(fe->mm, fe->address, fe->pte, entry);
+	update_mmu_cache(fe->vma, fe->address, fe->pte);
 
-	page = vm_normal_page(vma, addr, pte);
+	page = vm_normal_page(fe->vma, fe->address, entry);
 	if (!page) {
-		pte_unmap_unlock(ptep, ptl);
+		pte_unmap_unlock(fe->pte, fe->ptl);
 		return 0;
 	}
 	BUG_ON(is_zero_pfn(page_to_pfn(page)));
@@ -3111,27 +3161,28 @@ static int do_numa_page(struct mm_struct
 	 * in general, RO pages shouldn't hurt as much anyway since
 	 * they can be in shared cache state.
 	 */
-	if (!pte_write(pte))
+	if (!pte_write(entry))
 		flags |= TNF_NO_GROUP;
 
 	/*
 	 * Flag if the page is shared between multiple address spaces. This
 	 * is later used when determining whether to group tasks together
 	 */
-	if (page_mapcount(page) > 1 && (vma->vm_flags & VM_SHARED))
+	if (page_mapcount(page) > 1 && (fe->vma->vm_flags & VM_SHARED))
 		flags |= TNF_SHARED;
 
 	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
-	target_nid = numa_migrate_prep(page, vma, addr, page_nid, &flags);
-	pte_unmap_unlock(ptep, ptl);
+	target_nid = numa_migrate_prep(page, fe->vma, fe->address, page_nid, &flags);
+	pte_unmap_unlock(fe->pte, fe->ptl);
+
 	if (target_nid == -1) {
 		put_page(page);
 		goto out;
 	}
 
 	/* Migrate to the requested node */
-	migrated = migrate_misplaced_page(page, vma, target_nid);
+	migrated = migrate_misplaced_page(page, fe->vma, target_nid);
 	if (migrated) {
 		page_nid = target_nid;
 		flags |= TNF_MIGRATED;
@@ -3159,45 +3210,38 @@ static int do_numa_page(struct mm_struct
  * The mmap_sem may have been released depending on flags and our
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
-static int handle_pte_fault(struct mm_struct *mm,
-		     struct vm_area_struct *vma, unsigned long address,
-		     pte_t entry, pmd_t *pmd, unsigned int flags)
+static int handle_pte_fault(struct fault_env *fe)
 {
-	spinlock_t *ptl;
-	pte_t *pte;
+	pte_t entry = fe->entry;
 
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
-			if (vma->vm_ops) {
-				if (likely(vma->vm_ops->fault))
-					return do_linear_fault(mm, vma, address,
-						pmd, flags, entry);
+			if (fe->vma->vm_ops) {
+				if (likely(fe->vma->vm_ops->fault))
+					return do_linear_fault(fe);
 			}
-			return do_anonymous_page(mm, vma, address,
-						 pmd, flags);
+			return do_anonymous_page(fe);
 		}
 		if (pte_file(entry))
-			return do_nonlinear_fault(mm, vma, address,
-					pmd, flags, entry);
-		return do_swap_page(mm, vma, address,
-					pmd, flags, entry);
+			return do_nonlinear_fault(fe);
+		return do_swap_page(fe);
 	}
 
 	if (pte_numa(entry))
-		return do_numa_page(mm, vma, address, entry, pmd);
-
-	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*pte, entry)))
+		return do_numa_page(fe);
+	if (!pte_map_lock(fe))
+		return VM_FAULT_RETRY;
+	if (unlikely(!pte_same(*fe->pte, entry)))
 		goto unlock;
-	if (flags & FAULT_FLAG_WRITE) {
+	if (fe->flags & FAULT_FLAG_WRITE) {
 		if (!pte_write(entry))
-			return do_wp_page(mm, vma, address,
-					pte, pmd, ptl, entry);
+			return do_wp_page(fe);
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
-	if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {
-		update_mmu_cache(vma, address, pte);
+	if (ptep_set_access_flags(fe->vma, fe->address, fe->pte,
+				entry, fe->flags & FAULT_FLAG_WRITE)) {
+		update_mmu_cache(fe->vma, fe->address, fe->pte);
 	} else {
 		/*
 		 * This is needed only for protection faults but the arch code
@@ -3205,11 +3249,11 @@ static int handle_pte_fault(struct mm_st
 		 * This still avoids useless tlb flushes for .text page faults
 		 * with threads.
 		 */
-		if (flags & FAULT_FLAG_WRITE)
-			flush_tlb_fix_spurious_fault(vma, address);
+		if (fe->flags & FAULT_FLAG_WRITE)
+			flush_tlb_fix_spurious_fault(fe->vma, fe->address);
 	}
 unlock:
-	pte_unmap_unlock(pte, ptl);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 	return 0;
 }
 
@@ -3222,6 +3266,7 @@ static int handle_pte_fault(struct mm_st
 static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			     unsigned long address, unsigned int flags)
 {
+	struct fault_env fe;
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
@@ -3298,7 +3343,16 @@ static int __handle_mm_fault(struct mm_s
 	entry = ACCESS_ONCE(*pte);
 	pte_unmap(pte);
 
-	return handle_pte_fault(mm, vma, address, entry, pmd, flags);
+	fe = (struct fault_env) {
+		.mm = mm,
+		.vma = vma,
+		.address = address,
+		.entry = entry,
+		.pmd = pmd,
+		.flags = flags,
+	};
+
+	return handle_pte_fault(&fe);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC][PATCH 3/6] mm: VMA sequence count
  2014-10-20 21:56 [RFC][PATCH 0/6] Another go at speculative page faults Peter Zijlstra
  2014-10-20 21:56 ` [RFC][PATCH 1/6] mm: Dont assume page-table invariance during faults Peter Zijlstra
  2014-10-20 21:56 ` [RFC][PATCH 2/6] mm: Prepare for FAULT_FLAG_SPECULATIVE Peter Zijlstra
@ 2014-10-20 21:56 ` Peter Zijlstra
  2014-10-22 11:26   ` Kirill A. Shutemov
  2014-10-20 21:56 ` [RFC][PATCH 4/6] SRCU free VMAs Peter Zijlstra
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-20 21:56 UTC (permalink / raw)
  To: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: peterz-mm-vma-seq.patch --]
[-- Type: text/plain, Size: 3114 bytes --]

Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
counts such that we can easily test if a VMA is changed.

The unmap_page_range() one allows us to make assumptions about
page-tables; when we find the seqcount hasn't changed we can assume
page-tables are still valid.

The flip side is that we cannot distinguish between a vma_adjust() and
the unmap_page_range() -- where with the former we could have
re-checked the vma bounds against the address.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/mm_types.h |    2 ++
 mm/memory.c              |    2 ++
 mm/mmap.c                |   13 +++++++++++++
 3 files changed, 17 insertions(+)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
 #include <linux/page-debug-flags.h>
 #include <linux/uprobes.h>
 #include <linux/page-flags-layout.h>
+#include <linux/seqlock.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -308,6 +309,7 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	seqcount_t vm_sequence;
 };
 
 struct core_thread {
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1293,6 +1293,7 @@ static void unmap_page_range(struct mmu_
 		details = NULL;
 
 	BUG_ON(addr >= end);
+	write_seqcount_begin(&vma->vm_sequence);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -1302,6 +1303,7 @@ static void unmap_page_range(struct mmu_
 		next = zap_pud_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
+	write_seqcount_end(&vma->vm_sequence);
 }
 
 
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -596,6 +596,8 @@ void __vma_link_rb(struct mm_struct *mm,
 	else
 		mm->highest_vm_end = vma->vm_end;
 
+	seqcount_init(&vma->vm_sequence);
+
 	/*
 	 * vma->vm_prev wasn't known when we followed the rbtree to find the
 	 * correct insertion point for that vma. As a result, we could not
@@ -715,6 +717,10 @@ int vma_adjust(struct vm_area_struct *vm
 	long adjust_next = 0;
 	int remove_next = 0;
 
+	write_seqcount_begin(&vma->vm_sequence);
+	if (next)
+		write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
+
 	if (next && !insert) {
 		struct vm_area_struct *exporter = NULL;
 
@@ -880,7 +886,10 @@ again:			remove_next = 1 + (end > next->
 		 * we must remove another next too. It would clutter
 		 * up the code too much to do both in one go.
 		 */
+		write_seqcount_end(&next->vm_sequence);
 		next = vma->vm_next;
+		write_seqcount_begin_nested(&next->vm_sequence, SINGLE_DEPTH_NESTING);
+
 		if (remove_next == 2)
 			goto again;
 		else if (next)
@@ -891,6 +900,10 @@ again:			remove_next = 1 + (end > next->
 	if (insert && file)
 		uprobe_mmap(insert);
 
+	if (next)
+		write_seqcount_end(&next->vm_sequence);
+	write_seqcount_end(&vma->vm_sequence);
+
 	validate_mm(mm);
 
 	return 0;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 3/6] mm: VMA sequence count
  2014-10-20 21:56 ` [RFC][PATCH 3/6] mm: VMA sequence count Peter Zijlstra
@ 2014-10-22 11:26   ` Kirill A. Shutemov
  2014-10-22 11:39     ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Kirill A. Shutemov @ 2014-10-22 11:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave, linux-kernel,
	linux-mm

On Mon, Oct 20, 2014 at 11:56:36PM +0200, Peter Zijlstra wrote:
> Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
> counts such that we can easily test if a VMA is changed.
> 
> The unmap_page_range() one allows us to make assumptions about
> page-tables; when we find the seqcount hasn't changed we can assume
> page-tables are still valid.
> 
> The flip side is that we cannot distinguish between a vma_adjust() and
> the unmap_page_range() -- where with the former we could have
> re-checked the vma bounds against the address.

You only took care about changing size of VMA or unmap. What about other
aspects of VMA. How would you care about race with mprotect(2)?

		CPU0						CPU1
 mprotect()
   mprotect_fixup()
     vma_merge()
       [ maybe update vm_sequence ]
    						[ page fault kicks in ]
						  do_anonymous_page()
						    entry = mk_pte(page, fe->vma->vm_page_prot);
     vma_set_page_prot(vma)
       [ update vma->vm_page_prot ]
     change_protection()
						    pte_map_lock()
						      [ vm_sequence is ok ]
						    set_pte_at(entry) // With old vm_page_prot!!!

This can end up a security issue.

This particular case can be fixed pretty easily: we should move
vm_page_prot reference under the ptl and make sure that we walk over
virtual addresses in same (direct) order everywhere (this is seems true).

But who knows what else we're missing?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 3/6] mm: VMA sequence count
  2014-10-22 11:26   ` Kirill A. Shutemov
@ 2014-10-22 11:39     ` Peter Zijlstra
  2014-10-22 11:53       ` Kirill A. Shutemov
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-22 11:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave, linux-kernel,
	linux-mm

On Wed, Oct 22, 2014 at 02:26:57PM +0300, Kirill A. Shutemov wrote:
> On Mon, Oct 20, 2014 at 11:56:36PM +0200, Peter Zijlstra wrote:
> > Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
> > counts such that we can easily test if a VMA is changed.
> > 
> > The unmap_page_range() one allows us to make assumptions about
> > page-tables; when we find the seqcount hasn't changed we can assume
> > page-tables are still valid.
> > 
> > The flip side is that we cannot distinguish between a vma_adjust() and
> > the unmap_page_range() -- where with the former we could have
> > re-checked the vma bounds against the address.
> 
> You only took care about changing size of VMA or unmap. What about other
> aspects of VMA. How would you care about race with mprotect(2)?
> 
> 		CPU0						CPU1
>  mprotect()
>    mprotect_fixup()
>      vma_merge()
>        [ maybe update vm_sequence ]
>     						[ page fault kicks in ]
> 						  do_anonymous_page()
> 						    entry = mk_pte(page, fe->vma->vm_page_prot);
>      vma_set_page_prot(vma)
>        [ update vma->vm_page_prot ]
>      change_protection()
> 						    pte_map_lock()
> 						      [ vm_sequence is ok ]
> 						    set_pte_at(entry) // With old vm_page_prot!!!
> 

This won't happen, this is be serialized by the PTL and the fault
validates that the PTE is the 'same' it started out with after acquiring
the PTL.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 3/6] mm: VMA sequence count
  2014-10-22 11:39     ` Peter Zijlstra
@ 2014-10-22 11:53       ` Kirill A. Shutemov
  2014-10-22 12:15         ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Kirill A. Shutemov @ 2014-10-22 11:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave, linux-kernel,
	linux-mm

On Wed, Oct 22, 2014 at 01:39:51PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 22, 2014 at 02:26:57PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Oct 20, 2014 at 11:56:36PM +0200, Peter Zijlstra wrote:
> > > Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
> > > counts such that we can easily test if a VMA is changed.
> > > 
> > > The unmap_page_range() one allows us to make assumptions about
> > > page-tables; when we find the seqcount hasn't changed we can assume
> > > page-tables are still valid.
> > > 
> > > The flip side is that we cannot distinguish between a vma_adjust() and
> > > the unmap_page_range() -- where with the former we could have
> > > re-checked the vma bounds against the address.
> > 
> > You only took care about changing size of VMA or unmap. What about other
> > aspects of VMA. How would you care about race with mprotect(2)?
> > 
> > 		CPU0						CPU1
> >  mprotect()
> >    mprotect_fixup()
> >      vma_merge()
> >        [ maybe update vm_sequence ]
> >     						[ page fault kicks in ]
> > 						  do_anonymous_page()
> > 						    entry = mk_pte(page, fe->vma->vm_page_prot);
> >      vma_set_page_prot(vma)
> >        [ update vma->vm_page_prot ]
> >      change_protection()
> > 						    pte_map_lock()
> > 						      [ vm_sequence is ok ]
> > 						    set_pte_at(entry) // With old vm_page_prot!!!
> > 
> 
> This won't happen, this is be serialized by the PTL and the fault
> validates that the PTE is the 'same' it started out with after acquiring
> the PTL.

Em, no. In this case change_protection() will not touch the pte, since
it's pte_none() and the pte_same() check will pass just fine.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 3/6] mm: VMA sequence count
  2014-10-22 11:53       ` Kirill A. Shutemov
@ 2014-10-22 12:15         ` Peter Zijlstra
  2014-10-22 13:44           ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-22 12:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave, linux-kernel,
	linux-mm

On Wed, Oct 22, 2014 at 02:53:04PM +0300, Kirill A. Shutemov wrote:
> Em, no. In this case change_protection() will not touch the pte, since
> it's pte_none() and the pte_same() check will pass just fine.

Oh, that's what you meant. Yes that's a problem, yes vm_page_prot
needs wrapping too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 3/6] mm: VMA sequence count
  2014-10-22 12:15         ` Peter Zijlstra
@ 2014-10-22 13:44           ` Peter Zijlstra
  2014-10-23 12:36             ` Kirill A. Shutemov
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-22 13:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave, linux-kernel,
	linux-mm

On Wed, Oct 22, 2014 at 02:15:54PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 22, 2014 at 02:53:04PM +0300, Kirill A. Shutemov wrote:
> > Em, no. In this case change_protection() will not touch the pte, since
> > it's pte_none() and the pte_same() check will pass just fine.
> 
> Oh, that's what you meant. Yes that's a problem, yes vm_page_prot
> needs wrapping too.

Maybe also vm_policy, is there anything else that can change while a vma
lives?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 3/6] mm: VMA sequence count
  2014-10-22 13:44           ` Peter Zijlstra
@ 2014-10-23 12:36             ` Kirill A. Shutemov
  2014-10-23 14:22               ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Kirill A. Shutemov @ 2014-10-23 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave, linux-kernel,
	linux-mm

On Wed, Oct 22, 2014 at 03:44:16PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 22, 2014 at 02:15:54PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 22, 2014 at 02:53:04PM +0300, Kirill A. Shutemov wrote:
> > > Em, no. In this case change_protection() will not touch the pte, since
> > > it's pte_none() and the pte_same() check will pass just fine.
> > 
> > Oh, that's what you meant. Yes that's a problem, yes vm_page_prot
> > needs wrapping too.
> 
> Maybe also vm_policy, is there anything else that can change while a vma
> lives?

 - vm_flags, obviously;
 - shared, anon_vma and anon_vma_chain (at least on the first write fault
   to private mapping);
 - vm_pgoff (mremap(2) ?);
 - vm_private_data -- it's all over drivers. Potential nightmare, but
   seems not in use for anon mappings.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 3/6] mm: VMA sequence count
  2014-10-23 12:36             ` Kirill A. Shutemov
@ 2014-10-23 14:22               ` Peter Zijlstra
  2014-10-23 15:05                 ` Kirill A. Shutemov
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-23 14:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave, linux-kernel,
	linux-mm

On Thu, Oct 23, 2014 at 03:36:16PM +0300, Kirill A. Shutemov wrote:
> On Wed, Oct 22, 2014 at 03:44:16PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 22, 2014 at 02:15:54PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 22, 2014 at 02:53:04PM +0300, Kirill A. Shutemov wrote:
> > > > Em, no. In this case change_protection() will not touch the pte, since
> > > > it's pte_none() and the pte_same() check will pass just fine.
> > > 
> > > Oh, that's what you meant. Yes that's a problem, yes vm_page_prot
> > > needs wrapping too.
> > 
> > Maybe also vm_policy, is there anything else that can change while a vma
> > lives?
> 
>  - vm_flags, obviously;

Do those ever change? The only thing that jumps out is the VM_LOCKED
thing and that should not really matter one way or the other, but sure
can do.

>  - shared, anon_vma and anon_vma_chain (at least on the first write fault
>    to private mapping);
>  - vm_pgoff (mremap(2) ?);

Right you are. Never thought about that one.

>  - vm_private_data -- it's all over drivers. Potential nightmare, but
>    seems not in use for anon mappings.

Yeah, we need to either audit drivers or otherwise exclude stuff from
speculative faults, Andy already noted that drivers might not expect
.fault after .close or whatnot.

In any case, yes I'll go include them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 3/6] mm: VMA sequence count
  2014-10-23 14:22               ` Peter Zijlstra
@ 2014-10-23 15:05                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 47+ messages in thread
From: Kirill A. Shutemov @ 2014-10-23 15:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave, linux-kernel,
	linux-mm

On Thu, Oct 23, 2014 at 04:22:24PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 23, 2014 at 03:36:16PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Oct 22, 2014 at 03:44:16PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 22, 2014 at 02:15:54PM +0200, Peter Zijlstra wrote:
> > > > On Wed, Oct 22, 2014 at 02:53:04PM +0300, Kirill A. Shutemov wrote:
> > > > > Em, no. In this case change_protection() will not touch the pte, since
> > > > > it's pte_none() and the pte_same() check will pass just fine.
> > > > 
> > > > Oh, that's what you meant. Yes that's a problem, yes vm_page_prot
> > > > needs wrapping too.
> > > 
> > > Maybe also vm_policy, is there anything else that can change while a vma
> > > lives?
> > 
> >  - vm_flags, obviously;
> 
> Do those ever change?

The flags which can change (probably incomplete):

 - prot-related: VM_READ, VM_WRITE, VM_EXEC -- mprotect();
 - VM_LOCKED - mlock();
 - VM_SEQ_READ, VM_RAND_READ, VM_DONTCOPY, VM_DONTDUMP, VM_HUGEPAGE,
   VM_NOHUGEPAGE, VM_MERGEABLE -- madvise();
 - VM_SOFTDIRTY -- through procfs;
 
> The only thing that jumps out is the VM_LOCKED thing and that should not
> really matter one way or the other, but sure can do.

I would not be that sure about VM_LOCKED. Consider munlock() vs. write
fault race.

static int do_wp_page(struct fault_env *fe)
        __releases(ptl)
{
...
err:
	if (old_page) {
		/*
		 * Don't let another task, with possibly unlocked vma,
		 * keep the mlocked page.
		 */
		if ((ret & VM_FAULT_WRITE) && (fe->vma->vm_flags & VM_LOCKED)) {
			lock_page(old_page);	/* LRU manipulation */
			munlock_vma_page(old_page);
			unlock_page(old_page);
		}
		page_cache_release(old_page);
	}
	return ret;
...
}

The page can leak out mlocked, iiuc.

Some other flags can be problematic too.

> In any case, yes I'll go include them.

I hope it will not hurt single-threaded workloads even more. :-/

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC][PATCH 4/6] SRCU free VMAs
  2014-10-20 21:56 [RFC][PATCH 0/6] Another go at speculative page faults Peter Zijlstra
                   ` (2 preceding siblings ...)
  2014-10-20 21:56 ` [RFC][PATCH 3/6] mm: VMA sequence count Peter Zijlstra
@ 2014-10-20 21:56 ` Peter Zijlstra
  2014-10-20 23:41   ` Linus Torvalds
  2014-10-23 10:14   ` Lai Jiangshan
  2014-10-20 21:56 ` [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure Peter Zijlstra
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-20 21:56 UTC (permalink / raw)
  To: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: peterz-mm-srcu-vma.patch --]
[-- Type: text/plain, Size: 9279 bytes --]

Manage the VMAs with SRCU such that we can do a lockless VMA lookup.

We put the fput(vma->vm_file) in the SRCU callback, this keeps files
valid during speculative faults, this is possible due to the delayed
fput work by Al Viro -- do we need srcu_barrier() in unmount
someplace?

We guard the mm_rb tree with a seqlock (XXX could be a seqcount but
we'd have to disable preemption around the write side in order to make
the retry loop in __read_seqcount_begin() work) such that we can know
if the rb tree walk was correct. We cannot trust the restult of a
lockless tree walk in the face of concurrent tree rotations; although
we can trust on the termination of such walks -- tree rotations
guarantee the end result is a tree again after all.

Furthermore, we rely on the WMB implied by the
write_seqlock/count_begin() to separate the VMA initialization and the
publishing stores, analogous to the RELEASE in rcu_assign_pointer().
We also rely on the RMB from read_seqretry() to separate the vma load
from further loads like the smp_read_barrier_depends() in regular
RCU.

We must not touch the vmacache while doing SRCU lookups as that is not
properly serialized against changes. We update gap information after
publishing the VMA, but A) we don't use that and B) the seqlock
read side would fix that anyhow.

We clear vma->vm_rb for nodes removed from the vma tree such that we
can easily detect such 'dead' nodes, we rely on the WMB from
write_sequnlock() to separate the tree removal and clearing the node.

Provide find_vma_srcu() which wraps the required magic.

XXX: mmap()/munmap() heavy workloads might suffer from the global lock
in call_srcu() -- this is fixable with a 'better' SRCU implementation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/mm_types.h |    3 +
 kernel/fork.c            |    1 
 mm/init-mm.c             |    1 
 mm/internal.h            |   18 +++++++++
 mm/mmap.c                |   88 ++++++++++++++++++++++++++++++++++++-----------
 5 files changed, 91 insertions(+), 20 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -14,6 +14,7 @@
 #include <linux/uprobes.h>
 #include <linux/page-flags-layout.h>
 #include <linux/seqlock.h>
+#include <linux/srcu.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -310,6 +311,7 @@ struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	seqcount_t vm_sequence;
+	struct rcu_head vm_rcu_head;
 };
 
 struct core_thread {
@@ -347,6 +349,7 @@ struct kioctx_table;
 struct mm_struct {
 	struct vm_area_struct *mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
+	seqlock_t mm_seq;
 	u32 vmacache_seqnum;                   /* per-thread vmacache */
 #ifdef CONFIG_MMU
 	unsigned long (*get_unmapped_area) (struct file *filp,
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -553,6 +553,7 @@ static struct mm_struct *mm_init(struct
 	mm->mmap = NULL;
 	mm->mm_rb = RB_ROOT;
 	mm->vmacache_seqnum = 0;
+	seqlock_init(&mm->mm_seq);
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	init_rwsem(&mm->mmap_sem);
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,7 @@
 
 struct mm_struct init_mm = {
 	.mm_rb		= RB_ROOT,
+	.mm_seq		= __SEQLOCK_UNLOCKED(init_mm.mm_seq),
 	.pgd		= swapper_pg_dir,
 	.mm_users	= ATOMIC_INIT(2),
 	.mm_count	= ATOMIC_INIT(1),
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -14,6 +14,24 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 
+extern struct srcu_struct vma_srcu;
+
+extern struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr);
+
+static inline bool vma_is_dead(struct vm_area_struct *vma, unsigned int sequence)
+{
+	int ret = RB_EMPTY_NODE(&vma->vm_rb);
+	unsigned seq = ACCESS_ONCE(vma->vm_sequence.sequence);
+
+	/*
+	 * Matches both the wmb in write_seqlock_{begin,end}() and
+	 * the wmb in vma_rb_erase().
+	 */
+	smp_rmb();
+
+	return ret || seq != sequence;
+}
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -247,6 +247,23 @@ void unlink_file_vma(struct vm_area_stru
 	}
 }
 
+DEFINE_SRCU(vma_srcu);
+
+static void __free_vma(struct rcu_head *head)
+{
+	struct vm_area_struct *vma =
+		container_of(head, struct vm_area_struct, vm_rcu_head);
+
+	if (vma->vm_file)
+		fput(vma->vm_file);
+	kmem_cache_free(vm_area_cachep, vma);
+}
+
+static void free_vma(struct vm_area_struct *vma)
+{
+	call_srcu(&vma_srcu, &vma->vm_rcu_head, __free_vma);
+}
+
 /*
  * Close a vm structure and free it, returning the next.
  */
@@ -257,10 +274,8 @@ static struct vm_area_struct *remove_vma
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
-		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
-	kmem_cache_free(vm_area_cachep, vma);
+	free_vma(vma);
 	return next;
 }
 
@@ -468,17 +483,19 @@ static void vma_gap_update(struct vm_are
 	vma_gap_callbacks_propagate(&vma->vm_rb, NULL);
 }
 
-static inline void vma_rb_insert(struct vm_area_struct *vma,
-				 struct rb_root *root)
+static inline void vma_rb_insert(struct vm_area_struct *vma, struct mm_struct *mm)
 {
+	struct rb_root *root = &mm->mm_rb;
+
 	/* All rb_subtree_gap values must be consistent prior to insertion */
 	validate_mm_rb(root, NULL);
 
 	rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
 }
 
-static void vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
+static void vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
 {
+	struct rb_root *root = &mm->mm_rb;
 	/*
 	 * All rb_subtree_gap values must be consistent prior to erase,
 	 * with the possible exception of the vma being erased.
@@ -490,7 +507,15 @@ static void vma_rb_erase(struct vm_area_
 	 * so make sure we instantiate it only once with our desired
 	 * augmented rbtree callbacks.
 	 */
+	write_seqlock(&mm->mm_seq);
 	rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
+	write_sequnlock(&mm->mm_seq); /* wmb */
+
+	/*
+	 * Ensure the removal is complete before clearing the node.
+	 * Matched by vma_is_dead()/handle_speculative_fault().
+	 */
+	RB_CLEAR_NODE(&vma->vm_rb);
 }
 
 /*
@@ -607,10 +632,12 @@ void __vma_link_rb(struct mm_struct *mm,
 	 * immediately update the gap to the correct value. Finally we
 	 * rebalance the rbtree after all augmented values have been set.
 	 */
+	write_seqlock(&mm->mm_seq);
 	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
 	vma->rb_subtree_gap = 0;
 	vma_gap_update(vma);
-	vma_rb_insert(vma, &mm->mm_rb);
+	vma_rb_insert(vma, mm);
+	write_sequnlock(&mm->mm_seq);
 }
 
 static void __vma_link_file(struct vm_area_struct *vma)
@@ -687,7 +714,7 @@ __vma_unlink(struct mm_struct *mm, struc
 {
 	struct vm_area_struct *next;
 
-	vma_rb_erase(vma, &mm->mm_rb);
+	vma_rb_erase(vma, mm);
 	prev->vm_next = next = vma->vm_next;
 	if (next)
 		next->vm_prev = prev;
@@ -872,15 +899,13 @@ again:			remove_next = 1 + (end > next->
 	}
 
 	if (remove_next) {
-		if (file) {
+		if (file)
 			uprobe_munmap(next, next->vm_start, next->vm_end);
-			fput(file);
-		}
 		if (next->anon_vma)
 			anon_vma_merge(vma, next);
 		mm->map_count--;
 		mpol_put(vma_policy(next));
-		kmem_cache_free(vm_area_cachep, next);
+		free_vma(next);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
@@ -2027,16 +2052,11 @@ get_unmapped_area(struct file *file, uns
 EXPORT_SYMBOL(get_unmapped_area);
 
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
-struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+static struct vm_area_struct *__find_vma(struct mm_struct *mm, unsigned long addr)
 {
 	struct rb_node *rb_node;
 	struct vm_area_struct *vma;
 
-	/* Check the cache first. */
-	vma = vmacache_find(mm, addr);
-	if (likely(vma))
-		return vma;
-
 	rb_node = mm->mm_rb.rb_node;
 	vma = NULL;
 
@@ -2054,13 +2074,41 @@ struct vm_area_struct *find_vma(struct m
 			rb_node = rb_node->rb_right;
 	}
 
+	return vma;
+}
+
+struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+{
+	struct vm_area_struct *vma;
+
+	/* Check the cache first. */
+	vma = vmacache_find(mm, addr);
+	if (likely(vma))
+		return vma;
+
+	vma = __find_vma(mm, addr);
 	if (vma)
 		vmacache_update(addr, vma);
+
 	return vma;
 }
-
 EXPORT_SYMBOL(find_vma);
 
+struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr)
+{
+	struct vm_area_struct *vma;
+	unsigned int seq;
+
+	WARN_ON_ONCE(!srcu_read_lock_held(&vma_srcu));
+
+	do {
+		seq = read_seqbegin(&mm->mm_seq);
+		vma = __find_vma(mm, addr);
+	} while (read_seqretry(&mm->mm_seq, seq));
+
+	return vma;
+}
+
 /*
  * Same as find_vma, but also return a pointer to the previous VMA in *pprev.
  */
@@ -2415,7 +2463,7 @@ detach_vmas_to_be_unmapped(struct mm_str
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	vma->vm_prev = NULL;
 	do {
-		vma_rb_erase(vma, &mm->mm_rb);
+		vma_rb_erase(vma, mm);
 		mm->map_count--;
 		tail_vma = vma;
 		vma = vma->vm_next;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 4/6] SRCU free VMAs
  2014-10-20 21:56 ` [RFC][PATCH 4/6] SRCU free VMAs Peter Zijlstra
@ 2014-10-20 23:41   ` Linus Torvalds
  2014-10-21  8:07     ` Peter Zijlstra
  2014-10-21  8:22     ` Peter Zijlstra
  2014-10-23 10:14   ` Lai Jiangshan
  1 sibling, 2 replies; 47+ messages in thread
From: Linus Torvalds @ 2014-10-20 23:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul McKenney, Thomas Gleixner, Andrew Morton, Rik van Riel,
	Mel Gorman, Oleg Nesterov, Ingo Molnar, Minchan Kim,
	KAMEZAWA Hiroyuki, Al Viro, Lai Jiangshan, Davidlohr Bueso,
	Linux Kernel Mailing List, linux-mm

On Mon, Oct 20, 2014 at 2:56 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> Manage the VMAs with SRCU such that we can do a lockless VMA lookup.

Can you explain why srcu, and not plain regular rcu?

Especially as you then *note* some of the problems srcu can have.
Making it regular rcu would also seem to make it possible to make the
seqlock be just a seqcount, no?

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 4/6] SRCU free VMAs
  2014-10-20 23:41   ` Linus Torvalds
@ 2014-10-21  8:07     ` Peter Zijlstra
  2014-10-24 15:16       ` Christoph Lameter
  2014-10-21  8:22     ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-21  8:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Thomas Gleixner, Andrew Morton, Rik van Riel,
	Mel Gorman, Oleg Nesterov, Ingo Molnar, Minchan Kim,
	KAMEZAWA Hiroyuki, Al Viro, Lai Jiangshan, Davidlohr Bueso,
	Linux Kernel Mailing List, linux-mm

On Mon, Oct 20, 2014 at 04:41:45PM -0700, Linus Torvalds wrote:
> On Mon, Oct 20, 2014 at 2:56 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Manage the VMAs with SRCU such that we can do a lockless VMA lookup.
> 
> Can you explain why srcu, and not plain regular rcu?
> 
> Especially as you then *note* some of the problems srcu can have.
> Making it regular rcu would also seem to make it possible to make the
> seqlock be just a seqcount, no?

Because we need to hold onto the RCU read side lock across the entire
fault, which can involve IO and all kinds of other blocking ops.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 4/6] SRCU free VMAs
  2014-10-21  8:07     ` Peter Zijlstra
@ 2014-10-24 15:16       ` Christoph Lameter
  2014-10-24 15:51         ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2014-10-24 15:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Paul McKenney, Thomas Gleixner, Andrew Morton,
	Rik van Riel, Mel Gorman, Oleg Nesterov, Ingo Molnar, Minchan Kim,
	KAMEZAWA Hiroyuki, Al Viro, Lai Jiangshan, Davidlohr Bueso,
	Linux Kernel Mailing List, linux-mm

On Tue, 21 Oct 2014, Peter Zijlstra wrote:

> On Mon, Oct 20, 2014 at 04:41:45PM -0700, Linus Torvalds wrote:
> > On Mon, Oct 20, 2014 at 2:56 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > Manage the VMAs with SRCU such that we can do a lockless VMA lookup.
> >
> > Can you explain why srcu, and not plain regular rcu?
> >
> > Especially as you then *note* some of the problems srcu can have.
> > Making it regular rcu would also seem to make it possible to make the
> > seqlock be just a seqcount, no?
>
> Because we need to hold onto the RCU read side lock across the entire
> fault, which can involve IO and all kinds of other blocking ops.

Hmmm... One optimization to do before we get into these changes is to work
on allowing the dropping of mmap_sem before we get to sleeping and I/O and
then reevaluate when I/O etc is complete? This is probably the longest
hold on mmap_sem that is also frequent. Then it may be easier to use
standard RCU later.





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 4/6] SRCU free VMAs
  2014-10-24 15:16       ` Christoph Lameter
@ 2014-10-24 15:51         ` Peter Zijlstra
  2014-10-24 17:08           ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-24 15:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Paul McKenney, Thomas Gleixner, Andrew Morton,
	Rik van Riel, Mel Gorman, Oleg Nesterov, Ingo Molnar, Minchan Kim,
	KAMEZAWA Hiroyuki, Al Viro, Lai Jiangshan, Davidlohr Bueso,
	Linux Kernel Mailing List, linux-mm

On Fri, Oct 24, 2014 at 10:16:24AM -0500, Christoph Lameter wrote:

> Hmmm... One optimization to do before we get into these changes is to work
> on allowing the dropping of mmap_sem before we get to sleeping and I/O and
> then reevaluate when I/O etc is complete? This is probably the longest
> hold on mmap_sem that is also frequent. Then it may be easier to use
> standard RCU later.

The hold time isn't relevant, in fact breaking up the mmap_sem such that
we require multiple acquisitions will just increase the cacheline
bouncing.

Also I think it makes more sense to continue an entire fault operation,
including blocking, if at all possible. Every retry will just waste more
time.

Also, there is a lot of possible blocking, there's lock_page,
page_mkwrite() -- which ends up calling into the dirty throttle etc. We
could not possibly retry on all that, the error paths involved would be
horrible for one.

That said, there's a fair bit of code that does allow the retry, and I
think most fault paths actually do the retry on IO.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 4/6] SRCU free VMAs
  2014-10-24 15:51         ` Peter Zijlstra
@ 2014-10-24 17:08           ` Christoph Lameter
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2014-10-24 17:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Paul McKenney, Thomas Gleixner, Andrew Morton,
	Rik van Riel, Mel Gorman, Oleg Nesterov, Ingo Molnar, Minchan Kim,
	KAMEZAWA Hiroyuki, Al Viro, Lai Jiangshan, Davidlohr Bueso,
	Linux Kernel Mailing List, linux-mm

On Fri, 24 Oct 2014, Peter Zijlstra wrote:

> The hold time isn't relevant, in fact breaking up the mmap_sem such that
> we require multiple acquisitions will just increase the cacheline
> bouncing.

Well this wont be happening anymore once you RCUify the stuff. If you go
to sleep then its best to release mmap_sem and then the bouncing wont
matter.

Dropping mmap_sem there will also expose you to races you will see later
too when you RCUify the code paths. That way those can be deal with
beforehand.

> Also I think it makes more sense to continue an entire fault operation,
> including blocking, if at all possible. Every retry will just waste more
> time.

Ok then dont retry. Just drop mmap_sem before going to sleep. When you
come back evaluate the situation and if we can proceed do so otherwise
retry.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 4/6] SRCU free VMAs
  2014-10-20 23:41   ` Linus Torvalds
  2014-10-21  8:07     ` Peter Zijlstra
@ 2014-10-21  8:22     ` Peter Zijlstra
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-21  8:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Thomas Gleixner, Andrew Morton, Rik van Riel,
	Mel Gorman, Oleg Nesterov, Ingo Molnar, Minchan Kim,
	KAMEZAWA Hiroyuki, Al Viro, Lai Jiangshan, Davidlohr Bueso,
	Linux Kernel Mailing List, linux-mm

On Mon, Oct 20, 2014 at 04:41:45PM -0700, Linus Torvalds wrote:
> On Mon, Oct 20, 2014 at 2:56 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Manage the VMAs with SRCU such that we can do a lockless VMA lookup.
> 
> Can you explain why srcu, and not plain regular rcu?
> 
> Especially as you then *note* some of the problems srcu can have.
> Making it regular rcu would also seem to make it possible to make the
> seqlock be just a seqcount, no?

Ah, the reason I did the seqlock is because the read side will spin-wait
for &1 to go away. If the write side is preemptible that's horrid. I
used seqlock because that takes a lock (and thus disables preemption) on
the write side, but I could equally have done:

	preempt_disable();
	write_seqcount_begin();

	...

	write_seqcount_end();
	preempt_disable();

Since the lock is indeed superfluous, we're already fully serialized by
mmap_sem in this path.

Using regular RCU isn't sufficient, because of PREEMPT_RCU.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 4/6] SRCU free VMAs
  2014-10-20 21:56 ` [RFC][PATCH 4/6] SRCU free VMAs Peter Zijlstra
  2014-10-20 23:41   ` Linus Torvalds
@ 2014-10-23 10:14   ` Lai Jiangshan
  2014-10-23 11:03     ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Lai Jiangshan @ 2014-10-23 10:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, dave, linux-kernel, linux-mm


>  
> +struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr)
> +{
> +	struct vm_area_struct *vma;
> +	unsigned int seq;
> +
> +	WARN_ON_ONCE(!srcu_read_lock_held(&vma_srcu));
> +
> +	do {
> +		seq = read_seqbegin(&mm->mm_seq);
> +		vma = __find_vma(mm, addr);

will the __find_vma() loops for ever due to the rotations in the RBtree?

> +	} while (read_seqretry(&mm->mm_seq, seq));
> +
> +	return vma;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 4/6] SRCU free VMAs
  2014-10-23 10:14   ` Lai Jiangshan
@ 2014-10-23 11:03     ` Peter Zijlstra
  2014-10-24  3:33       ` Lai Jiangshan
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-23 11:03 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, dave, linux-kernel, linux-mm

On Thu, Oct 23, 2014 at 06:14:45PM +0800, Lai Jiangshan wrote:
> 
> >  
> > +struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr)
> > +{
> > +	struct vm_area_struct *vma;
> > +	unsigned int seq;
> > +
> > +	WARN_ON_ONCE(!srcu_read_lock_held(&vma_srcu));
> > +
> > +	do {
> > +		seq = read_seqbegin(&mm->mm_seq);
> > +		vma = __find_vma(mm, addr);
> 
> will the __find_vma() loops for ever due to the rotations in the RBtree?

No, a rotation takes a tree and generates a tree, furthermore the
rotation has a fairly strict fwd progress guarantee seeing how its now
done with preemption disabled.

Therefore, even if we're in a node that's being rotated up, we can only
'loop' for as long as it takes for the new pointer stores to become
visible on our CPU.

Thus we have a tree descent termination guarantee.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 4/6] SRCU free VMAs
  2014-10-23 11:03     ` Peter Zijlstra
@ 2014-10-24  3:33       ` Lai Jiangshan
  2014-10-24  7:26         ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Lai Jiangshan @ 2014-10-24  3:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, dave, linux-kernel, linux-mm

On 10/23/2014 07:03 PM, Peter Zijlstra wrote:
> On Thu, Oct 23, 2014 at 06:14:45PM +0800, Lai Jiangshan wrote:
>>
>>>  
>>> +struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr)
>>> +{
>>> +	struct vm_area_struct *vma;
>>> +	unsigned int seq;
>>> +
>>> +	WARN_ON_ONCE(!srcu_read_lock_held(&vma_srcu));
>>> +
>>> +	do {
>>> +		seq = read_seqbegin(&mm->mm_seq);
>>> +		vma = __find_vma(mm, addr);
>>
>> will the __find_vma() loops for ever due to the rotations in the RBtree?
> 
> No, a rotation takes a tree and generates a tree, furthermore the
> rotation has a fairly strict fwd progress guarantee seeing how its now
> done with preemption disabled.

I can't get the magic.

__find_vma is visiting vma_a,
vma_a is rotated to near the top due to multiple updates to the mm.
__find_vma is visiting down to near the bottom, vma_b.
now vma_b is rotated up to near the top again.
__find_vma is visiting down to near the bottom, vma_c.
now vma_c is rotated up to near the top again.

...




> 
> Therefore, even if we're in a node that's being rotated up, we can only
> 'loop' for as long as it takes for the new pointer stores to become
> visible on our CPU.
> 
> Thus we have a tree descent termination guarantee.
> .
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 4/6] SRCU free VMAs
  2014-10-24  3:33       ` Lai Jiangshan
@ 2014-10-24  7:26         ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-24  7:26 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, dave, linux-kernel, linux-mm

On Fri, Oct 24, 2014 at 11:33:58AM +0800, Lai Jiangshan wrote:
> On 10/23/2014 07:03 PM, Peter Zijlstra wrote:
> > On Thu, Oct 23, 2014 at 06:14:45PM +0800, Lai Jiangshan wrote:
> >>
> >>>  
> >>> +struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long addr)
> >>> +{
> >>> +	struct vm_area_struct *vma;
> >>> +	unsigned int seq;
> >>> +
> >>> +	WARN_ON_ONCE(!srcu_read_lock_held(&vma_srcu));
> >>> +
> >>> +	do {
> >>> +		seq = read_seqbegin(&mm->mm_seq);
> >>> +		vma = __find_vma(mm, addr);
> >>
> >> will the __find_vma() loops for ever due to the rotations in the RBtree?
> > 
> > No, a rotation takes a tree and generates a tree, furthermore the
> > rotation has a fairly strict fwd progress guarantee seeing how its now
> > done with preemption disabled.
> 
> I can't get the magic.
> 
> __find_vma is visiting vma_a,
> vma_a is rotated to near the top due to multiple updates to the mm.
> __find_vma is visiting down to near the bottom, vma_b.
> now vma_b is rotated up to near the top again.
> __find_vma is visiting down to near the bottom, vma_c.
> now vma_c is rotated up to near the top again.
> 
> ...

Why would there be that much rotations? Is this a scenario where someone
is endlessly changing the tree?

If you stop updating the tree, the traversal will finish.

This is no different to the reader starvation already present with
seqlocks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure
  2014-10-20 21:56 [RFC][PATCH 0/6] Another go at speculative page faults Peter Zijlstra
                   ` (3 preceding siblings ...)
  2014-10-20 21:56 ` [RFC][PATCH 4/6] SRCU free VMAs Peter Zijlstra
@ 2014-10-20 21:56 ` Peter Zijlstra
  2014-10-21  8:35   ` Kirill A. Shutemov
  2014-10-21 19:00   ` Peter Zijlstra
  2014-10-20 21:56 ` [RFC][PATCH 6/6] mm,x86: Add speculative pagefault handling Peter Zijlstra
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-20 21:56 UTC (permalink / raw)
  To: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: peterz-mm-speculative-fault.patch --]
[-- Type: text/plain, Size: 5269 bytes --]

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/mm.h |    2 
 mm/memory.c        |  118 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 119 insertions(+), 1 deletion(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1162,6 +1162,8 @@ int generic_error_remove_page(struct add
 int invalidate_inode_page(struct page *page);
 
 #ifdef CONFIG_MMU
+extern int handle_speculative_fault(struct mm_struct *mm,
+			unsigned long address, unsigned int flags);
 extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags);
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2004,12 +2004,40 @@ struct fault_env {
 	pte_t entry;
 	spinlock_t *ptl;
 	unsigned int flags;
+	unsigned int sequence;
 };
 
 static bool pte_map_lock(struct fault_env *fe)
 {
+	bool ret = false;
+
+	if (!(fe->flags & FAULT_FLAG_SPECULATIVE)) {
+		fe->pte = pte_offset_map_lock(fe->mm, fe->pmd, fe->address, &fe->ptl);
+		return true;
+	}
+
+	/*
+	 * The first vma_is_dead() guarantees the page-tables are still valid,
+	 * having IRQs disabled ensures they stay around, hence the second
+	 * vma_is_dead() to make sure they are still valid once we've got the
+	 * lock. After that a concurrent zap_pte_range() will block on the PTL
+	 * and thus we're safe.
+	 */
+	local_irq_disable();
+	if (vma_is_dead(fe->vma, fe->sequence))
+		goto out;
+
 	fe->pte = pte_offset_map_lock(fe->mm, fe->pmd, fe->address, &fe->ptl);
-	return true;
+
+	if (vma_is_dead(fe->vma, fe->sequence)) {
+		pte_unmap_unlock(fe->pte, fe->ptl);
+		goto out;
+	}
+
+	ret = true;
+out:
+	local_irq_enable();
+	return ret;
 }
 
 /*
@@ -2432,6 +2460,7 @@ static int do_swap_page(struct fault_env
 	entry = pte_to_swp_entry(fe->entry);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
+			/* XXX fe->pmd might be dead */
 			migration_entry_wait(fe->mm, fe->pmd, fe->address);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
@@ -3357,6 +3386,93 @@ static int __handle_mm_fault(struct mm_s
 	return handle_pte_fault(&fe);
 }
 
+int handle_speculative_fault(struct mm_struct *mm, unsigned long address, unsigned int flags)
+{
+	struct fault_env fe = {
+		.mm = mm,
+		.address = address,
+		.flags = flags | FAULT_FLAG_SPECULATIVE,
+	};
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	int dead, seq, idx, ret = VM_FAULT_RETRY;
+	struct vm_area_struct *vma;
+
+	idx = srcu_read_lock(&vma_srcu);
+	vma = find_vma_srcu(mm, address);
+	if (!vma)
+		goto unlock;
+
+	/*
+	 * Validate the VMA found by the lockless lookup.
+	 */
+	dead = RB_EMPTY_NODE(&vma->vm_rb);
+	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
+	if ((seq & 1) || dead) /* XXX wait for !&1 instead? */
+		goto unlock;
+
+	if (address < vma->vm_start || vma->vm_end <= address)
+		goto unlock;
+
+	/*
+	 * We need to re-validate the VMA after checking the bounds, otherwise
+	 * we might have a false positive on the bounds.
+	 */
+	if (read_seqcount_retry(&vma->vm_sequence, seq))
+		goto unlock;
+
+	/*
+	 * Do a speculative lookup of the PTE entry.
+	 */
+	local_irq_disable();
+	pgd = pgd_offset(mm, address);
+	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		goto out_walk;
+
+	pud = pud_offset(pgd, address);
+	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+		goto out_walk;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+		goto out_walk;
+
+	/*
+	 * The above does not allocate/instantiate page-tables because doing so
+	 * would lead to the possibility of instantiating page-tables after
+	 * free_pgtables() -- and consequently leaking them.
+	 *
+	 * The result is that we take at least one !speculative fault per PMD
+	 * in order to instantiate it.
+	 *
+	 * XXX try and fix that.. should be possible somehow.
+	 */
+
+	if (pmd_huge(*pmd)) /* XXX no huge support */
+		goto out_walk;
+
+	fe.vma = vma;
+	fe.pmd = pmd;
+	fe.sequence = seq;
+
+	pte = pte_offset_map(pmd, address);
+	fe.entry = ACCESS_ONCE(pte); /* XXX gup_get_pte() */
+	pte_unmap(pte);
+	local_irq_enable();
+
+	ret = handle_pte_fault(&fe);
+
+unlock:
+	srcu_read_unlock(&vma_srcu, idx);
+	return ret;
+
+out_walk:
+	local_irq_enable();
+	goto unlock;
+}
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure
  2014-10-20 21:56 ` [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure Peter Zijlstra
@ 2014-10-21  8:35   ` Kirill A. Shutemov
  2014-10-21 10:41     ` Peter Zijlstra
  2014-10-21 19:00   ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Kirill A. Shutemov @ 2014-10-21  8:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave, linux-kernel,
	linux-mm

On Mon, Oct 20, 2014 at 11:56:38PM +0200, Peter Zijlstra wrote:
> Provide infrastructure to do a speculative fault (not holding
> mmap_sem).
> 
> The not holding of mmap_sem means we can race against VMA
> change/removal and page-table destruction. We use the SRCU VMA freeing
> to keep the VMA around. We use the VMA seqcount to detect change
> (including umapping / page-table deletion) and we use gup_fast() style
> page-table walking to deal with page-table races.
> 
> Once we've obtained the page and are ready to update the PTE, we
> validate if the state we started the fault with is still valid, if
> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
> PTE and we're done.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/mm.h |    2 
>  mm/memory.c        |  118 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 119 insertions(+), 1 deletion(-)
> 
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1162,6 +1162,8 @@ int generic_error_remove_page(struct add
>  int invalidate_inode_page(struct page *page);
>  
>  #ifdef CONFIG_MMU
> +extern int handle_speculative_fault(struct mm_struct *mm,
> +			unsigned long address, unsigned int flags);
>  extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, unsigned int flags);
>  extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2004,12 +2004,40 @@ struct fault_env {
>  	pte_t entry;
>  	spinlock_t *ptl;
>  	unsigned int flags;
> +	unsigned int sequence;
>  };
>  
>  static bool pte_map_lock(struct fault_env *fe)
>  {
> +	bool ret = false;
> +
> +	if (!(fe->flags & FAULT_FLAG_SPECULATIVE)) {
> +		fe->pte = pte_offset_map_lock(fe->mm, fe->pmd, fe->address, &fe->ptl);
> +		return true;
> +	}
> +
> +	/*
> +	 * The first vma_is_dead() guarantees the page-tables are still valid,
> +	 * having IRQs disabled ensures they stay around, hence the second
> +	 * vma_is_dead() to make sure they are still valid once we've got the
> +	 * lock. After that a concurrent zap_pte_range() will block on the PTL
> +	 * and thus we're safe.
> +	 */
> +	local_irq_disable();
> +	if (vma_is_dead(fe->vma, fe->sequence))
> +		goto out;
> +
>  	fe->pte = pte_offset_map_lock(fe->mm, fe->pmd, fe->address, &fe->ptl);
> -	return true;
> +
> +	if (vma_is_dead(fe->vma, fe->sequence)) {
> +		pte_unmap_unlock(fe->pte, fe->ptl);
> +		goto out;
> +	}
> +
> +	ret = true;
> +out:
> +	local_irq_enable();
> +	return ret;
>  }
>  
>  /*
> @@ -2432,6 +2460,7 @@ static int do_swap_page(struct fault_env
>  	entry = pte_to_swp_entry(fe->entry);
>  	if (unlikely(non_swap_entry(entry))) {
>  		if (is_migration_entry(entry)) {
> +			/* XXX fe->pmd might be dead */
>  			migration_entry_wait(fe->mm, fe->pmd, fe->address);
>  		} else if (is_hwpoison_entry(entry)) {
>  			ret = VM_FAULT_HWPOISON;
> @@ -3357,6 +3386,93 @@ static int __handle_mm_fault(struct mm_s
>  	return handle_pte_fault(&fe);
>  }
>  
> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address, unsigned int flags)
> +{
> +	struct fault_env fe = {
> +		.mm = mm,
> +		.address = address,
> +		.flags = flags | FAULT_FLAG_SPECULATIVE,
> +	};
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
> +	int dead, seq, idx, ret = VM_FAULT_RETRY;
> +	struct vm_area_struct *vma;
> +
> +	idx = srcu_read_lock(&vma_srcu);
> +	vma = find_vma_srcu(mm, address);
> +	if (!vma)
> +		goto unlock;
> +
> +	/*
> +	 * Validate the VMA found by the lockless lookup.
> +	 */
> +	dead = RB_EMPTY_NODE(&vma->vm_rb);
> +	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
> +	if ((seq & 1) || dead) /* XXX wait for !&1 instead? */
> +		goto unlock;
> +
> +	if (address < vma->vm_start || vma->vm_end <= address)
> +		goto unlock;
> +
> +	/*
> +	 * We need to re-validate the VMA after checking the bounds, otherwise
> +	 * we might have a false positive on the bounds.
> +	 */
> +	if (read_seqcount_retry(&vma->vm_sequence, seq))
> +		goto unlock;
> +
> +	/*
> +	 * Do a speculative lookup of the PTE entry.
> +	 */
> +	local_irq_disable();
> +	pgd = pgd_offset(mm, address);
> +	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> +		goto out_walk;
> +
> +	pud = pud_offset(pgd, address);
> +	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> +		goto out_walk;

pud_huge() too. Or filter out VM_HUGETLB altogether.

BTW, what keeps mm_struct around? It seems we don't take reference during
page fault.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure
  2014-10-21  8:35   ` Kirill A. Shutemov
@ 2014-10-21 10:41     ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-21 10:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave, linux-kernel,
	linux-mm

On Tue, Oct 21, 2014 at 11:35:48AM +0300, Kirill A. Shutemov wrote:
> pud_huge() too. Or filter out VM_HUGETLB altogether.

Oh right, giga pages, all this new fangled stuff ;-) But yes, I suppose
we can exclude hugetlbfs, we should arguably make the thp muck work
though.

> BTW, what keeps mm_struct around? It seems we don't take reference during
> page fault.

Last I checked tasks have a ref on their own mm, and seeing this all
runs in task context, the mm should be pretty safe.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure
  2014-10-20 21:56 ` [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure Peter Zijlstra
  2014-10-21  8:35   ` Kirill A. Shutemov
@ 2014-10-21 19:00   ` Peter Zijlstra
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-21 19:00 UTC (permalink / raw)
  To: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave
  Cc: linux-kernel, linux-mm

On Mon, Oct 20, 2014 at 11:56:38PM +0200, Peter Zijlstra wrote:
>  static bool pte_map_lock(struct fault_env *fe)
>  {
> +	bool ret = false;
> +
> +	if (!(fe->flags & FAULT_FLAG_SPECULATIVE)) {
> +		fe->pte = pte_offset_map_lock(fe->mm, fe->pmd, fe->address, &fe->ptl);
> +		return true;
> +	}
> +
> +	/*
> +	 * The first vma_is_dead() guarantees the page-tables are still valid,
> +	 * having IRQs disabled ensures they stay around, hence the second
> +	 * vma_is_dead() to make sure they are still valid once we've got the
> +	 * lock. After that a concurrent zap_pte_range() will block on the PTL
> +	 * and thus we're safe.
> +	 */
> +	local_irq_disable();
> +	if (vma_is_dead(fe->vma, fe->sequence))
> +		goto out;
> +
>  	fe->pte = pte_offset_map_lock(fe->mm, fe->pmd, fe->address, &fe->ptl);

Yeah, so this deadlocks just fine, I found we still do TLB flushes while
holding the PTL. Bugger that, the alternative is either force everybody
to do RCU freed page-tables or put back the ugly code :/

A well..

> +
> +	if (vma_is_dead(fe->vma, fe->sequence)) {
> +		pte_unmap_unlock(fe->pte, fe->ptl);
> +		goto out;
> +	}
> +
> +	ret = true;
> +out:
> +	local_irq_enable();
> +	return ret;
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC][PATCH 6/6] mm,x86: Add speculative pagefault handling
  2014-10-20 21:56 [RFC][PATCH 0/6] Another go at speculative page faults Peter Zijlstra
                   ` (4 preceding siblings ...)
  2014-10-20 21:56 ` [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure Peter Zijlstra
@ 2014-10-20 21:56 ` Peter Zijlstra
  2014-10-21  0:07 ` [RFC][PATCH 0/6] Another go at speculative page faults Andy Lutomirski
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-20 21:56 UTC (permalink / raw)
  To: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave
  Cc: linux-kernel, linux-mm, Peter Zijlstra

[-- Attachment #1: peterz-mm-x86.patch --]
[-- Type: text/plain, Size: 3296 bytes --]

Try a speculative fault before acquiring mmap_sem, if it returns with
VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
traditional fault.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/mm/fault.c |   35 ++++++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 13 deletions(-)

--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -844,11 +844,8 @@ do_sigbus(struct pt_regs *regs, unsigned
 	  unsigned int fault)
 {
 	struct task_struct *tsk = current;
-	struct mm_struct *mm = tsk->mm;
 	int code = BUS_ADRERR;
 
-	up_read(&mm->mmap_sem);
-
 	/* Kernel mode? Handle exceptions or die: */
 	if (!(error_code & PF_USER)) {
 		no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
@@ -879,7 +876,6 @@ mm_fault_error(struct pt_regs *regs, uns
 	       unsigned long address, unsigned int fault)
 {
 	if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
-		up_read(&current->mm->mmap_sem);
 		no_context(regs, error_code, address, 0, 0);
 		return;
 	}
@@ -887,14 +883,11 @@ mm_fault_error(struct pt_regs *regs, uns
 	if (fault & VM_FAULT_OOM) {
 		/* Kernel mode? Handle exceptions or die: */
 		if (!(error_code & PF_USER)) {
-			up_read(&current->mm->mmap_sem);
 			no_context(regs, error_code, address,
 				   SIGSEGV, SEGV_MAPERR);
 			return;
 		}
 
-		up_read(&current->mm->mmap_sem);
-
 		/*
 		 * We ran out of memory, call the OOM killer, and return the
 		 * userspace (which will retry the fault, or kill us if we got
@@ -1141,6 +1134,16 @@ __do_page_fault(struct pt_regs *regs, un
 	if (error_code & PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
 
+	if (error_code & PF_USER) {
+		fault = handle_speculative_fault(mm, address,
+					flags & ~FAULT_FLAG_ALLOW_RETRY);
+
+		if (fault & VM_FAULT_RETRY)
+			goto retry;
+
+		goto done;
+	}
+
 	/*
 	 * When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in
@@ -1225,9 +1228,15 @@ __do_page_fault(struct pt_regs *regs, un
 	 * signal first. We do not need to release the mmap_sem because it
 	 * would already be released in __lock_page_or_retry in mm/filemap.c.
 	 */
-	if (unlikely((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)))
-		return;
+	if (unlikely(fault & VM_FAULT_RETRY)) {
+		if (fatal_signal_pending(current))
+			return;
+
+		goto done;
+	}
 
+	up_read(&mm->mmap_sem);
+done:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		mm_fault_error(regs, error_code, address, fault);
 		return;
@@ -1249,8 +1258,10 @@ __do_page_fault(struct pt_regs *regs, un
 				      regs, address);
 		}
 		if (fault & VM_FAULT_RETRY) {
-			/* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
-			 * of starvation. */
+			/*
+			 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk of
+			 * starvation.
+			 */
 			flags &= ~FAULT_FLAG_ALLOW_RETRY;
 			flags |= FAULT_FLAG_TRIED;
 			goto retry;
@@ -1258,8 +1269,6 @@ __do_page_fault(struct pt_regs *regs, un
 	}
 
 	check_v8086_mode(regs, address, tsk);
-
-	up_read(&mm->mmap_sem);
 }
 NOKPROBE_SYMBOL(__do_page_fault);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 0/6] Another go at speculative page faults
  2014-10-20 21:56 [RFC][PATCH 0/6] Another go at speculative page faults Peter Zijlstra
                   ` (5 preceding siblings ...)
  2014-10-20 21:56 ` [RFC][PATCH 6/6] mm,x86: Add speculative pagefault handling Peter Zijlstra
@ 2014-10-21  0:07 ` Andy Lutomirski
  2014-10-21  8:11   ` Peter Zijlstra
  2014-10-21 16:23 ` Ingo Molnar
  2014-10-22  7:34 ` Davidlohr Bueso
  8 siblings, 1 reply; 47+ messages in thread
From: Andy Lutomirski @ 2014-10-21  0:07 UTC (permalink / raw)
  To: Peter Zijlstra, torvalds, paulmck, tglx, akpm, riel, mgorman,
	oleg, mingo, minchan, kamezawa.hiroyu, viro, laijs, dave
  Cc: linux-kernel, linux-mm

On 10/20/2014 02:56 PM, Peter Zijlstra wrote:
> Hi,
> 
> I figured I'd give my 2010 speculative fault series another spin:
> 
>   https://lkml.org/lkml/2010/1/4/257
> 
> Since then I think many of the outstanding issues have changed sufficiently to
> warrant another go. In particular Al Viro's delayed fput seems to have made it
> entirely 'normal' to delay fput(). Lai Jiangshan's SRCU rewrite provided us
> with call_srcu() and my preemptible mmu_gather removed the TLB flushes from
> under the PTL.
> 
> The code needs way more attention but builds a kernel and runs the
> micro-benchmark so I figured I'd post it before sinking more time into it.
> 
> I realize the micro-bench is about as good as it gets for this series and not
> very realistic otherwise, but I think it does show the potential benefit the
> approach has.

Does this mean that an entire fault can complete without ever taking
mmap_sem at all?  If so, that's a *huge* win.

I'm a bit concerned about drivers that assume that the vma is unchanged
during .fault processing.  In particular, is there a race between .close
and .fault?  Would it make sense to add a per-vma rw lock and hold it
during vma modification and .fault calls?

--Andy

> 
> (patches go against .18-rc1+)
> 
> ---
> 
> Using Kamezawa's multi-fault micro-bench from: https://lkml.org/lkml/2010/1/6/28
> 
> My Ivy Bridge EP (2*10*2) has a ~58% improvement in pagefault throughput:
> 
> PRE:
> 
> root@ivb-ep:~# perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault 20
> 
>  Performance counter stats for './multi-fault 20' (5 runs):
> 
>        149,441,555      page-faults                  ( +-  1.25% )
>      2,153,651,828      cache-misses                 ( +-  1.09% )
> 
>       60.003082014 seconds time elapsed              ( +-  0.00% )
> 
> POST:
> 
> root@ivb-ep:~# perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault 20
> 
>  Performance counter stats for './multi-fault 20' (5 runs):
> 
>        236,442,626      page-faults                  ( +-  0.08% )
>      2,796,353,939      cache-misses                 ( +-  1.01% )
> 
>       60.002792431 seconds time elapsed              ( +-  0.00% )
> 
> 
> My Ivy Bridge EX (4*15*2) has a ~78% improvement in pagefault throughput:
> 
> PRE:
> 
> root@ivb-ex:~# perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault 60
> 
>  Performance counter stats for './multi-fault 60' (5 runs):
> 
>        105,789,078      page-faults                 ( +-  2.24% )
>      1,314,072,090      cache-misses                ( +-  1.17% )
> 
>       60.009243533 seconds time elapsed             ( +-  0.00% )
> 
> POST:
> 
> root@ivb-ex:~# perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault 60
> 
>  Performance counter stats for './multi-fault 60' (5 runs):
> 
>        187,751,767      page-faults                 ( +-  2.24% )
>      1,792,758,664      cache-misses                ( +-  2.30% )
> 
>       60.011611579 seconds time elapsed             ( +-  0.00% )
> 
> (I've not yet looked at why the EX sucks chunks compared to the EP box, I
>  suspect we contend on other locks, but it could be anything.)
> 
> ---
> 
>  arch/x86/mm/fault.c      |  35 ++-
>  include/linux/mm.h       |  19 +-
>  include/linux/mm_types.h |   5 +
>  kernel/fork.c            |   1 +
>  mm/init-mm.c             |   1 +
>  mm/internal.h            |  18 ++
>  mm/memory.c              | 672 ++++++++++++++++++++++++++++-------------------
>  mm/mmap.c                | 101 +++++--
>  8 files changed, 544 insertions(+), 308 deletions(-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 0/6] Another go at speculative page faults
  2014-10-21  0:07 ` [RFC][PATCH 0/6] Another go at speculative page faults Andy Lutomirski
@ 2014-10-21  8:11   ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-21  8:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave, linux-kernel,
	linux-mm

On Mon, Oct 20, 2014 at 05:07:02PM -0700, Andy Lutomirski wrote:
> On 10/20/2014 02:56 PM, Peter Zijlstra wrote:
> > Hi,
> > 
> > I figured I'd give my 2010 speculative fault series another spin:
> > 
> >   https://lkml.org/lkml/2010/1/4/257
> > 
> > Since then I think many of the outstanding issues have changed sufficiently to
> > warrant another go. In particular Al Viro's delayed fput seems to have made it
> > entirely 'normal' to delay fput(). Lai Jiangshan's SRCU rewrite provided us
> > with call_srcu() and my preemptible mmu_gather removed the TLB flushes from
> > under the PTL.
> > 
> > The code needs way more attention but builds a kernel and runs the
> > micro-benchmark so I figured I'd post it before sinking more time into it.
> > 
> > I realize the micro-bench is about as good as it gets for this series and not
> > very realistic otherwise, but I think it does show the potential benefit the
> > approach has.
> 
> Does this mean that an entire fault can complete without ever taking
> mmap_sem at all?  If so, that's a *huge* win.

Yep.

> I'm a bit concerned about drivers that assume that the vma is unchanged
> during .fault processing.  In particular, is there a race between .close
> and .fault?  Would it make sense to add a per-vma rw lock and hold it
> during vma modification and .fault calls?

VMA granularity contention would be about as bad as mmap_sem for many
workloads. But yes, that is one of the things we need to look at, I was
_hoping_ that holding the file open would sort most these problems, but
I'm sure there plenty 'interesting' cruft left.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 0/6] Another go at speculative page faults
  2014-10-20 21:56 [RFC][PATCH 0/6] Another go at speculative page faults Peter Zijlstra
                   ` (6 preceding siblings ...)
  2014-10-21  0:07 ` [RFC][PATCH 0/6] Another go at speculative page faults Andy Lutomirski
@ 2014-10-21 16:23 ` Ingo Molnar
  2014-10-21 17:09   ` Kirill A. Shutemov
  2014-10-21 17:25   ` Peter Zijlstra
  2014-10-22  7:34 ` Davidlohr Bueso
  8 siblings, 2 replies; 47+ messages in thread
From: Ingo Molnar @ 2014-10-21 16:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave, linux-kernel,
	linux-mm


* Peter Zijlstra <peterz@infradead.org> wrote:

> My Ivy Bridge EP (2*10*2) has a ~58% improvement in pagefault throughput:
> 
> PRE:
>        149,441,555      page-faults                  ( +-  1.25% )
>
> POST:
>        236,442,626      page-faults                  ( +-  0.08% )

> My Ivy Bridge EX (4*15*2) has a ~78% improvement in pagefault throughput:
> 
> PRE:
>        105,789,078      page-faults                 ( +-  2.24% )
>
> POST:
>        187,751,767      page-faults                 ( +-  2.24% )

I guess the 'PRE' and 'POST' numbers should be flipped around?

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 0/6] Another go at speculative page faults
  2014-10-21 16:23 ` Ingo Molnar
@ 2014-10-21 17:09   ` Kirill A. Shutemov
  2014-10-21 17:56     ` Peter Zijlstra
  2014-10-21 17:25   ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Kirill A. Shutemov @ 2014-10-21 17:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, torvalds, paulmck, tglx, akpm, riel, mgorman,
	oleg, mingo, minchan, kamezawa.hiroyu, viro, laijs, dave,
	linux-kernel, linux-mm

On Tue, Oct 21, 2014 at 06:23:40PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > My Ivy Bridge EP (2*10*2) has a ~58% improvement in pagefault throughput:
> > 
> > PRE:
> >        149,441,555      page-faults                  ( +-  1.25% )
> >
> > POST:
> >        236,442,626      page-faults                  ( +-  0.08% )
> 
> > My Ivy Bridge EX (4*15*2) has a ~78% improvement in pagefault throughput:
> > 
> > PRE:
> >        105,789,078      page-faults                 ( +-  2.24% )
> >
> > POST:
> >        187,751,767      page-faults                 ( +-  2.24% )
> 
> I guess the 'PRE' and 'POST' numbers should be flipped around?

I think it's faults per second.

It would be interesting to see if the patchset affects non-condended case.
Like a one-threaded workload.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 0/6] Another go at speculative page faults
  2014-10-21 17:09   ` Kirill A. Shutemov
@ 2014-10-21 17:56     ` Peter Zijlstra
  2014-10-23 10:40       ` Lai Jiangshan
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-21 17:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ingo Molnar, torvalds, paulmck, tglx, akpm, riel, mgorman, oleg,
	mingo, minchan, kamezawa.hiroyu, viro, laijs, dave, linux-kernel,
	linux-mm

On Tue, Oct 21, 2014 at 08:09:48PM +0300, Kirill A. Shutemov wrote:
> It would be interesting to see if the patchset affects non-condended case.
> Like a one-threaded workload.

It does, and not in a good way, I'll have to look at that... :/

 Performance counter stats for './multi-fault 1' (5 runs):

        73,860,251      page-faults                                                   ( +-  0.28% )
            40,914      cache-misses                                                  ( +- 41.26% )

      60.001484913 seconds time elapsed                                          ( +-  0.00% )


 Performance counter stats for './multi-fault 1' (5 runs):

        70,700,838      page-faults                                                   ( +-  0.03% )
            31,466      cache-misses                                                  ( +-  8.62% )

      60.001753906 seconds time elapsed                                          ( +-  0.00% )

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 0/6] Another go at speculative page faults
  2014-10-21 17:56     ` Peter Zijlstra
@ 2014-10-23 10:40       ` Lai Jiangshan
  2014-10-23 11:04         ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Lai Jiangshan @ 2014-10-23 10:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill A. Shutemov, Ingo Molnar, torvalds, paulmck, tglx, akpm,
	riel, mgorman, oleg, mingo, minchan, kamezawa.hiroyu, viro, dave,
	linux-kernel, linux-mm

On 10/22/2014 01:56 AM, Peter Zijlstra wrote:
> On Tue, Oct 21, 2014 at 08:09:48PM +0300, Kirill A. Shutemov wrote:
>> It would be interesting to see if the patchset affects non-condended case.
>> Like a one-threaded workload.
> 
> It does, and not in a good way, I'll have to look at that... :/

Maybe it is blamed to find_vma_srcu() that it doesn't take the advantage of
the vmacache_find() and cause more cache-misses.


Is it hard to use the vmacache in the find_vma_srcu()?

> 
>  Performance counter stats for './multi-fault 1' (5 runs):
> 
>         73,860,251      page-faults                                                   ( +-  0.28% )
>             40,914      cache-misses                                                  ( +- 41.26% )
> 
>       60.001484913 seconds time elapsed                                          ( +-  0.00% )
> 
> 
>  Performance counter stats for './multi-fault 1' (5 runs):
> 
>         70,700,838      page-faults                                                   ( +-  0.03% )
>             31,466      cache-misses                                                  ( +-  8.62% )
> 
>       60.001753906 seconds time elapsed                                          ( +-  0.00% )
> .
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 0/6] Another go at speculative page faults
  2014-10-23 10:40       ` Lai Jiangshan
@ 2014-10-23 11:04         ` Peter Zijlstra
  2014-10-24  7:54           ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-23 11:04 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Kirill A. Shutemov, Ingo Molnar, torvalds, paulmck, tglx, akpm,
	riel, mgorman, oleg, mingo, minchan, kamezawa.hiroyu, viro, dave,
	linux-kernel, linux-mm

On Thu, Oct 23, 2014 at 06:40:05PM +0800, Lai Jiangshan wrote:
> On 10/22/2014 01:56 AM, Peter Zijlstra wrote:
> > On Tue, Oct 21, 2014 at 08:09:48PM +0300, Kirill A. Shutemov wrote:
> >> It would be interesting to see if the patchset affects non-condended case.
> >> Like a one-threaded workload.
> > 
> > It does, and not in a good way, I'll have to look at that... :/
> 
> Maybe it is blamed to find_vma_srcu() that it doesn't take the advantage of
> the vmacache_find() and cause more cache-misses.

Its what I thought initially, I tried doing perf record with and
without, but then I ran into perf diff not quite working for me and I've
yet to find time to kick that thing into shape.

> Is it hard to use the vmacache in the find_vma_srcu()?

I've not had time to look at it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 0/6] Another go at speculative page faults
  2014-10-23 11:04         ` Peter Zijlstra
@ 2014-10-24  7:54           ` Ingo Molnar
  2014-10-24 13:14             ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2014-10-24  7:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Kirill A. Shutemov, torvalds, paulmck, tglx, akpm,
	riel, mgorman, oleg, mingo, minchan, kamezawa.hiroyu, viro, dave,
	linux-kernel, linux-mm


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Oct 23, 2014 at 06:40:05PM +0800, Lai Jiangshan wrote:
> > On 10/22/2014 01:56 AM, Peter Zijlstra wrote:
> > > On Tue, Oct 21, 2014 at 08:09:48PM +0300, Kirill A. Shutemov wrote:
> > >> It would be interesting to see if the patchset affects non-condended case.
> > >> Like a one-threaded workload.
> > > 
> > > It does, and not in a good way, I'll have to look at that... :/
> > 
> > Maybe it is blamed to find_vma_srcu() that it doesn't take the advantage of
> > the vmacache_find() and cause more cache-misses.
> 
> Its what I thought initially, I tried doing perf record with and
> without, but then I ran into perf diff not quite working for me and I've
> yet to find time to kick that thing into shape.

Might be the 'perf diff' regression fixed by this:

  9ab1f50876db perf diff: Add missing hists__init() call at tool start

I just pushed it out into tip:master.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 0/6] Another go at speculative page faults
  2014-10-24  7:54           ` Ingo Molnar
@ 2014-10-24 13:14             ` Peter Zijlstra
  2014-10-28  5:32               ` Namhyung Kim
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-24 13:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lai Jiangshan, Kirill A. Shutemov, torvalds, paulmck, tglx, akpm,
	riel, mgorman, oleg, mingo, minchan, kamezawa.hiroyu, viro, dave,
	linux-kernel, linux-mm

On Fri, Oct 24, 2014 at 09:54:23AM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, Oct 23, 2014 at 06:40:05PM +0800, Lai Jiangshan wrote:
> > > On 10/22/2014 01:56 AM, Peter Zijlstra wrote:
> > > > On Tue, Oct 21, 2014 at 08:09:48PM +0300, Kirill A. Shutemov wrote:
> > > >> It would be interesting to see if the patchset affects non-condended case.
> > > >> Like a one-threaded workload.
> > > > 
> > > > It does, and not in a good way, I'll have to look at that... :/
> > > 
> > > Maybe it is blamed to find_vma_srcu() that it doesn't take the advantage of
> > > the vmacache_find() and cause more cache-misses.
> > 
> > Its what I thought initially, I tried doing perf record with and
> > without, but then I ran into perf diff not quite working for me and I've
> > yet to find time to kick that thing into shape.
> 
> Might be the 'perf diff' regression fixed by this:
> 
>   9ab1f50876db perf diff: Add missing hists__init() call at tool start
> 
> I just pushed it out into tip:master.

I was on tip/master, so unlikely to be that as I was likely already
having it.

perf-report was affected too, for some reason my CONFIG_DEBUG_INFO=y
vmlinux wasn't showing symbols (and I double checked that KASLR crap was
disabled, so that wasn't confusing stuff either).

When I forced perf-report to use kallsyms it works, however perf-diff
doesn't have that option.

So there's two issues there, 1) perf-report failing to generate useful
output and 2) per-diff lacking options to force it to behave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 0/6] Another go at speculative page faults
  2014-10-24 13:14             ` Peter Zijlstra
@ 2014-10-28  5:32               ` Namhyung Kim
  0 siblings, 0 replies; 47+ messages in thread
From: Namhyung Kim @ 2014-10-28  5:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Lai Jiangshan, Kirill A. Shutemov, torvalds, paulmck,
	tglx, akpm, riel, mgorman, oleg, mingo, minchan, kamezawa.hiroyu,
	viro, dave, linux-kernel, linux-mm

Hi Peter,

On Fri, 24 Oct 2014 15:14:40 +0200, Peter Zijlstra wrote:
> On Fri, Oct 24, 2014 at 09:54:23AM +0200, Ingo Molnar wrote:
>> 
>> * Peter Zijlstra <peterz@infradead.org> wrote:
>> > Its what I thought initially, I tried doing perf record with and
>> > without, but then I ran into perf diff not quite working for me and I've
>> > yet to find time to kick that thing into shape.
>> 
>> Might be the 'perf diff' regression fixed by this:
>> 
>>   9ab1f50876db perf diff: Add missing hists__init() call at tool start
>> 
>> I just pushed it out into tip:master.
>
> I was on tip/master, so unlikely to be that as I was likely already
> having it.
>
> perf-report was affected too, for some reason my CONFIG_DEBUG_INFO=y
> vmlinux wasn't showing symbols (and I double checked that KASLR crap was
> disabled, so that wasn't confusing stuff either).
>
> When I forced perf-report to use kallsyms it works, however perf-diff
> doesn't have that option.
>
> So there's two issues there, 1) perf-report failing to generate useful
> output and 2) per-diff lacking options to force it to behave.

Did the perf-report fail to show any (kernel) symbols or are they wrong
symbols?  Maybe it's related to this:

https://lkml.org/lkml/2014/9/22/78

Thanks,
Namhyung

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 0/6] Another go at speculative page faults
  2014-10-21 16:23 ` Ingo Molnar
  2014-10-21 17:09   ` Kirill A. Shutemov
@ 2014-10-21 17:25   ` Peter Zijlstra
  2014-10-22 12:35     ` Ingo Molnar
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-21 17:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave, linux-kernel,
	linux-mm

On Tue, Oct 21, 2014 at 06:23:40PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > My Ivy Bridge EP (2*10*2) has a ~58% improvement in pagefault throughput:
> > 
> > PRE:
> >        149,441,555      page-faults                  ( +-  1.25% )
> >
> > POST:
> >        236,442,626      page-faults                  ( +-  0.08% )
> 
> > My Ivy Bridge EX (4*15*2) has a ~78% improvement in pagefault throughput:
> > 
> > PRE:
> >        105,789,078      page-faults                 ( +-  2.24% )
> >
> > POST:
> >        187,751,767      page-faults                 ( +-  2.24% )
> 
> I guess the 'PRE' and 'POST' numbers should be flipped around?

Nope, its the number of page-faults serviced in a fixed amount of time
(60 seconds), therefore higher is better.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 0/6] Another go at speculative page faults
  2014-10-21 17:25   ` Peter Zijlstra
@ 2014-10-22 12:35     ` Ingo Molnar
  0 siblings, 0 replies; 47+ messages in thread
From: Ingo Molnar @ 2014-10-22 12:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, dave, linux-kernel,
	linux-mm


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Oct 21, 2014 at 06:23:40PM +0200, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > My Ivy Bridge EP (2*10*2) has a ~58% improvement in pagefault throughput:
> > > 
> > > PRE:
> > >        149,441,555      page-faults                  ( +-  1.25% )
> > >
> > > POST:
> > >        236,442,626      page-faults                  ( +-  0.08% )
> > 
> > > My Ivy Bridge EX (4*15*2) has a ~78% improvement in pagefault throughput:
> > > 
> > > PRE:
> > >        105,789,078      page-faults                 ( +-  2.24% )
> > >
> > > POST:
> > >        187,751,767      page-faults                 ( +-  2.24% )
> > 
> > I guess the 'PRE' and 'POST' numbers should be flipped around?
> 
> Nope, its the number of page-faults serviced in a fixed amount of time
> (60 seconds), therefore higher is better.

Ah, okay!

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 0/6] Another go at speculative page faults
  2014-10-20 21:56 [RFC][PATCH 0/6] Another go at speculative page faults Peter Zijlstra
                   ` (7 preceding siblings ...)
  2014-10-21 16:23 ` Ingo Molnar
@ 2014-10-22  7:34 ` Davidlohr Bueso
  2014-10-22 11:29   ` Kirill A. Shutemov
  8 siblings, 1 reply; 47+ messages in thread
From: Davidlohr Bueso @ 2014-10-22  7:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, paulmck, tglx, akpm, riel, mgorman, oleg, mingo,
	minchan, kamezawa.hiroyu, viro, laijs, linux-kernel, linux-mm

On Mon, 2014-10-20 at 23:56 +0200, Peter Zijlstra wrote:
> Hi,
> 
> I figured I'd give my 2010 speculative fault series another spin:
> 
>   https://lkml.org/lkml/2010/1/4/257
> 
> Since then I think many of the outstanding issues have changed sufficiently to
> warrant another go. In particular Al Viro's delayed fput seems to have made it
> entirely 'normal' to delay fput(). Lai Jiangshan's SRCU rewrite provided us
> with call_srcu() and my preemptible mmu_gather removed the TLB flushes from
> under the PTL.
> 
> The code needs way more attention but builds a kernel and runs the
> micro-benchmark so I figured I'd post it before sinking more time into it.
> 
> I realize the micro-bench is about as good as it gets for this series and not
> very realistic otherwise, but I think it does show the potential benefit the
> approach has.
> 
> (patches go against .18-rc1+)

I think patch 2/6 is borken:

error: patch failed: mm/memory.c:2025
error: mm/memory.c: patch does not apply

and related, as you mention, I would very much welcome having the
introduction of 'struct faut_env' as a separate cleanup patch. May I
suggest renaming it to fault_cxt?

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 0/6] Another go at speculative page faults
  2014-10-22  7:34 ` Davidlohr Bueso
@ 2014-10-22 11:29   ` Kirill A. Shutemov
  2014-10-22 11:45     ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Kirill A. Shutemov @ 2014-10-22 11:29 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, torvalds, paulmck, tglx, akpm, riel, mgorman,
	oleg, mingo, minchan, kamezawa.hiroyu, viro, laijs, linux-kernel,
	linux-mm

On Wed, Oct 22, 2014 at 12:34:49AM -0700, Davidlohr Bueso wrote:
> On Mon, 2014-10-20 at 23:56 +0200, Peter Zijlstra wrote:
> > Hi,
> > 
> > I figured I'd give my 2010 speculative fault series another spin:
> > 
> >   https://lkml.org/lkml/2010/1/4/257
> > 
> > Since then I think many of the outstanding issues have changed sufficiently to
> > warrant another go. In particular Al Viro's delayed fput seems to have made it
> > entirely 'normal' to delay fput(). Lai Jiangshan's SRCU rewrite provided us
> > with call_srcu() and my preemptible mmu_gather removed the TLB flushes from
> > under the PTL.
> > 
> > The code needs way more attention but builds a kernel and runs the
> > micro-benchmark so I figured I'd post it before sinking more time into it.
> > 
> > I realize the micro-bench is about as good as it gets for this series and not
> > very realistic otherwise, but I think it does show the potential benefit the
> > approach has.
> > 
> > (patches go against .18-rc1+)
> 
> I think patch 2/6 is borken:
> 
> error: patch failed: mm/memory.c:2025
> error: mm/memory.c: patch does not apply
> 
> and related, as you mention, I would very much welcome having the
> introduction of 'struct faut_env' as a separate cleanup patch. May I
> suggest renaming it to fault_cxt?

What about extend start using 'struct vm_fault' earlier by stack?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 0/6] Another go at speculative page faults
  2014-10-22 11:29   ` Kirill A. Shutemov
@ 2014-10-22 11:45     ` Peter Zijlstra
  2014-10-22 11:55       ` Kirill A. Shutemov
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-22 11:45 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Davidlohr Bueso, torvalds, paulmck, tglx, akpm, riel, mgorman,
	oleg, mingo, minchan, kamezawa.hiroyu, viro, laijs, linux-kernel,
	linux-mm

On Wed, Oct 22, 2014 at 02:29:25PM +0300, Kirill A. Shutemov wrote:
> On Wed, Oct 22, 2014 at 12:34:49AM -0700, Davidlohr Bueso wrote:
> > On Mon, 2014-10-20 at 23:56 +0200, Peter Zijlstra wrote:
> > > Hi,
> > > 
> > > I figured I'd give my 2010 speculative fault series another spin:
> > > 
> > >   https://lkml.org/lkml/2010/1/4/257
> > > 
> > > Since then I think many of the outstanding issues have changed sufficiently to
> > > warrant another go. In particular Al Viro's delayed fput seems to have made it
> > > entirely 'normal' to delay fput(). Lai Jiangshan's SRCU rewrite provided us
> > > with call_srcu() and my preemptible mmu_gather removed the TLB flushes from
> > > under the PTL.
> > > 
> > > The code needs way more attention but builds a kernel and runs the
> > > micro-benchmark so I figured I'd post it before sinking more time into it.
> > > 
> > > I realize the micro-bench is about as good as it gets for this series and not
> > > very realistic otherwise, but I think it does show the potential benefit the
> > > approach has.
> > > 
> > > (patches go against .18-rc1+)
> > 
> > I think patch 2/6 is borken:
> > 
> > error: patch failed: mm/memory.c:2025
> > error: mm/memory.c: patch does not apply
> > 
> > and related, as you mention, I would very much welcome having the
> > introduction of 'struct faut_env' as a separate cleanup patch. May I
> > suggest renaming it to fault_cxt?
> 
> What about extend start using 'struct vm_fault' earlier by stack?

I'm not sure we should mix the environment for vm_ops::fault, which
acquires the page, and the fault path, which deals with changing the
PTE. Ideally we should not expose the page-table information to file
ops, its a layering violating if nothing else, drivers should not have
access to the page tables.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 0/6] Another go at speculative page faults
  2014-10-22 11:45     ` Peter Zijlstra
@ 2014-10-22 11:55       ` Kirill A. Shutemov
  0 siblings, 0 replies; 47+ messages in thread
From: Kirill A. Shutemov @ 2014-10-22 11:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, torvalds, paulmck, tglx, akpm, riel, mgorman,
	oleg, mingo, minchan, kamezawa.hiroyu, viro, laijs, linux-kernel,
	linux-mm

On Wed, Oct 22, 2014 at 01:45:58PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 22, 2014 at 02:29:25PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Oct 22, 2014 at 12:34:49AM -0700, Davidlohr Bueso wrote:
> > > On Mon, 2014-10-20 at 23:56 +0200, Peter Zijlstra wrote:
> > > > Hi,
> > > > 
> > > > I figured I'd give my 2010 speculative fault series another spin:
> > > > 
> > > >   https://lkml.org/lkml/2010/1/4/257
> > > > 
> > > > Since then I think many of the outstanding issues have changed sufficiently to
> > > > warrant another go. In particular Al Viro's delayed fput seems to have made it
> > > > entirely 'normal' to delay fput(). Lai Jiangshan's SRCU rewrite provided us
> > > > with call_srcu() and my preemptible mmu_gather removed the TLB flushes from
> > > > under the PTL.
> > > > 
> > > > The code needs way more attention but builds a kernel and runs the
> > > > micro-benchmark so I figured I'd post it before sinking more time into it.
> > > > 
> > > > I realize the micro-bench is about as good as it gets for this series and not
> > > > very realistic otherwise, but I think it does show the potential benefit the
> > > > approach has.
> > > > 
> > > > (patches go against .18-rc1+)
> > > 
> > > I think patch 2/6 is borken:
> > > 
> > > error: patch failed: mm/memory.c:2025
> > > error: mm/memory.c: patch does not apply
> > > 
> > > and related, as you mention, I would very much welcome having the
> > > introduction of 'struct faut_env' as a separate cleanup patch. May I
> > > suggest renaming it to fault_cxt?
> > 
> > What about extend start using 'struct vm_fault' earlier by stack?
> 
> I'm not sure we should mix the environment for vm_ops::fault, which
> acquires the page, and the fault path, which deals with changing the
> PTE. Ideally we should not expose the page-table information to file
> ops, its a layering violating if nothing else, drivers should not have
> access to the page tables.

We already have this for ->map_pages() :-P
I have asked if it's considered layering violation and seems nobody
cares...

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure
@ 2014-10-21  9:07 Hillf Danton
  2014-10-21 10:42 ` Peter Zijlstra
  2014-10-21 10:43 ` Peter Zijlstra
  0 siblings, 2 replies; 47+ messages in thread
From: Hillf Danton @ 2014-10-21  9:07 UTC (permalink / raw)
  To: Peter Zijlstra, LKML, Linus Torvalds, Paul E. McKenney, tglx,
	akpm, riel, mgorman, oleg, mingo, minchan, kamezawa.hiroyu, viro,
	linux-mm
  Cc: hillf.zj

Hey Peter

> Date:	Mon, 20 Oct 2014 23:56:38 +0200
> From:	Peter Zijlstra <peterz@infradead.org>
> To:	torvalds@linux-foundation.org, paulmck@linux.vnet.ibm.com,
> tglx@linutronix.de, akpm@linux-foundation.org, riel@redhat.com,
> mgorman@suse.de, oleg@redhat.com, mingo@redhat.com, minchan@kernel.org,
> kamezawa.hiroyu@jp.fujitsu.com, viro@zeniv.linux.org.uk, la
> Cc:	linux-kernel@vger.kernel.org, linux-mm@kvack.org, "Peter Zijlstra"
> <peterz@infradead.org>
> Subject: [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure
>
> Provide infrastructure to do a speculative fault (not holding
> mmap_sem).
>
> The not holding of mmap_sem means we can race against VMA
> change/removal and page-table destruction. We use the SRCU VMA freeing
> to keep the VMA around. We use the VMA seqcount to detect change
> (including umapping / page-table deletion) and we use gup_fast() style
> page-table walking to deal with page-table races.
>
> Once we've obtained the page and are ready to update the PTE, we
> validate if the state we started the fault with is still valid, if
> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
> PTE and we're done.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/mm.h |    2
>  mm/memory.c        |  118
> ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 119 insertions(+), 1 deletion(-)
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1162,6 +1162,8 @@ int generic_error_remove_page(struct add
>  int invalidate_inode_page(struct page *page);
>
>  #ifdef CONFIG_MMU
> +extern int handle_speculative_fault(struct mm_struct *mm,
> +			unsigned long address, unsigned int flags);
>  extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct
> *vma,
>  			unsigned long address, unsigned int flags);
>  extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2004,12 +2004,40 @@ struct fault_env {
>  	pte_t entry;
>  	spinlock_t *ptl;
>  	unsigned int flags;
> +	unsigned int sequence;
>  };
>
>  static bool pte_map_lock(struct fault_env *fe)
>  {
> +	bool ret = false;
> +
> +	if (!(fe->flags & FAULT_FLAG_SPECULATIVE)) {
> +		fe->pte = pte_offset_map_lock(fe->mm, fe->pmd, fe->address, &fe->ptl);
> +		return true;
> +	}
> +
> +	/*
> +	 * The first vma_is_dead() guarantees the page-tables are still valid,
> +	 * having IRQs disabled ensures they stay around, hence the second
> +	 * vma_is_dead() to make sure they are still valid once we've got the
> +	 * lock. After that a concurrent zap_pte_range() will block on the PTL
> +	 * and thus we're safe.
> +	 */
> +	local_irq_disable();
> +	if (vma_is_dead(fe->vma, fe->sequence))
> +		goto out;
> +
>  	fe->pte = pte_offset_map_lock(fe->mm, fe->pmd, fe->address, &fe->ptl);
> -	return true;
> +
> +	if (vma_is_dead(fe->vma, fe->sequence)) {
> +		pte_unmap_unlock(fe->pte, fe->ptl);
> +		goto out;
> +	}
> +
> +	ret = true;
> +out:
> +	local_irq_enable();
> +	return ret;
>  }
>
>  /*
> @@ -2432,6 +2460,7 @@ static int do_swap_page(struct fault_env
>  	entry = pte_to_swp_entry(fe->entry);
>  	if (unlikely(non_swap_entry(entry))) {
>  		if (is_migration_entry(entry)) {
> +			/* XXX fe->pmd might be dead */
>  			migration_entry_wait(fe->mm, fe->pmd, fe->address);
>  		} else if (is_hwpoison_entry(entry)) {
>  			ret = VM_FAULT_HWPOISON;
> @@ -3357,6 +3386,93 @@ static int __handle_mm_fault(struct mm_s
>  	return handle_pte_fault(&fe);
>  }
>
> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
> unsigned int flags)
> +{
> +	struct fault_env fe = {
> +		.mm = mm,
> +		.address = address,
> +		.flags = flags | FAULT_FLAG_SPECULATIVE,
> +	};
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
> +	int dead, seq, idx, ret = VM_FAULT_RETRY;
> +	struct vm_area_struct *vma;
> +
> +	idx = srcu_read_lock(&vma_srcu);
> +	vma = find_vma_srcu(mm, address);
> +	if (!vma)
> +		goto unlock;
> +
> +	/*
> +	 * Validate the VMA found by the lockless lookup.
> +	 */
> +	dead = RB_EMPTY_NODE(&vma->vm_rb);
> +	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <->
> seqlock,vma_rb_erase() */
> +	if ((seq & 1) || dead) /* XXX wait for !&1 instead? */
> +		goto unlock;
> +
> +	if (address < vma->vm_start || vma->vm_end <= address)
> +		goto unlock;
> +
> +	/*
> +	 * We need to re-validate the VMA after checking the bounds, otherwise
> +	 * we might have a false positive on the bounds.
> +	 */
> +	if (read_seqcount_retry(&vma->vm_sequence, seq))
> +		goto unlock;
> +
> +	/*
> +	 * Do a speculative lookup of the PTE entry.
> +	 */
> +	local_irq_disable();
> +	pgd = pgd_offset(mm, address);
> +	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> +		goto out_walk;
> +
> +	pud = pud_offset(pgd, address);
> +	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> +		goto out_walk;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> +		goto out_walk;
> +
> +	/*
> +	 * The above does not allocate/instantiate page-tables because doing so
> +	 * would lead to the possibility of instantiating page-tables after
> +	 * free_pgtables() -- and consequently leaking them.
> +	 *
> +	 * The result is that we take at least one !speculative fault per PMD
> +	 * in order to instantiate it.
> +	 *
> +	 * XXX try and fix that.. should be possible somehow.
> +	 */
> +
> +	if (pmd_huge(*pmd)) /* XXX no huge support */
> +		goto out_walk;
> +
> +	fe.vma = vma;
> +	fe.pmd = pmd;
> +	fe.sequence = seq;
> +
> +	pte = pte_offset_map(pmd, address);
> +	fe.entry = ACCESS_ONCE(pte); /* XXX gup_get_pte() */

I wonder if one char, "*", is missing.

btw, and more important, still correct for me to
address you Redhater, Sir?

Hillf
> +	pte_unmap(pte);
> +	local_irq_enable();
> +
> +	ret = handle_pte_fault(&fe);
> +
> +unlock:
> +	srcu_read_unlock(&vma_srcu, idx);
> +	return ret;
> +
> +out_walk:
> +	local_irq_enable();
> +	goto unlock;
> +}
> +
>  /*
>   * By the time we get here, we already hold the mm semaphore
>   *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure
  2014-10-21  9:07 [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure Hillf Danton
@ 2014-10-21 10:42 ` Peter Zijlstra
  2014-10-21 10:43 ` Peter Zijlstra
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-21 10:42 UTC (permalink / raw)
  To: Hillf Danton
  Cc: LKML, Linus Torvalds, Paul E. McKenney, tglx, akpm, riel, mgorman,
	oleg, mingo, minchan, kamezawa.hiroyu, viro, linux-mm, hillf.zj

On Tue, Oct 21, 2014 at 05:07:56PM +0800, Hillf Danton wrote:

> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> btw, and more important, still correct for me to
> address you Redhater, Sir?

Clue in the above line ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure
  2014-10-21  9:07 [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure Hillf Danton
  2014-10-21 10:42 ` Peter Zijlstra
@ 2014-10-21 10:43 ` Peter Zijlstra
  1 sibling, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2014-10-21 10:43 UTC (permalink / raw)
  To: Hillf Danton
  Cc: LKML, Linus Torvalds, Paul E. McKenney, tglx, akpm, riel, mgorman,
	oleg, mingo, minchan, kamezawa.hiroyu, viro, linux-mm, hillf.zj

On Tue, Oct 21, 2014 at 05:07:56PM +0800, Hillf Danton wrote:
> > +	pte = pte_offset_map(pmd, address);
> > +	fe.entry = ACCESS_ONCE(pte); /* XXX gup_get_pte() */
> 
> I wonder if one char, "*", is missing.
>
> > +	pte_unmap(pte);

Gah yes, last minute edit that. I noticed I missed the pte_unmap() while
doing the changelogs and 'fixed' up the code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2014-10-28  5:32 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-20 21:56 [RFC][PATCH 0/6] Another go at speculative page faults Peter Zijlstra
2014-10-20 21:56 ` [RFC][PATCH 1/6] mm: Dont assume page-table invariance during faults Peter Zijlstra
2014-10-20 21:56 ` [RFC][PATCH 2/6] mm: Prepare for FAULT_FLAG_SPECULATIVE Peter Zijlstra
2014-10-20 21:56 ` [RFC][PATCH 3/6] mm: VMA sequence count Peter Zijlstra
2014-10-22 11:26   ` Kirill A. Shutemov
2014-10-22 11:39     ` Peter Zijlstra
2014-10-22 11:53       ` Kirill A. Shutemov
2014-10-22 12:15         ` Peter Zijlstra
2014-10-22 13:44           ` Peter Zijlstra
2014-10-23 12:36             ` Kirill A. Shutemov
2014-10-23 14:22               ` Peter Zijlstra
2014-10-23 15:05                 ` Kirill A. Shutemov
2014-10-20 21:56 ` [RFC][PATCH 4/6] SRCU free VMAs Peter Zijlstra
2014-10-20 23:41   ` Linus Torvalds
2014-10-21  8:07     ` Peter Zijlstra
2014-10-24 15:16       ` Christoph Lameter
2014-10-24 15:51         ` Peter Zijlstra
2014-10-24 17:08           ` Christoph Lameter
2014-10-21  8:22     ` Peter Zijlstra
2014-10-23 10:14   ` Lai Jiangshan
2014-10-23 11:03     ` Peter Zijlstra
2014-10-24  3:33       ` Lai Jiangshan
2014-10-24  7:26         ` Peter Zijlstra
2014-10-20 21:56 ` [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure Peter Zijlstra
2014-10-21  8:35   ` Kirill A. Shutemov
2014-10-21 10:41     ` Peter Zijlstra
2014-10-21 19:00   ` Peter Zijlstra
2014-10-20 21:56 ` [RFC][PATCH 6/6] mm,x86: Add speculative pagefault handling Peter Zijlstra
2014-10-21  0:07 ` [RFC][PATCH 0/6] Another go at speculative page faults Andy Lutomirski
2014-10-21  8:11   ` Peter Zijlstra
2014-10-21 16:23 ` Ingo Molnar
2014-10-21 17:09   ` Kirill A. Shutemov
2014-10-21 17:56     ` Peter Zijlstra
2014-10-23 10:40       ` Lai Jiangshan
2014-10-23 11:04         ` Peter Zijlstra
2014-10-24  7:54           ` Ingo Molnar
2014-10-24 13:14             ` Peter Zijlstra
2014-10-28  5:32               ` Namhyung Kim
2014-10-21 17:25   ` Peter Zijlstra
2014-10-22 12:35     ` Ingo Molnar
2014-10-22  7:34 ` Davidlohr Bueso
2014-10-22 11:29   ` Kirill A. Shutemov
2014-10-22 11:45     ` Peter Zijlstra
2014-10-22 11:55       ` Kirill A. Shutemov
  -- strict thread matches above, loose matches on Subject: below --
2014-10-21  9:07 [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure Hillf Danton
2014-10-21 10:42 ` Peter Zijlstra
2014-10-21 10:43 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).