[RFC v4 06/20] mm: Provide speculative fault infrastructure

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Laurent Dufour <ldufour@linux.vnet.ibm.com>
To: paulmck@linux.vnet.ibm.com, peterz@infradead.org,
	akpm@linux-foundation.org, kirill@shutemov.name,
	ak@linux.intel.com, mhocko@kernel.org, dave@stgolabs.net,
	jack@suse.cz, Matthew Wilcox <willy@infradead.org>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	haren@linux.vnet.ibm.com, khandual@linux.vnet.ibm.com,
	npiggin@gmail.com, bsingharora@gmail.com
Subject: [RFC v4 06/20] mm: Provide speculative fault infrastructure
Date: Fri,  9 Jun 2017 16:20:55 +0200	[thread overview]
Message-ID: <1497018069-17790-7-git-send-email-ldufour@linux.vnet.ibm.com> (raw)
In-Reply-To: <1497018069-17790-1-git-send-email-ldufour@linux.vnet.ibm.com>

From: Peter Zijlstra <peterz@infradead.org>

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Fix newly introduced pte_spinlock() for speculative page fault]
[Rename vma_is_dead() to vma_has_changed()]
[Call p4d_alloc() as it is safe since pgd is valid]
[Call pud_alloc() as it is safe since p4d is valid]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
 include/linux/mm.h |   3 ++
 mm/memory.c        | 148 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 148 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6b7ec2a76953..671541e00d26 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -315,6 +315,7 @@ struct vm_fault {
 	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
 	pgoff_t pgoff;			/* Logical page offset based on vma */
 	unsigned long address;		/* Faulting virtual address */
+	unsigned int sequence;
 	pmd_t *pmd;			/* Pointer to pmd entry matching
 					 * the 'address' */
 	pud_t *pud;			/* Pointer to pud entry matching
@@ -1286,6 +1287,8 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		unsigned int flags);
+extern int handle_speculative_fault(struct mm_struct *mm,
+				    unsigned long address, unsigned int flags);
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
diff --git a/mm/memory.c b/mm/memory.c
index 8f0468d099fc..8c43895e9310 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2244,15 +2244,66 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 
 static bool pte_spinlock(struct vm_fault *vmf)
 {
+	bool ret = false;
+
+	/* Check if vma is still valid */
+	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+		vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+		spin_lock(vmf->ptl);
+		return true;
+	}
+
+	local_irq_disable();
+	if (vma_has_changed(vmf->vma, vmf->sequence))
+		goto out;
+
 	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
 	spin_lock(vmf->ptl);
-	return true;
+
+	if (vma_has_changed(vmf->vma, vmf->sequence)) {
+		spin_unlock(vmf->ptl);
+		goto out;
+	}
+
+	ret = true;
+out:
+	local_irq_enable();
+	return ret;
 }
 
 static bool pte_map_lock(struct vm_fault *vmf)
 {
-	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
-	return true;
+	bool ret = false;
+
+	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+		vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+					       vmf->address, &vmf->ptl);
+		return true;
+	}
+
+	/*
+	 * The first vma_has_changed() guarantees the page-tables are still
+	 * valid, having IRQs disabled ensures they stay around, hence the
+	 * second vma_has_changed() to make sure they are still valid once
+	 * we've got the lock. After that a concurrent zap_pte_range() will
+	 * block on the PTL and thus we're safe.
+	 */
+	local_irq_disable();
+	if (vma_has_changed(vmf->vma, vmf->sequence))
+		goto out;
+
+	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+				       vmf->address, &vmf->ptl);
+
+	if (vma_has_changed(vmf->vma, vmf->sequence)) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		goto out;
+	}
+
+	ret = true;
+out:
+	local_irq_enable();
+	return ret;
 }
 
 /*
@@ -2684,6 +2735,7 @@ int do_swap_page(struct vm_fault *vmf)
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
+			/* XXX fe->pmd might be dead */
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
 					     vmf->address);
 		} else if (is_hwpoison_entry(entry)) {
@@ -3868,6 +3920,96 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	return handle_pte_fault(&vmf);
 }
 
+int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
+			     unsigned int flags)
+{
+	struct vm_fault vmf = {
+		.address = address,
+		.flags = flags | FAULT_FLAG_SPECULATIVE,
+	};
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	int dead, seq, idx, ret = VM_FAULT_RETRY;
+	struct vm_area_struct *vma;
+
+	idx = srcu_read_lock(&vma_srcu);
+	vma = find_vma_srcu(mm, address);
+	if (!vma)
+		goto unlock;
+
+	/*
+	 * Validate the VMA found by the lockless lookup.
+	 */
+	dead = RB_EMPTY_NODE(&vma->vm_rb);
+	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
+	if ((seq & 1) || dead) /* XXX wait for !&1 instead? */
+		goto unlock;
+
+	if (address < vma->vm_start || vma->vm_end <= address)
+		goto unlock;
+
+	/*
+	 * We need to re-validate the VMA after checking the bounds, otherwise
+	 * we might have a false positive on the bounds.
+	 */
+	if (read_seqcount_retry(&vma->vm_sequence, seq))
+		goto unlock;
+
+	/*
+	 * Do a speculative lookup of the PTE entry.
+	 */
+	local_irq_disable();
+	pgd = pgd_offset(mm, address);
+	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		goto out_walk;
+
+	p4d = p4d_alloc(mm, pgd, address);
+	if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
+		goto out_walk;
+
+	pud = pud_alloc(mm, p4d, address);
+	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+		goto out_walk;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+		goto out_walk;
+
+	/*
+	 * The above does not allocate/instantiate page-tables because doing so
+	 * would lead to the possibility of instantiating page-tables after
+	 * free_pgtables() -- and consequently leaking them.
+	 *
+	 * The result is that we take at least one !speculative fault per PMD
+	 * in order to instantiate it.
+	 *
+	 * XXX try and fix that.. should be possible somehow.
+	 */
+
+	if (pmd_huge(*pmd)) /* XXX no huge support */
+		goto out_walk;
+
+	vmf.vma = vma;
+	vmf.pmd = pmd;
+	vmf.pgoff = linear_page_index(vma, address);
+	vmf.gfp_mask = __get_fault_gfp_mask(vma);
+	vmf.sequence = seq;
+
+	local_irq_enable();
+
+	ret = handle_pte_fault(&vmf);
+
+unlock:
+	srcu_read_unlock(&vma_srcu, idx);
+	return ret;
+
+out_walk:
+	local_irq_enable();
+	goto unlock;
+}
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2017-06-09 14:21 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-09 14:20 [RFC v4 00/20] Speculative page faults Laurent Dufour
2017-06-09 14:20 ` [RFC v4 01/20] mm: Dont assume page-table invariance during faults Laurent Dufour
2017-06-09 14:20 ` [RFC v4 02/20] mm: Prepare for FAULT_FLAG_SPECULATIVE Laurent Dufour
2017-06-09 14:20 ` [RFC v4 03/20] mm: Introduce pte_spinlock Laurent Dufour
2017-06-09 14:20 ` [RFC v4 04/20] mm: VMA sequence count Laurent Dufour
2017-06-09 14:20 ` [RFC v4 05/20] mm: RCU free VMAs Laurent Dufour
2017-06-09 14:20 ` Laurent Dufour [this message]
2017-06-09 14:20 ` [RFC v4 07/20] mm/spf: Try spin lock in speculative path Laurent Dufour
2017-06-09 14:20 ` [RFC v4 08/20] mm/spf: Fix fe.sequence init in __handle_mm_fault() Laurent Dufour
2017-06-09 14:20 ` [RFC v4 09/20] mm/spf: don't set fault entry's fields if locking failed Laurent Dufour
2017-06-09 14:20 ` [RFC v4 10/20] mm/spf; fix lock dependency against mapping->i_mmap_rwsem Laurent Dufour
2017-06-09 14:21 ` [RFC v4 11/20] mm/spf: Protect changes to vm_flags Laurent Dufour
2017-06-09 14:21 ` [RFC v4 12/20] mm/spf Protect vm_policy's changes against speculative pf Laurent Dufour
2017-06-09 14:21 ` [RFC v4 13/20] mm/spf: Add check on the VMA's flags Laurent Dufour
2017-06-09 14:21 ` [RFC v4 14/20] mm/spf: protect madvise vs speculative pf Laurent Dufour
2017-06-09 14:21 ` [RFC v4 15/20] mm/spf: protect mremap() against " Laurent Dufour
2017-06-09 14:21 ` [RFC v4 16/20] mm/spf: Don't call user fault callback in the speculative path Laurent Dufour
2017-06-09 14:21 ` [RFC v4 17/20] x86/mm: Add speculative pagefault handling Laurent Dufour
2017-06-09 14:21 ` [RFC v4 18/20] x86/mm: Update the handle_speculative_fault's path Laurent Dufour
2017-06-09 14:21 ` [RFC v4 19/20] powerpc/mm: Add speculative page fault Laurent Dufour
2017-06-09 14:21 ` [RFC v4 20/20] mm/spf: Clear FAULT_FLAG_KILLABLE in the speculative path Laurent Dufour
2017-06-09 15:01 ` [RFC v4 00/20] Speculative page faults Michal Hocko
2017-06-09 15:25   ` Laurent Dufour
2017-06-09 16:35     ` Michal Hocko
2017-06-09 16:59       ` Tim Chen
2017-06-13 10:19         ` Laurent Dufour
2017-06-13  9:58       ` Laurent Dufour
2017-06-09 18:55 ` Paul E. McKenney
2017-06-12 10:20 ` Jan Kara
2017-06-13 10:24   ` Laurent Dufour

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:6b7ec2a7695 dfblob:671541e00d2 dfblob:8f0468d099f
dfblob:8c43895e931 )
 OR (
bs:"[RFC v4 06/20] mm: Provide speculative fault infrastructure" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1497018069-17790-7-git-send-email-ldufour@linux.vnet.ibm.com \
    --to=ldufour@linux.vnet.ibm.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=bsingharora@gmail.com \
    --cc=dave@stgolabs.net \
    --cc=haren@linux.vnet.ibm.com \
    --cc=jack@suse.cz \
    --cc=khandual@linux.vnet.ibm.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=npiggin@gmail.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).