Re: [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Peter Zijlstra <peterz@infradead.org>
Cc: torvalds@linux-foundation.org, paulmck@linux.vnet.ibm.com,
	tglx@linutronix.de, akpm@linux-foundation.org, riel@redhat.com,
	mgorman@suse.de, oleg@redhat.com, mingo@redhat.com,
	minchan@kernel.org, kamezawa.hiroyu@jp.fujitsu.com,
	viro@zeniv.linux.org.uk, laijs@cn.fujitsu.com, dave@stgolabs.net,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure
Date: Tue, 21 Oct 2014 11:35:48 +0300	[thread overview]
Message-ID: <20141021083548.GA22200@node.dhcp.inet.fi> (raw)
In-Reply-To: <20141020222841.490529442@infradead.org>

On Mon, Oct 20, 2014 at 11:56:38PM +0200, Peter Zijlstra wrote:
> Provide infrastructure to do a speculative fault (not holding
> mmap_sem).
> 
> The not holding of mmap_sem means we can race against VMA
> change/removal and page-table destruction. We use the SRCU VMA freeing
> to keep the VMA around. We use the VMA seqcount to detect change
> (including umapping / page-table deletion) and we use gup_fast() style
> page-table walking to deal with page-table races.
> 
> Once we've obtained the page and are ready to update the PTE, we
> validate if the state we started the fault with is still valid, if
> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
> PTE and we're done.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/mm.h |    2 
>  mm/memory.c        |  118 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 119 insertions(+), 1 deletion(-)
> 
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1162,6 +1162,8 @@ int generic_error_remove_page(struct add
>  int invalidate_inode_page(struct page *page);
>  
>  #ifdef CONFIG_MMU
> +extern int handle_speculative_fault(struct mm_struct *mm,
> +			unsigned long address, unsigned int flags);
>  extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, unsigned int flags);
>  extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2004,12 +2004,40 @@ struct fault_env {
>  	pte_t entry;
>  	spinlock_t *ptl;
>  	unsigned int flags;
> +	unsigned int sequence;
>  };
>  
>  static bool pte_map_lock(struct fault_env *fe)
>  {
> +	bool ret = false;
> +
> +	if (!(fe->flags & FAULT_FLAG_SPECULATIVE)) {
> +		fe->pte = pte_offset_map_lock(fe->mm, fe->pmd, fe->address, &fe->ptl);
> +		return true;
> +	}
> +
> +	/*
> +	 * The first vma_is_dead() guarantees the page-tables are still valid,
> +	 * having IRQs disabled ensures they stay around, hence the second
> +	 * vma_is_dead() to make sure they are still valid once we've got the
> +	 * lock. After that a concurrent zap_pte_range() will block on the PTL
> +	 * and thus we're safe.
> +	 */
> +	local_irq_disable();
> +	if (vma_is_dead(fe->vma, fe->sequence))
> +		goto out;
> +
>  	fe->pte = pte_offset_map_lock(fe->mm, fe->pmd, fe->address, &fe->ptl);
> -	return true;
> +
> +	if (vma_is_dead(fe->vma, fe->sequence)) {
> +		pte_unmap_unlock(fe->pte, fe->ptl);
> +		goto out;
> +	}
> +
> +	ret = true;
> +out:
> +	local_irq_enable();
> +	return ret;
>  }
>  
>  /*
> @@ -2432,6 +2460,7 @@ static int do_swap_page(struct fault_env
>  	entry = pte_to_swp_entry(fe->entry);
>  	if (unlikely(non_swap_entry(entry))) {
>  		if (is_migration_entry(entry)) {
> +			/* XXX fe->pmd might be dead */
>  			migration_entry_wait(fe->mm, fe->pmd, fe->address);
>  		} else if (is_hwpoison_entry(entry)) {
>  			ret = VM_FAULT_HWPOISON;
> @@ -3357,6 +3386,93 @@ static int __handle_mm_fault(struct mm_s
>  	return handle_pte_fault(&fe);
>  }
>  
> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address, unsigned int flags)
> +{
> +	struct fault_env fe = {
> +		.mm = mm,
> +		.address = address,
> +		.flags = flags | FAULT_FLAG_SPECULATIVE,
> +	};
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
> +	int dead, seq, idx, ret = VM_FAULT_RETRY;
> +	struct vm_area_struct *vma;
> +
> +	idx = srcu_read_lock(&vma_srcu);
> +	vma = find_vma_srcu(mm, address);
> +	if (!vma)
> +		goto unlock;
> +
> +	/*
> +	 * Validate the VMA found by the lockless lookup.
> +	 */
> +	dead = RB_EMPTY_NODE(&vma->vm_rb);
> +	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
> +	if ((seq & 1) || dead) /* XXX wait for !&1 instead? */
> +		goto unlock;
> +
> +	if (address < vma->vm_start || vma->vm_end <= address)
> +		goto unlock;
> +
> +	/*
> +	 * We need to re-validate the VMA after checking the bounds, otherwise
> +	 * we might have a false positive on the bounds.
> +	 */
> +	if (read_seqcount_retry(&vma->vm_sequence, seq))
> +		goto unlock;
> +
> +	/*
> +	 * Do a speculative lookup of the PTE entry.
> +	 */
> +	local_irq_disable();
> +	pgd = pgd_offset(mm, address);
> +	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> +		goto out_walk;
> +
> +	pud = pud_offset(pgd, address);
> +	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> +		goto out_walk;

pud_huge() too. Or filter out VM_HUGETLB altogether.

BTW, what keeps mm_struct around? It seems we don't take reference during
page fault.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Peter Zijlstra <peterz@infradead.org>
Cc: torvalds@linux-foundation.org, paulmck@linux.vnet.ibm.com,
	tglx@linutronix.de, akpm@linux-foundation.org, riel@redhat.com,
	mgorman@suse.de, oleg@redhat.com, mingo@redhat.com,
	minchan@kernel.org, kamezawa.hiroyu@jp.fujitsu.com,
	viro@zeniv.linux.org.uk, laijs@cn.fujitsu.com, dave@stgolabs.net,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure
Date: Tue, 21 Oct 2014 11:35:48 +0300	[thread overview]
Message-ID: <20141021083548.GA22200@node.dhcp.inet.fi> (raw)
In-Reply-To: <20141020222841.490529442@infradead.org>

On Mon, Oct 20, 2014 at 11:56:38PM +0200, Peter Zijlstra wrote:
> Provide infrastructure to do a speculative fault (not holding
> mmap_sem).
> 
> The not holding of mmap_sem means we can race against VMA
> change/removal and page-table destruction. We use the SRCU VMA freeing
> to keep the VMA around. We use the VMA seqcount to detect change
> (including umapping / page-table deletion) and we use gup_fast() style
> page-table walking to deal with page-table races.
> 
> Once we've obtained the page and are ready to update the PTE, we
> validate if the state we started the fault with is still valid, if
> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
> PTE and we're done.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/mm.h |    2 
>  mm/memory.c        |  118 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 119 insertions(+), 1 deletion(-)
> 
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1162,6 +1162,8 @@ int generic_error_remove_page(struct add
>  int invalidate_inode_page(struct page *page);
>  
>  #ifdef CONFIG_MMU
> +extern int handle_speculative_fault(struct mm_struct *mm,
> +			unsigned long address, unsigned int flags);
>  extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, unsigned int flags);
>  extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2004,12 +2004,40 @@ struct fault_env {
>  	pte_t entry;
>  	spinlock_t *ptl;
>  	unsigned int flags;
> +	unsigned int sequence;
>  };
>  
>  static bool pte_map_lock(struct fault_env *fe)
>  {
> +	bool ret = false;
> +
> +	if (!(fe->flags & FAULT_FLAG_SPECULATIVE)) {
> +		fe->pte = pte_offset_map_lock(fe->mm, fe->pmd, fe->address, &fe->ptl);
> +		return true;
> +	}
> +
> +	/*
> +	 * The first vma_is_dead() guarantees the page-tables are still valid,
> +	 * having IRQs disabled ensures they stay around, hence the second
> +	 * vma_is_dead() to make sure they are still valid once we've got the
> +	 * lock. After that a concurrent zap_pte_range() will block on the PTL
> +	 * and thus we're safe.
> +	 */
> +	local_irq_disable();
> +	if (vma_is_dead(fe->vma, fe->sequence))
> +		goto out;
> +
>  	fe->pte = pte_offset_map_lock(fe->mm, fe->pmd, fe->address, &fe->ptl);
> -	return true;
> +
> +	if (vma_is_dead(fe->vma, fe->sequence)) {
> +		pte_unmap_unlock(fe->pte, fe->ptl);
> +		goto out;
> +	}
> +
> +	ret = true;
> +out:
> +	local_irq_enable();
> +	return ret;
>  }
>  
>  /*
> @@ -2432,6 +2460,7 @@ static int do_swap_page(struct fault_env
>  	entry = pte_to_swp_entry(fe->entry);
>  	if (unlikely(non_swap_entry(entry))) {
>  		if (is_migration_entry(entry)) {
> +			/* XXX fe->pmd might be dead */
>  			migration_entry_wait(fe->mm, fe->pmd, fe->address);
>  		} else if (is_hwpoison_entry(entry)) {
>  			ret = VM_FAULT_HWPOISON;
> @@ -3357,6 +3386,93 @@ static int __handle_mm_fault(struct mm_s
>  	return handle_pte_fault(&fe);
>  }
>  
> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address, unsigned int flags)
> +{
> +	struct fault_env fe = {
> +		.mm = mm,
> +		.address = address,
> +		.flags = flags | FAULT_FLAG_SPECULATIVE,
> +	};
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
> +	int dead, seq, idx, ret = VM_FAULT_RETRY;
> +	struct vm_area_struct *vma;
> +
> +	idx = srcu_read_lock(&vma_srcu);
> +	vma = find_vma_srcu(mm, address);
> +	if (!vma)
> +		goto unlock;
> +
> +	/*
> +	 * Validate the VMA found by the lockless lookup.
> +	 */
> +	dead = RB_EMPTY_NODE(&vma->vm_rb);
> +	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
> +	if ((seq & 1) || dead) /* XXX wait for !&1 instead? */
> +		goto unlock;
> +
> +	if (address < vma->vm_start || vma->vm_end <= address)
> +		goto unlock;
> +
> +	/*
> +	 * We need to re-validate the VMA after checking the bounds, otherwise
> +	 * we might have a false positive on the bounds.
> +	 */
> +	if (read_seqcount_retry(&vma->vm_sequence, seq))
> +		goto unlock;
> +
> +	/*
> +	 * Do a speculative lookup of the PTE entry.
> +	 */
> +	local_irq_disable();
> +	pgd = pgd_offset(mm, address);
> +	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> +		goto out_walk;
> +
> +	pud = pud_offset(pgd, address);
> +	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> +		goto out_walk;

pud_huge() too. Or filter out VM_HUGETLB altogether.

BTW, what keeps mm_struct around? It seems we don't take reference during
page fault.

-- 
 Kirill A. Shutemov

next prev parent reply	other threads:[~2014-10-21  8:38 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-20 21:56 [RFC][PATCH 0/6] Another go at speculative page faults Peter Zijlstra
2014-10-20 21:56 ` Peter Zijlstra
2014-10-20 21:56 ` [RFC][PATCH 1/6] mm: Dont assume page-table invariance during faults Peter Zijlstra
2014-10-20 21:56   ` Peter Zijlstra
2014-10-20 21:56 ` [RFC][PATCH 2/6] mm: Prepare for FAULT_FLAG_SPECULATIVE Peter Zijlstra
2014-10-20 21:56   ` Peter Zijlstra
2014-10-20 21:56 ` [RFC][PATCH 3/6] mm: VMA sequence count Peter Zijlstra
2014-10-20 21:56   ` Peter Zijlstra
2014-10-22 11:26   ` Kirill A. Shutemov
2014-10-22 11:26     ` Kirill A. Shutemov
2014-10-22 11:39     ` Peter Zijlstra
2014-10-22 11:39       ` Peter Zijlstra
2014-10-22 11:53       ` Kirill A. Shutemov
2014-10-22 11:53         ` Kirill A. Shutemov
2014-10-22 12:15         ` Peter Zijlstra
2014-10-22 12:15           ` Peter Zijlstra
2014-10-22 13:44           ` Peter Zijlstra
2014-10-22 13:44             ` Peter Zijlstra
2014-10-23 12:36             ` Kirill A. Shutemov
2014-10-23 12:36               ` Kirill A. Shutemov
2014-10-23 14:22               ` Peter Zijlstra
2014-10-23 14:22                 ` Peter Zijlstra
2014-10-23 15:05                 ` Kirill A. Shutemov
2014-10-23 15:05                   ` Kirill A. Shutemov
2014-10-20 21:56 ` [RFC][PATCH 4/6] SRCU free VMAs Peter Zijlstra
2014-10-20 21:56   ` Peter Zijlstra
2014-10-20 23:41   ` Linus Torvalds
2014-10-20 23:41     ` Linus Torvalds
2014-10-21  8:07     ` Peter Zijlstra
2014-10-21  8:07       ` Peter Zijlstra
2014-10-24 15:16       ` Christoph Lameter
2014-10-24 15:16         ` Christoph Lameter
2014-10-24 15:51         ` Peter Zijlstra
2014-10-24 15:51           ` Peter Zijlstra
2014-10-24 17:08           ` Christoph Lameter
2014-10-24 17:08             ` Christoph Lameter
2014-10-21  8:22     ` Peter Zijlstra
2014-10-21  8:22       ` Peter Zijlstra
2014-10-23 10:14   ` Lai Jiangshan
2014-10-23 10:14     ` Lai Jiangshan
2014-10-23 11:03     ` Peter Zijlstra
2014-10-23 11:03       ` Peter Zijlstra
2014-10-24  3:33       ` Lai Jiangshan
2014-10-24  3:33         ` Lai Jiangshan
2014-10-24  7:26         ` Peter Zijlstra
2014-10-24  7:26           ` Peter Zijlstra
2014-10-20 21:56 ` [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure Peter Zijlstra
2014-10-20 21:56   ` Peter Zijlstra
2014-10-21  8:35   ` Kirill A. Shutemov [this message]
2014-10-21  8:35     ` Kirill A. Shutemov
2014-10-21 10:41     ` Peter Zijlstra
2014-10-21 10:41       ` Peter Zijlstra
2014-10-21 19:00   ` Peter Zijlstra
2014-10-21 19:00     ` Peter Zijlstra
2014-10-20 21:56 ` [RFC][PATCH 6/6] mm,x86: Add speculative pagefault handling Peter Zijlstra
2014-10-20 21:56   ` Peter Zijlstra
2014-10-21  0:07 ` [RFC][PATCH 0/6] Another go at speculative page faults Andy Lutomirski
2014-10-21  0:07   ` Andy Lutomirski
2014-10-21  8:11   ` Peter Zijlstra
2014-10-21  8:11     ` Peter Zijlstra
2014-10-21 16:23 ` Ingo Molnar
2014-10-21 16:23   ` Ingo Molnar
2014-10-21 17:09   ` Kirill A. Shutemov
2014-10-21 17:09     ` Kirill A. Shutemov
2014-10-21 17:56     ` Peter Zijlstra
2014-10-21 17:56       ` Peter Zijlstra
2014-10-23 10:40       ` Lai Jiangshan
2014-10-23 10:40         ` Lai Jiangshan
2014-10-23 11:04         ` Peter Zijlstra
2014-10-23 11:04           ` Peter Zijlstra
2014-10-24  7:54           ` Ingo Molnar
2014-10-24  7:54             ` Ingo Molnar
2014-10-24 13:14             ` Peter Zijlstra
2014-10-24 13:14               ` Peter Zijlstra
2014-10-28  5:32               ` Namhyung Kim
2014-10-28  5:32                 ` Namhyung Kim
2014-10-21 17:25   ` Peter Zijlstra
2014-10-21 17:25     ` Peter Zijlstra
2014-10-22 12:35     ` Ingo Molnar
2014-10-22 12:35       ` Ingo Molnar
2014-10-22  7:34 ` Davidlohr Bueso
2014-10-22  7:34   ` Davidlohr Bueso
2014-10-22 11:29   ` Kirill A. Shutemov
2014-10-22 11:29     ` Kirill A. Shutemov
2014-10-22 11:45     ` Peter Zijlstra
2014-10-22 11:45       ` Peter Zijlstra
2014-10-22 11:55       ` Kirill A. Shutemov
2014-10-22 11:55         ` Kirill A. Shutemov
  -- strict thread matches above, loose matches on Subject: below --
2014-10-21  9:07 [RFC][PATCH 5/6] mm: Provide speculative fault infrastructure Hillf Danton
2014-10-21  9:07 ` Hillf Danton
2014-10-21 10:42 ` Peter Zijlstra
2014-10-21 10:42   ` Peter Zijlstra
2014-10-21 10:43 ` Peter Zijlstra
2014-10-21 10:43   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141021083548.GA22200@node.dhcp.inet.fi \
    --to=kirill@shutemov.name \
    --cc=akpm@linux-foundation.org \
    --cc=dave@stgolabs.net \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan@kernel.org \
    --cc=mingo@redhat.com \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.