* [RFC][PATCH 0/8] Speculative pagefault -v3
@ 2010-01-04 18:24 Peter Zijlstra
2010-01-04 18:24 ` [RFC][PATCH 1/8] mm: Remove pte reference from fault path Peter Zijlstra
` (9 more replies)
0 siblings, 10 replies; 121+ messages in thread
From: Peter Zijlstra @ 2010-01-04 18:24 UTC (permalink / raw)
To: Paul E. McKenney, Peter Zijlstra, KAMEZAWA Hiroyuki,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
minchan.kim@gmail.com, cl, hugh.dickins, Nick Piggin, Ingo Molnar,
Linus Torvalds
Patch series implementing speculative page faults for x86.
Still needs lots of things sorted, like:
- call_srcu()
- ptl, irq and tlb-flush
- a 2nd VM_FAULT_LOCK? return code to distuinguish between
simple retry and must take mmap_sem semantics?
Comments?
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* [RFC][PATCH 1/8] mm: Remove pte reference from fault path
2010-01-04 18:24 [RFC][PATCH 0/8] Speculative pagefault -v3 Peter Zijlstra
@ 2010-01-04 18:24 ` Peter Zijlstra
2010-01-04 18:24 ` [RFC][PATCH 2/8] mm: Speculative pagefault infrastructure Peter Zijlstra
` (8 subsequent siblings)
9 siblings, 0 replies; 121+ messages in thread
From: Peter Zijlstra @ 2010-01-04 18:24 UTC (permalink / raw)
To: Paul E. McKenney, Peter Zijlstra, KAMEZAWA Hiroyuki,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
minchan.kim@gmail.com, cl, hugh.dickins, Nick Piggin, Ingo Molnar,
Linus Torvalds
Cc: Peter Zijlstra
[-- Attachment #1: mm-foo-1.patch --]
[-- Type: text/plain, Size: 6032 bytes --]
Since we want to do speculative faults, where we can race against
unmap() and similar, we cannot trust pte pointers to remain valid.
Hence remove the relyance on those from the fault path.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
mm/memory.c | 72 ++++++++++++++++--------------------------------------------
1 file changed, 20 insertions(+), 52 deletions(-)
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1919,31 +1919,6 @@ int apply_to_page_range(struct mm_struct
EXPORT_SYMBOL_GPL(apply_to_page_range);
/*
- * handle_pte_fault chooses page fault handler according to an entry
- * which was read non-atomically. Before making any commitment, on
- * those architectures or configurations (e.g. i386 with PAE) which
- * might give a mix of unmatched parts, do_swap_page and do_file_page
- * must check under lock before unmapping the pte and proceeding
- * (but do_wp_page is only called after already making such a check;
- * and do_anonymous_page and do_no_page can safely check later on).
- */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
- pte_t *page_table, pte_t orig_pte)
-{
- int same = 1;
-#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
- if (sizeof(pte_t) > sizeof(unsigned long)) {
- spinlock_t *ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
- same = pte_same(*page_table, orig_pte);
- spin_unlock(ptl);
- }
-#endif
- pte_unmap(page_table);
- return same;
-}
-
-/*
* Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when
* servicing faults for write access. In the normal case, do always want
* pte_mkwrite. But get_user_pages can cause write faults for mappings
@@ -2508,19 +2483,16 @@ int vmtruncate_range(struct inode *inode
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
+ unsigned long address, pmd_t *pmd,
unsigned int flags, pte_t orig_pte)
{
spinlock_t *ptl;
struct page *page;
swp_entry_t entry;
- pte_t pte;
+ pte_t *page_table, pte;
struct mem_cgroup *ptr = NULL;
int ret = 0;
- if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
- goto out;
-
entry = pte_to_swp_entry(orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
@@ -2650,18 +2622,16 @@ out_release:
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags)
+ unsigned long address, pmd_t *pmd, unsigned int flags)
{
struct page *page;
spinlock_t *ptl;
- pte_t entry;
+ pte_t entry, *page_table;
if (!(flags & FAULT_FLAG_WRITE)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
vma->vm_page_prot));
- ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (!pte_none(*page_table))
goto unlock;
goto setpte;
@@ -2900,13 +2870,12 @@ unwritable_page:
}
static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
+ unsigned long address, pmd_t *pmd,
unsigned int flags, pte_t orig_pte)
{
pgoff_t pgoff = (((address & PAGE_MASK)
- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
- pte_unmap(page_table);
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}
@@ -2920,16 +2889,13 @@ static int do_linear_fault(struct mm_str
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
+ unsigned long address, pmd_t *pmd,
unsigned int flags, pte_t orig_pte)
{
pgoff_t pgoff;
flags |= FAULT_FLAG_NONLINEAR;
- if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
- return 0;
-
if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
/*
* Page table corrupted: show pte and kill process.
@@ -2957,31 +2923,29 @@ static int do_nonlinear_fault(struct mm_
*/
static inline int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
- pte_t *pte, pmd_t *pmd, unsigned int flags)
+ pte_t entry, pmd_t *pmd, unsigned int flags)
{
- pte_t entry;
spinlock_t *ptl;
+ pte_t *pte;
- entry = *pte;
if (!pte_present(entry)) {
if (pte_none(entry)) {
if (vma->vm_ops) {
if (likely(vma->vm_ops->fault))
return do_linear_fault(mm, vma, address,
- pte, pmd, flags, entry);
+ pmd, flags, entry);
}
return do_anonymous_page(mm, vma, address,
- pte, pmd, flags);
+ pmd, flags);
}
if (pte_file(entry))
return do_nonlinear_fault(mm, vma, address,
- pte, pmd, flags, entry);
+ pmd, flags, entry);
return do_swap_page(mm, vma, address,
- pte, pmd, flags, entry);
+ pmd, flags, entry);
}
- ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
+ pte = pte_offset_map_lock(mm, pmd, address, &ptl);
if (unlikely(!pte_same(*pte, entry)))
goto unlock;
if (flags & FAULT_FLAG_WRITE) {
@@ -3017,7 +2981,7 @@ int handle_mm_fault(struct mm_struct *mm
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
- pte_t *pte;
+ pte_t *pte, entry;
__set_current_state(TASK_RUNNING);
@@ -3037,7 +3001,11 @@ int handle_mm_fault(struct mm_struct *mm
if (!pte)
return VM_FAULT_OOM;
- return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+ entry = *pte;
+
+ pte_unmap(pte);
+
+ return handle_pte_fault(mm, vma, address, entry, pmd, flags);
}
#ifndef __PAGETABLE_PUD_FOLDED
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* [RFC][PATCH 2/8] mm: Speculative pagefault infrastructure
2010-01-04 18:24 [RFC][PATCH 0/8] Speculative pagefault -v3 Peter Zijlstra
2010-01-04 18:24 ` [RFC][PATCH 1/8] mm: Remove pte reference from fault path Peter Zijlstra
@ 2010-01-04 18:24 ` Peter Zijlstra
2010-01-04 18:24 ` [RFC][PATCH 3/8] mm: Add vma sequence count Peter Zijlstra
` (7 subsequent siblings)
9 siblings, 0 replies; 121+ messages in thread
From: Peter Zijlstra @ 2010-01-04 18:24 UTC (permalink / raw)
To: Paul E. McKenney, Peter Zijlstra, KAMEZAWA Hiroyuki,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
minchan.kim@gmail.com, cl, hugh.dickins, Nick Piggin, Ingo Molnar,
Linus Torvalds
Cc: Peter Zijlstra
[-- Attachment #1: mm-foo-6.patch --]
[-- Type: text/plain, Size: 9926 bytes --]
Replace pte_offset_map_lock() usage in the pagefault path with
pte_map_lock() which when called with .flags & FAULT_FLAG_SPECULATIVE
can fail, in which case we should return VM_FAULT_RETRY, meaning we
need to retry the fault (or do one with mmap_sem held).
This patch adds both FAULT_FLAG_SPECULATIVE, VM_FAULT_RETRY and the
error paths.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/mm.h | 2
mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++---------------
2 files changed, 88 insertions(+), 33 deletions(-)
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -136,6 +136,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_WRITE 0x01 /* Fault was a write access */
#define FAULT_FLAG_NONLINEAR 0x02 /* Fault was via a nonlinear mapping */
#define FAULT_FLAG_MKWRITE 0x04 /* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_SPECULATIVE 0x08
/*
* This interface is used by x86 PAT code to identify a pfn mapping that is
@@ -711,6 +712,7 @@ static inline int page_mapped(struct pag
#define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */
+#define VM_FAULT_RETRY 0x0400
#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1957,6 +1957,14 @@ static inline void cow_user_page(struct
copy_user_highpage(dst, src, va, vma);
}
+static int pte_map_lock(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd, unsigned int flags,
+ pte_t **ptep, spinlock_t **ptl)
+{
+ *ptep = pte_offset_map_lock(mm, pmd, address, ptl);
+ return 1;
+}
+
/*
* This routine handles present pages, when users try to write
* to a shared page. It is done by copying the page to a new address
@@ -1977,7 +1985,7 @@ static inline void cow_user_page(struct
*/
static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *page_table, pmd_t *pmd,
- spinlock_t *ptl, pte_t orig_pte)
+ spinlock_t *ptl, unsigned int flags, pte_t orig_pte)
{
struct page *old_page, *new_page;
pte_t entry;
@@ -2009,8 +2017,14 @@ static int do_wp_page(struct mm_struct *
page_cache_get(old_page);
pte_unmap_unlock(page_table, ptl);
lock_page(old_page);
- page_table = pte_offset_map_lock(mm, pmd, address,
- &ptl);
+
+ if (!pte_map_lock(mm, vma, address, pmd, flags,
+ &page_table, &ptl)) {
+ unlock_page(old_page);
+ ret = VM_FAULT_RETRY;
+ goto err;
+ }
+
if (!pte_same(*page_table, orig_pte)) {
unlock_page(old_page);
page_cache_release(old_page);
@@ -2052,14 +2066,14 @@ static int do_wp_page(struct mm_struct *
if (unlikely(tmp &
(VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
ret = tmp;
- goto unwritable_page;
+ goto err;
}
if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
lock_page(old_page);
if (!old_page->mapping) {
ret = 0; /* retry the fault */
unlock_page(old_page);
- goto unwritable_page;
+ goto err;
}
} else
VM_BUG_ON(!PageLocked(old_page));
@@ -2070,8 +2084,13 @@ static int do_wp_page(struct mm_struct *
* they did, we just return, as we can count on the
* MMU to tell us if they didn't also make it writable.
*/
- page_table = pte_offset_map_lock(mm, pmd, address,
- &ptl);
+ if (!pte_map_lock(mm, vma, address, pmd, flags,
+ &page_table, &ptl)) {
+ unlock_page(old_page);
+ ret = VM_FAULT_RETRY;
+ goto err;
+ }
+
if (!pte_same(*page_table, orig_pte)) {
unlock_page(old_page);
page_cache_release(old_page);
@@ -2103,17 +2122,23 @@ reuse:
gotten:
pte_unmap_unlock(page_table, ptl);
- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
+ if (unlikely(anon_vma_prepare(vma))) {
+ ret = VM_FAULT_OOM;
+ goto err;
+ }
if (is_zero_pfn(pte_pfn(orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma, address);
- if (!new_page)
- goto oom;
+ if (!new_page) {
+ ret = VM_FAULT_OOM;
+ goto err;
+ }
} else {
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
- if (!new_page)
- goto oom;
+ if (!new_page) {
+ ret = VM_FAULT_OOM;
+ goto err;
+ }
cow_user_page(new_page, old_page, address, vma);
}
__SetPageUptodate(new_page);
@@ -2128,13 +2153,20 @@ gotten:
unlock_page(old_page);
}
- if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
- goto oom_free_new;
+ if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)) {
+ ret = VM_FAULT_OOM;
+ goto err_free_new;
+ }
/*
* Re-check the pte - we dropped the lock
*/
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_map_lock(mm, vma, address, pmd, flags, &page_table, &ptl)) {
+ mem_cgroup_uncharge_page(new_page);
+ ret = VM_FAULT_RETRY;
+ goto err_free_new;
+ }
+
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
@@ -2233,9 +2265,9 @@ unlock:
file_update_time(vma->vm_file);
}
return ret;
-oom_free_new:
+err_free_new:
page_cache_release(new_page);
-oom:
+err:
if (old_page) {
if (page_mkwrite) {
unlock_page(old_page);
@@ -2243,10 +2275,6 @@ oom:
}
page_cache_release(old_page);
}
- return VM_FAULT_OOM;
-
-unwritable_page:
- page_cache_release(old_page);
return ret;
}
@@ -2496,6 +2524,10 @@ static int do_swap_page(struct mm_struct
entry = pte_to_swp_entry(orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
+ if (flags & FAULT_FLAG_SPECULATIVE) {
+ ret = VM_FAULT_RETRY;
+ goto out;
+ }
migration_entry_wait(mm, pmd, address);
} else if (is_hwpoison_entry(entry)) {
ret = VM_FAULT_HWPOISON;
@@ -2516,7 +2548,11 @@ static int do_swap_page(struct mm_struct
* Back out if somebody else faulted in this pte
* while we released the pte lock.
*/
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_map_lock(mm, vma, address, pmd, flags,
+ &page_table, &ptl)) {
+ ret = VM_FAULT_RETRY;
+ goto out;
+ }
if (likely(pte_same(*page_table, orig_pte)))
ret = VM_FAULT_OOM;
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
@@ -2553,7 +2589,11 @@ static int do_swap_page(struct mm_struct
/*
* Back out if somebody else already faulted in this pte.
*/
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_map_lock(mm, vma, address, pmd, flags, &page_table, &ptl)) {
+ ret = VM_FAULT_RETRY;
+ goto out_nolock;
+ }
+
if (unlikely(!pte_same(*page_table, orig_pte)))
goto out_nomap;
@@ -2594,7 +2634,7 @@ static int do_swap_page(struct mm_struct
unlock_page(page);
if (flags & FAULT_FLAG_WRITE) {
- ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
+ ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, flags, pte);
if (ret & VM_FAULT_ERROR)
ret &= VM_FAULT_ERROR;
goto out;
@@ -2607,8 +2647,9 @@ unlock:
out:
return ret;
out_nomap:
- mem_cgroup_cancel_charge_swapin(ptr);
pte_unmap_unlock(page_table, ptl);
+out_nolock:
+ mem_cgroup_cancel_charge_swapin(ptr);
out_page:
unlock_page(page);
out_release:
@@ -2631,7 +2672,9 @@ static int do_anonymous_page(struct mm_s
if (!(flags & FAULT_FLAG_WRITE)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
vma->vm_page_prot));
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_map_lock(mm, vma, address, pmd, flags,
+ &page_table, &ptl))
+ return VM_FAULT_RETRY;
if (!pte_none(*page_table))
goto unlock;
goto setpte;
@@ -2654,7 +2697,12 @@ static int do_anonymous_page(struct mm_s
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_map_lock(mm, vma, address, pmd, flags, &page_table, &ptl)) {
+ mem_cgroup_uncharge_page(page);
+ page_cache_release(page);
+ return VM_FAULT_RETRY;
+ }
+
if (!pte_none(*page_table))
goto release;
@@ -2793,7 +2841,10 @@ static int __do_fault(struct mm_struct *
}
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_map_lock(mm, vma, address, pmd, flags, &page_table, &ptl)) {
+ ret = VM_FAULT_RETRY;
+ goto out_uncharge;
+ }
/*
* This silly early PAGE_DIRTY setting removes a race
@@ -2826,7 +2877,10 @@ static int __do_fault(struct mm_struct *
/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, address, entry);
+ pte_unmap_unlock(page_table, ptl);
} else {
+ pte_unmap_unlock(page_table, ptl);
+out_uncharge:
if (charged)
mem_cgroup_uncharge_page(page);
if (anon)
@@ -2835,8 +2889,6 @@ static int __do_fault(struct mm_struct *
anon = 1; /* no anon but release faulted_page */
}
- pte_unmap_unlock(page_table, ptl);
-
out:
if (dirty_page) {
struct address_space *mapping = page->mapping;
@@ -2945,13 +2997,14 @@ static inline int handle_pte_fault(struc
pmd, flags, entry);
}
- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_map_lock(mm, vma, address, pmd, flags, &pte, &ptl))
+ return VM_FAULT_RETRY;
if (unlikely(!pte_same(*pte, entry)))
goto unlock;
if (flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
return do_wp_page(mm, vma, address,
- pte, pmd, ptl, entry);
+ pte, pmd, ptl, flags, entry);
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* [RFC][PATCH 3/8] mm: Add vma sequence count
2010-01-04 18:24 [RFC][PATCH 0/8] Speculative pagefault -v3 Peter Zijlstra
2010-01-04 18:24 ` [RFC][PATCH 1/8] mm: Remove pte reference from fault path Peter Zijlstra
2010-01-04 18:24 ` [RFC][PATCH 2/8] mm: Speculative pagefault infrastructure Peter Zijlstra
@ 2010-01-04 18:24 ` Peter Zijlstra
2010-01-04 18:24 ` [RFC][PATCH 4/8] mm: RCU free vmas Peter Zijlstra
` (6 subsequent siblings)
9 siblings, 0 replies; 121+ messages in thread
From: Peter Zijlstra @ 2010-01-04 18:24 UTC (permalink / raw)
To: Paul E. McKenney, Peter Zijlstra, KAMEZAWA Hiroyuki,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
minchan.kim@gmail.com, cl, hugh.dickins, Nick Piggin, Ingo Molnar,
Linus Torvalds
Cc: Peter Zijlstra
[-- Attachment #1: mm-foo-5.patch --]
[-- Type: text/plain, Size: 1869 bytes --]
In order to detect VMA range changes, add a sequence count.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/mm_types.h | 2 ++
mm/mmap.c | 10 ++++++++++
2 files changed, 12 insertions(+)
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -12,6 +12,7 @@
#include <linux/completion.h>
#include <linux/cpumask.h>
#include <linux/page-debug-flags.h>
+#include <linux/seqlock.h>
#include <asm/page.h>
#include <asm/mmu.h>
@@ -186,6 +187,7 @@ struct vm_area_struct {
#ifdef CONFIG_NUMA
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
+ seqcount_t vm_sequence;
};
struct core_thread {
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -512,6 +512,10 @@ void vma_adjust(struct vm_area_struct *v
long adjust_next = 0;
int remove_next = 0;
+ write_seqcount_begin(&vma->vm_sequence);
+ if (next)
+ write_seqcount_begin(&next->vm_sequence);
+
if (next && !insert) {
if (end >= next->vm_end) {
/*
@@ -647,11 +651,17 @@ again: remove_next = 1 + (end > next->
* up the code too much to do both in one go.
*/
if (remove_next == 2) {
+ write_seqcount_end(&next->vm_sequence);
next = vma->vm_next;
+ write_seqcount_begin(&next->vm_sequence);
goto again;
}
}
+ if (next)
+ write_seqcount_end(&next->vm_sequence);
+ write_seqcount_end(&vma->vm_sequence);
+
validate_mm(mm);
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* [RFC][PATCH 4/8] mm: RCU free vmas
2010-01-04 18:24 [RFC][PATCH 0/8] Speculative pagefault -v3 Peter Zijlstra
` (2 preceding siblings ...)
2010-01-04 18:24 ` [RFC][PATCH 3/8] mm: Add vma sequence count Peter Zijlstra
@ 2010-01-04 18:24 ` Peter Zijlstra
2010-01-05 2:43 ` Paul E. McKenney
2010-01-04 18:24 ` [RFC][PATCH 5/8] mm: Speculative pte_map_lock() Peter Zijlstra
` (5 subsequent siblings)
9 siblings, 1 reply; 121+ messages in thread
From: Peter Zijlstra @ 2010-01-04 18:24 UTC (permalink / raw)
To: Paul E. McKenney, Peter Zijlstra, KAMEZAWA Hiroyuki,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
minchan.kim@gmail.com, cl, hugh.dickins, Nick Piggin, Ingo Molnar,
Linus Torvalds
Cc: Peter Zijlstra
[-- Attachment #1: mm-foo-3.patch --]
[-- Type: text/plain, Size: 6711 bytes --]
TODO:
- should be SRCU, lack of call_srcu()
In order to allow speculative vma lookups, RCU free the struct
vm_area_struct.
We use two means of detecting a vma is still valid:
- firstly, we set RB_CLEAR_NODE once we remove a vma from the tree.
- secondly, we check the vma sequence number.
These two things combined will guarantee that 1) the vma is still
present and two, it still covers the same range from when we looked it
up.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/mm.h | 12 ++++++++++++
include/linux/mm_types.h | 2 ++
init/Kconfig | 34 +++++++++++++++++-----------------
kernel/sched.c | 9 ++++++++-
mm/mmap.c | 33 +++++++++++++++++++++++++++++++--
5 files changed, 70 insertions(+), 20 deletions(-)
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -765,6 +765,18 @@ unsigned long unmap_vmas(struct mmu_gath
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *);
+static inline int vma_is_dead(struct vm_area_struct *vma, unsigned int sequence)
+{
+ int ret = RB_EMPTY_NODE(&vma->vm_rb);
+ unsigned seq = vma->vm_sequence.sequence;
+ /*
+ * Matches both the wmb in write_seqlock_begin/end() and
+ * the wmb in detach_vmas_to_be_unmapped()/__unlink_vma().
+ */
+ smp_rmb();
+ return ret || seq != sequence;
+}
+
/**
* mm_walk - callbacks for walk_page_range
* @pgd_entry: if set, called for each non-empty PGD (top-level) entry
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -13,6 +13,7 @@
#include <linux/cpumask.h>
#include <linux/page-debug-flags.h>
#include <linux/seqlock.h>
+#include <linux/rcupdate.h>
#include <asm/page.h>
#include <asm/mmu.h>
@@ -188,6 +189,7 @@ struct vm_area_struct {
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
seqcount_t vm_sequence;
+ struct rcu_head vm_rcu_head;
};
struct core_thread {
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -222,6 +222,19 @@ void unlink_file_vma(struct vm_area_stru
}
}
+static void free_vma_rcu(struct rcu_head *head)
+{
+ struct vm_area_struct *vma =
+ container_of(head, struct vm_area_struct, vm_rcu_head);
+
+ kmem_cache_free(vm_area_cachep, vma);
+}
+
+static void free_vma(struct vm_area_struct *vma)
+{
+ call_rcu(&vma->vm_rcu_head, free_vma_rcu);
+}
+
/*
* Close a vm structure and free it, returning the next.
*/
@@ -238,7 +251,7 @@ static struct vm_area_struct *remove_vma
removed_exe_file_vma(vma->vm_mm);
}
mpol_put(vma_policy(vma));
- kmem_cache_free(vm_area_cachep, vma);
+ free_vma(vma);
return next;
}
@@ -488,6 +501,14 @@ __vma_unlink(struct mm_struct *mm, struc
{
prev->vm_next = vma->vm_next;
rb_erase(&vma->vm_rb, &mm->mm_rb);
+ /*
+ * Ensure the removal is completely comitted to memory
+ * before clearing the node.
+ *
+ * Matched by vma_is_dead()/handle_speculative_fault().
+ */
+ smp_wmb();
+ RB_CLEAR_NODE(&vma->vm_rb);
if (mm->mmap_cache == vma)
mm->mmap_cache = prev;
}
@@ -644,7 +665,7 @@ again: remove_next = 1 + (end > next->
}
mm->map_count--;
mpol_put(vma_policy(next));
- kmem_cache_free(vm_area_cachep, next);
+ free_vma(next);
/*
* In mprotect's case 6 (see comments on vma_merge),
* we must remove another next too. It would clutter
@@ -1858,6 +1879,14 @@ detach_vmas_to_be_unmapped(struct mm_str
insertion_point = (prev ? &prev->vm_next : &mm->mmap);
do {
rb_erase(&vma->vm_rb, &mm->mm_rb);
+ /*
+ * Ensure the removal is completely comitted to memory
+ * before clearing the node.
+ *
+ * Matched by vma_is_dead()/handle_speculative_fault().
+ */
+ smp_wmb();
+ RB_CLEAR_NODE(&vma->vm_rb);
mm->map_count--;
tail_vma = vma;
vma = vma->vm_next;
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -314,19 +314,19 @@ menu "RCU Subsystem"
choice
prompt "RCU Implementation"
- default TREE_RCU
+ default TREE_PREEMPT_RCU
-config TREE_RCU
- bool "Tree-based hierarchical RCU"
- help
- This option selects the RCU implementation that is
- designed for very large SMP system with hundreds or
- thousands of CPUs. It also scales down nicely to
- smaller systems.
+#config TREE_RCU
+# bool "Tree-based hierarchical RCU"
+# help
+# This option selects the RCU implementation that is
+# designed for very large SMP system with hundreds or
+# thousands of CPUs. It also scales down nicely to
+# smaller systems.
config TREE_PREEMPT_RCU
bool "Preemptable tree-based hierarchical RCU"
- depends on PREEMPT
+# depends on PREEMPT
help
This option selects the RCU implementation that is
designed for very large SMP systems with hundreds or
@@ -334,14 +334,14 @@ config TREE_PREEMPT_RCU
is also required. It also scales down nicely to
smaller systems.
-config TINY_RCU
- bool "UP-only small-memory-footprint RCU"
- depends on !SMP
- help
- This option selects the RCU implementation that is
- designed for UP systems from which real-time response
- is not required. This option greatly reduces the
- memory footprint of RCU.
+#config TINY_RCU
+# bool "UP-only small-memory-footprint RCU"
+# depends on !SMP
+# help
+# This option selects the RCU implementation that is
+# designed for UP systems from which real-time response
+# is not required. This option greatly reduces the
+# memory footprint of RCU.
endchoice
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -9689,7 +9689,14 @@ void __init sched_init(void)
#ifdef CONFIG_DEBUG_SPINLOCK_SLEEP
static inline int preempt_count_equals(int preempt_offset)
{
- int nested = (preempt_count() & ~PREEMPT_ACTIVE) + rcu_preempt_depth();
+ int nested = (preempt_count() & ~PREEMPT_ACTIVE)
+ /*
+ * remove this for we need preemptible RCU
+ * exactly because it needs to sleep..
+ *
+ + rcu_preempt_depth()
+ */
+ ;
return (nested == PREEMPT_INATOMIC_BASE + preempt_offset);
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* [RFC][PATCH 5/8] mm: Speculative pte_map_lock()
2010-01-04 18:24 [RFC][PATCH 0/8] Speculative pagefault -v3 Peter Zijlstra
` (3 preceding siblings ...)
2010-01-04 18:24 ` [RFC][PATCH 4/8] mm: RCU free vmas Peter Zijlstra
@ 2010-01-04 18:24 ` Peter Zijlstra
2010-01-04 18:24 ` [RFC][PATCH 6/8] mm: handle_speculative_fault() Peter Zijlstra
` (4 subsequent siblings)
9 siblings, 0 replies; 121+ messages in thread
From: Peter Zijlstra @ 2010-01-04 18:24 UTC (permalink / raw)
To: Paul E. McKenney, Peter Zijlstra, KAMEZAWA Hiroyuki,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
minchan.kim@gmail.com, cl, hugh.dickins, Nick Piggin, Ingo Molnar,
Linus Torvalds
Cc: Peter Zijlstra
[-- Attachment #1: mm-foo-7.patch --]
[-- Type: text/plain, Size: 11649 bytes --]
Implement pte_map_lock(.flags & FAULT_FLAG_SPECULATIVE), in which case
we're not holding mmap_sem, so we can race against umap() and similar
routines.
Since we cannot rely on pagetable stability in the face of unmap, we
use the technique fast_gup() also uses for a lockless pagetable
lookup. For this we introduce the {un,}pin_page_tables() functions.
The only problem is that we do TLB flushes while holding the PTL,
which in turn means that we cannot acquire the PTL while having IRQs
disabled.
Fudge around this by open-coding a spinner which drops the page-table
pin, which on x86 will be IRQ-disable (that holds of the TLB flush,
which is done before freeing the pagetables).
Once we hold the PTL, we can validate the VMA, if that is still valid
we know we're good to go and holding the PTL will hold off unmap.
We need to propagate the VMA sequence count through the fault code.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/mm.h | 2
mm/memory.c | 111 ++++++++++++++++++++++++++++++++++++++---------------
mm/util.c | 12 ++++-
3 files changed, 93 insertions(+), 32 deletions(-)
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -848,6 +848,8 @@ int get_user_pages(struct task_struct *t
struct page **pages, struct vm_area_struct **vmas);
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
+void pin_page_tables(void);
+void unpin_page_tables(void);
struct page *get_dump_page(unsigned long addr);
extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1959,10 +1959,56 @@ static inline void cow_user_page(struct
static int pte_map_lock(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd, unsigned int flags,
- pte_t **ptep, spinlock_t **ptl)
+ unsigned int seq, pte_t **ptep, spinlock_t **ptlp)
{
- *ptep = pte_offset_map_lock(mm, pmd, address, ptl);
+ pgd_t *pgd;
+ pud_t *pud;
+
+ if (!(flags & FAULT_FLAG_SPECULATIVE)) {
+ *ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
+ return 1;
+ }
+
+again:
+ pin_page_tables();
+
+ pgd = pgd_offset(mm, address);
+ if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+ goto out;
+
+ pud = pud_offset(pgd, address);
+ if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+ goto out;
+
+ pmd = pmd_offset(pud, address);
+ if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+ goto out;
+
+ if (pmd_huge(*pmd))
+ goto out;
+
+ *ptlp = pte_lockptr(mm, pmd);
+ *ptep = pte_offset_map(pmd, address);
+ if (!spin_trylock(*ptlp)) {
+ pte_unmap(*ptep);
+ unpin_page_tables();
+ goto again;
+ }
+
+ if (!*ptep)
+ goto out;
+
+ if (vma_is_dead(vma, seq))
+ goto unlock;
+
+ unpin_page_tables();
return 1;
+
+unlock:
+ pte_unmap_unlock(*ptep, *ptlp);
+out:
+ unpin_page_tables();
+ return 0;
}
/*
@@ -1985,7 +2031,8 @@ static int pte_map_lock(struct mm_struct
*/
static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *page_table, pmd_t *pmd,
- spinlock_t *ptl, unsigned int flags, pte_t orig_pte)
+ spinlock_t *ptl, unsigned int flags, pte_t orig_pte,
+ unsigned int seq)
{
struct page *old_page, *new_page;
pte_t entry;
@@ -2018,7 +2065,7 @@ static int do_wp_page(struct mm_struct *
pte_unmap_unlock(page_table, ptl);
lock_page(old_page);
- if (!pte_map_lock(mm, vma, address, pmd, flags,
+ if (!pte_map_lock(mm, vma, address, pmd, flags, seq,
&page_table, &ptl)) {
unlock_page(old_page);
ret = VM_FAULT_RETRY;
@@ -2084,7 +2131,7 @@ static int do_wp_page(struct mm_struct *
* they did, we just return, as we can count on the
* MMU to tell us if they didn't also make it writable.
*/
- if (!pte_map_lock(mm, vma, address, pmd, flags,
+ if (!pte_map_lock(mm, vma, address, pmd, flags, seq,
&page_table, &ptl)) {
unlock_page(old_page);
ret = VM_FAULT_RETRY;
@@ -2161,7 +2208,7 @@ gotten:
/*
* Re-check the pte - we dropped the lock
*/
- if (!pte_map_lock(mm, vma, address, pmd, flags, &page_table, &ptl)) {
+ if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &page_table, &ptl)) {
mem_cgroup_uncharge_page(new_page);
ret = VM_FAULT_RETRY;
goto err_free_new;
@@ -2511,8 +2558,8 @@ int vmtruncate_range(struct inode *inode
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- unsigned int flags, pte_t orig_pte)
+ unsigned long address, pmd_t *pmd, unsigned int flags,
+ pte_t orig_pte, unsigned int seq)
{
spinlock_t *ptl;
struct page *page;
@@ -2548,7 +2595,7 @@ static int do_swap_page(struct mm_struct
* Back out if somebody else faulted in this pte
* while we released the pte lock.
*/
- if (!pte_map_lock(mm, vma, address, pmd, flags,
+ if (!pte_map_lock(mm, vma, address, pmd, flags, seq,
&page_table, &ptl)) {
ret = VM_FAULT_RETRY;
goto out;
@@ -2589,7 +2636,7 @@ static int do_swap_page(struct mm_struct
/*
* Back out if somebody else already faulted in this pte.
*/
- if (!pte_map_lock(mm, vma, address, pmd, flags, &page_table, &ptl)) {
+ if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &page_table, &ptl)) {
ret = VM_FAULT_RETRY;
goto out_nolock;
}
@@ -2634,7 +2681,8 @@ static int do_swap_page(struct mm_struct
unlock_page(page);
if (flags & FAULT_FLAG_WRITE) {
- ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, flags, pte);
+ ret |= do_wp_page(mm, vma, address, page_table, pmd,
+ ptl, flags, pte, seq);
if (ret & VM_FAULT_ERROR)
ret &= VM_FAULT_ERROR;
goto out;
@@ -2663,7 +2711,8 @@ out_release:
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd, unsigned int flags)
+ unsigned long address, pmd_t *pmd, unsigned int flags,
+ unsigned int seq)
{
struct page *page;
spinlock_t *ptl;
@@ -2672,7 +2721,7 @@ static int do_anonymous_page(struct mm_s
if (!(flags & FAULT_FLAG_WRITE)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
vma->vm_page_prot));
- if (!pte_map_lock(mm, vma, address, pmd, flags,
+ if (!pte_map_lock(mm, vma, address, pmd, flags, seq,
&page_table, &ptl))
return VM_FAULT_RETRY;
if (!pte_none(*page_table))
@@ -2697,7 +2746,7 @@ static int do_anonymous_page(struct mm_s
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));
- if (!pte_map_lock(mm, vma, address, pmd, flags, &page_table, &ptl)) {
+ if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &page_table, &ptl)) {
mem_cgroup_uncharge_page(page);
page_cache_release(page);
return VM_FAULT_RETRY;
@@ -2740,8 +2789,8 @@ oom:
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+ unsigned long address, pmd_t *pmd, pgoff_t pgoff,
+ unsigned int flags, pte_t orig_pte, unsigned int seq)
{
pte_t *page_table;
spinlock_t *ptl;
@@ -2841,7 +2890,7 @@ static int __do_fault(struct mm_struct *
}
- if (!pte_map_lock(mm, vma, address, pmd, flags, &page_table, &ptl)) {
+ if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &page_table, &ptl)) {
ret = VM_FAULT_RETRY;
goto out_uncharge;
}
@@ -2923,12 +2972,12 @@ unwritable_page:
static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
- unsigned int flags, pte_t orig_pte)
+ unsigned int flags, pte_t orig_pte, unsigned int seq)
{
pgoff_t pgoff = (((address & PAGE_MASK)
- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
- return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+ return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte, seq);
}
/*
@@ -2942,7 +2991,7 @@ static int do_linear_fault(struct mm_str
*/
static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
- unsigned int flags, pte_t orig_pte)
+ unsigned int flags, pte_t orig_pte, unsigned int seq)
{
pgoff_t pgoff;
@@ -2957,7 +3006,7 @@ static int do_nonlinear_fault(struct mm_
}
pgoff = pte_to_pgoff(orig_pte);
- return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+ return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte, seq);
}
/*
@@ -2975,7 +3024,8 @@ static int do_nonlinear_fault(struct mm_
*/
static inline int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
- pte_t entry, pmd_t *pmd, unsigned int flags)
+ pte_t entry, pmd_t *pmd, unsigned int flags,
+ unsigned int seq)
{
spinlock_t *ptl;
pte_t *pte;
@@ -2985,26 +3035,27 @@ static inline int handle_pte_fault(struc
if (vma->vm_ops) {
if (likely(vma->vm_ops->fault))
return do_linear_fault(mm, vma, address,
- pmd, flags, entry);
+ pmd, flags, entry, seq);
}
return do_anonymous_page(mm, vma, address,
- pmd, flags);
+ pmd, flags, seq);
}
if (pte_file(entry))
return do_nonlinear_fault(mm, vma, address,
- pmd, flags, entry);
+ pmd, flags, entry, seq);
return do_swap_page(mm, vma, address,
- pmd, flags, entry);
+ pmd, flags, entry, seq);
}
- if (!pte_map_lock(mm, vma, address, pmd, flags, &pte, &ptl))
+ if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &pte, &ptl))
return VM_FAULT_RETRY;
if (unlikely(!pte_same(*pte, entry)))
goto unlock;
if (flags & FAULT_FLAG_WRITE) {
- if (!pte_write(entry))
+ if (!pte_write(entry)) {
return do_wp_page(mm, vma, address,
- pte, pmd, ptl, flags, entry);
+ pte, pmd, ptl, flags, entry, seq);
+ }
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
@@ -3058,7 +3109,7 @@ int handle_mm_fault(struct mm_struct *mm
pte_unmap(pte);
- return handle_pte_fault(mm, vma, address, entry, pmd, flags);
+ return handle_pte_fault(mm, vma, address, entry, pmd, flags, 0);
}
#ifndef __PAGETABLE_PUD_FOLDED
Index: linux-2.6/mm/util.c
===================================================================
--- linux-2.6.orig/mm/util.c
+++ linux-2.6/mm/util.c
@@ -253,8 +253,8 @@ void arch_pick_mmap_layout(struct mm_str
* callers need to carefully consider what to use. On many architectures,
* get_user_pages_fast simply falls back to get_user_pages.
*/
-int __attribute__((weak)) get_user_pages_fast(unsigned long start,
- int nr_pages, int write, struct page **pages)
+int __weak get_user_pages_fast(unsigned long start,
+ int nr_pages, int write, struct page **pages)
{
struct mm_struct *mm = current->mm;
int ret;
@@ -268,6 +268,14 @@ int __attribute__((weak)) get_user_pages
}
EXPORT_SYMBOL_GPL(get_user_pages_fast);
+void __weak pin_page_tables(void)
+{
+}
+
+void __weak unpin_page_tables(void)
+{
+}
+
/* Tracepoints definitions. */
EXPORT_TRACEPOINT_SYMBOL(kmalloc);
EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-04 18:24 [RFC][PATCH 0/8] Speculative pagefault -v3 Peter Zijlstra
` (4 preceding siblings ...)
2010-01-04 18:24 ` [RFC][PATCH 5/8] mm: Speculative pte_map_lock() Peter Zijlstra
@ 2010-01-04 18:24 ` Peter Zijlstra
2010-01-05 0:25 ` KAMEZAWA Hiroyuki
2010-01-05 13:45 ` Arjan van de Ven
2010-01-04 18:24 ` [RFC][PATCH 7/8] mm,x86: speculative pagefault support Peter Zijlstra
` (3 subsequent siblings)
9 siblings, 2 replies; 121+ messages in thread
From: Peter Zijlstra @ 2010-01-04 18:24 UTC (permalink / raw)
To: Paul E. McKenney, Peter Zijlstra, KAMEZAWA Hiroyuki,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
minchan.kim@gmail.com, cl, hugh.dickins, Nick Piggin, Ingo Molnar,
Linus Torvalds
Cc: Peter Zijlstra
[-- Attachment #1: mm-foo-8.patch --]
[-- Type: text/plain, Size: 2857 bytes --]
Generic speculative fault handler, tries to service a pagefault
without holding mmap_sem.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/mm.h | 2 +
mm/memory.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 60 insertions(+), 1 deletion(-)
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1998,7 +1998,7 @@ again:
if (!*ptep)
goto out;
- if (vma_is_dead(vma, seq))
+ if (vma && vma_is_dead(vma, seq))
goto unlock;
unpin_page_tables();
@@ -3112,6 +3112,63 @@ int handle_mm_fault(struct mm_struct *mm
return handle_pte_fault(mm, vma, address, entry, pmd, flags, 0);
}
+int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
+ unsigned int flags)
+{
+ pmd_t *pmd = NULL;
+ pte_t *pte, entry;
+ spinlock_t *ptl;
+ struct vm_area_struct *vma;
+ unsigned int seq;
+ int ret = VM_FAULT_RETRY;
+ int dead;
+
+ __set_current_state(TASK_RUNNING);
+ flags |= FAULT_FLAG_SPECULATIVE;
+
+ count_vm_event(PGFAULT);
+
+ rcu_read_lock();
+ if (!pte_map_lock(mm, NULL, address, pmd, flags, 0, &pte, &ptl))
+ goto out_unlock;
+
+ vma = find_vma(mm, address);
+
+ if (!vma)
+ goto out_unmap;
+
+ dead = RB_EMPTY_NODE(&vma->vm_rb);
+ seq = vma->vm_sequence.sequence;
+ /*
+ * Matches both the wmb in write_seqcount_begin/end() and
+ * the wmb in detach_vmas_to_be_unmapped()/__unlink_vma().
+ */
+ smp_rmb();
+ if (dead || seq & 1)
+ goto out_unmap;
+
+ if (!(vma->vm_end > address && vma->vm_start <= address))
+ goto out_unmap;
+
+ if (read_seqcount_retry(&vma->vm_sequence, seq))
+ goto out_unmap;
+
+ entry = *pte;
+
+ pte_unmap_unlock(pte, ptl);
+
+ ret = handle_pte_fault(mm, vma, address, entry, pmd, flags, seq);
+
+out_unlock:
+ rcu_read_unlock();
+ return ret;
+
+out_unmap:
+ pte_unmap_unlock(pte, ptl);
+ goto out_unlock;
+}
+
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -829,6 +829,8 @@ int invalidate_inode_page(struct page *p
#ifdef CONFIG_MMU
extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, unsigned int flags);
+extern int handle_speculative_fault(struct mm_struct *mm,
+ unsigned long address, unsigned int flags);
#else
static inline int handle_mm_fault(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* [RFC][PATCH 7/8] mm,x86: speculative pagefault support
2010-01-04 18:24 [RFC][PATCH 0/8] Speculative pagefault -v3 Peter Zijlstra
` (5 preceding siblings ...)
2010-01-04 18:24 ` [RFC][PATCH 6/8] mm: handle_speculative_fault() Peter Zijlstra
@ 2010-01-04 18:24 ` Peter Zijlstra
2010-01-04 18:24 ` [RFC][PATCH 8/8] mm: Optimize pte_map_lock() Peter Zijlstra
` (2 subsequent siblings)
9 siblings, 0 replies; 121+ messages in thread
From: Peter Zijlstra @ 2010-01-04 18:24 UTC (permalink / raw)
To: Paul E. McKenney, Peter Zijlstra, KAMEZAWA Hiroyuki,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
minchan.kim@gmail.com, cl, hugh.dickins, Nick Piggin, Ingo Molnar,
Linus Torvalds
Cc: Peter Zijlstra
[-- Attachment #1: mm-foo-9.patch --]
[-- Type: text/plain, Size: 3031 bytes --]
Implement the x86 architecture support for speculative pagefaults.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
arch/x86/mm/fault.c | 31 +++++++++++--------------------
arch/x86/mm/gup.c | 10 ++++++++++
2 files changed, 21 insertions(+), 20 deletions(-)
Index: linux-2.6/arch/x86/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/fault.c
+++ linux-2.6/arch/x86/mm/fault.c
@@ -777,30 +777,13 @@ bad_area_access_error(struct pt_regs *re
__bad_area(regs, error_code, address, SEGV_ACCERR);
}
-/* TODO: fixup for "mm-invoke-oom-killer-from-page-fault.patch" */
-static void
-out_of_memory(struct pt_regs *regs, unsigned long error_code,
- unsigned long address)
-{
- /*
- * We ran out of memory, call the OOM killer, and return the userspace
- * (which will retry the fault, or kill us if we got oom-killed):
- */
- up_read(¤t->mm->mmap_sem);
-
- pagefault_out_of_memory();
-}
-
static void
do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
unsigned int fault)
{
struct task_struct *tsk = current;
- struct mm_struct *mm = tsk->mm;
int code = BUS_ADRERR;
- up_read(&mm->mmap_sem);
-
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & PF_USER))
no_context(regs, error_code, address);
@@ -829,7 +812,7 @@ mm_fault_error(struct pt_regs *regs, uns
unsigned long address, unsigned int fault)
{
if (fault & VM_FAULT_OOM) {
- out_of_memory(regs, error_code, address);
+ pagefault_out_of_memory();
} else {
if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON))
do_sigbus(regs, error_code, address, fault);
@@ -1040,6 +1023,14 @@ do_page_fault(struct pt_regs *regs, unsi
return;
}
+ if (error_code & PF_USER) {
+ fault = handle_speculative_fault(mm, address,
+ error_code & PF_WRITE ? FAULT_FLAG_WRITE : 0);
+
+ if (!(fault & VM_FAULT_RETRY))
+ goto done;
+ }
+
/*
* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in
@@ -1119,6 +1110,8 @@ good_area:
*/
fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+ up_read(&mm->mmap_sem);
+done:
if (unlikely(fault & VM_FAULT_ERROR)) {
mm_fault_error(regs, error_code, address, fault);
return;
@@ -1135,6 +1128,4 @@ good_area:
}
check_v8086_mode(regs, address, tsk);
-
- up_read(&mm->mmap_sem);
}
Index: linux-2.6/arch/x86/mm/gup.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/gup.c
+++ linux-2.6/arch/x86/mm/gup.c
@@ -373,3 +373,13 @@ slow_irqon:
return ret;
}
}
+
+void pin_page_tables(void)
+{
+ local_irq_disable();
+}
+
+void unpin_page_tables(void)
+{
+ local_irq_enable();
+}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* [RFC][PATCH 8/8] mm: Optimize pte_map_lock()
2010-01-04 18:24 [RFC][PATCH 0/8] Speculative pagefault -v3 Peter Zijlstra
` (6 preceding siblings ...)
2010-01-04 18:24 ` [RFC][PATCH 7/8] mm,x86: speculative pagefault support Peter Zijlstra
@ 2010-01-04 18:24 ` Peter Zijlstra
2010-01-04 21:41 ` [RFC][PATCH 0/8] Speculative pagefault -v3 Rik van Riel
2010-01-05 2:26 ` Minchan Kim
9 siblings, 0 replies; 121+ messages in thread
From: Peter Zijlstra @ 2010-01-04 18:24 UTC (permalink / raw)
To: Paul E. McKenney, Peter Zijlstra, KAMEZAWA Hiroyuki,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
minchan.kim@gmail.com, cl, hugh.dickins, Nick Piggin, Ingo Molnar,
Linus Torvalds
Cc: Peter Zijlstra
[-- Attachment #1: mm-foo-11.patch --]
[-- Type: text/plain, Size: 3725 bytes --]
If we ensure the pagetable invariance by also guarding against unmap,
we can skip part of the pagetable walk by validating the vma early.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
mm/memory.c | 58 ++++++++++++++++++++++++++++++++++++----------------------
1 file changed, 36 insertions(+), 22 deletions(-)
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -956,6 +956,7 @@ static unsigned long unmap_page_range(st
details = NULL;
BUG_ON(addr >= end);
+ write_seqcount_begin(&vma->vm_sequence);
mem_cgroup_uncharge_start();
tlb_start_vma(tlb, vma);
pgd = pgd_offset(vma->vm_mm, addr);
@@ -970,6 +971,7 @@ static unsigned long unmap_page_range(st
} while (pgd++, addr = next, (addr != end && *zap_work > 0));
tlb_end_vma(tlb, vma);
mem_cgroup_uncharge_end();
+ write_seqcount_end(&vma->vm_sequence);
return addr;
}
@@ -1961,9 +1963,6 @@ static int pte_map_lock(struct mm_struct
unsigned long address, pmd_t *pmd, unsigned int flags,
unsigned int seq, pte_t **ptep, spinlock_t **ptlp)
{
- pgd_t *pgd;
- pud_t *pud;
-
if (!(flags & FAULT_FLAG_SPECULATIVE)) {
*ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
return 1;
@@ -1972,19 +1971,7 @@ static int pte_map_lock(struct mm_struct
again:
pin_page_tables();
- pgd = pgd_offset(mm, address);
- if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
- goto out;
-
- pud = pud_offset(pgd, address);
- if (pud_none(*pud) || unlikely(pud_bad(*pud)))
- goto out;
-
- pmd = pmd_offset(pud, address);
- if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
- goto out;
-
- if (pmd_huge(*pmd))
+ if (vma_is_dead(vma, seq))
goto out;
*ptlp = pte_lockptr(mm, pmd);
@@ -1998,7 +1985,7 @@ again:
if (!*ptep)
goto out;
- if (vma && vma_is_dead(vma, seq))
+ if (vma_is_dead(vma, seq))
goto unlock;
unpin_page_tables();
@@ -3115,13 +3102,14 @@ int handle_mm_fault(struct mm_struct *mm
int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
unsigned int flags)
{
- pmd_t *pmd = NULL;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
pte_t *pte, entry;
spinlock_t *ptl;
struct vm_area_struct *vma;
unsigned int seq;
- int ret = VM_FAULT_RETRY;
- int dead;
+ int dead, ret = VM_FAULT_RETRY;
__set_current_state(TASK_RUNNING);
flags |= FAULT_FLAG_SPECULATIVE;
@@ -3129,8 +3117,31 @@ int handle_speculative_fault(struct mm_s
count_vm_event(PGFAULT);
rcu_read_lock();
- if (!pte_map_lock(mm, NULL, address, pmd, flags, 0, &pte, &ptl))
- goto out_unlock;
+again:
+ pin_page_tables();
+
+ pgd = pgd_offset(mm, address);
+ if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+ goto out;
+
+ pud = pud_offset(pgd, address);
+ if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+ goto out;
+
+ pmd = pmd_offset(pud, address);
+ if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+ goto out;
+
+ if (pmd_huge(*pmd))
+ goto out;
+
+ ptl = pte_lockptr(mm, pmd);
+ pte = pte_offset_map(pmd, address);
+ if (!spin_trylock(ptl)) {
+ pte_unmap(pte);
+ unpin_page_tables();
+ goto again;
+ }
vma = find_vma(mm, address);
@@ -3156,6 +3167,7 @@ int handle_speculative_fault(struct mm_s
entry = *pte;
pte_unmap_unlock(pte, ptl);
+ unpin_page_tables();
ret = handle_pte_fault(mm, vma, address, entry, pmd, flags, seq);
@@ -3165,6 +3177,8 @@ out_unlock:
out_unmap:
pte_unmap_unlock(pte, ptl);
+out:
+ unpin_page_tables();
goto out_unlock;
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 0/8] Speculative pagefault -v3
2010-01-04 18:24 [RFC][PATCH 0/8] Speculative pagefault -v3 Peter Zijlstra
` (7 preceding siblings ...)
2010-01-04 18:24 ` [RFC][PATCH 8/8] mm: Optimize pte_map_lock() Peter Zijlstra
@ 2010-01-04 21:41 ` Rik van Riel
2010-01-04 21:46 ` Peter Zijlstra
2010-01-04 21:59 ` Christoph Lameter
2010-01-05 2:26 ` Minchan Kim
9 siblings, 2 replies; 121+ messages in thread
From: Rik van Riel @ 2010-01-04 21:41 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Paul E. McKenney, Peter Zijlstra, KAMEZAWA Hiroyuki,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
minchan.kim@gmail.com, cl, hugh.dickins, Nick Piggin, Ingo Molnar,
Linus Torvalds
On 01/04/2010 01:24 PM, Peter Zijlstra wrote:
> Patch series implementing speculative page faults for x86.
Fun, but why do we need this?
What improvements did you measure?
I'll take a look over the patches to see whether they're
sane...
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 0/8] Speculative pagefault -v3
2010-01-04 21:41 ` [RFC][PATCH 0/8] Speculative pagefault -v3 Rik van Riel
@ 2010-01-04 21:46 ` Peter Zijlstra
2010-01-04 23:20 ` Rik van Riel
2010-01-04 21:59 ` Christoph Lameter
1 sibling, 1 reply; 121+ messages in thread
From: Peter Zijlstra @ 2010-01-04 21:46 UTC (permalink / raw)
To: Rik van Riel
Cc: Paul E. McKenney, KAMEZAWA Hiroyuki, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, minchan.kim@gmail.com, cl, hugh.dickins,
Nick Piggin, Ingo Molnar, Linus Torvalds
On Mon, 2010-01-04 at 16:41 -0500, Rik van Riel wrote:
> On 01/04/2010 01:24 PM, Peter Zijlstra wrote:
> > Patch series implementing speculative page faults for x86.
>
> Fun, but why do we need this?
People were once again concerned with mmap_sem contention on threaded
apps on large machines. Kame-san posted some patches, but I felt they
weren't quite crazy enough ;-)
> What improvements did you measure?
I got it not to crash :-) Although I'd not be surprised if other people
do manage, it needs more eyes.
> I'll take a look over the patches to see whether they're
> sane...
More appreciated, thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 0/8] Speculative pagefault -v3
2010-01-04 21:41 ` [RFC][PATCH 0/8] Speculative pagefault -v3 Rik van Riel
2010-01-04 21:46 ` Peter Zijlstra
@ 2010-01-04 21:59 ` Christoph Lameter
2010-01-05 0:28 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 121+ messages in thread
From: Christoph Lameter @ 2010-01-04 21:59 UTC (permalink / raw)
To: Rik van Riel
Cc: Peter Zijlstra, Paul E. McKenney, Peter Zijlstra,
KAMEZAWA Hiroyuki, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, minchan.kim@gmail.com, hugh.dickins,
Nick Piggin, Ingo Molnar, Linus Torvalds
On Mon, 4 Jan 2010, Rik van Riel wrote:
> Fun, but why do we need this?
>
> What improvements did you measure?
If it measures up to Kame-sans approach then the possible pagefault rate
will at least double ...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 0/8] Speculative pagefault -v3
2010-01-04 21:46 ` Peter Zijlstra
@ 2010-01-04 23:20 ` Rik van Riel
0 siblings, 0 replies; 121+ messages in thread
From: Rik van Riel @ 2010-01-04 23:20 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Paul E. McKenney, KAMEZAWA Hiroyuki, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, minchan.kim@gmail.com, cl, hugh.dickins,
Nick Piggin, Ingo Molnar, Linus Torvalds
On 01/04/2010 04:46 PM, Peter Zijlstra wrote:
> On Mon, 2010-01-04 at 16:41 -0500, Rik van Riel wrote:
>> On 01/04/2010 01:24 PM, Peter Zijlstra wrote:
>>> Patch series implementing speculative page faults for x86.
>>
>> Fun, but why do we need this?
>
> People were once again concerned with mmap_sem contention on threaded
> apps on large machines. Kame-san posted some patches, but I felt they
> weren't quite crazy enough ;-)
In that case, I assume that somebody else (maybe Kame-san or
Christoph) will end up posting a benchmark that shows how
these patches help.
--
All rights reversed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-04 18:24 ` [RFC][PATCH 6/8] mm: handle_speculative_fault() Peter Zijlstra
@ 2010-01-05 0:25 ` KAMEZAWA Hiroyuki
2010-01-05 3:13 ` Linus Torvalds
2010-01-05 4:29 ` Minchan Kim
2010-01-05 13:45 ` Arjan van de Ven
1 sibling, 2 replies; 121+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-05 0:25 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Paul E. McKenney, Peter Zijlstra, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, minchan.kim@gmail.com, cl, hugh.dickins,
Nick Piggin, Ingo Molnar, Linus Torvalds
On Mon, 04 Jan 2010 19:24:35 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Generic speculative fault handler, tries to service a pagefault
> without holding mmap_sem.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
I'm sorry if I miss something...how does this patch series avoid
that vma is removed while __do_fault()->vma->vm_ops->fault() is called ?
("vma is removed" means all other things as freeing file struct etc..)
Thanks,
-Kame
> ---
> include/linux/mm.h | 2 +
> mm/memory.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 60 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -1998,7 +1998,7 @@ again:
> if (!*ptep)
> goto out;
>
> - if (vma_is_dead(vma, seq))
> + if (vma && vma_is_dead(vma, seq))
> goto unlock;
>
> unpin_page_tables();
> @@ -3112,6 +3112,63 @@ int handle_mm_fault(struct mm_struct *mm
> return handle_pte_fault(mm, vma, address, entry, pmd, flags, 0);
> }
>
> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
> + unsigned int flags)
> +{
> + pmd_t *pmd = NULL;
> + pte_t *pte, entry;
> + spinlock_t *ptl;
> + struct vm_area_struct *vma;
> + unsigned int seq;
> + int ret = VM_FAULT_RETRY;
> + int dead;
> +
> + __set_current_state(TASK_RUNNING);
> + flags |= FAULT_FLAG_SPECULATIVE;
> +
> + count_vm_event(PGFAULT);
> +
> + rcu_read_lock();
> + if (!pte_map_lock(mm, NULL, address, pmd, flags, 0, &pte, &ptl))
> + goto out_unlock;
> +
> + vma = find_vma(mm, address);
> +
> + if (!vma)
> + goto out_unmap;
> +
> + dead = RB_EMPTY_NODE(&vma->vm_rb);
> + seq = vma->vm_sequence.sequence;
> + /*
> + * Matches both the wmb in write_seqcount_begin/end() and
> + * the wmb in detach_vmas_to_be_unmapped()/__unlink_vma().
> + */
> + smp_rmb();
> + if (dead || seq & 1)
> + goto out_unmap;
> +
> + if (!(vma->vm_end > address && vma->vm_start <= address))
> + goto out_unmap;
> +
> + if (read_seqcount_retry(&vma->vm_sequence, seq))
> + goto out_unmap;
> +
> + entry = *pte;
> +
> + pte_unmap_unlock(pte, ptl);
> +
> + ret = handle_pte_fault(mm, vma, address, entry, pmd, flags, seq);
> +
> +out_unlock:
> + rcu_read_unlock();
> + return ret;
> +
> +out_unmap:
> + pte_unmap_unlock(pte, ptl);
> + goto out_unlock;
> +}
> +
> +
> #ifndef __PAGETABLE_PUD_FOLDED
> /*
> * Allocate page upper directory.
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -829,6 +829,8 @@ int invalidate_inode_page(struct page *p
> #ifdef CONFIG_MMU
> extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, unsigned int flags);
> +extern int handle_speculative_fault(struct mm_struct *mm,
> + unsigned long address, unsigned int flags);
> #else
> static inline int handle_mm_fault(struct mm_struct *mm,
> struct vm_area_struct *vma, unsigned long address,
>
> --
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 0/8] Speculative pagefault -v3
2010-01-04 21:59 ` Christoph Lameter
@ 2010-01-05 0:28 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 121+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-05 0:28 UTC (permalink / raw)
To: Christoph Lameter
Cc: Rik van Riel, Peter Zijlstra, Paul E. McKenney, Peter Zijlstra,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
minchan.kim@gmail.com, hugh.dickins, Nick Piggin, Ingo Molnar,
Linus Torvalds
On Mon, 4 Jan 2010 15:59:45 -0600 (CST)
Christoph Lameter <cl@linux-foundation.org> wrote:
> On Mon, 4 Jan 2010, Rik van Riel wrote:
>
> > Fun, but why do we need this?
> >
> > What improvements did you measure?
>
> If it measures up to Kame-sans approach then the possible pagefault rate
> will at least double ...
>
On 4-core/2 socket machine ;)
More than page fault rate, important fact is that we can reduce cache contention
by skipping mmap_sem in some situation.
And I think we have some chances.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 0/8] Speculative pagefault -v3
2010-01-04 18:24 [RFC][PATCH 0/8] Speculative pagefault -v3 Peter Zijlstra
` (8 preceding siblings ...)
2010-01-04 21:41 ` [RFC][PATCH 0/8] Speculative pagefault -v3 Rik van Riel
@ 2010-01-05 2:26 ` Minchan Kim
9 siblings, 0 replies; 121+ messages in thread
From: Minchan Kim @ 2010-01-05 2:26 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Paul E. McKenney, Peter Zijlstra, KAMEZAWA Hiroyuki,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl,
hugh.dickins, Nick Piggin, Ingo Molnar, Linus Torvalds
Hi, Peter.
On Tue, Jan 5, 2010 at 3:24 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Patch series implementing speculative page faults for x86.
>
> Still needs lots of things sorted, like:
>
> - call_srcu()
> - ptl, irq and tlb-flush
> - a 2nd VM_FAULT_LOCK? return code to distuinguish between
> simple retry and must take mmap_sem semantics?
>
> Comments?
> --
>
>
I looked over this patch series.
This series are most neat in things I have ever seen.
If we solve call_srcu problem, it would be good.
I will help you test this series in my machine to work well.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 4/8] mm: RCU free vmas
2010-01-04 18:24 ` [RFC][PATCH 4/8] mm: RCU free vmas Peter Zijlstra
@ 2010-01-05 2:43 ` Paul E. McKenney
2010-01-05 8:28 ` Peter Zijlstra
0 siblings, 1 reply; 121+ messages in thread
From: Paul E. McKenney @ 2010-01-05 2:43 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Peter Zijlstra, KAMEZAWA Hiroyuki, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, minchan.kim@gmail.com, cl, hugh.dickins,
Nick Piggin, Ingo Molnar, Linus Torvalds
On Mon, Jan 04, 2010 at 07:24:33PM +0100, Peter Zijlstra wrote:
> TODO:
> - should be SRCU, lack of call_srcu()
>
> In order to allow speculative vma lookups, RCU free the struct
> vm_area_struct.
>
> We use two means of detecting a vma is still valid:
> - firstly, we set RB_CLEAR_NODE once we remove a vma from the tree.
> - secondly, we check the vma sequence number.
>
> These two things combined will guarantee that 1) the vma is still
> present and two, it still covers the same range from when we looked it
> up.
OK, I think I see what you are up to here. I could get you a very crude
throw-away call_srcu() fairly quickly. I don't yet have a good estimate
of how long it will take me to merge SRCU into the treercu infrastructure,
but am getting there.
So, which release are you thinking in terms of?
Thanx, Paul
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
> include/linux/mm.h | 12 ++++++++++++
> include/linux/mm_types.h | 2 ++
> init/Kconfig | 34 +++++++++++++++++-----------------
> kernel/sched.c | 9 ++++++++-
> mm/mmap.c | 33 +++++++++++++++++++++++++++++++--
> 5 files changed, 70 insertions(+), 20 deletions(-)
>
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -765,6 +765,18 @@ unsigned long unmap_vmas(struct mmu_gath
> unsigned long end_addr, unsigned long *nr_accounted,
> struct zap_details *);
>
> +static inline int vma_is_dead(struct vm_area_struct *vma, unsigned int sequence)
> +{
> + int ret = RB_EMPTY_NODE(&vma->vm_rb);
> + unsigned seq = vma->vm_sequence.sequence;
> + /*
> + * Matches both the wmb in write_seqlock_begin/end() and
> + * the wmb in detach_vmas_to_be_unmapped()/__unlink_vma().
> + */
> + smp_rmb();
> + return ret || seq != sequence;
> +}
> +
> /**
> * mm_walk - callbacks for walk_page_range
> * @pgd_entry: if set, called for each non-empty PGD (top-level) entry
> Index: linux-2.6/include/linux/mm_types.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm_types.h
> +++ linux-2.6/include/linux/mm_types.h
> @@ -13,6 +13,7 @@
> #include <linux/cpumask.h>
> #include <linux/page-debug-flags.h>
> #include <linux/seqlock.h>
> +#include <linux/rcupdate.h>
> #include <asm/page.h>
> #include <asm/mmu.h>
>
> @@ -188,6 +189,7 @@ struct vm_area_struct {
> struct mempolicy *vm_policy; /* NUMA policy for the VMA */
> #endif
> seqcount_t vm_sequence;
> + struct rcu_head vm_rcu_head;
> };
>
> struct core_thread {
> Index: linux-2.6/mm/mmap.c
> ===================================================================
> --- linux-2.6.orig/mm/mmap.c
> +++ linux-2.6/mm/mmap.c
> @@ -222,6 +222,19 @@ void unlink_file_vma(struct vm_area_stru
> }
> }
>
> +static void free_vma_rcu(struct rcu_head *head)
> +{
> + struct vm_area_struct *vma =
> + container_of(head, struct vm_area_struct, vm_rcu_head);
> +
> + kmem_cache_free(vm_area_cachep, vma);
> +}
> +
> +static void free_vma(struct vm_area_struct *vma)
> +{
> + call_rcu(&vma->vm_rcu_head, free_vma_rcu);
> +}
> +
> /*
> * Close a vm structure and free it, returning the next.
> */
> @@ -238,7 +251,7 @@ static struct vm_area_struct *remove_vma
> removed_exe_file_vma(vma->vm_mm);
> }
> mpol_put(vma_policy(vma));
> - kmem_cache_free(vm_area_cachep, vma);
> + free_vma(vma);
> return next;
> }
>
> @@ -488,6 +501,14 @@ __vma_unlink(struct mm_struct *mm, struc
> {
> prev->vm_next = vma->vm_next;
> rb_erase(&vma->vm_rb, &mm->mm_rb);
> + /*
> + * Ensure the removal is completely comitted to memory
> + * before clearing the node.
> + *
> + * Matched by vma_is_dead()/handle_speculative_fault().
> + */
> + smp_wmb();
> + RB_CLEAR_NODE(&vma->vm_rb);
> if (mm->mmap_cache == vma)
> mm->mmap_cache = prev;
> }
> @@ -644,7 +665,7 @@ again: remove_next = 1 + (end > next->
> }
> mm->map_count--;
> mpol_put(vma_policy(next));
> - kmem_cache_free(vm_area_cachep, next);
> + free_vma(next);
> /*
> * In mprotect's case 6 (see comments on vma_merge),
> * we must remove another next too. It would clutter
> @@ -1858,6 +1879,14 @@ detach_vmas_to_be_unmapped(struct mm_str
> insertion_point = (prev ? &prev->vm_next : &mm->mmap);
> do {
> rb_erase(&vma->vm_rb, &mm->mm_rb);
> + /*
> + * Ensure the removal is completely comitted to memory
> + * before clearing the node.
> + *
> + * Matched by vma_is_dead()/handle_speculative_fault().
> + */
> + smp_wmb();
> + RB_CLEAR_NODE(&vma->vm_rb);
> mm->map_count--;
> tail_vma = vma;
> vma = vma->vm_next;
> Index: linux-2.6/init/Kconfig
> ===================================================================
> --- linux-2.6.orig/init/Kconfig
> +++ linux-2.6/init/Kconfig
> @@ -314,19 +314,19 @@ menu "RCU Subsystem"
>
> choice
> prompt "RCU Implementation"
> - default TREE_RCU
> + default TREE_PREEMPT_RCU
>
> -config TREE_RCU
> - bool "Tree-based hierarchical RCU"
> - help
> - This option selects the RCU implementation that is
> - designed for very large SMP system with hundreds or
> - thousands of CPUs. It also scales down nicely to
> - smaller systems.
> +#config TREE_RCU
> +# bool "Tree-based hierarchical RCU"
> +# help
> +# This option selects the RCU implementation that is
> +# designed for very large SMP system with hundreds or
> +# thousands of CPUs. It also scales down nicely to
> +# smaller systems.
>
> config TREE_PREEMPT_RCU
> bool "Preemptable tree-based hierarchical RCU"
> - depends on PREEMPT
> +# depends on PREEMPT
> help
> This option selects the RCU implementation that is
> designed for very large SMP systems with hundreds or
> @@ -334,14 +334,14 @@ config TREE_PREEMPT_RCU
> is also required. It also scales down nicely to
> smaller systems.
>
> -config TINY_RCU
> - bool "UP-only small-memory-footprint RCU"
> - depends on !SMP
> - help
> - This option selects the RCU implementation that is
> - designed for UP systems from which real-time response
> - is not required. This option greatly reduces the
> - memory footprint of RCU.
> +#config TINY_RCU
> +# bool "UP-only small-memory-footprint RCU"
> +# depends on !SMP
> +# help
> +# This option selects the RCU implementation that is
> +# designed for UP systems from which real-time response
> +# is not required. This option greatly reduces the
> +# memory footprint of RCU.
>
> endchoice
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -9689,7 +9689,14 @@ void __init sched_init(void)
> #ifdef CONFIG_DEBUG_SPINLOCK_SLEEP
> static inline int preempt_count_equals(int preempt_offset)
> {
> - int nested = (preempt_count() & ~PREEMPT_ACTIVE) + rcu_preempt_depth();
> + int nested = (preempt_count() & ~PREEMPT_ACTIVE)
> + /*
> + * remove this for we need preemptible RCU
> + * exactly because it needs to sleep..
> + *
> + + rcu_preempt_depth()
> + */
> + ;
>
> return (nested == PREEMPT_INATOMIC_BASE + preempt_offset);
> }
>
> --
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-05 0:25 ` KAMEZAWA Hiroyuki
@ 2010-01-05 3:13 ` Linus Torvalds
2010-01-05 8:17 ` Peter Zijlstra
` (2 more replies)
2010-01-05 4:29 ` Minchan Kim
1 sibling, 3 replies; 121+ messages in thread
From: Linus Torvalds @ 2010-01-05 3:13 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Peter Zijlstra, Paul E. McKenney, Peter Zijlstra,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
minchan.kim@gmail.com, cl, hugh.dickins, Nick Piggin, Ingo Molnar
On Tue, 5 Jan 2010, KAMEZAWA Hiroyuki wrote:
>
> I'm sorry if I miss something...how does this patch series avoid
> that vma is removed while __do_fault()->vma->vm_ops->fault() is called ?
> ("vma is removed" means all other things as freeing file struct etc..)
I don't think you're missing anything.
Protecting the vma isn't enough. You need to protect the whole FS stack
with rcu. Probably by moving _all_ of "free_vma()" into the RCU path
(which means that the whole file/inode gets de-allocated at that later RCU
point, rather than synchronously). Not just the actual kfree.
However, it's worth noting that that actually has some very subtle and
major consequences. If you have a temporary file that was removed, where
the mmap() was the last user that kind of delayed freeing would also delay
the final fput of that file that actually deletes it.
Or put another way: if the vma was a writable mapping, a user may do
munmap(mapping, size);
and the backing file is still active and writable AFTER THE MUNMAP! This
can be a huge problem for something that wants to unmount the volume, for
example, or depends on the whole writability-vs-executability thing. The
user may have unmapped it, and expects the file to be immediately
non-busy, but with the delayed free that isn't the case any more.
In other words, now you may well need to make munmap() wait for the RCU
grace period, so that the user who did the unmap really is synchronous wrt
the file accesses. We've had things like that before, and they have been
_huge_ performance problems (ie it may take just a timer tick or two, but
then people do tens of thousands of munmaps, and now that takes many
seconds just due to RCU grace period waiting.
I would say that this whole series is _very_ far from being mergeable.
Peter seems to have been thinking about the details, while missing all the
subtle big picture effects that seem to actually change semantics.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-05 0:25 ` KAMEZAWA Hiroyuki
2010-01-05 3:13 ` Linus Torvalds
@ 2010-01-05 4:29 ` Minchan Kim
2010-01-05 4:43 ` KAMEZAWA Hiroyuki
2010-01-05 4:48 ` Linus Torvalds
1 sibling, 2 replies; 121+ messages in thread
From: Minchan Kim @ 2010-01-05 4:29 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Peter Zijlstra, Paul E. McKenney, Peter Zijlstra,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl,
hugh.dickins, Nick Piggin, Ingo Molnar, Linus Torvalds
Hi, Kame.
On Tue, Jan 5, 2010 at 9:25 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 04 Jan 2010 19:24:35 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
>> Generic speculative fault handler, tries to service a pagefault
>> without holding mmap_sem.
>>
>> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>
>
> I'm sorry if I miss something...how does this patch series avoid
> that vma is removed while __do_fault()->vma->vm_ops->fault() is called ?
> ("vma is removed" means all other things as freeing file struct etc..)
Isn't it protected by get_file and iget?
Am I miss something?
>
> Thanks,
> -Kame
>
>
>
>
>> ---
>> include/linux/mm.h | 2 +
>> mm/memory.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>> 2 files changed, 60 insertions(+), 1 deletion(-)
>>
>> Index: linux-2.6/mm/memory.c
>> ===================================================================
>> --- linux-2.6.orig/mm/memory.c
>> +++ linux-2.6/mm/memory.c
>> @@ -1998,7 +1998,7 @@ again:
>> if (!*ptep)
>> goto out;
>>
>> - if (vma_is_dead(vma, seq))
>> + if (vma && vma_is_dead(vma, seq))
>> goto unlock;
>>
>> unpin_page_tables();
>> @@ -3112,6 +3112,63 @@ int handle_mm_fault(struct mm_struct *mm
>> return handle_pte_fault(mm, vma, address, entry, pmd, flags, 0);
>> }
>>
>> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>> + unsigned int flags)
>> +{
>> + pmd_t *pmd = NULL;
>> + pte_t *pte, entry;
>> + spinlock_t *ptl;
>> + struct vm_area_struct *vma;
>> + unsigned int seq;
>> + int ret = VM_FAULT_RETRY;
>> + int dead;
>> +
>> + __set_current_state(TASK_RUNNING);
>> + flags |= FAULT_FLAG_SPECULATIVE;
>> +
>> + count_vm_event(PGFAULT);
>> +
>> + rcu_read_lock();
>> + if (!pte_map_lock(mm, NULL, address, pmd, flags, 0, &pte, &ptl))
>> + goto out_unlock;
>> +
>> + vma = find_vma(mm, address);
>> +
>> + if (!vma)
>> + goto out_unmap;
>> +
>> + dead = RB_EMPTY_NODE(&vma->vm_rb);
>> + seq = vma->vm_sequence.sequence;
>> + /*
>> + * Matches both the wmb in write_seqcount_begin/end() and
>> + * the wmb in detach_vmas_to_be_unmapped()/__unlink_vma().
>> + */
>> + smp_rmb();
>> + if (dead || seq & 1)
>> + goto out_unmap;
>> +
>> + if (!(vma->vm_end > address && vma->vm_start <= address))
>> + goto out_unmap;
>> +
>> + if (read_seqcount_retry(&vma->vm_sequence, seq))
>> + goto out_unmap;
>> +
>> + entry = *pte;
>> +
>> + pte_unmap_unlock(pte, ptl);
>> +
>> + ret = handle_pte_fault(mm, vma, address, entry, pmd, flags, seq);
>> +
>> +out_unlock:
>> + rcu_read_unlock();
>> + return ret;
>> +
>> +out_unmap:
>> + pte_unmap_unlock(pte, ptl);
>> + goto out_unlock;
>> +}
>> +
>> +
>> #ifndef __PAGETABLE_PUD_FOLDED
>> /*
>> * Allocate page upper directory.
>> Index: linux-2.6/include/linux/mm.h
>> ===================================================================
>> --- linux-2.6.orig/include/linux/mm.h
>> +++ linux-2.6/include/linux/mm.h
>> @@ -829,6 +829,8 @@ int invalidate_inode_page(struct page *p
>> #ifdef CONFIG_MMU
>> extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>> unsigned long address, unsigned int flags);
>> +extern int handle_speculative_fault(struct mm_struct *mm,
>> + unsigned long address, unsigned int flags);
>> #else
>> static inline int handle_mm_fault(struct mm_struct *mm,
>> struct vm_area_struct *vma, unsigned long address,
>>
>> --
>>
>>
>
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-05 4:29 ` Minchan Kim
@ 2010-01-05 4:43 ` KAMEZAWA Hiroyuki
2010-01-05 5:10 ` Linus Torvalds
2010-01-05 6:00 ` Minchan Kim
2010-01-05 4:48 ` Linus Torvalds
1 sibling, 2 replies; 121+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-05 4:43 UTC (permalink / raw)
To: Minchan Kim
Cc: Peter Zijlstra, Paul E. McKenney, Peter Zijlstra,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl,
hugh.dickins, Nick Piggin, Ingo Molnar, Linus Torvalds
On Tue, 5 Jan 2010 13:29:40 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:
> Hi, Kame.
>
> On Tue, Jan 5, 2010 at 9:25 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 04 Jan 2010 19:24:35 +0100
> > Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> >> Generic speculative fault handler, tries to service a pagefault
> >> without holding mmap_sem.
> >>
> >> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> >
> >
> > I'm sorry if I miss something...how does this patch series avoid
> > that vma is removed while __do_fault()->vma->vm_ops->fault() is called ?
> > ("vma is removed" means all other things as freeing file struct etc..)
>
> Isn't it protected by get_file and iget?
> Am I miss something?
>
Only kmem_cache_free() part of following code is modified by the patch.
==
229 static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
230 {
231 struct vm_area_struct *next = vma->vm_next;
232
233 might_sleep();
234 if (vma->vm_ops && vma->vm_ops->close)
235 vma->vm_ops->close(vma);
236 if (vma->vm_file) {
237 fput(vma->vm_file);
238 if (vma->vm_flags & VM_EXECUTABLE)
239 removed_exe_file_vma(vma->vm_mm);
240 }
241 mpol_put(vma_policy(vma));
242 kmem_cache_free(vm_area_cachep, vma);
243 return next;
244 }
==
Then, fput() can be called. The whole above code should be delayd until RCU
glace period if we use RCU here.
Then, my patch dropped speculative trial of page fault and did synchronous
job here. I'm still considering how to insert some barrier to delay calling
remove_vma() until all page fault goes. One idea was reference count but
it was said not-enough crazy.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-05 4:29 ` Minchan Kim
2010-01-05 4:43 ` KAMEZAWA Hiroyuki
@ 2010-01-05 4:48 ` Linus Torvalds
2010-01-05 6:09 ` Minchan Kim
1 sibling, 1 reply; 121+ messages in thread
From: Linus Torvalds @ 2010-01-05 4:48 UTC (permalink / raw)
To: Minchan Kim
Cc: KAMEZAWA Hiroyuki, Peter Zijlstra, Paul E. McKenney,
Peter Zijlstra, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
cl, hugh.dickins, Nick Piggin, Ingo Molnar
On Tue, 5 Jan 2010, Minchan Kim wrote:
>
> Isn't it protected by get_file and iget?
When the vma is mapped, yes.
> Am I miss something?
remove_vma() will have done a
fput(vma->vm_file);
and other house-keeping (removing the executable info, doing
vm_ops->close() etc).
And that is _not_ done delayed by RCU, and as outlined in my previous
email I think that if the code really _does_ delay it, then munmap() (and
exit) need to wait for the RCU callbacks to have been done, because
otherwise the file may end up being busy "asynchronously" in ways that
break existing semantics.
Just as an example: imagine a script that does "fork()+execve()" on a
temporary file, and then after waiting for it all to finish with wait4()
does some re-write of the file. It currently works. But what if the
open-for-writing gets ETXTBUSY because the file is still marked as being
VM_DENYWRITE, and RCU hasn't done all the callbacks?
Or if you do the i_writecount handling synchronously (which is likely fine
- it really is just for ETXTBUSY handling, and I don't think speculative
page faults matter), what about a shutdown sequence (or whatever) that
wants to unmount the filesystem, but the file is still open - as it has to
be - because the actual close is delayed by RCU.
So the patch-series as-is is fundamentally buggy - and trying to fix it
seems painful.
I'm also not entirely clear on how the race with page table tear-down vs
page-fault got handled, but I didn't read the whole patch-series very
carefully. I skimmed through it and got rather nervous about it all. It
doesn't seem too large, but it _does_ seem rather cavalier about all the
object lifetimes.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-05 4:43 ` KAMEZAWA Hiroyuki
@ 2010-01-05 5:10 ` Linus Torvalds
2010-01-05 5:30 ` KAMEZAWA Hiroyuki
2010-01-05 8:18 ` Peter Zijlstra
2010-01-05 6:00 ` Minchan Kim
1 sibling, 2 replies; 121+ messages in thread
From: Linus Torvalds @ 2010-01-05 5:10 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Minchan Kim, Peter Zijlstra, Paul E. McKenney, Peter Zijlstra,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl,
hugh.dickins, Nick Piggin, Ingo Molnar
On Tue, 5 Jan 2010, KAMEZAWA Hiroyuki wrote:
>
> Then, my patch dropped speculative trial of page fault and did synchronous
> job here. I'm still considering how to insert some barrier to delay calling
> remove_vma() until all page fault goes. One idea was reference count but
> it was said not-enough crazy.
What lock would you use to protect the vma lookup (in order to then
increase the refcount)? A sequence lock with RCU lookup of the vma?
Sounds doable. But it also sounds way more expensive than the current VM
fault handling, which is pretty close to optimal for single-threaded
cases.. That RCU lookup might be cheap, but just the refcount is generally
going to be as expensive as a lock.
Are there some particular mappings that people care about more than
others? If we limit the speculative lookup purely to anonymous memory,
that might simplify the problem space?
[ From past experiences, I suspect DB people would be upset and really
want it for the general file mapping case.. But maybe the main usage
scenario is something else this time? ]
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-05 5:10 ` Linus Torvalds
@ 2010-01-05 5:30 ` KAMEZAWA Hiroyuki
2010-01-05 7:39 ` KAMEZAWA Hiroyuki
2010-01-05 15:14 ` Christoph Lameter
2010-01-05 8:18 ` Peter Zijlstra
1 sibling, 2 replies; 121+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-05 5:30 UTC (permalink / raw)
To: Linus Torvalds
Cc: Minchan Kim, Peter Zijlstra, Paul E. McKenney, Peter Zijlstra,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl,
hugh.dickins, Nick Piggin, Ingo Molnar
On Mon, 4 Jan 2010 21:10:29 -0800 (PST)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Tue, 5 Jan 2010, KAMEZAWA Hiroyuki wrote:
> >
> > Then, my patch dropped speculative trial of page fault and did synchronous
> > job here. I'm still considering how to insert some barrier to delay calling
> > remove_vma() until all page fault goes. One idea was reference count but
> > it was said not-enough crazy.
>
> What lock would you use to protect the vma lookup (in order to then
> increase the refcount)? A sequence lock with RCU lookup of the vma?
>
Ah, I just used reference counter to show "how many threads are in
page fault to this vma now". Below is from my post.
==
+ rb_node = rcu_dereference(rb_node->rb_left);
+ } else
+ rb_node = rcu_dereference(rb_node->rb_right);
+ }
+ if (vma) {
+ if ((vma->vm_start <= addr) && (addr < vma->vm_end)) {
+ if (!atomic_inc_not_zero(&vma->refcnt))
+ vma = NULL;
+ } else
+ vma = NULL;
+ }
+ rcu_read_unlock();
...
+void vma_put(struct vm_area_struct *vma)
+{
+ if ((atomic_dec_return(&vma->refcnt) == 1) &&
+ waitqueue_active(&vma->wait_queue))
+ wake_up(&vma->wait_queue);
+ return;
+}
==
And wait for this reference count to be good number before calling
remove_vma()
==
+/* called when vma is unlinked and wait for all racy access.*/
+static void invalidate_vma_before_free(struct vm_area_struct *vma)
+{
+ atomic_dec(&vma->refcnt);
+ wait_event(vma->wait_queue, !atomic_read(&vma->refcnt));
+}
+
....
* us to remove next before dropping the locks.
*/
__vma_unlink(mm, next, vma);
+ invalidate_vma_before_free(next);
if (file)
__remove_shared_vm_struct(next, file, mapping);
etc....
==
Above codes are a bit heavy(and buggy). I have some fixes.
> Sounds doable. But it also sounds way more expensive than the current VM
> fault handling, which is pretty close to optimal for single-threaded
> cases.. That RCU lookup might be cheap, but just the refcount is generally
> going to be as expensive as a lock.
>
For single-threaded apps, my patch will have no benefits.
(but will not make anything worse.)
I'll add CONFIG and I wonder I can enable speculave_vma_lookup
only after mm_struct is shared.(but the patch may be messy...)
> Are there some particular mappings that people care about more than
> others? If we limit the speculative lookup purely to anonymous memory,
> that might simplify the problem space?
>
I wonder, for usual people who don't write highly optimized programs,
some small benefit of skipping mmap_sem is to reduce mmap_sem() ping-pong
after doing fork()->exec(). This can cause some jitter to the application.
So, I'm glad if I can help file-backed vmas.
> [ From past experiences, I suspect DB people would be upset and really
> want it for the general file mapping case.. But maybe the main usage
> scenario is something else this time? ]
>
I'd like to hear use cases of really heavy users, too. Christoph ?
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-05 4:43 ` KAMEZAWA Hiroyuki
2010-01-05 5:10 ` Linus Torvalds
@ 2010-01-05 6:00 ` Minchan Kim
1 sibling, 0 replies; 121+ messages in thread
From: Minchan Kim @ 2010-01-05 6:00 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Peter Zijlstra, Paul E. McKenney, Peter Zijlstra,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl,
hugh.dickins, Nick Piggin, Ingo Molnar, Linus Torvalds
On Tue, Jan 5, 2010 at 1:43 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 5 Jan 2010 13:29:40 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> Hi, Kame.
>>
>> On Tue, Jan 5, 2010 at 9:25 AM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Mon, 04 Jan 2010 19:24:35 +0100
>> > Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> >
>> >> Generic speculative fault handler, tries to service a pagefault
>> >> without holding mmap_sem.
>> >>
>> >> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> >
>> >
>> > I'm sorry if I miss something...how does this patch series avoid
>> > that vma is removed while __do_fault()->vma->vm_ops->fault() is called ?
>> > ("vma is removed" means all other things as freeing file struct etc..)
>>
>> Isn't it protected by get_file and iget?
>> Am I miss something?
>>
> Only kmem_cache_free() part of following code is modified by the patch.
That's it I missed. Thanks, Kame. ;-)
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-05 6:09 ` Minchan Kim
@ 2010-01-05 6:09 ` KAMEZAWA Hiroyuki
2010-01-05 6:24 ` Minchan Kim
2010-01-05 8:35 ` Peter Zijlstra
1 sibling, 1 reply; 121+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-05 6:09 UTC (permalink / raw)
To: Minchan Kim
Cc: Linus Torvalds, Peter Zijlstra, Paul E. McKenney, Peter Zijlstra,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl,
hugh.dickins, Nick Piggin, Ingo Molnar
On Tue, 5 Jan 2010 15:09:47 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:
> My humble opinion is following as.
>
> Couldn't we synchronize rcu in that cases(munmap, exit and so on)?
> It can delay munap and exit but it would be better than handling them by more
> complicated things, I think. And both cases aren't often cases so we
> can achieve advantage than disadvantage?
>
In most case, a program is single threaded. And sychronize_rcu() in unmap path
just adds very big overhead.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-05 4:48 ` Linus Torvalds
@ 2010-01-05 6:09 ` Minchan Kim
2010-01-05 6:09 ` KAMEZAWA Hiroyuki
2010-01-05 8:35 ` Peter Zijlstra
0 siblings, 2 replies; 121+ messages in thread
From: Minchan Kim @ 2010-01-05 6:09 UTC (permalink / raw)
To: Linus Torvalds
Cc: KAMEZAWA Hiroyuki, Peter Zijlstra, Paul E. McKenney,
Peter Zijlstra, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
cl, hugh.dickins, Nick Piggin, Ingo Molnar
Hi, Linus.
On Tue, Jan 5, 2010 at 1:48 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Tue, 5 Jan 2010, Minchan Kim wrote:
>>
>> Isn't it protected by get_file and iget?
>
> When the vma is mapped, yes.
>
>> Am I miss something?
>
> remove_vma() will have done a
>
> fput(vma->vm_file);
>
> and other house-keeping (removing the executable info, doing
> vm_ops->close() etc).
>
> And that is _not_ done delayed by RCU, and as outlined in my previous
> email I think that if the code really _does_ delay it, then munmap() (and
> exit) need to wait for the RCU callbacks to have been done, because
> otherwise the file may end up being busy "asynchronously" in ways that
> break existing semantics.
>
> Just as an example: imagine a script that does "fork()+execve()" on a
> temporary file, and then after waiting for it all to finish with wait4()
> does some re-write of the file. It currently works. But what if the
> open-for-writing gets ETXTBUSY because the file is still marked as being
> VM_DENYWRITE, and RCU hasn't done all the callbacks?
>
> Or if you do the i_writecount handling synchronously (which is likely fine
> - it really is just for ETXTBUSY handling, and I don't think speculative
> page faults matter), what about a shutdown sequence (or whatever) that
> wants to unmount the filesystem, but the file is still open - as it has to
> be - because the actual close is delayed by RCU.
>
> So the patch-series as-is is fundamentally buggy - and trying to fix it
> seems painful.
>
> I'm also not entirely clear on how the race with page table tear-down vs
> page-fault got handled, but I didn't read the whole patch-series very
> carefully. I skimmed through it and got rather nervous about it all. It
> doesn't seem too large, but it _does_ seem rather cavalier about all the
> object lifetimes.
>
> Linus
>
Thanks for careful explanation, Linus.
My humble opinion is following as.
Couldn't we synchronize rcu in that cases(munmap, exit and so on)?
It can delay munap and exit but it would be better than handling them by more
complicated things, I think. And both cases aren't often cases so we
can achieve advantage than disadvantage?
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-05 6:09 ` KAMEZAWA Hiroyuki
@ 2010-01-05 6:24 ` Minchan Kim
0 siblings, 0 replies; 121+ messages in thread
From: Minchan Kim @ 2010-01-05 6:24 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Linus Torvalds, Peter Zijlstra, Paul E. McKenney, Peter Zijlstra,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl,
hugh.dickins, Nick Piggin, Ingo Molnar
On Tue, Jan 5, 2010 at 3:09 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 5 Jan 2010 15:09:47 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>> My humble opinion is following as.
>>
>> Couldn't we synchronize rcu in that cases(munmap, exit and so on)?
>> It can delay munap and exit but it would be better than handling them by more
>> complicated things, I think. And both cases aren't often cases so we
>> can achieve advantage than disadvantage?
>>
>
> In most case, a program is single threaded. And sychronize_rcu() in unmap path
> just adds very big overhead.
Yes.
I suggested you that consider single-thread app's regression, please. :)
First I come to my head is we can count number of thread.
Yes. thread number is a not good choice.
As a matter of fact, I want to work it adaptively.
If the process start to have many threads, speculative page fault turn
on or turn off.
I know it's not easy. I hope other guys have good ideas.
>
> Thanks,
> -Kame
>
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-05 5:30 ` KAMEZAWA Hiroyuki
@ 2010-01-05 7:39 ` KAMEZAWA Hiroyuki
2010-01-05 15:26 ` Linus Torvalds
2010-01-05 15:14 ` Christoph Lameter
1 sibling, 1 reply; 121+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-05 7:39 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Linus Torvalds, Minchan Kim, Peter Zijlstra, Paul E. McKenney,
Peter Zijlstra, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
cl, hugh.dickins, Nick Piggin, Ingo Molnar
[-- Attachment #1: Type: text/plain, Size: 13057 bytes --]
On Tue, 5 Jan 2010 14:30:46 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> ==
> Above codes are a bit heavy(and buggy). I have some fixes.
>
Here is my latest one. But I myself think this should be improved more
and don't think this patch is bug-free. I have some concern
s around fork() etc..and still considering how to avoid atomic ops.
I post this just as a reference rather than cut-out from old e-mails.
Here is a comarison before after patch. A test program does following in
multi-thread. (attached)
while (1) {
touch memory (with write)
pthread_barrier_wait().
fork() if I'm the first thread.
-> exit() immediately
wait()
pthread_barrier_wait()
}
And see overheads by perf. This is result on 8core/2socket hosts.
This is highly affected by the number of sockets.
# Samples: 1164449628090
#
# Overhead Command Shared Object Symbol
# ........ ............... ........................ ......
#
43.23% multi-fault-all [kernel] [k] smp_invalidate_interrupt
16.27% multi-fault-all [kernel] [k] flush_tlb_others_ipi
11.55% multi-fault-all [kernel] [k] _raw_spin_lock_irqsave <========(*)
6.23% multi-fault-all [kernel] [k] intel_pmu_enable_all
2.17% multi-fault-all [kernel] [k] _raw_spin_unlock_irqrestore
1.59% multi-fault-all [kernel] [k] page_fault
1.45% multi-fault-all [kernel] [k] do_wp_page
1.35% multi-fault-all ./multi-fault-all-fork [.] worker
1.26% multi-fault-all [kernel] [k] handle_mm_fault
1.19% multi-fault-all [kernel] [k] _raw_spin_lock
1.03% multi-fault-all [kernel] [k] invalidate_interrupt7
1.00% multi-fault-all [kernel] [k] invalidate_interrupt6
0.99% multi-fault-all [kernel] [k] invalidate_interrupt3
# Samples: 181505050964
#
# Overhead Command Shared Object Symbol
# ........ ............... ........................ ......
#
45.08% multi-fault-all [kernel] [k] smp_invalidate_interrupt
19.45% multi-fault-all [kernel] [k] intel_pmu_enable_all
14.17% multi-fault-all [kernel] [k] flush_tlb_others_ipi
1.89% multi-fault-all [kernel] [k] do_wp_page
1.58% multi-fault-all [kernel] [k] page_fault
1.46% multi-fault-all ./multi-fault-all-fork [.] worker
1.26% multi-fault-all [kernel] [k] do_page_fault
1.14% multi-fault-all [kernel] [k] _raw_spin_lock
1.10% multi-fault-all [kernel] [k] flush_tlb_page
1.09% multi-fault-all [kernel] [k] vma_put <========(**)
0.99% multi-fault-all [kernel] [k] invalidate_interrupt0
0.98% multi-fault-all [kernel] [k] find_vma_speculative <=====(**)
0.81% multi-fault-all [kernel] [k] invalidate_interrupt3
0.81% multi-fault-all [kernel] [k] native_apic_mem_write
0.79% multi-fault-all [kernel] [k] invalidate_interrupt7
0.76% multi-fault-all [kernel] [k] invalidate_interrupt5
(*) is removed overhead of mmap_sem and (**) is added overhead of atomic ops.
And yes, 1% seems still not ideally small. And it's unknown how this false-sharing
of atomic ops can affect to very big SMP.....maybe very big.
BTW, I'm not sure why intel_pmu_enable_all() is very big...because of fork() ?
My patch is below, just as a reference of my work in the last year, not
very clean yet. I need more time for improvements (and does not adhere to
this implementation.)
When you test this, please turn off SPINLOCK_DEBUG to enable split-page-table-lock.
Regards,
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Asynchronous page fault.
This patch is for avoidng mmap_sem in usual page fault. At running highly
multi-threaded programs, mm->mmap_sem can use much CPU because of false
sharing when it causes page fault in parallel. (Run after fork() is a typical
case, I think.)
This patch uses a speculative vma lookup to reduce that cost.
Considering vma lookup, rb-tree lookup, the only operation we do is checking
node->rb_left,rb_right. And there are no complicated operation.
At page fault, there are no demands for accessing sorted-vma-list or access
prev or next in many case. Except for stack-expansion, we always need a vma
which contains page-fault address. Then, we can access vma's RB-tree in
speculative way.
Even if RB-tree rotation occurs while we walk tree for look-up, we just
miss vma without oops. In other words, we can _try_ to find vma in lockless
manner. If failed, retry is ok.... we take lock and access vma.
For lockess walking, this uses RCU and adds find_vma_speculative(). And
per-vma wait-queue and reference count. This refcnt+wait_queue guarantees that
there are no thread which access the vma when we call subsystem's unmap
functions.
Changelog:
- removed rcu_xxx macros.
- fixed reference count handling bug at vma_put().
- removed atomic_add_unless() etc. But use Big number for handling race.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
arch/x86/mm/fault.c | 14 ++++++
include/linux/mm.h | 14 ++++++
include/linux/mm_types.h | 5 ++
mm/mmap.c | 98 +++++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 127 insertions(+), 4 deletions(-)
Index: linux-2.6.33-rc2/include/linux/mm_types.h
===================================================================
--- linux-2.6.33-rc2.orig/include/linux/mm_types.h
+++ linux-2.6.33-rc2/include/linux/mm_types.h
@@ -11,6 +11,7 @@
#include <linux/rwsem.h>
#include <linux/completion.h>
#include <linux/cpumask.h>
+#include <linux/rcupdate.h>
#include <linux/page-debug-flags.h>
#include <asm/page.h>
#include <asm/mmu.h>
@@ -180,6 +181,10 @@ struct vm_area_struct {
void * vm_private_data; /* was vm_pte (shared mem) */
unsigned long vm_truncate_count;/* truncate_count or restart_addr */
+ atomic_t refcnt;
+ wait_queue_head_t wait_queue;
+ struct rcu_head rcuhead;
+
#ifndef CONFIG_MMU
struct vm_region *vm_region; /* NOMMU mapping region */
#endif
Index: linux-2.6.33-rc2/mm/mmap.c
===================================================================
--- linux-2.6.33-rc2.orig/mm/mmap.c
+++ linux-2.6.33-rc2/mm/mmap.c
@@ -187,6 +187,26 @@ error:
return -ENOMEM;
}
+static void __free_vma_rcu_cb(struct rcu_head *head)
+{
+ struct vm_area_struct *vma;
+ vma = container_of(head, struct vm_area_struct, rcuhead);
+ kmem_cache_free(vm_area_cachep, vma);
+}
+/* Call this if vma was linked to rb-tree */
+static void free_vma_rcu(struct vm_area_struct *vma)
+{
+ call_rcu(&vma->rcuhead, __free_vma_rcu_cb);
+}
+#define VMA_FREE_MAGIC (10000000)
+/* called when vma is unlinked and wait for all racy access.*/
+static void invalidate_vma_before_free(struct vm_area_struct *vma)
+{
+ atomic_sub(VMA_FREE_MAGIC, &vma->refcnt);
+ wait_event(vma->wait_queue,
+ (atomic_read(&vma->refcnt) == -VMA_FREE_MAGIC));
+}
+
/*
* Requires inode->i_mapping->i_mmap_lock
*/
@@ -238,7 +258,7 @@ static struct vm_area_struct *remove_vma
removed_exe_file_vma(vma->vm_mm);
}
mpol_put(vma_policy(vma));
- kmem_cache_free(vm_area_cachep, vma);
+ free_vma_rcu(vma);
return next;
}
@@ -404,8 +424,12 @@ __vma_link_list(struct mm_struct *mm, st
void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
struct rb_node **rb_link, struct rb_node *rb_parent)
{
+ atomic_set(&vma->refcnt, 0);
+ init_waitqueue_head(&vma->wait_queue);
rb_link_node(&vma->vm_rb, rb_parent, rb_link);
rb_insert_color(&vma->vm_rb, &mm->mm_rb);
+ /* For speculative vma lookup */
+ smp_wmb();
}
static void __vma_link_file(struct vm_area_struct *vma)
@@ -490,6 +514,7 @@ __vma_unlink(struct mm_struct *mm, struc
rb_erase(&vma->vm_rb, &mm->mm_rb);
if (mm->mmap_cache == vma)
mm->mmap_cache = prev;
+ smp_wmb();
}
/*
@@ -614,6 +639,7 @@ again: remove_next = 1 + (end > next->
* us to remove next before dropping the locks.
*/
__vma_unlink(mm, next, vma);
+ invalidate_vma_before_free(next);
if (file)
__remove_shared_vm_struct(next, file, mapping);
if (next->anon_vma)
@@ -640,7 +666,7 @@ again: remove_next = 1 + (end > next->
}
mm->map_count--;
mpol_put(vma_policy(next));
- kmem_cache_free(vm_area_cachep, next);
+ free_vma_rcu(next);
/*
* In mprotect's case 6 (see comments on vma_merge),
* we must remove another next too. It would clutter
@@ -1544,6 +1570,71 @@ out:
}
/*
+ * Returns vma which contains given address. This scans rb-tree in speculative
+ * way and increment a reference count if found. Even if vma exists in rb-tree,
+ * this function may return NULL in racy case. So, this function cannot be used
+ * for checking whether given address is valid or not.
+ */
+
+struct vm_area_struct *
+find_vma_speculative(struct mm_struct *mm, unsigned long addr)
+{
+ struct vm_area_struct *vma = NULL;
+ struct vm_area_struct *vma_tmp;
+ struct rb_node *rb_node;
+
+ if (unlikely(!mm))
+ return NULL;;
+
+ rcu_read_lock();
+ /*
+ * Barreir against modification of rb-tree
+ * rb-tree update is not an atomic ops and no barreir is used while
+ * modification. Then, modification to rb-tree can be reordered. This
+ * memory barrier is against vma_(un)link_rb() for avoiding to read
+ * too old data to catch all changes we get rcu_read_lock.
+ *
+ * We may see broken RB-tree and can't find existing vma. But it's ok.
+ * We allowed to return NULL even if valid one exists. The caller will
+ * use find_vma() with read-semaphore.
+ */
+
+ smp_read_barrier_depends();
+ rb_node = mm->mm_rb.rb_node;
+ vma = NULL;
+ while (rb_node) {
+ vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
+
+ if (vma_tmp->vm_end > addr) {
+ vma = vma_tmp;
+ if (vma_tmp->vm_start <= addr)
+ break;
+ rb_node = rb_node->rb_left;
+ } else
+ rb_node = rb_node->rb_right;
+ }
+ if (vma) {
+ if ((vma->vm_start <= addr) && (addr < vma->vm_end)) {
+ if (atomic_inc_return(&vma->refcnt) < 0) {
+ vma_put(vma);
+ vma = NULL;
+ }
+ } else
+ vma = NULL;
+ }
+ rcu_read_unlock();
+ return vma;
+}
+
+void vma_put(struct vm_area_struct *vma)
+{
+ if ((atomic_dec_return(&vma->refcnt) == -VMA_FREE_MAGIC) &&
+ waitqueue_active(&vma->wait_queue))
+ wake_up(&vma->wait_queue);
+ return;
+}
+
+/*
* Verify that the stack growth is acceptable and
* update accounting. This is shared with both the
* grow-up and grow-down cases.
@@ -1808,6 +1899,7 @@ detach_vmas_to_be_unmapped(struct mm_str
insertion_point = (prev ? &prev->vm_next : &mm->mmap);
do {
rb_erase(&vma->vm_rb, &mm->mm_rb);
+ invalidate_vma_before_free(vma);
mm->map_count--;
tail_vma = vma;
vma = vma->vm_next;
Index: linux-2.6.33-rc2/include/linux/mm.h
===================================================================
--- linux-2.6.33-rc2.orig/include/linux/mm.h
+++ linux-2.6.33-rc2/include/linux/mm.h
@@ -1235,6 +1235,20 @@ static inline struct vm_area_struct * fi
return vma;
}
+/*
+ * Look up vma for given address in speculative way. This allows lockless lookup
+ * of vmas but in racy case, vma may no be found. You have to call find_vma()
+ * under read lock of mm->mmap_sem even if this function returns NULL.
+ * And, returned vma's reference count is incremented to show vma is accessed
+ * asynchronously, the caller has to call vma_put().
+ *
+ * Unlike find_vma(), this returns a vma which contains specified address.
+ * This doesn't return the nearest vma.
+ */
+extern struct vm_area_struct *find_vma_speculative(struct mm_struct *mm,
+ unsigned long addr);
+void vma_put(struct vm_area_struct *vma);
+
static inline unsigned long vma_pages(struct vm_area_struct *vma)
{
return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
Index: linux-2.6.33-rc2/arch/x86/mm/fault.c
===================================================================
--- linux-2.6.33-rc2.orig/arch/x86/mm/fault.c
+++ linux-2.6.33-rc2/arch/x86/mm/fault.c
@@ -952,6 +952,7 @@ do_page_fault(struct pt_regs *regs, unsi
struct mm_struct *mm;
int write;
int fault;
+ int speculative = 0;
tsk = current;
mm = tsk->mm;
@@ -1040,6 +1041,14 @@ do_page_fault(struct pt_regs *regs, unsi
return;
}
+ if ((error_code & PF_USER)) {
+ vma = find_vma_speculative(mm, address);
+ if (vma) {
+ speculative = 1;
+ goto good_area;
+ }
+ }
+
/*
* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in
@@ -1136,5 +1145,8 @@ good_area:
check_v8086_mode(regs, address, tsk);
- up_read(&mm->mmap_sem);
+ if (speculative)
+ vma_put(vma);
+ else
+ up_read(&mm->mmap_sem);
}
[-- Attachment #2: multi-fault-all-fork.c --]
[-- Type: text/x-csrc, Size: 1861 bytes --]
/*
* multi-fault.c :: causes 60secs of parallel page fault in multi-thread.
* % gcc -O2 -o multi-fault multi-fault.c -lpthread
* % multi-fault # of cpus.
*/
#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>
#define NR_THREADS 8
pthread_t threads[NR_THREADS];
/*
* For avoiding contention in page table lock, FAULT area is
* sparse. If FAULT_LENGTH is too large for your cpus, decrease it.
*/
#define MMAP_LENGTH (8 * 1024 * 1024)
#define FAULT_LENGTH (2 * 1024 * 1024)
void *mmap_area[NR_THREADS];
#define PAGE_SIZE 4096
pthread_barrier_t barrier;
int name[NR_THREADS];
void segv_handler(int sig)
{
sleep(100);
}
void *worker(void *data)
{
cpu_set_t set;
int cpu;
int status;
cpu = *(int *)data;
CPU_ZERO(&set);
CPU_SET(cpu, &set);
sched_setaffinity(0, sizeof(set), &set);
while (1) {
char *c;
char *start = mmap_area[cpu];
char *end = mmap_area[cpu] + FAULT_LENGTH;
pthread_barrier_wait(&barrier);
//printf("fault into %p-%p\n",start, end);
for (c = start; c < end; c += PAGE_SIZE)
*c = 0;
pthread_barrier_wait(&barrier);
if ((cpu == 0) && !fork())
exit(0);
wait(&status);
pthread_barrier_wait(&barrier);
}
return NULL;
}
int main(int argc, char *argv[])
{
int i, ret;
unsigned int num;
if (argc < 2)
return 0;
num = atoi(argv[1]);
pthread_barrier_init(&barrier, NULL, num);
mmap_area[0] = mmap(NULL, MMAP_LENGTH * num, PROT_WRITE|PROT_READ,
MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
for (i = 1; i < num; i++) {
mmap_area[i] = mmap_area[i - 1]+ MMAP_LENGTH;
}
for (i = 0; i < num; ++i) {
name[i] = i;
ret = pthread_create(&threads[i], NULL, worker, &name[i]);
if (ret < 0) {
perror("pthread create");
return 0;
}
}
sleep(60);
return 0;
}
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-05 3:13 ` Linus Torvalds
@ 2010-01-05 8:17 ` Peter Zijlstra
2010-01-05 8:57 ` Peter Zijlstra
2010-01-05 9:37 ` Peter Zijlstra
2 siblings, 0 replies; 121+ messages in thread
From: Peter Zijlstra @ 2010-01-05 8:17 UTC (permalink / raw)
To: Linus Torvalds
Cc: KAMEZAWA Hiroyuki, Paul E. McKenney, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, minchan.kim@gmail.com, cl, hugh.dickins,
Nick Piggin, Ingo Molnar
On Mon, 2010-01-04 at 19:13 -0800, Linus Torvalds wrote:
> I would say that this whole series is _very_ far from being mergeable.
Never said anything else ;-)
> Peter seems to have been thinking about the details, while missing all the
> subtle big picture effects that seem to actually change semantics.
Yes, I seem to have missed a rather significant side of the whole
issue.. back to the drawing room for me then.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-05 5:10 ` Linus Torvalds
2010-01-05 5:30 ` KAMEZAWA Hiroyuki
@ 2010-01-05 8:18 ` Peter Zijlstra
1 sibling, 0 replies; 121+ messages in thread
From: Peter Zijlstra @ 2010-01-05 8:18 UTC (permalink / raw)
To: Linus Torvalds
Cc: KAMEZAWA Hiroyuki, Minchan Kim, Paul E. McKenney,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl,
hugh.dickins, Nick Piggin, Ingo Molnar
On Mon, 2010-01-04 at 21:10 -0800, Linus Torvalds wrote:
> Sounds doable. But it also sounds way more expensive than the current VM
> fault handling, which is pretty close to optimal for single-threaded
> cases.. That RCU lookup might be cheap, but just the refcount is generally
> going to be as expensive as a lock.
Right, that refcount adds two atomic ops, the only grace it has is that
its in the vma as opposed to the mm, but there are plenty workloads that
concentrate on a single vma, in which case you get an equally contended
cacheline as with the mmap_sem.
I was trying to avoid having to have that refcount, but then sorta
forgot about the actual fault handlers also poking at the vma :/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 4/8] mm: RCU free vmas
2010-01-05 2:43 ` Paul E. McKenney
@ 2010-01-05 8:28 ` Peter Zijlstra
2010-01-05 16:05 ` Paul E. McKenney
0 siblings, 1 reply; 121+ messages in thread
From: Peter Zijlstra @ 2010-01-05 8:28 UTC (permalink / raw)
To: paulmck
Cc: KAMEZAWA Hiroyuki, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, minchan.kim@gmail.com, cl, hugh.dickins,
Nick Piggin, Ingo Molnar, Linus Torvalds
On Mon, 2010-01-04 at 18:43 -0800, Paul E. McKenney wrote:
> On Mon, Jan 04, 2010 at 07:24:33PM +0100, Peter Zijlstra wrote:
> > TODO:
> > - should be SRCU, lack of call_srcu()
> >
> > In order to allow speculative vma lookups, RCU free the struct
> > vm_area_struct.
> >
> > We use two means of detecting a vma is still valid:
> > - firstly, we set RB_CLEAR_NODE once we remove a vma from the tree.
> > - secondly, we check the vma sequence number.
> >
> > These two things combined will guarantee that 1) the vma is still
> > present and two, it still covers the same range from when we looked it
> > up.
>
> OK, I think I see what you are up to here. I could get you a very crude
> throw-away call_srcu() fairly quickly. I don't yet have a good estimate
> of how long it will take me to merge SRCU into the treercu infrastructure,
> but am getting there.
>
> So, which release are you thinking in terms of?
I'm not thinking any release yet, its very early and as Linus has
pointed out, I seem to have forgotten a rather big piece of the
picture :/
So I need to try and fix this glaring hole before we can continue.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-05 6:09 ` Minchan Kim
2010-01-05 6:09 ` KAMEZAWA Hiroyuki
@ 2010-01-05 8:35 ` Peter Zijlstra
1 sibling, 0 replies; 121+ messages in thread
From: Peter Zijlstra @ 2010-01-05 8:35 UTC (permalink / raw)
To: Minchan Kim
Cc: Linus Torvalds, KAMEZAWA Hiroyuki, Paul E. McKenney,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl,
hugh.dickins, Nick Piggin, Ingo Molnar
On Tue, 2010-01-05 at 15:09 +0900, Minchan Kim wrote:
> Couldn't we synchronize rcu in that cases(munmap, exit and so on)?
> It can delay munap and exit but it would be better than handling them by more
> complicated things, I think. And both cases aren't often cases so we
> can achieve advantage than disadvantage?
Sadly there are programs that mmap()/munmap() at a staggering rate
(clamav comes to mind), so munmap() performance is important too.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 121+ messages in thread
* Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
2010-01-05 3:13 ` Linus Torvalds
2010-01-05 8:17 ` Peter Zijlstra
@ 2010-01-05 8:57 ` Peter Zijlstra
2010-01-05 15:34 ` Linus Torvalds
2010-01-05 9:37 ` Peter Zijlstra
2 siblings, 1 reply; 121+ messages in thread
From: Peter Zijlstra @ 2010-01-05 8:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: KAMEZAWA Hiroyuki, Paul E. McKenney, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, minchan.kim@gmail.com, cl, hugh.dickins,
Nick Piggin, Ingo Molnar
On Mon, 2010-01-04 at 19:13 -0800, Linus Torvalds wrote:
> Or put another way: if the vma was a writable mapping, a user may do
>
> munmap(mapping, size);
>
> and the backing file is still active and writable AFTER THE MUNMAP! This
> can be a huge problem for something that wants to unmount the volume, for
> example, or depends on the whole writability-vs-executability thing. The
> user may have unmapped it, and expects the file to be immediately
> non-busy, but with the delayed free that isn't the case any more.
If it were only unmount it would be rather easy to fix by putting that
RCU synchronization in unmount, unmount does a lot of sync things
anyway. But I suspect there's more cases where that non-busy matters
(but I'd