* [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
From: Laurent Dufour @ 2019-04-16 13:44 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
From: Mahendran Ganesh <opensource.ganesh@gmail.com>
Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
enables Speculative Page Fault handler.
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
---
arch/arm64/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 870ef86a64ed..8e86934d598b 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -174,6 +174,7 @@ config ARM64
select SWIOTLB
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
+ select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
help
ARM 64-bit (AArch64) Linux support.
--
2.21.0
^ permalink raw reply related
* [PATCH v12 28/31] x86/mm: add speculative pagefault handling
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
From: Peter Zijlstra <peterz@infradead.org>
Try a speculative fault before acquiring mmap_sem, if it returns with
VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
traditional fault.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Clearing of FAULT_FLAG_ALLOW_RETRY is now done in
handle_speculative_fault()]
[Retry with usual fault path in the case VM_ERROR is returned by
handle_speculative_fault(). This allows signal to be delivered]
[Don't build SPF call if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Handle memory protection key fault]
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
arch/x86/mm/fault.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 667f1da36208..4390d207a7a1 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1401,6 +1401,18 @@ void do_user_addr_fault(struct pt_regs *regs,
}
#endif
+ /*
+ * Do not try to do a speculative page fault if the fault was due to
+ * protection keys since it can't be resolved.
+ */
+ if (!(hw_error_code & X86_PF_PK)) {
+ fault = handle_speculative_fault(mm, address, flags);
+ if (fault != VM_FAULT_RETRY) {
+ perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, address);
+ goto done;
+ }
+ }
+
/*
* Kernel-mode access to the user address space should only occur
* on well-defined single instructions listed in the exception
@@ -1499,6 +1511,8 @@ void do_user_addr_fault(struct pt_regs *regs,
}
up_read(&mm->mmap_sem);
+
+done:
if (unlikely(fault & VM_FAULT_ERROR)) {
mm_fault_error(regs, hw_error_code, address, fault);
return;
--
2.21.0
^ permalink raw reply related
* [PATCH v12 11/31] mm: protect mremap() against SPF hanlder
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
If a thread is remapping an area while another one is faulting on the
destination area, the SPF handler may fetch the vma from the RB tree before
the pte has been moved by the other thread. This means that the moved ptes
will overwrite those create by the page fault handler leading to page
leaked.
CPU 1 CPU2
enter mremap()
unmap the dest area
copy_vma() Enter speculative page fault handler
>> at this time the dest area is present in the RB tree
fetch the vma matching dest area
create a pte as the VMA matched
Exit the SPF handler
<data written in the new page>
move_ptes()
> it is assumed that the dest area is empty,
> the move ptes overwrite the page mapped by the CPU2.
To prevent that, when the VMA matching the dest area is extended or created
by copy_vma(), it should be marked as non available to the SPF handler.
The usual way to so is to rely on vm_write_begin()/end().
This is already in __vma_adjust() called by copy_vma() (through
vma_merge()). But __vma_adjust() is calling vm_write_end() before returning
which create a window for another thread.
This patch adds a new parameter to vma_merge() which is passed down to
vma_adjust().
The assumption is that copy_vma() is returning a vma which should be
released by calling vm_raw_write_end() by the callee once the ptes have
been moved.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
include/linux/mm.h | 24 ++++++++++++++++-----
mm/mmap.c | 53 +++++++++++++++++++++++++++++++++++-----------
mm/mremap.c | 13 ++++++++++++
3 files changed, 73 insertions(+), 17 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 906b9e06f18e..5d45b7d8718d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2343,18 +2343,32 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
/* mmap.c */
extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
+
extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
- struct vm_area_struct *expand);
+ struct vm_area_struct *expand, bool keep_locked);
+
static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
{
- return __vma_adjust(vma, start, end, pgoff, insert, NULL);
+ return __vma_adjust(vma, start, end, pgoff, insert, NULL, false);
}
-extern struct vm_area_struct *vma_merge(struct mm_struct *,
+
+extern struct vm_area_struct *__vma_merge(struct mm_struct *mm,
+ struct vm_area_struct *prev, unsigned long addr, unsigned long end,
+ unsigned long vm_flags, struct anon_vma *anon, struct file *file,
+ pgoff_t pgoff, struct mempolicy *mpol,
+ struct vm_userfaultfd_ctx uff, bool keep_locked);
+
+static inline struct vm_area_struct *vma_merge(struct mm_struct *mm,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
- unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
- struct mempolicy *, struct vm_userfaultfd_ctx);
+ unsigned long vm_flags, struct anon_vma *anon, struct file *file,
+ pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff)
+{
+ return __vma_merge(mm, prev, addr, end, vm_flags, anon, file, off,
+ pol, uff, false);
+}
+
extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
unsigned long addr, int new_below);
diff --git a/mm/mmap.c b/mm/mmap.c
index b77ec0149249..13460b38b0fb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -714,7 +714,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm,
*/
int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
- struct vm_area_struct *expand)
+ struct vm_area_struct *expand, bool keep_locked)
{
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
@@ -830,8 +830,12 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
importer->anon_vma = exporter->anon_vma;
error = anon_vma_clone(importer, exporter);
- if (error)
+ if (error) {
+ if (next && next != vma)
+ vm_raw_write_end(next);
+ vm_raw_write_end(vma);
return error;
+ }
}
}
again:
@@ -1025,7 +1029,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
if (next && next != vma)
vm_raw_write_end(next);
- vm_raw_write_end(vma);
+ if (!keep_locked)
+ vm_raw_write_end(vma);
validate_mm(mm);
@@ -1161,12 +1166,13 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
* parameter) may establish ptes with the wrong permissions of NNNN
* instead of the right permissions of XXXX.
*/
-struct vm_area_struct *vma_merge(struct mm_struct *mm,
+struct vm_area_struct *__vma_merge(struct mm_struct *mm,
struct vm_area_struct *prev, unsigned long addr,
unsigned long end, unsigned long vm_flags,
struct anon_vma *anon_vma, struct file *file,
pgoff_t pgoff, struct mempolicy *policy,
- struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+ bool keep_locked)
{
pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
struct vm_area_struct *area, *next;
@@ -1214,10 +1220,11 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
/* cases 1, 6 */
err = __vma_adjust(prev, prev->vm_start,
next->vm_end, prev->vm_pgoff, NULL,
- prev);
+ prev, keep_locked);
} else /* cases 2, 5, 7 */
err = __vma_adjust(prev, prev->vm_start,
- end, prev->vm_pgoff, NULL, prev);
+ end, prev->vm_pgoff, NULL, prev,
+ keep_locked);
if (err)
return NULL;
khugepaged_enter_vma_merge(prev, vm_flags);
@@ -1234,10 +1241,12 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
vm_userfaultfd_ctx)) {
if (prev && addr < prev->vm_end) /* case 4 */
err = __vma_adjust(prev, prev->vm_start,
- addr, prev->vm_pgoff, NULL, next);
+ addr, prev->vm_pgoff, NULL, next,
+ keep_locked);
else { /* cases 3, 8 */
err = __vma_adjust(area, addr, next->vm_end,
- next->vm_pgoff - pglen, NULL, next);
+ next->vm_pgoff - pglen, NULL, next,
+ keep_locked);
/*
* In case 3 area is already equal to next and
* this is a noop, but in case 8 "area" has
@@ -3259,9 +3268,20 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
return NULL; /* should never get here */
- new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
- vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
- vma->vm_userfaultfd_ctx);
+
+ /* There is 3 cases to manage here in
+ * AAAA AAAA AAAA AAAA
+ * PPPP.... PPPP......NNNN PPPP....NNNN PP........NN
+ * PPPPPPPP(A) PPPP..NNNNNNNN(B) PPPPPPPPPPPP(1) NULL
+ * PPPPPPPPNNNN(2)
+ * PPPPNNNNNNNN(3)
+ *
+ * new_vma == prev in case A,1,2
+ * new_vma == next in case B,3
+ */
+ new_vma = __vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
+ vma->anon_vma, vma->vm_file, pgoff,
+ vma_policy(vma), vma->vm_userfaultfd_ctx, true);
if (new_vma) {
/*
* Source vma may have been merged into new_vma
@@ -3299,6 +3319,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
get_file(new_vma->vm_file);
if (new_vma->vm_ops && new_vma->vm_ops->open)
new_vma->vm_ops->open(new_vma);
+ /*
+ * As the VMA is linked right now, it may be hit by the
+ * speculative page fault handler. But we don't want it to
+ * to start mapping page in this area until the caller has
+ * potentially move the pte from the moved VMA. To prevent
+ * that we protect it right now, and let the caller unprotect
+ * it once the move is done.
+ */
+ vm_raw_write_begin(new_vma);
vma_link(mm, new_vma, prev, rb_link, rb_parent);
*need_rmap_locks = false;
}
diff --git a/mm/mremap.c b/mm/mremap.c
index fc241d23cd97..ae5c3379586e 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -357,6 +357,14 @@ static unsigned long move_vma(struct vm_area_struct *vma,
if (!new_vma)
return -ENOMEM;
+ /* new_vma is returned protected by copy_vma, to prevent speculative
+ * page fault to be done in the destination area before we move the pte.
+ * Now, we must also protect the source VMA since we don't want pages
+ * to be mapped in our back while we are copying the PTEs.
+ */
+ if (vma != new_vma)
+ vm_raw_write_begin(vma);
+
moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
need_rmap_locks);
if (moved_len < old_len) {
@@ -373,6 +381,8 @@ static unsigned long move_vma(struct vm_area_struct *vma,
*/
move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
true);
+ if (vma != new_vma)
+ vm_raw_write_end(vma);
vma = new_vma;
old_len = new_len;
old_addr = new_addr;
@@ -381,7 +391,10 @@ static unsigned long move_vma(struct vm_area_struct *vma,
mremap_userfaultfd_prep(new_vma, uf);
arch_remap(mm, old_addr, old_addr + old_len,
new_addr, new_addr + new_len);
+ if (vma != new_vma)
+ vm_raw_write_end(vma);
}
+ vm_raw_write_end(new_vma);
/* Conceal VM_ACCOUNT so old reservation is not undone */
if (vm_flags & VM_ACCOUNT) {
--
2.21.0
^ permalink raw reply related
* [PATCH v12 15/31] mm: introduce __lru_cache_add_active_or_unevictable
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
The speculative page fault handler which is run without holding the
mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
is not guaranteed to remain constant.
Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
value parameter instead of the vma pointer.
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
include/linux/swap.h | 10 ++++++++--
mm/memory.c | 8 ++++----
mm/swap.c | 6 +++---
3 files changed, 15 insertions(+), 9 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bfb5c4ac108..d33b94eb3c69 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -343,8 +343,14 @@ extern void deactivate_file_page(struct page *page);
extern void mark_page_lazyfree(struct page *page);
extern void swap_setup(void);
-extern void lru_cache_add_active_or_unevictable(struct page *page,
- struct vm_area_struct *vma);
+extern void __lru_cache_add_active_or_unevictable(struct page *page,
+ unsigned long vma_flags);
+
+static inline void lru_cache_add_active_or_unevictable(struct page *page,
+ struct vm_area_struct *vma)
+{
+ return __lru_cache_add_active_or_unevictable(page, vma->vm_flags);
+}
/* linux/mm/vmscan.c */
extern unsigned long zone_reclaimable_pages(struct zone *zone);
diff --git a/mm/memory.c b/mm/memory.c
index 56802850e72c..85ec5ce5c0a8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2347,7 +2347,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
- lru_cache_add_active_or_unevictable(new_page, vma);
+ __lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
* We call the notify macro here because, when using secondary
* mmu page tables (such as kvm shadow page tables), we want the
@@ -2896,7 +2896,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (unlikely(page != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
mem_cgroup_commit_charge(page, memcg, true, false);
@@ -3048,7 +3048,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
@@ -3327,7 +3327,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page, false);
diff --git a/mm/swap.c b/mm/swap.c
index 3a75722e68a9..a55f0505b563 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -450,12 +450,12 @@ void lru_cache_add(struct page *page)
* directly back onto it's zone's unevictable list, it does NOT use a
* per cpu pagevec.
*/
-void lru_cache_add_active_or_unevictable(struct page *page,
- struct vm_area_struct *vma)
+void __lru_cache_add_active_or_unevictable(struct page *page,
+ unsigned long vma_flags)
{
VM_BUG_ON_PAGE(PageLRU(page), page);
- if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
+ if (likely((vma_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
SetPageActive(page);
else if (!TestSetPageMlocked(page)) {
/*
--
2.21.0
^ permalink raw reply related
* [PATCH v12 19/31] mm: protect the RB tree with a sequence lock
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
Introducing a per mm_struct seqlock, mm_seq field, to protect the changes
made in the MM RB tree. This allows to walk the RB tree without grabbing
the mmap_sem, and on the walk is done to double check that sequence counter
was stable during the walk.
The mm seqlock is held while inserting and removing entries into the MM RB
tree. Later in this series, it will be check when looking for a VMA
without holding the mmap_sem.
This is based on the initial work from Peter Zijlstra:
https://lore.kernel.org/linux-mm/20100104182813.479668508@chello.nl/
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
include/linux/mm_types.h | 3 +++
kernel/fork.c | 3 +++
mm/init-mm.c | 3 +++
mm/mmap.c | 48 +++++++++++++++++++++++++++++++---------
4 files changed, 46 insertions(+), 11 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e78f72eb2576..24b3f8ce9e42 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -358,6 +358,9 @@ struct mm_struct {
struct {
struct vm_area_struct *mmap; /* list of VMAs */
struct rb_root mm_rb;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ seqlock_t mm_seq;
+#endif
u64 vmacache_seqnum; /* per-thread vmacache */
#ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index 2992d2c95256..3a1739197ebc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1008,6 +1008,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mm->mmap = NULL;
mm->mm_rb = RB_ROOT;
mm->vmacache_seqnum = 0;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ seqlock_init(&mm->mm_seq);
+#endif
atomic_set(&mm->mm_users, 1);
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index a787a319211e..69346b883a4e 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -27,6 +27,9 @@
*/
struct mm_struct init_mm = {
.mm_rb = RB_ROOT,
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ .mm_seq = __SEQLOCK_UNLOCKED(init_mm.mm_seq),
+#endif
.pgd = swapper_pg_dir,
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
diff --git a/mm/mmap.c b/mm/mmap.c
index 13460b38b0fb..f7f6027a7dff 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -170,6 +170,24 @@ void unlink_file_vma(struct vm_area_struct *vma)
}
}
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static inline void mm_write_seqlock(struct mm_struct *mm)
+{
+ write_seqlock(&mm->mm_seq);
+}
+static inline void mm_write_sequnlock(struct mm_struct *mm)
+{
+ write_sequnlock(&mm->mm_seq);
+}
+#else
+static inline void mm_write_seqlock(struct mm_struct *mm)
+{
+}
+static inline void mm_write_sequnlock(struct mm_struct *mm)
+{
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
/*
* Close a vm structure and free it, returning the next.
*/
@@ -445,26 +463,32 @@ static void vma_gap_update(struct vm_area_struct *vma)
}
static inline void vma_rb_insert(struct vm_area_struct *vma,
- struct rb_root *root)
+ struct mm_struct *mm)
{
+ struct rb_root *root = &mm->mm_rb;
+
/* All rb_subtree_gap values must be consistent prior to insertion */
validate_mm_rb(root, NULL);
rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
}
-static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
+static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
{
+ struct rb_root *root = &mm->mm_rb;
+
/*
* Note rb_erase_augmented is a fairly large inline function,
* so make sure we instantiate it only once with our desired
* augmented rbtree callbacks.
*/
+ mm_write_seqlock(mm);
rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
+ mm_write_sequnlock(mm); /* wmb */
}
static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
- struct rb_root *root,
+ struct mm_struct *mm,
struct vm_area_struct *ignore)
{
/*
@@ -472,21 +496,21 @@ static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
* with the possible exception of the "next" vma being erased if
* next->vm_start was reduced.
*/
- validate_mm_rb(root, ignore);
+ validate_mm_rb(&mm->mm_rb, ignore);
- __vma_rb_erase(vma, root);
+ __vma_rb_erase(vma, mm);
}
static __always_inline void vma_rb_erase(struct vm_area_struct *vma,
- struct rb_root *root)
+ struct mm_struct *mm)
{
/*
* All rb_subtree_gap values must be consistent prior to erase,
* with the possible exception of the vma being erased.
*/
- validate_mm_rb(root, vma);
+ validate_mm_rb(&mm->mm_rb, vma);
- __vma_rb_erase(vma, root);
+ __vma_rb_erase(vma, mm);
}
/*
@@ -601,10 +625,12 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
* immediately update the gap to the correct value. Finally we
* rebalance the rbtree after all augmented values have been set.
*/
+ mm_write_seqlock(mm);
rb_link_node(&vma->vm_rb, rb_parent, rb_link);
vma->rb_subtree_gap = 0;
vma_gap_update(vma);
- vma_rb_insert(vma, &mm->mm_rb);
+ vma_rb_insert(vma, mm);
+ mm_write_sequnlock(mm);
}
static void __vma_link_file(struct vm_area_struct *vma)
@@ -680,7 +706,7 @@ static __always_inline void __vma_unlink_common(struct mm_struct *mm,
{
struct vm_area_struct *next;
- vma_rb_erase_ignore(vma, &mm->mm_rb, ignore);
+ vma_rb_erase_ignore(vma, mm, ignore);
next = vma->vm_next;
if (has_prev)
prev->vm_next = next;
@@ -2674,7 +2700,7 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
insertion_point = (prev ? &prev->vm_next : &mm->mmap);
vma->vm_prev = NULL;
do {
- vma_rb_erase(vma, &mm->mm_rb);
+ vma_rb_erase(vma, mm);
mm->map_count--;
tail_vma = vma;
vma = vma->vm_next;
--
2.21.0
^ permalink raw reply related
* [PATCH v12 26/31] perf tools: add support for the SPF perf event
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
Add support for the new speculative faults event.
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
tools/include/uapi/linux/perf_event.h | 1 +
tools/perf/util/evsel.c | 1 +
tools/perf/util/parse-events.c | 4 ++++
tools/perf/util/parse-events.l | 1 +
tools/perf/util/python.c | 1 +
5 files changed, 8 insertions(+)
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 7198ddd0c6b1..3b4356c55caa 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT = 10,
+ PERF_COUNT_SW_SPF = 11,
PERF_COUNT_SW_MAX, /* non-ABI */
};
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 66d066f18b5b..1f3bea4379b2 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -435,6 +435,7 @@ const char *perf_evsel__sw_names[PERF_COUNT_SW_MAX] = {
"alignment-faults",
"emulation-faults",
"dummy",
+ "speculative-faults",
};
static const char *__perf_evsel__sw_name(u64 config)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 5ef4939408f2..effa8929cc90 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -140,6 +140,10 @@ struct event_symbol event_symbols_sw[PERF_COUNT_SW_MAX] = {
.symbol = "bpf-output",
.alias = "",
},
+ [PERF_COUNT_SW_SPF] = {
+ .symbol = "speculative-faults",
+ .alias = "spf",
+ },
};
#define __PERF_EVENT_FIELD(config, name) \
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 7805c71aaae2..d28a6edd0a95 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -324,6 +324,7 @@ emulation-faults { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_EM
dummy { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
duration_time { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
bpf-output { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_BPF_OUTPUT); }
+speculative-faults|spf { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_SPF); }
/*
* We have to handle the kernel PMU event cycles-ct/cycles-t/mem-loads/mem-stores separately.
diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c
index dda0ac978b1e..c617a4751549 100644
--- a/tools/perf/util/python.c
+++ b/tools/perf/util/python.c
@@ -1200,6 +1200,7 @@ static struct {
PERF_CONST(COUNT_SW_ALIGNMENT_FAULTS),
PERF_CONST(COUNT_SW_EMULATION_FAULTS),
PERF_CONST(COUNT_SW_DUMMY),
+ PERF_CONST(COUNT_SW_SPF),
PERF_CONST(SAMPLE_IP),
PERF_CONST(SAMPLE_TID),
--
2.21.0
^ permalink raw reply related
* [PATCH v12 05/31] mm: prepare for FAULT_FLAG_SPECULATIVE
From: Laurent Dufour @ 2019-04-16 13:44 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
From: Peter Zijlstra <peterz@infradead.org>
When speculating faults (without holding mmap_sem) we need to validate
that the vma against which we loaded pages is still valid when we're
ready to install the new PTE.
Therefore, replace the pte_offset_map_lock() calls that (re)take the
PTL with pte_map_lock() which can fail in case we find the VMA changed
since we started the fault.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Port to 4.12 kernel]
[Remove the comment about the fault_env structure which has been
implemented as the vm_fault structure in the kernel]
[move pte_map_lock()'s definition upper in the file]
[move the define of FAULT_FLAG_SPECULATIVE later in the series]
[review error path in do_swap_page(), do_anonymous_page() and
wp_page_copy()]
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
mm/memory.c | 87 +++++++++++++++++++++++++++++++++++------------------
1 file changed, 58 insertions(+), 29 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index c6ddadd9d2b7..fc3698d13cb5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2073,6 +2073,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
}
EXPORT_SYMBOL_GPL(apply_to_page_range);
+static inline bool pte_map_lock(struct vm_fault *vmf)
+{
+ vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ return true;
+}
+
/*
* handle_pte_fault chooses page fault handler according to an entry which was
* read non-atomically. Before making any commitment, on those architectures
@@ -2261,25 +2268,26 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
int page_copied = 0;
struct mem_cgroup *memcg;
struct mmu_notifier_range range;
+ int ret = VM_FAULT_OOM;
if (unlikely(anon_vma_prepare(vma)))
- goto oom;
+ goto out;
if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma,
vmf->address);
if (!new_page)
- goto oom;
+ goto out;
} else {
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
vmf->address);
if (!new_page)
- goto oom;
+ goto out;
cow_user_page(new_page, old_page, vmf->address, vma);
}
if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, &memcg, false))
- goto oom_free_new;
+ goto out_free_new;
__SetPageUptodate(new_page);
@@ -2291,7 +2299,10 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
/*
* Re-check the pte - we dropped the lock
*/
- vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
+ if (!pte_map_lock(vmf)) {
+ ret = VM_FAULT_RETRY;
+ goto out_uncharge;
+ }
if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
@@ -2378,12 +2389,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
put_page(old_page);
}
return page_copied ? VM_FAULT_WRITE : 0;
-oom_free_new:
+out_uncharge:
+ mem_cgroup_cancel_charge(new_page, memcg, false);
+out_free_new:
put_page(new_page);
-oom:
+out:
if (old_page)
put_page(old_page);
- return VM_FAULT_OOM;
+ return ret;
}
/**
@@ -2405,8 +2418,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf)
{
WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
- vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
+ if (!pte_map_lock(vmf))
+ return VM_FAULT_RETRY;
/*
* We might have raced with another page fault while we released the
* pte_offset_map_lock.
@@ -2527,8 +2540,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
get_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
lock_page(vmf->page);
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
+ if (!pte_map_lock(vmf)) {
+ unlock_page(vmf->page);
+ put_page(vmf->page);
+ return VM_FAULT_RETRY;
+ }
if (!pte_same(*vmf->pte, vmf->orig_pte)) {
unlock_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2744,11 +2760,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!page) {
/*
- * Back out if somebody else faulted in this pte
- * while we released the pte lock.
+ * Back out if the VMA has changed in our back during
+ * a speculative page fault or if somebody else
+ * faulted in this pte while we released the pte lock.
*/
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
+ if (!pte_map_lock(vmf)) {
+ delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+ ret = VM_FAULT_RETRY;
+ goto out;
+ }
if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
ret = VM_FAULT_OOM;
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
@@ -2801,10 +2821,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}
/*
- * Back out if somebody else already faulted in this pte.
+ * Back out if the VMA has changed in our back during a speculative
+ * page fault or if somebody else already faulted in this pte.
*/
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
+ if (!pte_map_lock(vmf)) {
+ ret = VM_FAULT_RETRY;
+ goto out_cancel_cgroup;
+ }
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
goto out_nomap;
@@ -2882,8 +2905,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
out:
return ret;
out_nomap:
- mem_cgroup_cancel_charge(page, memcg, false);
pte_unmap_unlock(vmf->pte, vmf->ptl);
+out_cancel_cgroup:
+ mem_cgroup_cancel_charge(page, memcg, false);
out_page:
unlock_page(page);
out_release:
@@ -2934,8 +2958,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
!mm_forbids_zeropage(vma->vm_mm)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
vma->vm_page_prot));
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
+ if (!pte_map_lock(vmf))
+ return VM_FAULT_RETRY;
if (!pte_none(*vmf->pte))
goto unlock;
ret = check_stable_address_space(vma->vm_mm);
@@ -2971,14 +2995,16 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
- if (!pte_none(*vmf->pte))
+ if (!pte_map_lock(vmf)) {
+ ret = VM_FAULT_RETRY;
goto release;
+ }
+ if (!pte_none(*vmf->pte))
+ goto unlock_and_release;
ret = check_stable_address_space(vma->vm_mm);
if (ret)
- goto release;
+ goto unlock_and_release;
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
@@ -3000,10 +3026,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
return ret;
+unlock_and_release:
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
release:
mem_cgroup_cancel_charge(page, memcg, false);
put_page(page);
- goto unlock;
+ return ret;
oom_free_page:
put_page(page);
oom:
@@ -3118,8 +3146,9 @@ static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf)
* pte_none() under vmf->ptl protection when we return to
* alloc_set_pte().
*/
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
+ if (!pte_map_lock(vmf))
+ return VM_FAULT_RETRY;
+
return 0;
}
--
2.21.0
^ permalink raw reply related
* [PATCH v12 30/31] arm64/mm: add speculative page fault
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
From: Mahendran Ganesh <opensource.ganesh@gmail.com>
This patch enables the speculative page fault on the arm64
architecture.
I completed spf porting in 4.9. From the test result,
we can see app launching time improved by about 10% in average.
For the apps which have more than 50 threads, 15% or even more
improvement can be got.
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
[handle_speculative_fault() is no more returning the vma pointer]
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
arch/arm64/mm/fault.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 4f343e603925..b5e2a93f9c21 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -485,6 +485,16 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
+ /*
+ * let's try a speculative page fault without grabbing the
+ * mmap_sem.
+ */
+ fault = handle_speculative_fault(mm, addr, mm_flags);
+ if (fault != VM_FAULT_RETRY) {
+ perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, addr);
+ goto done;
+ }
+
/*
* As per x86, we may deadlock here. However, since the kernel only
* validly references user space from well defined areas of the code,
@@ -535,6 +545,8 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
}
up_read(&mm->mmap_sem);
+done:
+
/*
* Handle the "normal" (no error) case first.
*/
--
2.21.0
^ permalink raw reply related
* [PATCH v12 07/31] mm: make pte_unmap_same compatible with SPF
From: Laurent Dufour @ 2019-04-16 13:44 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
pte_unmap_same() is making the assumption that the page table are still
around because the mmap_sem is held.
This is no more the case when running a speculative page fault and
additional check must be made to ensure that the final page table are still
there.
This is now done by calling pte_spinlock() to check for the VMA's
consistency while locking for the page tables.
This is requiring passing a vm_fault structure to pte_unmap_same() which is
containing all the needed parameters.
As pte_spinlock() may fail in the case of a speculative page fault, if the
VMA has been touched in our back, pte_unmap_same() should now return 3
cases :
1. pte are the same (0)
2. pte are different (VM_FAULT_PTNOTSAME)
3. a VMA's changes has been detected (VM_FAULT_RETRY)
The case 2 is handled by the introduction of a new VM_FAULT flag named
VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().
If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
page fault while holding the mmap_sem.
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
include/linux/mm_types.h | 6 +++++-
mm/memory.c | 37 +++++++++++++++++++++++++++----------
2 files changed, 32 insertions(+), 11 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8ec38b11b361..fd7d38ee2e33 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -652,6 +652,8 @@ typedef __bitwise unsigned int vm_fault_t;
* @VM_FAULT_NEEDDSYNC: ->fault did not modify page tables and needs
* fsync() to complete (for synchronous page faults
* in DAX)
+ * @VM_FAULT_PTNOTSAME Page table entries have changed during a
+ * speculative page fault handling.
* @VM_FAULT_HINDEX_MASK: mask HINDEX value
*
*/
@@ -669,6 +671,7 @@ enum vm_fault_reason {
VM_FAULT_FALLBACK = (__force vm_fault_t)0x000800,
VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000,
VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000,
+ VM_FAULT_PTNOTSAME = (__force vm_fault_t)0x004000,
VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000,
};
@@ -693,7 +696,8 @@ enum vm_fault_reason {
{ VM_FAULT_RETRY, "RETRY" }, \
{ VM_FAULT_FALLBACK, "FALLBACK" }, \
{ VM_FAULT_DONE_COW, "DONE_COW" }, \
- { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }
+ { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }, \
+ { VM_FAULT_PTNOTSAME, "PTNOTSAME" }
struct vm_special_mapping {
const char *name; /* The name, e.g. "[vdso]". */
diff --git a/mm/memory.c b/mm/memory.c
index 221ccdf34991..d5bebca47d98 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2094,21 +2094,29 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
* parts, do_swap_page must check under lock before unmapping the pte and
* proceeding (but do_wp_page is only called after already making such a check;
* and do_anonymous_page can safely check later on).
+ *
+ * pte_unmap_same() returns:
+ * 0 if the PTE are the same
+ * VM_FAULT_PTNOTSAME if the PTE are different
+ * VM_FAULT_RETRY if the VMA has changed in our back during
+ * a speculative page fault handling.
*/
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
- pte_t *page_table, pte_t orig_pte)
+static inline vm_fault_t pte_unmap_same(struct vm_fault *vmf)
{
- int same = 1;
+ int ret = 0;
+
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
if (sizeof(pte_t) > sizeof(unsigned long)) {
- spinlock_t *ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
- same = pte_same(*page_table, orig_pte);
- spin_unlock(ptl);
+ if (pte_spinlock(vmf)) {
+ if (!pte_same(*vmf->pte, vmf->orig_pte))
+ ret = VM_FAULT_PTNOTSAME;
+ spin_unlock(vmf->ptl);
+ } else
+ ret = VM_FAULT_RETRY;
}
#endif
- pte_unmap(page_table);
- return same;
+ pte_unmap(vmf->pte);
+ return ret;
}
static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
@@ -2714,8 +2722,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
int exclusive = 0;
vm_fault_t ret = 0;
- if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+ ret = pte_unmap_same(vmf);
+ if (ret) {
+ /*
+ * If pte != orig_pte, this means another thread did the
+ * swap operation in our back.
+ * So nothing else to do.
+ */
+ if (ret == VM_FAULT_PTNOTSAME)
+ ret = 0;
goto out;
+ }
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
--
2.21.0
^ permalink raw reply related
* [PATCH v12 31/31] mm: Add a speculative page fault switch in sysctl
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
This allows to turn on/off the use of the speculative page fault handler.
By default it's turned on.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
include/linux/mm.h | 3 +++
kernel/sysctl.c | 9 +++++++++
mm/memory.c | 3 +++
3 files changed, 15 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ec609cbad25a..f5bf13a2197a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1531,6 +1531,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
unsigned long address, unsigned int flags);
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+extern int sysctl_speculative_page_fault;
extern vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
unsigned long address,
unsigned int flags);
@@ -1538,6 +1539,8 @@ static inline vm_fault_t handle_speculative_fault(struct mm_struct *mm,
unsigned long address,
unsigned int flags)
{
+ if (unlikely(!sysctl_speculative_page_fault))
+ return VM_FAULT_RETRY;
/*
* Try speculative page fault for multithreaded user space task only.
*/
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9df14b07a488..3a712e52c14a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1295,6 +1295,15 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
.extra2 = &two,
},
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ {
+ .procname = "speculative_page_fault",
+ .data = &sysctl_speculative_page_fault,
+ .maxlen = sizeof(sysctl_speculative_page_fault),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+#endif
{
.procname = "panic_on_oom",
.data = &sysctl_panic_on_oom,
diff --git a/mm/memory.c b/mm/memory.c
index c65e8011d285..a12a60891350 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -83,6 +83,9 @@
#define CREATE_TRACE_POINTS
#include <trace/events/pagefault.h>
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+int sysctl_speculative_page_fault = 1;
+#endif
#if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST)
#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid.
--
2.21.0
^ permalink raw reply related
* [PATCH v12 02/31] x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
From: Laurent Dufour @ 2019-04-16 13:44 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT which turns on the
Speculative Page Fault handler when building for 64bit.
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
arch/x86/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0f2ab09da060..8bd575184d0b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -30,6 +30,7 @@ config X86_64
select SWIOTLB
select X86_DEV_DMA_OPS
select ARCH_HAS_SYSCALL_WRAPPER
+ select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
#
# Arch settings
--
2.21.0
^ permalink raw reply related
* [PATCH v12 29/31] powerpc/mm: add speculative page fault
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
This patch enable the speculative page fault on the PowerPC
architecture.
This will try a speculative page fault without holding the mmap_sem,
if it returns with VM_FAULT_RETRY, the mmap_sem is acquired and the
traditional page fault processing is done.
The speculative path is only tried for multithreaded process as there is no
risk of contention on the mmap_sem otherwise.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
arch/powerpc/mm/fault.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index ec74305fa330..5d48016073cb 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -491,6 +491,21 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;
+ /*
+ * Try speculative page fault before grabbing the mmap_sem.
+ * The Page fault is done if VM_FAULT_RETRY is not returned.
+ * But if the memory protection keys are active, we don't know if the
+ * fault is due to key mistmatch or due to a classic protection check.
+ * To differentiate that, we will need the VMA we no more have, so
+ * let's retry with the mmap_sem held.
+ */
+ fault = handle_speculative_fault(mm, address, flags);
+ if (fault != VM_FAULT_RETRY && (IS_ENABLED(CONFIG_PPC_MEM_KEYS) &&
+ fault != VM_FAULT_SIGSEGV)) {
+ perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, address);
+ goto done;
+ }
+
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunately, in the case of an
@@ -600,6 +615,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
up_read(¤t->mm->mmap_sem);
+done:
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);
--
2.21.0
^ permalink raw reply related
* [PATCH v12 06/31] mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
From: Laurent Dufour @ 2019-04-16 13:44 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
When handling page fault without holding the mmap_sem the fetch of the
pte lock pointer and the locking will have to be done while ensuring
that the VMA is not touched in our back.
So move the fetch and locking operations in a dedicated function.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
mm/memory.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index fc3698d13cb5..221ccdf34991 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2073,6 +2073,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
}
EXPORT_SYMBOL_GPL(apply_to_page_range);
+static inline bool pte_spinlock(struct vm_fault *vmf)
+{
+ vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ spin_lock(vmf->ptl);
+ return true;
+}
+
static inline bool pte_map_lock(struct vm_fault *vmf)
{
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -3656,8 +3663,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
* validation through pte_unmap_same(). It's of NUMA type but
* the pfn may be screwed if the read is non atomic.
*/
- vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
- spin_lock(vmf->ptl);
+ if (!pte_spinlock(vmf))
+ return VM_FAULT_RETRY;
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
@@ -3850,8 +3857,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
- vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
- spin_lock(vmf->ptl);
+ if (!pte_spinlock(vmf))
+ return VM_FAULT_RETRY;
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry)))
goto unlock;
--
2.21.0
^ permalink raw reply related
* [PATCH v12 09/31] mm: VMA sequence count
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
From: Peter Zijlstra <peterz@infradead.org>
Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
counts such that we can easily test if a VMA is changed.
The calls to vm_write_begin/end() in unmap_page_range() are
used to detect when a VMA is being unmap and thus that new page fault
should not be satisfied for this VMA. If the seqcount hasn't changed when
the page table are locked, this means we are safe to satisfy the page
fault.
The flip side is that we cannot distinguish between a vma_adjust() and
the unmap_page_range() -- where with the former we could have
re-checked the vma bounds against the address.
The VMA's sequence counter is also used to detect change to various VMA's
fields used during the page fault handling, such as:
- vm_start, vm_end
- vm_pgoff
- vm_flags, vm_page_prot
- anon_vma
- vm_policy
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Port to 4.12 kernel]
[Build depends on CONFIG_SPECULATIVE_PAGE_FAULT]
[Introduce vm_write_* inline function depending on
CONFIG_SPECULATIVE_PAGE_FAULT]
[Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence by
using vm_raw_write* functions]
[Fix a lock dependency warning in mmap_region() when entering the error
path]
[move sequence initialisation INIT_VMA()]
[Review the patch description about unmap_page_range()]
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
include/linux/mm.h | 44 ++++++++++++++++++++++++++++++++++++++++
include/linux/mm_types.h | 3 +++
mm/memory.c | 2 ++
mm/mmap.c | 30 +++++++++++++++++++++++++++
4 files changed, 79 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2ceb1d2869a6..906b9e06f18e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1410,6 +1410,9 @@ struct zap_details {
static inline void INIT_VMA(struct vm_area_struct *vma)
{
INIT_LIST_HEAD(&vma->anon_vma_chain);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ seqcount_init(&vma->vm_sequence);
+#endif
}
struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
@@ -1534,6 +1537,47 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
unmap_mapping_range(mapping, holebegin, holelen, 0);
}
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static inline void vm_write_begin(struct vm_area_struct *vma)
+{
+ write_seqcount_begin(&vma->vm_sequence);
+}
+static inline void vm_write_begin_nested(struct vm_area_struct *vma,
+ int subclass)
+{
+ write_seqcount_begin_nested(&vma->vm_sequence, subclass);
+}
+static inline void vm_write_end(struct vm_area_struct *vma)
+{
+ write_seqcount_end(&vma->vm_sequence);
+}
+static inline void vm_raw_write_begin(struct vm_area_struct *vma)
+{
+ raw_write_seqcount_begin(&vma->vm_sequence);
+}
+static inline void vm_raw_write_end(struct vm_area_struct *vma)
+{
+ raw_write_seqcount_end(&vma->vm_sequence);
+}
+#else
+static inline void vm_write_begin(struct vm_area_struct *vma)
+{
+}
+static inline void vm_write_begin_nested(struct vm_area_struct *vma,
+ int subclass)
+{
+}
+static inline void vm_write_end(struct vm_area_struct *vma)
+{
+}
+static inline void vm_raw_write_begin(struct vm_area_struct *vma)
+{
+}
+static inline void vm_raw_write_end(struct vm_area_struct *vma)
+{
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
void *buf, int len, unsigned int gup_flags);
extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fd7d38ee2e33..e78f72eb2576 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -337,6 +337,9 @@ struct vm_area_struct {
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ seqcount_t vm_sequence;
+#endif
} __randomize_layout;
struct core_thread {
diff --git a/mm/memory.c b/mm/memory.c
index d5bebca47d98..423fa8ea0569 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1256,6 +1256,7 @@ void unmap_page_range(struct mmu_gather *tlb,
unsigned long next;
BUG_ON(addr >= end);
+ vm_write_begin(vma);
tlb_start_vma(tlb, vma);
pgd = pgd_offset(vma->vm_mm, addr);
do {
@@ -1265,6 +1266,7 @@ void unmap_page_range(struct mmu_gather *tlb,
next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
} while (pgd++, addr = next, addr != end);
tlb_end_vma(tlb, vma);
+ vm_write_end(vma);
}
diff --git a/mm/mmap.c b/mm/mmap.c
index 5ad3a3228d76..a4e4d52a5148 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -726,6 +726,30 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
long adjust_next = 0;
int remove_next = 0;
+ /*
+ * Why using vm_raw_write*() functions here to avoid lockdep's warning ?
+ *
+ * Locked is complaining about a theoretical lock dependency, involving
+ * 3 locks:
+ * mapping->i_mmap_rwsem --> vma->vm_sequence --> fs_reclaim
+ *
+ * Here are the major path leading to this dependency :
+ * 1. __vma_adjust() mmap_sem -> vm_sequence -> i_mmap_rwsem
+ * 2. move_vmap() mmap_sem -> vm_sequence -> fs_reclaim
+ * 3. __alloc_pages_nodemask() fs_reclaim -> i_mmap_rwsem
+ * 4. unmap_mapping_range() i_mmap_rwsem -> vm_sequence
+ *
+ * So there is no way to solve this easily, especially because in
+ * unmap_mapping_range() the i_mmap_rwsem is grab while the impacted
+ * VMAs are not yet known.
+ * However, the way the vm_seq is used is guarantying that we will
+ * never block on it since we just check for its value and never wait
+ * for it to move, see vma_has_changed() and handle_speculative_fault().
+ */
+ vm_raw_write_begin(vma);
+ if (next)
+ vm_raw_write_begin(next);
+
if (next && !insert) {
struct vm_area_struct *exporter = NULL, *importer = NULL;
@@ -950,6 +974,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
* "vma->vm_next" gap must be updated.
*/
next = vma->vm_next;
+ if (next)
+ vm_raw_write_begin(next);
} else {
/*
* For the scope of the comment "next" and
@@ -996,6 +1022,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
if (insert && file)
uprobe_mmap(insert);
+ if (next && next != vma)
+ vm_raw_write_end(next);
+ vm_raw_write_end(vma);
+
validate_mm(mm);
return 0;
--
2.21.0
^ permalink raw reply related
* [PATCH v12 12/31] mm: protect SPF handler against anon_vma changes
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
The speculative page fault handler must be protected against anon_vma
changes. This is because page_add_new_anon_rmap() is called during the
speculative path.
In addition, don't try speculative page fault if the VMA don't have an
anon_vma structure allocated because its allocation should be
protected by the mmap_sem.
In __vma_adjust() when importer->anon_vma is set, there is no need to
protect against speculative page faults since speculative page fault
is aborted if the vma->anon_vma is not set.
When calling page_add_new_anon_rmap() vma->anon_vma is necessarily
valid since we checked for it when locking the pte and the anon_vma is
removed once the pte is unlocked. So even if the speculative page
fault handler is running concurrently with do_unmap(), as the pte is
locked in unmap_region() - through unmap_vmas() - and the anon_vma
unlinked later, because we check for the vma sequence counter which is
updated in unmap_page_range() before locking the pte, and then in
free_pgtables() so when locking the pte the change will be detected.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
mm/memory.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index 423fa8ea0569..2cf7b6185daa 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -377,7 +377,9 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
* Hide vma from rmap and truncate_pagecache before freeing
* pgtables
*/
+ vm_write_begin(vma);
unlink_anon_vmas(vma);
+ vm_write_end(vma);
unlink_file_vma(vma);
if (is_vm_hugetlb_page(vma)) {
@@ -391,7 +393,9 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
&& !is_vm_hugetlb_page(next)) {
vma = next;
next = vma->vm_next;
+ vm_write_begin(vma);
unlink_anon_vmas(vma);
+ vm_write_end(vma);
unlink_file_vma(vma);
}
free_pgd_range(tlb, addr, vma->vm_end,
--
2.21.0
^ permalink raw reply related
* [PATCH v12 14/31] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
migrate_misplaced_page() is only called during the page fault handling so
it's better to pass the pointer to the struct vm_fault instead of the vma.
This way during the speculative page fault path the saved vma->vm_flags
could be used.
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
include/linux/migrate.h | 4 ++--
mm/memory.c | 2 +-
mm/migrate.c | 4 ++--
3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e13d9bf2f9a5..0197e40325f8 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -125,14 +125,14 @@ static inline void __ClearPageMovable(struct page *page)
#ifdef CONFIG_NUMA_BALANCING
extern bool pmd_trans_migrating(pmd_t pmd);
extern int migrate_misplaced_page(struct page *page,
- struct vm_area_struct *vma, int node);
+ struct vm_fault *vmf, int node);
#else
static inline bool pmd_trans_migrating(pmd_t pmd)
{
return false;
}
static inline int migrate_misplaced_page(struct page *page,
- struct vm_area_struct *vma, int node)
+ struct vm_fault *vmf, int node)
{
return -EAGAIN; /* can't migrate now */
}
diff --git a/mm/memory.c b/mm/memory.c
index d0de58464479..56802850e72c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3747,7 +3747,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
}
/* Migrate to the requested node */
- migrated = migrate_misplaced_page(page, vma, target_nid);
+ migrated = migrate_misplaced_page(page, vmf, target_nid);
if (migrated) {
page_nid = target_nid;
flags |= TNF_MIGRATED;
diff --git a/mm/migrate.c b/mm/migrate.c
index a9138093a8e2..633bd9abac54 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1938,7 +1938,7 @@ bool pmd_trans_migrating(pmd_t pmd)
* node. Caller is expected to have an elevated reference count on
* the page that will be dropped by this function before returning.
*/
-int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+int migrate_misplaced_page(struct page *page, struct vm_fault *vmf,
int node)
{
pg_data_t *pgdat = NODE_DATA(node);
@@ -1951,7 +1951,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
* with execute permissions as they are probably shared libraries.
*/
if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
- (vma->vm_flags & VM_EXEC))
+ (vmf->vma_flags & VM_EXEC))
goto out;
/*
--
2.21.0
^ permalink raw reply related
* [PATCH v12 03/31] powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
From: Laurent Dufour @ 2019-04-16 13:44 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for BOOK3S_64. This enables
the Speculative Page Fault handler.
Support is only provide for BOOK3S_64 currently because:
- require CONFIG_PPC_STD_MMU because checks done in
set_access_flags_filter()
- require BOOK3S because we can't support for book3e_hugetlb_preload()
called by update_mmu_cache()
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
arch/powerpc/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2d0be82c3061..a29887ea5383 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -238,6 +238,7 @@ config PPC
select PCI_SYSCALL if PCI
select RTC_LIB
select SPARSE_IRQ
+ select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT if PPC_BOOK3S_64
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
select VIRT_TO_BUS if !PPC64
--
2.21.0
^ permalink raw reply related
* [PATCH v12 16/31] mm: introduce __vm_normal_page()
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
When dealing with the speculative fault path we should use the VMA's field
cached value stored in the vm_fault structure.
Currently vm_normal_page() is using the pointer to the VMA to fetch the
vm_flags value. This patch provides a new __vm_normal_page() which is
receiving the vm_flags flags value as parameter.
Note: The speculative path is turned on for architecture providing support
for special PTE flag. So only the first block of vm_normal_page is used
during the speculative path.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
include/linux/mm.h | 18 +++++++++++++++---
mm/memory.c | 21 ++++++++++++---------
2 files changed, 27 insertions(+), 12 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f465bb2b049e..f14b2c9ddfd4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1421,9 +1421,21 @@ static inline void INIT_VMA(struct vm_area_struct *vma)
#endif
}
-struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
- pte_t pte, bool with_public_device);
-#define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
+struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device,
+ unsigned long vma_flags);
+static inline struct page *_vm_normal_page(struct vm_area_struct *vma,
+ unsigned long addr, pte_t pte,
+ bool with_public_device)
+{
+ return __vm_normal_page(vma, addr, pte, with_public_device,
+ vma->vm_flags);
+}
+static inline struct page *vm_normal_page(struct vm_area_struct *vma,
+ unsigned long addr, pte_t pte)
+{
+ return _vm_normal_page(vma, addr, pte, false);
+}
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t pmd);
diff --git a/mm/memory.c b/mm/memory.c
index 85ec5ce5c0a8..be93f2c8ebe0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -533,7 +533,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
}
/*
- * vm_normal_page -- This function gets the "struct page" associated with a pte.
+ * __vm_normal_page -- This function gets the "struct page" associated with
+ * a pte.
*
* "Special" mappings do not wish to be associated with a "struct page" (either
* it doesn't exist, or it exists but they don't want to touch it). In this
@@ -574,8 +575,9 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
* PFNMAP mappings in order to support COWable mappings.
*
*/
-struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
- pte_t pte, bool with_public_device)
+struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device,
+ unsigned long vma_flags)
{
unsigned long pfn = pte_pfn(pte);
@@ -584,7 +586,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
goto check_pfn;
if (vma->vm_ops && vma->vm_ops->find_special_page)
return vma->vm_ops->find_special_page(vma, addr);
- if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+ if (vma_flags & (VM_PFNMAP | VM_MIXEDMAP))
return NULL;
if (is_zero_pfn(pfn))
return NULL;
@@ -620,8 +622,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
/* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
- if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
- if (vma->vm_flags & VM_MIXEDMAP) {
+ if (unlikely(vma_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
+ if (vma_flags & VM_MIXEDMAP) {
if (!pfn_valid(pfn))
return NULL;
goto out;
@@ -630,7 +632,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
off = (addr - vma->vm_start) >> PAGE_SHIFT;
if (pfn == vma->vm_pgoff + off)
return NULL;
- if (!is_cow_mapping(vma->vm_flags))
+ if (!is_cow_mapping(vma_flags))
return NULL;
}
}
@@ -2532,7 +2534,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
- vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
+ vmf->page = __vm_normal_page(vma, vmf->address, vmf->orig_pte, false,
+ vmf->vma_flags);
if (!vmf->page) {
/*
* VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
@@ -3706,7 +3709,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
update_mmu_cache(vma, vmf->address, vmf->pte);
- page = vm_normal_page(vma, vmf->address, pte);
+ page = __vm_normal_page(vma, vmf->address, pte, false, vmf->vma_flags);
if (!page) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
return 0;
--
2.21.0
^ permalink raw reply related
* [PATCH v12 27/31] mm: add speculative page fault vmstats
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
Add speculative_pgfault vmstat counter to count successful speculative page
fault handling.
Also fixing a minor typo in include/linux/vm_event_item.h.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
include/linux/vm_event_item.h | 3 +++
mm/memory.c | 3 +++
mm/vmstat.c | 5 ++++-
3 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441cf4c4..137666e91074 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -109,6 +109,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_SWAP
SWAP_RA,
SWAP_RA_HIT,
+#endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ SPECULATIVE_PGFAULT,
#endif
NR_VM_EVENT_ITEMS
};
diff --git a/mm/memory.c b/mm/memory.c
index 509851ad7c95..c65e8011d285 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4367,6 +4367,9 @@ vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
put_vma(vma);
+ if (ret != VM_FAULT_RETRY)
+ count_vm_event(SPECULATIVE_PGFAULT);
+
/*
* The task may have entered a memcg OOM situation but
* if the allocation error was handled gracefully (no
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a7d493366a65..93f54b31e150 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1288,7 +1288,10 @@ const char * const vmstat_text[] = {
"swap_ra",
"swap_ra_hit",
#endif
-#endif /* CONFIG_VM_EVENTS_COUNTERS */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ "speculative_pgfault",
+#endif
+#endif /* CONFIG_VM_EVENT_COUNTERS */
};
#endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */
--
2.21.0
^ permalink raw reply related
* [PATCH v12 13/31] mm: cache some VMA fields in the vm_fault structure
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
When handling speculative page fault, the vma->vm_flags and
vma->vm_page_prot fields are read once the page table lock is released. So
there is no more guarantee that these fields would not change in our back.
They will be saved in the vm_fault structure before the VMA is checked for
changes.
In the detail, when we deal with a speculative page fault, the mmap_sem is
not taken, so parallel VMA's changes can occurred. When a VMA change is
done which will impact the page fault processing, we assumed that the VMA
sequence counter will be changed. In the page fault processing, at the
time the PTE is locked, we checked the VMA sequence counter to detect
changes done in our back. If no change is detected we can continue further.
But this doesn't prevent the VMA to not be changed in our back while the
PTE is locked. So VMA's fields which are used while the PTE is locked must
be saved to ensure that we are using *static* values. This is important
since the PTE changes will be made on regards to these VMA fields and they
need to be consistent. This concerns the vma->vm_flags and
vma->vm_page_prot VMA fields.
This patch also set the fields in hugetlb_no_page() and
__collapse_huge_page_swapin even if it is not need for the callee.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
include/linux/mm.h | 10 +++++++--
mm/huge_memory.c | 6 +++---
mm/hugetlb.c | 2 ++
mm/khugepaged.c | 2 ++
mm/memory.c | 53 ++++++++++++++++++++++++----------------------
mm/migrate.c | 2 +-
6 files changed, 44 insertions(+), 31 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5d45b7d8718d..f465bb2b049e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -439,6 +439,12 @@ struct vm_fault {
* page table to avoid allocation from
* atomic context.
*/
+ /*
+ * These entries are required when handling speculative page fault.
+ * This way the page handling is done using consistent field values.
+ */
+ unsigned long vma_flags;
+ pgprot_t vma_page_prot;
};
/* page entry size for vm->huge_fault() */
@@ -781,9 +787,9 @@ void free_compound_page(struct page *page);
* pte_mkwrite. But get_user_pages can cause write faults for mappings
* that do not have writing enabled, when used by access_process_vm.
*/
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+static inline pte_t maybe_mkwrite(pte_t pte, unsigned long vma_flags)
{
- if (likely(vma->vm_flags & VM_WRITE))
+ if (likely(vma_flags & VM_WRITE))
pte = pte_mkwrite(pte);
return pte;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 823688414d27..865886a689ee 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1244,8 +1244,8 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
pte_t entry;
- entry = mk_pte(pages[i], vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ entry = mk_pte(pages[i], vmf->vma_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
@@ -2228,7 +2228,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
entry = pte_swp_mksoft_dirty(entry);
} else {
entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
- entry = maybe_mkwrite(entry, vma);
+ entry = maybe_mkwrite(entry, vma->vm_flags);
if (!write)
entry = pte_wrprotect(entry);
if (!young)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 109f5de82910..13246da4bc50 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3812,6 +3812,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
.vma = vma,
.address = haddr,
.flags = flags,
+ .vma_flags = vma->vm_flags,
+ .vma_page_prot = vma->vm_page_prot,
/*
* Hard to debug if it ends up being
* used by a callee that assumes
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6a0cbca3885e..42469037240a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -888,6 +888,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
.flags = FAULT_FLAG_ALLOW_RETRY,
.pmd = pmd,
.pgoff = linear_page_index(vma, address),
+ .vma_flags = vma->vm_flags,
+ .vma_page_prot = vma->vm_page_prot,
};
/* we only decide to swapin, if there is enough young ptes */
diff --git a/mm/memory.c b/mm/memory.c
index 2cf7b6185daa..d0de58464479 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1560,7 +1560,8 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
goto out_unlock;
}
entry = pte_mkyoung(*pte);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ entry = maybe_mkwrite(pte_mkdirty(entry),
+ vma->vm_flags);
if (ptep_set_access_flags(vma, addr, pte, entry, 1))
update_mmu_cache(vma, addr, pte);
}
@@ -1575,7 +1576,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
if (mkwrite) {
entry = pte_mkyoung(entry);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma->vm_flags);
}
set_pte_at(mm, addr, pte, entry);
@@ -2257,7 +2258,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
entry = pte_mkyoung(vmf->orig_pte);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
update_mmu_cache(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2335,8 +2336,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
inc_mm_counter_fast(mm, MM_ANONPAGES);
}
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
- entry = mk_pte(new_page, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ entry = mk_pte(new_page, vmf->vma_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
/*
* Clear the pte entry and flush it first, before updating the
* pte with the new entry. This will avoid a race condition
@@ -2401,7 +2402,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
* Don't let another task, with possibly unlocked vma,
* keep the mlocked page.
*/
- if (page_copied && (vma->vm_flags & VM_LOCKED)) {
+ if (page_copied && (vmf->vma_flags & VM_LOCKED)) {
lock_page(old_page); /* LRU manipulation */
if (PageMlocked(old_page))
munlock_vma_page(old_page);
@@ -2438,7 +2439,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
*/
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf)
{
- WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
+ WARN_ON_ONCE(!(vmf->vma_flags & VM_SHARED));
if (!pte_map_lock(vmf))
return VM_FAULT_RETRY;
/*
@@ -2540,7 +2541,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
* We should not cow pages in a shared writeable mapping.
* Just mark the pages writable and/or call ops->pfn_mkwrite.
*/
- if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+ if ((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))
return wp_pfn_shared(vmf);
@@ -2599,7 +2600,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
return VM_FAULT_WRITE;
}
unlock_page(vmf->page);
- } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+ } else if (unlikely((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
return wp_page_shared(vmf);
}
@@ -2878,9 +2879,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
- pte = mk_pte(page, vma->vm_page_prot);
+ pte = mk_pte(page, vmf->vma_page_prot);
if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
- pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+ pte = maybe_mkwrite(pte_mkdirty(pte), vmf->vma_flags);
vmf->flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
exclusive = RMAP_EXCLUSIVE;
@@ -2905,7 +2906,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
swap_free(entry);
if (mem_cgroup_swap_full(page) ||
- (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
+ (vmf->vma_flags & VM_LOCKED) || PageMlocked(page))
try_to_free_swap(page);
unlock_page(page);
if (page != swapcache && swapcache) {
@@ -2963,7 +2964,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
pte_t entry;
/* File mapping without ->vm_ops ? */
- if (vma->vm_flags & VM_SHARED)
+ if (vmf->vma_flags & VM_SHARED)
return VM_FAULT_SIGBUS;
/*
@@ -2987,7 +2988,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (!(vmf->flags & FAULT_FLAG_WRITE) &&
!mm_forbids_zeropage(vma->vm_mm)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
- vma->vm_page_prot));
+ vmf->vma_page_prot));
if (!pte_map_lock(vmf))
return VM_FAULT_RETRY;
if (!pte_none(*vmf->pte))
@@ -3021,8 +3022,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
*/
__SetPageUptodate(page);
- entry = mk_pte(page, vma->vm_page_prot);
- if (vma->vm_flags & VM_WRITE)
+ entry = mk_pte(page, vmf->vma_page_prot);
+ if (vmf->vma_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));
if (!pte_map_lock(vmf)) {
@@ -3242,7 +3243,7 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
for (i = 0; i < HPAGE_PMD_NR; i++)
flush_icache_page(vma, page + i);
- entry = mk_huge_pmd(page, vma->vm_page_prot);
+ entry = mk_huge_pmd(page, vmf->vma_page_prot);
if (write)
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -3318,11 +3319,11 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
return VM_FAULT_NOPAGE;
flush_icache_page(vma, page);
- entry = mk_pte(page, vma->vm_page_prot);
+ entry = mk_pte(page, vmf->vma_page_prot);
if (write)
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
/* copy-on-write page */
- if (write && !(vma->vm_flags & VM_SHARED)) {
+ if (write && !(vmf->vma_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
@@ -3362,7 +3363,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
/* Did we COW the page? */
if ((vmf->flags & FAULT_FLAG_WRITE) &&
- !(vmf->vma->vm_flags & VM_SHARED))
+ !(vmf->vma_flags & VM_SHARED))
page = vmf->cow_page;
else
page = vmf->page;
@@ -3641,7 +3642,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
}
} else if (!(vmf->flags & FAULT_FLAG_WRITE))
ret = do_read_fault(vmf);
- else if (!(vma->vm_flags & VM_SHARED))
+ else if (!(vmf->vma_flags & VM_SHARED))
ret = do_cow_fault(vmf);
else
ret = do_shared_fault(vmf);
@@ -3698,7 +3699,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
* accessible ptes, some can allow access by kernel mode.
*/
old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
- pte = pte_modify(old_pte, vma->vm_page_prot);
+ pte = pte_modify(old_pte, vmf->vma_page_prot);
pte = pte_mkyoung(pte);
if (was_writable)
pte = pte_mkwrite(pte);
@@ -3732,7 +3733,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
* Flag if the page is shared between multiple address spaces. This
* is later used when determining whether to group tasks together
*/
- if (page_mapcount(page) > 1 && (vma->vm_flags & VM_SHARED))
+ if (page_mapcount(page) > 1 && (vmf->vma_flags & VM_SHARED))
flags |= TNF_SHARED;
last_cpupid = page_cpupid_last(page);
@@ -3777,7 +3778,7 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
/* COW handled on pte level: split pmd */
- VM_BUG_ON_VMA(vmf->vma->vm_flags & VM_SHARED, vmf->vma);
+ VM_BUG_ON_VMA(vmf->vma_flags & VM_SHARED, vmf->vma);
__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
return VM_FAULT_FALLBACK;
@@ -3924,6 +3925,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
.flags = flags,
.pgoff = linear_page_index(vma, address),
.gfp_mask = __get_fault_gfp_mask(vma),
+ .vma_flags = vma->vm_flags,
+ .vma_page_prot = vma->vm_page_prot,
};
unsigned int dirty = flags & FAULT_FLAG_WRITE;
struct mm_struct *mm = vma->vm_mm;
diff --git a/mm/migrate.c b/mm/migrate.c
index f2ecc2855a12..a9138093a8e2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -240,7 +240,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
*/
entry = pte_to_swp_entry(*pvmw.pte);
if (is_write_migration_entry(entry))
- pte = maybe_mkwrite(pte, vma);
+ pte = maybe_mkwrite(pte, vma->vm_flags);
if (unlikely(is_zone_device_page(new))) {
if (is_device_private_page(new)) {
--
2.21.0
^ permalink raw reply related
* [PATCH v12 21/31] mm: Introduce find_vma_rcu()
From: Laurent Dufour @ 2019-04-16 13:45 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
This allows to search for a VMA structure without holding the mmap_sem.
The search is repeated while the mm seqlock is changing and until we found
a valid VMA.
While under the RCU protection, a reference is taken on the VMA, so the
caller must call put_vma() once it not more need the VMA structure.
At the time a VMA is inserted in the MM RB tree, in vma_rb_insert(), a
reference is taken to the VMA by calling get_vma().
When removing a VMA from the MM RB tree, the VMA is not release immediately
but at the end of the RCU grace period through vm_rcu_put(). This ensures
that the VMA remains allocated until the end the RCU grace period.
Since the vm_file pointer, if valid, is released in put_vma(), there is no
guarantee that the file pointer will be valid on the returned VMA.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
include/linux/mm_types.h | 1 +
mm/internal.h | 5 ++-
mm/mmap.c | 76 ++++++++++++++++++++++++++++++++++++++--
3 files changed, 78 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6a6159e11a3f..9af6694cb95d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -287,6 +287,7 @@ struct vm_area_struct {
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
atomic_t vm_ref_count;
+ struct rcu_head vm_rcu;
#endif
struct rb_node vm_rb;
diff --git a/mm/internal.h b/mm/internal.h
index 302382bed406..1e368e4afe3c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -55,7 +55,10 @@ static inline void put_vma(struct vm_area_struct *vma)
__free_vma(vma);
}
-#else
+extern struct vm_area_struct *find_vma_rcu(struct mm_struct *mm,
+ unsigned long addr);
+
+#else /* CONFIG_SPECULATIVE_PAGE_FAULT */
static inline void get_vma(struct vm_area_struct *vma)
{
diff --git a/mm/mmap.c b/mm/mmap.c
index c106440dcae7..34bf261dc2c8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -179,6 +179,18 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
{
write_sequnlock(&mm->mm_seq);
}
+
+static void __vm_rcu_put(struct rcu_head *head)
+{
+ struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
+ vm_rcu);
+ put_vma(vma);
+}
+static void vm_rcu_put(struct vm_area_struct *vma)
+{
+ VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);
+ call_rcu(&vma->vm_rcu, __vm_rcu_put);
+}
#else
static inline void mm_write_seqlock(struct mm_struct *mm)
{
@@ -190,6 +202,8 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
void __free_vma(struct vm_area_struct *vma)
{
+ if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
+ VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);
mpol_put(vma_policy(vma));
vm_area_free(vma);
}
@@ -197,11 +211,24 @@ void __free_vma(struct vm_area_struct *vma)
/*
* Close a vm structure and free it, returning the next.
*/
-static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
+static struct vm_area_struct *__remove_vma(struct vm_area_struct *vma)
{
struct vm_area_struct *next = vma->vm_next;
might_sleep();
+ if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT) &&
+ !RB_EMPTY_NODE(&vma->vm_rb)) {
+ /*
+ * If the VMA is still linked in the RB tree, we must release
+ * that reference by calling put_vma().
+ * This should only happen when called from exit_mmap().
+ * We forcely clear the node to satisfy the chec in
+ * __free_vma(). This is safe since the RB tree is not walked
+ * anymore.
+ */
+ RB_CLEAR_NODE(&vma->vm_rb);
+ put_vma(vma);
+ }
if (vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
if (vma->vm_file)
@@ -211,6 +238,13 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
return next;
}
+static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
+{
+ if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
+ VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);
+ return __remove_vma(vma);
+}
+
static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long flags,
struct list_head *uf);
SYSCALL_DEFINE1(brk, unsigned long, brk)
@@ -475,7 +509,7 @@ static inline void vma_rb_insert(struct vm_area_struct *vma,
/* All rb_subtree_gap values must be consistent prior to insertion */
validate_mm_rb(root, NULL);
-
+ get_vma(vma);
rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
}
@@ -491,6 +525,14 @@ static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
mm_write_seqlock(mm);
rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
mm_write_sequnlock(mm); /* wmb */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ /*
+ * Ensure the removal is complete before clearing the node.
+ * Matched by vma_has_changed()/handle_speculative_fault().
+ */
+ RB_CLEAR_NODE(&vma->vm_rb);
+ vm_rcu_put(vma);
+#endif
}
static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
@@ -2331,6 +2373,34 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
EXPORT_SYMBOL(find_vma);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+/*
+ * Like find_vma() but under the protection of RCU and the mm sequence counter.
+ * The vma returned has to be relaesed by the caller through the call to
+ * put_vma()
+ */
+struct vm_area_struct *find_vma_rcu(struct mm_struct *mm, unsigned long addr)
+{
+ struct vm_area_struct *vma = NULL;
+ unsigned int seq;
+
+ do {
+ if (vma)
+ put_vma(vma);
+
+ seq = read_seqbegin(&mm->mm_seq);
+
+ rcu_read_lock();
+ vma = find_vma(mm, addr);
+ if (vma)
+ get_vma(vma);
+ rcu_read_unlock();
+ } while (read_seqretry(&mm->mm_seq, seq));
+
+ return vma;
+}
+#endif
+
/*
* Same as find_vma, but also return a pointer to the previous VMA in *pprev.
*/
@@ -3231,7 +3301,7 @@ void exit_mmap(struct mm_struct *mm)
while (vma) {
if (vma->vm_flags & VM_ACCOUNT)
nr_accounted += vma_pages(vma);
- vma = remove_vma(vma);
+ vma = __remove_vma(vma);
}
vm_unacct_memory(nr_accounted);
}
--
2.21.0
^ permalink raw reply related
* [PATCH v12 01/31] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
From: Laurent Dufour @ 2019-04-16 13:44 UTC (permalink / raw)
To: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
aneesh.kumar, benh, mpe, paulus, Thomas Gleixner, Ingo Molnar,
hpa, Will Deacon, Sergey Senozhatsky, sergey.senozhatsky.work,
Andrea Arcangeli, Alexei Starovoitov, kemi.wang, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi, zhong jiang, Haiyan Song,
Balbir Singh, sj38.park, Michel Lespinasse, Mike Rapoport
Cc: linuxppc-dev, x86, linux-kernel, npiggin, linux-mm, paulmck,
Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
This configuration variable will be used to build the code needed to
handle speculative page fault.
By default it is turned off, and activated depending on architecture
support, ARCH_HAS_PTE_SPECIAL, SMP and MMU.
The architecture support is needed since the speculative page fault handler
is called from the architecture's page faulting code, and some code has to
be added there to handle the speculative handler.
The dependency on ARCH_HAS_PTE_SPECIAL is required because vm_normal_page()
does processing that is not compatible with the speculative handling in the
case ARCH_HAS_PTE_SPECIAL is not set.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
mm/Kconfig | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/mm/Kconfig b/mm/Kconfig
index 0eada3f818fa..ff278ac9978a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -761,4 +761,26 @@ config GUP_BENCHMARK
config ARCH_HAS_PTE_SPECIAL
bool
+config ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
+ def_bool n
+
+config SPECULATIVE_PAGE_FAULT
+ bool "Speculative page faults"
+ default y
+ depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
+ depends on ARCH_HAS_PTE_SPECIAL && MMU && SMP
+ help
+ Try to handle user space page faults without holding the mmap_sem.
+
+ This should allow better concurrency for massively threaded processes
+ since the page fault handler will not wait for other thread's memory
+ layout change to be done, assuming that this change is done in
+ another part of the process's memory space. This type of page fault
+ is named speculative page fault.
+
+ If the speculative page fault fails because a concurrent modification
+ is detected or because underlying PMD or PTE tables are not yet
+ allocated, the speculative page fault fails and a classic page fault
+ is then tried.
+
endmenu
--
2.21.0
^ permalink raw reply related
* Re: [PATCH v2 00/21] Convert hwmon documentation to ReST
From: Guenter Roeck @ 2019-04-16 20:31 UTC (permalink / raw)
To: Jonathan Corbet
Cc: linux-hwmon, Jean Delvare, linux-aspeed, Linux Doc Mailing List,
Andrew Jeffery, Sudeep Holla, Liviu Dudau, linux-kernel,
Mauro Carvalho Chehab, Lorenzo Pieralisi, Paul Mackerras,
Joel Stanley, Mauro Carvalho Chehab, linuxppc-dev,
linux-arm-kernel
In-Reply-To: <20190416141949.09b48789@lwn.net>
On Tue, Apr 16, 2019 at 02:19:49PM -0600, Jonathan Corbet wrote:
> On Fri, 12 Apr 2019 20:09:16 -0700
> Guenter Roeck <linux@roeck-us.net> wrote:
>
> > The big real-world question is: Is the series good enough for you to accept,
> > or do you expect some level of user/kernel separation ?
>
> I guess it can go in; it's forward progress, even if it doesn't make the
> improvements I would like to see.
>
> The real question, I guess, is who should take it. I've been seeing a
> fair amount of activity on hwmon, so I suspect that the potential for
> conflicts is real. Perhaps things would go smoother if it went through
> your tree?
>
We'll see a number of conflicts, yes. In terms of timing, this is probably
the worst release in the last few years to make such a change. I currently
have 9 patches queued in hwmon-next which touch Documentation/hwmon.
Of course the changes made in those are all not ReST compatible, and I have
no idea what to look out for to make it compatible. So this is going to be
fun (in a negative sense) either way.
I don't really have a recommendation at this point; I think the best I could
do to take the patches which don't generate conflicts and leave the rest
alone. But that would also be bad, since the new index file would not match
reality. No idea, really, what the best or even a useful approach would be.
Maybe automated changes like this (assuming they are indeed automated)
can be generated and pushed right after a commit window closes. Would
that by any chance be possible ?
Guenter
^ permalink raw reply
* [PATCH v3 10/26] compat_ioctl: use correct compat_ptr() translation in drivers
From: Arnd Bergmann @ 2019-04-16 20:19 UTC (permalink / raw)
To: Alexander Viro
Cc: Shen Jing, Arnd Bergmann, linux-scsi, y2038, Frank Haverkamp,
Kashyap Desai, Felipe Balbi, Jerry Zhang, James E.J. Bottomley,
linux-fsdevel, Vincent Pelletier, megaraidlinux.pdl, Felipe Balbi,
Martin K. Petersen, Shivasharan S, Greg Kroah-Hartman, linux-usb,
linux-kernel, Andrzej Pietrasiewicz, Sumit Saxena,
Andrew Donnellan, Frederic Barrat, linuxppc-dev
In-Reply-To: <20190416202013.4034148-1-arnd@arndb.de>
A handful of drivers all have a trivial wrapper around their ioctl
handler, but don't call the compat_ptr() conversion function at the
moment. In practice this does not matter, since none of them are used
on the s390 architecture and for all other architectures, compat_ptr()
does not do anything, but using the new compat_ptr_ioctl()
helper makes it more correct in theory, and simplifies the code.
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: Felipe Balbi <felipe.balbi@linux.intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
drivers/misc/cxl/flash.c | 8 +-------
drivers/misc/genwqe/card_dev.c | 23 +----------------------
drivers/scsi/megaraid/megaraid_mm.c | 28 +---------------------------
drivers/usb/gadget/function/f_fs.c | 12 +-----------
4 files changed, 4 insertions(+), 67 deletions(-)
diff --git a/drivers/misc/cxl/flash.c b/drivers/misc/cxl/flash.c
index 4d6836f19489..cb9cca35a226 100644
--- a/drivers/misc/cxl/flash.c
+++ b/drivers/misc/cxl/flash.c
@@ -473,12 +473,6 @@ static long device_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
return -EINVAL;
}
-static long device_compat_ioctl(struct file *file, unsigned int cmd,
- unsigned long arg)
-{
- return device_ioctl(file, cmd, arg);
-}
-
static int device_close(struct inode *inode, struct file *file)
{
struct cxl *adapter = file->private_data;
@@ -514,7 +508,7 @@ static const struct file_operations fops = {
.owner = THIS_MODULE,
.open = device_open,
.unlocked_ioctl = device_ioctl,
- .compat_ioctl = device_compat_ioctl,
+ .compat_ioctl = compat_ptr_ioctl,
.release = device_close,
};
diff --git a/drivers/misc/genwqe/card_dev.c b/drivers/misc/genwqe/card_dev.c
index 8c1b63a4337b..5de0796f2786 100644
--- a/drivers/misc/genwqe/card_dev.c
+++ b/drivers/misc/genwqe/card_dev.c
@@ -1221,34 +1221,13 @@ static long genwqe_ioctl(struct file *filp, unsigned int cmd,
return rc;
}
-#if defined(CONFIG_COMPAT)
-/**
- * genwqe_compat_ioctl() - Compatibility ioctl
- *
- * Called whenever a 32-bit process running under a 64-bit kernel
- * performs an ioctl on /dev/genwqe<n>_card.
- *
- * @filp: file pointer.
- * @cmd: command.
- * @arg: user argument.
- * Return: zero on success or negative number on failure.
- */
-static long genwqe_compat_ioctl(struct file *filp, unsigned int cmd,
- unsigned long arg)
-{
- return genwqe_ioctl(filp, cmd, arg);
-}
-#endif /* defined(CONFIG_COMPAT) */
-
static const struct file_operations genwqe_fops = {
.owner = THIS_MODULE,
.open = genwqe_open,
.fasync = genwqe_fasync,
.mmap = genwqe_mmap,
.unlocked_ioctl = genwqe_ioctl,
-#if defined(CONFIG_COMPAT)
- .compat_ioctl = genwqe_compat_ioctl,
-#endif
+ .compat_ioctl = compat_ptr_ioctl,
.release = genwqe_release,
};
diff --git a/drivers/scsi/megaraid/megaraid_mm.c b/drivers/scsi/megaraid/megaraid_mm.c
index 3ce837e4b24c..21ee5751c04e 100644
--- a/drivers/scsi/megaraid/megaraid_mm.c
+++ b/drivers/scsi/megaraid/megaraid_mm.c
@@ -45,10 +45,6 @@ static int mraid_mm_setup_dma_pools(mraid_mmadp_t *);
static void mraid_mm_free_adp_resources(mraid_mmadp_t *);
static void mraid_mm_teardown_dma_pools(mraid_mmadp_t *);
-#ifdef CONFIG_COMPAT
-static long mraid_mm_compat_ioctl(struct file *, unsigned int, unsigned long);
-#endif
-
MODULE_AUTHOR("LSI Logic Corporation");
MODULE_DESCRIPTION("LSI Logic Management Module");
MODULE_LICENSE("GPL");
@@ -72,9 +68,7 @@ static wait_queue_head_t wait_q;
static const struct file_operations lsi_fops = {
.open = mraid_mm_open,
.unlocked_ioctl = mraid_mm_unlocked_ioctl,
-#ifdef CONFIG_COMPAT
- .compat_ioctl = mraid_mm_compat_ioctl,
-#endif
+ .compat_ioctl = compat_ptr_ioctl,
.owner = THIS_MODULE,
.llseek = noop_llseek,
};
@@ -228,7 +222,6 @@ mraid_mm_unlocked_ioctl(struct file *filep, unsigned int cmd,
{
int err;
- /* inconsistent: mraid_mm_compat_ioctl doesn't take the BKL */
mutex_lock(&mraid_mm_mutex);
err = mraid_mm_ioctl(filep, cmd, arg);
mutex_unlock(&mraid_mm_mutex);
@@ -1232,25 +1225,6 @@ mraid_mm_init(void)
}
-#ifdef CONFIG_COMPAT
-/**
- * mraid_mm_compat_ioctl - 32bit to 64bit ioctl conversion routine
- * @filep : file operations pointer (ignored)
- * @cmd : ioctl command
- * @arg : user ioctl packet
- */
-static long
-mraid_mm_compat_ioctl(struct file *filep, unsigned int cmd,
- unsigned long arg)
-{
- int err;
-
- err = mraid_mm_ioctl(filep, cmd, arg);
-
- return err;
-}
-#endif
-
/**
* mraid_mm_exit - Module exit point
*/
diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c
index 20413c276c61..addc210d198a 100644
--- a/drivers/usb/gadget/function/f_fs.c
+++ b/drivers/usb/gadget/function/f_fs.c
@@ -1347,14 +1347,6 @@ static long ffs_epfile_ioctl(struct file *file, unsigned code,
return ret;
}
-#ifdef CONFIG_COMPAT
-static long ffs_epfile_compat_ioctl(struct file *file, unsigned code,
- unsigned long value)
-{
- return ffs_epfile_ioctl(file, code, value);
-}
-#endif
-
static const struct file_operations ffs_epfile_operations = {
.llseek = no_llseek,
@@ -1363,9 +1355,7 @@ static const struct file_operations ffs_epfile_operations = {
.read_iter = ffs_epfile_read_iter,
.release = ffs_epfile_release,
.unlocked_ioctl = ffs_epfile_ioctl,
-#ifdef CONFIG_COMPAT
- .compat_ioctl = ffs_epfile_compat_ioctl,
-#endif
+ .compat_ioctl = compat_ptr_ioctl,
};
--
2.20.0
^ permalink raw reply related
* [PATCH v3 00/26] compat_ioctl: cleanups
From: Arnd Bergmann @ 2019-04-16 20:19 UTC (permalink / raw)
To: Alexander Viro
Cc: linux-fbdev, linux-iio, linux-remoteproc, alsa-devel, dri-devel,
platform-driver-x86, linux-ide, linux-mtd, sparclinux,
linux1394-devel, devel, linux-s390, linux-scsi, linux-bluetooth,
y2038, qat-linux, amd-gfx, linux-input, Marcel Holtmann,
linux-media, linux-rtc, Arnd Bergmann, James E.J. Bottomley,
linux-nvme, ceph-devel, linux-arm-kernel, Karsten Keil,
Martin K. Petersen, Greg Kroah-Hartman, linux-usb, linux-wireless,
linux-kernel, linux-rdma, linux-crypto, netdev, linux-fsdevel,
linux-integrity, linuxppc-dev, David S. Miller, linux-btrfs,
linux-ppp
Hi Al,
It took me way longer than I had hoped to revisit this series, see
https://lore.kernel.org/lkml/20180912150142.157913-1-arnd@arndb.de/
for the previously posted version.
I've come to the point where all conversion handlers and most
COMPATIBLE_IOCTL() entries are gone from this file, but for
now, this series only has the parts that have either been reviewed
previously, or that are simple enough to include.
The main missing piece is the SG_IO/SG_GET_REQUEST_TABLE conversion.
I'll post the patches I made for that later, as they need more
testing and review from the scsi maintainers.
I hope you can still take these for the coming merge window, unless
new problems come up.
Arnd
Arnd Bergmann (26):
compat_ioctl: pppoe: fix PPPOEIOCSFWD handling
compat_ioctl: move simple ppp command handling into driver
compat_ioctl: avoid unused function warning for do_ioctl
compat_ioctl: move PPPIOCSCOMPRESS32 to ppp-generic.c
compat_ioctl: move PPPIOCSPASS32/PPPIOCSACTIVE32 to ppp_generic.c
compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
compat_ioctl: move rtc handling into rtc-dev.c
compat_ioctl: add compat_ptr_ioctl()
compat_ioctl: move drivers to compat_ptr_ioctl
compat_ioctl: use correct compat_ptr() translation in drivers
ceph: fix compat_ioctl for ceph_dir_operations
compat_ioctl: move more drivers to compat_ptr_ioctl
compat_ioctl: move tape handling into drivers
compat_ioctl: move ATYFB_CLK handling to atyfb driver
compat_ioctl: move isdn/capi ioctl translation into driver
compat_ioctl: move rfcomm handlers into driver
compat_ioctl: move hci_sock handlers into driver
compat_ioctl: remove HCIUART handling
compat_ioctl: remove HIDIO translation
compat_ioctl: remove translation for sound ioctls
compat_ioctl: remove IGNORE_IOCTL()
compat_ioctl: remove /dev/random commands
compat_ioctl: remove joystick ioctl translation
compat_ioctl: remove PCI ioctl translation
compat_ioctl: remove /dev/raw ioctl translation
compat_ioctl: remove last RAID handling code
Documentation/networking/ppp_generic.txt | 2 +
arch/um/drivers/hostaudio_kern.c | 1 +
drivers/android/binder.c | 2 +-
drivers/char/ppdev.c | 12 +-
drivers/char/random.c | 1 +
drivers/char/tpm/tpm_vtpm_proxy.c | 12 +-
drivers/crypto/qat/qat_common/adf_ctl_drv.c | 2 +-
drivers/dma-buf/dma-buf.c | 4 +-
drivers/dma-buf/sw_sync.c | 2 +-
drivers/dma-buf/sync_file.c | 2 +-
drivers/firewire/core-cdev.c | 12 +-
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 +-
drivers/hid/hidraw.c | 4 +-
drivers/hid/usbhid/hiddev.c | 11 +-
drivers/hwtracing/stm/core.c | 12 +-
drivers/ide/ide-tape.c | 31 +-
drivers/iio/industrialio-core.c | 2 +-
drivers/infiniband/core/uverbs_main.c | 4 +-
drivers/isdn/capi/capi.c | 31 +
drivers/isdn/i4l/isdn_ppp.c | 14 +-
drivers/media/rc/lirc_dev.c | 4 +-
drivers/mfd/cros_ec_dev.c | 4 +-
drivers/misc/cxl/flash.c | 8 +-
drivers/misc/genwqe/card_dev.c | 23 +-
drivers/misc/mei/main.c | 22 +-
drivers/misc/vmw_vmci/vmci_host.c | 2 +-
drivers/mtd/ubi/cdev.c | 36 +-
drivers/net/ppp/ppp_generic.c | 99 +++-
drivers/net/ppp/pppoe.c | 7 +
drivers/net/ppp/pptp.c | 3 +
drivers/net/tap.c | 12 +-
drivers/nvdimm/bus.c | 4 +-
drivers/nvme/host/core.c | 2 +-
drivers/pci/switch/switchtec.c | 2 +-
drivers/platform/x86/wmi.c | 2 +-
drivers/rpmsg/rpmsg_char.c | 4 +-
drivers/rtc/dev.c | 13 +-
drivers/rtc/rtc-vr41xx.c | 10 +
drivers/s390/char/tape_char.c | 41 +-
drivers/sbus/char/display7seg.c | 2 +-
drivers/sbus/char/envctrl.c | 4 +-
drivers/scsi/3w-xxxx.c | 4 +-
drivers/scsi/cxlflash/main.c | 2 +-
drivers/scsi/esas2r/esas2r_main.c | 2 +-
drivers/scsi/megaraid/megaraid_mm.c | 28 +-
drivers/scsi/osst.c | 34 +-
drivers/scsi/pmcraid.c | 4 +-
drivers/scsi/st.c | 35 +-
drivers/staging/android/ion/ion.c | 4 +-
drivers/staging/pi433/pi433_if.c | 12 +-
drivers/staging/vme/devices/vme_user.c | 2 +-
drivers/tee/tee_core.c | 2 +-
drivers/usb/class/cdc-wdm.c | 2 +-
drivers/usb/class/usbtmc.c | 4 +-
drivers/usb/core/devio.c | 16 +-
drivers/usb/gadget/function/f_fs.c | 12 +-
drivers/vfio/vfio.c | 39 +-
drivers/vhost/net.c | 12 +-
drivers/vhost/scsi.c | 12 +-
drivers/vhost/test.c | 12 +-
drivers/vhost/vsock.c | 12 +-
drivers/video/fbdev/aty/atyfb_base.c | 12 +-
drivers/virt/fsl_hypervisor.c | 2 +-
fs/btrfs/super.c | 2 +-
fs/ceph/dir.c | 1 +
fs/ceph/file.c | 2 +-
fs/compat_ioctl.c | 602 +-------------------
fs/fat/file.c | 13 +-
fs/fuse/dev.c | 2 +-
fs/notify/fanotify/fanotify_user.c | 2 +-
fs/userfaultfd.c | 2 +-
include/linux/fs.h | 7 +
include/linux/if_pppox.h | 2 +
include/linux/mtio.h | 58 ++
include/uapi/linux/ppp-ioctl.h | 2 +
include/uapi/linux/ppp_defs.h | 14 +
net/bluetooth/hci_sock.c | 21 +-
net/bluetooth/rfcomm/sock.c | 14 +-
net/l2tp/l2tp_ppp.c | 3 +
net/rfkill/core.c | 2 +-
sound/core/oss/pcm_oss.c | 4 +
sound/oss/dmasound/dmasound_core.c | 2 +
82 files changed, 452 insertions(+), 1034 deletions(-)
create mode 100644 include/linux/mtio.h
--
2.20.0
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: "James E.J. Bottomley" <jejb@linux.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: linux-integrity@vger.kernel.org
Cc: qat-linux@intel.com
Cc: linux-crypto@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linux1394-devel@lists.sourceforge.net
Cc: amd-gfx@lists.freedesktop.org
Cc: linux-input@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-ide@vger.kernel.org
Cc: linux-iio@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mtd@lists.infradead.org
Cc: linux-ppp@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: platform-driver-x86@vger.kernel.org
Cc: linux-remoteproc@vger.kernel.org
Cc: linux-rtc@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: linux-fbdev@vger.kernel.org
Cc: linux-btrfs@vger.kernel.org
Cc: ceph-devel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-bluetooth@vger.kernel.org
Cc: linux-wireless@vger.kernel.org
Cc: alsa-devel@alsa-project.org
Cc: y2038@lists.linaro.org
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox