[RFC PATCH V3 0/6] get_user_pages_fast for ARM and ARM64

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH V3 0/6] get_user_pages_fast for ARM and ARM64
@ 2014-03-12 13:40 Steve Capper
  2014-03-12 13:40 ` [RFC PATCH V3 1/6] arm: mm: Introduce special ptes for LPAE Steve Capper
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-12 13:40 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,
This RFC series implements get_user_pages_fast and __get_user_pages_fast.
These are required for Transparent HugePages to function correctly, as
a futex on a THP tail will otherwise result in an infinite loop (due to
the core implementation of __get_user_pages_fast always returning 0).
This series may also be beneficial for direct-IO heavy workloads and
certain KVM workloads.

The main changes since RFC V2 are:
 * pte_special logic added to the fast_gup.
 * mmu_gather in arm64 replaced entirely by core implementation.
 * arm and arm64 both reference the same gup.c to prevent code
   duplication (I'm not sure how much of a good idea that is).

I have tested the series using the Fast Model for ARM64 and an Arndale
Board. This series applies to 3.14-rc6 with one additional patch from
Will Deacon that is currently in Linux next:
 1971188 ARM: 7985/1: mm: implement pte_accessible for faulting mappings

Again, I would really appreciate any comments and/or testers!

Cheers,
--
Steve

Catalin Marinas (1):
  arm64: Convert asm/tlb.h to generic mmu_gather

Steve Capper (5):
  arm: mm: Introduce special ptes for LPAE
  arm: mm: Enable HAVE_RCU_TABLE_FREE logic
  arm: mm: implement get_user_pages_fast
  arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  arm64: mm: Activate get_user_pages_fast for arm64

 arch/arm/Kconfig                      |   1 +
 arch/arm/include/asm/pgtable-2level.h |   2 +
 arch/arm/include/asm/pgtable-3level.h |  14 ++
 arch/arm/include/asm/pgtable.h        |   3 -
 arch/arm/include/asm/tlb.h            |  38 ++++-
 arch/arm/mm/Makefile                  |   1 +
 arch/arm/mm/gup.c                     | 299 ++++++++++++++++++++++++++++++++++
 arch/arm64/Kconfig                    |   1 +
 arch/arm64/include/asm/pgtable.h      |   6 +
 arch/arm64/include/asm/tlb.h          | 140 +++-------------
 arch/arm64/mm/Makefile                |   4 +-
 11 files changed, 389 insertions(+), 120 deletions(-)
 create mode 100644 arch/arm/mm/gup.c

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 1/6] arm: mm: Introduce special ptes for LPAE
  2014-03-12 13:40 [RFC PATCH V3 0/6] get_user_pages_fast for ARM and ARM64 Steve Capper
@ 2014-03-12 13:40 ` Steve Capper
  2014-03-12 13:40 ` [RFC PATCH V3 2/6] arm: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-12 13:40 UTC (permalink / raw)
  To: linux-arm-kernel

We need a mechanism to tag ptes as being special, this indicates that
no attempt should be made to access the underlying struct page *
associated with the pte. This is used by the fast_gup as it has no means
to access VMAs (that also contain this information) locklessly.

The L_PTE_SPECIAL bit is already allocated for LPAE, this patch modifies
pte_special and pte_mkspecial to make use of it, and defines
__HAVE_ARCH_PTE_SPECIAL.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
 arch/arm/include/asm/pgtable-2level.h | 2 ++
 arch/arm/include/asm/pgtable-3level.h | 8 ++++++++
 arch/arm/include/asm/pgtable.h        | 3 ---
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h
index dfff709..26d1742 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -181,6 +181,8 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 #define pmd_addr_end(addr,end) (end)
 
 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext)
+#define pte_special(pte)	(0)
+static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
 
 /*
  * We don't have huge page support for short descriptors, for the moment
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 85c60ad..b286ba9 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -207,6 +207,14 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 #define pte_huge(pte)		(pte_val(pte) && !(pte_val(pte) & PTE_TABLE_BIT))
 #define pte_mkhuge(pte)		(__pte(pte_val(pte) & ~PTE_TABLE_BIT))
 
+#define pte_special(pte)	(!!(pte_val(pte) & L_PTE_SPECIAL))
+static inline pte_t pte_mkspecial(pte_t pte)
+{
+	pte_val(pte) |= L_PTE_SPECIAL;
+	return pte;
+}
+#define	__HAVE_ARCH_PTE_SPECIAL
+
 #define pmd_young(pmd)		(pmd_val(pmd) & PMD_SECT_AF)
 
 #define __HAVE_ARCH_PMD_WRITE
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index 5478e5d..664d243 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -222,7 +222,6 @@ static inline pte_t *pmd_page_vaddr(pmd_t pmd)
 #define pte_dirty(pte)		(pte_val(pte) & L_PTE_DIRTY)
 #define pte_young(pte)		(pte_val(pte) & L_PTE_YOUNG)
 #define pte_exec(pte)		(!(pte_val(pte) & L_PTE_XN))
-#define pte_special(pte)	(0)
 
 #define pte_valid_user(pte)	\
 	(pte_valid(pte) && (pte_val(pte) & L_PTE_USER) && pte_young(pte))
@@ -260,8 +259,6 @@ PTE_BIT_FUNC(mkyoung,   |= L_PTE_YOUNG);
 PTE_BIT_FUNC(mkexec,   &= ~L_PTE_XN);
 PTE_BIT_FUNC(mknexec,   |= L_PTE_XN);
 
-static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
-
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
 	const pteval_t mask = L_PTE_XN | L_PTE_RDONLY | L_PTE_USER |
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 2/6] arm: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-03-12 13:40 [RFC PATCH V3 0/6] get_user_pages_fast for ARM and ARM64 Steve Capper
  2014-03-12 13:40 ` [RFC PATCH V3 1/6] arm: mm: Introduce special ptes for LPAE Steve Capper
@ 2014-03-12 13:40 ` Steve Capper
  2014-03-12 13:40 ` [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast Steve Capper
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-12 13:40 UTC (permalink / raw)
  To: linux-arm-kernel

In order to implement fast_get_user_pages we need to ensure that the
page table walker is protected from page table pages being freed from
under it.

One way to achieve this is to have the walker disable interrupts, and
rely on IPIs from the TLB flushing code blocking before the page table
pages are freed.

On some ARM platforms we have hardware TLB invalidation, thus the TLB
flushing code won't necessarily broadcast IPIs. Also spuriously
broadcasting IPIs can hurt system performance if done too often.

This problem has already been solved on PowerPC and Sparc by batching
up page table pages belonging to more than one mm_user, then scheduling
an rcu_sched callback to free the pages. If one were to disable
interrupts, that would delay the scheduling interrupts thus block the
page table pages being freed. This logic has also been promoted to core
code and is activated when one enables HAVE_RCU_TABLE_FREE.

This patch enables HAVE_RCU_TABLE_FREE and incorporates it into the
existing ARM TLB logic.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
 arch/arm/Kconfig           |  1 +
 arch/arm/include/asm/tlb.h | 38 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 1594945..7d5340d 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -58,6 +58,7 @@ config ARM
 	select HAVE_PERF_EVENTS
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
+	select HAVE_RCU_TABLE_FREE if SMP
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UID16
diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index 0baf7f0..eaf7578 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -35,12 +35,39 @@
 
 #define MMU_GATHER_BUNDLE	8
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+static inline void __tlb_remove_table(void *_table)
+{
+	free_page_and_swap_cache((struct page *)_table);
+}
+
+struct mmu_table_batch {
+	struct rcu_head		rcu;
+	unsigned int		nr;
+	void			*tables[0];
+};
+
+#define MAX_TABLE_BATCH		\
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+extern void tlb_table_flush(struct mmu_gather *tlb);
+extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+
+#define tlb_remove_entry(tlb, entry)	tlb_remove_table(tlb, entry)
+#else
+#define tlb_remove_entry(tlb, entry)	tlb_remove_page(tlb, entry)
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+
 /*
  * TLB handling.  This allows us to remove pages from the page
  * tables, and efficiently handle the TLB issues.
  */
 struct mmu_gather {
 	struct mm_struct	*mm;
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	struct mmu_table_batch	*batch;
+	unsigned int		need_flush;
+#endif
 	unsigned int		fullmm;
 	struct vm_area_struct	*vma;
 	unsigned long		start, end;
@@ -101,6 +128,9 @@ static inline void __tlb_alloc_page(struct mmu_gather *tlb)
 static inline void tlb_flush_mmu(struct mmu_gather *tlb)
 {
 	tlb_flush(tlb);
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb_table_flush(tlb);
+#endif
 	free_pages_and_swap_cache(tlb->pages, tlb->nr);
 	tlb->nr = 0;
 	if (tlb->pages == tlb->local)
@@ -119,6 +149,10 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start
 	tlb->pages = tlb->local;
 	tlb->nr = 0;
 	__tlb_alloc_page(tlb);
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb->batch = NULL;
+#endif
 }
 
 static inline void
@@ -195,7 +229,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 	tlb_add_flush(tlb, addr + SZ_1M);
 #endif
 
-	tlb_remove_page(tlb, pte);
+	tlb_remove_entry(tlb, pte);
 }
 
 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
@@ -203,7 +237,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 {
 #ifdef CONFIG_ARM_LPAE
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, virt_to_page(pmdp));
+	tlb_remove_entry(tlb, virt_to_page(pmdp));
 #endif
 }
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast
  2014-03-12 13:40 [RFC PATCH V3 0/6] get_user_pages_fast for ARM and ARM64 Steve Capper
  2014-03-12 13:40 ` [RFC PATCH V3 1/6] arm: mm: Introduce special ptes for LPAE Steve Capper
  2014-03-12 13:40 ` [RFC PATCH V3 2/6] arm: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
@ 2014-03-12 13:40 ` Steve Capper
  2014-03-12 14:18   ` Peter Zijlstra
                     ` (2 more replies)
  2014-03-12 13:40 ` [RFC PATCH V3 4/6] arm64: Convert asm/tlb.h to generic mmu_gather Steve Capper
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-12 13:40 UTC (permalink / raw)
  To: linux-arm-kernel

An implementation of get_user_pages_fast for ARM. It is based loosely
on the PowerPC implementation. We disable interrupts in the walker to
prevent the call_rcu_sched pagetable freeing code from running under
us.

We also explicitly fire an IPI in the Transparent HugePage splitting
case to prevent splits from interfering with the fast_gup walker.
As THP splits are relatively rare, this should not have a noticable
overhead.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
 arch/arm/include/asm/pgtable-3level.h |   6 +
 arch/arm/mm/Makefile                  |   1 +
 arch/arm/mm/gup.c                     | 299 ++++++++++++++++++++++++++++++++++
 3 files changed, 306 insertions(+)
 create mode 100644 arch/arm/mm/gup.c

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index b286ba9..fdc4a4f 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -226,6 +226,12 @@ static inline pte_t pte_mkspecial(pte_t pte)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd)	(pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
 #define pmd_trans_splitting(pmd) (pmd_val(pmd) & PMD_SECT_SPLITTING)
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp);
+#endif
 #endif
 
 #define PMD_BIT_FUNC(fn,op) \
diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
index 7f39ce2..a2c4e87 100644
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -7,6 +7,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
 
 obj-$(CONFIG_MMU)		+= fault-armv.o flush.o idmap.o ioremap.o \
 				   mmap.o pgd.o mmu.o
+obj-$(CONFIG_ARM_LPAE)		+= gup.o
 
 ifneq ($(CONFIG_MMU),y)
 obj-y				+= nommu.o
diff --git a/arch/arm/mm/gup.c b/arch/arm/mm/gup.c
new file mode 100644
index 0000000..715ab0d
--- /dev/null
+++ b/arch/arm/mm/gup.c
@@ -0,0 +1,299 @@
+/*
+ * arch/arm/mm/gup.c
+ *
+ * Copyright (C) 2014 Linaro Ltd.
+ *
+ * Based on arch/powerpc/mm/gup.c which is:
+ * Copyright (C) 2008 Nick Piggin
+ * Copyright (C) 2008 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/rwsem.h>
+#include <linux/hugetlb.h>
+#include <asm/pgtable.h>
+
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	pte_t *ptep, *ptem;
+	int ret = 0;
+
+	ptem = ptep = pte_offset_map(&pmd, addr);
+	do {
+		pte_t pte = ACCESS_ONCE(*ptep);
+		struct page *page;
+
+		if (!pte_valid_user(pte) || pte_special(pte)
+			|| (write && !pte_write(pte)))
+			goto pte_unmap;
+
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+
+		if (!page_cache_get_speculative(page))
+			goto pte_unmap;
+
+		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+			put_page(page);
+			goto pte_unmap;
+		}
+
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+
+	ret = 1;
+
+pte_unmap:
+	pte_unmap(ptem);
+	return ret;
+}
+
+static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page, *tail;
+	int refs;
+
+	if (!pmd_present(orig) || (write && !pmd_write(orig)))
+		return 0;
+
+	refs = 0;
+	head = pmd_page(orig);
+	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	/*
+	 * Any tail pages need their mapcount reference taken before we
+	 * return. (This allows the THP code to bump their ref count when
+	 * they are split into base pages).
+	 */
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page, *tail;
+	pmd_t origpmd = __pmd(pud_val(orig));
+	int refs;
+
+	if (!pmd_present(origpmd) || (write && !pmd_write(origpmd)))
+		return 0;
+
+	refs = 0;
+	head = pmd_page(origpmd);
+	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = ACCESS_ONCE(*pmdp);
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+			return 0;
+
+		if (unlikely(pmd_thp_or_huge(pmd))) {
+			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
+				pages, nr))
+				return 0;
+		} else {
+			if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+		}
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t *pgdp, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(pgdp, addr);
+	do {
+		pud_t pud = ACCESS_ONCE(*pudp);
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (pud_huge(pud)) {
+			if (!gup_huge_pud(pud, pudp, addr, next, write,
+					pages, nr))
+				return 0;
+		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+/*
+ * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
+ * back to the regular GUP.
+ */
+int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			  struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, len, end;
+	unsigned long next, flags;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	start &= PAGE_MASK;
+	addr = start;
+	len = (unsigned long) nr_pages << PAGE_SHIFT;
+	end = start + len;
+
+	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+					start, len)))
+		return 0;
+
+	/*
+	 * Disable interrupts, we use the nested form as we can already
+	 * have interrupts disabled by get_futex_key.
+	 *
+	 * With interrupts disabled, we block page table pages from being
+	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
+	 * for more details.
+	 */
+
+	local_irq_save(flags);
+	pgdp = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(*pgdp))
+			break;
+		else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr))
+			break;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_restore(flags);
+
+	return nr;
+}
+
+int get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	int nr, ret;
+
+	start &= PAGE_MASK;
+	nr = __get_user_pages_fast(start, nr_pages, write, pages);
+	ret = nr;
+
+	if (nr < nr_pages) {
+		/* Try to get the remaining pages with get_user_pages */
+		start += nr << PAGE_SHIFT;
+		pages += nr;
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+				     nr_pages - nr, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		/* Have to be a bit careful with return values */
+		if (nr > 0) {
+			if (ret < 0)
+				ret = nr;
+			else
+				ret += nr;
+		}
+	}
+
+	return ret;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+static void thp_splitting_flush_sync(void *arg)
+{
+}
+
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp)
+{
+	pmd_t pmd = pmd_mksplitting(*pmdp);
+	VM_BUG_ON(address & ~PMD_MASK);
+	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
+
+	/* dummy IPI to serialise against fast_gup */
+	smp_call_function(thp_splitting_flush_sync, NULL, 1);
+}
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 4/6] arm64: Convert asm/tlb.h to generic mmu_gather
  2014-03-12 13:40 [RFC PATCH V3 0/6] get_user_pages_fast for ARM and ARM64 Steve Capper
                   ` (2 preceding siblings ...)
  2014-03-12 13:40 ` [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast Steve Capper
@ 2014-03-12 13:40 ` Steve Capper
  2014-03-12 13:40 ` [RFC PATCH V3 5/6] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
  2014-03-12 13:40 ` [RFC PATCH V3 6/6] arm64: mm: Activate get_user_pages_fast for arm64 Steve Capper
  5 siblings, 0 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-12 13:40 UTC (permalink / raw)
  To: linux-arm-kernel

From: Catalin Marinas <catalin.marinas@arm.com>

Over the past couple of years, the generic mmu_gather gained range
tracking - 597e1c3580b7 (mm/mmu_gather: enable tlb flush range in generic
mmu_gather), 2b047252d087 (Fix TLB gather virtual address range
invalidation corner cases) - and tlb_fast_mode() has been removed -
29eb77825cc7 (arch, mm: Remove tlb_fast_mode()).

The new mmu_gather structure is now suitable for arm64 and this patch
converts the arch asm/tlb.h to the generic code. One functional
difference is the shift_arg_pages() case where previously the code was
flushing the full mm (no tlb_start_vma call) but now it flushes the
range given to tlb_gather_mmu() (possibly slightly more efficient
previously).

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
 arch/arm64/include/asm/tlb.h | 136 +++++++------------------------------------
 1 file changed, 20 insertions(+), 116 deletions(-)

diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index 717031a..72cadf5 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -19,115 +19,44 @@
 #ifndef __ASM_TLB_H
 #define __ASM_TLB_H
 
-#include <linux/pagemap.h>
-#include <linux/swap.h>
 
-#include <asm/pgalloc.h>
-#include <asm/tlbflush.h>
-
-#define MMU_GATHER_BUNDLE	8
-
-/*
- * TLB handling.  This allows us to remove pages from the page
- * tables, and efficiently handle the TLB issues.
- */
-struct mmu_gather {
-	struct mm_struct	*mm;
-	unsigned int		fullmm;
-	struct vm_area_struct	*vma;
-	unsigned long		start, end;
-	unsigned long		range_start;
-	unsigned long		range_end;
-	unsigned int		nr;
-	unsigned int		max;
-	struct page		**pages;
-	struct page		*local[MMU_GATHER_BUNDLE];
-};
+#include <asm-generic/tlb.h>
 
 /*
- * This is unnecessarily complex.  There's three ways the TLB shootdown
- * code is used:
+ * There's three ways the TLB shootdown code is used:
  *  1. Unmapping a range of vmas.  See zap_page_range(), unmap_region().
  *     tlb->fullmm = 0, and tlb_start_vma/tlb_end_vma will be called.
- *     tlb->vma will be non-NULL.
  *  2. Unmapping all vmas.  See exit_mmap().
  *     tlb->fullmm = 1, and tlb_start_vma/tlb_end_vma will be called.
- *     tlb->vma will be non-NULL.  Additionally, page tables will be freed.
+ *     Page tables will be freed.
  *  3. Unmapping argument pages.  See shift_arg_pages().
  *     tlb->fullmm = 0, but tlb_start_vma/tlb_end_vma will not be called.
- *     tlb->vma will be NULL.
  */
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
-	if (tlb->fullmm || !tlb->vma)
+	if (tlb->fullmm) {
 		flush_tlb_mm(tlb->mm);
-	else if (tlb->range_end > 0) {
-		flush_tlb_range(tlb->vma, tlb->range_start, tlb->range_end);
-		tlb->range_start = TASK_SIZE;
-		tlb->range_end = 0;
+	} else if (tlb->end > 0) {
+		struct vm_area_struct vma = { .vm_mm = tlb->mm, };
+		flush_tlb_range(&vma, tlb->start, tlb->end);
+		tlb->start = TASK_SIZE;
+		tlb->end = 0;
 	}
 }
 
 static inline void tlb_add_flush(struct mmu_gather *tlb, unsigned long addr)
 {
 	if (!tlb->fullmm) {
-		if (addr < tlb->range_start)
-			tlb->range_start = addr;
-		if (addr + PAGE_SIZE > tlb->range_end)
-			tlb->range_end = addr + PAGE_SIZE;
-	}
-}
-
-static inline void __tlb_alloc_page(struct mmu_gather *tlb)
-{
-	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
-
-	if (addr) {
-		tlb->pages = (void *)addr;
-		tlb->max = PAGE_SIZE / sizeof(struct page *);
+		tlb->start = min(tlb->start, addr);
+		tlb->end = max(tlb->end, addr + PAGE_SIZE);
 	}
 }
 
-static inline void tlb_flush_mmu(struct mmu_gather *tlb)
-{
-	tlb_flush(tlb);
-	free_pages_and_swap_cache(tlb->pages, tlb->nr);
-	tlb->nr = 0;
-	if (tlb->pages == tlb->local)
-		__tlb_alloc_page(tlb);
-}
-
-static inline void
-tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end)
-{
-	tlb->mm = mm;
-	tlb->fullmm = !(start | (end+1));
-	tlb->start = start;
-	tlb->end = end;
-	tlb->vma = NULL;
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->pages = tlb->local;
-	tlb->nr = 0;
-	__tlb_alloc_page(tlb);
-}
-
-static inline void
-tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
-{
-	tlb_flush_mmu(tlb);
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-
-	if (tlb->pages != tlb->local)
-		free_pages((unsigned long)tlb->pages, 0);
-}
-
 /*
  * Memorize the range for the TLB flush.
  */
-static inline void
-tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, unsigned long addr)
+static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
+					  unsigned long addr)
 {
 	tlb_add_flush(tlb, addr);
 }
@@ -137,38 +66,24 @@ tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, unsigned long addr)
  * case where we're doing a full MM flush.  When we're doing a munmap,
  * the vmas are adjusted to only cover the region to be torn down.
  */
-static inline void
-tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
+static inline void tlb_start_vma(struct mmu_gather *tlb,
+				 struct vm_area_struct *vma)
 {
 	if (!tlb->fullmm) {
-		tlb->vma = vma;
-		tlb->range_start = TASK_SIZE;
-		tlb->range_end = 0;
+		tlb->start = TASK_SIZE;
+		tlb->end = 0;
 	}
 }
 
-static inline void
-tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
+static inline void tlb_end_vma(struct mmu_gather *tlb,
+			       struct vm_area_struct *vma)
 {
 	if (!tlb->fullmm)
 		tlb_flush(tlb);
 }
 
-static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	tlb->pages[tlb->nr++] = page;
-	VM_BUG_ON(tlb->nr > tlb->max);
-	return tlb->max - tlb->nr;
-}
-
-static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	if (!__tlb_remove_page(tlb, page))
-		tlb_flush_mmu(tlb);
-}
-
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
-	unsigned long addr)
+				  unsigned long addr)
 {
 	pgtable_page_dtor(pte);
 	tlb_add_flush(tlb, addr);
@@ -184,16 +99,5 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 }
 #endif
 
-#define pte_free_tlb(tlb, ptep, addr)	__pte_free_tlb(tlb, ptep, addr)
-#define pmd_free_tlb(tlb, pmdp, addr)	__pmd_free_tlb(tlb, pmdp, addr)
-#define pud_free_tlb(tlb, pudp, addr)	pud_free((tlb)->mm, pudp)
-
-#define tlb_migrate_finish(mm)		do { } while (0)
-
-static inline void
-tlb_remove_pmd_tlb_entry(struct mmu_gather *tlb, pmd_t *pmdp, unsigned long addr)
-{
-	tlb_add_flush(tlb, addr);
-}
 
 #endif
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 5/6] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-03-12 13:40 [RFC PATCH V3 0/6] get_user_pages_fast for ARM and ARM64 Steve Capper
                   ` (3 preceding siblings ...)
  2014-03-12 13:40 ` [RFC PATCH V3 4/6] arm64: Convert asm/tlb.h to generic mmu_gather Steve Capper
@ 2014-03-12 13:40 ` Steve Capper
  2014-03-12 13:40 ` [RFC PATCH V3 6/6] arm64: mm: Activate get_user_pages_fast for arm64 Steve Capper
  5 siblings, 0 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-12 13:40 UTC (permalink / raw)
  To: linux-arm-kernel

In order to implement fast_get_user_pages we need to ensure that the
page table walker is protected from page table pages being freed from
under it.

This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging
to address spaces with multiple users will be call_rcu_sched freed.
Meaning that disabling interrupts will block the free and protect the
fast gup page walker.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
 arch/arm64/Kconfig           | 1 +
 arch/arm64/include/asm/tlb.h | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 27bbcfc..6185f95 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -38,6 +38,7 @@ config ARM64
 	select HAVE_MEMBLOCK
 	select HAVE_PATA_PLATFORM
 	select HAVE_PERF_EVENTS
+	select HAVE_RCU_TABLE_FREE
 	select IRQ_DOMAIN
 	select MODULES_USE_ELF_RELA
 	select NO_BOOTMEM
diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index 72cadf5..58a8b78 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -22,6 +22,14 @@
 
 #include <asm-generic/tlb.h>
 
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+
+static inline void __tlb_remove_table(void *_table)
+{
+	free_page_and_swap_cache((struct page *)_table);
+}
+
 /*
  * There's three ways the TLB shootdown code is used:
  *  1. Unmapping a range of vmas.  See zap_page_range(), unmap_region().
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 6/6] arm64: mm: Activate get_user_pages_fast for arm64
  2014-03-12 13:40 [RFC PATCH V3 0/6] get_user_pages_fast for ARM and ARM64 Steve Capper
                   ` (4 preceding siblings ...)
  2014-03-12 13:40 ` [RFC PATCH V3 5/6] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
@ 2014-03-12 13:40 ` Steve Capper
  5 siblings, 0 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-12 13:40 UTC (permalink / raw)
  To: linux-arm-kernel

The get_user_pages fast implementation in arch/arm is valid for arm64
too. This patch references get_user_pages_fast from arch/arm, rather
than duplicate the code.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
 arch/arm64/include/asm/pgtable.h | 6 ++++++
 arch/arm64/mm/Makefile           | 4 +++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index aa3917c..d5ae326 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -242,9 +242,15 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 #define __HAVE_ARCH_PMD_WRITE
 #define pmd_write(pmd)		(!(pmd_val(pmd) & PMD_SECT_RDONLY))
 
+#define pmd_thp_or_huge(pmd)	(pmd_huge(pmd) || pmd_trans_huge(pmd))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd)	(pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
 #define pmd_trans_splitting(pmd) (pmd_val(pmd) & PMD_SECT_SPLITTING)
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+struct vm_area_struct;
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp);
 #endif
 
 #define PMD_BIT_FUNC(fn,op) \
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index b51d364..212f229 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -1,5 +1,7 @@
+ARM=../../../arch/arm/mm
+
 obj-y				:= dma-mapping.o extable.o fault.o init.o \
 				   cache.o copypage.o flush.o \
 				   ioremap.o mmap.o pgd.o mmu.o \
-				   context.o tlb.o proc.o
+				   context.o tlb.o proc.o $(ARM)/gup.o
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast
  2014-03-12 13:40 ` [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast Steve Capper
@ 2014-03-12 14:18   ` Peter Zijlstra
  2014-03-12 16:20     ` Steve Capper
  2014-03-12 16:32   ` Peter Zijlstra
  2014-03-12 17:15   ` Catalin Marinas
  2 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-03-12 14:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 12, 2014 at 01:40:20PM +0000, Steve Capper wrote:
> +int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> +			  struct page **pages)
> +{
> +	struct mm_struct *mm = current->mm;
> +	unsigned long addr, len, end;
> +	unsigned long next, flags;
> +	pgd_t *pgdp;
> +	int nr = 0;
> +
> +	start &= PAGE_MASK;
> +	addr = start;
> +	len = (unsigned long) nr_pages << PAGE_SHIFT;
> +	end = start + len;
> +
> +	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
> +					start, len)))
> +		return 0;
> +
> +	/*
> +	 * Disable interrupts, we use the nested form as we can already
> +	 * have interrupts disabled by get_futex_key.
> +	 *
> +	 * With interrupts disabled, we block page table pages from being
> +	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
> +	 * for more details.
> +	 */
> +
> +	local_irq_save(flags);
> +	pgdp = pgd_offset(mm, addr);
> +	do {
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none(*pgdp))
> +			break;
> +		else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr))
> +			break;
> +	} while (pgdp++, addr = next, addr != end);
> +	local_irq_restore(flags);
> +
> +	return nr;
> +}

Since you just went through the trouble of enabling RCU pagetable
freeing, you might also replace these local_irq_save/restore with
rcu_read_{,un}lock().

Typically rcu_read_lock() is faster than disabling interrupts; but I've
no clue about ARM.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast
  2014-03-12 14:18   ` Peter Zijlstra
@ 2014-03-12 16:20     ` Steve Capper
  2014-03-12 16:30       ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Steve Capper @ 2014-03-12 16:20 UTC (permalink / raw)
  To: linux-arm-kernel

On 12 March 2014 14:18, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Mar 12, 2014 at 01:40:20PM +0000, Steve Capper wrote:
>> +int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>> +                       struct page **pages)
>> +{
>> +     struct mm_struct *mm = current->mm;
>> +     unsigned long addr, len, end;
>> +     unsigned long next, flags;
>> +     pgd_t *pgdp;
>> +     int nr = 0;
>> +
>> +     start &= PAGE_MASK;
>> +     addr = start;
>> +     len = (unsigned long) nr_pages << PAGE_SHIFT;
>> +     end = start + len;
>> +
>> +     if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
>> +                                     start, len)))
>> +             return 0;
>> +
>> +     /*
>> +      * Disable interrupts, we use the nested form as we can already
>> +      * have interrupts disabled by get_futex_key.
>> +      *
>> +      * With interrupts disabled, we block page table pages from being
>> +      * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
>> +      * for more details.
>> +      */
>> +
>> +     local_irq_save(flags);
>> +     pgdp = pgd_offset(mm, addr);
>> +     do {
>> +             next = pgd_addr_end(addr, end);
>> +             if (pgd_none(*pgdp))
>> +                     break;
>> +             else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr))
>> +                     break;
>> +     } while (pgdp++, addr = next, addr != end);
>> +     local_irq_restore(flags);
>> +
>> +     return nr;
>> +}
>
> Since you just went through the trouble of enabling RCU pagetable
> freeing, you might also replace these local_irq_save/restore with
> rcu_read_{,un}lock().

Hi Peter,
This critical section also needs to block the THP splitting code. At
the moment an IPI is broadcast in pmdp_splitting_flush. I'm not sure
how to adapt that to block on an rcu_read_lock, I'll have a think.

Cheers,
-- 
Steve

>
> Typically rcu_read_lock() is faster than disabling interrupts; but I've
> no clue about ARM.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast
  2014-03-12 16:20     ` Steve Capper
@ 2014-03-12 16:30       ` Peter Zijlstra
  2014-03-12 16:42         ` Steve Capper
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-03-12 16:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 12, 2014 at 04:20:15PM +0000, Steve Capper wrote:
> On 12 March 2014 14:18, Peter Zijlstra <peterz@infradead.org> wrote:
> > Since you just went through the trouble of enabling RCU pagetable
> > freeing, you might also replace these local_irq_save/restore with
> > rcu_read_{,un}lock().
> 
> Hi Peter,
> This critical section also needs to block the THP splitting code. At
> the moment an IPI is broadcast in pmdp_splitting_flush. I'm not sure
> how to adapt that to block on an rcu_read_lock, I'll have a think.

Ah, I've not looked at THP much at all.

Would it be sufficient to make sure to fail the pmd get_page()
equivalent early enough?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast
  2014-03-12 13:40 ` [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast Steve Capper
  2014-03-12 14:18   ` Peter Zijlstra
@ 2014-03-12 16:32   ` Peter Zijlstra
  2014-03-12 16:41     ` Steve Capper
  2014-03-12 16:55     ` Will Deacon
  2014-03-12 17:15   ` Catalin Marinas
  2 siblings, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-03-12 16:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 12, 2014 at 01:40:20PM +0000, Steve Capper wrote:
> +void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
> +			  pmd_t *pmdp)
> +{
> +	pmd_t pmd = pmd_mksplitting(*pmdp);
> +	VM_BUG_ON(address & ~PMD_MASK);
> +	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> +
> +	/* dummy IPI to serialise against fast_gup */
> +	smp_call_function(thp_splitting_flush_sync, NULL, 1);
> +}

do you really need to IPI the entire machine? Wouldn't the mm's TLB
invalidate mask be sufficient?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast
  2014-03-12 16:32   ` Peter Zijlstra
@ 2014-03-12 16:41     ` Steve Capper
  2014-03-12 16:55     ` Will Deacon
  1 sibling, 0 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-12 16:41 UTC (permalink / raw)
  To: linux-arm-kernel

On 12 March 2014 16:32, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Mar 12, 2014 at 01:40:20PM +0000, Steve Capper wrote:
>> +void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
>> +                       pmd_t *pmdp)
>> +{
>> +     pmd_t pmd = pmd_mksplitting(*pmdp);
>> +     VM_BUG_ON(address & ~PMD_MASK);
>> +     set_pmd_at(vma->vm_mm, address, pmdp, pmd);
>> +
>> +     /* dummy IPI to serialise against fast_gup */
>> +     smp_call_function(thp_splitting_flush_sync, NULL, 1);
>> +}
>
> do you really need to IPI the entire machine? Wouldn't the mm's TLB
> invalidate mask be sufficient?

Thank you! Yes, that would be a much better idea. I'll correct this.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast
  2014-03-12 16:30       ` Peter Zijlstra
@ 2014-03-12 16:42         ` Steve Capper
  0 siblings, 0 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-12 16:42 UTC (permalink / raw)
  To: linux-arm-kernel

On 12 March 2014 16:30, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Mar 12, 2014 at 04:20:15PM +0000, Steve Capper wrote:
>> On 12 March 2014 14:18, Peter Zijlstra <peterz@infradead.org> wrote:
>> > Since you just went through the trouble of enabling RCU pagetable
>> > freeing, you might also replace these local_irq_save/restore with
>> > rcu_read_{,un}lock().
>>
>> Hi Peter,
>> This critical section also needs to block the THP splitting code. At
>> the moment an IPI is broadcast in pmdp_splitting_flush. I'm not sure
>> how to adapt that to block on an rcu_read_lock, I'll have a think.
>
> Ah, I've not looked at THP much at all.
>
> Would it be sufficient to make sure to fail the pmd get_page()
> equivalent early enough?

I don't think that will be enough, as we haven't locked anything. I'll
refine the IPI as per your suggestion.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast
  2014-03-12 16:32   ` Peter Zijlstra
  2014-03-12 16:41     ` Steve Capper
@ 2014-03-12 16:55     ` Will Deacon
  2014-03-12 17:11       ` Peter Zijlstra
  2014-03-13  8:24       ` Steve Capper
  1 sibling, 2 replies; 19+ messages in thread
From: Will Deacon @ 2014-03-12 16:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 12, 2014 at 04:32:00PM +0000, Peter Zijlstra wrote:
> On Wed, Mar 12, 2014 at 01:40:20PM +0000, Steve Capper wrote:
> > +void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
> > +			  pmd_t *pmdp)
> > +{
> > +	pmd_t pmd = pmd_mksplitting(*pmdp);
> > +	VM_BUG_ON(address & ~PMD_MASK);
> > +	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> > +
> > +	/* dummy IPI to serialise against fast_gup */
> > +	smp_call_function(thp_splitting_flush_sync, NULL, 1);
> > +}
> 
> do you really need to IPI the entire machine? Wouldn't the mm's TLB
> invalidate mask be sufficient?

Are you thinking of using mm_cpumask(vma->vm_mm)? That's rarely cleared on
ARM, so it tends to identify everywhere the task has ever run, regardless of
TLB state. The reason is that the mask is also used for cache flushing
(which is further overloaded for VIVT and VIPT w/ software maintenance
broadcast).

I had a patch improving this a bit (below) but I didn't manage to see any
significant improvements so I didn't pursue it further. What we probably want
to try is nuking the mask on a h/w broadcast TLBI operation with ARMv7, but
it will mean adding horrible checks to tlbflush.h

Will

--->8

commit fd24d6170839b200cc2916c83847ca46e889f1ca
Author: Will Deacon <will.deacon@arm.com>
Date:   Thu Jul 25 16:38:34 2013 +0100

    ARM: mm: use mm_cpumask to keep track of dirty TLBs on v7
    
    Signed-off-by: Will Deacon <will.deacon@arm.com>

diff --git a/arch/arm/include/asm/tlbflush.h b/arch/arm/include/asm/tlbflush.h
index def9e570199f..f2a1cb7edfca 100644
--- a/arch/arm/include/asm/tlbflush.h
+++ b/arch/arm/include/asm/tlbflush.h
@@ -202,6 +202,7 @@
 #ifndef __ASSEMBLY__
 
 #include <linux/sched.h>
+#include <asm/smp_plat.h>
 
 struct cpu_tlb_fns {
 	void (*flush_user_range)(unsigned long, unsigned long, struct vm_area_struct *);
@@ -401,6 +402,17 @@ static inline void __flush_tlb_mm(struct mm_struct *mm)
 {
 	const unsigned int __tlb_flag = __cpu_tlb_flags;
 
+	if (!cache_ops_need_broadcast()) {
+		int cpu = get_cpu();
+		if (cpumask_equal(mm_cpumask(mm), cpumask_of(cpu))) {
+			cpumask_clear_cpu(cpu, mm_cpumask(mm));
+			local_flush_tlb_mm(mm);
+			put_cpu();
+			return;
+		}
+		put_cpu();
+	}
+
 	if (tlb_flag(TLB_WB))
 		dsb(ishst);
 
@@ -459,6 +471,17 @@ __flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr)
 {
 	const unsigned int __tlb_flag = __cpu_tlb_flags;
 
+	if (!cache_ops_need_broadcast()) {
+		int cpu = get_cpu();
+		if (cpumask_equal(mm_cpumask(vma->vm_mm), cpumask_of(cpu))) {
+			cpumask_clear_cpu(cpu, mm_cpumask(vma->vm_mm));
+			local_flush_tlb_page(vma, uaddr);
+			put_cpu();
+			return;
+		}
+		put_cpu();
+	}
+
 	uaddr = (uaddr & PAGE_MASK) | ASID(vma->vm_mm);
 
 	if (tlb_flag(TLB_WB))

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast
  2014-03-12 16:55     ` Will Deacon
@ 2014-03-12 17:11       ` Peter Zijlstra
  2014-03-14 11:47         ` Peter Zijlstra
  2014-03-13  8:24       ` Steve Capper
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-03-12 17:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 12, 2014 at 04:55:11PM +0000, Will Deacon wrote:
> On Wed, Mar 12, 2014 at 04:32:00PM +0000, Peter Zijlstra wrote:
> > On Wed, Mar 12, 2014 at 01:40:20PM +0000, Steve Capper wrote:
> > > +void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
> > > +			  pmd_t *pmdp)
> > > +{
> > > +	pmd_t pmd = pmd_mksplitting(*pmdp);
> > > +	VM_BUG_ON(address & ~PMD_MASK);
> > > +	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> > > +
> > > +	/* dummy IPI to serialise against fast_gup */
> > > +	smp_call_function(thp_splitting_flush_sync, NULL, 1);
> > > +}
> > 
> > do you really need to IPI the entire machine? Wouldn't the mm's TLB
> > invalidate mask be sufficient?
> 
> Are you thinking of using mm_cpumask(vma->vm_mm)? That's rarely cleared on
> ARM, so it tends to identify everywhere the task has ever run, regardless of
> TLB state. The reason is that the mask is also used for cache flushing
> (which is further overloaded for VIVT and VIPT w/ software maintenance
> broadcast).
> 
> I had a patch improving this a bit (below) but I didn't manage to see any
> significant improvements so I didn't pursue it further. What we probably want
> to try is nuking the mask on a h/w broadcast TLBI operation with ARMv7, but
> it will mean adding horrible checks to tlbflush.h

Ah this is because you have context tagged TLBs so your context switch
doesn't locally flush TLBs and therefore you cannot keep track of this?

Too much x86 in my head I suppose.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast
  2014-03-12 13:40 ` [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast Steve Capper
  2014-03-12 14:18   ` Peter Zijlstra
  2014-03-12 16:32   ` Peter Zijlstra
@ 2014-03-12 17:15   ` Catalin Marinas
  2014-03-13  8:03     ` Steve Capper
  2 siblings, 1 reply; 19+ messages in thread
From: Catalin Marinas @ 2014-03-12 17:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 12, 2014 at 01:40:20PM +0000, Steve Capper wrote:
> An implementation of get_user_pages_fast for ARM. It is based loosely
> on the PowerPC implementation. We disable interrupts in the walker to
> prevent the call_rcu_sched pagetable freeing code from running under
> us.
> 
> We also explicitly fire an IPI in the Transparent HugePage splitting
> case to prevent splits from interfering with the fast_gup walker.
> As THP splits are relatively rare, this should not have a noticable
> overhead.
> 
> Signed-off-by: Steve Capper <steve.capper@linaro.org>
> ---
>  arch/arm/include/asm/pgtable-3level.h |   6 +
>  arch/arm/mm/Makefile                  |   1 +
>  arch/arm/mm/gup.c                     | 299 ++++++++++++++++++++++++++++++++++
>  3 files changed, 306 insertions(+)
>  create mode 100644 arch/arm/mm/gup.c

Is there anything specific to ARM in this gup.c file? Could we make it
more generic like mm/gup.c?

-- 
Catalin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast
  2014-03-12 17:15   ` Catalin Marinas
@ 2014-03-13  8:03     ` Steve Capper
  0 siblings, 0 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-13  8:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 12 March 2014 17:15, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Wed, Mar 12, 2014 at 01:40:20PM +0000, Steve Capper wrote:
>> An implementation of get_user_pages_fast for ARM. It is based loosely
>> on the PowerPC implementation. We disable interrupts in the walker to
>> prevent the call_rcu_sched pagetable freeing code from running under
>> us.
>>
>> We also explicitly fire an IPI in the Transparent HugePage splitting
>> case to prevent splits from interfering with the fast_gup walker.
>> As THP splits are relatively rare, this should not have a noticable
>> overhead.
>>
>> Signed-off-by: Steve Capper <steve.capper@linaro.org>
>> ---
>>  arch/arm/include/asm/pgtable-3level.h |   6 +
>>  arch/arm/mm/Makefile                  |   1 +
>>  arch/arm/mm/gup.c                     | 299 ++++++++++++++++++++++++++++++++++
>>  3 files changed, 306 insertions(+)
>>  create mode 100644 arch/arm/mm/gup.c
>
> Is there anything specific to ARM in this gup.c file? Could we make it
> more generic like mm/gup.c?

Hi Catalin,
The arm and arm64 cases assume that we can read the pte's atomically,
that TLB hardware broadcasts can occur (so we have to use the
page_cache_get_speculative logic), and that hugetlb pages are
equivalent in pte layout to thp's.

Also, I took a quick look at the other architectures, and a summary of
what I found can be found in this post:
http://lists.infradead.org/pipermail/linux-arm-kernel/2014-March/239326.html

Cheers,
-- 
Steve

>
> --
> Catalin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast
  2014-03-12 16:55     ` Will Deacon
  2014-03-12 17:11       ` Peter Zijlstra
@ 2014-03-13  8:24       ` Steve Capper
  1 sibling, 0 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-13  8:24 UTC (permalink / raw)
  To: linux-arm-kernel

On 12 March 2014 16:55, Will Deacon <will.deacon@arm.com> wrote:
> On Wed, Mar 12, 2014 at 04:32:00PM +0000, Peter Zijlstra wrote:
>> On Wed, Mar 12, 2014 at 01:40:20PM +0000, Steve Capper wrote:
>> > +void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
>> > +                     pmd_t *pmdp)
>> > +{
>> > +   pmd_t pmd = pmd_mksplitting(*pmdp);
>> > +   VM_BUG_ON(address & ~PMD_MASK);
>> > +   set_pmd_at(vma->vm_mm, address, pmdp, pmd);
>> > +
>> > +   /* dummy IPI to serialise against fast_gup */
>> > +   smp_call_function(thp_splitting_flush_sync, NULL, 1);
>> > +}
>>
>> do you really need to IPI the entire machine? Wouldn't the mm's TLB
>> invalidate mask be sufficient?

Hey Will,

>
> Are you thinking of using mm_cpumask(vma->vm_mm)? That's rarely cleared on
> ARM, so it tends to identify everywhere the task has ever run, regardless of
> TLB state. The reason is that the mask is also used for cache flushing
> (which is further overloaded for VIVT and VIPT w/ software maintenance
> broadcast).

For the THP splitting case, I want a cpu mask to represent any cpu
that is touching the address space belonging to the THP. That way, the
IPI will block on any fast_gups taking place that contain the THP.

>
> I had a patch improving this a bit (below) but I didn't manage to see any
> significant improvements so I didn't pursue it further. What we probably want
> to try is nuking the mask on a h/w broadcast TLBI operation with ARMv7, but
> it will mean adding horrible checks to tlbflush.h

Thanks! I'm still waking up, will have a think about this.

Cheers,
-- 
Steve

>
> Will
>
> --->8
>
> commit fd24d6170839b200cc2916c83847ca46e889f1ca
> Author: Will Deacon <will.deacon@arm.com>
> Date:   Thu Jul 25 16:38:34 2013 +0100
>
>     ARM: mm: use mm_cpumask to keep track of dirty TLBs on v7
>
>     Signed-off-by: Will Deacon <will.deacon@arm.com>
>
> diff --git a/arch/arm/include/asm/tlbflush.h b/arch/arm/include/asm/tlbflush.h
> index def9e570199f..f2a1cb7edfca 100644
> --- a/arch/arm/include/asm/tlbflush.h
> +++ b/arch/arm/include/asm/tlbflush.h
> @@ -202,6 +202,7 @@
>  #ifndef __ASSEMBLY__
>
>  #include <linux/sched.h>
> +#include <asm/smp_plat.h>
>
>  struct cpu_tlb_fns {
>         void (*flush_user_range)(unsigned long, unsigned long, struct vm_area_struct *);
> @@ -401,6 +402,17 @@ static inline void __flush_tlb_mm(struct mm_struct *mm)
>  {
>         const unsigned int __tlb_flag = __cpu_tlb_flags;
>
> +       if (!cache_ops_need_broadcast()) {
> +               int cpu = get_cpu();
> +               if (cpumask_equal(mm_cpumask(mm), cpumask_of(cpu))) {
> +                       cpumask_clear_cpu(cpu, mm_cpumask(mm));
> +                       local_flush_tlb_mm(mm);
> +                       put_cpu();
> +                       return;
> +               }
> +               put_cpu();
> +       }
> +
>         if (tlb_flag(TLB_WB))
>                 dsb(ishst);
>
> @@ -459,6 +471,17 @@ __flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr)
>  {
>         const unsigned int __tlb_flag = __cpu_tlb_flags;
>
> +       if (!cache_ops_need_broadcast()) {
> +               int cpu = get_cpu();
> +               if (cpumask_equal(mm_cpumask(vma->vm_mm), cpumask_of(cpu))) {
> +                       cpumask_clear_cpu(cpu, mm_cpumask(vma->vm_mm));
> +                       local_flush_tlb_page(vma, uaddr);
> +                       put_cpu();
> +                       return;
> +               }
> +               put_cpu();
> +       }
> +
>         uaddr = (uaddr & PAGE_MASK) | ASID(vma->vm_mm);
>
>         if (tlb_flag(TLB_WB))

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast
  2014-03-12 17:11       ` Peter Zijlstra
@ 2014-03-14 11:47         ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-03-14 11:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 12, 2014 at 06:11:26PM +0100, Peter Zijlstra wrote:
> Ah this is because you have context tagged TLBs so your context switch
> doesn't locally flush TLBs and therefore you cannot keep track of this?
> 
> Too much x86 in my head I suppose.

Something you could consider is something like:

typdef struct {
	...
+	unsigned long tlb_flush_count;
} mm_context_t;

struct thread_info {
	...
+	unsigned long tlb_flush_count;
};

void flush_tlb*() {
	ACCESS_ONCE(mm->context.tlb_flush_counter)++;

	...
}

void switch_to(prev, next) {
	...

	if (prev->mm != next->mm &&
	    next->mm.context.tlb_flush_counter !=
	    task_thread_info(next)->tlb_flush_counter) {
		task_thread_info(next)->tlb_flush_counter =
			next->mm.context.tlb_flush_counter;
		local_tlb_flush(next->mm);
	}
}

That way you don't have to IPI cpus that don't currently run tasks of
that mm because the next time they get scheduled the switch_to() bit
will flush their mm for you.

And thus you can keep a tight tlb invalidate mask.

Now I'm not at all sure this is beneficial for ARM, just a thought.

Also I suppose one should think about the case where the counter
wrapped. The easy way out there is to unconditionally flush the entire
machine in flush_tlb*() when that happens.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-03-14 11:47 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-12 13:40 [RFC PATCH V3 0/6] get_user_pages_fast for ARM and ARM64 Steve Capper
2014-03-12 13:40 ` [RFC PATCH V3 1/6] arm: mm: Introduce special ptes for LPAE Steve Capper
2014-03-12 13:40 ` [RFC PATCH V3 2/6] arm: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
2014-03-12 13:40 ` [RFC PATCH V3 3/6] arm: mm: implement get_user_pages_fast Steve Capper
2014-03-12 14:18   ` Peter Zijlstra
2014-03-12 16:20     ` Steve Capper
2014-03-12 16:30       ` Peter Zijlstra
2014-03-12 16:42         ` Steve Capper
2014-03-12 16:32   ` Peter Zijlstra
2014-03-12 16:41     ` Steve Capper
2014-03-12 16:55     ` Will Deacon
2014-03-12 17:11       ` Peter Zijlstra
2014-03-14 11:47         ` Peter Zijlstra
2014-03-13  8:24       ` Steve Capper
2014-03-12 17:15   ` Catalin Marinas
2014-03-13  8:03     ` Steve Capper
2014-03-12 13:40 ` [RFC PATCH V3 4/6] arm64: Convert asm/tlb.h to generic mmu_gather Steve Capper
2014-03-12 13:40 ` [RFC PATCH V3 5/6] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
2014-03-12 13:40 ` [RFC PATCH V3 6/6] arm64: mm: Activate get_user_pages_fast for arm64 Steve Capper

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).