linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH V2 0/4] get_user_pages_fast for ARM and ARM64
@ 2014-02-06 16:18 Steve Capper
  2014-02-06 16:18 ` [RFC PATCH V2 1/4] arm: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Steve Capper @ 2014-02-06 16:18 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,
This RFC series implements get_user_pages_fast and __get_user_pages_fast.
These are required for Transparent HugePages to function correctly, as
a futex on a THP tail will otherwise result in an infinite loop (due to
the core implementation of __get_user_pages_fast always returning 0).
This series may also be beneficial for direct-IO heavy workloads and
certain KVM workloads.
 
Previous RFCs for fast_gup on arm have included one from Chanho Park:
http://lists.infradead.org/pipermail/linux-arm-kernel/2013-April/162115.html

one from Zi Shen Lim:
http://lists.infradead.org/pipermail/linux-arm-kernel/2013-October/202133.html

and my RFC V1:
http://lists.infradead.org/pipermail/linux-arm-kernel/2013-October/205951.html

The main issues with previous RFCs have been in the mechanisms used to
prevent page table pages from being freed from under the fast_gup
walker. Some other architectures disable interrupts in the fast_gup
walker, and then rely on the fact that TLB invalidations require IPIs;
thus the page table freeing code is blocked by the walker. Some ARM
platforms, however, have hardware broadcasts for TLB invalidation, so
do not always require IPIs to flush TLBs. Some extra logic is therefore
required to protect the fast_gup walker on ARM.

My previous RFC attempted to protect the fast_gup walker with atomics,
but this led to performance degradation.

This RFC V2 instead uses the RCU scheduler logic from PowerPC to protect
the fast_gup walker. All page table pages belonging to an address space
with more than one user are batched together and freed from a delayed
call_rcu_sched routine. Disabling interrupts will block the RCU delayed
scheduler and prevent the page table pages from being freed from under
the fast_gup walker. If there is not enough memory to batch the page
tables together (which is very rare), then IPIs are raised individually
instead.

The RCU logic is activated by enabling HAVE_RCU_TABLE_FREE, and some
modifications are made to the mmu_gather code in ARM and ARM64 to plumb
it in. On ARM64, we could probably go one step further and switch to
the generic mmu_gather code too.

THP splitting is made to broadcast an IPI as we need to block these
completely when the fast_gup walker is active. As THP splits are
relatively rare (on my machine with 22 days uptime I count 27678), I do
not expect these IPIs to cause any performance issues.

I have tested the series using the Fast Model for ARM64 and an Arndale
Board. A series of hackbench runs on the Arndale did not turn up any
performance degradation with this patch set applied.

This series applies to 3.13, but has also been tested on 3.14-rc1.

I would really appreciate any comments and/or testers!

Cheers,
-- 
Steve

Steve Capper (4):
  arm: mm: Enable HAVE_RCU_TABLE_FREE logic
  arm: mm: implement get_user_pages_fast
  arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  arm64: mm: implement get_user_pages_fast

 arch/arm/Kconfig                      |   1 +
 arch/arm/include/asm/pgtable-3level.h |   6 +
 arch/arm/include/asm/tlb.h            |  38 ++++-
 arch/arm/mm/Makefile                  |   2 +-
 arch/arm/mm/gup.c                     | 251 ++++++++++++++++++++++++++++
 arch/arm64/Kconfig                    |   1 +
 arch/arm64/include/asm/pgtable.h      |   4 +
 arch/arm64/include/asm/tlb.h          |  27 +++-
 arch/arm64/mm/Makefile                |   2 +-
 arch/arm64/mm/gup.c                   | 297 ++++++++++++++++++++++++++++++++++
 10 files changed, 623 insertions(+), 6 deletions(-)
 create mode 100644 arch/arm/mm/gup.c
 create mode 100644 arch/arm64/mm/gup.c

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH V2 1/4] arm: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-02-06 16:18 [RFC PATCH V2 0/4] get_user_pages_fast for ARM and ARM64 Steve Capper
@ 2014-02-06 16:18 ` Steve Capper
  2014-02-06 16:18 ` [RFC PATCH V2 2/4] arm: mm: implement get_user_pages_fast Steve Capper
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Steve Capper @ 2014-02-06 16:18 UTC (permalink / raw)
  To: linux-arm-kernel

In order to implement fast_get_user_pages we need to ensure that the
page table walker is protected from page table pages being freed from
under it.

One way to achieve this is to have the walker disable interrupts, and
rely on IPIs from the TLB flushing code blocking before the page table
pages are freed.

On some ARM platforms we have hardware TLB invalidation, thus the TLB
flushing code won't necessarily broadcast IPIs. Also spuriously
broadcasting IPIs can hurt system performance if done too often.

This problem has already been solved on PowerPC and Sparc by batching
up page table pages belonging to more than one mm_user, then scheduling
an rcu_sched callback to free the pages. If one were to disable
interrupts, that would block the page table pages being freed. This
logic has also been promoted to core code and is activated when one
enables HAVE_RCU_TABLE_FREE.

This patch enables HAVE_RCU_TABLE_FREE and incorporates it into the
existing ARM TLB logic.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
 arch/arm/Kconfig           |  1 +
 arch/arm/include/asm/tlb.h | 38 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index c1f1a7e..e4a0e59 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -55,6 +55,7 @@ config ARM
 	select HAVE_PERF_EVENTS
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
+	select HAVE_RCU_TABLE_FREE if SMP
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UID16
diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index 0baf7f0..8cb5552 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -35,12 +35,39 @@
 
 #define MMU_GATHER_BUNDLE	8
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+static inline void __tlb_remove_table(void *_table)
+{
+	free_page_and_swap_cache((struct page *)_table);
+}
+
+struct mmu_table_batch {
+	struct rcu_head		rcu;
+	unsigned int		nr;
+	void			*tables[0];
+};
+
+#define MAX_TABLE_BATCH		\
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+extern void tlb_table_flush(struct mmu_gather *tlb);
+extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+
+#define tlb_remove_entry(tlb,entry)	tlb_remove_table(tlb,entry)
+#else
+#define tlb_remove_entry(tlb,entry)	tlb_remove_page(tlb,entry)
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+
 /*
  * TLB handling.  This allows us to remove pages from the page
  * tables, and efficiently handle the TLB issues.
  */
 struct mmu_gather {
 	struct mm_struct	*mm;
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	struct mmu_table_batch	*batch;
+	unsigned int		need_flush;
+#endif
 	unsigned int		fullmm;
 	struct vm_area_struct	*vma;
 	unsigned long		start, end;
@@ -101,6 +128,9 @@ static inline void __tlb_alloc_page(struct mmu_gather *tlb)
 static inline void tlb_flush_mmu(struct mmu_gather *tlb)
 {
 	tlb_flush(tlb);
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb_table_flush(tlb);
+#endif
 	free_pages_and_swap_cache(tlb->pages, tlb->nr);
 	tlb->nr = 0;
 	if (tlb->pages == tlb->local)
@@ -119,6 +149,10 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start
 	tlb->pages = tlb->local;
 	tlb->nr = 0;
 	__tlb_alloc_page(tlb);
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb->batch = NULL;
+#endif
 }
 
 static inline void
@@ -195,7 +229,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 	tlb_add_flush(tlb, addr + SZ_1M);
 #endif
 
-	tlb_remove_page(tlb, pte);
+	tlb_remove_entry(tlb, pte);
 }
 
 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
@@ -203,7 +237,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 {
 #ifdef CONFIG_ARM_LPAE
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, virt_to_page(pmdp));
+	tlb_remove_entry(tlb, virt_to_page(pmdp));
 #endif
 }
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH V2 2/4] arm: mm: implement get_user_pages_fast
  2014-02-06 16:18 [RFC PATCH V2 0/4] get_user_pages_fast for ARM and ARM64 Steve Capper
  2014-02-06 16:18 ` [RFC PATCH V2 1/4] arm: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
@ 2014-02-06 16:18 ` Steve Capper
  2014-02-06 16:18 ` [RFC PATCH V2 3/4] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
  2014-02-06 16:18 ` [RFC PATCH V2 4/4] arm64: mm: implement get_user_pages_fast Steve Capper
  3 siblings, 0 replies; 12+ messages in thread
From: Steve Capper @ 2014-02-06 16:18 UTC (permalink / raw)
  To: linux-arm-kernel

An implementation of get_user_pages_fast for ARM. It is based loosely
on the PowerPC implementation. We disable interrupts in the walker to
prevent the call_rcu_sched pagetable freeing code from running under
us.

We also explicitly fire an IPI in the Transparent HugePage splitting
case to prevent splits from interfering with the fast_gup walker.
As THP splits are relatively rare, this should not have a noticable
overhead.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
 arch/arm/include/asm/pgtable-3level.h |   6 +
 arch/arm/mm/Makefile                  |   2 +-
 arch/arm/mm/gup.c                     | 251 ++++++++++++++++++++++++++++++++++
 3 files changed, 258 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm/mm/gup.c

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 4f95039..4392c40 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -214,6 +214,12 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd)	(pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
 #define pmd_trans_splitting(pmd) (pmd_val(pmd) & PMD_SECT_SPLITTING)
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp);
+#endif
 #endif
 
 #define PMD_BIT_FUNC(fn,op) \
diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
index ecfe6e5..45cc6d8 100644
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -6,7 +6,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
 				   iomap.o
 
 obj-$(CONFIG_MMU)		+= fault-armv.o flush.o idmap.o ioremap.o \
-				   mmap.o pgd.o mmu.o
+				   mmap.o pgd.o mmu.o gup.o
 
 ifneq ($(CONFIG_MMU),y)
 obj-y				+= nommu.o
diff --git a/arch/arm/mm/gup.c b/arch/arm/mm/gup.c
new file mode 100644
index 0000000..2dcacad
--- /dev/null
+++ b/arch/arm/mm/gup.c
@@ -0,0 +1,251 @@
+/*
+ * arch/arm/mm/gup.c
+ *
+ * Copyright (C) 2014 Linaro Ltd.
+ *
+ * Based on arch/powerpc/mm/gup.c which is:
+ * Copyright (C) 2008 Nick Piggin
+ * Copyright (C) 2008 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/rwsem.h>
+#include <linux/hugetlb.h>
+#include <asm/pgtable.h>
+
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	pte_t *ptep, *ptem;
+	int ret = 0;
+
+	ptem = ptep = pte_offset_map(&pmd, addr);
+	do {
+		pte_t pte = ACCESS_ONCE(*ptep);
+		struct page *page;
+
+		if (!pte_present_user(pte) || (write && !pte_write(pte)))
+			goto pte_unmap;
+
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+
+		if (!page_cache_get_speculative(page))
+			goto pte_unmap;
+
+		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+			put_page(page);
+			goto pte_unmap;
+		}
+
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+
+	ret = 1;
+
+pte_unmap:
+	pte_unmap(ptem);
+	return ret;
+}
+
+static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page, *tail;
+	int refs;
+
+	if (!pmd_present(orig) || (write && !pmd_write(orig)))
+		return 0;
+
+	refs = 0;
+	head = pmd_page(orig);
+	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	/*
+	 * Any tail pages need their mapcount reference taken before we
+	 * return. (This allows the THP code to bump their ref count when
+	 * they are split into base pages).
+	 */
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = ACCESS_ONCE(*pmdp);
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+			return 0;
+
+		if (unlikely(pmd_thp_or_huge(pmd))) {
+			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
+				pages, nr))
+				return 0;
+		} else {
+			if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+		}
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t *pgdp, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(pgdp, addr);
+	do {
+		pud_t pud = ACCESS_ONCE(*pudp);
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+/*
+ * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
+ * back to the regular GUP.
+ */
+int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			  struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, len, end;
+	unsigned long next, flags;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	start &= PAGE_MASK;
+	addr = start;
+	len = (unsigned long) nr_pages << PAGE_SHIFT;
+	end = start + len;
+
+	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+					start, len)))
+		return 0;
+
+	/*
+	 * Disable interrupts, we use the nested form as we can already
+	 * have interrupts disabled by get_futex_key.
+	 *
+	 * With interrupts disabled, we block page table pages from being
+	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
+	 * for more details.
+	 */
+
+	local_irq_save(flags);
+	pgdp = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(*pgdp))
+			break;
+		else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr))
+			break;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_restore(flags);
+
+	return nr;
+}
+
+int get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	int nr, ret;
+
+	start &= PAGE_MASK;
+	nr = __get_user_pages_fast(start, nr_pages, write, pages);
+	ret = nr;
+
+	if (nr < nr_pages) {
+		/* Try to get the remaining pages with get_user_pages */
+		start += nr << PAGE_SHIFT;
+		pages += nr;
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+				     nr_pages - nr, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		/* Have to be a bit careful with return values */
+		if (nr > 0) {
+			if (ret < 0)
+				ret = nr;
+			else
+				ret += nr;
+		}
+	}
+
+	return ret;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+static void thp_splitting_flush_sync(void *arg)
+{
+}
+
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp)
+{
+	pmd_t pmd = pmd_mksplitting(*pmdp);
+	VM_BUG_ON(address & ~PMD_MASK);
+	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
+
+	/* dummy IPI to serialise against fast_gup */
+	smp_call_function(thp_splitting_flush_sync, NULL, 1);
+}
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH V2 3/4] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-02-06 16:18 [RFC PATCH V2 0/4] get_user_pages_fast for ARM and ARM64 Steve Capper
  2014-02-06 16:18 ` [RFC PATCH V2 1/4] arm: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
  2014-02-06 16:18 ` [RFC PATCH V2 2/4] arm: mm: implement get_user_pages_fast Steve Capper
@ 2014-02-06 16:18 ` Steve Capper
  2014-02-11  2:29   ` Ming Lei
  2014-02-11 15:42   ` Catalin Marinas
  2014-02-06 16:18 ` [RFC PATCH V2 4/4] arm64: mm: implement get_user_pages_fast Steve Capper
  3 siblings, 2 replies; 12+ messages in thread
From: Steve Capper @ 2014-02-06 16:18 UTC (permalink / raw)
  To: linux-arm-kernel

In order to implement fast_get_user_pages we need to ensure that the
page table walker is protected from page table pages being freed from
under it.

This patch enables HAVE_RCU_TABLE_FREE and incorporates it into the
existing arm64 TLB logic. Any page table pages belonging to address
spaces with multiple users will be call_rcu_sched freed. Meaning
that disabling interrupts will block the free and protect the fast
gup page walker.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
 arch/arm64/Kconfig           |  1 +
 arch/arm64/include/asm/tlb.h | 27 +++++++++++++++++++++++++--
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6d4dd22..129bd6a 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -28,6 +28,7 @@ config ARM64
 	select HAVE_HW_BREAKPOINT if PERF_EVENTS
 	select HAVE_MEMBLOCK
 	select HAVE_PERF_EVENTS
+	select HAVE_RCU_TABLE_FREE
 	select IRQ_DOMAIN
 	select MODULES_USE_ELF_RELA
 	select NO_BOOTMEM
diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index 717031a..8999823 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -27,12 +27,33 @@
 
 #define MMU_GATHER_BUNDLE	8
 
+static inline void __tlb_remove_table(void *_table)
+{
+	free_page_and_swap_cache((struct page *)_table);
+}
+
+struct mmu_table_batch {
+	struct rcu_head		rcu;
+	unsigned int		nr;
+	void			*tables[0];
+};
+
+#define MAX_TABLE_BATCH		\
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+extern void tlb_table_flush(struct mmu_gather *tlb);
+extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+
+#define tlb_remove_entry(tlb,entry)	tlb_remove_table(tlb,entry)
+
 /*
  * TLB handling.  This allows us to remove pages from the page
  * tables, and efficiently handle the TLB issues.
  */
 struct mmu_gather {
 	struct mm_struct	*mm;
+	struct mmu_table_batch	*batch;
+	unsigned int		need_flush;
 	unsigned int		fullmm;
 	struct vm_area_struct	*vma;
 	unsigned long		start, end;
@@ -91,6 +112,7 @@ static inline void __tlb_alloc_page(struct mmu_gather *tlb)
 static inline void tlb_flush_mmu(struct mmu_gather *tlb)
 {
 	tlb_flush(tlb);
+	tlb_table_flush(tlb);
 	free_pages_and_swap_cache(tlb->pages, tlb->nr);
 	tlb->nr = 0;
 	if (tlb->pages == tlb->local)
@@ -109,6 +131,7 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start
 	tlb->pages = tlb->local;
 	tlb->nr = 0;
 	__tlb_alloc_page(tlb);
+	tlb->batch = NULL;
 }
 
 static inline void
@@ -172,7 +195,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 {
 	pgtable_page_dtor(pte);
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, pte);
+	tlb_remove_entry(tlb, pte);
 }
 
 #ifndef CONFIG_ARM64_64K_PAGES
@@ -180,7 +203,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 				  unsigned long addr)
 {
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, virt_to_page(pmdp));
+	tlb_remove_entry(tlb, virt_to_page(pmdp));
 }
 #endif
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH V2 4/4] arm64: mm: implement get_user_pages_fast
  2014-02-06 16:18 [RFC PATCH V2 0/4] get_user_pages_fast for ARM and ARM64 Steve Capper
                   ` (2 preceding siblings ...)
  2014-02-06 16:18 ` [RFC PATCH V2 3/4] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
@ 2014-02-06 16:18 ` Steve Capper
  2014-02-11  2:30   ` Ming Lei
  2014-02-11 15:48   ` Catalin Marinas
  3 siblings, 2 replies; 12+ messages in thread
From: Steve Capper @ 2014-02-06 16:18 UTC (permalink / raw)
  To: linux-arm-kernel

An implementation of get_user_pages_fast for arm64. It is based on the
arm implementation (it has the added ability to walk huge puds) which
is loosely on the PowerPC implementation. We disable interrupts in the
walker to prevent the call_rcu_sched pagetable freeing code from
running under us.

We also explicitly fire an IPI in the Transparent HugePage splitting
case to prevent splits from interfering with the fast_gup walker.
As THP splits are relatively rare, this should not have a noticable
overhead.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
 arch/arm64/include/asm/pgtable.h |   4 +
 arch/arm64/mm/Makefile           |   2 +-
 arch/arm64/mm/gup.c              | 297 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 302 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/mm/gup.c

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7f2b60a..8cfa1aa 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -212,6 +212,10 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd)	(pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
 #define pmd_trans_splitting(pmd) (pmd_val(pmd) & PMD_SECT_SPLITTING)
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+struct vm_area_struct;
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp);
 #endif
 
 #define PMD_BIT_FUNC(fn,op) \
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index b51d364..44b2148 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -1,5 +1,5 @@
 obj-y				:= dma-mapping.o extable.o fault.o init.o \
 				   cache.o copypage.o flush.o \
 				   ioremap.o mmap.o pgd.o mmu.o \
-				   context.o tlb.o proc.o
+				   context.o tlb.o proc.o gup.o
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
diff --git a/arch/arm64/mm/gup.c b/arch/arm64/mm/gup.c
new file mode 100644
index 0000000..45dd908
--- /dev/null
+++ b/arch/arm64/mm/gup.c
@@ -0,0 +1,297 @@
+/*
+ * arch/arm64/mm/gup.c
+ *
+ * Copyright (C) 2014 Linaro Ltd.
+ *
+ * Based on arch/powerpc/mm/gup.c which is:
+ * Copyright (C) 2008 Nick Piggin
+ * Copyright (C) 2008 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/rwsem.h>
+#include <linux/hugetlb.h>
+#include <asm/pgtable.h>
+
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	pte_t *ptep, *ptem;
+	int ret = 0;
+
+	ptem = ptep = pte_offset_map(&pmd, addr);
+	do {
+		pte_t pte = ACCESS_ONCE(*ptep);
+		struct page *page;
+
+		if (!pte_valid_user(pte) || (write && !pte_write(pte)))
+			goto pte_unmap;
+
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+
+		if (!page_cache_get_speculative(page))
+			goto pte_unmap;
+
+		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+			put_page(page);
+			goto pte_unmap;
+		}
+
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+
+	ret = 1;
+
+pte_unmap:
+	pte_unmap(ptem);
+	return ret;
+}
+
+static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page, *tail;
+	int refs;
+
+	if (!pmd_present(orig) || (write && !pmd_write(orig)))
+		return 0;
+
+	refs = 0;
+	head = pmd_page(orig);
+	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	/*
+	 * Any tail pages need their mapcount reference taken before we
+	 * return. (This allows the THP code to bump their ref count when
+	 * they are split into base pages).
+	 */
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page, *tail;
+	pmd_t origpmd = __pmd(pud_val(orig));
+	int refs;
+
+	if (!pmd_present(origpmd) || (write && !pmd_write(origpmd)))
+		return 0;
+
+	refs = 0;
+	head = pmd_page(origpmd);
+	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = ACCESS_ONCE(*pmdp);
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+			return 0;
+
+		if (unlikely(pmd_huge(pmd) || pmd_trans_huge(pmd))) {
+			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
+					pages, nr))
+				return 0;
+		} else {
+			if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+		}
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t *pgdp, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(pgdp, addr);
+	do {
+		pud_t pud = ACCESS_ONCE(*pudp);
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (pud_huge(pud)) {
+			if (!gup_huge_pud(pud, pudp, addr, next, write,
+					pages, nr))
+				return 0;
+		}
+		else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+/*
+ * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
+ * back to the regular GUP.
+ */
+int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			  struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, len, end;
+	unsigned long next, flags;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	start &= PAGE_MASK;
+	addr = start;
+	len = (unsigned long) nr_pages << PAGE_SHIFT;
+	end = start + len;
+
+	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+					start, len)))
+		return 0;
+
+	/*
+	 * Disable interrupts, we use the nested form as we can already
+	 * have interrupts disabled by get_futex_key.
+	 *
+	 * With interrupts disabled, we block page table pages from being
+	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
+	 * for more details.
+	 */
+
+	local_irq_save(flags);
+	pgdp = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(*pgdp))
+			break;
+		else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr))
+			break;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_restore(flags);
+
+	return nr;
+}
+
+int get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	int nr, ret;
+
+	start &= PAGE_MASK;
+	nr = __get_user_pages_fast(start, nr_pages, write, pages);
+	ret = nr;
+
+	if (nr < nr_pages) {
+		/* Try to get the remaining pages with get_user_pages */
+		start += nr << PAGE_SHIFT;
+		pages += nr;
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+				     nr_pages - nr, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		/* Have to be a bit careful with return values */
+		if (nr > 0) {
+			if (ret < 0)
+				ret = nr;
+			else
+				ret += nr;
+		}
+	}
+
+	return ret;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void thp_splitting_flush_sync(void *arg)
+{
+}
+
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp)
+{
+	pmd_t pmd = pmd_mksplitting(*pmdp);
+	VM_BUG_ON(address & ~PMD_MASK);
+	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
+
+	/* dummy IPI to serialise against fast_gup */
+	smp_call_function(thp_splitting_flush_sync, NULL, 1);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH V2 3/4] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-02-06 16:18 ` [RFC PATCH V2 3/4] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
@ 2014-02-11  2:29   ` Ming Lei
  2014-02-11 15:42   ` Catalin Marinas
  1 sibling, 0 replies; 12+ messages in thread
From: Ming Lei @ 2014-02-11  2:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 7, 2014 at 12:18 AM, Steve Capper <steve.capper@linaro.org> wrote:
> In order to implement fast_get_user_pages we need to ensure that the
> page table walker is protected from page table pages being freed from
> under it.
>
> This patch enables HAVE_RCU_TABLE_FREE and incorporates it into the
> existing arm64 TLB logic. Any page table pages belonging to address
> spaces with multiple users will be call_rcu_sched freed. Meaning
> that disabling interrupts will block the free and protect the fast
> gup page walker.
>
> Signed-off-by: Steve Capper <steve.capper@linaro.org>

Tested-by: Ming Lei <ming.lei@canonical.com>

Without patch 3 and 4 in this patchset, we can't run go script
successfully with thp enabled on arm64, after applying the two
patches,  go can start working.

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH V2 4/4] arm64: mm: implement get_user_pages_fast
  2014-02-06 16:18 ` [RFC PATCH V2 4/4] arm64: mm: implement get_user_pages_fast Steve Capper
@ 2014-02-11  2:30   ` Ming Lei
  2014-02-11 15:48   ` Catalin Marinas
  1 sibling, 0 replies; 12+ messages in thread
From: Ming Lei @ 2014-02-11  2:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 7, 2014 at 12:18 AM, Steve Capper <steve.capper@linaro.org> wrote:
> An implementation of get_user_pages_fast for arm64. It is based on the
> arm implementation (it has the added ability to walk huge puds) which
> is loosely on the PowerPC implementation. We disable interrupts in the
> walker to prevent the call_rcu_sched pagetable freeing code from
> running under us.
>
> We also explicitly fire an IPI in the Transparent HugePage splitting
> case to prevent splits from interfering with the fast_gup walker.
> As THP splits are relatively rare, this should not have a noticable
> overhead.
>
> Signed-off-by: Steve Capper <steve.capper@linaro.org>

Tested-by: Ming Lei <ming.lei@canonical.com>

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH V2 3/4] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-02-06 16:18 ` [RFC PATCH V2 3/4] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
  2014-02-11  2:29   ` Ming Lei
@ 2014-02-11 15:42   ` Catalin Marinas
  2014-02-11 16:08     ` Steve Capper
  1 sibling, 1 reply; 12+ messages in thread
From: Catalin Marinas @ 2014-02-11 15:42 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Steve,

On Thu, Feb 06, 2014 at 04:18:50PM +0000, Steve Capper wrote:
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 6d4dd22..129bd6a 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -28,6 +28,7 @@ config ARM64
>  	select HAVE_HW_BREAKPOINT if PERF_EVENTS
>  	select HAVE_MEMBLOCK
>  	select HAVE_PERF_EVENTS
> +	select HAVE_RCU_TABLE_FREE
>  	select IRQ_DOMAIN
>  	select MODULES_USE_ELF_RELA
>  	select NO_BOOTMEM
> diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
> index 717031a..8999823 100644
> --- a/arch/arm64/include/asm/tlb.h
> +++ b/arch/arm64/include/asm/tlb.h
> @@ -27,12 +27,33 @@
>  
>  #define MMU_GATHER_BUNDLE	8
>  
> +static inline void __tlb_remove_table(void *_table)
> +{
> +	free_page_and_swap_cache((struct page *)_table);
> +}

I think you can reduce your patch to just the above (and a linux/swap.h
include) after the arm64 conversion to generic mmu_gather below.

I cc'ed Peter Z for a sanity check, some of the code is close to
https://lkml.org/lkml/2011/3/7/302, only that it's under arch/arm64.

And, of course, it needs a lot more testing.

-------------8<---------------------------------------

>From 01a958dfc44eb7ec697625813b3b98a705bad324 Mon Sep 17 00:00:00 2001
From: Catalin Marinas <catalin.marinas@arm.com>
Date: Tue, 11 Feb 2014 15:22:01 +0000
Subject: [PATCH] arm64: Convert asm/tlb.h to generic mmu_gather

Over the past couple of years, the generic mmu_gather gained range
tracking - 597e1c3580b7 (mm/mmu_gather: enable tlb flush range in generic
mmu_gather), 2b047252d087 (Fix TLB gather virtual address range
invalidation corner cases) - and tlb_fast_mode() has been removed -
29eb77825cc7 (arch, mm: Remove tlb_fast_mode()).

The new mmu_gather structure is now suitable for arm64 and this patch
converts the arch asm/tlb.h to the generic code. One functional
difference is the shift_arg_pages() case where previously the code was
flushing the full mm (no tlb_start_vma call) but now it flushes the
range given to tlb_gather_mmu() (possibly slightly more efficient
previously).

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/arm64/include/asm/tlb.h | 136 +++++++------------------------------------
 1 file changed, 20 insertions(+), 116 deletions(-)

diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index 717031a762c2..72cadf52ca80 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -19,115 +19,44 @@
 #ifndef __ASM_TLB_H
 #define __ASM_TLB_H
 
-#include <linux/pagemap.h>
-#include <linux/swap.h>
 
-#include <asm/pgalloc.h>
-#include <asm/tlbflush.h>
-
-#define MMU_GATHER_BUNDLE	8
-
-/*
- * TLB handling.  This allows us to remove pages from the page
- * tables, and efficiently handle the TLB issues.
- */
-struct mmu_gather {
-	struct mm_struct	*mm;
-	unsigned int		fullmm;
-	struct vm_area_struct	*vma;
-	unsigned long		start, end;
-	unsigned long		range_start;
-	unsigned long		range_end;
-	unsigned int		nr;
-	unsigned int		max;
-	struct page		**pages;
-	struct page		*local[MMU_GATHER_BUNDLE];
-};
+#include <asm-generic/tlb.h>
 
 /*
- * This is unnecessarily complex.  There's three ways the TLB shootdown
- * code is used:
+ * There's three ways the TLB shootdown code is used:
  *  1. Unmapping a range of vmas.  See zap_page_range(), unmap_region().
  *     tlb->fullmm = 0, and tlb_start_vma/tlb_end_vma will be called.
- *     tlb->vma will be non-NULL.
  *  2. Unmapping all vmas.  See exit_mmap().
  *     tlb->fullmm = 1, and tlb_start_vma/tlb_end_vma will be called.
- *     tlb->vma will be non-NULL.  Additionally, page tables will be freed.
+ *     Page tables will be freed.
  *  3. Unmapping argument pages.  See shift_arg_pages().
  *     tlb->fullmm = 0, but tlb_start_vma/tlb_end_vma will not be called.
- *     tlb->vma will be NULL.
  */
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
-	if (tlb->fullmm || !tlb->vma)
+	if (tlb->fullmm) {
 		flush_tlb_mm(tlb->mm);
-	else if (tlb->range_end > 0) {
-		flush_tlb_range(tlb->vma, tlb->range_start, tlb->range_end);
-		tlb->range_start = TASK_SIZE;
-		tlb->range_end = 0;
+	} else if (tlb->end > 0) {
+		struct vm_area_struct vma = { .vm_mm = tlb->mm, };
+		flush_tlb_range(&vma, tlb->start, tlb->end);
+		tlb->start = TASK_SIZE;
+		tlb->end = 0;
 	}
 }
 
 static inline void tlb_add_flush(struct mmu_gather *tlb, unsigned long addr)
 {
 	if (!tlb->fullmm) {
-		if (addr < tlb->range_start)
-			tlb->range_start = addr;
-		if (addr + PAGE_SIZE > tlb->range_end)
-			tlb->range_end = addr + PAGE_SIZE;
-	}
-}
-
-static inline void __tlb_alloc_page(struct mmu_gather *tlb)
-{
-	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
-
-	if (addr) {
-		tlb->pages = (void *)addr;
-		tlb->max = PAGE_SIZE / sizeof(struct page *);
+		tlb->start = min(tlb->start, addr);
+		tlb->end = max(tlb->end, addr + PAGE_SIZE);
 	}
 }
 
-static inline void tlb_flush_mmu(struct mmu_gather *tlb)
-{
-	tlb_flush(tlb);
-	free_pages_and_swap_cache(tlb->pages, tlb->nr);
-	tlb->nr = 0;
-	if (tlb->pages == tlb->local)
-		__tlb_alloc_page(tlb);
-}
-
-static inline void
-tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end)
-{
-	tlb->mm = mm;
-	tlb->fullmm = !(start | (end+1));
-	tlb->start = start;
-	tlb->end = end;
-	tlb->vma = NULL;
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->pages = tlb->local;
-	tlb->nr = 0;
-	__tlb_alloc_page(tlb);
-}
-
-static inline void
-tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
-{
-	tlb_flush_mmu(tlb);
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-
-	if (tlb->pages != tlb->local)
-		free_pages((unsigned long)tlb->pages, 0);
-}
-
 /*
  * Memorize the range for the TLB flush.
  */
-static inline void
-tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, unsigned long addr)
+static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
+					  unsigned long addr)
 {
 	tlb_add_flush(tlb, addr);
 }
@@ -137,38 +66,24 @@ tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, unsigned long addr)
  * case where we're doing a full MM flush.  When we're doing a munmap,
  * the vmas are adjusted to only cover the region to be torn down.
  */
-static inline void
-tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
+static inline void tlb_start_vma(struct mmu_gather *tlb,
+				 struct vm_area_struct *vma)
 {
 	if (!tlb->fullmm) {
-		tlb->vma = vma;
-		tlb->range_start = TASK_SIZE;
-		tlb->range_end = 0;
+		tlb->start = TASK_SIZE;
+		tlb->end = 0;
 	}
 }
 
-static inline void
-tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
+static inline void tlb_end_vma(struct mmu_gather *tlb,
+			       struct vm_area_struct *vma)
 {
 	if (!tlb->fullmm)
 		tlb_flush(tlb);
 }
 
-static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	tlb->pages[tlb->nr++] = page;
-	VM_BUG_ON(tlb->nr > tlb->max);
-	return tlb->max - tlb->nr;
-}
-
-static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	if (!__tlb_remove_page(tlb, page))
-		tlb_flush_mmu(tlb);
-}
-
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
-	unsigned long addr)
+				  unsigned long addr)
 {
 	pgtable_page_dtor(pte);
 	tlb_add_flush(tlb, addr);
@@ -184,16 +99,5 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 }
 #endif
 
-#define pte_free_tlb(tlb, ptep, addr)	__pte_free_tlb(tlb, ptep, addr)
-#define pmd_free_tlb(tlb, pmdp, addr)	__pmd_free_tlb(tlb, pmdp, addr)
-#define pud_free_tlb(tlb, pudp, addr)	pud_free((tlb)->mm, pudp)
-
-#define tlb_migrate_finish(mm)		do { } while (0)
-
-static inline void
-tlb_remove_pmd_tlb_entry(struct mmu_gather *tlb, pmd_t *pmdp, unsigned long addr)
-{
-	tlb_add_flush(tlb, addr);
-}
 
 #endif

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH V2 4/4] arm64: mm: implement get_user_pages_fast
  2014-02-06 16:18 ` [RFC PATCH V2 4/4] arm64: mm: implement get_user_pages_fast Steve Capper
  2014-02-11  2:30   ` Ming Lei
@ 2014-02-11 15:48   ` Catalin Marinas
  2014-02-11 16:13     ` Steve Capper
  2014-03-11 10:14     ` Steve Capper
  1 sibling, 2 replies; 12+ messages in thread
From: Catalin Marinas @ 2014-02-11 15:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 06, 2014 at 04:18:51PM +0000, Steve Capper wrote:
> An implementation of get_user_pages_fast for arm64. It is based on the
> arm implementation (it has the added ability to walk huge puds) which
> is loosely on the PowerPC implementation. We disable interrupts in the
> walker to prevent the call_rcu_sched pagetable freeing code from
> running under us.
> 
> We also explicitly fire an IPI in the Transparent HugePage splitting
> case to prevent splits from interfering with the fast_gup walker.
> As THP splits are relatively rare, this should not have a noticable
> overhead.
> 
> Signed-off-by: Steve Capper <steve.capper@linaro.org>
> ---
>  arch/arm64/include/asm/pgtable.h |   4 +
>  arch/arm64/mm/Makefile           |   2 +-
>  arch/arm64/mm/gup.c              | 297 +++++++++++++++++++++++++++++++++++++++

Why don't you make a generic gup.c implementation and let architectures
select it? I don't see much arm64-specific code in here.

-- 
Catalin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH V2 3/4] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-02-11 15:42   ` Catalin Marinas
@ 2014-02-11 16:08     ` Steve Capper
  0 siblings, 0 replies; 12+ messages in thread
From: Steve Capper @ 2014-02-11 16:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Feb 11, 2014 at 03:42:29PM +0000, Catalin Marinas wrote:
> Hi Steve,
> 
> On Thu, Feb 06, 2014 at 04:18:50PM +0000, Steve Capper wrote:
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 6d4dd22..129bd6a 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -28,6 +28,7 @@ config ARM64
> >  	select HAVE_HW_BREAKPOINT if PERF_EVENTS
> >  	select HAVE_MEMBLOCK
> >  	select HAVE_PERF_EVENTS
> > +	select HAVE_RCU_TABLE_FREE
> >  	select IRQ_DOMAIN
> >  	select MODULES_USE_ELF_RELA
> >  	select NO_BOOTMEM
> > diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
> > index 717031a..8999823 100644
> > --- a/arch/arm64/include/asm/tlb.h
> > +++ b/arch/arm64/include/asm/tlb.h
> > @@ -27,12 +27,33 @@
> >  
> >  #define MMU_GATHER_BUNDLE	8
> >  
> > +static inline void __tlb_remove_table(void *_table)
> > +{
> > +	free_page_and_swap_cache((struct page *)_table);
> > +}
> 
> I think you can reduce your patch to just the above (and a linux/swap.h
> include) after the arm64 conversion to generic mmu_gather below.
> 
> I cc'ed Peter Z for a sanity check, some of the code is close to
> https://lkml.org/lkml/2011/3/7/302, only that it's under arch/arm64.
> 
> And, of course, it needs a lot more testing.

Okay, cheers Catalin, I'll give that a go.

Cheers,
-- 
Steve

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH V2 4/4] arm64: mm: implement get_user_pages_fast
  2014-02-11 15:48   ` Catalin Marinas
@ 2014-02-11 16:13     ` Steve Capper
  2014-03-11 10:14     ` Steve Capper
  1 sibling, 0 replies; 12+ messages in thread
From: Steve Capper @ 2014-02-11 16:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Feb 11, 2014 at 03:48:59PM +0000, Catalin Marinas wrote:
> On Thu, Feb 06, 2014 at 04:18:51PM +0000, Steve Capper wrote:
> > An implementation of get_user_pages_fast for arm64. It is based on the
> > arm implementation (it has the added ability to walk huge puds) which
> > is loosely on the PowerPC implementation. We disable interrupts in the
> > walker to prevent the call_rcu_sched pagetable freeing code from
> > running under us.
> > 
> > We also explicitly fire an IPI in the Transparent HugePage splitting
> > case to prevent splits from interfering with the fast_gup walker.
> > As THP splits are relatively rare, this should not have a noticable
> > overhead.
> > 
> > Signed-off-by: Steve Capper <steve.capper@linaro.org>
> > ---
> >  arch/arm64/include/asm/pgtable.h |   4 +
> >  arch/arm64/mm/Makefile           |   2 +-
> >  arch/arm64/mm/gup.c              | 297 +++++++++++++++++++++++++++++++++++++++
> 
> Why don't you make a generic gup.c implementation and let architectures
> select it? I don't see much arm64-specific code in here.

Certainly ARM and ARM64 will happily share the code and other
architectures use basically the same technique in most places so could
probably make use of a tweaked version.

I'll generalise gup.c in the next revision of this series.

Cheers,
-- 
Steve

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH V2 4/4] arm64: mm: implement get_user_pages_fast
  2014-02-11 15:48   ` Catalin Marinas
  2014-02-11 16:13     ` Steve Capper
@ 2014-03-11 10:14     ` Steve Capper
  1 sibling, 0 replies; 12+ messages in thread
From: Steve Capper @ 2014-03-11 10:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Feb 11, 2014 at 03:48:59PM +0000, Catalin Marinas wrote:
> On Thu, Feb 06, 2014 at 04:18:51PM +0000, Steve Capper wrote:
> > An implementation of get_user_pages_fast for arm64. It is based on the
> > arm implementation (it has the added ability to walk huge puds) which
> > is loosely on the PowerPC implementation. We disable interrupts in the
> > walker to prevent the call_rcu_sched pagetable freeing code from
> > running under us.
> > 
> > We also explicitly fire an IPI in the Transparent HugePage splitting
> > case to prevent splits from interfering with the fast_gup walker.
> > As THP splits are relatively rare, this should not have a noticable
> > overhead.
> > 
> > Signed-off-by: Steve Capper <steve.capper@linaro.org>
> > ---
> >  arch/arm64/include/asm/pgtable.h |   4 +
> >  arch/arm64/mm/Makefile           |   2 +-
> >  arch/arm64/mm/gup.c              | 297 +++++++++++++++++++++++++++++++++++++++
> 
> Why don't you make a generic gup.c implementation and let architectures
> select it? I don't see much arm64-specific code in here.

Hi Catalin,
I've had a stab at generalising the gup, but I've found that it varies
too much between architectures to make this practical for me:
 * x86 blocks on TLB invalidate so does not need the speculative page
   cache logic. Also x86 does not have 64-bit single-copy atomicity for
   pte reads, so needs a work around.
 * mips is similar-ish to x86.
 * powerpc has extra is_hugepd codepaths to identify huge pages.
 * superh has sub-architecture pte flags and no 64-bit single-copy
   atomicity.
 * sparc has hypervisor tlb logic for the pte flags.
 * s390 has extra pmd derefence logic and extra barriers that I do not
   quite understand.

My plan was to introduce pte_special(.) for arm with LPAE, add
pte_special logic to fast_gup and share the fast_gup between arm and
arm64.

Does this approach sound reasonable?

Thanks,
-- 
Steve

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-03-11 10:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-06 16:18 [RFC PATCH V2 0/4] get_user_pages_fast for ARM and ARM64 Steve Capper
2014-02-06 16:18 ` [RFC PATCH V2 1/4] arm: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
2014-02-06 16:18 ` [RFC PATCH V2 2/4] arm: mm: implement get_user_pages_fast Steve Capper
2014-02-06 16:18 ` [RFC PATCH V2 3/4] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
2014-02-11  2:29   ` Ming Lei
2014-02-11 15:42   ` Catalin Marinas
2014-02-11 16:08     ` Steve Capper
2014-02-06 16:18 ` [RFC PATCH V2 4/4] arm64: mm: implement get_user_pages_fast Steve Capper
2014-02-11  2:30   ` Ming Lei
2014-02-11 15:48   ` Catalin Marinas
2014-02-11 16:13     ` Steve Capper
2014-03-11 10:14     ` Steve Capper

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).