* [RFC PATCH V4 0/7] get_user_pages_fast for ARM and ARM64
@ 2014-03-28 15:01 Steve Capper
  2014-03-28 15:01 ` [RFC PATCH V4 1/7] mm: Introduce a general RCU get_user_pages_fast Steve Capper
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-28 15:01 UTC (permalink / raw)
  To: linux-arm-kernel
Hello,
This RFC series implements get_user_pages_fast and __get_user_pages_fast.
These are required for Transparent HugePages to function correctly, as
a futex on a THP tail will otherwise result in an infinite loop (due to
the core implementation of __get_user_pages_fast always returning 0).
This series may also be beneficial for direct-IO heavy workloads and
certain KVM workloads.
The main changes since RFC V3 are:
 * fast_gup now generalised and moved to core code.
 * pte_special logic now extended to reduce unnecessary icache syncs.
 * dropped the pte_accessible logic in fast_gup as it is unnecessary.
I would really appreciate any comments (especially on the validity or
otherwise of the core fast_gup implementation) and/or testers.
Cheers,
--
Steve
Catalin Marinas (1):
  arm64: Convert asm/tlb.h to generic mmu_gather
Steve Capper (6):
  mm: Introduce a general RCU get_user_pages_fast.
  arm: mm: Introduce special ptes for LPAE
  arm: mm: Enable HAVE_RCU_TABLE_FREE logic
  arm: mm: Enable RCU fast_gup
  arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  arm64: mm: Enable RCU fast_gup
 arch/arm/Kconfig                      |   4 +
 arch/arm/include/asm/pgtable-2level.h |   2 +
 arch/arm/include/asm/pgtable-3level.h |  14 ++
 arch/arm/include/asm/pgtable.h        |   6 +-
 arch/arm/include/asm/tlb.h            |  38 ++++-
 arch/arm/mm/flush.c                   |  19 +++
 arch/arm64/Kconfig                    |   4 +
 arch/arm64/include/asm/pgtable.h      |   4 +
 arch/arm64/include/asm/tlb.h          | 140 +++-------------
 arch/arm64/mm/flush.c                 |  19 +++
 mm/Kconfig                            |   3 +
 mm/Makefile                           |   1 +
 mm/gup.c                              | 297 ++++++++++++++++++++++++++++++++++
 13 files changed, 431 insertions(+), 120 deletions(-)
 create mode 100644 mm/gup.c
-- 
1.8.1.4
^ permalink raw reply	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 1/7] mm: Introduce a general RCU get_user_pages_fast.
  2014-03-28 15:01 [RFC PATCH V4 0/7] get_user_pages_fast for ARM and ARM64 Steve Capper
@ 2014-03-28 15:01 ` Steve Capper
  2014-03-28 15:01 ` [RFC PATCH V4 2/7] arm: mm: Introduce special ptes for LPAE Steve Capper
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-28 15:01 UTC (permalink / raw)
  To: linux-arm-kernel
A general RCU implementation of get_user_pages_fast. It is based
heavily on the PowerPC implementation.
The lockless page cache protocols are used as this implementation
assumes that TLB invalidations do not necessarily need to be broadcast
via IPI.
This implementation does however assume that THP splits will broadcast
an IPI, and this is why interrupts are disabled in the fast_gup walker
(otherwise calls to rcu_read_(un)lock would suffice).
Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
This is my first attempt to generalise fast_gup to core code. At the
moment there are two implicit assumptions that I know about:
  o) 64-bit ptes can be atomically read.
  o) hugetlb pages and thps have a similar bit layout.
Any feedback from other architectures maintainers on how this could be
tweaked to accommodate them, would be greatly appreciated! Especially
as there is a lot of similarity between each architecture's fast_gup.
---
 mm/Kconfig  |   3 +
 mm/Makefile |   1 +
 mm/gup.c    | 297 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 301 insertions(+)
 create mode 100644 mm/gup.c
diff --git a/mm/Kconfig b/mm/Kconfig
index 2888024..0151e17 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -134,6 +134,9 @@ config HAVE_MEMBLOCK
 config HAVE_MEMBLOCK_NODE_MAP
 	boolean
 
+config HAVE_RCU_GUP
+	boolean
+
 config ARCH_DISCARD_MEMBLOCK
 	boolean
 
diff --git a/mm/Makefile b/mm/Makefile
index 310c90a..0f19c5f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -28,6 +28,7 @@ else
 endif
 
 obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o
+obj-$(CONFIG_HAVE_RCU_GUP) += gup.o
 
 obj-$(CONFIG_BOUNCE)	+= bounce.o
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
diff --git a/mm/gup.c b/mm/gup.c
new file mode 100644
index 0000000..b35296f
--- /dev/null
+++ b/mm/gup.c
@@ -0,0 +1,297 @@
+/*
+ * mm/gup.c
+ *
+ * Copyright (C) 2014 Linaro Ltd.
+ *
+ * Based on arch/powerpc/mm/gup.c which is:
+ * Copyright (C) 2008 Nick Piggin
+ * Copyright (C) 2008 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/rwsem.h>
+#include <linux/hugetlb.h>
+#include <asm/pgtable.h>
+
+#ifdef __HAVE_ARCH_PTE_SPECIAL
+static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	pte_t *ptep, *ptem;
+	int ret = 0;
+
+	ptem = ptep = pte_offset_map(&pmd, addr);
+	do {
+		pte_t pte = ACCESS_ONCE(*ptep);
+		struct page *page;
+
+		if (!pte_present(pte) || pte_special(pte)
+			|| (write && !pte_write(pte)))
+			goto pte_unmap;
+
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+
+		if (!page_cache_get_speculative(page))
+			goto pte_unmap;
+
+		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+			put_page(page);
+			goto pte_unmap;
+		}
+
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+
+	ret = 1;
+
+pte_unmap:
+	pte_unmap(ptem);
+	return ret;
+}
+#else
+
+/*
+ * If we can't determine whether or not a pte is special, then fail immediately
+ * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
+ * to be special.
+ */
+static inline int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
+			 int write, struct page **pages, int *nr)
+{
+	return 0;
+}
+#endif /* __HAVE_ARCH_PTE_SPECIAL */
+
+static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page, *tail;
+	int refs;
+
+	if (!pmd_present(orig) || (write && !pmd_write(orig)))
+		return 0;
+
+	refs = 0;
+	head = pmd_page(orig);
+	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	/*
+	 * Any tail pages need their mapcount reference taken before we
+	 * return. (This allows the THP code to bump their ref count when
+	 * they are split into base pages).
+	 */
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	struct page *head, *page, *tail;
+	pmd_t origpmd = __pmd(pud_val(orig));
+	int refs;
+
+	if (!pmd_present(origpmd) || (write && !pmd_write(origpmd)))
+		return 0;
+
+	refs = 0;
+	head = pmd_page(origpmd);
+	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = ACCESS_ONCE(*pmdp);
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+			return 0;
+
+		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
+			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
+				pages, nr))
+				return 0;
+		} else {
+			if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+		}
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t *pgdp, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(pgdp, addr);
+	do {
+		pud_t pud = ACCESS_ONCE(*pudp);
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (pud_huge(pud)) {
+			if (!gup_huge_pud(pud, pudp, addr, next, write,
+					pages, nr))
+				return 0;
+		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+/*
+ * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
+ * back to the regular GUP.
+ */
+int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			  struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr, len, end;
+	unsigned long next, flags;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	start &= PAGE_MASK;
+	addr = start;
+	len = (unsigned long) nr_pages << PAGE_SHIFT;
+	end = start + len;
+
+	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+					start, len)))
+		return 0;
+
+	/*
+	 * Disable interrupts, we use the nested form as we can already
+	 * have interrupts disabled by get_futex_key.
+	 *
+	 * With interrupts disabled, we block page table pages from being
+	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
+	 * for more details.
+	 *
+	 * We do not adopt an rcu_read_lock(.) here as we also want to
+	 * block IPIs that come from THPs splitting.
+	 */
+
+	local_irq_save(flags);
+	pgdp = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(*pgdp))
+			break;
+		else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr))
+			break;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_restore(flags);
+
+	return nr;
+}
+
+int get_user_pages_fast(unsigned long start, int nr_pages, int write,
+			struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	int nr, ret;
+
+	start &= PAGE_MASK;
+	nr = __get_user_pages_fast(start, nr_pages, write, pages);
+	ret = nr;
+
+	if (nr < nr_pages) {
+		/* Try to get the remaining pages with get_user_pages */
+		start += nr << PAGE_SHIFT;
+		pages += nr;
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+				     nr_pages - nr, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		/* Have to be a bit careful with return values */
+		if (nr > 0) {
+			if (ret < 0)
+				ret = nr;
+			else
+				ret += nr;
+		}
+	}
+
+	return ret;
+}
-- 
1.8.1.4
^ permalink raw reply related	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 2/7] arm: mm: Introduce special ptes for LPAE
  2014-03-28 15:01 [RFC PATCH V4 0/7] get_user_pages_fast for ARM and ARM64 Steve Capper
  2014-03-28 15:01 ` [RFC PATCH V4 1/7] mm: Introduce a general RCU get_user_pages_fast Steve Capper
@ 2014-03-28 15:01 ` Steve Capper
  2014-03-28 15:01 ` [RFC PATCH V4 3/7] arm: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-28 15:01 UTC (permalink / raw)
  To: linux-arm-kernel
We need a mechanism to tag ptes as being special, this indicates that
no attempt should be made to access the underlying struct page *
associated with the pte. This is used by the fast_gup when operating on
ptes as it has no means to access VMAs (that also contain this
information) locklessly.
The L_PTE_SPECIAL bit is already allocated for LPAE, this patch modifies
pte_special and pte_mkspecial to make use of it, and defines
__HAVE_ARCH_PTE_SPECIAL.
This patch also excludes special ptes from the icache/dcache sync logic.
Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
Changed in V4: set_pte_at will not perform an icache sync for special
ptes.
---
 arch/arm/include/asm/pgtable-2level.h | 2 ++
 arch/arm/include/asm/pgtable-3level.h | 8 ++++++++
 arch/arm/include/asm/pgtable.h        | 6 ++----
 3 files changed, 12 insertions(+), 4 deletions(-)
diff --git a/arch/arm/include/asm/pgtable-2level.h b/arch/arm/include/asm/pgtable-2level.h
index dfff709..26d1742 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -181,6 +181,8 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 #define pmd_addr_end(addr,end) (end)
 
 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext)
+#define pte_special(pte)	(0)
+static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
 
 /*
  * We don't have huge page support for short descriptors, for the moment
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 85c60ad..b286ba9 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -207,6 +207,14 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 #define pte_huge(pte)		(pte_val(pte) && !(pte_val(pte) & PTE_TABLE_BIT))
 #define pte_mkhuge(pte)		(__pte(pte_val(pte) & ~PTE_TABLE_BIT))
 
+#define pte_special(pte)	(!!(pte_val(pte) & L_PTE_SPECIAL))
+static inline pte_t pte_mkspecial(pte_t pte)
+{
+	pte_val(pte) |= L_PTE_SPECIAL;
+	return pte;
+}
+#define	__HAVE_ARCH_PTE_SPECIAL
+
 #define pmd_young(pmd)		(pmd_val(pmd) & PMD_SECT_AF)
 
 #define __HAVE_ARCH_PMD_WRITE
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index 7d59b52..e6139e3 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -220,7 +220,6 @@ static inline pte_t *pmd_page_vaddr(pmd_t pmd)
 #define pte_dirty(pte)		(pte_val(pte) & L_PTE_DIRTY)
 #define pte_young(pte)		(pte_val(pte) & L_PTE_YOUNG)
 #define pte_exec(pte)		(!(pte_val(pte) & L_PTE_XN))
-#define pte_special(pte)	(0)
 
 #define pte_present_user(pte)  (pte_present(pte) && (pte_val(pte) & L_PTE_USER))
 
@@ -238,7 +237,8 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 	unsigned long ext = 0;
 
 	if (addr < TASK_SIZE && pte_present_user(pteval)) {
-		__sync_icache_dcache(pteval);
+		if (!pte_special(pteval))
+			__sync_icache_dcache(pteval);
 		ext |= PTE_EXT_NG;
 	}
 
@@ -257,8 +257,6 @@ PTE_BIT_FUNC(mkyoung,   |= L_PTE_YOUNG);
 PTE_BIT_FUNC(mkexec,   &= ~L_PTE_XN);
 PTE_BIT_FUNC(mknexec,   |= L_PTE_XN);
 
-static inline pte_t pte_mkspecial(pte_t pte) { return pte; }
-
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
 	const pteval_t mask = L_PTE_XN | L_PTE_RDONLY | L_PTE_USER |
-- 
1.8.1.4
^ permalink raw reply related	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 3/7] arm: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-03-28 15:01 [RFC PATCH V4 0/7] get_user_pages_fast for ARM and ARM64 Steve Capper
  2014-03-28 15:01 ` [RFC PATCH V4 1/7] mm: Introduce a general RCU get_user_pages_fast Steve Capper
  2014-03-28 15:01 ` [RFC PATCH V4 2/7] arm: mm: Introduce special ptes for LPAE Steve Capper
@ 2014-03-28 15:01 ` Steve Capper
  2014-05-01 11:11   ` Catalin Marinas
  2014-03-28 15:01 ` [RFC PATCH V4 4/7] arm: mm: Enable RCU fast_gup Steve Capper
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Steve Capper @ 2014-03-28 15:01 UTC (permalink / raw)
  To: linux-arm-kernel
In order to implement fast_get_user_pages we need to ensure that the
page table walker is protected from page table pages being freed from
under it.
One way to achieve this is to have the walker disable interrupts, and
rely on IPIs from the TLB flushing code blocking before the page table
pages are freed.
On some ARM platforms we have hardware TLB invalidation, thus the TLB
flushing code won't necessarily broadcast IPIs. Also spuriously
broadcasting IPIs can hurt system performance if done too often.
This problem has already been solved on PowerPC and Sparc by batching
up page table pages belonging to more than one mm_user, then scheduling
an rcu_sched callback to free the pages. If one were to disable
interrupts, that would delay the scheduling interrupts thus block the
page table pages being freed. This logic has also been promoted to core
code and is activated when one enables HAVE_RCU_TABLE_FREE.
This patch enables HAVE_RCU_TABLE_FREE and incorporates it into the
existing ARM TLB logic.
Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
 arch/arm/Kconfig           |  1 +
 arch/arm/include/asm/tlb.h | 38 ++++++++++++++++++++++++++++++++++++--
 2 files changed, 37 insertions(+), 2 deletions(-)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 1594945..7d5340d 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -58,6 +58,7 @@ config ARM
 	select HAVE_PERF_EVENTS
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
+	select HAVE_RCU_TABLE_FREE if SMP
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UID16
diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index 0baf7f0..eaf7578 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -35,12 +35,39 @@
 
 #define MMU_GATHER_BUNDLE	8
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+static inline void __tlb_remove_table(void *_table)
+{
+	free_page_and_swap_cache((struct page *)_table);
+}
+
+struct mmu_table_batch {
+	struct rcu_head		rcu;
+	unsigned int		nr;
+	void			*tables[0];
+};
+
+#define MAX_TABLE_BATCH		\
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+extern void tlb_table_flush(struct mmu_gather *tlb);
+extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+
+#define tlb_remove_entry(tlb, entry)	tlb_remove_table(tlb, entry)
+#else
+#define tlb_remove_entry(tlb, entry)	tlb_remove_page(tlb, entry)
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+
 /*
  * TLB handling.  This allows us to remove pages from the page
  * tables, and efficiently handle the TLB issues.
  */
 struct mmu_gather {
 	struct mm_struct	*mm;
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	struct mmu_table_batch	*batch;
+	unsigned int		need_flush;
+#endif
 	unsigned int		fullmm;
 	struct vm_area_struct	*vma;
 	unsigned long		start, end;
@@ -101,6 +128,9 @@ static inline void __tlb_alloc_page(struct mmu_gather *tlb)
 static inline void tlb_flush_mmu(struct mmu_gather *tlb)
 {
 	tlb_flush(tlb);
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb_table_flush(tlb);
+#endif
 	free_pages_and_swap_cache(tlb->pages, tlb->nr);
 	tlb->nr = 0;
 	if (tlb->pages == tlb->local)
@@ -119,6 +149,10 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start
 	tlb->pages = tlb->local;
 	tlb->nr = 0;
 	__tlb_alloc_page(tlb);
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb->batch = NULL;
+#endif
 }
 
 static inline void
@@ -195,7 +229,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 	tlb_add_flush(tlb, addr + SZ_1M);
 #endif
 
-	tlb_remove_page(tlb, pte);
+	tlb_remove_entry(tlb, pte);
 }
 
 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
@@ -203,7 +237,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 {
 #ifdef CONFIG_ARM_LPAE
 	tlb_add_flush(tlb, addr);
-	tlb_remove_page(tlb, virt_to_page(pmdp));
+	tlb_remove_entry(tlb, virt_to_page(pmdp));
 #endif
 }
 
-- 
1.8.1.4
^ permalink raw reply related	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 4/7] arm: mm: Enable RCU fast_gup
  2014-03-28 15:01 [RFC PATCH V4 0/7] get_user_pages_fast for ARM and ARM64 Steve Capper
                   ` (2 preceding siblings ...)
  2014-03-28 15:01 ` [RFC PATCH V4 3/7] arm: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
@ 2014-03-28 15:01 ` Steve Capper
  2014-03-28 15:01 ` [RFC PATCH V4 5/7] arm64: Convert asm/tlb.h to generic mmu_gather Steve Capper
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-28 15:01 UTC (permalink / raw)
  To: linux-arm-kernel
Activate the RCU fast_gup for ARM. We also need to force THP splits to
broadcast an IPI s.t. we block in the fast_gup page walker. As THP
splits are comparatively rare, this should not lead to a noticeable
performance degradation.
Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
 arch/arm/Kconfig                      |  3 +++
 arch/arm/include/asm/pgtable-3level.h |  6 ++++++
 arch/arm/mm/flush.c                   | 19 +++++++++++++++++++
 3 files changed, 28 insertions(+)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7d5340d..3cf589e 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1788,6 +1788,9 @@ config ARCH_SELECT_MEMORY_MODEL
 config HAVE_ARCH_PFN_VALID
 	def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM
 
+config HAVE_RCU_GUP
+	def_bool y
+
 config HIGHMEM
 	bool "High Memory Support"
 	depends on MMU
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index b286ba9..fdc4a4f 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -226,6 +226,12 @@ static inline pte_t pte_mkspecial(pte_t pte)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd)	(pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
 #define pmd_trans_splitting(pmd) (pmd_val(pmd) & PMD_SECT_SPLITTING)
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp);
+#endif
 #endif
 
 #define PMD_BIT_FUNC(fn,op) \
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 3387e60..91a2b59 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -377,3 +377,22 @@ void __flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned l
 	 */
 	__cpuc_flush_dcache_area(page_address(page), PAGE_SIZE);
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+static void thp_splitting_flush_sync(void *arg)
+{
+}
+
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp)
+{
+	pmd_t pmd = pmd_mksplitting(*pmdp);
+	VM_BUG_ON(address & ~PMD_MASK);
+	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
+
+	/* dummy IPI to serialise against fast_gup */
+	smp_call_function(thp_splitting_flush_sync, NULL, 1);
+}
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
1.8.1.4
^ permalink raw reply related	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 5/7] arm64: Convert asm/tlb.h to generic mmu_gather
  2014-03-28 15:01 [RFC PATCH V4 0/7] get_user_pages_fast for ARM and ARM64 Steve Capper
                   ` (3 preceding siblings ...)
  2014-03-28 15:01 ` [RFC PATCH V4 4/7] arm: mm: Enable RCU fast_gup Steve Capper
@ 2014-03-28 15:01 ` Steve Capper
  2014-03-28 15:01 ` [RFC PATCH V4 6/7] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
  2014-03-28 15:01 ` [RFC PATCH V4 7/7] arm64: mm: Enable RCU fast_gup Steve Capper
  6 siblings, 0 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-28 15:01 UTC (permalink / raw)
  To: linux-arm-kernel
From: Catalin Marinas <catalin.marinas@arm.com>
Over the past couple of years, the generic mmu_gather gained range
tracking - 597e1c3580b7 (mm/mmu_gather: enable tlb flush range in generic
mmu_gather), 2b047252d087 (Fix TLB gather virtual address range
invalidation corner cases) - and tlb_fast_mode() has been removed -
29eb77825cc7 (arch, mm: Remove tlb_fast_mode()).
The new mmu_gather structure is now suitable for arm64 and this patch
converts the arch asm/tlb.h to the generic code. One functional
difference is the shift_arg_pages() case where previously the code was
flushing the full mm (no tlb_start_vma call) but now it flushes the
range given to tlb_gather_mmu() (possibly slightly more efficient
previously).
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
I think Catalin already has this patch in his upstream tree, it's
included in this series for the sake of completeness.
---
 arch/arm64/include/asm/tlb.h | 136 +++++++------------------------------------
 1 file changed, 20 insertions(+), 116 deletions(-)
diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index 717031a..72cadf5 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -19,115 +19,44 @@
 #ifndef __ASM_TLB_H
 #define __ASM_TLB_H
 
-#include <linux/pagemap.h>
-#include <linux/swap.h>
 
-#include <asm/pgalloc.h>
-#include <asm/tlbflush.h>
-
-#define MMU_GATHER_BUNDLE	8
-
-/*
- * TLB handling.  This allows us to remove pages from the page
- * tables, and efficiently handle the TLB issues.
- */
-struct mmu_gather {
-	struct mm_struct	*mm;
-	unsigned int		fullmm;
-	struct vm_area_struct	*vma;
-	unsigned long		start, end;
-	unsigned long		range_start;
-	unsigned long		range_end;
-	unsigned int		nr;
-	unsigned int		max;
-	struct page		**pages;
-	struct page		*local[MMU_GATHER_BUNDLE];
-};
+#include <asm-generic/tlb.h>
 
 /*
- * This is unnecessarily complex.  There's three ways the TLB shootdown
- * code is used:
+ * There's three ways the TLB shootdown code is used:
  *  1. Unmapping a range of vmas.  See zap_page_range(), unmap_region().
  *     tlb->fullmm = 0, and tlb_start_vma/tlb_end_vma will be called.
- *     tlb->vma will be non-NULL.
  *  2. Unmapping all vmas.  See exit_mmap().
  *     tlb->fullmm = 1, and tlb_start_vma/tlb_end_vma will be called.
- *     tlb->vma will be non-NULL.  Additionally, page tables will be freed.
+ *     Page tables will be freed.
  *  3. Unmapping argument pages.  See shift_arg_pages().
  *     tlb->fullmm = 0, but tlb_start_vma/tlb_end_vma will not be called.
- *     tlb->vma will be NULL.
  */
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
-	if (tlb->fullmm || !tlb->vma)
+	if (tlb->fullmm) {
 		flush_tlb_mm(tlb->mm);
-	else if (tlb->range_end > 0) {
-		flush_tlb_range(tlb->vma, tlb->range_start, tlb->range_end);
-		tlb->range_start = TASK_SIZE;
-		tlb->range_end = 0;
+	} else if (tlb->end > 0) {
+		struct vm_area_struct vma = { .vm_mm = tlb->mm, };
+		flush_tlb_range(&vma, tlb->start, tlb->end);
+		tlb->start = TASK_SIZE;
+		tlb->end = 0;
 	}
 }
 
 static inline void tlb_add_flush(struct mmu_gather *tlb, unsigned long addr)
 {
 	if (!tlb->fullmm) {
-		if (addr < tlb->range_start)
-			tlb->range_start = addr;
-		if (addr + PAGE_SIZE > tlb->range_end)
-			tlb->range_end = addr + PAGE_SIZE;
-	}
-}
-
-static inline void __tlb_alloc_page(struct mmu_gather *tlb)
-{
-	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
-
-	if (addr) {
-		tlb->pages = (void *)addr;
-		tlb->max = PAGE_SIZE / sizeof(struct page *);
+		tlb->start = min(tlb->start, addr);
+		tlb->end = max(tlb->end, addr + PAGE_SIZE);
 	}
 }
 
-static inline void tlb_flush_mmu(struct mmu_gather *tlb)
-{
-	tlb_flush(tlb);
-	free_pages_and_swap_cache(tlb->pages, tlb->nr);
-	tlb->nr = 0;
-	if (tlb->pages == tlb->local)
-		__tlb_alloc_page(tlb);
-}
-
-static inline void
-tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end)
-{
-	tlb->mm = mm;
-	tlb->fullmm = !(start | (end+1));
-	tlb->start = start;
-	tlb->end = end;
-	tlb->vma = NULL;
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->pages = tlb->local;
-	tlb->nr = 0;
-	__tlb_alloc_page(tlb);
-}
-
-static inline void
-tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
-{
-	tlb_flush_mmu(tlb);
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-
-	if (tlb->pages != tlb->local)
-		free_pages((unsigned long)tlb->pages, 0);
-}
-
 /*
  * Memorize the range for the TLB flush.
  */
-static inline void
-tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, unsigned long addr)
+static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
+					  unsigned long addr)
 {
 	tlb_add_flush(tlb, addr);
 }
@@ -137,38 +66,24 @@ tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, unsigned long addr)
  * case where we're doing a full MM flush.  When we're doing a munmap,
  * the vmas are adjusted to only cover the region to be torn down.
  */
-static inline void
-tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
+static inline void tlb_start_vma(struct mmu_gather *tlb,
+				 struct vm_area_struct *vma)
 {
 	if (!tlb->fullmm) {
-		tlb->vma = vma;
-		tlb->range_start = TASK_SIZE;
-		tlb->range_end = 0;
+		tlb->start = TASK_SIZE;
+		tlb->end = 0;
 	}
 }
 
-static inline void
-tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
+static inline void tlb_end_vma(struct mmu_gather *tlb,
+			       struct vm_area_struct *vma)
 {
 	if (!tlb->fullmm)
 		tlb_flush(tlb);
 }
 
-static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	tlb->pages[tlb->nr++] = page;
-	VM_BUG_ON(tlb->nr > tlb->max);
-	return tlb->max - tlb->nr;
-}
-
-static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	if (!__tlb_remove_page(tlb, page))
-		tlb_flush_mmu(tlb);
-}
-
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
-	unsigned long addr)
+				  unsigned long addr)
 {
 	pgtable_page_dtor(pte);
 	tlb_add_flush(tlb, addr);
@@ -184,16 +99,5 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 }
 #endif
 
-#define pte_free_tlb(tlb, ptep, addr)	__pte_free_tlb(tlb, ptep, addr)
-#define pmd_free_tlb(tlb, pmdp, addr)	__pmd_free_tlb(tlb, pmdp, addr)
-#define pud_free_tlb(tlb, pudp, addr)	pud_free((tlb)->mm, pudp)
-
-#define tlb_migrate_finish(mm)		do { } while (0)
-
-static inline void
-tlb_remove_pmd_tlb_entry(struct mmu_gather *tlb, pmd_t *pmdp, unsigned long addr)
-{
-	tlb_add_flush(tlb, addr);
-}
 
 #endif
-- 
1.8.1.4
^ permalink raw reply related	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 6/7] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-03-28 15:01 [RFC PATCH V4 0/7] get_user_pages_fast for ARM and ARM64 Steve Capper
                   ` (4 preceding siblings ...)
  2014-03-28 15:01 ` [RFC PATCH V4 5/7] arm64: Convert asm/tlb.h to generic mmu_gather Steve Capper
@ 2014-03-28 15:01 ` Steve Capper
  2014-04-30 15:20   ` Catalin Marinas
  2014-03-28 15:01 ` [RFC PATCH V4 7/7] arm64: mm: Enable RCU fast_gup Steve Capper
  6 siblings, 1 reply; 19+ messages in thread
From: Steve Capper @ 2014-03-28 15:01 UTC (permalink / raw)
  To: linux-arm-kernel
In order to implement fast_get_user_pages we need to ensure that the
page table walker is protected from page table pages being freed from
under it.
This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging
to address spaces with multiple users will be call_rcu_sched freed.
Meaning that disabling interrupts will block the free and protect the
fast gup page walker.
Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
 arch/arm64/Kconfig           | 1 +
 arch/arm64/include/asm/tlb.h | 8 ++++++++
 2 files changed, 9 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 27bbcfc..6185f95 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -38,6 +38,7 @@ config ARM64
 	select HAVE_MEMBLOCK
 	select HAVE_PATA_PLATFORM
 	select HAVE_PERF_EVENTS
+	select HAVE_RCU_TABLE_FREE
 	select IRQ_DOMAIN
 	select MODULES_USE_ELF_RELA
 	select NO_BOOTMEM
diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index 72cadf5..58a8b78 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -22,6 +22,14 @@
 
 #include <asm-generic/tlb.h>
 
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+
+static inline void __tlb_remove_table(void *_table)
+{
+	free_page_and_swap_cache((struct page *)_table);
+}
+
 /*
  * There's three ways the TLB shootdown code is used:
  *  1. Unmapping a range of vmas.  See zap_page_range(), unmap_region().
-- 
1.8.1.4
^ permalink raw reply related	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 7/7] arm64: mm: Enable RCU fast_gup
  2014-03-28 15:01 [RFC PATCH V4 0/7] get_user_pages_fast for ARM and ARM64 Steve Capper
                   ` (5 preceding siblings ...)
  2014-03-28 15:01 ` [RFC PATCH V4 6/7] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
@ 2014-03-28 15:01 ` Steve Capper
  6 siblings, 0 replies; 19+ messages in thread
From: Steve Capper @ 2014-03-28 15:01 UTC (permalink / raw)
  To: linux-arm-kernel
Activate the RCU fast_gup for ARM64. We also need to force THP splits
to broadcast an IPI s.t. we block in the fast_gup page walker. As THP
splits are comparatively rare, this should not lead to a noticeable
performance degradation.
Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
 arch/arm64/Kconfig               |  3 +++
 arch/arm64/include/asm/pgtable.h |  4 ++++
 arch/arm64/mm/flush.c            | 19 +++++++++++++++++++
 3 files changed, 26 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6185f95..9f5a81a 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -86,6 +86,9 @@ config GENERIC_CSUM
 config GENERIC_CALIBRATE_DELAY
 	def_bool y
 
+config HAVE_RCU_GUP
+	def_bool y
+
 config ZONE_DMA32
 	def_bool y
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index aa3917c..0e148ae 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -245,6 +245,10 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd)	(pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
 #define pmd_trans_splitting(pmd) (pmd_val(pmd) & PMD_SECT_SPLITTING)
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+struct vm_area_struct;
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp);
 #endif
 
 #define PMD_BIT_FUNC(fn,op) \
diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index e4193e3..ddf96c1 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -103,3 +103,22 @@ EXPORT_SYMBOL(flush_dcache_page);
  */
 EXPORT_SYMBOL(flush_cache_all);
 EXPORT_SYMBOL(flush_icache_range);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+static void thp_splitting_flush_sync(void *arg)
+{
+}
+
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp)
+{
+	pmd_t pmd = pmd_mksplitting(*pmdp);
+	VM_BUG_ON(address & ~PMD_MASK);
+	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
+
+	/* dummy IPI to serialise against fast_gup */
+	smp_call_function(thp_splitting_flush_sync, NULL, 1);
+}
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
1.8.1.4
^ permalink raw reply related	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 6/7] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-03-28 15:01 ` [RFC PATCH V4 6/7] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
@ 2014-04-30 15:20   ` Catalin Marinas
  2014-04-30 15:33     ` Catalin Marinas
  0 siblings, 1 reply; 19+ messages in thread
From: Catalin Marinas @ 2014-04-30 15:20 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, Mar 28, 2014 at 03:01:31PM +0000, Steve Capper wrote:
> In order to implement fast_get_user_pages we need to ensure that the
> page table walker is protected from page table pages being freed from
> under it.
> 
> This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging
> to address spaces with multiple users will be call_rcu_sched freed.
> Meaning that disabling interrupts will block the free and protect the
> fast gup page walker.
> 
> Signed-off-by: Steve Capper <steve.capper@linaro.org>
While this patch is simple, I'd like to better understand the reason for
it. Currently HAVE_RCU_TABLE_FREE is enabled for powerpc and sparc while
__get_user_pages_fast() is supported by a few other architectures that
don't select HAVE_RCU_TABLE_FREE. So why do we need it for fast gup on
arm/arm64 while not all the other archs need it?
Thanks.
-- 
Catalin
^ permalink raw reply	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 6/7] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-04-30 15:20   ` Catalin Marinas
@ 2014-04-30 15:33     ` Catalin Marinas
  2014-04-30 15:38       ` Steve Capper
  0 siblings, 1 reply; 19+ messages in thread
From: Catalin Marinas @ 2014-04-30 15:33 UTC (permalink / raw)
  To: linux-arm-kernel
On Wed, Apr 30, 2014 at 04:20:47PM +0100, Catalin Marinas wrote:
> On Fri, Mar 28, 2014 at 03:01:31PM +0000, Steve Capper wrote:
> > In order to implement fast_get_user_pages we need to ensure that the
> > page table walker is protected from page table pages being freed from
> > under it.
> > 
> > This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging
> > to address spaces with multiple users will be call_rcu_sched freed.
> > Meaning that disabling interrupts will block the free and protect the
> > fast gup page walker.
> > 
> > Signed-off-by: Steve Capper <steve.capper@linaro.org>
> 
> While this patch is simple, I'd like to better understand the reason for
> it. Currently HAVE_RCU_TABLE_FREE is enabled for powerpc and sparc while
> __get_user_pages_fast() is supported by a few other architectures that
> don't select HAVE_RCU_TABLE_FREE. So why do we need it for fast gup on
> arm/arm64 while not all the other archs need it?
OK, replying to myself. I assume the other architectures that don't need
HAVE_RCU_TABLE_FREE use IPI for TLB shootdown, hence they gup_fast
synchronisation for free.
-- 
Catalin
^ permalink raw reply	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 6/7] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-04-30 15:33     ` Catalin Marinas
@ 2014-04-30 15:38       ` Steve Capper
  2014-04-30 17:21         ` Catalin Marinas
  0 siblings, 1 reply; 19+ messages in thread
From: Steve Capper @ 2014-04-30 15:38 UTC (permalink / raw)
  To: linux-arm-kernel
On Wed, Apr 30, 2014 at 04:33:17PM +0100, Catalin Marinas wrote:
> On Wed, Apr 30, 2014 at 04:20:47PM +0100, Catalin Marinas wrote:
> > On Fri, Mar 28, 2014 at 03:01:31PM +0000, Steve Capper wrote:
> > > In order to implement fast_get_user_pages we need to ensure that the
> > > page table walker is protected from page table pages being freed from
> > > under it.
> > > 
> > > This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging
> > > to address spaces with multiple users will be call_rcu_sched freed.
> > > Meaning that disabling interrupts will block the free and protect the
> > > fast gup page walker.
> > > 
> > > Signed-off-by: Steve Capper <steve.capper@linaro.org>
> > 
> > While this patch is simple, I'd like to better understand the reason for
> > it. Currently HAVE_RCU_TABLE_FREE is enabled for powerpc and sparc while
> > __get_user_pages_fast() is supported by a few other architectures that
> > don't select HAVE_RCU_TABLE_FREE. So why do we need it for fast gup on
> > arm/arm64 while not all the other archs need it?
> 
> OK, replying to myself. I assume the other architectures that don't need
> HAVE_RCU_TABLE_FREE use IPI for TLB shootdown, hence they gup_fast
> synchronisation for free.
Hi Catalin,
Yes that is roughly the case.
Essentially we want to RCU free the page table backing pages at a
later time when we aren't walking on them.
Other arches use IPI, some others have their own RCU logic. I opted to
activate some existing logic to reduce code duplication.
Cheers,
-- 
Steve
> 
> -- 
> Catalin
^ permalink raw reply	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 6/7] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-04-30 15:38       ` Steve Capper
@ 2014-04-30 17:21         ` Catalin Marinas
  2014-05-01  7:34           ` Steve Capper
  0 siblings, 1 reply; 19+ messages in thread
From: Catalin Marinas @ 2014-04-30 17:21 UTC (permalink / raw)
  To: linux-arm-kernel
On Wed, Apr 30, 2014 at 04:38:25PM +0100, Steve Capper wrote:
> On Wed, Apr 30, 2014 at 04:33:17PM +0100, Catalin Marinas wrote:
> > On Wed, Apr 30, 2014 at 04:20:47PM +0100, Catalin Marinas wrote:
> > > On Fri, Mar 28, 2014 at 03:01:31PM +0000, Steve Capper wrote:
> > > > In order to implement fast_get_user_pages we need to ensure that the
> > > > page table walker is protected from page table pages being freed from
> > > > under it.
> > > > 
> > > > This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging
> > > > to address spaces with multiple users will be call_rcu_sched freed.
> > > > Meaning that disabling interrupts will block the free and protect the
> > > > fast gup page walker.
> > > > 
> > > > Signed-off-by: Steve Capper <steve.capper@linaro.org>
> > > 
> > > While this patch is simple, I'd like to better understand the reason for
> > > it. Currently HAVE_RCU_TABLE_FREE is enabled for powerpc and sparc while
> > > __get_user_pages_fast() is supported by a few other architectures that
> > > don't select HAVE_RCU_TABLE_FREE. So why do we need it for fast gup on
> > > arm/arm64 while not all the other archs need it?
> > 
> > OK, replying to myself. I assume the other architectures that don't need
> > HAVE_RCU_TABLE_FREE use IPI for TLB shootdown, hence they gup_fast
> > synchronisation for free.
> 
> Yes that is roughly the case.
> Essentially we want to RCU free the page table backing pages at a
> later time when we aren't walking on them.
> 
> Other arches use IPI, some others have their own RCU logic. I opted to
> activate some existing logic to reduce code duplication.
Both powerpc and sparc use tlb_remove_table() via their __pte_free_tlb()
etc. which implies an IPI for synchronisation if mm_users > 1. For
gup_fast we may not need it since we use the RCU for protection. Am I
missing anything?
-- 
Catalin
^ permalink raw reply	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 6/7] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-04-30 17:21         ` Catalin Marinas
@ 2014-05-01  7:34           ` Steve Capper
  2014-05-01  9:52             ` Catalin Marinas
  0 siblings, 1 reply; 19+ messages in thread
From: Steve Capper @ 2014-05-01  7:34 UTC (permalink / raw)
  To: linux-arm-kernel
On Wed, Apr 30, 2014 at 06:21:14PM +0100, Catalin Marinas wrote:
> On Wed, Apr 30, 2014 at 04:38:25PM +0100, Steve Capper wrote:
> > On Wed, Apr 30, 2014 at 04:33:17PM +0100, Catalin Marinas wrote:
> > > On Wed, Apr 30, 2014 at 04:20:47PM +0100, Catalin Marinas wrote:
> > > > On Fri, Mar 28, 2014 at 03:01:31PM +0000, Steve Capper wrote:
> > > > > In order to implement fast_get_user_pages we need to ensure that the
> > > > > page table walker is protected from page table pages being freed from
> > > > > under it.
> > > > > 
> > > > > This patch enables HAVE_RCU_TABLE_FREE, any page table pages belonging
> > > > > to address spaces with multiple users will be call_rcu_sched freed.
> > > > > Meaning that disabling interrupts will block the free and protect the
> > > > > fast gup page walker.
> > > > > 
> > > > > Signed-off-by: Steve Capper <steve.capper@linaro.org>
> > > > 
> > > > While this patch is simple, I'd like to better understand the reason for
> > > > it. Currently HAVE_RCU_TABLE_FREE is enabled for powerpc and sparc while
> > > > __get_user_pages_fast() is supported by a few other architectures that
> > > > don't select HAVE_RCU_TABLE_FREE. So why do we need it for fast gup on
> > > > arm/arm64 while not all the other archs need it?
> > > 
> > > OK, replying to myself. I assume the other architectures that don't need
> > > HAVE_RCU_TABLE_FREE use IPI for TLB shootdown, hence they gup_fast
> > > synchronisation for free.
> > 
> > Yes that is roughly the case.
> > Essentially we want to RCU free the page table backing pages at a
> > later time when we aren't walking on them.
> > 
> > Other arches use IPI, some others have their own RCU logic. I opted to
> > activate some existing logic to reduce code duplication.
> 
> Both powerpc and sparc use tlb_remove_table() via their __pte_free_tlb()
> etc. which implies an IPI for synchronisation if mm_users > 1. For
> gup_fast we may not need it since we use the RCU for protection. Am I
> missing anything?
So my understanding is:
tlb_remove_table will just immediately free any pages where there's a
single user as there's no need to consider a gup walking.
For the case of multiple users we have an mmu_table_batch structure
that holds references to pages that should be freed at a later point.
This batch is contained on a page that is allocated on the fly. If, for
any reason, we can't allocate the batch container we fallback to a slow
path which is to issue an IPI (via tlb_remove_table_one). This IPI will
block on the gup walker. We need this fallback behaviour on ARM/ARM64.
Most of the time we will be able to allocate the batch container, and
we will populate it with references to page table containing pages that
are freed via an RCU scheduler delayed callback to tlb_remove_table_rcu.
In the fast_gup walker, we block tlb_remove_table_rcu from running by
disabling interrupts in the critical path. Technically we could issue
a call to rcu_read_lock_sched instead to block tlb_remove_table_rcu,
but that wouldn't be sufficient to block THP splits; so we opt to
disable interrupts to block both THP and tlb_remove_table_rcu.
Cheers,
-- 
Steve
> 
> -- 
> Catalin
^ permalink raw reply	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 6/7] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-05-01  7:34           ` Steve Capper
@ 2014-05-01  9:52             ` Catalin Marinas
  2014-05-01  9:57               ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Catalin Marinas @ 2014-05-01  9:52 UTC (permalink / raw)
  To: linux-arm-kernel
On Thu, May 01, 2014 at 08:34:03AM +0100, Steve Capper wrote:
> On Wed, Apr 30, 2014 at 06:21:14PM +0100, Catalin Marinas wrote:
> > Both powerpc and sparc use tlb_remove_table() via their __pte_free_tlb()
> > etc. which implies an IPI for synchronisation if mm_users > 1. For
> > gup_fast we may not need it since we use the RCU for protection. Am I
> > missing anything?
> 
> So my understanding is:
> 
> tlb_remove_table will just immediately free any pages where there's a
> single user as there's no need to consider a gup walking.
Does gup_fast walking increment the mm_users? Or is it a requirement of
the calling code? I can't seem to find where this happens.
> For the case of multiple users we have an mmu_table_batch structure
> that holds references to pages that should be freed at a later point.
Yes.
> This batch is contained on a page that is allocated on the fly. If, for
> any reason, we can't allocate the batch container we fallback to a slow
> path which is to issue an IPI (via tlb_remove_table_one). This IPI will
> block on the gup walker. We need this fallback behaviour on ARM/ARM64.
That's my main point: this batch page allocation on the fly for table
pages happens in tlb_remove_table(). With your patch for arm64
HAVE_RCU_TABLE_FREE, I can comment out tlb_remove_table() and it
compiles just fine because you don't call it from functions like
__pte_free_tlb() (as powerpc and sparc do). The __tlb_remove_page() that
we currently use doesn't give us any RCU protection here.
-- 
Catalin
^ permalink raw reply	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 6/7] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-05-01  9:52             ` Catalin Marinas
@ 2014-05-01  9:57               ` Peter Zijlstra
  2014-05-01 10:04                 ` Catalin Marinas
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-05-01  9:57 UTC (permalink / raw)
  To: linux-arm-kernel
On Thu, May 01, 2014 at 10:52:47AM +0100, Catalin Marinas wrote:
> Does gup_fast walking increment the mm_users? Or is it a requirement of
> the calling code? I can't seem to find where this happens.
No, its not required at all. One should only walk current->mm with
gup_fast, any other usage is broken.
And by delaying TLB shootdown, either through disabling IRQs and
stalling IPIs or by using RCU freeing, you're guaranteed your own page
tables won't disappear underneath your feet.
^ permalink raw reply	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 6/7] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-05-01  9:57               ` Peter Zijlstra
@ 2014-05-01 10:04                 ` Catalin Marinas
  2014-05-01 10:15                   ` Steve Capper
  0 siblings, 1 reply; 19+ messages in thread
From: Catalin Marinas @ 2014-05-01 10:04 UTC (permalink / raw)
  To: linux-arm-kernel
On Thu, May 01, 2014 at 10:57:39AM +0100, Peter Zijlstra wrote:
> On Thu, May 01, 2014 at 10:52:47AM +0100, Catalin Marinas wrote:
> > Does gup_fast walking increment the mm_users? Or is it a requirement of
> > the calling code? I can't seem to find where this happens.
> 
> No, its not required at all. One should only walk current->mm with
> gup_fast, any other usage is broken.
OK, I get it now.
> And by delaying TLB shootdown, either through disabling IRQs and
> stalling IPIs or by using RCU freeing, you're guaranteed your own page
> tables won't disappear underneath your feet.
And for RCU to work, we still need to use the full tlb_remove_table()
logic (Steve's patches just use tlb_remove_page() for table freeing).
Thanks.
-- 
Catalin
^ permalink raw reply	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 6/7] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-05-01 10:04                 ` Catalin Marinas
@ 2014-05-01 10:15                   ` Steve Capper
  0 siblings, 0 replies; 19+ messages in thread
From: Steve Capper @ 2014-05-01 10:15 UTC (permalink / raw)
  To: linux-arm-kernel
On Thu, May 01, 2014 at 11:04:15AM +0100, Catalin Marinas wrote:
> On Thu, May 01, 2014 at 10:57:39AM +0100, Peter Zijlstra wrote:
> > On Thu, May 01, 2014 at 10:52:47AM +0100, Catalin Marinas wrote:
> > > Does gup_fast walking increment the mm_users? Or is it a requirement of
> > > the calling code? I can't seem to find where this happens.
> > 
> > No, its not required at all. One should only walk current->mm with
> > gup_fast, any other usage is broken.
> 
> OK, I get it now.
> 
> > And by delaying TLB shootdown, either through disabling IRQs and
> > stalling IPIs or by using RCU freeing, you're guaranteed your own page
> > tables won't disappear underneath your feet.
> 
> And for RCU to work, we still need to use the full tlb_remove_table()
> logic (Steve's patches just use tlb_remove_page() for table freeing).
Yes, I see.
This is a bug in the arm64 patch (arm correctly calls tlb_remove_page),
I think it got ate during a rebase.
I will fix the arm64 activation logic.
Apologies for the confustion,
-- 
Steve
> 
> Thanks.
> 
> -- 
> Catalin
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo at kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email at kvack.org </a>
^ permalink raw reply	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 3/7] arm: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-03-28 15:01 ` [RFC PATCH V4 3/7] arm: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
@ 2014-05-01 11:11   ` Catalin Marinas
  2014-05-01 11:44     ` Steve Capper
  0 siblings, 1 reply; 19+ messages in thread
From: Catalin Marinas @ 2014-05-01 11:11 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, Mar 28, 2014 at 03:01:28PM +0000, Steve Capper wrote:
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 1594945..7d5340d 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -58,6 +58,7 @@ config ARM
>  	select HAVE_PERF_EVENTS
>  	select HAVE_PERF_REGS
>  	select HAVE_PERF_USER_STACK_DUMP
> +	select HAVE_RCU_TABLE_FREE if SMP
You select this if (SMP && CPU_V7). On ARMv6 SMP systems we use IPI for
TLB maintenance already.
-- 
Catalin
^ permalink raw reply	[flat|nested] 19+ messages in thread
* [RFC PATCH V4 3/7] arm: mm: Enable HAVE_RCU_TABLE_FREE logic
  2014-05-01 11:11   ` Catalin Marinas
@ 2014-05-01 11:44     ` Steve Capper
  0 siblings, 0 replies; 19+ messages in thread
From: Steve Capper @ 2014-05-01 11:44 UTC (permalink / raw)
  To: linux-arm-kernel
On Thu, May 01, 2014 at 12:11:21PM +0100, Catalin Marinas wrote:
> On Fri, Mar 28, 2014 at 03:01:28PM +0000, Steve Capper wrote:
> > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> > index 1594945..7d5340d 100644
> > --- a/arch/arm/Kconfig
> > +++ b/arch/arm/Kconfig
> > @@ -58,6 +58,7 @@ config ARM
> >  	select HAVE_PERF_EVENTS
> >  	select HAVE_PERF_REGS
> >  	select HAVE_PERF_USER_STACK_DUMP
> > +	select HAVE_RCU_TABLE_FREE if SMP
> 
> You select this if (SMP && CPU_V7). On ARMv6 SMP systems we use IPI for
> TLB maintenance already.
Thanks, I'll add that to the next series.
-- 
Steve
> 
> -- 
> Catalin
^ permalink raw reply	[flat|nested] 19+ messages in thread
end of thread, other threads:[~2014-05-01 11:44 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-28 15:01 [RFC PATCH V4 0/7] get_user_pages_fast for ARM and ARM64 Steve Capper
2014-03-28 15:01 ` [RFC PATCH V4 1/7] mm: Introduce a general RCU get_user_pages_fast Steve Capper
2014-03-28 15:01 ` [RFC PATCH V4 2/7] arm: mm: Introduce special ptes for LPAE Steve Capper
2014-03-28 15:01 ` [RFC PATCH V4 3/7] arm: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
2014-05-01 11:11   ` Catalin Marinas
2014-05-01 11:44     ` Steve Capper
2014-03-28 15:01 ` [RFC PATCH V4 4/7] arm: mm: Enable RCU fast_gup Steve Capper
2014-03-28 15:01 ` [RFC PATCH V4 5/7] arm64: Convert asm/tlb.h to generic mmu_gather Steve Capper
2014-03-28 15:01 ` [RFC PATCH V4 6/7] arm64: mm: Enable HAVE_RCU_TABLE_FREE logic Steve Capper
2014-04-30 15:20   ` Catalin Marinas
2014-04-30 15:33     ` Catalin Marinas
2014-04-30 15:38       ` Steve Capper
2014-04-30 17:21         ` Catalin Marinas
2014-05-01  7:34           ` Steve Capper
2014-05-01  9:52             ` Catalin Marinas
2014-05-01  9:57               ` Peter Zijlstra
2014-05-01 10:04                 ` Catalin Marinas
2014-05-01 10:15                   ` Steve Capper
2014-03-28 15:01 ` [RFC PATCH V4 7/7] arm64: mm: Enable RCU fast_gup Steve Capper
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).