- * [PATCH 00/17] mm: mmu_gather rework
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 01/17] tile: Fix __pte_free_tlb Peter Zijlstra
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang
Rework the existing mmu_gather infrastructure.
The direct purpose of these patches was to allow preemptible mmu_gather,
but even without that I think these patches provide an improvement to the
status quo.
The first patch is a fix to the tile architecture, the subsequent 9 patches
rework the mmu_gather infrastructure. For review purpose I've split them
into generic and per-arch patches with the last of those a generic cleanup.
For the final commit I would provide a roll-up of these patches so as not
to wreck bisectability of non generic archs.
The next patch provides generic RCU page-table freeing, and the follow up
is a patch converting s390 to use this. I've also got 4 patches from
DaveM lined up (not included in this series) that uses this to implement
gup_fast() for sparc64.
Then there is one patch that extends the generic mmu_gather batching.
Finally there are 4 patches that convert various architectures over
to asm-generic/tlb.h, these are compile tested only and basically RFC.
After this only um and s390 are left -- um should be straight forward,
s390 wants a bit more, but more on that in another email.
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 01/17] tile: Fix __pte_free_tlb
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
  2011-02-17 16:23 ` Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 02/17] mm: mmu_gather rework Peter Zijlstra
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Chris Metcalf
[-- Attachment #1: tile-fix-pte_free_tlb.patch --]
[-- Type: text/plain, Size: 1420 bytes --]
Tile's __pte_free_tlb() implementation makes assumptions about the
generic mmu_gather implementation, cure this ;-)
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/tile/mm/pgtable.c |   15 ++-------------
 1 file changed, 2 insertions(+), 13 deletions(-)
Index: linux-2.6/arch/tile/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/tile/mm/pgtable.c
+++ linux-2.6/arch/tile/mm/pgtable.c
@@ -252,19 +252,8 @@ void __pte_free_tlb(struct mmu_gather *t
 	int i;
 
 	pgtable_page_dtor(pte);
-	tlb->need_flush = 1;
-	if (tlb_fast_mode(tlb)) {
-		struct page *pte_pages[L2_USER_PGTABLE_PAGES];
-		for (i = 0; i < L2_USER_PGTABLE_PAGES; ++i)
-			pte_pages[i] = pte + i;
-		free_pages_and_swap_cache(pte_pages, L2_USER_PGTABLE_PAGES);
-		return;
-	}
-	for (i = 0; i < L2_USER_PGTABLE_PAGES; ++i) {
-		tlb->pages[tlb->nr++] = pte + i;
-		if (tlb->nr >= FREE_PTE_NR)
-			tlb_flush_mmu(tlb, 0, 0);
-	}
+	for (i = 0; i < L2_USER_PGTABLE_PAGES; ++i)
+		tlb_remove_page(tlb, pte + i);
 }
 
 #ifndef __tilegx__
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 01/17] tile: Fix __pte_free_tlb
  2011-02-17 16:23 ` [PATCH 01/17] tile: Fix __pte_free_tlb Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Chris Metcalf
[-- Attachment #1: tile-fix-pte_free_tlb.patch --]
[-- Type: text/plain, Size: 1117 bytes --]
Tile's __pte_free_tlb() implementation makes assumptions about the
generic mmu_gather implementation, cure this ;-)
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/tile/mm/pgtable.c |   15 ++-------------
 1 file changed, 2 insertions(+), 13 deletions(-)
Index: linux-2.6/arch/tile/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/tile/mm/pgtable.c
+++ linux-2.6/arch/tile/mm/pgtable.c
@@ -252,19 +252,8 @@ void __pte_free_tlb(struct mmu_gather *t
 	int i;
 
 	pgtable_page_dtor(pte);
-	tlb->need_flush = 1;
-	if (tlb_fast_mode(tlb)) {
-		struct page *pte_pages[L2_USER_PGTABLE_PAGES];
-		for (i = 0; i < L2_USER_PGTABLE_PAGES; ++i)
-			pte_pages[i] = pte + i;
-		free_pages_and_swap_cache(pte_pages, L2_USER_PGTABLE_PAGES);
-		return;
-	}
-	for (i = 0; i < L2_USER_PGTABLE_PAGES; ++i) {
-		tlb->pages[tlb->nr++] = pte + i;
-		if (tlb->nr >= FREE_PTE_NR)
-			tlb_flush_mmu(tlb, 0, 0);
-	}
+	for (i = 0; i < L2_USER_PGTABLE_PAGES; ++i)
+		tlb_remove_page(tlb, pte + i);
 }
 
 #ifndef __tilegx__
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
- * [PATCH 02/17] mm: mmu_gather rework
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
  2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 01/17] tile: Fix __pte_free_tlb Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-03-10 15:50   ` Mel Gorman
  2011-02-17 16:23 ` [PATCH 03/17] powerpc: " Peter Zijlstra
                   ` (16 subsequent siblings)
  19 siblings, 2 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Martin Schwidefsky,
	Russell King, Paul Mundt, Jeff Dike, Tony Luck, Hugh Dickins
[-- Attachment #1: peter_zijlstra-mm-preemptible_mmu_gather.patch --]
[-- Type: text/plain, Size: 13618 bytes --]
Remove the first obstackle towards a fully preemptible mmu_gather.
The current scheme assumes mmu_gather is always done with preemption
disabled and uses per-cpu storage for the page batches. Change this to
try and allocate a page for batching and in case of failure, use a
small on-stack array to make some progress.
Preemptible mmu_gather is desired in general and usable once
i_mmap_lock becomes a mutex. Doing it before the mutex conversion
saves us from having to rework the code by moving the mmu_gather
bits inside the pte_lock.
Also avoid flushing the tlb batches from under the pte lock,
this is useful even without the i_mmap_lock conversion as it
significantly reduces pte lock hold times.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Tony Luck <tony.luck@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/exec.c                 |   10 ++---
 include/asm-generic/tlb.h |   77 ++++++++++++++++++++++++++++++++--------------
 include/linux/mm.h        |    2 -
 mm/memory.c               |   42 ++++++++++---------------
 mm/mmap.c                 |   18 +++++-----
 5 files changed, 87 insertions(+), 62 deletions(-)
Index: linux-2.6/fs/exec.c
===================================================================
--- linux-2.6.orig/fs/exec.c
+++ linux-2.6/fs/exec.c
@@ -550,7 +550,7 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 
 	BUG_ON(new_start > new_end);
 
@@ -576,12 +576,12 @@ static int shift_arg_pages(struct vm_are
 		return -ENOMEM;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	if (new_end > old_start) {
 		/*
 		 * when the old and new regions overlap clear from new_end.
 		 */
-		free_pgd_range(tlb, new_end, old_end, new_end,
+		free_pgd_range(&tlb, new_end, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : 0);
 	} else {
 		/*
@@ -590,10 +590,10 @@ static int shift_arg_pages(struct vm_are
 		 * have constraints on va-space that make this illegal (IA64) -
 		 * for the others its just a little faster.
 		 */
-		free_pgd_range(tlb, old_start, old_end, new_end,
+		free_pgd_range(&tlb, old_start, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : 0);
 	}
-	tlb_finish_mmu(tlb, new_end, old_end);
+	tlb_finish_mmu(&tlb, new_end, old_end);
 
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -5,6 +5,8 @@
  * Copyright 2001 Red Hat, Inc.
  * Based on code from mm/memory.c Copyright Linus Torvalds and others.
  *
+ * Copyright 2011 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License
  * as published by the Free Software Foundation; either version
@@ -22,51 +24,69 @@
  * and page free order so much..
  */
 #ifdef CONFIG_SMP
-  #ifdef ARCH_FREE_PTR_NR
-    #define FREE_PTR_NR   ARCH_FREE_PTR_NR
-  #else
-    #define FREE_PTE_NR	506
-  #endif
   #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
 #else
-  #define FREE_PTE_NR	1
   #define tlb_fast_mode(tlb) 1
 #endif
 
+/*
+ * If we can't allocate a page to make a big patch of page pointers
+ * to work on, then just handle a few from the on-stack structure.
+ */
+#define MMU_GATHER_BUNDLE	8
+
 /* struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
  */
 struct mmu_gather {
 	struct mm_struct	*mm;
 	unsigned int		nr;	/* set to ~0U means fast mode */
+	unsigned int		max;	/* nr < max */
 	unsigned int		need_flush;/* Really unmapped some ptes? */
 	unsigned int		fullmm; /* non-zero means full mm flush */
-	struct page *		pages[FREE_PTE_NR];
+#ifdef HAVE_ARCH_MMU_GATHER
+	struct arch_mmu_gather	arch;
+#endif
+	struct page		**pages;
+	struct page		*local[MMU_GATHER_BUNDLE];
 };
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
+static inline void __tlb_alloc_page(struct mmu_gather *tlb)
+{
+	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
+
+	if (addr) {
+		tlb->pages = (void *)addr;
+		tlb->max = PAGE_SIZE / sizeof(struct page *);
+	}
+}
 
 /* tlb_gather_mmu
  *	Return a pointer to an initialized struct mmu_gather.
  */
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 
-	/* Use fast mode if only one CPU is online */
-	tlb->nr = num_online_cpus() > 1 ? 0U : ~0U;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->pages = tlb->local;
+
+	if (num_online_cpus() > 1) {
+		tlb->nr = 0;
+		__tlb_alloc_page(tlb);
+	} else /* Use fast mode if only one CPU is online */
+		tlb->nr = ~0U;
 
 	tlb->fullmm = full_mm_flush;
 
-	return tlb;
+#ifdef HAVE_ARCH_MMU_GATHER
+	tlb->arch = ARCH_MMU_GATHER_INIT;
+#endif
 }
 
 static inline void
-tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
+tlb_flush_mmu(struct mmu_gather *tlb)
 {
 	if (!tlb->need_flush)
 		return;
@@ -75,6 +95,8 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 	if (!tlb_fast_mode(tlb)) {
 		free_pages_and_swap_cache(tlb->pages, tlb->nr);
 		tlb->nr = 0;
+		if (tlb->pages == tlb->local)
+			__tlb_alloc_page(tlb);
 	}
 }
 
@@ -85,12 +107,13 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 static inline void
 tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
-	tlb_flush_mmu(tlb, start, end);
+	tlb_flush_mmu(tlb);
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->pages != tlb->local)
+		free_pages((unsigned long)tlb->pages, 0);
 }
 
 /* tlb_remove_page
@@ -98,16 +121,24 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
  *	handling the additional races in SMP caused by other CPUs caching valid
  *	mappings in their TLBs.
  */
-static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
 	tlb->need_flush = 1;
 	if (tlb_fast_mode(tlb)) {
 		free_page_and_swap_cache(page);
-		return;
+		return 0;
 	}
 	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= FREE_PTE_NR)
-		tlb_flush_mmu(tlb, 0, 0);
+	if (tlb->nr >= tlb->max)
+		return 1;
+
+	return 0;
+}
+
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	if (__tlb_remove_page(tlb, page))
+		tlb_flush_mmu(tlb);
 }
 
 /**
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -889,7 +889,7 @@ int zap_vma_ptes(struct vm_area_struct *
 		unsigned long size);
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
+unsigned long unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -912,12 +912,13 @@ static unsigned long zap_pte_range(struc
 				long *zap_work, struct zap_details *details)
 {
 	struct mm_struct *mm = tlb->mm;
+	int need_flush = 0;
 	pte_t *pte;
 	spinlock_t *ptl;
 	int rss[NR_MM_COUNTERS];
 
 	init_rss_vec(rss);
-
+again:
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
 	do {
@@ -974,7 +975,7 @@ static unsigned long zap_pte_range(struc
 			page_remove_rmap(page);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
-			tlb_remove_page(tlb, page);
+			need_flush = __tlb_remove_page(tlb, page);
 			continue;
 		}
 		/*
@@ -995,12 +996,20 @@ static unsigned long zap_pte_range(struc
 				print_bad_pte(vma, addr, ptent, NULL);
 		}
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
-	} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
+	} while (pte++, addr += PAGE_SIZE,
+			(addr != end && *zap_work > 0 && !need_flush));
 
 	add_mm_rss_vec(mm, rss);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
+	if (need_flush) {
+		need_flush = 0;
+		tlb_flush_mmu(tlb);
+		if (addr != end)
+			goto again;
+	}
+
 	return addr;
 }
 
@@ -1121,17 +1130,14 @@ static unsigned long unmap_page_range(st
  * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
  * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
+unsigned long unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *details)
 {
 	long zap_work = ZAP_BLOCK_SIZE;
-	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
-	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
-	int fullmm = (*tlbp)->fullmm;
 	struct mm_struct *mm = vma->vm_mm;
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
@@ -1152,11 +1158,6 @@ unsigned long unmap_vmas(struct mmu_gath
 			untrack_pfn_vma(vma, 0, 0);
 
 		while (start != end) {
-			if (!tlb_start_valid) {
-				tlb_start = start;
-				tlb_start_valid = 1;
-			}
-
 			if (unlikely(is_vm_hugetlb_page(vma))) {
 				/*
 				 * It is undesirable to test vma->vm_file as it
@@ -1177,7 +1178,7 @@ unsigned long unmap_vmas(struct mmu_gath
 
 				start = end;
 			} else
-				start = unmap_page_range(*tlbp, vma,
+				start = unmap_page_range(tlb, vma,
 						start, end, &zap_work, details);
 
 			if (zap_work > 0) {
@@ -1185,19 +1186,13 @@ unsigned long unmap_vmas(struct mmu_gath
 				break;
 			}
 
-			tlb_finish_mmu(*tlbp, tlb_start, start);
-
 			if (need_resched() ||
 				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
-				if (i_mmap_lock) {
-					*tlbp = NULL;
+				if (i_mmap_lock)
 					goto out;
-				}
 				cond_resched();
 			}
 
-			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
-			tlb_start_valid = 0;
 			zap_work = ZAP_BLOCK_SIZE;
 		}
 	}
@@ -1217,16 +1212,15 @@ unsigned long zap_page_range(struct vm_a
 		unsigned long size, struct zap_details *details)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
-		tlb_finish_mmu(tlb, address, end);
+	tlb_finish_mmu(&tlb, address, end);
 	return end;
 }
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1913,17 +1913,17 @@ static void unmap_region(struct mm_struc
 		unsigned long start, unsigned long end)
 {
 	struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	unsigned long nr_accounted = 0;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
-				 next? next->vm_start: 0);
-	tlb_finish_mmu(tlb, start, end);
+	free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
+				 next ? next->vm_start : 0);
+	tlb_finish_mmu(&tlb, start, end);
 }
 
 /*
@@ -2265,7 +2265,7 @@ EXPORT_SYMBOL(do_brk);
 /* Release all mmaps. */
 void exit_mmap(struct mm_struct *mm)
 {
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
@@ -2290,14 +2290,14 @@ void exit_mmap(struct mm_struct *mm)
 
 	lru_add_drain();
 	flush_cache_mm(mm);
-	tlb = tlb_gather_mmu(mm, 1);
+	tlb_gather_mmu(&tlb, mm, 1);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 
-	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
-	tlb_finish_mmu(tlb, 0, end);
+	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
+	tlb_finish_mmu(&tlb, 0, end);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 02/17] mm: mmu_gather rework
  2011-02-17 16:23 ` [PATCH 02/17] mm: mmu_gather rework Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  2011-03-10 15:50   ` Mel Gorman
  1 sibling, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Martin Schwidefsky,
	Russell King, Paul Mundt, Jeff Dike, Tony Luck, Hugh Dickins
[-- Attachment #1: peter_zijlstra-mm-preemptible_mmu_gather.patch --]
[-- Type: text/plain, Size: 13315 bytes --]
Remove the first obstackle towards a fully preemptible mmu_gather.
The current scheme assumes mmu_gather is always done with preemption
disabled and uses per-cpu storage for the page batches. Change this to
try and allocate a page for batching and in case of failure, use a
small on-stack array to make some progress.
Preemptible mmu_gather is desired in general and usable once
i_mmap_lock becomes a mutex. Doing it before the mutex conversion
saves us from having to rework the code by moving the mmu_gather
bits inside the pte_lock.
Also avoid flushing the tlb batches from under the pte lock,
this is useful even without the i_mmap_lock conversion as it
significantly reduces pte lock hold times.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Tony Luck <tony.luck@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/exec.c                 |   10 ++---
 include/asm-generic/tlb.h |   77 ++++++++++++++++++++++++++++++++--------------
 include/linux/mm.h        |    2 -
 mm/memory.c               |   42 ++++++++++---------------
 mm/mmap.c                 |   18 +++++-----
 5 files changed, 87 insertions(+), 62 deletions(-)
Index: linux-2.6/fs/exec.c
===================================================================
--- linux-2.6.orig/fs/exec.c
+++ linux-2.6/fs/exec.c
@@ -550,7 +550,7 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 
 	BUG_ON(new_start > new_end);
 
@@ -576,12 +576,12 @@ static int shift_arg_pages(struct vm_are
 		return -ENOMEM;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	if (new_end > old_start) {
 		/*
 		 * when the old and new regions overlap clear from new_end.
 		 */
-		free_pgd_range(tlb, new_end, old_end, new_end,
+		free_pgd_range(&tlb, new_end, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : 0);
 	} else {
 		/*
@@ -590,10 +590,10 @@ static int shift_arg_pages(struct vm_are
 		 * have constraints on va-space that make this illegal (IA64) -
 		 * for the others its just a little faster.
 		 */
-		free_pgd_range(tlb, old_start, old_end, new_end,
+		free_pgd_range(&tlb, old_start, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : 0);
 	}
-	tlb_finish_mmu(tlb, new_end, old_end);
+	tlb_finish_mmu(&tlb, new_end, old_end);
 
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -5,6 +5,8 @@
  * Copyright 2001 Red Hat, Inc.
  * Based on code from mm/memory.c Copyright Linus Torvalds and others.
  *
+ * Copyright 2011 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License
  * as published by the Free Software Foundation; either version
@@ -22,51 +24,69 @@
  * and page free order so much..
  */
 #ifdef CONFIG_SMP
-  #ifdef ARCH_FREE_PTR_NR
-    #define FREE_PTR_NR   ARCH_FREE_PTR_NR
-  #else
-    #define FREE_PTE_NR	506
-  #endif
   #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
 #else
-  #define FREE_PTE_NR	1
   #define tlb_fast_mode(tlb) 1
 #endif
 
+/*
+ * If we can't allocate a page to make a big patch of page pointers
+ * to work on, then just handle a few from the on-stack structure.
+ */
+#define MMU_GATHER_BUNDLE	8
+
 /* struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
  */
 struct mmu_gather {
 	struct mm_struct	*mm;
 	unsigned int		nr;	/* set to ~0U means fast mode */
+	unsigned int		max;	/* nr < max */
 	unsigned int		need_flush;/* Really unmapped some ptes? */
 	unsigned int		fullmm; /* non-zero means full mm flush */
-	struct page *		pages[FREE_PTE_NR];
+#ifdef HAVE_ARCH_MMU_GATHER
+	struct arch_mmu_gather	arch;
+#endif
+	struct page		**pages;
+	struct page		*local[MMU_GATHER_BUNDLE];
 };
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
+static inline void __tlb_alloc_page(struct mmu_gather *tlb)
+{
+	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
+
+	if (addr) {
+		tlb->pages = (void *)addr;
+		tlb->max = PAGE_SIZE / sizeof(struct page *);
+	}
+}
 
 /* tlb_gather_mmu
  *	Return a pointer to an initialized struct mmu_gather.
  */
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 
-	/* Use fast mode if only one CPU is online */
-	tlb->nr = num_online_cpus() > 1 ? 0U : ~0U;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->pages = tlb->local;
+
+	if (num_online_cpus() > 1) {
+		tlb->nr = 0;
+		__tlb_alloc_page(tlb);
+	} else /* Use fast mode if only one CPU is online */
+		tlb->nr = ~0U;
 
 	tlb->fullmm = full_mm_flush;
 
-	return tlb;
+#ifdef HAVE_ARCH_MMU_GATHER
+	tlb->arch = ARCH_MMU_GATHER_INIT;
+#endif
 }
 
 static inline void
-tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
+tlb_flush_mmu(struct mmu_gather *tlb)
 {
 	if (!tlb->need_flush)
 		return;
@@ -75,6 +95,8 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 	if (!tlb_fast_mode(tlb)) {
 		free_pages_and_swap_cache(tlb->pages, tlb->nr);
 		tlb->nr = 0;
+		if (tlb->pages == tlb->local)
+			__tlb_alloc_page(tlb);
 	}
 }
 
@@ -85,12 +107,13 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 static inline void
 tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
-	tlb_flush_mmu(tlb, start, end);
+	tlb_flush_mmu(tlb);
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->pages != tlb->local)
+		free_pages((unsigned long)tlb->pages, 0);
 }
 
 /* tlb_remove_page
@@ -98,16 +121,24 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
  *	handling the additional races in SMP caused by other CPUs caching valid
  *	mappings in their TLBs.
  */
-static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
 	tlb->need_flush = 1;
 	if (tlb_fast_mode(tlb)) {
 		free_page_and_swap_cache(page);
-		return;
+		return 0;
 	}
 	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= FREE_PTE_NR)
-		tlb_flush_mmu(tlb, 0, 0);
+	if (tlb->nr >= tlb->max)
+		return 1;
+
+	return 0;
+}
+
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	if (__tlb_remove_page(tlb, page))
+		tlb_flush_mmu(tlb);
 }
 
 /**
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -889,7 +889,7 @@ int zap_vma_ptes(struct vm_area_struct *
 		unsigned long size);
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
+unsigned long unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -912,12 +912,13 @@ static unsigned long zap_pte_range(struc
 				long *zap_work, struct zap_details *details)
 {
 	struct mm_struct *mm = tlb->mm;
+	int need_flush = 0;
 	pte_t *pte;
 	spinlock_t *ptl;
 	int rss[NR_MM_COUNTERS];
 
 	init_rss_vec(rss);
-
+again:
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
 	do {
@@ -974,7 +975,7 @@ static unsigned long zap_pte_range(struc
 			page_remove_rmap(page);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
-			tlb_remove_page(tlb, page);
+			need_flush = __tlb_remove_page(tlb, page);
 			continue;
 		}
 		/*
@@ -995,12 +996,20 @@ static unsigned long zap_pte_range(struc
 				print_bad_pte(vma, addr, ptent, NULL);
 		}
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
-	} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
+	} while (pte++, addr += PAGE_SIZE,
+			(addr != end && *zap_work > 0 && !need_flush));
 
 	add_mm_rss_vec(mm, rss);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 
+	if (need_flush) {
+		need_flush = 0;
+		tlb_flush_mmu(tlb);
+		if (addr != end)
+			goto again;
+	}
+
 	return addr;
 }
 
@@ -1121,17 +1130,14 @@ static unsigned long unmap_page_range(st
  * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
  * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
+unsigned long unmap_vmas(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *details)
 {
 	long zap_work = ZAP_BLOCK_SIZE;
-	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
-	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
-	int fullmm = (*tlbp)->fullmm;
 	struct mm_struct *mm = vma->vm_mm;
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
@@ -1152,11 +1158,6 @@ unsigned long unmap_vmas(struct mmu_gath
 			untrack_pfn_vma(vma, 0, 0);
 
 		while (start != end) {
-			if (!tlb_start_valid) {
-				tlb_start = start;
-				tlb_start_valid = 1;
-			}
-
 			if (unlikely(is_vm_hugetlb_page(vma))) {
 				/*
 				 * It is undesirable to test vma->vm_file as it
@@ -1177,7 +1178,7 @@ unsigned long unmap_vmas(struct mmu_gath
 
 				start = end;
 			} else
-				start = unmap_page_range(*tlbp, vma,
+				start = unmap_page_range(tlb, vma,
 						start, end, &zap_work, details);
 
 			if (zap_work > 0) {
@@ -1185,19 +1186,13 @@ unsigned long unmap_vmas(struct mmu_gath
 				break;
 			}
 
-			tlb_finish_mmu(*tlbp, tlb_start, start);
-
 			if (need_resched() ||
 				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
-				if (i_mmap_lock) {
-					*tlbp = NULL;
+				if (i_mmap_lock)
 					goto out;
-				}
 				cond_resched();
 			}
 
-			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
-			tlb_start_valid = 0;
 			zap_work = ZAP_BLOCK_SIZE;
 		}
 	}
@@ -1217,16 +1212,15 @@ unsigned long zap_page_range(struct vm_a
 		unsigned long size, struct zap_details *details)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
-		tlb_finish_mmu(tlb, address, end);
+	tlb_finish_mmu(&tlb, address, end);
 	return end;
 }
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1913,17 +1913,17 @@ static void unmap_region(struct mm_struc
 		unsigned long start, unsigned long end)
 {
 	struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	unsigned long nr_accounted = 0;
 
 	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
+	tlb_gather_mmu(&tlb, mm, 0);
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
-				 next? next->vm_start: 0);
-	tlb_finish_mmu(tlb, start, end);
+	free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
+				 next ? next->vm_start : 0);
+	tlb_finish_mmu(&tlb, start, end);
 }
 
 /*
@@ -2265,7 +2265,7 @@ EXPORT_SYMBOL(do_brk);
 /* Release all mmaps. */
 void exit_mmap(struct mm_struct *mm)
 {
-	struct mmu_gather *tlb;
+	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
@@ -2290,14 +2290,14 @@ void exit_mmap(struct mm_struct *mm)
 
 	lru_add_drain();
 	flush_cache_mm(mm);
-	tlb = tlb_gather_mmu(mm, 1);
+	tlb_gather_mmu(&tlb, mm, 1);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 
-	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
-	tlb_finish_mmu(tlb, 0, end);
+	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
+	tlb_finish_mmu(&tlb, 0, end);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 02/17] mm: mmu_gather rework
  2011-02-17 16:23 ` [PATCH 02/17] mm: mmu_gather rework Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
@ 2011-03-10 15:50   ` Mel Gorman
  2011-03-10 15:50     ` Mel Gorman
  2011-03-16 18:55     ` Peter Zijlstra
  1 sibling, 2 replies; 90+ messages in thread
From: Mel Gorman @ 2011-03-10 15:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Martin Schwidefsky,
	Russell King, Paul Mundt, Jeff Dike, Tony Luck, Hugh Dickins
On Thu, Feb 17, 2011 at 05:23:29PM +0100, Peter Zijlstra wrote:
> Remove the first obstackle towards a fully preemptible mmu_gather.
> 
> The current scheme assumes mmu_gather is always done with preemption
> disabled and uses per-cpu storage for the page batches. Change this to
> try and allocate a page for batching and in case of failure, use a
> small on-stack array to make some progress.
> 
> Preemptible mmu_gather is desired in general and usable once
> i_mmap_lock becomes a mutex. Doing it before the mutex conversion
> saves us from having to rework the code by moving the mmu_gather
> bits inside the pte_lock.
> 
> Also avoid flushing the tlb batches from under the pte lock,
> this is useful even without the i_mmap_lock conversion as it
> significantly reduces pte lock hold times.
> 
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: David Miller <davem@davemloft.net>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Russell King <rmk@arm.linux.org.uk>
> Cc: Paul Mundt <lethal@linux-sh.org>
> Cc: Jeff Dike <jdike@addtoit.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  fs/exec.c                 |   10 ++---
>  include/asm-generic/tlb.h |   77 ++++++++++++++++++++++++++++++++--------------
>  include/linux/mm.h        |    2 -
>  mm/memory.c               |   42 ++++++++++---------------
>  mm/mmap.c                 |   18 +++++-----
>  5 files changed, 87 insertions(+), 62 deletions(-)
> 
> Index: linux-2.6/fs/exec.c
> ===================================================================
> --- linux-2.6.orig/fs/exec.c
> +++ linux-2.6/fs/exec.c
> @@ -550,7 +550,7 @@ static int shift_arg_pages(struct vm_are
>  	unsigned long length = old_end - old_start;
>  	unsigned long new_start = old_start - shift;
>  	unsigned long new_end = old_end - shift;
> -	struct mmu_gather *tlb;
> +	struct mmu_gather tlb;
>  
>  	BUG_ON(new_start > new_end);
>  
> @@ -576,12 +576,12 @@ static int shift_arg_pages(struct vm_are
>  		return -ENOMEM;
>  
>  	lru_add_drain();
> -	tlb = tlb_gather_mmu(mm, 0);
> +	tlb_gather_mmu(&tlb, mm, 0);
>  	if (new_end > old_start) {
>  		/*
>  		 * when the old and new regions overlap clear from new_end.
>  		 */
> -		free_pgd_range(tlb, new_end, old_end, new_end,
> +		free_pgd_range(&tlb, new_end, old_end, new_end,
>  			vma->vm_next ? vma->vm_next->vm_start : 0);
>  	} else {
>  		/*
> @@ -590,10 +590,10 @@ static int shift_arg_pages(struct vm_are
>  		 * have constraints on va-space that make this illegal (IA64) -
>  		 * for the others its just a little faster.
>  		 */
> -		free_pgd_range(tlb, old_start, old_end, new_end,
> +		free_pgd_range(&tlb, old_start, old_end, new_end,
>  			vma->vm_next ? vma->vm_next->vm_start : 0);
>  	}
> -	tlb_finish_mmu(tlb, new_end, old_end);
> +	tlb_finish_mmu(&tlb, new_end, old_end);
>  
>  	/*
>  	 * Shrink the vma to just the new range.  Always succeeds.
> Index: linux-2.6/include/asm-generic/tlb.h
> ===================================================================
> --- linux-2.6.orig/include/asm-generic/tlb.h
> +++ linux-2.6/include/asm-generic/tlb.h
> @@ -5,6 +5,8 @@
>   * Copyright 2001 Red Hat, Inc.
>   * Based on code from mm/memory.c Copyright Linus Torvalds and others.
>   *
> + * Copyright 2011 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
> + *
>   * This program is free software; you can redistribute it and/or
>   * modify it under the terms of the GNU General Public License
>   * as published by the Free Software Foundation; either version
> @@ -22,51 +24,69 @@
>   * and page free order so much..
>   */
>  #ifdef CONFIG_SMP
> -  #ifdef ARCH_FREE_PTR_NR
> -    #define FREE_PTR_NR   ARCH_FREE_PTR_NR
> -  #else
> -    #define FREE_PTE_NR	506
> -  #endif
>    #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
>  #else
> -  #define FREE_PTE_NR	1
>    #define tlb_fast_mode(tlb) 1
>  #endif
>  
> +/*
> + * If we can't allocate a page to make a big patch of page pointers
> + * to work on, then just handle a few from the on-stack structure.
> + */
> +#define MMU_GATHER_BUNDLE	8
> +
>  /* struct mmu_gather is an opaque type used by the mm code for passing around
>   * any data needed by arch specific code for tlb_remove_page.
>   */
>  struct mmu_gather {
>  	struct mm_struct	*mm;
>  	unsigned int		nr;	/* set to ~0U means fast mode */
> +	unsigned int		max;	/* nr < max */
>  	unsigned int		need_flush;/* Really unmapped some ptes? */
>  	unsigned int		fullmm; /* non-zero means full mm flush */
> -	struct page *		pages[FREE_PTE_NR];
> +#ifdef HAVE_ARCH_MMU_GATHER
> +	struct arch_mmu_gather	arch;
> +#endif
> +	struct page		**pages;
> +	struct page		*local[MMU_GATHER_BUNDLE];
>  };
>  
> -/* Users of the generic TLB shootdown code must declare this storage space. */
> -DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
> +static inline void __tlb_alloc_page(struct mmu_gather *tlb)
> +{
> +	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
> +
> +	if (addr) {
> +		tlb->pages = (void *)addr;
> +		tlb->max = PAGE_SIZE / sizeof(struct page *);
> +	}
> +}
>  
>  /* tlb_gather_mmu
>   *	Return a pointer to an initialized struct mmu_gather.
>   */
> -static inline struct mmu_gather *
> -tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
> +static inline void
> +tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
>  {
checkpatch will bitch about line length.
> -	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
> -
>  	tlb->mm = mm;
>  
> -	/* Use fast mode if only one CPU is online */
> -	tlb->nr = num_online_cpus() > 1 ? 0U : ~0U;
> +	tlb->max = ARRAY_SIZE(tlb->local);
> +	tlb->pages = tlb->local;
> +
> +	if (num_online_cpus() > 1) {
> +		tlb->nr = 0;
> +		__tlb_alloc_page(tlb);
> +	} else /* Use fast mode if only one CPU is online */
> +		tlb->nr = ~0U;
>  
>  	tlb->fullmm = full_mm_flush;
>  
> -	return tlb;
> +#ifdef HAVE_ARCH_MMU_GATHER
> +	tlb->arch = ARCH_MMU_GATHER_INIT;
> +#endif
>  }
>  
>  static inline void
> -tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
> +tlb_flush_mmu(struct mmu_gather *tlb)
Removing start/end here is a harmless, but unrelated cleanup. Is it
worth keeping start/end on the rough off-chance the information is ever
used to limit what portion of the TLB is flushed?
>  {
>  	if (!tlb->need_flush)
>  		return;
> @@ -75,6 +95,8 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
>  	if (!tlb_fast_mode(tlb)) {
>  		free_pages_and_swap_cache(tlb->pages, tlb->nr);
>  		tlb->nr = 0;
> +		if (tlb->pages == tlb->local)
> +			__tlb_alloc_page(tlb);
>  	}
That needs a comment. Something like
/*
 * If we are using the local on-stack array of pages for MMU gather,
 * try allocation again as we have recently freed pages
 */
>  }
>  
> @@ -85,12 +107,13 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
>  static inline void
>  tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
>  {
> -	tlb_flush_mmu(tlb, start, end);
> +	tlb_flush_mmu(tlb);
>  
>  	/* keep the page table cache within bounds */
>  	check_pgt_cache();
>  
> -	put_cpu_var(mmu_gathers);
> +	if (tlb->pages != tlb->local)
> +		free_pages((unsigned long)tlb->pages, 0);
>  }
>  
>  /* tlb_remove_page
> @@ -98,16 +121,24 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
>   *	handling the additional races in SMP caused by other CPUs caching valid
>   *	mappings in their TLBs.
>   */
> -static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
> +static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
>  {
What does this return value mean?
Looking at the function, its obvious that 1 is returned when pages[] is full
and needs to be freed, TLB flushed, etc. However, callers refer the return
value as "need_flush" where as this function sets tlb->need_flush but the
two values have different meaning: retval need_flush means the array is full
and must be emptied where as tlb->need_flush just says there are some pages
that need to be freed.
It's a nit-pick but how about having it return the number of array slots
that are still available like what pagevec_add does? It would allow you
to get rid of the slighty-different need_flush variable in mm/memory.c
>  	tlb->need_flush = 1;
>  	if (tlb_fast_mode(tlb)) {
>  		free_page_and_swap_cache(page);
> -		return;
> +		return 0;
>  	}
>  	tlb->pages[tlb->nr++] = page;
> -	if (tlb->nr >= FREE_PTE_NR)
> -		tlb_flush_mmu(tlb, 0, 0);
> +	if (tlb->nr >= tlb->max)
> +		return 1;
> +
Use == and VM_BUG_ON(tlb->nr > tlb->max) ?
> +	return 0;
> +}
> +
> +static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
> +{
> +	if (__tlb_remove_page(tlb, page))
> +		tlb_flush_mmu(tlb);
>  }
>  
>  /**
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -889,7 +889,7 @@ int zap_vma_ptes(struct vm_area_struct *
>  		unsigned long size);
>  unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
>  		unsigned long size, struct zap_details *);
> -unsigned long unmap_vmas(struct mmu_gather **tlb,
> +unsigned long unmap_vmas(struct mmu_gather *tlb,
>  		struct vm_area_struct *start_vma, unsigned long start_addr,
>  		unsigned long end_addr, unsigned long *nr_accounted,
>  		struct zap_details *);
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -912,12 +912,13 @@ static unsigned long zap_pte_range(struc
>  				long *zap_work, struct zap_details *details)
>  {
>  	struct mm_struct *mm = tlb->mm;
> +	int need_flush = 0;
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  	int rss[NR_MM_COUNTERS];
>  
>  	init_rss_vec(rss);
> -
> +again:
>  	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>  	arch_enter_lazy_mmu_mode();
>  	do {
> @@ -974,7 +975,7 @@ static unsigned long zap_pte_range(struc
>  			page_remove_rmap(page);
>  			if (unlikely(page_mapcount(page) < 0))
>  				print_bad_pte(vma, addr, ptent, page);
> -			tlb_remove_page(tlb, page);
> +			need_flush = __tlb_remove_page(tlb, page);
>  			continue;
So, if __tlb_remove_page() returns 1 (should be bool for true/false) the
caller is expected to call tlb_flush_mmu(). We call continue and as a
side-effect break out of the loop unlocking various bits and pieces and
restarted.
It'd be a hell of a lot clearer to just say
if (__tlb_remove_page(tlb, page))
	break;
and not check !need_flush on each iteration.
>  		}
>  		/*
> @@ -995,12 +996,20 @@ static unsigned long zap_pte_range(struc
>  				print_bad_pte(vma, addr, ptent, NULL);
>  		}
>  		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> -	} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
> +	} while (pte++, addr += PAGE_SIZE,
> +			(addr != end && *zap_work > 0 && !need_flush));
>  
>  	add_mm_rss_vec(mm, rss);
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);
>  
> +	if (need_flush) {
> +		need_flush = 0;
> +		tlb_flush_mmu(tlb);
> +		if (addr != end)
> +			goto again;
> +	}
So, I think the reasoning here is to update counters and release locks
regularly while tearing down pagetables. If this is true, it could do with
a comment explaining that's the intention. You can also obviate the need
for the local need_flush here with just if (tlb->need_flush), right?
> +
>  	return addr;
>  }
>  
> @@ -1121,17 +1130,14 @@ static unsigned long unmap_page_range(st
>   * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
>   * drops the lock and schedules.
>   */
> -unsigned long unmap_vmas(struct mmu_gather **tlbp,
> +unsigned long unmap_vmas(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, unsigned long start_addr,
>  		unsigned long end_addr, unsigned long *nr_accounted,
>  		struct zap_details *details)
>  {
>  	long zap_work = ZAP_BLOCK_SIZE;
> -	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
> -	int tlb_start_valid = 0;
>  	unsigned long start = start_addr;
>  	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
> -	int fullmm = (*tlbp)->fullmm;
>  	struct mm_struct *mm = vma->vm_mm;
>  
>  	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
> @@ -1152,11 +1158,6 @@ unsigned long unmap_vmas(struct mmu_gath
>  			untrack_pfn_vma(vma, 0, 0);
>  
>  		while (start != end) {
> -			if (!tlb_start_valid) {
> -				tlb_start = start;
> -				tlb_start_valid = 1;
> -			}
> -
>  			if (unlikely(is_vm_hugetlb_page(vma))) {
>  				/*
>  				 * It is undesirable to test vma->vm_file as it
> @@ -1177,7 +1178,7 @@ unsigned long unmap_vmas(struct mmu_gath
>  
>  				start = end;
>  			} else
> -				start = unmap_page_range(*tlbp, vma,
> +				start = unmap_page_range(tlb, vma,
>  						start, end, &zap_work, details);
>  
>  			if (zap_work > 0) {
> @@ -1185,19 +1186,13 @@ unsigned long unmap_vmas(struct mmu_gath
>  				break;
>  			}
>  
> -			tlb_finish_mmu(*tlbp, tlb_start, start);
> -
>  			if (need_resched() ||
>  				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
> -				if (i_mmap_lock) {
> -					*tlbp = NULL;
> +				if (i_mmap_lock)
>  					goto out;
> -				}
>  				cond_resched();
>  			}
>  
> -			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
> -			tlb_start_valid = 0;
>  			zap_work = ZAP_BLOCK_SIZE;
>  		}
>  	}
> @@ -1217,16 +1212,15 @@ unsigned long zap_page_range(struct vm_a
>  		unsigned long size, struct zap_details *details)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> -	struct mmu_gather *tlb;
> +	struct mmu_gather tlb;
>  	unsigned long end = address + size;
>  	unsigned long nr_accounted = 0;
>  
>  	lru_add_drain();
> -	tlb = tlb_gather_mmu(mm, 0);
> +	tlb_gather_mmu(&tlb, mm, 0);
>  	update_hiwater_rss(mm);
>  	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
> -	if (tlb)
> -		tlb_finish_mmu(tlb, address, end);
> +	tlb_finish_mmu(&tlb, address, end);
>  	return end;
>  }
>  
> Index: linux-2.6/mm/mmap.c
> ===================================================================
> --- linux-2.6.orig/mm/mmap.c
> +++ linux-2.6/mm/mmap.c
> @@ -1913,17 +1913,17 @@ static void unmap_region(struct mm_struc
>  		unsigned long start, unsigned long end)
>  {
>  	struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
> -	struct mmu_gather *tlb;
> +	struct mmu_gather tlb;
>  	unsigned long nr_accounted = 0;
>  
>  	lru_add_drain();
> -	tlb = tlb_gather_mmu(mm, 0);
> +	tlb_gather_mmu(&tlb, mm, 0);
>  	update_hiwater_rss(mm);
>  	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
>  	vm_unacct_memory(nr_accounted);
> -	free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
> -				 next? next->vm_start: 0);
> -	tlb_finish_mmu(tlb, start, end);
> +	free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> +				 next ? next->vm_start : 0);
> +	tlb_finish_mmu(&tlb, start, end);
>  }
>  
>  /*
> @@ -2265,7 +2265,7 @@ EXPORT_SYMBOL(do_brk);
>  /* Release all mmaps. */
>  void exit_mmap(struct mm_struct *mm)
>  {
> -	struct mmu_gather *tlb;
> +	struct mmu_gather tlb;
>  	struct vm_area_struct *vma;
>  	unsigned long nr_accounted = 0;
>  	unsigned long end;
> @@ -2290,14 +2290,14 @@ void exit_mmap(struct mm_struct *mm)
>  
>  	lru_add_drain();
>  	flush_cache_mm(mm);
> -	tlb = tlb_gather_mmu(mm, 1);
> +	tlb_gather_mmu(&tlb, mm, 1);
>  	/* update_hiwater_rss(mm) here? but nobody should be looking */
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>  	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
>  	vm_unacct_memory(nr_accounted);
>  
> -	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
> -	tlb_finish_mmu(tlb, 0, end);
> +	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
> +	tlb_finish_mmu(&tlb, 0, end);
>  
>  	/*
>  	 * Walk the list again, actually closing and freeing it,
> 
Functionally I didn't see any problems. Comments are more about form
than function. Whether you apply them or not
Acked-by: Mel Gorman <mel@csn.ul.ie>
-- 
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 02/17] mm: mmu_gather rework
  2011-03-10 15:50   ` Mel Gorman
@ 2011-03-10 15:50     ` Mel Gorman
  2011-03-16 18:55     ` Peter Zijlstra
  1 sibling, 0 replies; 90+ messages in thread
From: Mel Gorman @ 2011-03-10 15:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Martin Schwidefsky,
	Russell King, Paul Mundt, Jeff Dike, Tony Luck, Hugh Dickins
On Thu, Feb 17, 2011 at 05:23:29PM +0100, Peter Zijlstra wrote:
> Remove the first obstackle towards a fully preemptible mmu_gather.
> 
> The current scheme assumes mmu_gather is always done with preemption
> disabled and uses per-cpu storage for the page batches. Change this to
> try and allocate a page for batching and in case of failure, use a
> small on-stack array to make some progress.
> 
> Preemptible mmu_gather is desired in general and usable once
> i_mmap_lock becomes a mutex. Doing it before the mutex conversion
> saves us from having to rework the code by moving the mmu_gather
> bits inside the pte_lock.
> 
> Also avoid flushing the tlb batches from under the pte lock,
> this is useful even without the i_mmap_lock conversion as it
> significantly reduces pte lock hold times.
> 
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: David Miller <davem@davemloft.net>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Russell King <rmk@arm.linux.org.uk>
> Cc: Paul Mundt <lethal@linux-sh.org>
> Cc: Jeff Dike <jdike@addtoit.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  fs/exec.c                 |   10 ++---
>  include/asm-generic/tlb.h |   77 ++++++++++++++++++++++++++++++++--------------
>  include/linux/mm.h        |    2 -
>  mm/memory.c               |   42 ++++++++++---------------
>  mm/mmap.c                 |   18 +++++-----
>  5 files changed, 87 insertions(+), 62 deletions(-)
> 
> Index: linux-2.6/fs/exec.c
> ===================================================================
> --- linux-2.6.orig/fs/exec.c
> +++ linux-2.6/fs/exec.c
> @@ -550,7 +550,7 @@ static int shift_arg_pages(struct vm_are
>  	unsigned long length = old_end - old_start;
>  	unsigned long new_start = old_start - shift;
>  	unsigned long new_end = old_end - shift;
> -	struct mmu_gather *tlb;
> +	struct mmu_gather tlb;
>  
>  	BUG_ON(new_start > new_end);
>  
> @@ -576,12 +576,12 @@ static int shift_arg_pages(struct vm_are
>  		return -ENOMEM;
>  
>  	lru_add_drain();
> -	tlb = tlb_gather_mmu(mm, 0);
> +	tlb_gather_mmu(&tlb, mm, 0);
>  	if (new_end > old_start) {
>  		/*
>  		 * when the old and new regions overlap clear from new_end.
>  		 */
> -		free_pgd_range(tlb, new_end, old_end, new_end,
> +		free_pgd_range(&tlb, new_end, old_end, new_end,
>  			vma->vm_next ? vma->vm_next->vm_start : 0);
>  	} else {
>  		/*
> @@ -590,10 +590,10 @@ static int shift_arg_pages(struct vm_are
>  		 * have constraints on va-space that make this illegal (IA64) -
>  		 * for the others its just a little faster.
>  		 */
> -		free_pgd_range(tlb, old_start, old_end, new_end,
> +		free_pgd_range(&tlb, old_start, old_end, new_end,
>  			vma->vm_next ? vma->vm_next->vm_start : 0);
>  	}
> -	tlb_finish_mmu(tlb, new_end, old_end);
> +	tlb_finish_mmu(&tlb, new_end, old_end);
>  
>  	/*
>  	 * Shrink the vma to just the new range.  Always succeeds.
> Index: linux-2.6/include/asm-generic/tlb.h
> ===================================================================
> --- linux-2.6.orig/include/asm-generic/tlb.h
> +++ linux-2.6/include/asm-generic/tlb.h
> @@ -5,6 +5,8 @@
>   * Copyright 2001 Red Hat, Inc.
>   * Based on code from mm/memory.c Copyright Linus Torvalds and others.
>   *
> + * Copyright 2011 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
> + *
>   * This program is free software; you can redistribute it and/or
>   * modify it under the terms of the GNU General Public License
>   * as published by the Free Software Foundation; either version
> @@ -22,51 +24,69 @@
>   * and page free order so much..
>   */
>  #ifdef CONFIG_SMP
> -  #ifdef ARCH_FREE_PTR_NR
> -    #define FREE_PTR_NR   ARCH_FREE_PTR_NR
> -  #else
> -    #define FREE_PTE_NR	506
> -  #endif
>    #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
>  #else
> -  #define FREE_PTE_NR	1
>    #define tlb_fast_mode(tlb) 1
>  #endif
>  
> +/*
> + * If we can't allocate a page to make a big patch of page pointers
> + * to work on, then just handle a few from the on-stack structure.
> + */
> +#define MMU_GATHER_BUNDLE	8
> +
>  /* struct mmu_gather is an opaque type used by the mm code for passing around
>   * any data needed by arch specific code for tlb_remove_page.
>   */
>  struct mmu_gather {
>  	struct mm_struct	*mm;
>  	unsigned int		nr;	/* set to ~0U means fast mode */
> +	unsigned int		max;	/* nr < max */
>  	unsigned int		need_flush;/* Really unmapped some ptes? */
>  	unsigned int		fullmm; /* non-zero means full mm flush */
> -	struct page *		pages[FREE_PTE_NR];
> +#ifdef HAVE_ARCH_MMU_GATHER
> +	struct arch_mmu_gather	arch;
> +#endif
> +	struct page		**pages;
> +	struct page		*local[MMU_GATHER_BUNDLE];
>  };
>  
> -/* Users of the generic TLB shootdown code must declare this storage space. */
> -DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
> +static inline void __tlb_alloc_page(struct mmu_gather *tlb)
> +{
> +	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
> +
> +	if (addr) {
> +		tlb->pages = (void *)addr;
> +		tlb->max = PAGE_SIZE / sizeof(struct page *);
> +	}
> +}
>  
>  /* tlb_gather_mmu
>   *	Return a pointer to an initialized struct mmu_gather.
>   */
> -static inline struct mmu_gather *
> -tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
> +static inline void
> +tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
>  {
checkpatch will bitch about line length.
> -	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
> -
>  	tlb->mm = mm;
>  
> -	/* Use fast mode if only one CPU is online */
> -	tlb->nr = num_online_cpus() > 1 ? 0U : ~0U;
> +	tlb->max = ARRAY_SIZE(tlb->local);
> +	tlb->pages = tlb->local;
> +
> +	if (num_online_cpus() > 1) {
> +		tlb->nr = 0;
> +		__tlb_alloc_page(tlb);
> +	} else /* Use fast mode if only one CPU is online */
> +		tlb->nr = ~0U;
>  
>  	tlb->fullmm = full_mm_flush;
>  
> -	return tlb;
> +#ifdef HAVE_ARCH_MMU_GATHER
> +	tlb->arch = ARCH_MMU_GATHER_INIT;
> +#endif
>  }
>  
>  static inline void
> -tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
> +tlb_flush_mmu(struct mmu_gather *tlb)
Removing start/end here is a harmless, but unrelated cleanup. Is it
worth keeping start/end on the rough off-chance the information is ever
used to limit what portion of the TLB is flushed?
>  {
>  	if (!tlb->need_flush)
>  		return;
> @@ -75,6 +95,8 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
>  	if (!tlb_fast_mode(tlb)) {
>  		free_pages_and_swap_cache(tlb->pages, tlb->nr);
>  		tlb->nr = 0;
> +		if (tlb->pages == tlb->local)
> +			__tlb_alloc_page(tlb);
>  	}
That needs a comment. Something like
/*
 * If we are using the local on-stack array of pages for MMU gather,
 * try allocation again as we have recently freed pages
 */
>  }
>  
> @@ -85,12 +107,13 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
>  static inline void
>  tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
>  {
> -	tlb_flush_mmu(tlb, start, end);
> +	tlb_flush_mmu(tlb);
>  
>  	/* keep the page table cache within bounds */
>  	check_pgt_cache();
>  
> -	put_cpu_var(mmu_gathers);
> +	if (tlb->pages != tlb->local)
> +		free_pages((unsigned long)tlb->pages, 0);
>  }
>  
>  /* tlb_remove_page
> @@ -98,16 +121,24 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
>   *	handling the additional races in SMP caused by other CPUs caching valid
>   *	mappings in their TLBs.
>   */
> -static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
> +static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
>  {
What does this return value mean?
Looking at the function, its obvious that 1 is returned when pages[] is full
and needs to be freed, TLB flushed, etc. However, callers refer the return
value as "need_flush" where as this function sets tlb->need_flush but the
two values have different meaning: retval need_flush means the array is full
and must be emptied where as tlb->need_flush just says there are some pages
that need to be freed.
It's a nit-pick but how about having it return the number of array slots
that are still available like what pagevec_add does? It would allow you
to get rid of the slighty-different need_flush variable in mm/memory.c
>  	tlb->need_flush = 1;
>  	if (tlb_fast_mode(tlb)) {
>  		free_page_and_swap_cache(page);
> -		return;
> +		return 0;
>  	}
>  	tlb->pages[tlb->nr++] = page;
> -	if (tlb->nr >= FREE_PTE_NR)
> -		tlb_flush_mmu(tlb, 0, 0);
> +	if (tlb->nr >= tlb->max)
> +		return 1;
> +
Use == and VM_BUG_ON(tlb->nr > tlb->max) ?
> +	return 0;
> +}
> +
> +static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
> +{
> +	if (__tlb_remove_page(tlb, page))
> +		tlb_flush_mmu(tlb);
>  }
>  
>  /**
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -889,7 +889,7 @@ int zap_vma_ptes(struct vm_area_struct *
>  		unsigned long size);
>  unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
>  		unsigned long size, struct zap_details *);
> -unsigned long unmap_vmas(struct mmu_gather **tlb,
> +unsigned long unmap_vmas(struct mmu_gather *tlb,
>  		struct vm_area_struct *start_vma, unsigned long start_addr,
>  		unsigned long end_addr, unsigned long *nr_accounted,
>  		struct zap_details *);
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -912,12 +912,13 @@ static unsigned long zap_pte_range(struc
>  				long *zap_work, struct zap_details *details)
>  {
>  	struct mm_struct *mm = tlb->mm;
> +	int need_flush = 0;
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  	int rss[NR_MM_COUNTERS];
>  
>  	init_rss_vec(rss);
> -
> +again:
>  	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>  	arch_enter_lazy_mmu_mode();
>  	do {
> @@ -974,7 +975,7 @@ static unsigned long zap_pte_range(struc
>  			page_remove_rmap(page);
>  			if (unlikely(page_mapcount(page) < 0))
>  				print_bad_pte(vma, addr, ptent, page);
> -			tlb_remove_page(tlb, page);
> +			need_flush = __tlb_remove_page(tlb, page);
>  			continue;
So, if __tlb_remove_page() returns 1 (should be bool for true/false) the
caller is expected to call tlb_flush_mmu(). We call continue and as a
side-effect break out of the loop unlocking various bits and pieces and
restarted.
It'd be a hell of a lot clearer to just say
if (__tlb_remove_page(tlb, page))
	break;
and not check !need_flush on each iteration.
>  		}
>  		/*
> @@ -995,12 +996,20 @@ static unsigned long zap_pte_range(struc
>  				print_bad_pte(vma, addr, ptent, NULL);
>  		}
>  		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> -	} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
> +	} while (pte++, addr += PAGE_SIZE,
> +			(addr != end && *zap_work > 0 && !need_flush));
>  
>  	add_mm_rss_vec(mm, rss);
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);
>  
> +	if (need_flush) {
> +		need_flush = 0;
> +		tlb_flush_mmu(tlb);
> +		if (addr != end)
> +			goto again;
> +	}
So, I think the reasoning here is to update counters and release locks
regularly while tearing down pagetables. If this is true, it could do with
a comment explaining that's the intention. You can also obviate the need
for the local need_flush here with just if (tlb->need_flush), right?
> +
>  	return addr;
>  }
>  
> @@ -1121,17 +1130,14 @@ static unsigned long unmap_page_range(st
>   * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
>   * drops the lock and schedules.
>   */
> -unsigned long unmap_vmas(struct mmu_gather **tlbp,
> +unsigned long unmap_vmas(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, unsigned long start_addr,
>  		unsigned long end_addr, unsigned long *nr_accounted,
>  		struct zap_details *details)
>  {
>  	long zap_work = ZAP_BLOCK_SIZE;
> -	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
> -	int tlb_start_valid = 0;
>  	unsigned long start = start_addr;
>  	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
> -	int fullmm = (*tlbp)->fullmm;
>  	struct mm_struct *mm = vma->vm_mm;
>  
>  	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
> @@ -1152,11 +1158,6 @@ unsigned long unmap_vmas(struct mmu_gath
>  			untrack_pfn_vma(vma, 0, 0);
>  
>  		while (start != end) {
> -			if (!tlb_start_valid) {
> -				tlb_start = start;
> -				tlb_start_valid = 1;
> -			}
> -
>  			if (unlikely(is_vm_hugetlb_page(vma))) {
>  				/*
>  				 * It is undesirable to test vma->vm_file as it
> @@ -1177,7 +1178,7 @@ unsigned long unmap_vmas(struct mmu_gath
>  
>  				start = end;
>  			} else
> -				start = unmap_page_range(*tlbp, vma,
> +				start = unmap_page_range(tlb, vma,
>  						start, end, &zap_work, details);
>  
>  			if (zap_work > 0) {
> @@ -1185,19 +1186,13 @@ unsigned long unmap_vmas(struct mmu_gath
>  				break;
>  			}
>  
> -			tlb_finish_mmu(*tlbp, tlb_start, start);
> -
>  			if (need_resched() ||
>  				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
> -				if (i_mmap_lock) {
> -					*tlbp = NULL;
> +				if (i_mmap_lock)
>  					goto out;
> -				}
>  				cond_resched();
>  			}
>  
> -			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
> -			tlb_start_valid = 0;
>  			zap_work = ZAP_BLOCK_SIZE;
>  		}
>  	}
> @@ -1217,16 +1212,15 @@ unsigned long zap_page_range(struct vm_a
>  		unsigned long size, struct zap_details *details)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> -	struct mmu_gather *tlb;
> +	struct mmu_gather tlb;
>  	unsigned long end = address + size;
>  	unsigned long nr_accounted = 0;
>  
>  	lru_add_drain();
> -	tlb = tlb_gather_mmu(mm, 0);
> +	tlb_gather_mmu(&tlb, mm, 0);
>  	update_hiwater_rss(mm);
>  	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
> -	if (tlb)
> -		tlb_finish_mmu(tlb, address, end);
> +	tlb_finish_mmu(&tlb, address, end);
>  	return end;
>  }
>  
> Index: linux-2.6/mm/mmap.c
> ===================================================================
> --- linux-2.6.orig/mm/mmap.c
> +++ linux-2.6/mm/mmap.c
> @@ -1913,17 +1913,17 @@ static void unmap_region(struct mm_struc
>  		unsigned long start, unsigned long end)
>  {
>  	struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
> -	struct mmu_gather *tlb;
> +	struct mmu_gather tlb;
>  	unsigned long nr_accounted = 0;
>  
>  	lru_add_drain();
> -	tlb = tlb_gather_mmu(mm, 0);
> +	tlb_gather_mmu(&tlb, mm, 0);
>  	update_hiwater_rss(mm);
>  	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
>  	vm_unacct_memory(nr_accounted);
> -	free_pgtables(tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
> -				 next? next->vm_start: 0);
> -	tlb_finish_mmu(tlb, start, end);
> +	free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> +				 next ? next->vm_start : 0);
> +	tlb_finish_mmu(&tlb, start, end);
>  }
>  
>  /*
> @@ -2265,7 +2265,7 @@ EXPORT_SYMBOL(do_brk);
>  /* Release all mmaps. */
>  void exit_mmap(struct mm_struct *mm)
>  {
> -	struct mmu_gather *tlb;
> +	struct mmu_gather tlb;
>  	struct vm_area_struct *vma;
>  	unsigned long nr_accounted = 0;
>  	unsigned long end;
> @@ -2290,14 +2290,14 @@ void exit_mmap(struct mm_struct *mm)
>  
>  	lru_add_drain();
>  	flush_cache_mm(mm);
> -	tlb = tlb_gather_mmu(mm, 1);
> +	tlb_gather_mmu(&tlb, mm, 1);
>  	/* update_hiwater_rss(mm) here? but nobody should be looking */
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>  	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
>  	vm_unacct_memory(nr_accounted);
>  
> -	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
> -	tlb_finish_mmu(tlb, 0, end);
> +	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
> +	tlb_finish_mmu(&tlb, 0, end);
>  
>  	/*
>  	 * Walk the list again, actually closing and freeing it,
> 
Functionally I didn't see any problems. Comments are more about form
than function. Whether you apply them or not
Acked-by: Mel Gorman <mel@csn.ul.ie>
-- 
Mel Gorman
SUSE Labs
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 02/17] mm: mmu_gather rework
  2011-03-10 15:50   ` Mel Gorman
  2011-03-10 15:50     ` Mel Gorman
@ 2011-03-16 18:55     ` Peter Zijlstra
  2011-03-16 18:55       ` Peter Zijlstra
                         ` (2 more replies)
  1 sibling, 3 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-03-16 18:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Martin Schwidefsky,
	Russell King, Paul Mundt, Jeff Dike, Tony Luck, Hugh Dickins
On Thu, 2011-03-10 at 15:50 +0000, Mel Gorman wrote:
> > +static inline void
> > +tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
> >  {
> 
> checkpatch will bitch about line length.
I did a s/full_mm_flush/fullmm/ which puts the line length at 81. At
which point I'll ignore it ;-)
> > -	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
> > -
> >  	tlb->mm = mm;
> >  
> > -	/* Use fast mode if only one CPU is online */
> > -	tlb->nr = num_online_cpus() > 1 ? 0U : ~0U;
> > +	tlb->max = ARRAY_SIZE(tlb->local);
> > +	tlb->pages = tlb->local;
> > +
> > +	if (num_online_cpus() > 1) {
> > +		tlb->nr = 0;
> > +		__tlb_alloc_page(tlb);
> > +	} else /* Use fast mode if only one CPU is online */
> > +		tlb->nr = ~0U;
> >  
> >  	tlb->fullmm = full_mm_flush;
> >  
> > -	return tlb;
> > +#ifdef HAVE_ARCH_MMU_GATHER
> > +	tlb->arch = ARCH_MMU_GATHER_INIT;
> > +#endif
> >  }
> >  
> >  static inline void
> > -tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
> > +tlb_flush_mmu(struct mmu_gather *tlb)
> 
> Removing start/end here is a harmless, but unrelated cleanup. Is it
> worth keeping start/end on the rough off-chance the information is ever
> used to limit what portion of the TLB is flushed?
I've got another patch that adds full range tracking to
asm-generic/tlb.h, it uses tlb_remove_tlb_entry()/p.._free_tlb() to
track the range of the things actually removed.
> >  {
> >  	if (!tlb->need_flush)
> >  		return;
> > @@ -75,6 +95,8 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
> >  	if (!tlb_fast_mode(tlb)) {
> >  		free_pages_and_swap_cache(tlb->pages, tlb->nr);
> >  		tlb->nr = 0;
> > +		if (tlb->pages == tlb->local)
> > +			__tlb_alloc_page(tlb);
> >  	}
> 
> That needs a comment. Something like
> 
> /*
>  * If we are using the local on-stack array of pages for MMU gather,
>  * try allocation again as we have recently freed pages
>  */
Fair enough, done.
> >  }
> >  
> > @@ -98,16 +121,24 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
> >   *	handling the additional races in SMP caused by other CPUs caching valid
> >   *	mappings in their TLBs.
> >   */
> > -static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
> > +static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
> >  {
> 
> What does this return value mean?
Like you surmise below, that we need to call tlb_flush_mmu() before
calling more of __tlb_remove_page().
> Looking at the function, its obvious that 1 is returned when pages[] is full
> and needs to be freed, TLB flushed, etc. However, callers refer the return
> value as "need_flush" where as this function sets tlb->need_flush but the
> two values have different meaning: retval need_flush means the array is full
> and must be emptied where as tlb->need_flush just says there are some pages
> that need to be freed.
> 
> It's a nit-pick but how about having it return the number of array slots
> that are still available like what pagevec_add does? It would allow you
> to get rid of the slighty-different need_flush variable in mm/memory.c
That might work, let me do so.
> >  	tlb->need_flush = 1;
> >  	if (tlb_fast_mode(tlb)) {
> >  		free_page_and_swap_cache(page);
> > -		return;
> > +		return 0;
> >  	}
> >  	tlb->pages[tlb->nr++] = page;
> > -	if (tlb->nr >= FREE_PTE_NR)
> > -		tlb_flush_mmu(tlb, 0, 0);
> > +	if (tlb->nr >= tlb->max)
> > +		return 1;
> > +
> 
> Use == and VM_BUG_ON(tlb->nr > tlb->max) ?
Paranoia, I like ;-)
> > +	return 0;
> > +}
> > +
> > @@ -974,7 +975,7 @@ static unsigned long zap_pte_range(struc
> >  			page_remove_rmap(page);
> >  			if (unlikely(page_mapcount(page) < 0))
> >  				print_bad_pte(vma, addr, ptent, page);
> > -			tlb_remove_page(tlb, page);
> > +			need_flush = __tlb_remove_page(tlb, page);
> >  			continue;
> 
> So, if __tlb_remove_page() returns 1 (should be bool for true/false) the
> caller is expected to call tlb_flush_mmu(). We call continue and as a
> side-effect break out of the loop unlocking various bits and pieces and
> restarted.
> 
> It'd be a hell of a lot clearer to just say
> 
> if (__tlb_remove_page(tlb, page))
> 	break;
> 
> and not check !need_flush on each iteration.
Uhm,. right :-), /me wonders why he wrote it like it was.
> >  		}
> >  		/*
> > @@ -995,12 +996,20 @@ static unsigned long zap_pte_range(struc
> >  				print_bad_pte(vma, addr, ptent, NULL);
> >  		}
> >  		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> > -	} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
> > +	} while (pte++, addr += PAGE_SIZE,
> > +			(addr != end && *zap_work > 0 && !need_flush));
> >  
> >  	add_mm_rss_vec(mm, rss);
> >  	arch_leave_lazy_mmu_mode();
> >  	pte_unmap_unlock(pte - 1, ptl);
> >  
> > +	if (need_flush) {
> > +		need_flush = 0;
> > +		tlb_flush_mmu(tlb);
> > +		if (addr != end)
> > +			goto again;
> > +	}
> 
> So, I think the reasoning here is to update counters and release locks
> regularly while tearing down pagetables. If this is true, it could do with
> a comment explaining that's the intention. You can also obviate the need
> for the local need_flush here with just if (tlb->need_flush), right?
I'll add a comment. tlb->need_flush is not quite the same, its set as
soon as there's one page in, our need_flush is when there's no space
left. I should have spotted this confusion before.
> 
> Functionally I didn't see any problems. Comments are more about form
> than function. Whether you apply them or not
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>
Thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 02/17] mm: mmu_gather rework
  2011-03-16 18:55     ` Peter Zijlstra
@ 2011-03-16 18:55       ` Peter Zijlstra
  2011-03-16 20:15       ` Geert Uytterhoeven
  2011-03-21  8:47       ` Avi Kivity
  2 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-03-16 18:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Martin Schwidefsky,
	Russell King, Paul Mundt, Jeff Dike, Tony Luck, Hugh Dickins
On Thu, 2011-03-10 at 15:50 +0000, Mel Gorman wrote:
> > +static inline void
> > +tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
> >  {
> 
> checkpatch will bitch about line length.
I did a s/full_mm_flush/fullmm/ which puts the line length at 81. At
which point I'll ignore it ;-)
> > -	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
> > -
> >  	tlb->mm = mm;
> >  
> > -	/* Use fast mode if only one CPU is online */
> > -	tlb->nr = num_online_cpus() > 1 ? 0U : ~0U;
> > +	tlb->max = ARRAY_SIZE(tlb->local);
> > +	tlb->pages = tlb->local;
> > +
> > +	if (num_online_cpus() > 1) {
> > +		tlb->nr = 0;
> > +		__tlb_alloc_page(tlb);
> > +	} else /* Use fast mode if only one CPU is online */
> > +		tlb->nr = ~0U;
> >  
> >  	tlb->fullmm = full_mm_flush;
> >  
> > -	return tlb;
> > +#ifdef HAVE_ARCH_MMU_GATHER
> > +	tlb->arch = ARCH_MMU_GATHER_INIT;
> > +#endif
> >  }
> >  
> >  static inline void
> > -tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
> > +tlb_flush_mmu(struct mmu_gather *tlb)
> 
> Removing start/end here is a harmless, but unrelated cleanup. Is it
> worth keeping start/end on the rough off-chance the information is ever
> used to limit what portion of the TLB is flushed?
I've got another patch that adds full range tracking to
asm-generic/tlb.h, it uses tlb_remove_tlb_entry()/p.._free_tlb() to
track the range of the things actually removed.
> >  {
> >  	if (!tlb->need_flush)
> >  		return;
> > @@ -75,6 +95,8 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
> >  	if (!tlb_fast_mode(tlb)) {
> >  		free_pages_and_swap_cache(tlb->pages, tlb->nr);
> >  		tlb->nr = 0;
> > +		if (tlb->pages == tlb->local)
> > +			__tlb_alloc_page(tlb);
> >  	}
> 
> That needs a comment. Something like
> 
> /*
>  * If we are using the local on-stack array of pages for MMU gather,
>  * try allocation again as we have recently freed pages
>  */
Fair enough, done.
> >  }
> >  
> > @@ -98,16 +121,24 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
> >   *	handling the additional races in SMP caused by other CPUs caching valid
> >   *	mappings in their TLBs.
> >   */
> > -static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
> > +static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
> >  {
> 
> What does this return value mean?
Like you surmise below, that we need to call tlb_flush_mmu() before
calling more of __tlb_remove_page().
> Looking at the function, its obvious that 1 is returned when pages[] is full
> and needs to be freed, TLB flushed, etc. However, callers refer the return
> value as "need_flush" where as this function sets tlb->need_flush but the
> two values have different meaning: retval need_flush means the array is full
> and must be emptied where as tlb->need_flush just says there are some pages
> that need to be freed.
> 
> It's a nit-pick but how about having it return the number of array slots
> that are still available like what pagevec_add does? It would allow you
> to get rid of the slighty-different need_flush variable in mm/memory.c
That might work, let me do so.
> >  	tlb->need_flush = 1;
> >  	if (tlb_fast_mode(tlb)) {
> >  		free_page_and_swap_cache(page);
> > -		return;
> > +		return 0;
> >  	}
> >  	tlb->pages[tlb->nr++] = page;
> > -	if (tlb->nr >= FREE_PTE_NR)
> > -		tlb_flush_mmu(tlb, 0, 0);
> > +	if (tlb->nr >= tlb->max)
> > +		return 1;
> > +
> 
> Use == and VM_BUG_ON(tlb->nr > tlb->max) ?
Paranoia, I like ;-)
> > +	return 0;
> > +}
> > +
> > @@ -974,7 +975,7 @@ static unsigned long zap_pte_range(struc
> >  			page_remove_rmap(page);
> >  			if (unlikely(page_mapcount(page) < 0))
> >  				print_bad_pte(vma, addr, ptent, page);
> > -			tlb_remove_page(tlb, page);
> > +			need_flush = __tlb_remove_page(tlb, page);
> >  			continue;
> 
> So, if __tlb_remove_page() returns 1 (should be bool for true/false) the
> caller is expected to call tlb_flush_mmu(). We call continue and as a
> side-effect break out of the loop unlocking various bits and pieces and
> restarted.
> 
> It'd be a hell of a lot clearer to just say
> 
> if (__tlb_remove_page(tlb, page))
> 	break;
> 
> and not check !need_flush on each iteration.
Uhm,. right :-), /me wonders why he wrote it like it was.
> >  		}
> >  		/*
> > @@ -995,12 +996,20 @@ static unsigned long zap_pte_range(struc
> >  				print_bad_pte(vma, addr, ptent, NULL);
> >  		}
> >  		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> > -	} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
> > +	} while (pte++, addr += PAGE_SIZE,
> > +			(addr != end && *zap_work > 0 && !need_flush));
> >  
> >  	add_mm_rss_vec(mm, rss);
> >  	arch_leave_lazy_mmu_mode();
> >  	pte_unmap_unlock(pte - 1, ptl);
> >  
> > +	if (need_flush) {
> > +		need_flush = 0;
> > +		tlb_flush_mmu(tlb);
> > +		if (addr != end)
> > +			goto again;
> > +	}
> 
> So, I think the reasoning here is to update counters and release locks
> regularly while tearing down pagetables. If this is true, it could do with
> a comment explaining that's the intention. You can also obviate the need
> for the local need_flush here with just if (tlb->need_flush), right?
I'll add a comment. tlb->need_flush is not quite the same, its set as
soon as there's one page in, our need_flush is when there's no space
left. I should have spotted this confusion before.
> 
> Functionally I didn't see any problems. Comments are more about form
> than function. Whether you apply them or not
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>
Thanks!
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 02/17] mm: mmu_gather rework
  2011-03-16 18:55     ` Peter Zijlstra
  2011-03-16 18:55       ` Peter Zijlstra
@ 2011-03-16 20:15       ` Geert Uytterhoeven
  2011-03-16 20:15         ` Geert Uytterhoeven
  2011-03-16 21:08         ` Peter Zijlstra
  2011-03-21  8:47       ` Avi Kivity
  2 siblings, 2 replies; 90+ messages in thread
From: Geert Uytterhoeven @ 2011-03-16 20:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, linux-mm, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Martin Schwidefsky, Russell King, Paul Mundt, Jeff Dike,
	Tony Luck, Hugh Dickins
On Wed, Mar 16, 2011 at 19:55, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2011-03-10 at 15:50 +0000, Mel Gorman wrote:
>
>> > +static inline void
>> > +tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
>> >  {
>>
>> checkpatch will bitch about line length.
>
> I did a s/full_mm_flush/fullmm/ which puts the line length at 81. At
> which point I'll ignore it ;-)
But what does "fullmm" mean here? Shouldn't that be documented.
BTW, the function no longer returns a struct, but void, so the documentation
should be updated for sure.
Gr{oetje,eeting}s,
                        Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 02/17] mm: mmu_gather rework
  2011-03-16 20:15       ` Geert Uytterhoeven
@ 2011-03-16 20:15         ` Geert Uytterhoeven
  2011-03-16 21:08         ` Peter Zijlstra
  1 sibling, 0 replies; 90+ messages in thread
From: Geert Uytterhoeven @ 2011-03-16 20:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, linux-mm, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Martin Schwidefsky, Russell King, Paul Mundt, Jeff Dike,
	Tony Luck, Hugh Dickins
On Wed, Mar 16, 2011 at 19:55, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2011-03-10 at 15:50 +0000, Mel Gorman wrote:
>
>> > +static inline void
>> > +tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
>> >  {
>>
>> checkpatch will bitch about line length.
>
> I did a s/full_mm_flush/fullmm/ which puts the line length at 81. At
> which point I'll ignore it ;-)
But what does "fullmm" mean here? Shouldn't that be documented.
BTW, the function no longer returns a struct, but void, so the documentation
should be updated for sure.
Gr{oetje,eeting}s,
                        Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 02/17] mm: mmu_gather rework
  2011-03-16 20:15       ` Geert Uytterhoeven
  2011-03-16 20:15         ` Geert Uytterhoeven
@ 2011-03-16 21:08         ` Peter Zijlstra
  1 sibling, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-03-16 21:08 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Mel Gorman, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, linux-mm, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Nick Piggin, Paul McKenney, Yanmin Zhang,
	Martin Schwidefsky, Russell King, Paul Mundt, Jeff Dike,
	Tony Luck, Hugh Dickins
On Wed, 2011-03-16 at 21:15 +0100, Geert Uytterhoeven wrote:
> On Wed, Mar 16, 2011 at 19:55, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Thu, 2011-03-10 at 15:50 +0000, Mel Gorman wrote:
> >
> >> > +static inline void
> >> > +tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
> >> >  {
> >>
> >> checkpatch will bitch about line length.
> >
> > I did a s/full_mm_flush/fullmm/ which puts the line length at 81. At
> > which point I'll ignore it ;-)
> 
> But what does "fullmm" mean here? Shouldn't that be documented.
> BTW, the function no longer returns a struct, but void, so the documentation
> should be updated for sure.
You're talking about the comment right? I'll update that. I was also
considering writing Documentation/mmugather.txt, but that's a slightly
bigger undertaking.
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
- * Re: [PATCH 02/17] mm: mmu_gather rework
  2011-03-16 18:55     ` Peter Zijlstra
  2011-03-16 18:55       ` Peter Zijlstra
  2011-03-16 20:15       ` Geert Uytterhoeven
@ 2011-03-21  8:47       ` Avi Kivity
  2011-03-21  8:47         ` Avi Kivity
  2011-04-01 12:07         ` Peter Zijlstra
  2 siblings, 2 replies; 90+ messages in thread
From: Avi Kivity @ 2011-03-21  8:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Andrea Arcangeli, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Martin Schwidefsky,
	Russell King, Paul Mundt, Jeff Dike, Tony Luck, Hugh Dickins
On 03/16/2011 08:55 PM, Peter Zijlstra wrote:
> On Thu, 2011-03-10 at 15:50 +0000, Mel Gorman wrote:
>
> >  >  +static inline void
> >  >  +tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
> >  >   {
> >
> >  checkpatch will bitch about line length.
>
> I did a s/full_mm_flush/fullmm/ which puts the line length at 81. At
> which point I'll ignore it ;-)
How about s/unsigned int/bool/?  IIRC you aren't a "bool was invented 
after 1971, therefore it is evil" type.
-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 02/17] mm: mmu_gather rework
  2011-03-21  8:47       ` Avi Kivity
@ 2011-03-21  8:47         ` Avi Kivity
  2011-04-01 12:07         ` Peter Zijlstra
  1 sibling, 0 replies; 90+ messages in thread
From: Avi Kivity @ 2011-03-21  8:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Andrea Arcangeli, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Martin Schwidefsky,
	Russell King, Paul Mundt, Jeff Dike, Tony Luck, Hugh Dickins
On 03/16/2011 08:55 PM, Peter Zijlstra wrote:
> On Thu, 2011-03-10 at 15:50 +0000, Mel Gorman wrote:
>
> >  >  +static inline void
> >  >  +tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
> >  >   {
> >
> >  checkpatch will bitch about line length.
>
> I did a s/full_mm_flush/fullmm/ which puts the line length at 81. At
> which point I'll ignore it ;-)
How about s/unsigned int/bool/?  IIRC you aren't a "bool was invented 
after 1971, therefore it is evil" type.
-- 
error compiling committee.c: too many arguments to function
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 02/17] mm: mmu_gather rework
  2011-03-21  8:47       ` Avi Kivity
  2011-03-21  8:47         ` Avi Kivity
@ 2011-04-01 12:07         ` Peter Zijlstra
  2011-04-01 12:07           ` Peter Zijlstra
  2011-04-01 16:13           ` Linus Torvalds
  1 sibling, 2 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-04-01 12:07 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mel Gorman, Andrea Arcangeli, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Martin Schwidefsky,
	Russell King, Paul Mundt, Jeff Dike, Tony Luck, Hugh Dickins
On Mon, 2011-03-21 at 10:47 +0200, Avi Kivity wrote:
> On 03/16/2011 08:55 PM, Peter Zijlstra wrote:
> > On Thu, 2011-03-10 at 15:50 +0000, Mel Gorman wrote:
> >
> > >  >  +static inline void
> > >  >  +tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
> > >  >   {
> > >
> > >  checkpatch will bitch about line length.
> >
> > I did a s/full_mm_flush/fullmm/ which puts the line length at 81. At
> > which point I'll ignore it ;-)
> 
> How about s/unsigned int/bool/?  IIRC you aren't a "bool was invented 
> after 1971, therefore it is evil" type.
No, although I do try to avoid it in structures because I'm ever unsure
of the storage type used. But yes, good suggestion, thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 02/17] mm: mmu_gather rework
  2011-04-01 12:07         ` Peter Zijlstra
@ 2011-04-01 12:07           ` Peter Zijlstra
  2011-04-01 16:13           ` Linus Torvalds
  1 sibling, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-04-01 12:07 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mel Gorman, Andrea Arcangeli, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Martin Schwidefsky,
	Russell King, Paul Mundt, Jeff Dike, Tony Luck, Hugh Dickins
On Mon, 2011-03-21 at 10:47 +0200, Avi Kivity wrote:
> On 03/16/2011 08:55 PM, Peter Zijlstra wrote:
> > On Thu, 2011-03-10 at 15:50 +0000, Mel Gorman wrote:
> >
> > >  >  +static inline void
> > >  >  +tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
> > >  >   {
> > >
> > >  checkpatch will bitch about line length.
> >
> > I did a s/full_mm_flush/fullmm/ which puts the line length at 81. At
> > which point I'll ignore it ;-)
> 
> How about s/unsigned int/bool/?  IIRC you aren't a "bool was invented 
> after 1971, therefore it is evil" type.
No, although I do try to avoid it in structures because I'm ever unsure
of the storage type used. But yes, good suggestion, thanks!
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 02/17] mm: mmu_gather rework
  2011-04-01 12:07         ` Peter Zijlstra
  2011-04-01 12:07           ` Peter Zijlstra
@ 2011-04-01 16:13           ` Linus Torvalds
  2011-04-02  0:07             ` David Miller
  1 sibling, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2011-04-01 16:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Mel Gorman, Andrea Arcangeli, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Martin Schwidefsky,
	Russell King, Paul Mundt, Jeff Dike, Tony Luck, Hugh Dickins
On Fri, Apr 1, 2011 at 5:07 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> No, although I do try to avoid it in structures because I'm ever unsure
> of the storage type used. But yes, good suggestion, thanks!
I have to admit to not being a huge fan of "bool". You never know what
it actually is in C, and it's a possible source of major confusion.
Some environments will make it "int", others "char", and others - like
the kernel - will make it a C99/C++-like "true boolean" (C99 _Bool).
What's the difference? Integer assignment makes a hell of a difference. Do this:
  long long expression = ...
  ...
  bool val = expression;
and depending on implementation it will either just truncate the value
to a random number of bits, or actually do a compare with zero.
And while we use the C99 _Bool type, and thus get those true boolean
semantics (ie not just be a truncated integer type), I have to say
that it's still a dangerous thing to do in C because you generally
cannot rely on it. There's _tons_ of software that just typedefs int
or char to bool.
So even outside of structures, I'm not necessarily convinced "bool" is
always such a good thing. But I'm not going to stop people from using
it (inside the kernel it should be safe), I just want to raise a
warning and ask people to not use it mindlessly. And avoid the casts -
even if they are safe in the kernel.
                      Linus
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 02/17] mm: mmu_gather rework
  2011-04-01 16:13           ` Linus Torvalds
@ 2011-04-02  0:07             ` David Miller
  0 siblings, 0 replies; 90+ messages in thread
From: David Miller @ 2011-04-02  0:07 UTC (permalink / raw)
  To: torvalds
  Cc: a.p.zijlstra, avi, mel, aarcange, tglx, riel, mingo, akpm,
	linux-kernel, linux-arch, linux-mm, benh, hugh.dickins, npiggin,
	paulmck, yanmin_zhang, schwidefsky, rmk, lethal, jdike, tony.luck,
	hughd
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 1 Apr 2011 09:13:51 -0700
> What's the difference? Integer assignment makes a hell of a difference. Do this:
> 
>   long long expression = ...
>   ...
>   bool val = expression;
> 
> and depending on implementation it will either just truncate the value
> to a random number of bits, or actually do a compare with zero.
But note that, as you indicate, using int's to store boolean values
have this exact problem.
And most of the time people are converting an "int used as a boolean
value" into a "bool".
At least the "bool" has a chance of giving true boolean semantics in
the case you describe, whereas the 'int' always has the potential
truncation issue.
So, personally, I see it as a net positive to convert int to bool when
the variable is being used to take on true/false values.
^ permalink raw reply	[flat|nested] 90+ messages in thread 
 
 
 
 
 
 
- * [PATCH 03/17] powerpc: mmu_gather rework
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (2 preceding siblings ...)
  2011-02-17 16:23 ` [PATCH 02/17] mm: mmu_gather rework Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 04/17] sparc: " Peter Zijlstra
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang
[-- Attachment #1: peter_zijlstra-powerpc-preemptible_mmu_gather.patch --]
[-- Type: text/plain, Size: 9098 bytes --]
Fix up powerpc to the new mmu_gather stuffs.
PPC has an extra batching queue to RCU free the actual pagetable
allocations, use the ARCH extentions for that for now.
For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
hardware hash-table, keep using per-cpu arrays but flush on context
switch and use a TLF bit to track the lazy_mmu state.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/powerpc/include/asm/pgalloc.h     |    4 ++--
 arch/powerpc/include/asm/thread_info.h |    2 ++
 arch/powerpc/include/asm/tlb.h         |   10 ++++++++++
 arch/powerpc/kernel/process.c          |   23 ++++++++++++++++++++++-
 arch/powerpc/mm/pgtable.c              |   14 ++++----------
 arch/powerpc/mm/tlb_hash32.c           |    2 +-
 arch/powerpc/mm/tlb_hash64.c           |   12 +++++++-----
 arch/powerpc/mm/tlb_nohash.c           |    2 +-
 8 files changed, 49 insertions(+), 20 deletions(-)
Index: linux-2.6/arch/powerpc/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
+++ linux-2.6/arch/powerpc/include/asm/tlb.h
@@ -28,6 +28,16 @@
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
 
+#define HAVE_ARCH_MMU_GATHER 1
+
+struct pte_freelist_batch;
+
+struct arch_mmu_gather {
+	struct pte_freelist_batch *batch;
+};
+
+#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
+
 extern void tlb_flush(struct mmu_gather *tlb);
 
 /* Get the generic bits... */
Index: linux-2.6/arch/powerpc/kernel/process.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/process.c
+++ linux-2.6/arch/powerpc/kernel/process.c
@@ -393,6 +393,9 @@ struct task_struct *__switch_to(struct t
 	struct thread_struct *new_thread, *old_thread;
 	unsigned long flags;
 	struct task_struct *last;
+#ifdef CONFIG_PPC_BOOK3S_64
+	struct ppc64_tlb_batch *batch;
+#endif
 
 #ifdef CONFIG_SMP
 	/* avoid complexity of lazy save/restore of fpu
@@ -511,7 +514,17 @@ struct task_struct *__switch_to(struct t
 		old_thread->accum_tb += (current_tb - start_tb);
 		new_thread->start_tb = current_tb;
 	}
-#endif
+#endif /* CONFIG_PPC64 */
+
+#ifdef CONFIG_PPC_BOOK3S_64
+	batch = &__get_cpu_var(ppc64_tlb_batch);
+	if (batch->active) {
+		current_thread_info()->local_flags |= _TLF_LAZY_MMU;
+		if (batch->index)
+			__flush_tlb_pending(batch);
+		batch->active = 0;
+	}
+#endif /* CONFIG_PPC_BOOK3S_64 */
 
 	local_irq_save(flags);
 
@@ -526,6 +539,14 @@ struct task_struct *__switch_to(struct t
 	hard_irq_disable();
 	last = _switch(old_thread, new_thread);
 
+#ifdef CONFIG_PPC_BOOK3S_64
+	if (current_thread_info()->local_flags & _TLF_LAZY_MMU) {
+		current_thread_info()->local_flags &= ~_TLF_LAZY_MMU;
+		batch = &__get_cpu_var(ppc64_tlb_batch);
+		batch->active = 1;
+	}
+#endif /* CONFIG_PPC_BOOK3S_64 */
+
 	local_irq_restore(flags);
 
 	return last;
Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -33,8 +33,6 @@
 
 #include "mmu_decl.h"
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 #ifdef CONFIG_SMP
 
 /*
@@ -43,7 +41,6 @@ DEFINE_PER_CPU(struct mmu_gather, mmu_ga
  * freeing a page table page that is being walked without locks
  */
 
-static DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur);
 static unsigned long pte_freelist_forced_free;
 
 struct pte_freelist_batch
@@ -97,12 +94,10 @@ static void pte_free_submit(struct pte_f
 
 void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
-	/* This is safe since tlb_gather_mmu has disabled preemption */
-	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	struct pte_freelist_batch **batchp = &tlb->arch.batch;
 	unsigned long pgf;
 
-	if (atomic_read(&tlb->mm->mm_users) < 2 ||
-	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
+	if (atomic_read(&tlb->mm->mm_users) < 2) {
 		pgtable_free(table, shift);
 		return;
 	}
@@ -124,10 +119,9 @@ void pgtable_free_tlb(struct mmu_gather 
 	}
 }
 
-void pte_free_finish(void)
+void pte_free_finish(struct mmu_gather *tlb)
 {
-	/* This is safe since tlb_gather_mmu has disabled preemption */
-	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	struct pte_freelist_batch **batchp = &tlb->arch.batch;
 
 	if (*batchp == NULL)
 		return;
Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
@@ -38,13 +38,11 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p
  * neesd to be flushed. This function will either perform the flush
  * immediately or will batch it up if the current CPU has an active
  * batch on it.
- *
- * Must be called from within some kind of spinlock/non-preempt region...
  */
 void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
 		     pte_t *ptep, unsigned long pte, int huge)
 {
-	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
 	unsigned long vsid, vaddr;
 	unsigned int psize;
 	int ssize;
@@ -99,6 +97,7 @@ void hpte_need_flush(struct mm_struct *m
 	 */
 	if (!batch->active) {
 		flush_hash_page(vaddr, rpte, psize, ssize, 0);
+		put_cpu_var(ppc64_tlb_batch);
 		return;
 	}
 
@@ -127,6 +126,7 @@ void hpte_need_flush(struct mm_struct *m
 	batch->index = ++i;
 	if (i >= PPC64_TLB_BATCH_NR)
 		__flush_tlb_pending(batch);
+	put_cpu_var(ppc64_tlb_batch);
 }
 
 /*
@@ -155,7 +155,7 @@ void __flush_tlb_pending(struct ppc64_tl
 
 void tlb_flush(struct mmu_gather *tlb)
 {
-	struct ppc64_tlb_batch *tlbbatch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *tlbbatch = &get_cpu_var(ppc64_tlb_batch);
 
 	/* If there's a TLB batch pending, then we must flush it because the
 	 * pages are going to be freed and we really don't want to have a CPU
@@ -164,8 +164,10 @@ void tlb_flush(struct mmu_gather *tlb)
 	if (tlbbatch->index)
 		__flush_tlb_pending(tlbbatch);
 
+	put_cpu_var(ppc64_tlb_batch);
+
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /**
Index: linux-2.6/arch/powerpc/include/asm/thread_info.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/thread_info.h
+++ linux-2.6/arch/powerpc/include/asm/thread_info.h
@@ -139,10 +139,12 @@ static inline struct thread_info *curren
 #define TLF_NAPPING		0	/* idle thread enabled NAP mode */
 #define TLF_SLEEPING		1	/* suspend code enabled SLEEP mode */
 #define TLF_RESTORE_SIGMASK	2	/* Restore signal mask in do_signal */
+#define TLF_LAZY_MMU		3	/* tlb_batch is active */
 
 #define _TLF_NAPPING		(1 << TLF_NAPPING)
 #define _TLF_SLEEPING		(1 << TLF_SLEEPING)
 #define _TLF_RESTORE_SIGMASK	(1 << TLF_RESTORE_SIGMASK)
+#define _TLF_LAZY_MMU		(1 << TLF_LAZY_MMU)
 
 #ifndef __ASSEMBLY__
 #define HAVE_SET_RESTORE_SIGMASK	1
Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
+++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
@@ -32,13 +32,13 @@ static inline void pte_free(struct mm_st
 
 #ifdef CONFIG_SMP
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
-extern void pte_free_finish(void);
+extern void pte_free_finish(struct mmu_gather *tlb);
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	pgtable_free(table, shift);
 }
-static inline void pte_free_finish(void) { }
+static inline void pte_free_finish(struct mmu_gather *tlb) { }
 #endif /* !CONFIG_SMP */
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
@@ -73,7 +73,7 @@ void tlb_flush(struct mmu_gather *tlb)
 	}
 
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
+++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
@@ -301,7 +301,7 @@ void tlb_flush(struct mmu_gather *tlb)
 	flush_tlb_mm(tlb->mm);
 
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 03/17] powerpc: mmu_gather rework
  2011-02-17 16:23 ` [PATCH 03/17] powerpc: " Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang
[-- Attachment #1: peter_zijlstra-powerpc-preemptible_mmu_gather.patch --]
[-- Type: text/plain, Size: 8795 bytes --]
Fix up powerpc to the new mmu_gather stuffs.
PPC has an extra batching queue to RCU free the actual pagetable
allocations, use the ARCH extentions for that for now.
For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
hardware hash-table, keep using per-cpu arrays but flush on context
switch and use a TLF bit to track the lazy_mmu state.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/powerpc/include/asm/pgalloc.h     |    4 ++--
 arch/powerpc/include/asm/thread_info.h |    2 ++
 arch/powerpc/include/asm/tlb.h         |   10 ++++++++++
 arch/powerpc/kernel/process.c          |   23 ++++++++++++++++++++++-
 arch/powerpc/mm/pgtable.c              |   14 ++++----------
 arch/powerpc/mm/tlb_hash32.c           |    2 +-
 arch/powerpc/mm/tlb_hash64.c           |   12 +++++++-----
 arch/powerpc/mm/tlb_nohash.c           |    2 +-
 8 files changed, 49 insertions(+), 20 deletions(-)
Index: linux-2.6/arch/powerpc/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
+++ linux-2.6/arch/powerpc/include/asm/tlb.h
@@ -28,6 +28,16 @@
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
 
+#define HAVE_ARCH_MMU_GATHER 1
+
+struct pte_freelist_batch;
+
+struct arch_mmu_gather {
+	struct pte_freelist_batch *batch;
+};
+
+#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
+
 extern void tlb_flush(struct mmu_gather *tlb);
 
 /* Get the generic bits... */
Index: linux-2.6/arch/powerpc/kernel/process.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/process.c
+++ linux-2.6/arch/powerpc/kernel/process.c
@@ -393,6 +393,9 @@ struct task_struct *__switch_to(struct t
 	struct thread_struct *new_thread, *old_thread;
 	unsigned long flags;
 	struct task_struct *last;
+#ifdef CONFIG_PPC_BOOK3S_64
+	struct ppc64_tlb_batch *batch;
+#endif
 
 #ifdef CONFIG_SMP
 	/* avoid complexity of lazy save/restore of fpu
@@ -511,7 +514,17 @@ struct task_struct *__switch_to(struct t
 		old_thread->accum_tb += (current_tb - start_tb);
 		new_thread->start_tb = current_tb;
 	}
-#endif
+#endif /* CONFIG_PPC64 */
+
+#ifdef CONFIG_PPC_BOOK3S_64
+	batch = &__get_cpu_var(ppc64_tlb_batch);
+	if (batch->active) {
+		current_thread_info()->local_flags |= _TLF_LAZY_MMU;
+		if (batch->index)
+			__flush_tlb_pending(batch);
+		batch->active = 0;
+	}
+#endif /* CONFIG_PPC_BOOK3S_64 */
 
 	local_irq_save(flags);
 
@@ -526,6 +539,14 @@ struct task_struct *__switch_to(struct t
 	hard_irq_disable();
 	last = _switch(old_thread, new_thread);
 
+#ifdef CONFIG_PPC_BOOK3S_64
+	if (current_thread_info()->local_flags & _TLF_LAZY_MMU) {
+		current_thread_info()->local_flags &= ~_TLF_LAZY_MMU;
+		batch = &__get_cpu_var(ppc64_tlb_batch);
+		batch->active = 1;
+	}
+#endif /* CONFIG_PPC_BOOK3S_64 */
+
 	local_irq_restore(flags);
 
 	return last;
Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -33,8 +33,6 @@
 
 #include "mmu_decl.h"
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 #ifdef CONFIG_SMP
 
 /*
@@ -43,7 +41,6 @@ DEFINE_PER_CPU(struct mmu_gather, mmu_ga
  * freeing a page table page that is being walked without locks
  */
 
-static DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur);
 static unsigned long pte_freelist_forced_free;
 
 struct pte_freelist_batch
@@ -97,12 +94,10 @@ static void pte_free_submit(struct pte_f
 
 void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
-	/* This is safe since tlb_gather_mmu has disabled preemption */
-	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	struct pte_freelist_batch **batchp = &tlb->arch.batch;
 	unsigned long pgf;
 
-	if (atomic_read(&tlb->mm->mm_users) < 2 ||
-	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
+	if (atomic_read(&tlb->mm->mm_users) < 2) {
 		pgtable_free(table, shift);
 		return;
 	}
@@ -124,10 +119,9 @@ void pgtable_free_tlb(struct mmu_gather 
 	}
 }
 
-void pte_free_finish(void)
+void pte_free_finish(struct mmu_gather *tlb)
 {
-	/* This is safe since tlb_gather_mmu has disabled preemption */
-	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	struct pte_freelist_batch **batchp = &tlb->arch.batch;
 
 	if (*batchp == NULL)
 		return;
Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
@@ -38,13 +38,11 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p
  * neesd to be flushed. This function will either perform the flush
  * immediately or will batch it up if the current CPU has an active
  * batch on it.
- *
- * Must be called from within some kind of spinlock/non-preempt region...
  */
 void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
 		     pte_t *ptep, unsigned long pte, int huge)
 {
-	struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
 	unsigned long vsid, vaddr;
 	unsigned int psize;
 	int ssize;
@@ -99,6 +97,7 @@ void hpte_need_flush(struct mm_struct *m
 	 */
 	if (!batch->active) {
 		flush_hash_page(vaddr, rpte, psize, ssize, 0);
+		put_cpu_var(ppc64_tlb_batch);
 		return;
 	}
 
@@ -127,6 +126,7 @@ void hpte_need_flush(struct mm_struct *m
 	batch->index = ++i;
 	if (i >= PPC64_TLB_BATCH_NR)
 		__flush_tlb_pending(batch);
+	put_cpu_var(ppc64_tlb_batch);
 }
 
 /*
@@ -155,7 +155,7 @@ void __flush_tlb_pending(struct ppc64_tl
 
 void tlb_flush(struct mmu_gather *tlb)
 {
-	struct ppc64_tlb_batch *tlbbatch = &__get_cpu_var(ppc64_tlb_batch);
+	struct ppc64_tlb_batch *tlbbatch = &get_cpu_var(ppc64_tlb_batch);
 
 	/* If there's a TLB batch pending, then we must flush it because the
 	 * pages are going to be freed and we really don't want to have a CPU
@@ -164,8 +164,10 @@ void tlb_flush(struct mmu_gather *tlb)
 	if (tlbbatch->index)
 		__flush_tlb_pending(tlbbatch);
 
+	put_cpu_var(ppc64_tlb_batch);
+
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /**
Index: linux-2.6/arch/powerpc/include/asm/thread_info.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/thread_info.h
+++ linux-2.6/arch/powerpc/include/asm/thread_info.h
@@ -139,10 +139,12 @@ static inline struct thread_info *curren
 #define TLF_NAPPING		0	/* idle thread enabled NAP mode */
 #define TLF_SLEEPING		1	/* suspend code enabled SLEEP mode */
 #define TLF_RESTORE_SIGMASK	2	/* Restore signal mask in do_signal */
+#define TLF_LAZY_MMU		3	/* tlb_batch is active */
 
 #define _TLF_NAPPING		(1 << TLF_NAPPING)
 #define _TLF_SLEEPING		(1 << TLF_SLEEPING)
 #define _TLF_RESTORE_SIGMASK	(1 << TLF_RESTORE_SIGMASK)
+#define _TLF_LAZY_MMU		(1 << TLF_LAZY_MMU)
 
 #ifndef __ASSEMBLY__
 #define HAVE_SET_RESTORE_SIGMASK	1
Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
+++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
@@ -32,13 +32,13 @@ static inline void pte_free(struct mm_st
 
 #ifdef CONFIG_SMP
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
-extern void pte_free_finish(void);
+extern void pte_free_finish(struct mmu_gather *tlb);
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	pgtable_free(table, shift);
 }
-static inline void pte_free_finish(void) { }
+static inline void pte_free_finish(struct mmu_gather *tlb) { }
 #endif /* !CONFIG_SMP */
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
@@ -73,7 +73,7 @@ void tlb_flush(struct mmu_gather *tlb)
 	}
 
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
+++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
@@ -301,7 +301,7 @@ void tlb_flush(struct mmu_gather *tlb)
 	flush_tlb_mm(tlb->mm);
 
 	/* Push out batch of freed page tables */
-	pte_free_finish();
+	pte_free_finish(tlb);
 }
 
 /*
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
- * [PATCH 04/17] sparc: mmu_gather rework
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (3 preceding siblings ...)
  2011-02-17 16:23 ` [PATCH 03/17] powerpc: " Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 05/17] s390: " Peter Zijlstra
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang
[-- Attachment #1: peter_zijlstra-sparc-preemptible_mmu_gather.patch --]
[-- Type: text/plain, Size: 10878 bytes --]
Rework the sparc mmu_gather usage to conform to the new world order :-)
Sparc mmu_gather does two things:
 - tracks vaddrs to unhash
 - tracks pages to free
Split these two things like powerpc has done and keep the vaddrs
in per-cpu data structures and flush them on context switch.
The remaining bits can then use the generic mmu_gather.
Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/sparc/include/asm/pgalloc_64.h  |    3 +
 arch/sparc/include/asm/pgtable_64.h  |   15 ++++-
 arch/sparc/include/asm/tlb_64.h      |   91 ++---------------------------------
 arch/sparc/include/asm/tlbflush_64.h |   12 +++-
 arch/sparc/mm/tlb.c                  |   43 +++++++++-------
 arch/sparc/mm/tsb.c                  |   15 +++--
 6 files changed, 63 insertions(+), 116 deletions(-)
Index: linux-2.6/arch/sparc/include/asm/pgalloc_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/pgalloc_64.h
+++ linux-2.6/arch/sparc/include/asm/pgalloc_64.h
@@ -78,4 +78,7 @@ static inline void check_pgt_cache(void)
 	quicklist_trim(0, NULL, 25, 16);
 }
 
+#define __pte_free_tlb(tlb, pte, addr)	pte_free((tlb)->mm, pte)
+#define __pmd_free_tlb(tlb, pmd, addr)	pmd_free((tlb)->mm, pmd)
+
 #endif /* _SPARC64_PGALLOC_H */
Index: linux-2.6/arch/sparc/include/asm/tlb_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/tlb_64.h
+++ linux-2.6/arch/sparc/include/asm/tlb_64.h
@@ -7,66 +7,11 @@
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 
-#define TLB_BATCH_NR	192
-
-/*
- * For UP we don't need to worry about TLB flush
- * and page free order so much..
- */
-#ifdef CONFIG_SMP
-  #define FREE_PTE_NR	506
-  #define tlb_fast_mode(bp) ((bp)->pages_nr == ~0U)
-#else
-  #define FREE_PTE_NR	1
-  #define tlb_fast_mode(bp) 1
-#endif
-
-struct mmu_gather {
-	struct mm_struct *mm;
-	unsigned int pages_nr;
-	unsigned int need_flush;
-	unsigned int fullmm;
-	unsigned int tlb_nr;
-	unsigned long vaddrs[TLB_BATCH_NR];
-	struct page *pages[FREE_PTE_NR];
-};
-
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 #ifdef CONFIG_SMP
 extern void smp_flush_tlb_pending(struct mm_struct *,
 				  unsigned long, unsigned long *);
 #endif
 
-extern void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
-extern void flush_tlb_pending(void);
-
-static inline struct mmu_gather *tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
-{
-	struct mmu_gather *mp = &get_cpu_var(mmu_gathers);
-
-	BUG_ON(mp->tlb_nr);
-
-	mp->mm = mm;
-	mp->pages_nr = num_online_cpus() > 1 ? 0U : ~0U;
-	mp->fullmm = full_mm_flush;
-
-	return mp;
-}
-
-
-static inline void tlb_flush_mmu(struct mmu_gather *mp)
-{
-	if (!mp->fullmm)
-		flush_tlb_pending();
-	if (mp->need_flush) {
-		free_pages_and_swap_cache(mp->pages, mp->pages_nr);
-		mp->pages_nr = 0;
-		mp->need_flush = 0;
-	}
-
-}
-
 #ifdef CONFIG_SMP
 extern void smp_flush_tlb_mm(struct mm_struct *mm);
 #define do_flush_tlb_mm(mm) smp_flush_tlb_mm(mm)
@@ -74,38 +19,14 @@ extern void smp_flush_tlb_mm(struct mm_s
 #define do_flush_tlb_mm(mm) __flush_tlb_mm(CTX_HWBITS(mm->context), SECONDARY_CONTEXT)
 #endif
 
-static inline void tlb_finish_mmu(struct mmu_gather *mp, unsigned long start, unsigned long end)
-{
-	tlb_flush_mmu(mp);
-
-	if (mp->fullmm)
-		mp->fullmm = 0;
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
-}
-
-static inline void tlb_remove_page(struct mmu_gather *mp, struct page *page)
-{
-	if (tlb_fast_mode(mp)) {
-		free_page_and_swap_cache(page);
-		return;
-	}
-	mp->need_flush = 1;
-	mp->pages[mp->pages_nr++] = page;
-	if (mp->pages_nr >= FREE_PTE_NR)
-		tlb_flush_mmu(mp);
-}
-
-#define tlb_remove_tlb_entry(mp,ptep,addr) do { } while (0)
-#define pte_free_tlb(mp, ptepage, addr) pte_free((mp)->mm, ptepage)
-#define pmd_free_tlb(mp, pmdp, addr) pmd_free((mp)->mm, pmdp)
-#define pud_free_tlb(tlb,pudp, addr) __pud_free_tlb(tlb,pudp,addr)
+extern void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
+extern void flush_tlb_pending(void);
 
-#define tlb_migrate_finish(mm)	do { } while (0)
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
+#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
+#define tlb_flush(tlb)	flush_tlb_pending()
+
+#include <asm-generic/tlb.h>
 
 #endif /* _SPARC64_TLB_H */
Index: linux-2.6/arch/sparc/include/asm/tlbflush_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/tlbflush_64.h
+++ linux-2.6/arch/sparc/include/asm/tlbflush_64.h
@@ -5,9 +5,17 @@
 #include <asm/mmu_context.h>
 
 /* TSB flush operations. */
-struct mmu_gather;
+
+#define TLB_BATCH_NR	192
+
+struct tlb_batch {
+	struct mm_struct *mm;
+	unsigned long tlb_nr;
+	unsigned long vaddrs[TLB_BATCH_NR];
+};
+
 extern void flush_tsb_kernel_range(unsigned long start, unsigned long end);
-extern void flush_tsb_user(struct mmu_gather *mp);
+extern void flush_tsb_user(struct tlb_batch *tb);
 
 /* TLB flush operations. */
 
Index: linux-2.6/arch/sparc/mm/tlb.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/tlb.c
+++ linux-2.6/arch/sparc/mm/tlb.c
@@ -19,33 +19,34 @@
 
 /* Heavily inspired by the ppc64 code.  */
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
+static DEFINE_PER_CPU(struct tlb_batch, tlb_batch);
 
 void flush_tlb_pending(void)
 {
-	struct mmu_gather *mp = &get_cpu_var(mmu_gathers);
+	struct tlb_batch *tb = &get_cpu_var(tlb_batch);
 
-	if (mp->tlb_nr) {
-		flush_tsb_user(mp);
+	if (tb->tlb_nr) {
+		flush_tsb_user(tb);
 
-		if (CTX_VALID(mp->mm->context)) {
+		if (CTX_VALID(tb->mm->context)) {
 #ifdef CONFIG_SMP
-			smp_flush_tlb_pending(mp->mm, mp->tlb_nr,
-					      &mp->vaddrs[0]);
+			smp_flush_tlb_pending(tb->mm, tb->tlb_nr,
+					      &tb->vaddrs[0]);
 #else
-			__flush_tlb_pending(CTX_HWBITS(mp->mm->context),
-					    mp->tlb_nr, &mp->vaddrs[0]);
+			__flush_tlb_pending(CTX_HWBITS(tb->mm->context),
+					    tb->tlb_nr, &tb->vaddrs[0]);
 #endif
 		}
-		mp->tlb_nr = 0;
+		tb->tlb_nr = 0;
 	}
 
-	put_cpu_var(mmu_gathers);
+	put_cpu_var(tlb_batch);
 }
 
-void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, pte_t *ptep, pte_t orig)
+void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
+		   pte_t *ptep, pte_t orig, int fullmm)
 {
-	struct mmu_gather *mp = &__get_cpu_var(mmu_gathers);
+	struct tlb_batch *tb = &get_cpu_var(tlb_batch);
 	unsigned long nr;
 
 	vaddr &= PAGE_MASK;
@@ -77,21 +78,25 @@ void tlb_batch_add(struct mm_struct *mm,
 
 no_cache_flush:
 
-	if (mp->fullmm)
+	if (fullmm) {
+		put_cpu_var(tlb_batch);
 		return;
+	}
 
-	nr = mp->tlb_nr;
+	nr = tb->tlb_nr;
 
-	if (unlikely(nr != 0 && mm != mp->mm)) {
+	if (unlikely(nr != 0 && mm != tb->mm)) {
 		flush_tlb_pending();
 		nr = 0;
 	}
 
 	if (nr == 0)
-		mp->mm = mm;
+		tb->mm = mm;
 
-	mp->vaddrs[nr] = vaddr;
-	mp->tlb_nr = ++nr;
+	tb->vaddrs[nr] = vaddr;
+	tb->tlb_nr = ++nr;
 	if (nr >= TLB_BATCH_NR)
 		flush_tlb_pending();
+
+	put_cpu_var(tlb_batch);
 }
Index: linux-2.6/arch/sparc/mm/tsb.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/tsb.c
+++ linux-2.6/arch/sparc/mm/tsb.c
@@ -47,12 +47,13 @@ void flush_tsb_kernel_range(unsigned lon
 	}
 }
 
-static void __flush_tsb_one(struct mmu_gather *mp, unsigned long hash_shift, unsigned long tsb, unsigned long nentries)
+static void __flush_tsb_one(struct tlb_batch *tb, unsigned long hash_shift,
+			    unsigned long tsb, unsigned long nentries)
 {
 	unsigned long i;
 
-	for (i = 0; i < mp->tlb_nr; i++) {
-		unsigned long v = mp->vaddrs[i];
+	for (i = 0; i < tb->tlb_nr; i++) {
+		unsigned long v = tb->vaddrs[i];
 		unsigned long tag, ent, hash;
 
 		v &= ~0x1UL;
@@ -65,9 +66,9 @@ static void __flush_tsb_one(struct mmu_g
 	}
 }
 
-void flush_tsb_user(struct mmu_gather *mp)
+void flush_tsb_user(struct tlb_batch *tb)
 {
-	struct mm_struct *mm = mp->mm;
+	struct mm_struct *mm = tb->mm;
 	unsigned long nentries, base, flags;
 
 	spin_lock_irqsave(&mm->context.lock, flags);
@@ -76,7 +77,7 @@ void flush_tsb_user(struct mmu_gather *m
 	nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries;
 	if (tlb_type == cheetah_plus || tlb_type == hypervisor)
 		base = __pa(base);
-	__flush_tsb_one(mp, PAGE_SHIFT, base, nentries);
+	__flush_tsb_one(tb, PAGE_SHIFT, base, nentries);
 
 #ifdef CONFIG_HUGETLB_PAGE
 	if (mm->context.tsb_block[MM_TSB_HUGE].tsb) {
@@ -84,7 +85,7 @@ void flush_tsb_user(struct mmu_gather *m
 		nentries = mm->context.tsb_block[MM_TSB_HUGE].tsb_nentries;
 		if (tlb_type == cheetah_plus || tlb_type == hypervisor)
 			base = __pa(base);
-		__flush_tsb_one(mp, HPAGE_SHIFT, base, nentries);
+		__flush_tsb_one(tb, HPAGE_SHIFT, base, nentries);
 	}
 #endif
 	spin_unlock_irqrestore(&mm->context.lock, flags);
Index: linux-2.6/arch/sparc/include/asm/pgtable_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/pgtable_64.h
+++ linux-2.6/arch/sparc/include/asm/pgtable_64.h
@@ -655,9 +655,11 @@ static inline int pte_special(pte_t pte)
 #define pte_unmap(pte)			do { } while (0)
 
 /* Actual page table PTE updates.  */
-extern void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, pte_t *ptep, pte_t orig);
+extern void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
+			  pte_t *ptep, pte_t orig, int fullmm);
 
-static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte)
+static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
+			     pte_t *ptep, pte_t pte, int fullmm)
 {
 	pte_t orig = *ptep;
 
@@ -670,12 +672,19 @@ static inline void set_pte_at(struct mm_
 	 *             and SUN4V pte layout, so this inline test is fine.
 	 */
 	if (likely(mm != &init_mm) && (pte_val(orig) & _PAGE_VALID))
-		tlb_batch_add(mm, addr, ptep, orig);
+		tlb_batch_add(mm, addr, ptep, orig, fullmm);
 }
 
+#define set_pte_at(mm,addr,ptep,pte)	\
+	__set_pte_at((mm), (addr), (ptep), (pte), 0)
+
 #define pte_clear(mm,addr,ptep)		\
 	set_pte_at((mm), (addr), (ptep), __pte(0UL))
 
+#define __HAVE_ARCH_PTE_CLEAR_NOT_PRESENT_FULL
+#define pte_clear_not_present_full(mm,addr,ptep,fullmm)	\
+	__set_pte_at((mm), (addr), (ptep), __pte(0UL), (fullmm))
+
 #ifdef DCACHE_ALIASING_POSSIBLE
 #define __HAVE_ARCH_MOVE_PTE
 #define move_pte(pte, prot, old_addr, new_addr)				\
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 04/17] sparc: mmu_gather rework
  2011-02-17 16:23 ` [PATCH 04/17] sparc: " Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang
[-- Attachment #1: peter_zijlstra-sparc-preemptible_mmu_gather.patch --]
[-- Type: text/plain, Size: 10575 bytes --]
Rework the sparc mmu_gather usage to conform to the new world order :-)
Sparc mmu_gather does two things:
 - tracks vaddrs to unhash
 - tracks pages to free
Split these two things like powerpc has done and keep the vaddrs
in per-cpu data structures and flush them on context switch.
The remaining bits can then use the generic mmu_gather.
Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/sparc/include/asm/pgalloc_64.h  |    3 +
 arch/sparc/include/asm/pgtable_64.h  |   15 ++++-
 arch/sparc/include/asm/tlb_64.h      |   91 ++---------------------------------
 arch/sparc/include/asm/tlbflush_64.h |   12 +++-
 arch/sparc/mm/tlb.c                  |   43 +++++++++-------
 arch/sparc/mm/tsb.c                  |   15 +++--
 6 files changed, 63 insertions(+), 116 deletions(-)
Index: linux-2.6/arch/sparc/include/asm/pgalloc_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/pgalloc_64.h
+++ linux-2.6/arch/sparc/include/asm/pgalloc_64.h
@@ -78,4 +78,7 @@ static inline void check_pgt_cache(void)
 	quicklist_trim(0, NULL, 25, 16);
 }
 
+#define __pte_free_tlb(tlb, pte, addr)	pte_free((tlb)->mm, pte)
+#define __pmd_free_tlb(tlb, pmd, addr)	pmd_free((tlb)->mm, pmd)
+
 #endif /* _SPARC64_PGALLOC_H */
Index: linux-2.6/arch/sparc/include/asm/tlb_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/tlb_64.h
+++ linux-2.6/arch/sparc/include/asm/tlb_64.h
@@ -7,66 +7,11 @@
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 
-#define TLB_BATCH_NR	192
-
-/*
- * For UP we don't need to worry about TLB flush
- * and page free order so much..
- */
-#ifdef CONFIG_SMP
-  #define FREE_PTE_NR	506
-  #define tlb_fast_mode(bp) ((bp)->pages_nr == ~0U)
-#else
-  #define FREE_PTE_NR	1
-  #define tlb_fast_mode(bp) 1
-#endif
-
-struct mmu_gather {
-	struct mm_struct *mm;
-	unsigned int pages_nr;
-	unsigned int need_flush;
-	unsigned int fullmm;
-	unsigned int tlb_nr;
-	unsigned long vaddrs[TLB_BATCH_NR];
-	struct page *pages[FREE_PTE_NR];
-};
-
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 #ifdef CONFIG_SMP
 extern void smp_flush_tlb_pending(struct mm_struct *,
 				  unsigned long, unsigned long *);
 #endif
 
-extern void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
-extern void flush_tlb_pending(void);
-
-static inline struct mmu_gather *tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
-{
-	struct mmu_gather *mp = &get_cpu_var(mmu_gathers);
-
-	BUG_ON(mp->tlb_nr);
-
-	mp->mm = mm;
-	mp->pages_nr = num_online_cpus() > 1 ? 0U : ~0U;
-	mp->fullmm = full_mm_flush;
-
-	return mp;
-}
-
-
-static inline void tlb_flush_mmu(struct mmu_gather *mp)
-{
-	if (!mp->fullmm)
-		flush_tlb_pending();
-	if (mp->need_flush) {
-		free_pages_and_swap_cache(mp->pages, mp->pages_nr);
-		mp->pages_nr = 0;
-		mp->need_flush = 0;
-	}
-
-}
-
 #ifdef CONFIG_SMP
 extern void smp_flush_tlb_mm(struct mm_struct *mm);
 #define do_flush_tlb_mm(mm) smp_flush_tlb_mm(mm)
@@ -74,38 +19,14 @@ extern void smp_flush_tlb_mm(struct mm_s
 #define do_flush_tlb_mm(mm) __flush_tlb_mm(CTX_HWBITS(mm->context), SECONDARY_CONTEXT)
 #endif
 
-static inline void tlb_finish_mmu(struct mmu_gather *mp, unsigned long start, unsigned long end)
-{
-	tlb_flush_mmu(mp);
-
-	if (mp->fullmm)
-		mp->fullmm = 0;
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
-}
-
-static inline void tlb_remove_page(struct mmu_gather *mp, struct page *page)
-{
-	if (tlb_fast_mode(mp)) {
-		free_page_and_swap_cache(page);
-		return;
-	}
-	mp->need_flush = 1;
-	mp->pages[mp->pages_nr++] = page;
-	if (mp->pages_nr >= FREE_PTE_NR)
-		tlb_flush_mmu(mp);
-}
-
-#define tlb_remove_tlb_entry(mp,ptep,addr) do { } while (0)
-#define pte_free_tlb(mp, ptepage, addr) pte_free((mp)->mm, ptepage)
-#define pmd_free_tlb(mp, pmdp, addr) pmd_free((mp)->mm, pmdp)
-#define pud_free_tlb(tlb,pudp, addr) __pud_free_tlb(tlb,pudp,addr)
+extern void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
+extern void flush_tlb_pending(void);
 
-#define tlb_migrate_finish(mm)	do { } while (0)
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
+#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
+#define tlb_flush(tlb)	flush_tlb_pending()
+
+#include <asm-generic/tlb.h>
 
 #endif /* _SPARC64_TLB_H */
Index: linux-2.6/arch/sparc/include/asm/tlbflush_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/tlbflush_64.h
+++ linux-2.6/arch/sparc/include/asm/tlbflush_64.h
@@ -5,9 +5,17 @@
 #include <asm/mmu_context.h>
 
 /* TSB flush operations. */
-struct mmu_gather;
+
+#define TLB_BATCH_NR	192
+
+struct tlb_batch {
+	struct mm_struct *mm;
+	unsigned long tlb_nr;
+	unsigned long vaddrs[TLB_BATCH_NR];
+};
+
 extern void flush_tsb_kernel_range(unsigned long start, unsigned long end);
-extern void flush_tsb_user(struct mmu_gather *mp);
+extern void flush_tsb_user(struct tlb_batch *tb);
 
 /* TLB flush operations. */
 
Index: linux-2.6/arch/sparc/mm/tlb.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/tlb.c
+++ linux-2.6/arch/sparc/mm/tlb.c
@@ -19,33 +19,34 @@
 
 /* Heavily inspired by the ppc64 code.  */
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
+static DEFINE_PER_CPU(struct tlb_batch, tlb_batch);
 
 void flush_tlb_pending(void)
 {
-	struct mmu_gather *mp = &get_cpu_var(mmu_gathers);
+	struct tlb_batch *tb = &get_cpu_var(tlb_batch);
 
-	if (mp->tlb_nr) {
-		flush_tsb_user(mp);
+	if (tb->tlb_nr) {
+		flush_tsb_user(tb);
 
-		if (CTX_VALID(mp->mm->context)) {
+		if (CTX_VALID(tb->mm->context)) {
 #ifdef CONFIG_SMP
-			smp_flush_tlb_pending(mp->mm, mp->tlb_nr,
-					      &mp->vaddrs[0]);
+			smp_flush_tlb_pending(tb->mm, tb->tlb_nr,
+					      &tb->vaddrs[0]);
 #else
-			__flush_tlb_pending(CTX_HWBITS(mp->mm->context),
-					    mp->tlb_nr, &mp->vaddrs[0]);
+			__flush_tlb_pending(CTX_HWBITS(tb->mm->context),
+					    tb->tlb_nr, &tb->vaddrs[0]);
 #endif
 		}
-		mp->tlb_nr = 0;
+		tb->tlb_nr = 0;
 	}
 
-	put_cpu_var(mmu_gathers);
+	put_cpu_var(tlb_batch);
 }
 
-void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, pte_t *ptep, pte_t orig)
+void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
+		   pte_t *ptep, pte_t orig, int fullmm)
 {
-	struct mmu_gather *mp = &__get_cpu_var(mmu_gathers);
+	struct tlb_batch *tb = &get_cpu_var(tlb_batch);
 	unsigned long nr;
 
 	vaddr &= PAGE_MASK;
@@ -77,21 +78,25 @@ void tlb_batch_add(struct mm_struct *mm,
 
 no_cache_flush:
 
-	if (mp->fullmm)
+	if (fullmm) {
+		put_cpu_var(tlb_batch);
 		return;
+	}
 
-	nr = mp->tlb_nr;
+	nr = tb->tlb_nr;
 
-	if (unlikely(nr != 0 && mm != mp->mm)) {
+	if (unlikely(nr != 0 && mm != tb->mm)) {
 		flush_tlb_pending();
 		nr = 0;
 	}
 
 	if (nr == 0)
-		mp->mm = mm;
+		tb->mm = mm;
 
-	mp->vaddrs[nr] = vaddr;
-	mp->tlb_nr = ++nr;
+	tb->vaddrs[nr] = vaddr;
+	tb->tlb_nr = ++nr;
 	if (nr >= TLB_BATCH_NR)
 		flush_tlb_pending();
+
+	put_cpu_var(tlb_batch);
 }
Index: linux-2.6/arch/sparc/mm/tsb.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/tsb.c
+++ linux-2.6/arch/sparc/mm/tsb.c
@@ -47,12 +47,13 @@ void flush_tsb_kernel_range(unsigned lon
 	}
 }
 
-static void __flush_tsb_one(struct mmu_gather *mp, unsigned long hash_shift, unsigned long tsb, unsigned long nentries)
+static void __flush_tsb_one(struct tlb_batch *tb, unsigned long hash_shift,
+			    unsigned long tsb, unsigned long nentries)
 {
 	unsigned long i;
 
-	for (i = 0; i < mp->tlb_nr; i++) {
-		unsigned long v = mp->vaddrs[i];
+	for (i = 0; i < tb->tlb_nr; i++) {
+		unsigned long v = tb->vaddrs[i];
 		unsigned long tag, ent, hash;
 
 		v &= ~0x1UL;
@@ -65,9 +66,9 @@ static void __flush_tsb_one(struct mmu_g
 	}
 }
 
-void flush_tsb_user(struct mmu_gather *mp)
+void flush_tsb_user(struct tlb_batch *tb)
 {
-	struct mm_struct *mm = mp->mm;
+	struct mm_struct *mm = tb->mm;
 	unsigned long nentries, base, flags;
 
 	spin_lock_irqsave(&mm->context.lock, flags);
@@ -76,7 +77,7 @@ void flush_tsb_user(struct mmu_gather *m
 	nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries;
 	if (tlb_type == cheetah_plus || tlb_type == hypervisor)
 		base = __pa(base);
-	__flush_tsb_one(mp, PAGE_SHIFT, base, nentries);
+	__flush_tsb_one(tb, PAGE_SHIFT, base, nentries);
 
 #ifdef CONFIG_HUGETLB_PAGE
 	if (mm->context.tsb_block[MM_TSB_HUGE].tsb) {
@@ -84,7 +85,7 @@ void flush_tsb_user(struct mmu_gather *m
 		nentries = mm->context.tsb_block[MM_TSB_HUGE].tsb_nentries;
 		if (tlb_type == cheetah_plus || tlb_type == hypervisor)
 			base = __pa(base);
-		__flush_tsb_one(mp, HPAGE_SHIFT, base, nentries);
+		__flush_tsb_one(tb, HPAGE_SHIFT, base, nentries);
 	}
 #endif
 	spin_unlock_irqrestore(&mm->context.lock, flags);
Index: linux-2.6/arch/sparc/include/asm/pgtable_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/pgtable_64.h
+++ linux-2.6/arch/sparc/include/asm/pgtable_64.h
@@ -655,9 +655,11 @@ static inline int pte_special(pte_t pte)
 #define pte_unmap(pte)			do { } while (0)
 
 /* Actual page table PTE updates.  */
-extern void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, pte_t *ptep, pte_t orig);
+extern void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
+			  pte_t *ptep, pte_t orig, int fullmm);
 
-static inline void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte)
+static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
+			     pte_t *ptep, pte_t pte, int fullmm)
 {
 	pte_t orig = *ptep;
 
@@ -670,12 +672,19 @@ static inline void set_pte_at(struct mm_
 	 *             and SUN4V pte layout, so this inline test is fine.
 	 */
 	if (likely(mm != &init_mm) && (pte_val(orig) & _PAGE_VALID))
-		tlb_batch_add(mm, addr, ptep, orig);
+		tlb_batch_add(mm, addr, ptep, orig, fullmm);
 }
 
+#define set_pte_at(mm,addr,ptep,pte)	\
+	__set_pte_at((mm), (addr), (ptep), (pte), 0)
+
 #define pte_clear(mm,addr,ptep)		\
 	set_pte_at((mm), (addr), (ptep), __pte(0UL))
 
+#define __HAVE_ARCH_PTE_CLEAR_NOT_PRESENT_FULL
+#define pte_clear_not_present_full(mm,addr,ptep,fullmm)	\
+	__set_pte_at((mm), (addr), (ptep), __pte(0UL), (fullmm))
+
 #ifdef DCACHE_ALIASING_POSSIBLE
 #define __HAVE_ARCH_MOVE_PTE
 #define move_pte(pte, prot, old_addr, new_addr)				\
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
- * [PATCH 05/17] s390: mmu_gather rework
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (4 preceding siblings ...)
  2011-02-17 16:23 ` [PATCH 04/17] sparc: " Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 06/17] arm: " Peter Zijlstra
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Martin Schwidefsky
[-- Attachment #1: martin_schwidefsky-s390-preemptible_mmu_gather.patch --]
[-- Type: text/plain, Size: 4429 bytes --]
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
Adapt the stand-alone s390 mmu_gather implementation to the new
preemptible mmu_gather interface.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101126145410.881573395@chello.nl>
---
 arch/s390/include/asm/tlb.h |   60 ++++++++++++++++++++++++++------------------
 1 file changed, 36 insertions(+), 24 deletions(-)
Index: linux-2.6/arch/s390/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/tlb.h
+++ linux-2.6/arch/s390/include/asm/tlb.h
@@ -29,58 +29,64 @@
 #include <asm/smp.h>
 #include <asm/tlbflush.h>
 
-#ifndef CONFIG_SMP
-#define TLB_NR_PTRS	1
-#else
-#define TLB_NR_PTRS	508
-#endif
-
 struct mmu_gather {
 	struct mm_struct *mm;
 	unsigned int fullmm;
 	unsigned int nr_ptes;
 	unsigned int nr_pxds;
-	void *array[TLB_NR_PTRS];
+	unsigned int max;
+	void **array;
+	void *local[8];
 };
 
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
-static inline struct mmu_gather *tlb_gather_mmu(struct mm_struct *mm,
-						unsigned int full_mm_flush)
+static inline void __tlb_alloc_page(struct mmu_gather *tlb)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
+	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
 
+	if (addr) {
+		tlb->array = (void *) addr;
+		tlb->max = PAGE_SIZE / sizeof(void *);
+	}
+}
+
+static inline void tlb_gather_mmu(struct mmu_gather *tlb,
+				  struct mm_struct *mm,
+				  unsigned int full_mm_flush)
+{
 	tlb->mm = mm;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->array = tlb->local;
 	tlb->fullmm = full_mm_flush;
-	tlb->nr_ptes = 0;
-	tlb->nr_pxds = TLB_NR_PTRS;
 	if (tlb->fullmm)
 		__tlb_flush_mm(mm);
-	return tlb;
+	else
+		__tlb_alloc_page(tlb);
+	tlb->nr_ptes = 0;
+	tlb->nr_pxds = tlb->max;
 }
 
-static inline void tlb_flush_mmu(struct mmu_gather *tlb,
-				 unsigned long start, unsigned long end)
+static inline void tlb_flush_mmu(struct mmu_gather *tlb)
 {
-	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < TLB_NR_PTRS))
+	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < tlb->max))
 		__tlb_flush_mm(tlb->mm);
 	while (tlb->nr_ptes > 0)
 		page_table_free_rcu(tlb->mm, tlb->array[--tlb->nr_ptes]);
-	while (tlb->nr_pxds < TLB_NR_PTRS)
+	while (tlb->nr_pxds < tlb->max)
 		crst_table_free_rcu(tlb->mm, tlb->array[tlb->nr_pxds++]);
 }
 
 static inline void tlb_finish_mmu(struct mmu_gather *tlb,
 				  unsigned long start, unsigned long end)
 {
-	tlb_flush_mmu(tlb, start, end);
+	tlb_flush_mmu(tlb);
 
 	rcu_table_freelist_finish();
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->array != tlb->local)
+		free_pages((unsigned long) tlb->array, 0);
 }
 
 /*
@@ -88,6 +94,12 @@ static inline void tlb_finish_mmu(struct
  * tlb_ptep_clear_flush. In both flush modes the tlb fo a page cache page
  * has already been freed, so just do free_page_and_swap_cache.
  */
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	free_page_and_swap_cache(page);
+	return 0;
+}
+
 static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
 	free_page_and_swap_cache(page);
@@ -103,7 +115,7 @@ static inline void pte_free_tlb(struct m
 	if (!tlb->fullmm) {
 		tlb->array[tlb->nr_ptes++] = pte;
 		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb, 0, 0);
+			tlb_flush_mmu(tlb);
 	} else
 		page_table_free(tlb->mm, (unsigned long *) pte);
 }
@@ -124,7 +136,7 @@ static inline void pmd_free_tlb(struct m
 	if (!tlb->fullmm) {
 		tlb->array[--tlb->nr_pxds] = pmd;
 		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb, 0, 0);
+			tlb_flush_mmu(tlb);
 	} else
 		crst_table_free(tlb->mm, (unsigned long *) pmd);
 #endif
@@ -146,7 +158,7 @@ static inline void pud_free_tlb(struct m
 	if (!tlb->fullmm) {
 		tlb->array[--tlb->nr_pxds] = pud;
 		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb, 0, 0);
+			tlb_flush_mmu(tlb);
 	} else
 		crst_table_free(tlb->mm, (unsigned long *) pud);
 #endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 05/17] s390: mmu_gather rework
  2011-02-17 16:23 ` [PATCH 05/17] s390: " Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Martin Schwidefsky
[-- Attachment #1: martin_schwidefsky-s390-preemptible_mmu_gather.patch --]
[-- Type: text/plain, Size: 4126 bytes --]
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
Adapt the stand-alone s390 mmu_gather implementation to the new
preemptible mmu_gather interface.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101126145410.881573395@chello.nl>
---
 arch/s390/include/asm/tlb.h |   60 ++++++++++++++++++++++++++------------------
 1 file changed, 36 insertions(+), 24 deletions(-)
Index: linux-2.6/arch/s390/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/tlb.h
+++ linux-2.6/arch/s390/include/asm/tlb.h
@@ -29,58 +29,64 @@
 #include <asm/smp.h>
 #include <asm/tlbflush.h>
 
-#ifndef CONFIG_SMP
-#define TLB_NR_PTRS	1
-#else
-#define TLB_NR_PTRS	508
-#endif
-
 struct mmu_gather {
 	struct mm_struct *mm;
 	unsigned int fullmm;
 	unsigned int nr_ptes;
 	unsigned int nr_pxds;
-	void *array[TLB_NR_PTRS];
+	unsigned int max;
+	void **array;
+	void *local[8];
 };
 
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
-static inline struct mmu_gather *tlb_gather_mmu(struct mm_struct *mm,
-						unsigned int full_mm_flush)
+static inline void __tlb_alloc_page(struct mmu_gather *tlb)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
+	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
 
+	if (addr) {
+		tlb->array = (void *) addr;
+		tlb->max = PAGE_SIZE / sizeof(void *);
+	}
+}
+
+static inline void tlb_gather_mmu(struct mmu_gather *tlb,
+				  struct mm_struct *mm,
+				  unsigned int full_mm_flush)
+{
 	tlb->mm = mm;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->array = tlb->local;
 	tlb->fullmm = full_mm_flush;
-	tlb->nr_ptes = 0;
-	tlb->nr_pxds = TLB_NR_PTRS;
 	if (tlb->fullmm)
 		__tlb_flush_mm(mm);
-	return tlb;
+	else
+		__tlb_alloc_page(tlb);
+	tlb->nr_ptes = 0;
+	tlb->nr_pxds = tlb->max;
 }
 
-static inline void tlb_flush_mmu(struct mmu_gather *tlb,
-				 unsigned long start, unsigned long end)
+static inline void tlb_flush_mmu(struct mmu_gather *tlb)
 {
-	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < TLB_NR_PTRS))
+	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < tlb->max))
 		__tlb_flush_mm(tlb->mm);
 	while (tlb->nr_ptes > 0)
 		page_table_free_rcu(tlb->mm, tlb->array[--tlb->nr_ptes]);
-	while (tlb->nr_pxds < TLB_NR_PTRS)
+	while (tlb->nr_pxds < tlb->max)
 		crst_table_free_rcu(tlb->mm, tlb->array[tlb->nr_pxds++]);
 }
 
 static inline void tlb_finish_mmu(struct mmu_gather *tlb,
 				  unsigned long start, unsigned long end)
 {
-	tlb_flush_mmu(tlb, start, end);
+	tlb_flush_mmu(tlb);
 
 	rcu_table_freelist_finish();
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->array != tlb->local)
+		free_pages((unsigned long) tlb->array, 0);
 }
 
 /*
@@ -88,6 +94,12 @@ static inline void tlb_finish_mmu(struct
  * tlb_ptep_clear_flush. In both flush modes the tlb fo a page cache page
  * has already been freed, so just do free_page_and_swap_cache.
  */
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	free_page_and_swap_cache(page);
+	return 0;
+}
+
 static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
 	free_page_and_swap_cache(page);
@@ -103,7 +115,7 @@ static inline void pte_free_tlb(struct m
 	if (!tlb->fullmm) {
 		tlb->array[tlb->nr_ptes++] = pte;
 		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb, 0, 0);
+			tlb_flush_mmu(tlb);
 	} else
 		page_table_free(tlb->mm, (unsigned long *) pte);
 }
@@ -124,7 +136,7 @@ static inline void pmd_free_tlb(struct m
 	if (!tlb->fullmm) {
 		tlb->array[--tlb->nr_pxds] = pmd;
 		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb, 0, 0);
+			tlb_flush_mmu(tlb);
 	} else
 		crst_table_free(tlb->mm, (unsigned long *) pmd);
 #endif
@@ -146,7 +158,7 @@ static inline void pud_free_tlb(struct m
 	if (!tlb->fullmm) {
 		tlb->array[--tlb->nr_pxds] = pud;
 		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb, 0, 0);
+			tlb_flush_mmu(tlb);
 	} else
 		crst_table_free(tlb->mm, (unsigned long *) pud);
 #endif
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
- * [PATCH 06/17] arm: mmu_gather rework
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (5 preceding siblings ...)
  2011-02-17 16:23 ` [PATCH 05/17] s390: " Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-02-24 16:34   ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 07/17] sh: " Peter Zijlstra
                   ` (12 subsequent siblings)
  19 siblings, 2 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Russell King
[-- Attachment #1: peter_zijlstra-arm-preemptible_mmu_gather.patch --]
[-- Type: text/plain, Size: 2091 bytes --]
Fix up the arm mmu_gather code to conform to the new API.
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/arm/include/asm/tlb.h |   29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)
Index: linux-2.6/arch/arm/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/arm/include/asm/tlb.h
+++ linux-2.6/arch/arm/include/asm/tlb.h
@@ -40,17 +40,11 @@ struct mmu_gather {
 	unsigned long		range_end;
 };
 
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
-
-	return tlb;
 }
 
 static inline void
@@ -61,8 +55,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
 }
 
 /*
@@ -101,7 +93,22 @@ tlb_end_vma(struct mmu_gather *tlb, stru
 		flush_tlb_range(vma, tlb->range_start, tlb->range_end);
 }
 
-#define tlb_remove_page(tlb,page)	free_page_and_swap_cache(page)
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	free_page_and_swap_cache(page);
+	return 0;
+}
+
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	might_sleep();
+	__tlb_remove_page(tlb, page);
+}
+
+static inline void tlb_flush_mmu(struct mmu_gather *tlb)
+{
+}
+
 #define pte_free_tlb(tlb, ptep, addr)	pte_free((tlb)->mm, ptep)
 #define pmd_free_tlb(tlb, pmdp, addr)	pmd_free((tlb)->mm, pmdp)
 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 06/17] arm: mmu_gather rework
  2011-02-17 16:23 ` [PATCH 06/17] arm: " Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  2011-02-24 16:34   ` Peter Zijlstra
  1 sibling, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Russell King
[-- Attachment #1: peter_zijlstra-arm-preemptible_mmu_gather.patch --]
[-- Type: text/plain, Size: 1788 bytes --]
Fix up the arm mmu_gather code to conform to the new API.
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/arm/include/asm/tlb.h |   29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)
Index: linux-2.6/arch/arm/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/arm/include/asm/tlb.h
+++ linux-2.6/arch/arm/include/asm/tlb.h
@@ -40,17 +40,11 @@ struct mmu_gather {
 	unsigned long		range_end;
 };
 
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
-
-	return tlb;
 }
 
 static inline void
@@ -61,8 +55,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
 }
 
 /*
@@ -101,7 +93,22 @@ tlb_end_vma(struct mmu_gather *tlb, stru
 		flush_tlb_range(vma, tlb->range_start, tlb->range_end);
 }
 
-#define tlb_remove_page(tlb,page)	free_page_and_swap_cache(page)
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	free_page_and_swap_cache(page);
+	return 0;
+}
+
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	might_sleep();
+	__tlb_remove_page(tlb, page);
+}
+
+static inline void tlb_flush_mmu(struct mmu_gather *tlb)
+{
+}
+
 #define pte_free_tlb(tlb, ptep, addr)	pte_free((tlb)->mm, ptep)
 #define pmd_free_tlb(tlb, pmdp, addr)	pmd_free((tlb)->mm, pmdp)
 
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-17 16:23 ` [PATCH 06/17] arm: " Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
@ 2011-02-24 16:34   ` Peter Zijlstra
  2011-02-24 16:34     ` Peter Zijlstra
  2011-02-25 18:04     ` Peter Zijlstra
  1 sibling, 2 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-24 16:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, linux-mm,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Russell King, Luck,Tony,
	PaulMundt
On Thu, 2011-02-17 at 17:23 +0100, Peter Zijlstra wrote:
> plain text document attachment
> (peter_zijlstra-arm-preemptible_mmu_gather.patch)
> Fix up the arm mmu_gather code to conform to the new API.
So akpm noted that this one doesn't apply anymore because of:
commit 06824ba824b3e9f2fedb38bee79af0643198ed7f
Author: Russell King <rmk+kernel@arm.linux.org.uk>
Date:   Sun Feb 20 12:16:45 2011 +0000
    ARM: tlb: delay page freeing for SMP and ARMv7 CPUs
    
    We need to delay freeing any mapped page on SMP and ARMv7 systems to
    ensure that the data is not accessed by other CPUs, or is used for
    speculative prefetch with ARMv7.  This includes not only mapped pages
    but also pages used for the page tables themselves.
    
    This avoids races with the MMU/other CPUs accessing pages after they've
    been freed but before we've invalidated the TLB.
    
    Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Which raises a nice point about shift_arg_pages() which calls
free_pgd_range(), the other architectures that look similar to arm in
this respect are ia64 and sh, do they suffer the same problem?
It doesn't look hard to fold the requirements for this into the generic
tlb range support (patch 14 in this series).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-24 16:34   ` Peter Zijlstra
@ 2011-02-24 16:34     ` Peter Zijlstra
  2011-02-25 18:04     ` Peter Zijlstra
  1 sibling, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-24 16:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, linux-mm,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Russell King, Luck,Tony,
	PaulMundt
On Thu, 2011-02-17 at 17:23 +0100, Peter Zijlstra wrote:
> plain text document attachment
> (peter_zijlstra-arm-preemptible_mmu_gather.patch)
> Fix up the arm mmu_gather code to conform to the new API.
So akpm noted that this one doesn't apply anymore because of:
commit 06824ba824b3e9f2fedb38bee79af0643198ed7f
Author: Russell King <rmk+kernel@arm.linux.org.uk>
Date:   Sun Feb 20 12:16:45 2011 +0000
    ARM: tlb: delay page freeing for SMP and ARMv7 CPUs
    
    We need to delay freeing any mapped page on SMP and ARMv7 systems to
    ensure that the data is not accessed by other CPUs, or is used for
    speculative prefetch with ARMv7.  This includes not only mapped pages
    but also pages used for the page tables themselves.
    
    This avoids races with the MMU/other CPUs accessing pages after they've
    been freed but before we've invalidated the TLB.
    
    Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Which raises a nice point about shift_arg_pages() which calls
free_pgd_range(), the other architectures that look similar to arm in
this respect are ia64 and sh, do they suffer the same problem?
It doesn't look hard to fold the requirements for this into the generic
tlb range support (patch 14 in this series).
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-24 16:34   ` Peter Zijlstra
  2011-02-24 16:34     ` Peter Zijlstra
@ 2011-02-25 18:04     ` Peter Zijlstra
  2011-02-25 18:04       ` Peter Zijlstra
                         ` (2 more replies)
  1 sibling, 3 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-25 18:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, linux-mm,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Russell King, Luck,Tony,
	PaulMundt
On Thu, 2011-02-24 at 17:34 +0100, Peter Zijlstra wrote:
> On Thu, 2011-02-17 at 17:23 +0100, Peter Zijlstra wrote:
> > plain text document attachment
> > (peter_zijlstra-arm-preemptible_mmu_gather.patch)
> > Fix up the arm mmu_gather code to conform to the new API.
> 
> So akpm noted that this one doesn't apply anymore because of:
> 
> commit 06824ba824b3e9f2fedb38bee79af0643198ed7f
> Author: Russell King <rmk+kernel@arm.linux.org.uk>
> Date:   Sun Feb 20 12:16:45 2011 +0000
> 
>     ARM: tlb: delay page freeing for SMP and ARMv7 CPUs
>     
>     We need to delay freeing any mapped page on SMP and ARMv7 systems to
>     ensure that the data is not accessed by other CPUs, or is used for
>     speculative prefetch with ARMv7.  This includes not only mapped pages
>     but also pages used for the page tables themselves.
>     
>     This avoids races with the MMU/other CPUs accessing pages after they've
>     been freed but before we've invalidated the TLB.
>     
>     Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
> 
> 
> Which raises a nice point about shift_arg_pages() which calls
> free_pgd_range(), the other architectures that look similar to arm in
> this respect are ia64 and sh, do they suffer the same problem?
> 
> It doesn't look hard to fold the requirements for this into the generic
> tlb range support (patch 14 in this series).
It looks like both ia64 and sh do indeed suffer there.
I've pulled my generic range tracking to the head of the series so that
I can convert ARM, IA64 and SH to generic tlb solving it for those.
Russell, generic tlb doesn't look to need the extra logic you added for
the fs/exec.c case, but please double check the patches when I post
them.
In short, tlb_end_vma() will call flush_tlb_range() on the tracked range
and clear ->need_flush, so things like zap_page_range() will not then
also call tlb_flush().
In case of shift_arg_pages() and unmap_region() however we first call
free_pgtables() which might end up calling p??_free_tlb() which will
then set ->need_flush, and tlb_finish_mmu() will then end up calling
tlb_flush().
I'm not quite sure why you chose to add range tracking on
pte_free_tlb(), the only affected code path seems to be unmap_region()
where you'll use a flush_tlb_range(), but its buggy, the pte_free_tlb()
range is much larger than 1 page, and if you do it there you also need
it for all the other p??_free_tlb() functions.
The tlb flush after freeing page-tables is needed for things like
gup_fast() which needs to sync against them being freed.
So the stuff I have now will try its best to track ranges on zap_* while
clearing the page mapping, will use flush_cache_range() and
flush_tlb_range(). But when it comes to tearing down the page-tables
themselves we'll punt and use a full mm flush, which seems a waste of
all that careful range tracking by zap_*.
One possibility would be to add tlb_start/end_vma() in
unmap_page_range(), except we don't need to flush the cache again, also,
it would be nice to not have to flush on tlb_end_vma() but delay it all
to tlb_finish_mmu() where possible.
OK, let me try and hack up proper range tracking for free_*, that way I
can move the flush_tlb_range() from tlb_end_vma() and into
tlb_flush_mmu().
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-25 18:04     ` Peter Zijlstra
@ 2011-02-25 18:04       ` Peter Zijlstra
  2011-02-25 19:45       ` Peter Zijlstra
  2011-02-25 21:51       ` Russell King
  2 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-25 18:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, linux-mm,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Russell King, Luck,Tony,
	PaulMundt
On Thu, 2011-02-24 at 17:34 +0100, Peter Zijlstra wrote:
> On Thu, 2011-02-17 at 17:23 +0100, Peter Zijlstra wrote:
> > plain text document attachment
> > (peter_zijlstra-arm-preemptible_mmu_gather.patch)
> > Fix up the arm mmu_gather code to conform to the new API.
> 
> So akpm noted that this one doesn't apply anymore because of:
> 
> commit 06824ba824b3e9f2fedb38bee79af0643198ed7f
> Author: Russell King <rmk+kernel@arm.linux.org.uk>
> Date:   Sun Feb 20 12:16:45 2011 +0000
> 
>     ARM: tlb: delay page freeing for SMP and ARMv7 CPUs
>     
>     We need to delay freeing any mapped page on SMP and ARMv7 systems to
>     ensure that the data is not accessed by other CPUs, or is used for
>     speculative prefetch with ARMv7.  This includes not only mapped pages
>     but also pages used for the page tables themselves.
>     
>     This avoids races with the MMU/other CPUs accessing pages after they've
>     been freed but before we've invalidated the TLB.
>     
>     Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
> 
> 
> Which raises a nice point about shift_arg_pages() which calls
> free_pgd_range(), the other architectures that look similar to arm in
> this respect are ia64 and sh, do they suffer the same problem?
> 
> It doesn't look hard to fold the requirements for this into the generic
> tlb range support (patch 14 in this series).
It looks like both ia64 and sh do indeed suffer there.
I've pulled my generic range tracking to the head of the series so that
I can convert ARM, IA64 and SH to generic tlb solving it for those.
Russell, generic tlb doesn't look to need the extra logic you added for
the fs/exec.c case, but please double check the patches when I post
them.
In short, tlb_end_vma() will call flush_tlb_range() on the tracked range
and clear ->need_flush, so things like zap_page_range() will not then
also call tlb_flush().
In case of shift_arg_pages() and unmap_region() however we first call
free_pgtables() which might end up calling p??_free_tlb() which will
then set ->need_flush, and tlb_finish_mmu() will then end up calling
tlb_flush().
I'm not quite sure why you chose to add range tracking on
pte_free_tlb(), the only affected code path seems to be unmap_region()
where you'll use a flush_tlb_range(), but its buggy, the pte_free_tlb()
range is much larger than 1 page, and if you do it there you also need
it for all the other p??_free_tlb() functions.
The tlb flush after freeing page-tables is needed for things like
gup_fast() which needs to sync against them being freed.
So the stuff I have now will try its best to track ranges on zap_* while
clearing the page mapping, will use flush_cache_range() and
flush_tlb_range(). But when it comes to tearing down the page-tables
themselves we'll punt and use a full mm flush, which seems a waste of
all that careful range tracking by zap_*.
One possibility would be to add tlb_start/end_vma() in
unmap_page_range(), except we don't need to flush the cache again, also,
it would be nice to not have to flush on tlb_end_vma() but delay it all
to tlb_finish_mmu() where possible.
OK, let me try and hack up proper range tracking for free_*, that way I
can move the flush_tlb_range() from tlb_end_vma() and into
tlb_flush_mmu().
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-25 18:04     ` Peter Zijlstra
  2011-02-25 18:04       ` Peter Zijlstra
@ 2011-02-25 19:45       ` Peter Zijlstra
  2011-02-25 19:45         ` Peter Zijlstra
  2011-02-25 19:59         ` Hugh Dickins
  2011-02-25 21:51       ` Russell King
  2 siblings, 2 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-25 19:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, linux-mm,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Russell King, Luck,Tony,
	PaulMundt
On Fri, 2011-02-25 at 19:04 +0100, Peter Zijlstra wrote:
> On Thu, 2011-02-24 at 17:34 +0100, Peter Zijlstra wrote:
> > On Thu, 2011-02-17 at 17:23 +0100, Peter Zijlstra wrote:
> > > plain text document attachment
> > > (peter_zijlstra-arm-preemptible_mmu_gather.patch)
> > > Fix up the arm mmu_gather code to conform to the new API.
> > 
> > So akpm noted that this one doesn't apply anymore because of:
> > 
> > commit 06824ba824b3e9f2fedb38bee79af0643198ed7f
> > Author: Russell King <rmk+kernel@arm.linux.org.uk>
> > Date:   Sun Feb 20 12:16:45 2011 +0000
> > 
> >     ARM: tlb: delay page freeing for SMP and ARMv7 CPUs
> >     
> >     We need to delay freeing any mapped page on SMP and ARMv7 systems to
> >     ensure that the data is not accessed by other CPUs, or is used for
> >     speculative prefetch with ARMv7.  This includes not only mapped pages
> >     but also pages used for the page tables themselves.
> >     
> >     This avoids races with the MMU/other CPUs accessing pages after they've
> >     been freed but before we've invalidated the TLB.
> >     
> >     Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
> > 
> > 
> > Which raises a nice point about shift_arg_pages() which calls
> > free_pgd_range(), the other architectures that look similar to arm in
> > this respect are ia64 and sh, do they suffer the same problem?
> > 
> > It doesn't look hard to fold the requirements for this into the generic
> > tlb range support (patch 14 in this series).
> 
> It looks like both ia64 and sh do indeed suffer there.
> 
> I've pulled my generic range tracking to the head of the series so that
> I can convert ARM, IA64 and SH to generic tlb solving it for those.
> 
> Russell, generic tlb doesn't look to need the extra logic you added for
> the fs/exec.c case, but please double check the patches when I post
> them.
> 
> In short, tlb_end_vma() will call flush_tlb_range() on the tracked range
> and clear ->need_flush, so things like zap_page_range() will not then
> also call tlb_flush().
> 
> In case of shift_arg_pages() and unmap_region() however we first call
> free_pgtables() which might end up calling p??_free_tlb() which will
> then set ->need_flush, and tlb_finish_mmu() will then end up calling
> tlb_flush().
> 
> I'm not quite sure why you chose to add range tracking on
> pte_free_tlb(), the only affected code path seems to be unmap_region()
> where you'll use a flush_tlb_range(), but its buggy, the pte_free_tlb()
> range is much larger than 1 page, and if you do it there you also need
> it for all the other p??_free_tlb() functions.
> 
> The tlb flush after freeing page-tables is needed for things like
> gup_fast() which needs to sync against them being freed.
> 
> So the stuff I have now will try its best to track ranges on zap_* while
> clearing the page mapping, will use flush_cache_range() and
> flush_tlb_range(). But when it comes to tearing down the page-tables
> themselves we'll punt and use a full mm flush, which seems a waste of
> all that careful range tracking by zap_*.
> 
> One possibility would be to add tlb_start/end_vma() in
> unmap_page_range(), except we don't need to flush the cache again, also,
> it would be nice to not have to flush on tlb_end_vma() but delay it all
> to tlb_finish_mmu() where possible.
> 
> OK, let me try and hack up proper range tracking for free_*, that way I
> can move the flush_tlb_range() from tlb_end_vma() and into
> tlb_flush_mmu().
Grmbl.. so doing that would require flush_tlb_range() to take an mm, not
a vma, but tile and arm both use the vma->flags & VM_EXEC test to avoid
flushing their i-tlbs.
I'm tempted to make them flush i-tlbs unconditionally as its still
better than hitting an mm wide tlb flush due to the page table free.
Ideas?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-25 19:45       ` Peter Zijlstra
@ 2011-02-25 19:45         ` Peter Zijlstra
  2011-02-25 19:59         ` Hugh Dickins
  1 sibling, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-25 19:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, linux-mm,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Russell King, Luck,Tony,
	PaulMundt
On Fri, 2011-02-25 at 19:04 +0100, Peter Zijlstra wrote:
> On Thu, 2011-02-24 at 17:34 +0100, Peter Zijlstra wrote:
> > On Thu, 2011-02-17 at 17:23 +0100, Peter Zijlstra wrote:
> > > plain text document attachment
> > > (peter_zijlstra-arm-preemptible_mmu_gather.patch)
> > > Fix up the arm mmu_gather code to conform to the new API.
> > 
> > So akpm noted that this one doesn't apply anymore because of:
> > 
> > commit 06824ba824b3e9f2fedb38bee79af0643198ed7f
> > Author: Russell King <rmk+kernel@arm.linux.org.uk>
> > Date:   Sun Feb 20 12:16:45 2011 +0000
> > 
> >     ARM: tlb: delay page freeing for SMP and ARMv7 CPUs
> >     
> >     We need to delay freeing any mapped page on SMP and ARMv7 systems to
> >     ensure that the data is not accessed by other CPUs, or is used for
> >     speculative prefetch with ARMv7.  This includes not only mapped pages
> >     but also pages used for the page tables themselves.
> >     
> >     This avoids races with the MMU/other CPUs accessing pages after they've
> >     been freed but before we've invalidated the TLB.
> >     
> >     Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
> > 
> > 
> > Which raises a nice point about shift_arg_pages() which calls
> > free_pgd_range(), the other architectures that look similar to arm in
> > this respect are ia64 and sh, do they suffer the same problem?
> > 
> > It doesn't look hard to fold the requirements for this into the generic
> > tlb range support (patch 14 in this series).
> 
> It looks like both ia64 and sh do indeed suffer there.
> 
> I've pulled my generic range tracking to the head of the series so that
> I can convert ARM, IA64 and SH to generic tlb solving it for those.
> 
> Russell, generic tlb doesn't look to need the extra logic you added for
> the fs/exec.c case, but please double check the patches when I post
> them.
> 
> In short, tlb_end_vma() will call flush_tlb_range() on the tracked range
> and clear ->need_flush, so things like zap_page_range() will not then
> also call tlb_flush().
> 
> In case of shift_arg_pages() and unmap_region() however we first call
> free_pgtables() which might end up calling p??_free_tlb() which will
> then set ->need_flush, and tlb_finish_mmu() will then end up calling
> tlb_flush().
> 
> I'm not quite sure why you chose to add range tracking on
> pte_free_tlb(), the only affected code path seems to be unmap_region()
> where you'll use a flush_tlb_range(), but its buggy, the pte_free_tlb()
> range is much larger than 1 page, and if you do it there you also need
> it for all the other p??_free_tlb() functions.
> 
> The tlb flush after freeing page-tables is needed for things like
> gup_fast() which needs to sync against them being freed.
> 
> So the stuff I have now will try its best to track ranges on zap_* while
> clearing the page mapping, will use flush_cache_range() and
> flush_tlb_range(). But when it comes to tearing down the page-tables
> themselves we'll punt and use a full mm flush, which seems a waste of
> all that careful range tracking by zap_*.
> 
> One possibility would be to add tlb_start/end_vma() in
> unmap_page_range(), except we don't need to flush the cache again, also,
> it would be nice to not have to flush on tlb_end_vma() but delay it all
> to tlb_finish_mmu() where possible.
> 
> OK, let me try and hack up proper range tracking for free_*, that way I
> can move the flush_tlb_range() from tlb_end_vma() and into
> tlb_flush_mmu().
Grmbl.. so doing that would require flush_tlb_range() to take an mm, not
a vma, but tile and arm both use the vma->flags & VM_EXEC test to avoid
flushing their i-tlbs.
I'm tempted to make them flush i-tlbs unconditionally as its still
better than hitting an mm wide tlb flush due to the page table free.
Ideas?
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-25 19:45       ` Peter Zijlstra
  2011-02-25 19:45         ` Peter Zijlstra
@ 2011-02-25 19:59         ` Hugh Dickins
  1 sibling, 0 replies; 90+ messages in thread
From: Hugh Dickins @ 2011-02-25 19:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Mel Gorman,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Russell King, Luck,Tony,
	PaulMundt
On Fri, Feb 25, 2011 at 11:45 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> Grmbl.. so doing that would require flush_tlb_range() to take an mm, not
> a vma, but tile and arm both use the vma->flags & VM_EXEC test to avoid
> flushing their i-tlbs.
>
> I'm tempted to make them flush i-tlbs unconditionally as its still
> better than hitting an mm wide tlb flush due to the page table free.
>
> Ideas?
What's wrong with using vma->vm_mm?
Hugh
^ permalink raw reply	[flat|nested] 90+ messages in thread 
 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-25 18:04     ` Peter Zijlstra
  2011-02-25 18:04       ` Peter Zijlstra
  2011-02-25 19:45       ` Peter Zijlstra
@ 2011-02-25 21:51       ` Russell King
  2011-02-25 21:51         ` Russell King
  2011-02-28 11:44         ` Peter Zijlstra
  2 siblings, 2 replies; 90+ messages in thread
From: Russell King @ 2011-02-25 21:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt
On Fri, Feb 25, 2011 at 07:04:43PM +0100, Peter Zijlstra wrote:
> I'm not quite sure why you chose to add range tracking on
> pte_free_tlb(), the only affected code path seems to be unmap_region()
> where you'll use a flush_tlb_range(), but its buggy, the pte_free_tlb()
> range is much larger than 1 page, and if you do it there you also need
> it for all the other p??_free_tlb() functions.
My reasoning is to do with the way the LPAE stuff works.  For the
explaination below, I'm going to assume a 2 level page table system
for simplicity.
The first thing to realise is that if we have L2 entries, then we'll
have unmapped them first using the usual tlb shootdown interfaces.
However, when we're freeing the page tables themselves, we should
already have removed the L2 entries, so all we have are the L1 entries.
In most 'normal' processors, these aren't cached in any way.
Howver, with LPAE, these are cached.  I'm told that any TLB flush for an
address which is covered by the L1 entry will cause that cached entry to
be invalidated.
So really this is about getting rid of cached L1 entries, and not the
usual TLB lookaside entries that you'd come to expect.
-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-25 21:51       ` Russell King
@ 2011-02-25 21:51         ` Russell King
  2011-02-28 11:44         ` Peter Zijlstra
  1 sibling, 0 replies; 90+ messages in thread
From: Russell King @ 2011-02-25 21:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt
On Fri, Feb 25, 2011 at 07:04:43PM +0100, Peter Zijlstra wrote:
> I'm not quite sure why you chose to add range tracking on
> pte_free_tlb(), the only affected code path seems to be unmap_region()
> where you'll use a flush_tlb_range(), but its buggy, the pte_free_tlb()
> range is much larger than 1 page, and if you do it there you also need
> it for all the other p??_free_tlb() functions.
My reasoning is to do with the way the LPAE stuff works.  For the
explaination below, I'm going to assume a 2 level page table system
for simplicity.
The first thing to realise is that if we have L2 entries, then we'll
have unmapped them first using the usual tlb shootdown interfaces.
However, when we're freeing the page tables themselves, we should
already have removed the L2 entries, so all we have are the L1 entries.
In most 'normal' processors, these aren't cached in any way.
Howver, with LPAE, these are cached.  I'm told that any TLB flush for an
address which is covered by the L1 entry will cause that cached entry to
be invalidated.
So really this is about getting rid of cached L1 entries, and not the
usual TLB lookaside entries that you'd come to expect.
-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-25 21:51       ` Russell King
  2011-02-25 21:51         ` Russell King
@ 2011-02-28 11:44         ` Peter Zijlstra
  2011-02-28 11:44           ` Peter Zijlstra
                             ` (3 more replies)
  1 sibling, 4 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-28 11:44 UTC (permalink / raw)
  To: Russell King
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Fri, 2011-02-25 at 21:51 +0000, Russell King wrote:
> On Fri, Feb 25, 2011 at 07:04:43PM +0100, Peter Zijlstra wrote:
> > I'm not quite sure why you chose to add range tracking on
> > pte_free_tlb(), the only affected code path seems to be unmap_region()
> > where you'll use a flush_tlb_range(), but its buggy, the pte_free_tlb()
> > range is much larger than 1 page, and if you do it there you also need
> > it for all the other p??_free_tlb() functions.
> 
> My reasoning is to do with the way the LPAE stuff works.  For the
> explaination below, I'm going to assume a 2 level page table system
> for simplicity.
> 
> The first thing to realise is that if we have L2 entries, then we'll
> have unmapped them first using the usual tlb shootdown interfaces.
> 
> However, when we're freeing the page tables themselves, we should
> already have removed the L2 entries, so all we have are the L1 entries.
> In most 'normal' processors, these aren't cached in any way.
> 
> Howver, with LPAE, these are cached.  I'm told that any TLB flush for an
> address which is covered by the L1 entry will cause that cached entry to
> be invalidated.
> 
> So really this is about getting rid of cached L1 entries, and not the
> usual TLB lookaside entries that you'd come to expect.
Right, so the normal case is:
  unmap_region()
    tlb_gather_mmu()
    unmap_vmas()
      for (; vma; vma = vma->vm_next)
        unmao_page_range()
          tlb_start_vma() -> flush cache range
          zap_*_range()
            ptep_get_and_clear_full() -> batch/track external tlbs
            tlb_remove_tlb_entry() -> batch/track external tlbs
            tlb_remove_page() -> track range/batch page
          tlb_end_vma() -> flush tlb range
 [ for architectures that have hardware page table walkers
   concurrent faults can still load the page tables ]
    free_pgtables()
      while (vma)
        unlink_*_vma()
        free_*_range()
          *_free_tlb()
    tlb_finish_mmu()
  free vmas
Now, if we want to track ranges _and_ have hardware page table walkers
(ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
because the hardware walker could have re-populated the cache after
clearing the PTEs but before freeing the page tables.
What ARM does is it retains the last vma pointer and tracks
pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
hacky.
Mostly because of shift_arg_pages(), where we have:
  shift_arg_pages()
    tlb_gather_mmu()
    free_*_range()
    tlb_finish_mmu()
For which ARM now punts and does a full tlb invalidate (no vma pointer).
But also because it drags along that vma pointer, which might not at all
match the range its actually going to invalidate (and hence its vm_flags
might not accurately reflect things -- at worst more expensive than
needed).
The reason I wanted flush_tlb_range() to take an mm_struct and not the
current vm_area_struct is because we can avoid doing the
flush_tlb_range() from tlb_end_vma() and delay the thing until
tlb_finish_mmu() without having to resort to such games as above. We
could simply track the full range over all VMAs and free'd page-tables
and do one range invalidate.
ARM uses vm_flags & VM_EXEC to see if it also needs to invalidate
I-TLBs, and TILE uses VM_EXEC and VM_HUGETLB.
For the I-TLBs we could easily use
ptep_get_and_clear_full()/tlb_remove_tlb_entry() and see if any of the
cleared pte's had its executable bit set (both ARM and TILE seem to have
such a PTE bit).
I'm not sure what we can do about TILE's VM_HUGETLB usage though, if it
needs explicit flushes for huge ptes it might just have to issue
multiple tlb invalidates and do them from tlb_start_vma()/tlb_end_vma().
So my current proposal for generic mmu_gather (not fully coded yet) is
to provide a number of CONFIG_goo switches:
  CONFIG_HAVE_RCU_TABLE_FREE - for architectures that do not walk the
linux page tables in hardware (Sparc64, PowerPC, etc), and others where
TLB flushing isn't delayed by disabling IRQs (Xen, s390).
  CONFIG_HAVE_MMU_GATHER_RANGE - will track start,end ranges from
tlb_remove_tlb_entry() and p*_free_tlb() and issue
flush_tlb_range(mm,start,end) instead of mm-wide invalidates.
  CONFIG_HAVE_MMU_GATHER_ITLB - will use
ptep_get_and_clear_full()/tlb_remove_tlb_entry() to test pte_exec() and
issue flush_itlb_range(mm,start,end).
Then there is the optimization s390 wants, which is to do a full mm tlb
flush for fullmm (exit_mmap()) at tlb_gather_mmu() and never again after
that, since there is guaranteed no concurrency to poke at anything.
AFAICT that should work on all architectures so we can do that
unconditionally.
So the biggest problem with implementing the above is TILE, where we
need to figure out wth to do with its hugetlb stuff.
The second biggest problem is with ARM and TILE, where we'd need to
implement flush_itlb_range(). I've already got a patch for all other
architectures to convert flush_tlb_range() to mm_struct.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 11:44         ` Peter Zijlstra
@ 2011-02-28 11:44           ` Peter Zijlstra
  2011-02-28 11:59           ` Russell King
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-28 11:44 UTC (permalink / raw)
  To: Russell King
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Fri, 2011-02-25 at 21:51 +0000, Russell King wrote:
> On Fri, Feb 25, 2011 at 07:04:43PM +0100, Peter Zijlstra wrote:
> > I'm not quite sure why you chose to add range tracking on
> > pte_free_tlb(), the only affected code path seems to be unmap_region()
> > where you'll use a flush_tlb_range(), but its buggy, the pte_free_tlb()
> > range is much larger than 1 page, and if you do it there you also need
> > it for all the other p??_free_tlb() functions.
> 
> My reasoning is to do with the way the LPAE stuff works.  For the
> explaination below, I'm going to assume a 2 level page table system
> for simplicity.
> 
> The first thing to realise is that if we have L2 entries, then we'll
> have unmapped them first using the usual tlb shootdown interfaces.
> 
> However, when we're freeing the page tables themselves, we should
> already have removed the L2 entries, so all we have are the L1 entries.
> In most 'normal' processors, these aren't cached in any way.
> 
> Howver, with LPAE, these are cached.  I'm told that any TLB flush for an
> address which is covered by the L1 entry will cause that cached entry to
> be invalidated.
> 
> So really this is about getting rid of cached L1 entries, and not the
> usual TLB lookaside entries that you'd come to expect.
Right, so the normal case is:
  unmap_region()
    tlb_gather_mmu()
    unmap_vmas()
      for (; vma; vma = vma->vm_next)
        unmao_page_range()
          tlb_start_vma() -> flush cache range
          zap_*_range()
            ptep_get_and_clear_full() -> batch/track external tlbs
            tlb_remove_tlb_entry() -> batch/track external tlbs
            tlb_remove_page() -> track range/batch page
          tlb_end_vma() -> flush tlb range
 [ for architectures that have hardware page table walkers
   concurrent faults can still load the page tables ]
    free_pgtables()
      while (vma)
        unlink_*_vma()
        free_*_range()
          *_free_tlb()
    tlb_finish_mmu()
  free vmas
Now, if we want to track ranges _and_ have hardware page table walkers
(ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
because the hardware walker could have re-populated the cache after
clearing the PTEs but before freeing the page tables.
What ARM does is it retains the last vma pointer and tracks
pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
hacky.
Mostly because of shift_arg_pages(), where we have:
  shift_arg_pages()
    tlb_gather_mmu()
    free_*_range()
    tlb_finish_mmu()
For which ARM now punts and does a full tlb invalidate (no vma pointer).
But also because it drags along that vma pointer, which might not at all
match the range its actually going to invalidate (and hence its vm_flags
might not accurately reflect things -- at worst more expensive than
needed).
The reason I wanted flush_tlb_range() to take an mm_struct and not the
current vm_area_struct is because we can avoid doing the
flush_tlb_range() from tlb_end_vma() and delay the thing until
tlb_finish_mmu() without having to resort to such games as above. We
could simply track the full range over all VMAs and free'd page-tables
and do one range invalidate.
ARM uses vm_flags & VM_EXEC to see if it also needs to invalidate
I-TLBs, and TILE uses VM_EXEC and VM_HUGETLB.
For the I-TLBs we could easily use
ptep_get_and_clear_full()/tlb_remove_tlb_entry() and see if any of the
cleared pte's had its executable bit set (both ARM and TILE seem to have
such a PTE bit).
I'm not sure what we can do about TILE's VM_HUGETLB usage though, if it
needs explicit flushes for huge ptes it might just have to issue
multiple tlb invalidates and do them from tlb_start_vma()/tlb_end_vma().
So my current proposal for generic mmu_gather (not fully coded yet) is
to provide a number of CONFIG_goo switches:
  CONFIG_HAVE_RCU_TABLE_FREE - for architectures that do not walk the
linux page tables in hardware (Sparc64, PowerPC, etc), and others where
TLB flushing isn't delayed by disabling IRQs (Xen, s390).
  CONFIG_HAVE_MMU_GATHER_RANGE - will track start,end ranges from
tlb_remove_tlb_entry() and p*_free_tlb() and issue
flush_tlb_range(mm,start,end) instead of mm-wide invalidates.
  CONFIG_HAVE_MMU_GATHER_ITLB - will use
ptep_get_and_clear_full()/tlb_remove_tlb_entry() to test pte_exec() and
issue flush_itlb_range(mm,start,end).
Then there is the optimization s390 wants, which is to do a full mm tlb
flush for fullmm (exit_mmap()) at tlb_gather_mmu() and never again after
that, since there is guaranteed no concurrency to poke at anything.
AFAICT that should work on all architectures so we can do that
unconditionally.
So the biggest problem with implementing the above is TILE, where we
need to figure out wth to do with its hugetlb stuff.
The second biggest problem is with ARM and TILE, where we'd need to
implement flush_itlb_range(). I've already got a patch for all other
architectures to convert flush_tlb_range() to mm_struct.
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 11:44         ` Peter Zijlstra
  2011-02-28 11:44           ` Peter Zijlstra
@ 2011-02-28 11:59           ` Russell King
  2011-02-28 11:59             ` Russell King
                               ` (3 more replies)
  2011-02-28 14:18           ` Peter Zijlstra
  2011-03-01 22:05           ` Chris Metcalf
  3 siblings, 4 replies; 90+ messages in thread
From: Russell King @ 2011-02-28 11:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, Feb 28, 2011 at 12:44:47PM +0100, Peter Zijlstra wrote:
> Right, so the normal case is:
> 
>   unmap_region()
>     tlb_gather_mmu()
The fullmm argument is important here as it specifies the mode.
      tlb_gather_mmu(, 0)
>     unmap_vmas()
>       for (; vma; vma = vma->vm_next)
>         unmao_page_range()
>           tlb_start_vma() -> flush cache range
>           zap_*_range()
>             ptep_get_and_clear_full() -> batch/track external tlbs
>             tlb_remove_tlb_entry() -> batch/track external tlbs
>             tlb_remove_page() -> track range/batch page
>           tlb_end_vma() -> flush tlb range
       tlb_finish_mmu() -> nothing
> 
>  [ for architectures that have hardware page table walkers
>    concurrent faults can still load the page tables ]
> 
>     free_pgtables()
        tlb_gather_mmu(, 1)
>       while (vma)
>         unlink_*_vma()
>         free_*_range()
>           *_free_tlb()
>     tlb_finish_mmu()
      tlb_finish_mmu() -> flush tlb mm
> 
>   free vmas
So this is all fine.  Note that we *don't* use the range stuff here.
> Now, if we want to track ranges _and_ have hardware page table walkers
> (ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
> flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
> than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
> because the hardware walker could have re-populated the cache after
> clearing the PTEs but before freeing the page tables.
No.  The hardware walker won't re-populate the TLB after the page table
entries have been cleared - where would it get this information from if
not from the page tables?
> What ARM does is it retains the last vma pointer and tracks
> pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
> hacky.
It may be hacky but then the TLB shootdown interface is hacky too.  We
don't keep the vma around to re-use after tlb_end_vma() - if you think
that then you misunderstand what's going on.  The vma pointer is kept
around as a cheap way of allowing tlb_finish_mmu() to distinguish
between the unmap_region() mode and the shift_arg_pages() mode.
> Mostly because of shift_arg_pages(), where we have:
> 
>   shift_arg_pages()
>     tlb_gather_mmu()
      tlb_gather_mmu(, 0)
>     free_*_range()
>     tlb_finish_mmu()
      tlb_finish_mmu() does nothing without the ARM change.
      tlb_finish_mmu() -> flush_tlb_mm() with the ARM change.
And this is where the bug was - these page table entries could find
their way into the TLB and persist after they've been torn down.
> For which ARM now punts and does a full tlb invalidate (no vma pointer).
> But also because it drags along that vma pointer, which might not at all
> match the range its actually going to invalidate (and hence its vm_flags
> might not accurately reflect things -- at worst more expensive than
> needed).
Where do you get that from?  Where exactly in the above code would the
VMA pointer get set?  In this case, it will be NULL, so we do a
flush_tlb_mm() for this case.  We have to - we don't have any VMA to
deal with at this point.
> The reason I wanted flush_tlb_range() to take an mm_struct and not the
> current vm_area_struct is because we can avoid doing the
> flush_tlb_range() from tlb_end_vma() and delay the thing until
> tlb_finish_mmu() without having to resort to such games as above. We
> could simply track the full range over all VMAs and free'd page-tables
> and do one range invalidate.
No.  That's stupid.  Consider the case where you have to loop one page
at a time over the range (we do on ARM.)  If we ended up with your
suggestion above, that means we could potentially have to loop 4K at a
time over 3GB of address space.  That's idiotic when we have an
instruction which can flush the entire TLB for a particular thread.
-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 11:59           ` Russell King
@ 2011-02-28 11:59             ` Russell King
  2011-02-28 12:06             ` Russell King
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 90+ messages in thread
From: Russell King @ 2011-02-28 11:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, Feb 28, 2011 at 12:44:47PM +0100, Peter Zijlstra wrote:
> Right, so the normal case is:
> 
>   unmap_region()
>     tlb_gather_mmu()
The fullmm argument is important here as it specifies the mode.
      tlb_gather_mmu(, 0)
>     unmap_vmas()
>       for (; vma; vma = vma->vm_next)
>         unmao_page_range()
>           tlb_start_vma() -> flush cache range
>           zap_*_range()
>             ptep_get_and_clear_full() -> batch/track external tlbs
>             tlb_remove_tlb_entry() -> batch/track external tlbs
>             tlb_remove_page() -> track range/batch page
>           tlb_end_vma() -> flush tlb range
       tlb_finish_mmu() -> nothing
> 
>  [ for architectures that have hardware page table walkers
>    concurrent faults can still load the page tables ]
> 
>     free_pgtables()
        tlb_gather_mmu(, 1)
>       while (vma)
>         unlink_*_vma()
>         free_*_range()
>           *_free_tlb()
>     tlb_finish_mmu()
      tlb_finish_mmu() -> flush tlb mm
> 
>   free vmas
So this is all fine.  Note that we *don't* use the range stuff here.
> Now, if we want to track ranges _and_ have hardware page table walkers
> (ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
> flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
> than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
> because the hardware walker could have re-populated the cache after
> clearing the PTEs but before freeing the page tables.
No.  The hardware walker won't re-populate the TLB after the page table
entries have been cleared - where would it get this information from if
not from the page tables?
> What ARM does is it retains the last vma pointer and tracks
> pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
> hacky.
It may be hacky but then the TLB shootdown interface is hacky too.  We
don't keep the vma around to re-use after tlb_end_vma() - if you think
that then you misunderstand what's going on.  The vma pointer is kept
around as a cheap way of allowing tlb_finish_mmu() to distinguish
between the unmap_region() mode and the shift_arg_pages() mode.
> Mostly because of shift_arg_pages(), where we have:
> 
>   shift_arg_pages()
>     tlb_gather_mmu()
      tlb_gather_mmu(, 0)
>     free_*_range()
>     tlb_finish_mmu()
      tlb_finish_mmu() does nothing without the ARM change.
      tlb_finish_mmu() -> flush_tlb_mm() with the ARM change.
And this is where the bug was - these page table entries could find
their way into the TLB and persist after they've been torn down.
> For which ARM now punts and does a full tlb invalidate (no vma pointer).
> But also because it drags along that vma pointer, which might not at all
> match the range its actually going to invalidate (and hence its vm_flags
> might not accurately reflect things -- at worst more expensive than
> needed).
Where do you get that from?  Where exactly in the above code would the
VMA pointer get set?  In this case, it will be NULL, so we do a
flush_tlb_mm() for this case.  We have to - we don't have any VMA to
deal with at this point.
> The reason I wanted flush_tlb_range() to take an mm_struct and not the
> current vm_area_struct is because we can avoid doing the
> flush_tlb_range() from tlb_end_vma() and delay the thing until
> tlb_finish_mmu() without having to resort to such games as above. We
> could simply track the full range over all VMAs and free'd page-tables
> and do one range invalidate.
No.  That's stupid.  Consider the case where you have to loop one page
at a time over the range (we do on ARM.)  If we ended up with your
suggestion above, that means we could potentially have to loop 4K at a
time over 3GB of address space.  That's idiotic when we have an
instruction which can flush the entire TLB for a particular thread.
-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 11:59           ` Russell King
  2011-02-28 11:59             ` Russell King
@ 2011-02-28 12:06             ` Russell King
  2011-02-28 12:06             ` Russell King
  2011-02-28 12:20             ` Peter Zijlstra
  3 siblings, 0 replies; 90+ messages in thread
From: Russell King @ 2011-02-28 12:06 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo
On Mon, Feb 28, 2011 at 11:59:07AM +0000, Russell King wrote:
> It may be hacky but then the TLB shootdown interface is hacky too.  We
> don't keep the vma around to re-use after tlb_end_vma() - if you think
> that then you misunderstand what's going on.  The vma pointer is kept
> around as a cheap way of allowing tlb_finish_mmu() to distinguish
> between the unmap_region() mode and the shift_arg_pages() mode.
As I think I mentioned, the TLB shootdown interface either needs rewriting
from scratch as its currently a broken design, or it needs tlb_gather_mmu()
to take a proper mode argument, rather than this useless 'fullmm' argument
which only gives half the story.
The fact is that the interface has three modes, and distinguishing between
them requires a certain amount of black magic.  Explicitly, the !fullmm
case has two modes, and it requires implementations to remember whether
tlb_start_vma() has been called before tlb_finish_mm() or not.
Maybe this will help you understand the ARM implementation - this doesn't
change the functionality, but may make things clearer.
diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index 82dfe5d..73fb813 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -54,7 +54,7 @@
 struct mmu_gather {
 	struct mm_struct	*mm;
 	unsigned int		fullmm;
-	struct vm_area_struct	*vma;
+	unsigned int		byvma;
 	unsigned long		range_start;
 	unsigned long		range_end;
 	unsigned int		nr;
@@ -68,23 +68,18 @@ DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
  * code is used:
  *  1. Unmapping a range of vmas.  See zap_page_range(), unmap_region().
  *     tlb->fullmm = 0, and tlb_start_vma/tlb_end_vma will be called.
- *     tlb->vma will be non-NULL.
+ *     tlb->byvma will be true.
  *  2. Unmapping all vmas.  See exit_mmap().
  *     tlb->fullmm = 1, and tlb_start_vma/tlb_end_vma will be called.
- *     tlb->vma will be non-NULL.  Additionally, page tables will be freed.
+ *     tlb->byvma will be true.  Additionally, page tables will be freed.
  *  3. Unmapping argument pages.  See shift_arg_pages().
  *     tlb->fullmm = 0, but tlb_start_vma/tlb_end_vma will not be called.
- *     tlb->vma will be NULL.
+ *     tlb->byvma will be false.
  */
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
-	if (tlb->fullmm || !tlb->vma)
+	if (tlb->fullmm || !tlb->byvma)
 		flush_tlb_mm(tlb->mm);
-	else if (tlb->range_end > 0) {
-		flush_tlb_range(tlb->vma, tlb->range_start, tlb->range_end);
-		tlb->range_start = TASK_SIZE;
-		tlb->range_end = 0;
-	}
 }
 
 static inline void tlb_add_flush(struct mmu_gather *tlb, unsigned long addr)
@@ -113,7 +108,7 @@ tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
 
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
-	tlb->vma = NULL;
+	tlb->byvma = 0;
 	tlb->nr = 0;
 
 	return tlb;
@@ -149,7 +144,7 @@ tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
 	if (!tlb->fullmm) {
 		flush_cache_range(vma, vma->vm_start, vma->vm_end);
-		tlb->vma = vma;
+		tlb->byvma = 1;
 		tlb->range_start = TASK_SIZE;
 		tlb->range_end = 0;
 	}
@@ -158,8 +153,11 @@ tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 static inline void
 tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
-	if (!tlb->fullmm)
-		tlb_flush(tlb);
+	if (!tlb->fullmm && tlb->range_end > 0) {
+		flush_tlb_range(vma, tlb->range_start, tlb->range_end);
+		tlb->range_start = TASK_SIZE;
+		tlb->range_end = 0;
+	}
 }
 
 static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
^ permalink raw reply related	[flat|nested] 90+ messages in thread
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 11:59           ` Russell King
  2011-02-28 11:59             ` Russell King
  2011-02-28 12:06             ` Russell King
@ 2011-02-28 12:06             ` Russell King
  2011-02-28 12:25               ` Peter Zijlstra
  2011-02-28 12:20             ` Peter Zijlstra
  3 siblings, 1 reply; 90+ messages in thread
From: Russell King @ 2011-02-28 12:06 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, linux-mm, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Paul McKenney,
	Yanmin Zhang, Luck,Tony, PaulMundt, Chris Metcalf
On Mon, Feb 28, 2011 at 11:59:07AM +0000, Russell King wrote:
> It may be hacky but then the TLB shootdown interface is hacky too.  We
> don't keep the vma around to re-use after tlb_end_vma() - if you think
> that then you misunderstand what's going on.  The vma pointer is kept
> around as a cheap way of allowing tlb_finish_mmu() to distinguish
> between the unmap_region() mode and the shift_arg_pages() mode.
As I think I mentioned, the TLB shootdown interface either needs rewriting
from scratch as its currently a broken design, or it needs tlb_gather_mmu()
to take a proper mode argument, rather than this useless 'fullmm' argument
which only gives half the story.
The fact is that the interface has three modes, and distinguishing between
them requires a certain amount of black magic.  Explicitly, the !fullmm
case has two modes, and it requires implementations to remember whether
tlb_start_vma() has been called before tlb_finish_mm() or not.
Maybe this will help you understand the ARM implementation - this doesn't
change the functionality, but may make things clearer.
diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index 82dfe5d..73fb813 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -54,7 +54,7 @@
 struct mmu_gather {
 	struct mm_struct	*mm;
 	unsigned int		fullmm;
-	struct vm_area_struct	*vma;
+	unsigned int		byvma;
 	unsigned long		range_start;
 	unsigned long		range_end;
 	unsigned int		nr;
@@ -68,23 +68,18 @@ DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
  * code is used:
  *  1. Unmapping a range of vmas.  See zap_page_range(), unmap_region().
  *     tlb->fullmm = 0, and tlb_start_vma/tlb_end_vma will be called.
- *     tlb->vma will be non-NULL.
+ *     tlb->byvma will be true.
  *  2. Unmapping all vmas.  See exit_mmap().
  *     tlb->fullmm = 1, and tlb_start_vma/tlb_end_vma will be called.
- *     tlb->vma will be non-NULL.  Additionally, page tables will be freed.
+ *     tlb->byvma will be true.  Additionally, page tables will be freed.
  *  3. Unmapping argument pages.  See shift_arg_pages().
  *     tlb->fullmm = 0, but tlb_start_vma/tlb_end_vma will not be called.
- *     tlb->vma will be NULL.
+ *     tlb->byvma will be false.
  */
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
-	if (tlb->fullmm || !tlb->vma)
+	if (tlb->fullmm || !tlb->byvma)
 		flush_tlb_mm(tlb->mm);
-	else if (tlb->range_end > 0) {
-		flush_tlb_range(tlb->vma, tlb->range_start, tlb->range_end);
-		tlb->range_start = TASK_SIZE;
-		tlb->range_end = 0;
-	}
 }
 
 static inline void tlb_add_flush(struct mmu_gather *tlb, unsigned long addr)
@@ -113,7 +108,7 @@ tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
 
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
-	tlb->vma = NULL;
+	tlb->byvma = 0;
 	tlb->nr = 0;
 
 	return tlb;
@@ -149,7 +144,7 @@ tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
 	if (!tlb->fullmm) {
 		flush_cache_range(vma, vma->vm_start, vma->vm_end);
-		tlb->vma = vma;
+		tlb->byvma = 1;
 		tlb->range_start = TASK_SIZE;
 		tlb->range_end = 0;
 	}
@@ -158,8 +153,11 @@ tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 static inline void
 tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
-	if (!tlb->fullmm)
-		tlb_flush(tlb);
+	if (!tlb->fullmm && tlb->range_end > 0) {
+		flush_tlb_range(vma, tlb->range_start, tlb->range_end);
+		tlb->range_start = TASK_SIZE;
+		tlb->range_end = 0;
+	}
 }
 
 static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
^ permalink raw reply related	[flat|nested] 90+ messages in thread
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 12:06             ` Russell King
@ 2011-02-28 12:25               ` Peter Zijlstra
  2011-02-28 12:25                 ` Peter Zijlstra
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-28 12:25 UTC (permalink / raw)
  To: Russell King
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, 2011-02-28 at 12:06 +0000, Russell King wrote:
> 
> As I think I mentioned, the TLB shootdown interface either needs rewriting
> from scratch as its currently a broken design, or it needs tlb_gather_mmu()
> to take a proper mode argument, rather than this useless 'fullmm' argument
> which only gives half the story.
> 
> The fact is that the interface has three modes, and distinguishing between
> them requires a certain amount of black magic.  Explicitly, the !fullmm
> case has two modes, and it requires implementations to remember whether
> tlb_start_vma() has been called before tlb_finish_mm() or not.
> 
> Maybe this will help you understand the ARM implementation - this doesn't
> change the functionality, but may make things clearer.
I've actually implemented that, but it didn't really help much.
Mostly because you want your TLB flush to be after freeing the
page-tables, not before it.
So I want to avoid having to flush at tlb_end_vma() _and_ at
tlb_finish_mmu(), and doing that needs a flush_tlb_range() that doesn't
need a vma.
ARM also does the whole IPI thing on TLB flush, so a gup_fast()
implementation for arm would also need that TLB flush after page-table
tear-down, not on tlb_end_vma().
And once you want a single TLB invalidate, it doesn't matter if you want
to track ranges for p*_free_tlb() too.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 12:25               ` Peter Zijlstra
@ 2011-02-28 12:25                 ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-28 12:25 UTC (permalink / raw)
  To: Russell King
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, 2011-02-28 at 12:06 +0000, Russell King wrote:
> 
> As I think I mentioned, the TLB shootdown interface either needs rewriting
> from scratch as its currently a broken design, or it needs tlb_gather_mmu()
> to take a proper mode argument, rather than this useless 'fullmm' argument
> which only gives half the story.
> 
> The fact is that the interface has three modes, and distinguishing between
> them requires a certain amount of black magic.  Explicitly, the !fullmm
> case has two modes, and it requires implementations to remember whether
> tlb_start_vma() has been called before tlb_finish_mm() or not.
> 
> Maybe this will help you understand the ARM implementation - this doesn't
> change the functionality, but may make things clearer.
I've actually implemented that, but it didn't really help much.
Mostly because you want your TLB flush to be after freeing the
page-tables, not before it.
So I want to avoid having to flush at tlb_end_vma() _and_ at
tlb_finish_mmu(), and doing that needs a flush_tlb_range() that doesn't
need a vma.
ARM also does the whole IPI thing on TLB flush, so a gup_fast()
implementation for arm would also need that TLB flush after page-table
tear-down, not on tlb_end_vma().
And once you want a single TLB invalidate, it doesn't matter if you want
to track ranges for p*_free_tlb() too.
^ permalink raw reply	[flat|nested] 90+ messages in thread 
 
 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 11:59           ` Russell King
                               ` (2 preceding siblings ...)
  2011-02-28 12:06             ` Russell King
@ 2011-02-28 12:20             ` Peter Zijlstra
  2011-02-28 12:20               ` Peter Zijlstra
  2011-02-28 12:28               ` Russell King
  3 siblings, 2 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-28 12:20 UTC (permalink / raw)
  To: Russell King
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, 2011-02-28 at 11:59 +0000, Russell King wrote:
> On Mon, Feb 28, 2011 at 12:44:47PM +0100, Peter Zijlstra wrote:
> > Right, so the normal case is:
> > 
> >   unmap_region()
> >     tlb_gather_mmu()
> 
> The fullmm argument is important here as it specifies the mode.
well, unmap_region always has that 0, I've mentioned the fullmm mode
separately below, its in many way the easiest case to deal with.
>       tlb_gather_mmu(, 0)
> 
> >     unmap_vmas()
> >       for (; vma; vma = vma->vm_next)
> >         unmao_page_range()
> >           tlb_start_vma() -> flush cache range
> >           zap_*_range()
> >             ptep_get_and_clear_full() -> batch/track external tlbs
> >             tlb_remove_tlb_entry() -> batch/track external tlbs
> >             tlb_remove_page() -> track range/batch page
> >           tlb_end_vma() -> flush tlb range
> 
>        tlb_finish_mmu() -> nothing
> 
> > 
> >  [ for architectures that have hardware page table walkers
> >    concurrent faults can still load the page tables ]
> > 
> >     free_pgtables()
> 
>         tlb_gather_mmu(, 1)
> 
> >       while (vma)
> >         unlink_*_vma()
> >         free_*_range()
> >           *_free_tlb()
> >     tlb_finish_mmu()
> 
>       tlb_finish_mmu() -> flush tlb mm
> 
> > 
> >   free vmas
> 
> So this is all fine.  Note that we *don't* use the range stuff here.
> 
> > Now, if we want to track ranges _and_ have hardware page table walkers
> > (ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
> > flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
> > than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
> > because the hardware walker could have re-populated the cache after
> > clearing the PTEs but before freeing the page tables.
> 
> No.  The hardware walker won't re-populate the TLB after the page table
Never said it would repopulate the TLB, just said it could repopulate
your cache thing and that it might still walk the page tables.
> entries have been cleared - where would it get this information from if
> not from the page tables?
> 
> > What ARM does is it retains the last vma pointer and tracks
> > pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
> > hacky.
> 
> It may be hacky but then the TLB shootdown interface is hacky too.  We
> don't keep the vma around to re-use after tlb_end_vma() - if you think
> that then you misunderstand what's going on.  The vma pointer is kept
> around as a cheap way of allowing tlb_finish_mmu() to distinguish
> between the unmap_region() mode and the shift_arg_pages() mode.
Well, you most certainly use it in the unmap_region() case above.
tlb_end_vma() will do a flush_tlb_range(), but then your
__pte_free_tlb() will also track range and the tlb_finish_mmu() will
then again issue a flush_tlb_range() using the last vma pointer.
You argued you need that second tlb flush fo flush your cached level1
entries for your LPAE mode (btw arm sucks for having all those docs
private).
> > Mostly because of shift_arg_pages(), where we have:
> > 
> >   shift_arg_pages()
> >     tlb_gather_mmu()
> 
>       tlb_gather_mmu(, 0)
> 
> >     free_*_range()
> >     tlb_finish_mmu()
> 
>       tlb_finish_mmu() does nothing without the ARM change.
>       tlb_finish_mmu() -> flush_tlb_mm() with the ARM change.
> 
> And this is where the bug was - these page table entries could find
> their way into the TLB and persist after they've been torn down.
Sure, I got that, you punt and do a full mm tlb invalidate (IA64 and SH
seem similarly challenged).
> > For which ARM now punts and does a full tlb invalidate (no vma pointer).
> > But also because it drags along that vma pointer, which might not at all
> > match the range its actually going to invalidate (and hence its vm_flags
> > might not accurately reflect things -- at worst more expensive than
> > needed).
> 
> Where do you get that from?  Where exactly in the above code would the
> VMA pointer get set?  In this case, it will be NULL, so we do a
> flush_tlb_mm() for this case.  We have to - we don't have any VMA to
> deal with at this point.
unmap_region()'s last tlb_start_vma(), with __pte_free_tlb() tracking
range will then get tlb_finish_mmu() to issue a second
flush_tlb_range().
> > The reason I wanted flush_tlb_range() to take an mm_struct and not the
> > current vm_area_struct is because we can avoid doing the
> > flush_tlb_range() from tlb_end_vma() and delay the thing until
> > tlb_finish_mmu() without having to resort to such games as above. We
> > could simply track the full range over all VMAs and free'd page-tables
> > and do one range invalidate.
> 
> No.  That's stupid.  Consider the case where you have to loop one page
> at a time over the range (we do on ARM.)  If we ended up with your
> suggestion above, that means we could potentially have to loop 4K at a
> time over 3GB of address space.  That's idiotic when we have an
> instruction which can flush the entire TLB for a particular thread.
*blink* so you've implemented flush_tlb_range() as an iteration of
single page invalidates?
x86 could have done the same I think, instead we chose to implement it
as a full mm invalidate simply because that's way cheaper in general.
You could also put a threshold in, if (end-start) >> PAGE_SHIFT > n,
flush everything if you want.
Anyway, I don't see how that's related to the I-TLB thing?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 12:20             ` Peter Zijlstra
@ 2011-02-28 12:20               ` Peter Zijlstra
  2011-02-28 12:28               ` Russell King
  1 sibling, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-28 12:20 UTC (permalink / raw)
  To: Russell King
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, 2011-02-28 at 11:59 +0000, Russell King wrote:
> On Mon, Feb 28, 2011 at 12:44:47PM +0100, Peter Zijlstra wrote:
> > Right, so the normal case is:
> > 
> >   unmap_region()
> >     tlb_gather_mmu()
> 
> The fullmm argument is important here as it specifies the mode.
well, unmap_region always has that 0, I've mentioned the fullmm mode
separately below, its in many way the easiest case to deal with.
>       tlb_gather_mmu(, 0)
> 
> >     unmap_vmas()
> >       for (; vma; vma = vma->vm_next)
> >         unmao_page_range()
> >           tlb_start_vma() -> flush cache range
> >           zap_*_range()
> >             ptep_get_and_clear_full() -> batch/track external tlbs
> >             tlb_remove_tlb_entry() -> batch/track external tlbs
> >             tlb_remove_page() -> track range/batch page
> >           tlb_end_vma() -> flush tlb range
> 
>        tlb_finish_mmu() -> nothing
> 
> > 
> >  [ for architectures that have hardware page table walkers
> >    concurrent faults can still load the page tables ]
> > 
> >     free_pgtables()
> 
>         tlb_gather_mmu(, 1)
> 
> >       while (vma)
> >         unlink_*_vma()
> >         free_*_range()
> >           *_free_tlb()
> >     tlb_finish_mmu()
> 
>       tlb_finish_mmu() -> flush tlb mm
> 
> > 
> >   free vmas
> 
> So this is all fine.  Note that we *don't* use the range stuff here.
> 
> > Now, if we want to track ranges _and_ have hardware page table walkers
> > (ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
> > flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
> > than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
> > because the hardware walker could have re-populated the cache after
> > clearing the PTEs but before freeing the page tables.
> 
> No.  The hardware walker won't re-populate the TLB after the page table
Never said it would repopulate the TLB, just said it could repopulate
your cache thing and that it might still walk the page tables.
> entries have been cleared - where would it get this information from if
> not from the page tables?
> 
> > What ARM does is it retains the last vma pointer and tracks
> > pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
> > hacky.
> 
> It may be hacky but then the TLB shootdown interface is hacky too.  We
> don't keep the vma around to re-use after tlb_end_vma() - if you think
> that then you misunderstand what's going on.  The vma pointer is kept
> around as a cheap way of allowing tlb_finish_mmu() to distinguish
> between the unmap_region() mode and the shift_arg_pages() mode.
Well, you most certainly use it in the unmap_region() case above.
tlb_end_vma() will do a flush_tlb_range(), but then your
__pte_free_tlb() will also track range and the tlb_finish_mmu() will
then again issue a flush_tlb_range() using the last vma pointer.
You argued you need that second tlb flush fo flush your cached level1
entries for your LPAE mode (btw arm sucks for having all those docs
private).
> > Mostly because of shift_arg_pages(), where we have:
> > 
> >   shift_arg_pages()
> >     tlb_gather_mmu()
> 
>       tlb_gather_mmu(, 0)
> 
> >     free_*_range()
> >     tlb_finish_mmu()
> 
>       tlb_finish_mmu() does nothing without the ARM change.
>       tlb_finish_mmu() -> flush_tlb_mm() with the ARM change.
> 
> And this is where the bug was - these page table entries could find
> their way into the TLB and persist after they've been torn down.
Sure, I got that, you punt and do a full mm tlb invalidate (IA64 and SH
seem similarly challenged).
> > For which ARM now punts and does a full tlb invalidate (no vma pointer).
> > But also because it drags along that vma pointer, which might not at all
> > match the range its actually going to invalidate (and hence its vm_flags
> > might not accurately reflect things -- at worst more expensive than
> > needed).
> 
> Where do you get that from?  Where exactly in the above code would the
> VMA pointer get set?  In this case, it will be NULL, so we do a
> flush_tlb_mm() for this case.  We have to - we don't have any VMA to
> deal with at this point.
unmap_region()'s last tlb_start_vma(), with __pte_free_tlb() tracking
range will then get tlb_finish_mmu() to issue a second
flush_tlb_range().
> > The reason I wanted flush_tlb_range() to take an mm_struct and not the
> > current vm_area_struct is because we can avoid doing the
> > flush_tlb_range() from tlb_end_vma() and delay the thing until
> > tlb_finish_mmu() without having to resort to such games as above. We
> > could simply track the full range over all VMAs and free'd page-tables
> > and do one range invalidate.
> 
> No.  That's stupid.  Consider the case where you have to loop one page
> at a time over the range (we do on ARM.)  If we ended up with your
> suggestion above, that means we could potentially have to loop 4K at a
> time over 3GB of address space.  That's idiotic when we have an
> instruction which can flush the entire TLB for a particular thread.
*blink* so you've implemented flush_tlb_range() as an iteration of
single page invalidates?
x86 could have done the same I think, instead we chose to implement it
as a full mm invalidate simply because that's way cheaper in general.
You could also put a threshold in, if (end-start) >> PAGE_SHIFT > n,
flush everything if you want.
Anyway, I don't see how that's related to the I-TLB thing?
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 12:20             ` Peter Zijlstra
  2011-02-28 12:20               ` Peter Zijlstra
@ 2011-02-28 12:28               ` Russell King
  2011-02-28 12:28                 ` Russell King
  2011-02-28 12:49                 ` Peter Zijlstra
  1 sibling, 2 replies; 90+ messages in thread
From: Russell King @ 2011-02-28 12:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, Feb 28, 2011 at 01:20:12PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-02-28 at 11:59 +0000, Russell King wrote:
> > On Mon, Feb 28, 2011 at 12:44:47PM +0100, Peter Zijlstra wrote:
> > > Right, so the normal case is:
> > > 
> > >   unmap_region()
> > >     tlb_gather_mmu()
> > 
> > The fullmm argument is important here as it specifies the mode.
> 
> well, unmap_region always has that 0, I've mentioned the fullmm mode
> separately below, its in many way the easiest case to deal with.
> 
> >       tlb_gather_mmu(, 0)
> > 
> > >     unmap_vmas()
> > >       for (; vma; vma = vma->vm_next)
> > >         unmao_page_range()
> > >           tlb_start_vma() -> flush cache range
> > >           zap_*_range()
> > >             ptep_get_and_clear_full() -> batch/track external tlbs
> > >             tlb_remove_tlb_entry() -> batch/track external tlbs
> > >             tlb_remove_page() -> track range/batch page
> > >           tlb_end_vma() -> flush tlb range
> > 
> >        tlb_finish_mmu() -> nothing
> > 
> > > 
> > >  [ for architectures that have hardware page table walkers
> > >    concurrent faults can still load the page tables ]
> > > 
> > >     free_pgtables()
> > 
> >         tlb_gather_mmu(, 1)
> > 
> > >       while (vma)
> > >         unlink_*_vma()
> > >         free_*_range()
> > >           *_free_tlb()
> > >     tlb_finish_mmu()
> > 
> >       tlb_finish_mmu() -> flush tlb mm
> > 
> > > 
> > >   free vmas
> > 
> > So this is all fine.  Note that we *don't* use the range stuff here.
> > 
> > > Now, if we want to track ranges _and_ have hardware page table walkers
> > > (ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
> > > flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
> > > than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
> > > because the hardware walker could have re-populated the cache after
> > > clearing the PTEs but before freeing the page tables.
> > 
> > No.  The hardware walker won't re-populate the TLB after the page table
> 
> Never said it would repopulate the TLB, just said it could repopulate
> your cache thing and that it might still walk the page tables.
> 
> > entries have been cleared - where would it get this information from if
> > not from the page tables?
> > 
> > > What ARM does is it retains the last vma pointer and tracks
> > > pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
> > > hacky.
> > 
> > It may be hacky but then the TLB shootdown interface is hacky too.  We
> > don't keep the vma around to re-use after tlb_end_vma() - if you think
> > that then you misunderstand what's going on.  The vma pointer is kept
> > around as a cheap way of allowing tlb_finish_mmu() to distinguish
> > between the unmap_region() mode and the shift_arg_pages() mode.
> 
> Well, you most certainly use it in the unmap_region() case above.
> tlb_end_vma() will do a flush_tlb_range(), but then your
> __pte_free_tlb() will also track range and the tlb_finish_mmu() will
> then again issue a flush_tlb_range() using the last vma pointer.
Can you point out where pte_free_tlb() is used with unmap_region()?
> unmap_region()'s last tlb_start_vma(), with __pte_free_tlb() tracking
> range will then get tlb_finish_mmu() to issue a second
> flush_tlb_range().
I don't think it will because afaics pte_free_tlb() is never called in
the unmap_region() case.
> > No.  That's stupid.  Consider the case where you have to loop one page
> > at a time over the range (we do on ARM.)  If we ended up with your
> > suggestion above, that means we could potentially have to loop 4K at a
> > time over 3GB of address space.  That's idiotic when we have an
> > instruction which can flush the entire TLB for a particular thread.
> 
> *blink* so you've implemented flush_tlb_range() as an iteration of
> single page invalidates?
Yes, because flush_tlb_range() is used at most over one VMA, which
typically will not be in the GB range, but a few MB at most.
> Anyway, I don't see how that's related to the I-TLB thing?
It's all related because I don't think you understand what's going on
here properly yet, and as such are getting rather mixed up and confused
about when flush_tlb_range() is called.  As such, the whole
does-it-take-vma-or-mm argument is irrelevant, and therefore so is
the I-TLB stuff.
I put to you that pte_free_tlb() is not called in unmap_vmas(), and
as such the double-tlb-invalidate you talk about can't happen.
-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 12:28               ` Russell King
@ 2011-02-28 12:28                 ` Russell King
  2011-02-28 12:49                 ` Peter Zijlstra
  1 sibling, 0 replies; 90+ messages in thread
From: Russell King @ 2011-02-28 12:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, Feb 28, 2011 at 01:20:12PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-02-28 at 11:59 +0000, Russell King wrote:
> > On Mon, Feb 28, 2011 at 12:44:47PM +0100, Peter Zijlstra wrote:
> > > Right, so the normal case is:
> > > 
> > >   unmap_region()
> > >     tlb_gather_mmu()
> > 
> > The fullmm argument is important here as it specifies the mode.
> 
> well, unmap_region always has that 0, I've mentioned the fullmm mode
> separately below, its in many way the easiest case to deal with.
> 
> >       tlb_gather_mmu(, 0)
> > 
> > >     unmap_vmas()
> > >       for (; vma; vma = vma->vm_next)
> > >         unmao_page_range()
> > >           tlb_start_vma() -> flush cache range
> > >           zap_*_range()
> > >             ptep_get_and_clear_full() -> batch/track external tlbs
> > >             tlb_remove_tlb_entry() -> batch/track external tlbs
> > >             tlb_remove_page() -> track range/batch page
> > >           tlb_end_vma() -> flush tlb range
> > 
> >        tlb_finish_mmu() -> nothing
> > 
> > > 
> > >  [ for architectures that have hardware page table walkers
> > >    concurrent faults can still load the page tables ]
> > > 
> > >     free_pgtables()
> > 
> >         tlb_gather_mmu(, 1)
> > 
> > >       while (vma)
> > >         unlink_*_vma()
> > >         free_*_range()
> > >           *_free_tlb()
> > >     tlb_finish_mmu()
> > 
> >       tlb_finish_mmu() -> flush tlb mm
> > 
> > > 
> > >   free vmas
> > 
> > So this is all fine.  Note that we *don't* use the range stuff here.
> > 
> > > Now, if we want to track ranges _and_ have hardware page table walkers
> > > (ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
> > > flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
> > > than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
> > > because the hardware walker could have re-populated the cache after
> > > clearing the PTEs but before freeing the page tables.
> > 
> > No.  The hardware walker won't re-populate the TLB after the page table
> 
> Never said it would repopulate the TLB, just said it could repopulate
> your cache thing and that it might still walk the page tables.
> 
> > entries have been cleared - where would it get this information from if
> > not from the page tables?
> > 
> > > What ARM does is it retains the last vma pointer and tracks
> > > pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
> > > hacky.
> > 
> > It may be hacky but then the TLB shootdown interface is hacky too.  We
> > don't keep the vma around to re-use after tlb_end_vma() - if you think
> > that then you misunderstand what's going on.  The vma pointer is kept
> > around as a cheap way of allowing tlb_finish_mmu() to distinguish
> > between the unmap_region() mode and the shift_arg_pages() mode.
> 
> Well, you most certainly use it in the unmap_region() case above.
> tlb_end_vma() will do a flush_tlb_range(), but then your
> __pte_free_tlb() will also track range and the tlb_finish_mmu() will
> then again issue a flush_tlb_range() using the last vma pointer.
Can you point out where pte_free_tlb() is used with unmap_region()?
> unmap_region()'s last tlb_start_vma(), with __pte_free_tlb() tracking
> range will then get tlb_finish_mmu() to issue a second
> flush_tlb_range().
I don't think it will because afaics pte_free_tlb() is never called in
the unmap_region() case.
> > No.  That's stupid.  Consider the case where you have to loop one page
> > at a time over the range (we do on ARM.)  If we ended up with your
> > suggestion above, that means we could potentially have to loop 4K at a
> > time over 3GB of address space.  That's idiotic when we have an
> > instruction which can flush the entire TLB for a particular thread.
> 
> *blink* so you've implemented flush_tlb_range() as an iteration of
> single page invalidates?
Yes, because flush_tlb_range() is used at most over one VMA, which
typically will not be in the GB range, but a few MB at most.
> Anyway, I don't see how that's related to the I-TLB thing?
It's all related because I don't think you understand what's going on
here properly yet, and as such are getting rather mixed up and confused
about when flush_tlb_range() is called.  As such, the whole
does-it-take-vma-or-mm argument is irrelevant, and therefore so is
the I-TLB stuff.
I put to you that pte_free_tlb() is not called in unmap_vmas(), and
as such the double-tlb-invalidate you talk about can't happen.
-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 12:28               ` Russell King
  2011-02-28 12:28                 ` Russell King
@ 2011-02-28 12:49                 ` Peter Zijlstra
  2011-02-28 12:49                   ` Peter Zijlstra
  2011-02-28 12:50                   ` Russell King
  1 sibling, 2 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-28 12:49 UTC (permalink / raw)
  To: Russell King
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, 2011-02-28 at 12:28 +0000, Russell King wrote:
> Can you point out where pte_free_tlb() is used with unmap_region()?
unmap_region()
  free_pgtables()
    free_pgd_range()
      free_pud_range()
        free_pmd_range()
          free_pte_range()
            pte_free_tlb()
        
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 12:49                 ` Peter Zijlstra
@ 2011-02-28 12:49                   ` Peter Zijlstra
  2011-02-28 12:50                   ` Russell King
  1 sibling, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-28 12:49 UTC (permalink / raw)
  To: Russell King
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, 2011-02-28 at 12:28 +0000, Russell King wrote:
> Can you point out where pte_free_tlb() is used with unmap_region()?
unmap_region()
  free_pgtables()
    free_pgd_range()
      free_pud_range()
        free_pmd_range()
          free_pte_range()
            pte_free_tlb()
        
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 12:49                 ` Peter Zijlstra
  2011-02-28 12:49                   ` Peter Zijlstra
@ 2011-02-28 12:50                   ` Russell King
  2011-02-28 12:50                     ` Russell King
  2011-02-28 13:03                     ` Peter Zijlstra
  1 sibling, 2 replies; 90+ messages in thread
From: Russell King @ 2011-02-28 12:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, Feb 28, 2011 at 01:49:02PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-02-28 at 12:28 +0000, Russell King wrote:
> > Can you point out where pte_free_tlb() is used with unmap_region()?
> 
> unmap_region()
>   free_pgtables()
>     free_pgd_range()
>       free_pud_range()
>         free_pmd_range()
>           free_pte_range()
>             pte_free_tlb()
Damn it.  Okay, I give up with this.  The TLB shootdown interface is
total crap.
-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 12:50                   ` Russell King
@ 2011-02-28 12:50                     ` Russell King
  2011-02-28 13:03                     ` Peter Zijlstra
  1 sibling, 0 replies; 90+ messages in thread
From: Russell King @ 2011-02-28 12:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, Feb 28, 2011 at 01:49:02PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-02-28 at 12:28 +0000, Russell King wrote:
> > Can you point out where pte_free_tlb() is used with unmap_region()?
> 
> unmap_region()
>   free_pgtables()
>     free_pgd_range()
>       free_pud_range()
>         free_pmd_range()
>           free_pte_range()
>             pte_free_tlb()
Damn it.  Okay, I give up with this.  The TLB shootdown interface is
total crap.
-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 12:50                   ` Russell King
  2011-02-28 12:50                     ` Russell King
@ 2011-02-28 13:03                     ` Peter Zijlstra
  2011-02-28 13:03                       ` Peter Zijlstra
  1 sibling, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-28 13:03 UTC (permalink / raw)
  To: Russell King
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, 2011-02-28 at 12:50 +0000, Russell King wrote:
> On Mon, Feb 28, 2011 at 01:49:02PM +0100, Peter Zijlstra wrote:
> > On Mon, 2011-02-28 at 12:28 +0000, Russell King wrote:
> > > Can you point out where pte_free_tlb() is used with unmap_region()?
> > 
> > unmap_region()
> >   free_pgtables()
> >     free_pgd_range()
> >       free_pud_range()
> >         free_pmd_range()
> >           free_pte_range()
> >             pte_free_tlb()
> 
> Damn it.  Okay, I give up with this.  The TLB shootdown interface is
> total crap.
:-)
There's a reason I'd like to make everybody use asm-generic/tlb.h and
unify all the crazy bits. Once there's common code everybody is forced
to think about this stuff instead of endlessly hack their own
architecture to make it work without consideration for the rest of us.
Furthermore, I don't think its actually too hard to do.. (famous last
words).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 13:03                     ` Peter Zijlstra
@ 2011-02-28 13:03                       ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-28 13:03 UTC (permalink / raw)
  To: Russell King
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, 2011-02-28 at 12:50 +0000, Russell King wrote:
> On Mon, Feb 28, 2011 at 01:49:02PM +0100, Peter Zijlstra wrote:
> > On Mon, 2011-02-28 at 12:28 +0000, Russell King wrote:
> > > Can you point out where pte_free_tlb() is used with unmap_region()?
> > 
> > unmap_region()
> >   free_pgtables()
> >     free_pgd_range()
> >       free_pud_range()
> >         free_pmd_range()
> >           free_pte_range()
> >             pte_free_tlb()
> 
> Damn it.  Okay, I give up with this.  The TLB shootdown interface is
> total crap.
:-)
There's a reason I'd like to make everybody use asm-generic/tlb.h and
unify all the crazy bits. Once there's common code everybody is forced
to think about this stuff instead of endlessly hack their own
architecture to make it work without consideration for the rest of us.
Furthermore, I don't think its actually too hard to do.. (famous last
words).
^ permalink raw reply	[flat|nested] 90+ messages in thread 
 
 
 
 
 
 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 11:44         ` Peter Zijlstra
  2011-02-28 11:44           ` Peter Zijlstra
  2011-02-28 11:59           ` Russell King
@ 2011-02-28 14:18           ` Peter Zijlstra
  2011-02-28 14:18             ` Peter Zijlstra
  2011-02-28 14:57             ` Russell King
  2011-03-01 22:05           ` Chris Metcalf
  3 siblings, 2 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-28 14:18 UTC (permalink / raw)
  To: Russell King
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, 2011-02-28 at 12:44 +0100, Peter Zijlstra wrote:
>   unmap_region()
>     tlb_gather_mmu()
>     unmap_vmas()
>       for (; vma; vma = vma->vm_next)
>         unmao_page_range()
>           tlb_start_vma() -> flush cache range
So why is this correct? Can't we race with a concurrent access to the
memory region (munmap() vs other thread access race)? While
unmap_region() callers will have removed the vma from the tree so faults
will not be satisfied, TLBs might still be present and allow us to
access the memory and thereby reloading it in the cache.
>           zap_*_range()
>             ptep_get_and_clear_full() -> batch/track external tlbs
>             tlb_remove_tlb_entry() -> batch/track external tlbs
>             tlb_remove_page() -> track range/batch page
>           tlb_end_vma() -> flush tlb range
> 
>  [ for architectures that have hardware page table walkers
>    concurrent faults can still load the page tables ]
> 
>     free_pgtables()
>       while (vma)
>         unlink_*_vma()
>         free_*_range()
>           *_free_tlb()
>     tlb_finish_mmu()
> 
>   free vmas 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 14:18           ` Peter Zijlstra
@ 2011-02-28 14:18             ` Peter Zijlstra
  2011-02-28 14:57             ` Russell King
  1 sibling, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-28 14:18 UTC (permalink / raw)
  To: Russell King
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, 2011-02-28 at 12:44 +0100, Peter Zijlstra wrote:
>   unmap_region()
>     tlb_gather_mmu()
>     unmap_vmas()
>       for (; vma; vma = vma->vm_next)
>         unmao_page_range()
>           tlb_start_vma() -> flush cache range
So why is this correct? Can't we race with a concurrent access to the
memory region (munmap() vs other thread access race)? While
unmap_region() callers will have removed the vma from the tree so faults
will not be satisfied, TLBs might still be present and allow us to
access the memory and thereby reloading it in the cache.
>           zap_*_range()
>             ptep_get_and_clear_full() -> batch/track external tlbs
>             tlb_remove_tlb_entry() -> batch/track external tlbs
>             tlb_remove_page() -> track range/batch page
>           tlb_end_vma() -> flush tlb range
> 
>  [ for architectures that have hardware page table walkers
>    concurrent faults can still load the page tables ]
> 
>     free_pgtables()
>       while (vma)
>         unlink_*_vma()
>         free_*_range()
>           *_free_tlb()
>     tlb_finish_mmu()
> 
>   free vmas 
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 14:18           ` Peter Zijlstra
  2011-02-28 14:18             ` Peter Zijlstra
@ 2011-02-28 14:57             ` Russell King
  2011-02-28 14:57               ` Russell King
  2011-02-28 15:05               ` Peter Zijlstra
  1 sibling, 2 replies; 90+ messages in thread
From: Russell King @ 2011-02-28 14:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, Feb 28, 2011 at 03:18:47PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-02-28 at 12:44 +0100, Peter Zijlstra wrote:
> >   unmap_region()
> >     tlb_gather_mmu()
> >     unmap_vmas()
> >       for (; vma; vma = vma->vm_next)
> >         unmao_page_range()
> >           tlb_start_vma() -> flush cache range
> 
> So why is this correct? Can't we race with a concurrent access to the
> memory region (munmap() vs other thread access race)? While
> unmap_region() callers will have removed the vma from the tree so faults
> will not be satisfied, TLBs might still be present and allow us to
> access the memory and thereby reloading it in the cache.
It is my understanding that code sections between tlb_gather_mmu() and
tlb_finish_mmu() are non-preemptible - that was the case once upon a
time when this stuff first appeared.  If that's changed then that
change has introduced an unnoticed bug.
-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 14:57             ` Russell King
@ 2011-02-28 14:57               ` Russell King
  2011-02-28 15:05               ` Peter Zijlstra
  1 sibling, 0 replies; 90+ messages in thread
From: Russell King @ 2011-02-28 14:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, Feb 28, 2011 at 03:18:47PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-02-28 at 12:44 +0100, Peter Zijlstra wrote:
> >   unmap_region()
> >     tlb_gather_mmu()
> >     unmap_vmas()
> >       for (; vma; vma = vma->vm_next)
> >         unmao_page_range()
> >           tlb_start_vma() -> flush cache range
> 
> So why is this correct? Can't we race with a concurrent access to the
> memory region (munmap() vs other thread access race)? While
> unmap_region() callers will have removed the vma from the tree so faults
> will not be satisfied, TLBs might still be present and allow us to
> access the memory and thereby reloading it in the cache.
It is my understanding that code sections between tlb_gather_mmu() and
tlb_finish_mmu() are non-preemptible - that was the case once upon a
time when this stuff first appeared.  If that's changed then that
change has introduced an unnoticed bug.
-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 14:57             ` Russell King
  2011-02-28 14:57               ` Russell King
@ 2011-02-28 15:05               ` Peter Zijlstra
  2011-02-28 15:15                 ` Russell King
  1 sibling, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-28 15:05 UTC (permalink / raw)
  To: Russell King
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, 2011-02-28 at 14:57 +0000, Russell King wrote:
> On Mon, Feb 28, 2011 at 03:18:47PM +0100, Peter Zijlstra wrote:
> > On Mon, 2011-02-28 at 12:44 +0100, Peter Zijlstra wrote:
> > >   unmap_region()
> > >     tlb_gather_mmu()
> > >     unmap_vmas()
> > >       for (; vma; vma = vma->vm_next)
> > >         unmao_page_range()
> > >           tlb_start_vma() -> flush cache range
> > 
> > So why is this correct? Can't we race with a concurrent access to the
> > memory region (munmap() vs other thread access race)? While
> > unmap_region() callers will have removed the vma from the tree so faults
> > will not be satisfied, TLBs might still be present and allow us to
> > access the memory and thereby reloading it in the cache.
> 
> It is my understanding that code sections between tlb_gather_mmu() and
> tlb_finish_mmu() are non-preemptible - that was the case once upon a
> time when this stuff first appeared.  
It is still so, but that doesn't help with SMP. The case mentioned above
has two threads running, one doing munmap() and the other is poking at
the memory being unmapped.
Afaict, even when its all non-preemptible, the remote cpu can
re-populate the cache you just flushed through existing TLB entries.
> If that's changed then that change has introduced an unnoticed bug.
I've got such a patch-set pending, but I cannot see how that would
change the semantics other than that the above race becomes possible on
a single CPU.
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 15:05               ` Peter Zijlstra
@ 2011-02-28 15:15                 ` Russell King
  2011-02-28 15:15                   ` Russell King
  0 siblings, 1 reply; 90+ messages in thread
From: Russell King @ 2011-02-28 15:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, Feb 28, 2011 at 04:05:48PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-02-28 at 14:57 +0000, Russell King wrote:
> > On Mon, Feb 28, 2011 at 03:18:47PM +0100, Peter Zijlstra wrote:
> > > On Mon, 2011-02-28 at 12:44 +0100, Peter Zijlstra wrote:
> > > >   unmap_region()
> > > >     tlb_gather_mmu()
> > > >     unmap_vmas()
> > > >       for (; vma; vma = vma->vm_next)
> > > >         unmao_page_range()
> > > >           tlb_start_vma() -> flush cache range
> > > 
> > > So why is this correct? Can't we race with a concurrent access to the
> > > memory region (munmap() vs other thread access race)? While
> > > unmap_region() callers will have removed the vma from the tree so faults
> > > will not be satisfied, TLBs might still be present and allow us to
> > > access the memory and thereby reloading it in the cache.
> > 
> > It is my understanding that code sections between tlb_gather_mmu() and
> > tlb_finish_mmu() are non-preemptible - that was the case once upon a
> > time when this stuff first appeared.  
> 
> It is still so, but that doesn't help with SMP. The case mentioned above
> has two threads running, one doing munmap() and the other is poking at
> the memory being unmapped.
Luckily its a no-op on SMP capable CPUs (and actually is also a no-op
on any PIPT or VIPT ARM CPU.)
-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 15:15                 ` Russell King
@ 2011-02-28 15:15                   ` Russell King
  0 siblings, 0 replies; 90+ messages in thread
From: Russell King @ 2011-02-28 15:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds, linux-kernel, linux-arch,
	linux-mm, Benjamin Herrenschmidt, David Miller, Hugh Dickins,
	Mel Gorman, Nick Piggin, Paul McKenney, Yanmin Zhang, Luck,Tony,
	PaulMundt, Chris Metcalf
On Mon, Feb 28, 2011 at 04:05:48PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-02-28 at 14:57 +0000, Russell King wrote:
> > On Mon, Feb 28, 2011 at 03:18:47PM +0100, Peter Zijlstra wrote:
> > > On Mon, 2011-02-28 at 12:44 +0100, Peter Zijlstra wrote:
> > > >   unmap_region()
> > > >     tlb_gather_mmu()
> > > >     unmap_vmas()
> > > >       for (; vma; vma = vma->vm_next)
> > > >         unmao_page_range()
> > > >           tlb_start_vma() -> flush cache range
> > > 
> > > So why is this correct? Can't we race with a concurrent access to the
> > > memory region (munmap() vs other thread access race)? While
> > > unmap_region() callers will have removed the vma from the tree so faults
> > > will not be satisfied, TLBs might still be present and allow us to
> > > access the memory and thereby reloading it in the cache.
> > 
> > It is my understanding that code sections between tlb_gather_mmu() and
> > tlb_finish_mmu() are non-preemptible - that was the case once upon a
> > time when this stuff first appeared.  
> 
> It is still so, but that doesn't help with SMP. The case mentioned above
> has two threads running, one doing munmap() and the other is poking at
> the memory being unmapped.
Luckily its a no-op on SMP capable CPUs (and actually is also a no-op
on any PIPT or VIPT ARM CPU.)
-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
^ permalink raw reply	[flat|nested] 90+ messages in thread 
 
 
 
 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-02-28 11:44         ` Peter Zijlstra
                             ` (2 preceding siblings ...)
  2011-02-28 14:18           ` Peter Zijlstra
@ 2011-03-01 22:05           ` Chris Metcalf
  2011-03-01 22:05             ` Chris Metcalf
  2011-03-02 10:54             ` Peter Zijlstra
  3 siblings, 2 replies; 90+ messages in thread
From: Chris Metcalf @ 2011-03-01 22:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Russell King, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, linux-mm, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Paul McKenney,
	Yanmin Zhang, Luck,Tony, PaulMundt
On 2/28/2011 6:44 AM, Peter Zijlstra wrote:
> [...]
> Now, if we want to track ranges _and_ have hardware page table walkers
> (ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
> flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
> than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
> because the hardware walker could have re-populated the cache after
> clearing the PTEs but before freeing the page tables.
>
> What ARM does is it retains the last vma pointer and tracks
> pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
> hacky.
>
> Mostly because of shift_arg_pages(), where we have:
>
>   shift_arg_pages()
>     tlb_gather_mmu()
>     free_*_range()
>     tlb_finish_mmu()
>
> For which ARM now punts and does a full tlb invalidate (no vma pointer).
> But also because it drags along that vma pointer, which might not at all
> match the range its actually going to invalidate (and hence its vm_flags
> might not accurately reflect things -- at worst more expensive than
> needed).
>
> The reason I wanted flush_tlb_range() to take an mm_struct and not the
> current vm_area_struct is because we can avoid doing the
> flush_tlb_range() from tlb_end_vma() and delay the thing until
> tlb_finish_mmu() without having to resort to such games as above. We
> could simply track the full range over all VMAs and free'd page-tables
> and do one range invalidate.
>
> ARM uses vm_flags & VM_EXEC to see if it also needs to invalidate
> I-TLBs, and TILE uses VM_EXEC and VM_HUGETLB.
>
> For the I-TLBs we could easily use
> ptep_get_and_clear_full()/tlb_remove_tlb_entry() and see if any of the
> cleared pte's had its executable bit set (both ARM and TILE seem to have
> such a PTE bit).
For Tile, the concern is that we want to make sure to invalidate the
i-cache.  The I-TLB is handled by the regular TLB flush just fine, like the
other architectures.  So our concern is that once we have cleared the page
table entries and invalidated the TLBs, we still have to deal with i-cache
lines in any core that may have run code from that page.  The risk is that
the kernel might free, reallocate, and then run code from one of those
pages, all before the stale i-cache lines happened to be evicted.
The current Tile code flushes the icache explicitly at two different times:
1. Whenever we flush the TLB, since this is one time when we know who might
currently be using the page (via cpu_vm_mask) and we can flush all of them
easily, piggybacking on the infrastructure we use to flush remote TLBs.
2. Whenever we context switch, to handle the case where cpu 1 is running
process A, then switches to B, but another cpu still running process A
unmaps an executable page that was in cpu 1's icache.  This way when cpu 1
switches back to A, it doesn't have to worry about any unmaps that occurred
while it was switched out.
> I'm not sure what we can do about TILE's VM_HUGETLB usage though, if it
> needs explicit flushes for huge ptes it might just have to issue
> multiple tlb invalidates and do them from tlb_start_vma()/tlb_end_vma().
I'm not too concerned about this.  We can make the flush code check both
page sizes at a small cost in efficiency, relative to the overall cost of
global TLB invalidation.
>   CONFIG_HAVE_MMU_GATHER_ITLB - will use
> ptep_get_and_clear_full()/tlb_remove_tlb_entry() to test pte_exec() and
> issue flush_itlb_range(mm,start,end).
So it sounds like the proposal for tile would be to piggy-back on
flush_itlb_range() and use it to flush the i-cache?  It does seem like
there must be other Linux architectures with incoherent icache out there,
and some existing solution we could just repurpose.
-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-03-01 22:05           ` Chris Metcalf
@ 2011-03-01 22:05             ` Chris Metcalf
  2011-03-02 10:54             ` Peter Zijlstra
  1 sibling, 0 replies; 90+ messages in thread
From: Chris Metcalf @ 2011-03-01 22:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Russell King, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, linux-mm, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Paul McKenney,
	Yanmin Zhang, Luck,Tony, PaulMundt
On 2/28/2011 6:44 AM, Peter Zijlstra wrote:
> [...]
> Now, if we want to track ranges _and_ have hardware page table walkers
> (ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
> flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
> than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
> because the hardware walker could have re-populated the cache after
> clearing the PTEs but before freeing the page tables.
>
> What ARM does is it retains the last vma pointer and tracks
> pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
> hacky.
>
> Mostly because of shift_arg_pages(), where we have:
>
>   shift_arg_pages()
>     tlb_gather_mmu()
>     free_*_range()
>     tlb_finish_mmu()
>
> For which ARM now punts and does a full tlb invalidate (no vma pointer).
> But also because it drags along that vma pointer, which might not at all
> match the range its actually going to invalidate (and hence its vm_flags
> might not accurately reflect things -- at worst more expensive than
> needed).
>
> The reason I wanted flush_tlb_range() to take an mm_struct and not the
> current vm_area_struct is because we can avoid doing the
> flush_tlb_range() from tlb_end_vma() and delay the thing until
> tlb_finish_mmu() without having to resort to such games as above. We
> could simply track the full range over all VMAs and free'd page-tables
> and do one range invalidate.
>
> ARM uses vm_flags & VM_EXEC to see if it also needs to invalidate
> I-TLBs, and TILE uses VM_EXEC and VM_HUGETLB.
>
> For the I-TLBs we could easily use
> ptep_get_and_clear_full()/tlb_remove_tlb_entry() and see if any of the
> cleared pte's had its executable bit set (both ARM and TILE seem to have
> such a PTE bit).
For Tile, the concern is that we want to make sure to invalidate the
i-cache.  The I-TLB is handled by the regular TLB flush just fine, like the
other architectures.  So our concern is that once we have cleared the page
table entries and invalidated the TLBs, we still have to deal with i-cache
lines in any core that may have run code from that page.  The risk is that
the kernel might free, reallocate, and then run code from one of those
pages, all before the stale i-cache lines happened to be evicted.
The current Tile code flushes the icache explicitly at two different times:
1. Whenever we flush the TLB, since this is one time when we know who might
currently be using the page (via cpu_vm_mask) and we can flush all of them
easily, piggybacking on the infrastructure we use to flush remote TLBs.
2. Whenever we context switch, to handle the case where cpu 1 is running
process A, then switches to B, but another cpu still running process A
unmaps an executable page that was in cpu 1's icache.  This way when cpu 1
switches back to A, it doesn't have to worry about any unmaps that occurred
while it was switched out.
> I'm not sure what we can do about TILE's VM_HUGETLB usage though, if it
> needs explicit flushes for huge ptes it might just have to issue
> multiple tlb invalidates and do them from tlb_start_vma()/tlb_end_vma().
I'm not too concerned about this.  We can make the flush code check both
page sizes at a small cost in efficiency, relative to the overall cost of
global TLB invalidation.
>   CONFIG_HAVE_MMU_GATHER_ITLB - will use
> ptep_get_and_clear_full()/tlb_remove_tlb_entry() to test pte_exec() and
> issue flush_itlb_range(mm,start,end).
So it sounds like the proposal for tile would be to piggy-back on
flush_itlb_range() and use it to flush the i-cache?  It does seem like
there must be other Linux architectures with incoherent icache out there,
and some existing solution we could just repurpose.
-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com
^ permalink raw reply	[flat|nested] 90+ messages in thread 
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-03-01 22:05           ` Chris Metcalf
  2011-03-01 22:05             ` Chris Metcalf
@ 2011-03-02 10:54             ` Peter Zijlstra
  2011-03-02 10:54               ` Peter Zijlstra
  1 sibling, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-03-02 10:54 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Russell King, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, linux-mm, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Paul McKenney,
	Yanmin Zhang, Luck,Tony, PaulMundt
On Tue, 2011-03-01 at 17:05 -0500, Chris Metcalf wrote:
> For Tile, the concern is that we want to make sure to invalidate the
> i-cache.  The I-TLB is handled by the regular TLB flush just fine, like the
> other architectures.  So our concern is that once we have cleared the page
> table entries and invalidated the TLBs, we still have to deal with i-cache
> lines in any core that may have run code from that page.  The risk is that
> the kernel might free, reallocate, and then run code from one of those
> pages, all before the stale i-cache lines happened to be evicted.
From reading Documentation/cachetlb.txt, update_mmu_cache() can be used
to flush i-cache whenever you install a pte with executable permissions,
and covers the particular case you mention above.
DaveM any comment? You seem to be the one who wrote that document :-)
> The current Tile code flushes the icache explicitly at two different times:
> 
> 1. Whenever we flush the TLB, since this is one time when we know who might
> currently be using the page (via cpu_vm_mask) and we can flush all of them
> easily, piggybacking on the infrastructure we use to flush remote TLBs.
> 
> 2. Whenever we context switch, to handle the case where cpu 1 is running
> process A, then switches to B, but another cpu still running process A
> unmaps an executable page that was in cpu 1's icache.  This way when cpu 1
> switches back to A, it doesn't have to worry about any unmaps that occurred
> while it was switched out.
> 
> 
> > I'm not sure what we can do about TILE's VM_HUGETLB usage though, if it
> > needs explicit flushes for huge ptes it might just have to issue
> > multiple tlb invalidates and do them from tlb_start_vma()/tlb_end_vma().
> 
> I'm not too concerned about this.  We can make the flush code check both
> page sizes at a small cost in efficiency, relative to the overall cost of
> global TLB invalidation.
OK, that's basically what I made it do now:
Index: linux-2.6/arch/tile/kernel/tlb.c
===================================================================
--- linux-2.6.orig/arch/tile/kernel/tlb.c
+++ linux-2.6/arch/tile/kernel/tlb.c
@@ -64,14 +64,13 @@ void flush_tlb_page(const struct vm_area
 }
 EXPORT_SYMBOL(flush_tlb_page);
-void flush_tlb_range(const struct vm_area_struct *vma,
+void flush_tlb_range(const struct mm_struct *mm,
                     unsigned long start, unsigned long end)
 {
-       unsigned long size = hv_page_size(vma);
-       struct mm_struct *mm = vma->vm_mm;
-       int cache = (vma->vm_flags & VM_EXEC) ? HV_FLUSH_EVICT_L1I : 0;
-       flush_remote(0, cache, &mm->cpu_vm_mask, start, end - start, size,
-                    &mm->cpu_vm_mask, NULL, 0);
+       flush_remote(0, HV_FLUSH_EVICT_L1I, &mm->cpu_vm_mask,
+                    start, end - start, PAGE_SIZE, &mm->cpu_vm_mask, NULL, 0);
+       flush_remote(0, 0, &mm->cpu_vm_mask,
+                    start, end - start, HPAGE_SIZE, &mm->cpu_vm_mask, NULL, 0);
 }
And I guess that if the update_mmu_cache() thing works out we can remove
the HV_FLUSH_EVICT_L1I thing.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 06/17] arm: mmu_gather rework
  2011-03-02 10:54             ` Peter Zijlstra
@ 2011-03-02 10:54               ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-03-02 10:54 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Russell King, Andrea Arcangeli, Avi Kivity, Thomas Gleixner,
	Rik van Riel, Ingo Molnar, akpm, Linus Torvalds, linux-kernel,
	linux-arch, linux-mm, Benjamin Herrenschmidt, David Miller,
	Hugh Dickins, Mel Gorman, Nick Piggin, Paul McKenney,
	Yanmin Zhang, Luck,Tony, PaulMundt
On Tue, 2011-03-01 at 17:05 -0500, Chris Metcalf wrote:
> For Tile, the concern is that we want to make sure to invalidate the
> i-cache.  The I-TLB is handled by the regular TLB flush just fine, like the
> other architectures.  So our concern is that once we have cleared the page
> table entries and invalidated the TLBs, we still have to deal with i-cache
> lines in any core that may have run code from that page.  The risk is that
> the kernel might free, reallocate, and then run code from one of those
> pages, all before the stale i-cache lines happened to be evicted.
From reading Documentation/cachetlb.txt, update_mmu_cache() can be used
to flush i-cache whenever you install a pte with executable permissions,
and covers the particular case you mention above.
DaveM any comment? You seem to be the one who wrote that document :-)
> The current Tile code flushes the icache explicitly at two different times:
> 
> 1. Whenever we flush the TLB, since this is one time when we know who might
> currently be using the page (via cpu_vm_mask) and we can flush all of them
> easily, piggybacking on the infrastructure we use to flush remote TLBs.
> 
> 2. Whenever we context switch, to handle the case where cpu 1 is running
> process A, then switches to B, but another cpu still running process A
> unmaps an executable page that was in cpu 1's icache.  This way when cpu 1
> switches back to A, it doesn't have to worry about any unmaps that occurred
> while it was switched out.
> 
> 
> > I'm not sure what we can do about TILE's VM_HUGETLB usage though, if it
> > needs explicit flushes for huge ptes it might just have to issue
> > multiple tlb invalidates and do them from tlb_start_vma()/tlb_end_vma().
> 
> I'm not too concerned about this.  We can make the flush code check both
> page sizes at a small cost in efficiency, relative to the overall cost of
> global TLB invalidation.
OK, that's basically what I made it do now:
Index: linux-2.6/arch/tile/kernel/tlb.c
===================================================================
--- linux-2.6.orig/arch/tile/kernel/tlb.c
+++ linux-2.6/arch/tile/kernel/tlb.c
@@ -64,14 +64,13 @@ void flush_tlb_page(const struct vm_area
 }
 EXPORT_SYMBOL(flush_tlb_page);
-void flush_tlb_range(const struct vm_area_struct *vma,
+void flush_tlb_range(const struct mm_struct *mm,
                     unsigned long start, unsigned long end)
 {
-       unsigned long size = hv_page_size(vma);
-       struct mm_struct *mm = vma->vm_mm;
-       int cache = (vma->vm_flags & VM_EXEC) ? HV_FLUSH_EVICT_L1I : 0;
-       flush_remote(0, cache, &mm->cpu_vm_mask, start, end - start, size,
-                    &mm->cpu_vm_mask, NULL, 0);
+       flush_remote(0, HV_FLUSH_EVICT_L1I, &mm->cpu_vm_mask,
+                    start, end - start, PAGE_SIZE, &mm->cpu_vm_mask, NULL, 0);
+       flush_remote(0, 0, &mm->cpu_vm_mask,
+                    start, end - start, HPAGE_SIZE, &mm->cpu_vm_mask, NULL, 0);
 }
And I guess that if the update_mmu_cache() thing works out we can remove
the HV_FLUSH_EVICT_L1I thing.
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
 
 
 
 
 
 
- * [PATCH 07/17] sh: mmu_gather rework
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (6 preceding siblings ...)
  2011-02-17 16:23 ` [PATCH 06/17] arm: " Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 08/17] um: " Peter Zijlstra
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Paul Mundt
[-- Attachment #1: peter_zijlstra-sh-preemptible_mmu_gather.patch --]
[-- Type: text/plain, Size: 2272 bytes --]
Fix up the sh mmu_gather code to conform to the new API.
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/sh/include/asm/tlb.h |   28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)
Index: linux-2.6/arch/sh/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/sh/include/asm/tlb.h
+++ linux-2.6/arch/sh/include/asm/tlb.h
@@ -23,8 +23,6 @@ struct mmu_gather {
 	unsigned long		start, end;
 };
 
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 static inline void init_tlb_gather(struct mmu_gather *tlb)
 {
 	tlb->start = TASK_SIZE;
@@ -36,17 +34,13 @@ static inline void init_tlb_gather(struc
 	}
 }
 
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
 
 	init_tlb_gather(tlb);
-
-	return tlb;
 }
 
 static inline void
@@ -57,8 +51,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
 }
 
 static inline void
@@ -91,7 +83,21 @@ tlb_end_vma(struct mmu_gather *tlb, stru
 	}
 }
 
-#define tlb_remove_page(tlb,page)	free_page_and_swap_cache(page)
+static inline void tlb_flush_mmu(struct mmu_gather *tlb)
+{
+}
+
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	free_page_and_swap_cache(page);
+	return 0;
+}
+
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	__tlb_remove_page(tlb, page);
+}
+
 #define pte_free_tlb(tlb, ptep, addr)	pte_free((tlb)->mm, ptep)
 #define pmd_free_tlb(tlb, pmdp, addr)	pmd_free((tlb)->mm, pmdp)
 #define pud_free_tlb(tlb, pudp, addr)	pud_free((tlb)->mm, pudp)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 07/17] sh: mmu_gather rework
  2011-02-17 16:23 ` [PATCH 07/17] sh: " Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Paul Mundt
[-- Attachment #1: peter_zijlstra-sh-preemptible_mmu_gather.patch --]
[-- Type: text/plain, Size: 1969 bytes --]
Fix up the sh mmu_gather code to conform to the new API.
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/sh/include/asm/tlb.h |   28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)
Index: linux-2.6/arch/sh/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/sh/include/asm/tlb.h
+++ linux-2.6/arch/sh/include/asm/tlb.h
@@ -23,8 +23,6 @@ struct mmu_gather {
 	unsigned long		start, end;
 };
 
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 static inline void init_tlb_gather(struct mmu_gather *tlb)
 {
 	tlb->start = TASK_SIZE;
@@ -36,17 +34,13 @@ static inline void init_tlb_gather(struc
 	}
 }
 
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
 
 	init_tlb_gather(tlb);
-
-	return tlb;
 }
 
 static inline void
@@ -57,8 +51,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
 }
 
 static inline void
@@ -91,7 +83,21 @@ tlb_end_vma(struct mmu_gather *tlb, stru
 	}
 }
 
-#define tlb_remove_page(tlb,page)	free_page_and_swap_cache(page)
+static inline void tlb_flush_mmu(struct mmu_gather *tlb)
+{
+}
+
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	free_page_and_swap_cache(page);
+	return 0;
+}
+
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	__tlb_remove_page(tlb, page);
+}
+
 #define pte_free_tlb(tlb, ptep, addr)	pte_free((tlb)->mm, ptep)
 #define pmd_free_tlb(tlb, pmdp, addr)	pmd_free((tlb)->mm, pmdp)
 #define pud_free_tlb(tlb, pudp, addr)	pud_free((tlb)->mm, pudp)
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
- * [PATCH 08/17] um: mmu_gather rework
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (7 preceding siblings ...)
  2011-02-17 16:23 ` [PATCH 07/17] sh: " Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 09/17] ia64: " Peter Zijlstra
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Jeff Dike
[-- Attachment #1: peter_zijlstra-um-preemptible_mmu_gather.patch --]
[-- Type: text/plain, Size: 2809 bytes --]
Fix up the um mmu_gather code to conform to the new API.
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/um/include/asm/tlb.h |   29 +++++++++++------------------
 1 file changed, 11 insertions(+), 18 deletions(-)
Index: linux-2.6/arch/um/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/um/include/asm/tlb.h
+++ linux-2.6/arch/um/include/asm/tlb.h
@@ -22,9 +22,6 @@ struct mmu_gather {
 	unsigned int		fullmm; /* non-zero means full mm flush */
 };
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
 					  unsigned long address)
 {
@@ -47,27 +44,20 @@ static inline void init_tlb_gather(struc
 	}
 }
 
-/* tlb_gather_mmu
- *	Return a pointer to an initialized struct mmu_gather.
- */
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
 
 	init_tlb_gather(tlb);
-
-	return tlb;
 }
 
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 			       unsigned long end);
 
 static inline void
-tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
+tlb_flush_mmu(struct mmu_gather *tlb)
 {
 	if (!tlb->need_flush)
 		return;
@@ -83,12 +73,10 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 static inline void
 tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
-	tlb_flush_mmu(tlb, start, end);
+	tlb_flush_mmu(tlb);
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
 }
 
 /* tlb_remove_page
@@ -96,11 +84,16 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
  *	while handling the additional races in SMP caused by other CPUs
  *	caching valid mappings in their TLBs.
  */
-static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
 	tlb->need_flush = 1;
 	free_page_and_swap_cache(page);
-	return;
+	return 0;
+}
+
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	__tlb_remove_page(tlb, page);
 }
 
 /**
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 08/17] um: mmu_gather rework
  2011-02-17 16:23 ` [PATCH 08/17] um: " Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Jeff Dike
[-- Attachment #1: peter_zijlstra-um-preemptible_mmu_gather.patch --]
[-- Type: text/plain, Size: 2506 bytes --]
Fix up the um mmu_gather code to conform to the new API.
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/um/include/asm/tlb.h |   29 +++++++++++------------------
 1 file changed, 11 insertions(+), 18 deletions(-)
Index: linux-2.6/arch/um/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/um/include/asm/tlb.h
+++ linux-2.6/arch/um/include/asm/tlb.h
@@ -22,9 +22,6 @@ struct mmu_gather {
 	unsigned int		fullmm; /* non-zero means full mm flush */
 };
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
 					  unsigned long address)
 {
@@ -47,27 +44,20 @@ static inline void init_tlb_gather(struc
 	}
 }
 
-/* tlb_gather_mmu
- *	Return a pointer to an initialized struct mmu_gather.
- */
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
 	tlb->mm = mm;
 	tlb->fullmm = full_mm_flush;
 
 	init_tlb_gather(tlb);
-
-	return tlb;
 }
 
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 			       unsigned long end);
 
 static inline void
-tlb_flush_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
+tlb_flush_mmu(struct mmu_gather *tlb)
 {
 	if (!tlb->need_flush)
 		return;
@@ -83,12 +73,10 @@ tlb_flush_mmu(struct mmu_gather *tlb, un
 static inline void
 tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
-	tlb_flush_mmu(tlb, start, end);
+	tlb_flush_mmu(tlb);
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	put_cpu_var(mmu_gathers);
 }
 
 /* tlb_remove_page
@@ -96,11 +84,16 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
  *	while handling the additional races in SMP caused by other CPUs
  *	caching valid mappings in their TLBs.
  */
-static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
 	tlb->need_flush = 1;
 	free_page_and_swap_cache(page);
-	return;
+	return 0;
+}
+
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	__tlb_remove_page(tlb, page);
 }
 
 /**
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
- * [PATCH 09/17] ia64: mmu_gather rework
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (8 preceding siblings ...)
  2011-02-17 16:23 ` [PATCH 08/17] um: " Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 10/17] mm: Now that all old mmu_gather code is gone, remove the storage Peter Zijlstra
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Tony Luck
[-- Attachment #1: peter_zijlstra-ia64-preemptible_mmu_gather.patch --]
[-- Type: text/plain, Size: 4600 bytes --]
Fix up the ia64 mmu_gather code to conform to the new API.
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/ia64/include/asm/tlb.h |   67 ++++++++++++++++++++++++++++++--------------
 1 file changed, 47 insertions(+), 20 deletions(-)
Index: linux-2.6/arch/ia64/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/ia64/include/asm/tlb.h
+++ linux-2.6/arch/ia64/include/asm/tlb.h
@@ -47,21 +47,27 @@
 #include <asm/machvec.h>
 
 #ifdef CONFIG_SMP
-# define FREE_PTE_NR		2048
 # define tlb_fast_mode(tlb)	((tlb)->nr == ~0U)
 #else
-# define FREE_PTE_NR		0
 # define tlb_fast_mode(tlb)	(1)
 #endif
 
+/*
+ * If we can't allocate a page to make a big batch of page pointers
+ * to work on, then just handle a few from the on-stack structure.
+ */
+#define	IA64_GATHER_BUNDLE	8
+
 struct mmu_gather {
 	struct mm_struct	*mm;
 	unsigned int		nr;		/* == ~0U => fast mode */
+	unsigned int		max;
 	unsigned char		fullmm;		/* non-zero means full mm flush */
 	unsigned char		need_flush;	/* really unmapped some PTEs? */
 	unsigned long		start_addr;
 	unsigned long		end_addr;
-	struct page 		*pages[FREE_PTE_NR];
+	struct page		**pages;
+	struct page		*local[IA64_GATHER_BUNDLE];
 };
 
 struct ia64_tr_entry {
@@ -90,9 +96,6 @@ extern struct ia64_tr_entry *ia64_idtrs[
 #define RR_RID_MASK	0x00000000ffffff00L
 #define RR_TO_RID(val) 	((val >> 8) & 0xffffff)
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 /*
  * Flush the TLB for address range START to END and, if not in fast mode, release the
  * freed pages that where gathered up to this point.
@@ -147,15 +150,23 @@ ia64_tlb_flush_mmu (struct mmu_gather *t
 	}
 }
 
-/*
- * Return a pointer to an initialized struct mmu_gather.
- */
-static inline struct mmu_gather *
-tlb_gather_mmu (struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void __tlb_alloc_page(struct mmu_gather *tlb)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
+	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
 
+	if (addr) {
+		tlb->pages = (void *)addr;
+		tlb->max = PAGE_SIZE / sizeof(void *);
+	}
+}
+
+
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
+{
 	tlb->mm = mm;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->pages = tlb->local;
 	/*
 	 * Use fast mode if only 1 CPU is online.
 	 *
@@ -172,7 +183,6 @@ tlb_gather_mmu (struct mm_struct *mm, un
 	tlb->nr = (num_online_cpus() == 1) ? ~0U : 0;
 	tlb->fullmm = full_mm_flush;
 	tlb->start_addr = ~0UL;
-	return tlb;
 }
 
 /*
@@ -180,7 +190,7 @@ tlb_gather_mmu (struct mm_struct *mm, un
  * collected.
  */
 static inline void
-tlb_finish_mmu (struct mmu_gather *tlb, unsigned long start, unsigned long end)
+tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
 	/*
 	 * Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
@@ -191,7 +201,8 @@ tlb_finish_mmu (struct mmu_gather *tlb, 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->pages != tlb->local)
+		free_pages((unsigned long)tlb->pages, 0);
 }
 
 /*
@@ -199,18 +210,34 @@ tlb_finish_mmu (struct mmu_gather *tlb, 
  * must be delayed until after the TLB has been flushed (see comments at the beginning of
  * this file).
  */
-static inline void
-tlb_remove_page (struct mmu_gather *tlb, struct page *page)
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
 	tlb->need_flush = 1;
 
 	if (tlb_fast_mode(tlb)) {
 		free_page_and_swap_cache(page);
-		return;
+		return 0;
 	}
+
+	if (!tlb->nr && tlb->pages == tlb->local)
+		__tlb_alloc_page(tlb);
+
 	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= FREE_PTE_NR)
-		ia64_tlb_flush_mmu(tlb, tlb->start_addr, tlb->end_addr);
+	if (tlb->nr >= tlb->max)
+		return 1;
+
+	return 0;
+}
+
+static inline void tlb_flush_mmu(struct mmu_gather *tlb)
+{
+	ia64_tlb_flush_mmu(tlb, tlb->start_addr, tlb->end_addr);
+}
+
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	if (__tlb_remove_page(tlb, page))
+		tlb_flush_mmu(tlb);
 }
 
 /*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 09/17] ia64: mmu_gather rework
  2011-02-17 16:23 ` [PATCH 09/17] ia64: " Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Tony Luck
[-- Attachment #1: peter_zijlstra-ia64-preemptible_mmu_gather.patch --]
[-- Type: text/plain, Size: 4297 bytes --]
Fix up the ia64 mmu_gather code to conform to the new API.
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/ia64/include/asm/tlb.h |   67 ++++++++++++++++++++++++++++++--------------
 1 file changed, 47 insertions(+), 20 deletions(-)
Index: linux-2.6/arch/ia64/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/ia64/include/asm/tlb.h
+++ linux-2.6/arch/ia64/include/asm/tlb.h
@@ -47,21 +47,27 @@
 #include <asm/machvec.h>
 
 #ifdef CONFIG_SMP
-# define FREE_PTE_NR		2048
 # define tlb_fast_mode(tlb)	((tlb)->nr == ~0U)
 #else
-# define FREE_PTE_NR		0
 # define tlb_fast_mode(tlb)	(1)
 #endif
 
+/*
+ * If we can't allocate a page to make a big batch of page pointers
+ * to work on, then just handle a few from the on-stack structure.
+ */
+#define	IA64_GATHER_BUNDLE	8
+
 struct mmu_gather {
 	struct mm_struct	*mm;
 	unsigned int		nr;		/* == ~0U => fast mode */
+	unsigned int		max;
 	unsigned char		fullmm;		/* non-zero means full mm flush */
 	unsigned char		need_flush;	/* really unmapped some PTEs? */
 	unsigned long		start_addr;
 	unsigned long		end_addr;
-	struct page 		*pages[FREE_PTE_NR];
+	struct page		**pages;
+	struct page		*local[IA64_GATHER_BUNDLE];
 };
 
 struct ia64_tr_entry {
@@ -90,9 +96,6 @@ extern struct ia64_tr_entry *ia64_idtrs[
 #define RR_RID_MASK	0x00000000ffffff00L
 #define RR_TO_RID(val) 	((val >> 8) & 0xffffff)
 
-/* Users of the generic TLB shootdown code must declare this storage space. */
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 /*
  * Flush the TLB for address range START to END and, if not in fast mode, release the
  * freed pages that where gathered up to this point.
@@ -147,15 +150,23 @@ ia64_tlb_flush_mmu (struct mmu_gather *t
 	}
 }
 
-/*
- * Return a pointer to an initialized struct mmu_gather.
- */
-static inline struct mmu_gather *
-tlb_gather_mmu (struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void __tlb_alloc_page(struct mmu_gather *tlb)
 {
-	struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
+	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
 
+	if (addr) {
+		tlb->pages = (void *)addr;
+		tlb->max = PAGE_SIZE / sizeof(void *);
+	}
+}
+
+
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
+{
 	tlb->mm = mm;
+	tlb->max = ARRAY_SIZE(tlb->local);
+	tlb->pages = tlb->local;
 	/*
 	 * Use fast mode if only 1 CPU is online.
 	 *
@@ -172,7 +183,6 @@ tlb_gather_mmu (struct mm_struct *mm, un
 	tlb->nr = (num_online_cpus() == 1) ? ~0U : 0;
 	tlb->fullmm = full_mm_flush;
 	tlb->start_addr = ~0UL;
-	return tlb;
 }
 
 /*
@@ -180,7 +190,7 @@ tlb_gather_mmu (struct mm_struct *mm, un
  * collected.
  */
 static inline void
-tlb_finish_mmu (struct mmu_gather *tlb, unsigned long start, unsigned long end)
+tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
 	/*
 	 * Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
@@ -191,7 +201,8 @@ tlb_finish_mmu (struct mmu_gather *tlb, 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	put_cpu_var(mmu_gathers);
+	if (tlb->pages != tlb->local)
+		free_pages((unsigned long)tlb->pages, 0);
 }
 
 /*
@@ -199,18 +210,34 @@ tlb_finish_mmu (struct mmu_gather *tlb, 
  * must be delayed until after the TLB has been flushed (see comments at the beginning of
  * this file).
  */
-static inline void
-tlb_remove_page (struct mmu_gather *tlb, struct page *page)
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
 	tlb->need_flush = 1;
 
 	if (tlb_fast_mode(tlb)) {
 		free_page_and_swap_cache(page);
-		return;
+		return 0;
 	}
+
+	if (!tlb->nr && tlb->pages == tlb->local)
+		__tlb_alloc_page(tlb);
+
 	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= FREE_PTE_NR)
-		ia64_tlb_flush_mmu(tlb, tlb->start_addr, tlb->end_addr);
+	if (tlb->nr >= tlb->max)
+		return 1;
+
+	return 0;
+}
+
+static inline void tlb_flush_mmu(struct mmu_gather *tlb)
+{
+	ia64_tlb_flush_mmu(tlb, tlb->start_addr, tlb->end_addr);
+}
+
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+	if (__tlb_remove_page(tlb, page))
+		tlb_flush_mmu(tlb);
 }
 
 /*
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
- * [PATCH 10/17] mm: Now that all old mmu_gather code is gone, remove the storage
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (9 preceding siblings ...)
  2011-02-17 16:23 ` [PATCH 09/17] ia64: " Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 11/17] mm, powerpc: Move the RCU page-table freeing into generic code Peter Zijlstra
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang
[-- Attachment #1: arch-remove-mmu_gather.patch --]
[-- Type: text/plain, Size: 9157 bytes --]
XXX fold all the mmu_gather rework patches into one for submission
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/alpha/mm/init.c      |    2 --
 arch/arm/mm/mmu.c         |    2 --
 arch/avr32/mm/init.c      |    2 --
 arch/cris/mm/init.c       |    2 --
 arch/frv/mm/init.c        |    2 --
 arch/ia64/mm/init.c       |    2 --
 arch/m32r/mm/init.c       |    2 --
 arch/m68k/mm/init.c       |    2 --
 arch/microblaze/mm/init.c |    2 --
 arch/mips/mm/init.c       |    2 --
 arch/mn10300/mm/init.c    |    2 --
 arch/parisc/mm/init.c     |    2 --
 arch/s390/mm/pgtable.c    |    1 -
 arch/score/mm/init.c      |    2 --
 arch/sh/mm/init.c         |    1 -
 arch/sparc/mm/init_32.c   |    2 --
 arch/tile/mm/init.c       |    2 --
 arch/um/kernel/smp.c      |    3 ---
 arch/x86/mm/init.c        |    2 --
 arch/xtensa/mm/mmu.c      |    2 --
 20 files changed, 39 deletions(-)
Index: linux-2.6/arch/alpha/mm/init.c
===================================================================
--- linux-2.6.orig/arch/alpha/mm/init.c
+++ linux-2.6/arch/alpha/mm/init.c
@@ -32,8 +32,6 @@
 #include <asm/console.h>
 #include <asm/tlb.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 extern void die_if_kernel(char *,struct pt_regs *,long);
 
 static struct pcb_struct original_pcb;
Index: linux-2.6/arch/arm/mm/mmu.c
===================================================================
--- linux-2.6.orig/arch/arm/mm/mmu.c
+++ linux-2.6/arch/arm/mm/mmu.c
@@ -31,8 +31,6 @@
 
 #include "mm.h"
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 /*
  * empty_zero_page is a special page that is used for
  * zero-initialized data and COW.
Index: linux-2.6/arch/avr32/mm/init.c
===================================================================
--- linux-2.6.orig/arch/avr32/mm/init.c
+++ linux-2.6/arch/avr32/mm/init.c
@@ -25,8 +25,6 @@
 #include <asm/setup.h>
 #include <asm/sections.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_data;
 
 struct page *empty_zero_page;
Index: linux-2.6/arch/cris/mm/init.c
===================================================================
--- linux-2.6.orig/arch/cris/mm/init.c
+++ linux-2.6/arch/cris/mm/init.c
@@ -13,8 +13,6 @@
 #include <linux/bootmem.h>
 #include <asm/tlb.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 unsigned long empty_zero_page;
 
 extern char _stext, _edata, _etext; /* From linkerscript */
Index: linux-2.6/arch/frv/mm/init.c
===================================================================
--- linux-2.6.orig/arch/frv/mm/init.c
+++ linux-2.6/arch/frv/mm/init.c
@@ -41,8 +41,6 @@
 
 #undef DEBUG
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 /*
  * BAD_PAGE is the page that is used for page faults when linux
  * is out-of-memory. Older versions of linux just did a
Index: linux-2.6/arch/ia64/mm/init.c
===================================================================
--- linux-2.6.orig/arch/ia64/mm/init.c
+++ linux-2.6/arch/ia64/mm/init.c
@@ -36,8 +36,6 @@
 #include <asm/mca.h>
 #include <asm/paravirt.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 extern void ia64_tlb_init (void);
 
 unsigned long MAX_DMA_ADDRESS = PAGE_OFFSET + 0x100000000UL;
Index: linux-2.6/arch/m32r/mm/init.c
===================================================================
--- linux-2.6.orig/arch/m32r/mm/init.c
+++ linux-2.6/arch/m32r/mm/init.c
@@ -35,8 +35,6 @@ extern char __init_begin, __init_end;
 
 pgd_t swapper_pg_dir[1024];
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 /*
  * Cache of MMU context last used.
  */
Index: linux-2.6/arch/m68k/mm/init.c
===================================================================
--- linux-2.6.orig/arch/m68k/mm/init.c
+++ linux-2.6/arch/m68k/mm/init.c
@@ -32,8 +32,6 @@
 #include <asm/sections.h>
 #include <asm/tlb.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 pg_data_t pg_data_map[MAX_NUMNODES];
 EXPORT_SYMBOL(pg_data_map);
 
Index: linux-2.6/arch/microblaze/mm/init.c
===================================================================
--- linux-2.6.orig/arch/microblaze/mm/init.c
+++ linux-2.6/arch/microblaze/mm/init.c
@@ -32,8 +32,6 @@ unsigned int __page_offset;
 EXPORT_SYMBOL(__page_offset);
 
 #else
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 static int init_bootmem_done;
 #endif /* CONFIG_MMU */
 
Index: linux-2.6/arch/mips/mm/init.c
===================================================================
--- linux-2.6.orig/arch/mips/mm/init.c
+++ linux-2.6/arch/mips/mm/init.c
@@ -64,8 +64,6 @@
 
 #endif /* CONFIG_MIPS_MT_SMTC */
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 /*
  * We have up to 8 empty zeroed pages so we can map one of the right colour
  * when needed.  This is necessary only on R4000 / R4400 SC and MC versions
Index: linux-2.6/arch/mn10300/mm/init.c
===================================================================
--- linux-2.6.orig/arch/mn10300/mm/init.c
+++ linux-2.6/arch/mn10300/mm/init.c
@@ -37,8 +37,6 @@
 #include <asm/tlb.h>
 #include <asm/sections.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 unsigned long highstart_pfn, highend_pfn;
 
 #ifdef CONFIG_MN10300_HAS_ATOMIC_OPS_UNIT
Index: linux-2.6/arch/parisc/mm/init.c
===================================================================
--- linux-2.6.orig/arch/parisc/mm/init.c
+++ linux-2.6/arch/parisc/mm/init.c
@@ -31,8 +31,6 @@
 #include <asm/mmzone.h>
 #include <asm/sections.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 extern int  data_start;
 
 #ifdef CONFIG_DISCONTIGMEM
Index: linux-2.6/arch/s390/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/s390/mm/pgtable.c
+++ linux-2.6/arch/s390/mm/pgtable.c
@@ -36,7 +36,6 @@ struct rcu_table_freelist {
 	((PAGE_SIZE - sizeof(struct rcu_table_freelist)) \
 	  / sizeof(unsigned long))
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 static DEFINE_PER_CPU(struct rcu_table_freelist *, rcu_table_freelist);
 
 static void __page_table_free(struct mm_struct *mm, unsigned long *table);
Index: linux-2.6/arch/score/mm/init.c
===================================================================
--- linux-2.6.orig/arch/score/mm/init.c
+++ linux-2.6/arch/score/mm/init.c
@@ -38,8 +38,6 @@
 #include <asm/sections.h>
 #include <asm/tlb.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 unsigned long empty_zero_page;
 EXPORT_SYMBOL_GPL(empty_zero_page);
 
Index: linux-2.6/arch/sh/mm/init.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/init.c
+++ linux-2.6/arch/sh/mm/init.c
@@ -28,7 +28,6 @@
 #include <asm/cache.h>
 #include <asm/sizes.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 pgd_t swapper_pg_dir[PTRS_PER_PGD];
 
 void __init generic_mem_init(void)
Index: linux-2.6/arch/sparc/mm/init_32.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/init_32.c
+++ linux-2.6/arch/sparc/mm/init_32.c
@@ -37,8 +37,6 @@
 #include <asm/prom.h>
 #include <asm/leon.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 unsigned long *sparc_valid_addr_bitmap;
 EXPORT_SYMBOL(sparc_valid_addr_bitmap);
 
Index: linux-2.6/arch/tile/mm/init.c
===================================================================
--- linux-2.6.orig/arch/tile/mm/init.c
+++ linux-2.6/arch/tile/mm/init.c
@@ -71,8 +71,6 @@
 unsigned long VMALLOC_RESERVE = CONFIG_VMALLOC_RESERVE;
 #endif
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 /* Create an L2 page table */
 static pte_t * __init alloc_pte(void)
 {
Index: linux-2.6/arch/um/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/um/kernel/smp.c
+++ linux-2.6/arch/um/kernel/smp.c
@@ -7,9 +7,6 @@
 #include "asm/pgalloc.h"
 #include "asm/tlb.h"
 
-/* For some reason, mmu_gathers are referenced when CONFIG_SMP is off. */
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 #ifdef CONFIG_SMP
 
 #include "linux/sched.h"
Index: linux-2.6/arch/x86/mm/init.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init.c
+++ linux-2.6/arch/x86/mm/init.c
@@ -16,8 +16,6 @@
 #include <asm/tlb.h>
 #include <asm/proto.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 unsigned long __initdata e820_table_start;
 unsigned long __meminitdata e820_table_end;
 unsigned long __meminitdata e820_table_top;
Index: linux-2.6/arch/xtensa/mm/mmu.c
===================================================================
--- linux-2.6.orig/arch/xtensa/mm/mmu.c
+++ linux-2.6/arch/xtensa/mm/mmu.c
@@ -14,8 +14,6 @@
 #include <asm/mmu_context.h>
 #include <asm/page.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 void __init paging_init(void)
 {
 	memset(swapper_pg_dir, 0, PAGE_SIZE);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 10/17] mm: Now that all old mmu_gather code is gone, remove the storage
  2011-02-17 16:23 ` [PATCH 10/17] mm: Now that all old mmu_gather code is gone, remove the storage Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang
[-- Attachment #1: arch-remove-mmu_gather.patch --]
[-- Type: text/plain, Size: 8854 bytes --]
XXX fold all the mmu_gather rework patches into one for submission
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/alpha/mm/init.c      |    2 --
 arch/arm/mm/mmu.c         |    2 --
 arch/avr32/mm/init.c      |    2 --
 arch/cris/mm/init.c       |    2 --
 arch/frv/mm/init.c        |    2 --
 arch/ia64/mm/init.c       |    2 --
 arch/m32r/mm/init.c       |    2 --
 arch/m68k/mm/init.c       |    2 --
 arch/microblaze/mm/init.c |    2 --
 arch/mips/mm/init.c       |    2 --
 arch/mn10300/mm/init.c    |    2 --
 arch/parisc/mm/init.c     |    2 --
 arch/s390/mm/pgtable.c    |    1 -
 arch/score/mm/init.c      |    2 --
 arch/sh/mm/init.c         |    1 -
 arch/sparc/mm/init_32.c   |    2 --
 arch/tile/mm/init.c       |    2 --
 arch/um/kernel/smp.c      |    3 ---
 arch/x86/mm/init.c        |    2 --
 arch/xtensa/mm/mmu.c      |    2 --
 20 files changed, 39 deletions(-)
Index: linux-2.6/arch/alpha/mm/init.c
===================================================================
--- linux-2.6.orig/arch/alpha/mm/init.c
+++ linux-2.6/arch/alpha/mm/init.c
@@ -32,8 +32,6 @@
 #include <asm/console.h>
 #include <asm/tlb.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 extern void die_if_kernel(char *,struct pt_regs *,long);
 
 static struct pcb_struct original_pcb;
Index: linux-2.6/arch/arm/mm/mmu.c
===================================================================
--- linux-2.6.orig/arch/arm/mm/mmu.c
+++ linux-2.6/arch/arm/mm/mmu.c
@@ -31,8 +31,6 @@
 
 #include "mm.h"
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 /*
  * empty_zero_page is a special page that is used for
  * zero-initialized data and COW.
Index: linux-2.6/arch/avr32/mm/init.c
===================================================================
--- linux-2.6.orig/arch/avr32/mm/init.c
+++ linux-2.6/arch/avr32/mm/init.c
@@ -25,8 +25,6 @@
 #include <asm/setup.h>
 #include <asm/sections.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_data;
 
 struct page *empty_zero_page;
Index: linux-2.6/arch/cris/mm/init.c
===================================================================
--- linux-2.6.orig/arch/cris/mm/init.c
+++ linux-2.6/arch/cris/mm/init.c
@@ -13,8 +13,6 @@
 #include <linux/bootmem.h>
 #include <asm/tlb.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 unsigned long empty_zero_page;
 
 extern char _stext, _edata, _etext; /* From linkerscript */
Index: linux-2.6/arch/frv/mm/init.c
===================================================================
--- linux-2.6.orig/arch/frv/mm/init.c
+++ linux-2.6/arch/frv/mm/init.c
@@ -41,8 +41,6 @@
 
 #undef DEBUG
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 /*
  * BAD_PAGE is the page that is used for page faults when linux
  * is out-of-memory. Older versions of linux just did a
Index: linux-2.6/arch/ia64/mm/init.c
===================================================================
--- linux-2.6.orig/arch/ia64/mm/init.c
+++ linux-2.6/arch/ia64/mm/init.c
@@ -36,8 +36,6 @@
 #include <asm/mca.h>
 #include <asm/paravirt.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 extern void ia64_tlb_init (void);
 
 unsigned long MAX_DMA_ADDRESS = PAGE_OFFSET + 0x100000000UL;
Index: linux-2.6/arch/m32r/mm/init.c
===================================================================
--- linux-2.6.orig/arch/m32r/mm/init.c
+++ linux-2.6/arch/m32r/mm/init.c
@@ -35,8 +35,6 @@ extern char __init_begin, __init_end;
 
 pgd_t swapper_pg_dir[1024];
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 /*
  * Cache of MMU context last used.
  */
Index: linux-2.6/arch/m68k/mm/init.c
===================================================================
--- linux-2.6.orig/arch/m68k/mm/init.c
+++ linux-2.6/arch/m68k/mm/init.c
@@ -32,8 +32,6 @@
 #include <asm/sections.h>
 #include <asm/tlb.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 pg_data_t pg_data_map[MAX_NUMNODES];
 EXPORT_SYMBOL(pg_data_map);
 
Index: linux-2.6/arch/microblaze/mm/init.c
===================================================================
--- linux-2.6.orig/arch/microblaze/mm/init.c
+++ linux-2.6/arch/microblaze/mm/init.c
@@ -32,8 +32,6 @@ unsigned int __page_offset;
 EXPORT_SYMBOL(__page_offset);
 
 #else
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 static int init_bootmem_done;
 #endif /* CONFIG_MMU */
 
Index: linux-2.6/arch/mips/mm/init.c
===================================================================
--- linux-2.6.orig/arch/mips/mm/init.c
+++ linux-2.6/arch/mips/mm/init.c
@@ -64,8 +64,6 @@
 
 #endif /* CONFIG_MIPS_MT_SMTC */
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 /*
  * We have up to 8 empty zeroed pages so we can map one of the right colour
  * when needed.  This is necessary only on R4000 / R4400 SC and MC versions
Index: linux-2.6/arch/mn10300/mm/init.c
===================================================================
--- linux-2.6.orig/arch/mn10300/mm/init.c
+++ linux-2.6/arch/mn10300/mm/init.c
@@ -37,8 +37,6 @@
 #include <asm/tlb.h>
 #include <asm/sections.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 unsigned long highstart_pfn, highend_pfn;
 
 #ifdef CONFIG_MN10300_HAS_ATOMIC_OPS_UNIT
Index: linux-2.6/arch/parisc/mm/init.c
===================================================================
--- linux-2.6.orig/arch/parisc/mm/init.c
+++ linux-2.6/arch/parisc/mm/init.c
@@ -31,8 +31,6 @@
 #include <asm/mmzone.h>
 #include <asm/sections.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 extern int  data_start;
 
 #ifdef CONFIG_DISCONTIGMEM
Index: linux-2.6/arch/s390/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/s390/mm/pgtable.c
+++ linux-2.6/arch/s390/mm/pgtable.c
@@ -36,7 +36,6 @@ struct rcu_table_freelist {
 	((PAGE_SIZE - sizeof(struct rcu_table_freelist)) \
 	  / sizeof(unsigned long))
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 static DEFINE_PER_CPU(struct rcu_table_freelist *, rcu_table_freelist);
 
 static void __page_table_free(struct mm_struct *mm, unsigned long *table);
Index: linux-2.6/arch/score/mm/init.c
===================================================================
--- linux-2.6.orig/arch/score/mm/init.c
+++ linux-2.6/arch/score/mm/init.c
@@ -38,8 +38,6 @@
 #include <asm/sections.h>
 #include <asm/tlb.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 unsigned long empty_zero_page;
 EXPORT_SYMBOL_GPL(empty_zero_page);
 
Index: linux-2.6/arch/sh/mm/init.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/init.c
+++ linux-2.6/arch/sh/mm/init.c
@@ -28,7 +28,6 @@
 #include <asm/cache.h>
 #include <asm/sizes.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 pgd_t swapper_pg_dir[PTRS_PER_PGD];
 
 void __init generic_mem_init(void)
Index: linux-2.6/arch/sparc/mm/init_32.c
===================================================================
--- linux-2.6.orig/arch/sparc/mm/init_32.c
+++ linux-2.6/arch/sparc/mm/init_32.c
@@ -37,8 +37,6 @@
 #include <asm/prom.h>
 #include <asm/leon.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 unsigned long *sparc_valid_addr_bitmap;
 EXPORT_SYMBOL(sparc_valid_addr_bitmap);
 
Index: linux-2.6/arch/tile/mm/init.c
===================================================================
--- linux-2.6.orig/arch/tile/mm/init.c
+++ linux-2.6/arch/tile/mm/init.c
@@ -71,8 +71,6 @@
 unsigned long VMALLOC_RESERVE = CONFIG_VMALLOC_RESERVE;
 #endif
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 /* Create an L2 page table */
 static pte_t * __init alloc_pte(void)
 {
Index: linux-2.6/arch/um/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/um/kernel/smp.c
+++ linux-2.6/arch/um/kernel/smp.c
@@ -7,9 +7,6 @@
 #include "asm/pgalloc.h"
 #include "asm/tlb.h"
 
-/* For some reason, mmu_gathers are referenced when CONFIG_SMP is off. */
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 #ifdef CONFIG_SMP
 
 #include "linux/sched.h"
Index: linux-2.6/arch/x86/mm/init.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init.c
+++ linux-2.6/arch/x86/mm/init.c
@@ -16,8 +16,6 @@
 #include <asm/tlb.h>
 #include <asm/proto.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 unsigned long __initdata e820_table_start;
 unsigned long __meminitdata e820_table_end;
 unsigned long __meminitdata e820_table_top;
Index: linux-2.6/arch/xtensa/mm/mmu.c
===================================================================
--- linux-2.6.orig/arch/xtensa/mm/mmu.c
+++ linux-2.6/arch/xtensa/mm/mmu.c
@@ -14,8 +14,6 @@
 #include <asm/mmu_context.h>
 #include <asm/page.h>
 
-DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
-
 void __init paging_init(void)
 {
 	memset(swapper_pg_dir, 0, PAGE_SIZE);
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
- * [PATCH 11/17] mm, powerpc: Move the RCU page-table freeing into generic code
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (10 preceding siblings ...)
  2011-02-17 16:23 ` [PATCH 10/17] mm: Now that all old mmu_gather code is gone, remove the storage Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 12/17] s390: use generic RCP page-table freeing Peter Zijlstra
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang
[-- Attachment #1: peter_zijlstra-mm_powerpc-move_the_rcu_page-table_freeing_into.patch --]
[-- Type: text/plain, Size: 12860 bytes --]
In case other architectures require RCU freed page-tables to implement
gup_fast() and software filled hashes and similar things, provide the
means to do so by moving the logic into generic code.
Requested-by: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/Kconfig                       |    3 +
 arch/powerpc/Kconfig               |    1 
 arch/powerpc/include/asm/pgalloc.h |   21 ++++++-
 arch/powerpc/include/asm/tlb.h     |   10 ---
 arch/powerpc/mm/pgtable.c          |   98 -------------------------------------
 arch/powerpc/mm/tlb_hash32.c       |    3 -
 arch/powerpc/mm/tlb_hash64.c       |    3 -
 arch/powerpc/mm/tlb_nohash.c       |    3 -
 include/asm-generic/tlb.h          |   57 +++++++++++++++++++--
 mm/memory.c                        |   77 +++++++++++++++++++++++++++++
 10 files changed, 151 insertions(+), 125 deletions(-)
Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
+++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
@@ -31,14 +31,29 @@ static inline void pte_free(struct mm_st
 #endif
 
 #ifdef CONFIG_SMP
-extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
-extern void pte_free_finish(struct mmu_gather *tlb);
+struct mmu_gather;
+extern void tlb_remove_table(struct mmu_gather *, void *);
+
+static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
+{
+	unsigned long pgf = (unsigned long)table;
+	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+	pgf |= shift;
+	tlb_remove_table(tlb, (void *)pgf);
+}
+
+static inline void __tlb_remove_table(void *_table)
+{
+	void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+	unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+	pgtable_free(table, shift);
+}
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	pgtable_free(table, shift);
 }
-static inline void pte_free_finish(struct mmu_gather *tlb) { }
 #endif /* !CONFIG_SMP */
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
Index: linux-2.6/arch/powerpc/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
+++ linux-2.6/arch/powerpc/include/asm/tlb.h
@@ -28,16 +28,6 @@
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
 
-#define HAVE_ARCH_MMU_GATHER 1
-
-struct pte_freelist_batch;
-
-struct arch_mmu_gather {
-	struct pte_freelist_batch *batch;
-};
-
-#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
-
 extern void tlb_flush(struct mmu_gather *tlb);
 
 /* Get the generic bits... */
Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -33,104 +33,6 @@
 
 #include "mmu_decl.h"
 
-#ifdef CONFIG_SMP
-
-/*
- * Handle batching of page table freeing on SMP. Page tables are
- * queued up and send to be freed later by RCU in order to avoid
- * freeing a page table page that is being walked without locks
- */
-
-static unsigned long pte_freelist_forced_free;
-
-struct pte_freelist_batch
-{
-	struct rcu_head	rcu;
-	unsigned int	index;
-	unsigned long	tables[0];
-};
-
-#define PTE_FREELIST_SIZE \
-	((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \
-	  / sizeof(unsigned long))
-
-static void pte_free_smp_sync(void *arg)
-{
-	/* Do nothing, just ensure we sync with all CPUs */
-}
-
-/* This is only called when we are critically out of memory
- * (and fail to get a page in pte_free_tlb).
- */
-static void pgtable_free_now(void *table, unsigned shift)
-{
-	pte_freelist_forced_free++;
-
-	smp_call_function(pte_free_smp_sync, NULL, 1);
-
-	pgtable_free(table, shift);
-}
-
-static void pte_free_rcu_callback(struct rcu_head *head)
-{
-	struct pte_freelist_batch *batch =
-		container_of(head, struct pte_freelist_batch, rcu);
-	unsigned int i;
-
-	for (i = 0; i < batch->index; i++) {
-		void *table = (void *)(batch->tables[i] & ~MAX_PGTABLE_INDEX_SIZE);
-		unsigned shift = batch->tables[i] & MAX_PGTABLE_INDEX_SIZE;
-
-		pgtable_free(table, shift);
-	}
-
-	free_page((unsigned long)batch);
-}
-
-static void pte_free_submit(struct pte_freelist_batch *batch)
-{
-	call_rcu_sched(&batch->rcu, pte_free_rcu_callback);
-}
-
-void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
-{
-	struct pte_freelist_batch **batchp = &tlb->arch.batch;
-	unsigned long pgf;
-
-	if (atomic_read(&tlb->mm->mm_users) < 2) {
-		pgtable_free(table, shift);
-		return;
-	}
-
-	if (*batchp == NULL) {
-		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
-		if (*batchp == NULL) {
-			pgtable_free_now(table, shift);
-			return;
-		}
-		(*batchp)->index = 0;
-	}
-	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
-	pgf = (unsigned long)table | shift;
-	(*batchp)->tables[(*batchp)->index++] = pgf;
-	if ((*batchp)->index == PTE_FREELIST_SIZE) {
-		pte_free_submit(*batchp);
-		*batchp = NULL;
-	}
-}
-
-void pte_free_finish(struct mmu_gather *tlb)
-{
-	struct pte_freelist_batch **batchp = &tlb->arch.batch;
-
-	if (*batchp == NULL)
-		return;
-	pte_free_submit(*batchp);
-	*batchp = NULL;
-}
-
-#endif /* CONFIG_SMP */
-
 static inline int is_exec_fault(void)
 {
 	return current->thread.regs && TRAP(current->thread.regs) == 0x400;
Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
@@ -71,9 +71,6 @@ void tlb_flush(struct mmu_gather *tlb)
 		 */
 		_tlbia();
 	}
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
@@ -165,9 +165,6 @@ void tlb_flush(struct mmu_gather *tlb)
 		__flush_tlb_pending(tlbbatch);
 
 	put_cpu_var(ppc64_tlb_batch);
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /**
Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
+++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
@@ -299,9 +299,6 @@ EXPORT_SYMBOL(flush_tlb_range);
 void tlb_flush(struct mmu_gather *tlb)
 {
 	flush_tlb_mm(tlb->mm);
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -29,6 +29,49 @@
   #define tlb_fast_mode(tlb) 1
 #endif
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+/*
+ * Semi RCU freeing of the page directories.
+ *
+ * This is needed by some architectures to implement software pagetable walkers.
+ *
+ * gup_fast() and other software pagetable walkers do a lockless page-table
+ * walk and therefore needs some synchronization with the freeing of the page
+ * directories. The chosen means to accomplish that is by disabling IRQs over
+ * the walk.
+ *
+ * Architectures that use IPIs to flush TLBs will then automagically DTRT,
+ * since we unlink the page, flush TLBs, free the page. Since the disabling of
+ * IRQs delays the completion of the TLB flush we can never observe an already
+ * freed page.
+ *
+ * Architectures that do not have this (PPC) need to delay the freeing by some
+ * other means, this is that means.
+ *
+ * What we do is batch the freed directory pages (tables) and RCU free them.
+ * We use the sched RCU variant, as that guarantees that IRQ/preempt disabling
+ * holds off grace periods.
+ *
+ * However, in order to batch these pages we need to allocate storage, this
+ * allocation is deep inside the MM code and can thus easily fail on memory
+ * pressure. To guarantee progress we fall back to single table freeing, see
+ * the implementation of tlb_remove_table_one().
+ *
+ */
+struct mmu_table_batch {
+	struct rcu_head		rcu;
+	unsigned int		nr;
+	void			*tables[0];
+};
+
+#define MAX_TABLE_BATCH		\
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+extern void tlb_table_flush(struct mmu_gather *tlb);
+extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+
+#endif
+
 /*
  * If we can't allocate a page to make a big patch of page pointers
  * to work on, then just handle a few from the on-stack structure.
@@ -44,11 +87,12 @@ struct mmu_gather {
 	unsigned int		max;	/* nr < max */
 	unsigned int		need_flush;/* Really unmapped some ptes? */
 	unsigned int		fullmm; /* non-zero means full mm flush */
-#ifdef HAVE_ARCH_MMU_GATHER
-	struct arch_mmu_gather	arch;
-#endif
 	struct page		**pages;
 	struct page		*local[MMU_GATHER_BUNDLE];
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	struct mmu_table_batch	*batch;
+#endif
 };
 
 static inline void __tlb_alloc_page(struct mmu_gather *tlb)
@@ -80,8 +124,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 
 	tlb->fullmm = full_mm_flush;
 
-#ifdef HAVE_ARCH_MMU_GATHER
-	tlb->arch = ARCH_MMU_GATHER_INIT;
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb->batch = NULL;
 #endif
 }
 
@@ -92,6 +136,9 @@ tlb_flush_mmu(struct mmu_gather *tlb)
 		return;
 	tlb->need_flush = 0;
 	tlb_flush(tlb);
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb_table_flush(tlb);
+#endif
 	if (!tlb_fast_mode(tlb)) {
 		free_pages_and_swap_cache(tlb->pages, tlb->nr);
 		tlb->nr = 0;
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -193,6 +193,83 @@ static void check_sync_rss_stat(struct t
 
 #endif
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+
+/*
+ * See the comment near struct mmu_table_batch.
+ */
+
+static void tlb_remove_table_smp_sync(void *arg)
+{
+	/* Simply deliver the interrupt */
+}
+
+static void tlb_remove_table_one(void *table)
+{
+	/*
+	 * This isn't an RCU grace period and hence the page-tables cannot be
+	 * assumed to be actually RCU-freed.
+	 *
+	 * It is however sufficient for software page-table walkers that rely on
+	 * IRQ disabling. See the comment near struct mmu_table_batch.
+	 */
+	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
+	__tlb_remove_table(table);
+}
+
+static void tlb_remove_table_rcu(struct rcu_head *head)
+{
+	struct mmu_table_batch *batch;
+	int i;
+
+	batch = container_of(head, struct mmu_table_batch, rcu);
+
+	for (i = 0; i < batch->nr; i++)
+		__tlb_remove_table(batch->tables[i]);
+
+	free_page((unsigned long)batch);
+}
+
+void tlb_table_flush(struct mmu_gather *tlb)
+{
+	struct mmu_table_batch **batch = &tlb->batch;
+
+	if (*batch) {
+		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
+		*batch = NULL;
+	}
+}
+
+void tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	struct mmu_table_batch **batch = &tlb->batch;
+
+	tlb->need_flush = 1;
+
+	/*
+	 * When there's less then two users of this mm there cannot be a
+	 * concurrent page-table walk.
+	 */
+	if (atomic_read(&tlb->mm->mm_users) < 2) {
+		__tlb_remove_table(table);
+		return;
+	}
+
+	if (*batch == NULL) {
+		*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
+		if (*batch == NULL) {
+			tlb_remove_table_one(table);
+			return;
+		}
+		(*batch)->nr = 0;
+	}
+	(*batch)->tables[(*batch)->nr++] = table;
+	if ((*batch)->nr == MAX_TABLE_BATCH)
+		tlb_table_flush(tlb);
+}
+
+#endif
+
 /*
  * If a p?d_bad entry is found while walking page tables, report
  * the error, before resetting entry to p?d_none.  Usually (but
Index: linux-2.6/arch/Kconfig
===================================================================
--- linux-2.6.orig/arch/Kconfig
+++ linux-2.6/arch/Kconfig
@@ -178,4 +178,7 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_ARCH_MUTEX_CPU_RELAX
 	bool
 
+config HAVE_RCU_TABLE_FREE
+	bool
+
 source "kernel/gcov/Kconfig"
Index: linux-2.6/arch/powerpc/Kconfig
===================================================================
--- linux-2.6.orig/arch/powerpc/Kconfig
+++ linux-2.6/arch/powerpc/Kconfig
@@ -134,6 +134,7 @@ config PPC
 	select HAVE_GENERIC_HARDIRQS
 	select HAVE_SPARSE_IRQ
 	select IRQ_PER_CPU
+	select HAVE_RCU_TABLE_FREE if SMP
 
 config EARLY_PRINTK
 	bool
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 11/17] mm, powerpc: Move the RCU page-table freeing into generic code
  2011-02-17 16:23 ` [PATCH 11/17] mm, powerpc: Move the RCU page-table freeing into generic code Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang
[-- Attachment #1: peter_zijlstra-mm_powerpc-move_the_rcu_page-table_freeing_into.patch --]
[-- Type: text/plain, Size: 12557 bytes --]
In case other architectures require RCU freed page-tables to implement
gup_fast() and software filled hashes and similar things, provide the
means to do so by moving the logic into generic code.
Requested-by: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/Kconfig                       |    3 +
 arch/powerpc/Kconfig               |    1 
 arch/powerpc/include/asm/pgalloc.h |   21 ++++++-
 arch/powerpc/include/asm/tlb.h     |   10 ---
 arch/powerpc/mm/pgtable.c          |   98 -------------------------------------
 arch/powerpc/mm/tlb_hash32.c       |    3 -
 arch/powerpc/mm/tlb_hash64.c       |    3 -
 arch/powerpc/mm/tlb_nohash.c       |    3 -
 include/asm-generic/tlb.h          |   57 +++++++++++++++++++--
 mm/memory.c                        |   77 +++++++++++++++++++++++++++++
 10 files changed, 151 insertions(+), 125 deletions(-)
Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
+++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
@@ -31,14 +31,29 @@ static inline void pte_free(struct mm_st
 #endif
 
 #ifdef CONFIG_SMP
-extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
-extern void pte_free_finish(struct mmu_gather *tlb);
+struct mmu_gather;
+extern void tlb_remove_table(struct mmu_gather *, void *);
+
+static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
+{
+	unsigned long pgf = (unsigned long)table;
+	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+	pgf |= shift;
+	tlb_remove_table(tlb, (void *)pgf);
+}
+
+static inline void __tlb_remove_table(void *_table)
+{
+	void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+	unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+	pgtable_free(table, shift);
+}
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	pgtable_free(table, shift);
 }
-static inline void pte_free_finish(struct mmu_gather *tlb) { }
 #endif /* !CONFIG_SMP */
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
Index: linux-2.6/arch/powerpc/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/tlb.h
+++ linux-2.6/arch/powerpc/include/asm/tlb.h
@@ -28,16 +28,6 @@
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
 
-#define HAVE_ARCH_MMU_GATHER 1
-
-struct pte_freelist_batch;
-
-struct arch_mmu_gather {
-	struct pte_freelist_batch *batch;
-};
-
-#define ARCH_MMU_GATHER_INIT (struct arch_mmu_gather){ .batch = NULL, }
-
 extern void tlb_flush(struct mmu_gather *tlb);
 
 /* Get the generic bits... */
Index: linux-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/pgtable.c
+++ linux-2.6/arch/powerpc/mm/pgtable.c
@@ -33,104 +33,6 @@
 
 #include "mmu_decl.h"
 
-#ifdef CONFIG_SMP
-
-/*
- * Handle batching of page table freeing on SMP. Page tables are
- * queued up and send to be freed later by RCU in order to avoid
- * freeing a page table page that is being walked without locks
- */
-
-static unsigned long pte_freelist_forced_free;
-
-struct pte_freelist_batch
-{
-	struct rcu_head	rcu;
-	unsigned int	index;
-	unsigned long	tables[0];
-};
-
-#define PTE_FREELIST_SIZE \
-	((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \
-	  / sizeof(unsigned long))
-
-static void pte_free_smp_sync(void *arg)
-{
-	/* Do nothing, just ensure we sync with all CPUs */
-}
-
-/* This is only called when we are critically out of memory
- * (and fail to get a page in pte_free_tlb).
- */
-static void pgtable_free_now(void *table, unsigned shift)
-{
-	pte_freelist_forced_free++;
-
-	smp_call_function(pte_free_smp_sync, NULL, 1);
-
-	pgtable_free(table, shift);
-}
-
-static void pte_free_rcu_callback(struct rcu_head *head)
-{
-	struct pte_freelist_batch *batch =
-		container_of(head, struct pte_freelist_batch, rcu);
-	unsigned int i;
-
-	for (i = 0; i < batch->index; i++) {
-		void *table = (void *)(batch->tables[i] & ~MAX_PGTABLE_INDEX_SIZE);
-		unsigned shift = batch->tables[i] & MAX_PGTABLE_INDEX_SIZE;
-
-		pgtable_free(table, shift);
-	}
-
-	free_page((unsigned long)batch);
-}
-
-static void pte_free_submit(struct pte_freelist_batch *batch)
-{
-	call_rcu_sched(&batch->rcu, pte_free_rcu_callback);
-}
-
-void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
-{
-	struct pte_freelist_batch **batchp = &tlb->arch.batch;
-	unsigned long pgf;
-
-	if (atomic_read(&tlb->mm->mm_users) < 2) {
-		pgtable_free(table, shift);
-		return;
-	}
-
-	if (*batchp == NULL) {
-		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
-		if (*batchp == NULL) {
-			pgtable_free_now(table, shift);
-			return;
-		}
-		(*batchp)->index = 0;
-	}
-	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
-	pgf = (unsigned long)table | shift;
-	(*batchp)->tables[(*batchp)->index++] = pgf;
-	if ((*batchp)->index == PTE_FREELIST_SIZE) {
-		pte_free_submit(*batchp);
-		*batchp = NULL;
-	}
-}
-
-void pte_free_finish(struct mmu_gather *tlb)
-{
-	struct pte_freelist_batch **batchp = &tlb->arch.batch;
-
-	if (*batchp == NULL)
-		return;
-	pte_free_submit(*batchp);
-	*batchp = NULL;
-}
-
-#endif /* CONFIG_SMP */
-
 static inline int is_exec_fault(void)
 {
 	return current->thread.regs && TRAP(current->thread.regs) == 0x400;
Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
@@ -71,9 +71,6 @@ void tlb_flush(struct mmu_gather *tlb)
 		 */
 		_tlbia();
 	}
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
+++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
@@ -165,9 +165,6 @@ void tlb_flush(struct mmu_gather *tlb)
 		__flush_tlb_pending(tlbbatch);
 
 	put_cpu_var(ppc64_tlb_batch);
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /**
Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
+++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
@@ -299,9 +299,6 @@ EXPORT_SYMBOL(flush_tlb_range);
 void tlb_flush(struct mmu_gather *tlb)
 {
 	flush_tlb_mm(tlb->mm);
-
-	/* Push out batch of freed page tables */
-	pte_free_finish(tlb);
 }
 
 /*
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -29,6 +29,49 @@
   #define tlb_fast_mode(tlb) 1
 #endif
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+/*
+ * Semi RCU freeing of the page directories.
+ *
+ * This is needed by some architectures to implement software pagetable walkers.
+ *
+ * gup_fast() and other software pagetable walkers do a lockless page-table
+ * walk and therefore needs some synchronization with the freeing of the page
+ * directories. The chosen means to accomplish that is by disabling IRQs over
+ * the walk.
+ *
+ * Architectures that use IPIs to flush TLBs will then automagically DTRT,
+ * since we unlink the page, flush TLBs, free the page. Since the disabling of
+ * IRQs delays the completion of the TLB flush we can never observe an already
+ * freed page.
+ *
+ * Architectures that do not have this (PPC) need to delay the freeing by some
+ * other means, this is that means.
+ *
+ * What we do is batch the freed directory pages (tables) and RCU free them.
+ * We use the sched RCU variant, as that guarantees that IRQ/preempt disabling
+ * holds off grace periods.
+ *
+ * However, in order to batch these pages we need to allocate storage, this
+ * allocation is deep inside the MM code and can thus easily fail on memory
+ * pressure. To guarantee progress we fall back to single table freeing, see
+ * the implementation of tlb_remove_table_one().
+ *
+ */
+struct mmu_table_batch {
+	struct rcu_head		rcu;
+	unsigned int		nr;
+	void			*tables[0];
+};
+
+#define MAX_TABLE_BATCH		\
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+extern void tlb_table_flush(struct mmu_gather *tlb);
+extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+
+#endif
+
 /*
  * If we can't allocate a page to make a big patch of page pointers
  * to work on, then just handle a few from the on-stack structure.
@@ -44,11 +87,12 @@ struct mmu_gather {
 	unsigned int		max;	/* nr < max */
 	unsigned int		need_flush;/* Really unmapped some ptes? */
 	unsigned int		fullmm; /* non-zero means full mm flush */
-#ifdef HAVE_ARCH_MMU_GATHER
-	struct arch_mmu_gather	arch;
-#endif
 	struct page		**pages;
 	struct page		*local[MMU_GATHER_BUNDLE];
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	struct mmu_table_batch	*batch;
+#endif
 };
 
 static inline void __tlb_alloc_page(struct mmu_gather *tlb)
@@ -80,8 +124,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 
 	tlb->fullmm = full_mm_flush;
 
-#ifdef HAVE_ARCH_MMU_GATHER
-	tlb->arch = ARCH_MMU_GATHER_INIT;
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb->batch = NULL;
 #endif
 }
 
@@ -92,6 +136,9 @@ tlb_flush_mmu(struct mmu_gather *tlb)
 		return;
 	tlb->need_flush = 0;
 	tlb_flush(tlb);
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb_table_flush(tlb);
+#endif
 	if (!tlb_fast_mode(tlb)) {
 		free_pages_and_swap_cache(tlb->pages, tlb->nr);
 		tlb->nr = 0;
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -193,6 +193,83 @@ static void check_sync_rss_stat(struct t
 
 #endif
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+
+/*
+ * See the comment near struct mmu_table_batch.
+ */
+
+static void tlb_remove_table_smp_sync(void *arg)
+{
+	/* Simply deliver the interrupt */
+}
+
+static void tlb_remove_table_one(void *table)
+{
+	/*
+	 * This isn't an RCU grace period and hence the page-tables cannot be
+	 * assumed to be actually RCU-freed.
+	 *
+	 * It is however sufficient for software page-table walkers that rely on
+	 * IRQ disabling. See the comment near struct mmu_table_batch.
+	 */
+	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
+	__tlb_remove_table(table);
+}
+
+static void tlb_remove_table_rcu(struct rcu_head *head)
+{
+	struct mmu_table_batch *batch;
+	int i;
+
+	batch = container_of(head, struct mmu_table_batch, rcu);
+
+	for (i = 0; i < batch->nr; i++)
+		__tlb_remove_table(batch->tables[i]);
+
+	free_page((unsigned long)batch);
+}
+
+void tlb_table_flush(struct mmu_gather *tlb)
+{
+	struct mmu_table_batch **batch = &tlb->batch;
+
+	if (*batch) {
+		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
+		*batch = NULL;
+	}
+}
+
+void tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	struct mmu_table_batch **batch = &tlb->batch;
+
+	tlb->need_flush = 1;
+
+	/*
+	 * When there's less then two users of this mm there cannot be a
+	 * concurrent page-table walk.
+	 */
+	if (atomic_read(&tlb->mm->mm_users) < 2) {
+		__tlb_remove_table(table);
+		return;
+	}
+
+	if (*batch == NULL) {
+		*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
+		if (*batch == NULL) {
+			tlb_remove_table_one(table);
+			return;
+		}
+		(*batch)->nr = 0;
+	}
+	(*batch)->tables[(*batch)->nr++] = table;
+	if ((*batch)->nr == MAX_TABLE_BATCH)
+		tlb_table_flush(tlb);
+}
+
+#endif
+
 /*
  * If a p?d_bad entry is found while walking page tables, report
  * the error, before resetting entry to p?d_none.  Usually (but
Index: linux-2.6/arch/Kconfig
===================================================================
--- linux-2.6.orig/arch/Kconfig
+++ linux-2.6/arch/Kconfig
@@ -178,4 +178,7 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_ARCH_MUTEX_CPU_RELAX
 	bool
 
+config HAVE_RCU_TABLE_FREE
+	bool
+
 source "kernel/gcov/Kconfig"
Index: linux-2.6/arch/powerpc/Kconfig
===================================================================
--- linux-2.6.orig/arch/powerpc/Kconfig
+++ linux-2.6/arch/powerpc/Kconfig
@@ -134,6 +134,7 @@ config PPC
 	select HAVE_GENERIC_HARDIRQS
 	select HAVE_SPARSE_IRQ
 	select IRQ_PER_CPU
+	select HAVE_RCU_TABLE_FREE if SMP
 
 config EARLY_PRINTK
 	bool
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
- * [PATCH 12/17] s390: use generic RCP page-table freeing
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (11 preceding siblings ...)
  2011-02-17 16:23 ` [PATCH 11/17] mm, powerpc: Move the RCU page-table freeing into generic code Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 13/17] mm: Extended batches for generic mmu_gather Peter Zijlstra
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Martin Schwidefsky
[-- Attachment #1: martin_schwidefsky-s390-use_generic_rcp_page-table_freeing.patch --]
[-- Type: text/plain, Size: 16522 bytes --]
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Now that we have a generic implementation for RCU based page table
freeing, use it for s390 as well. It saves a couple of lines.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110215193717.30c2bb0a@mschwide.boeblingen.de.ibm.com>
---
 arch/s390/Kconfig               |    1 
 arch/s390/include/asm/pgalloc.h |   19 +--
 arch/s390/include/asm/tlb.h     |   92 ++++++++-----------
 arch/s390/mm/pgtable.c          |  192 +++++-----------------------------------
 4 files changed, 76 insertions(+), 228 deletions(-)
Index: linux-2.6/arch/s390/Kconfig
===================================================================
--- linux-2.6.orig/arch/s390/Kconfig
+++ linux-2.6/arch/s390/Kconfig
@@ -87,6 +87,7 @@ config S390
 	select HAVE_KERNEL_LZO
 	select HAVE_GET_USER_PAGES_FAST
 	select HAVE_ARCH_MUTEX_CPU_RELAX
+	select HAVE_RCU_TABLE_FREE
 	select ARCH_INLINE_SPIN_TRYLOCK
 	select ARCH_INLINE_SPIN_TRYLOCK_BH
 	select ARCH_INLINE_SPIN_LOCK
Index: linux-2.6/arch/s390/include/asm/pgalloc.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/pgalloc.h
+++ linux-2.6/arch/s390/include/asm/pgalloc.h
@@ -20,12 +20,11 @@
 #define check_pgt_cache()	do {} while (0)
 
 unsigned long *crst_table_alloc(struct mm_struct *, int);
-void crst_table_free(struct mm_struct *, unsigned long *);
-void crst_table_free_rcu(struct mm_struct *, unsigned long *);
+void crst_table_free(unsigned long *);
 
 unsigned long *page_table_alloc(struct mm_struct *);
-void page_table_free(struct mm_struct *, unsigned long *);
-void page_table_free_rcu(struct mm_struct *, unsigned long *);
+void page_table_free(unsigned long *);
+
 void disable_noexec(struct mm_struct *, struct task_struct *);
 
 static inline void clear_table(unsigned long *s, unsigned long val, size_t n)
@@ -95,7 +94,7 @@ static inline pud_t *pud_alloc_one(struc
 		crst_table_init(table, _REGION3_ENTRY_EMPTY);
 	return (pud_t *) table;
 }
-#define pud_free(mm, pud) crst_table_free(mm, (unsigned long *) pud)
+#define pud_free(mm, pud) crst_table_free((unsigned long *) pud)
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long vmaddr)
 {
@@ -104,7 +103,7 @@ static inline pmd_t *pmd_alloc_one(struc
 		crst_table_init(table, _SEGMENT_ENTRY_EMPTY);
 	return (pmd_t *) table;
 }
-#define pmd_free(mm, pmd) crst_table_free(mm, (unsigned long *) pmd)
+#define pmd_free(mm, pmd) crst_table_free((unsigned long *) pmd)
 
 static inline void pgd_populate_kernel(struct mm_struct *mm,
 				       pgd_t *pgd, pud_t *pud)
@@ -148,7 +147,7 @@ static inline pgd_t *pgd_alloc(struct mm
 	return (pgd_t *)
 		crst_table_alloc(mm, user_mode == SECONDARY_SPACE_MODE);
 }
-#define pgd_free(mm, pgd) crst_table_free(mm, (unsigned long *) pgd)
+#define pgd_free(mm, pgd) crst_table_free((unsigned long *) pgd)
 
 static inline void pmd_populate_kernel(struct mm_struct *mm,
 				       pmd_t *pmd, pte_t *pte)
@@ -175,9 +174,7 @@ static inline void pmd_populate(struct m
 #define pte_alloc_one_kernel(mm, vmaddr) ((pte_t *) page_table_alloc(mm))
 #define pte_alloc_one(mm, vmaddr) ((pte_t *) page_table_alloc(mm))
 
-#define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
-#define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
-
-extern void rcu_table_freelist_finish(void);
+#define pte_free_kernel(mm, pte) page_table_free((unsigned long *) pte)
+#define pte_free(mm, pte) page_table_free((unsigned long *) pte)
 
 #endif /* _S390_PGALLOC_H */
Index: linux-2.6/arch/s390/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/tlb.h
+++ linux-2.6/arch/s390/include/asm/tlb.h
@@ -29,50 +29,45 @@
 #include <asm/smp.h>
 #include <asm/tlbflush.h>
 
+struct mmu_table_batch {
+	struct rcu_head rcu;
+	unsigned int nr;
+	void *tables[0];
+};
+
+#define MAX_TABLE_BATCH \
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+void tlb_table_flush(struct mmu_gather *);
+void tlb_remove_table(struct mmu_gather *, void *);
+
 struct mmu_gather {
 	struct mm_struct *mm;
 	unsigned int fullmm;
-	unsigned int nr_ptes;
-	unsigned int nr_pxds;
-	unsigned int max;
-	void **array;
-	void *local[8];
+	struct mmu_table_batch *batch;
+	/* need_flush is used only for page tables */
+	unsigned int need_flush : 1;
 };
 
-static inline void __tlb_alloc_page(struct mmu_gather *tlb)
-{
-	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
-
-	if (addr) {
-		tlb->array = (void *) addr;
-		tlb->max = PAGE_SIZE / sizeof(void *);
-	}
-}
-
 static inline void tlb_gather_mmu(struct mmu_gather *tlb,
 				  struct mm_struct *mm,
 				  unsigned int full_mm_flush)
 {
 	tlb->mm = mm;
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->array = tlb->local;
 	tlb->fullmm = full_mm_flush;
 	if (tlb->fullmm)
 		__tlb_flush_mm(mm);
-	else
-		__tlb_alloc_page(tlb);
-	tlb->nr_ptes = 0;
-	tlb->nr_pxds = tlb->max;
+	tlb->batch = NULL;
+	tlb->need_flush = 0;
 }
 
 static inline void tlb_flush_mmu(struct mmu_gather *tlb)
 {
-	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < tlb->max))
+	if (tlb->need_flush) {
+		tlb->need_flush = 0;
 		__tlb_flush_mm(tlb->mm);
-	while (tlb->nr_ptes > 0)
-		page_table_free_rcu(tlb->mm, tlb->array[--tlb->nr_ptes]);
-	while (tlb->nr_pxds < tlb->max)
-		crst_table_free_rcu(tlb->mm, tlb->array[tlb->nr_pxds++]);
+		tlb_table_flush(tlb);
+	}
 }
 
 static inline void tlb_finish_mmu(struct mmu_gather *tlb,
@@ -80,13 +75,8 @@ static inline void tlb_finish_mmu(struct
 {
 	tlb_flush_mmu(tlb);
 
-	rcu_table_freelist_finish();
-
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	if (tlb->array != tlb->local)
-		free_pages((unsigned long) tlb->array, 0);
 }
 
 /*
@@ -112,12 +102,11 @@ static inline void tlb_remove_page(struc
 static inline void pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 				unsigned long address)
 {
-	if (!tlb->fullmm) {
-		tlb->array[tlb->nr_ptes++] = pte;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		page_table_free(tlb->mm, (unsigned long *) pte);
+	if (!tlb->fullmm)
+		/* Use LSB to distinguish crst table vs. page table */
+		tlb_remove_table(tlb, (void *) pte + 1);
+	else
+		page_table_free((unsigned long *) pte);
 }
 
 /*
@@ -133,12 +122,10 @@ static inline void pmd_free_tlb(struct m
 #ifdef __s390x__
 	if (tlb->mm->context.asce_limit <= (1UL << 31))
 		return;
-	if (!tlb->fullmm) {
-		tlb->array[--tlb->nr_pxds] = pmd;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		crst_table_free(tlb->mm, (unsigned long *) pmd);
+	if (!tlb->fullmm)
+		tlb_remove_table(tlb, pmd);
+	else
+		crst_table_free((unsigned long *) pmd);
 #endif
 }
 
@@ -155,15 +142,22 @@ static inline void pud_free_tlb(struct m
 #ifdef __s390x__
 	if (tlb->mm->context.asce_limit <= (1UL << 42))
 		return;
-	if (!tlb->fullmm) {
-		tlb->array[--tlb->nr_pxds] = pud;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		crst_table_free(tlb->mm, (unsigned long *) pud);
+	if (!tlb->fullmm)
+		tlb_remove_table(tlb, pud);
+	else
+		crst_table_free((unsigned long *) pud);
 #endif
 }
 
+static inline void __tlb_remove_table(void *table)
+{
+	/* Use LSB to distinguish crst table vs. page table */
+	if ((unsigned long) table & 1)
+		page_table_free(table - 1);
+	else
+		crst_table_free(table);
+}
+
 #define tlb_start_vma(tlb, vma)			do { } while (0)
 #define tlb_end_vma(tlb, vma)			do { } while (0)
 #define tlb_remove_tlb_entry(tlb, ptep, addr)	do { } while (0)
Index: linux-2.6/arch/s390/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/s390/mm/pgtable.c
+++ linux-2.6/arch/s390/mm/pgtable.c
@@ -24,92 +24,17 @@
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 
-struct rcu_table_freelist {
-	struct rcu_head rcu;
-	struct mm_struct *mm;
-	unsigned int pgt_index;
-	unsigned int crst_index;
-	unsigned long *table[0];
-};
-
-#define RCU_FREELIST_SIZE \
-	((PAGE_SIZE - sizeof(struct rcu_table_freelist)) \
-	  / sizeof(unsigned long))
-
-static DEFINE_PER_CPU(struct rcu_table_freelist *, rcu_table_freelist);
-
-static void __page_table_free(struct mm_struct *mm, unsigned long *table);
-static void __crst_table_free(struct mm_struct *mm, unsigned long *table);
-
-static struct rcu_table_freelist *rcu_table_freelist_get(struct mm_struct *mm)
-{
-	struct rcu_table_freelist **batchp = &__get_cpu_var(rcu_table_freelist);
-	struct rcu_table_freelist *batch = *batchp;
-
-	if (batch)
-		return batch;
-	batch = (struct rcu_table_freelist *) __get_free_page(GFP_ATOMIC);
-	if (batch) {
-		batch->mm = mm;
-		batch->pgt_index = 0;
-		batch->crst_index = RCU_FREELIST_SIZE;
-		*batchp = batch;
-	}
-	return batch;
-}
-
-static void rcu_table_freelist_callback(struct rcu_head *head)
-{
-	struct rcu_table_freelist *batch =
-		container_of(head, struct rcu_table_freelist, rcu);
-
-	while (batch->pgt_index > 0)
-		__page_table_free(batch->mm, batch->table[--batch->pgt_index]);
-	while (batch->crst_index < RCU_FREELIST_SIZE)
-		__crst_table_free(batch->mm, batch->table[batch->crst_index++]);
-	free_page((unsigned long) batch);
-}
-
-void rcu_table_freelist_finish(void)
-{
-	struct rcu_table_freelist *batch = __get_cpu_var(rcu_table_freelist);
-
-	if (!batch)
-		return;
-	call_rcu(&batch->rcu, rcu_table_freelist_callback);
-	__get_cpu_var(rcu_table_freelist) = NULL;
-}
-
-static void smp_sync(void *arg)
-{
-}
 
 #ifndef CONFIG_64BIT
 #define ALLOC_ORDER	1
 #define TABLES_PER_PAGE	4
 #define FRAG_MASK	15UL
 #define SECOND_HALVES	10UL
-
-void clear_table_pgstes(unsigned long *table)
-{
-	clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE/4);
-	memset(table + 256, 0, PAGE_SIZE/4);
-	clear_table(table + 512, _PAGE_TYPE_EMPTY, PAGE_SIZE/4);
-	memset(table + 768, 0, PAGE_SIZE/4);
-}
-
 #else
 #define ALLOC_ORDER	2
 #define TABLES_PER_PAGE	2
 #define FRAG_MASK	3UL
 #define SECOND_HALVES	2UL
-
-void clear_table_pgstes(unsigned long *table)
-{
-	clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE/2);
-	memset(table + 256, 0, PAGE_SIZE/2);
-}
-
 #endif
 
 unsigned long VMALLOC_START = VMALLOC_END - VMALLOC_SIZE;
@@ -138,6 +63,7 @@ unsigned long *crst_table_alloc(struct m
 			return NULL;
 		}
 		page->index = page_to_phys(shadow);
+		page->private = (unsigned long) mm;
 	}
 	spin_lock_bh(&mm->context.list_lock);
 	list_add(&page->lru, &mm->context.crst_list);
@@ -145,47 +71,19 @@ unsigned long *crst_table_alloc(struct m
 	return (unsigned long *) page_to_phys(page);
 }
 
-static void __crst_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	unsigned long *shadow = get_shadow_table(table);
-
-	if (shadow)
-		free_pages((unsigned long) shadow, ALLOC_ORDER);
-	free_pages((unsigned long) table, ALLOC_ORDER);
-}
-
-void crst_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	struct page *page = virt_to_page(table);
-
-	spin_lock_bh(&mm->context.list_lock);
-	list_del(&page->lru);
-	spin_unlock_bh(&mm->context.list_lock);
-	__crst_table_free(mm, table);
-}
-
-void crst_table_free_rcu(struct mm_struct *mm, unsigned long *table)
+void crst_table_free(unsigned long *table)
 {
-	struct rcu_table_freelist *batch;
 	struct page *page = virt_to_page(table);
+	struct mm_struct *mm = (struct mm_struct *) page->private;
+	unsigned long *shadow = get_shadow_table(table);
 
 	spin_lock_bh(&mm->context.list_lock);
 	list_del(&page->lru);
+	page->private = 0;
 	spin_unlock_bh(&mm->context.list_lock);
-	if (atomic_read(&mm->mm_users) < 2 &&
-	    cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id()))) {
-		__crst_table_free(mm, table);
-		return;
-	}
-	batch = rcu_table_freelist_get(mm);
-	if (!batch) {
-		smp_call_function(smp_sync, NULL, 1);
-		__crst_table_free(mm, table);
-		return;
-	}
-	batch->table[--batch->crst_index] = table;
-	if (batch->pgt_index >= batch->crst_index)
-		rcu_table_freelist_finish();
+	if (shadow)
+		free_pages((unsigned long) shadow, ALLOC_ORDER);
+	free_pages((unsigned long) table, ALLOC_ORDER);
 }
 
 #ifdef CONFIG_64BIT
@@ -223,7 +121,7 @@ int crst_table_upgrade(struct mm_struct 
 	}
 	spin_unlock_bh(&mm->page_table_lock);
 	if (table)
-		crst_table_free(mm, table);
+		crst_table_free(table);
 	if (mm->context.asce_limit < limit)
 		goto repeat;
 	update_mm(mm, current);
@@ -257,7 +155,7 @@ void crst_table_downgrade(struct mm_stru
 		}
 		mm->pgd = (pgd_t *) (pgd_val(*pgd) & _REGION_ENTRY_ORIGIN);
 		mm->task_size = mm->context.asce_limit;
-		crst_table_free(mm, (unsigned long *) pgd);
+		crst_table_free((unsigned long *) pgd);
 	}
 	update_mm(mm, current);
 }
@@ -288,11 +186,7 @@ unsigned long *page_table_alloc(struct m
 			return NULL;
 		pgtable_page_ctor(page);
 		page->flags &= ~FRAG_MASK;
-		table = (unsigned long *) page_to_phys(page);
-		if (mm->context.has_pgste)
-			clear_table_pgstes(table);
-		else
-			clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE);
+		page->private = (unsigned long) mm;
 		spin_lock_bh(&mm->context.list_lock);
 		list_add(&page->lru, &mm->context.pgtable_list);
 	}
@@ -305,42 +199,34 @@ unsigned long *page_table_alloc(struct m
 	if ((page->flags & FRAG_MASK) == ((1UL << TABLES_PER_PAGE) - 1))
 		list_move_tail(&page->lru, &mm->context.pgtable_list);
 	spin_unlock_bh(&mm->context.list_lock);
+	clear_table(table, _PAGE_TYPE_EMPTY, PTRS_PER_PTE * sizeof(long));
+	if (mm->context.noexec)
+		clear_table(table + 256, _PAGE_TYPE_EMPTY,
+			    PTRS_PER_PTE * sizeof(long));
+	else if (mm->context.has_pgste)
+		clear_table(table + 256, 0, PTRS_PER_PTE * sizeof(long));
 	return table;
 }
 
-static void __page_table_free(struct mm_struct *mm, unsigned long *table)
+void page_table_free(unsigned long *table)
 {
-	struct page *page;
-	unsigned long bits;
-
-	bits = ((unsigned long) table) & 15;
-	table = (unsigned long *)(((unsigned long) table) ^ bits);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
-	page->flags ^= bits;
-	if (!(page->flags & FRAG_MASK)) {
-		pgtable_page_dtor(page);
-		__free_page(page);
-	}
-}
-
-void page_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	struct page *page;
+	struct page *page = virt_to_page(table);
+	struct mm_struct *mm = (struct mm_struct *) page->private;
 	unsigned long bits;
 
 	bits = (mm->context.noexec || mm->context.has_pgste) ? 3UL : 1UL;
 	bits <<= (__pa(table) & (PAGE_SIZE - 1)) / 256 / sizeof(unsigned long);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
 	spin_lock_bh(&mm->context.list_lock);
 	page->flags ^= bits;
 	if (page->flags & FRAG_MASK) {
 		/* Page now has some free pgtable fragments. */
-		if (!list_empty(&page->lru))
-			list_move(&page->lru, &mm->context.pgtable_list);
+		list_move(&page->lru, &mm->context.pgtable_list);
 		page = NULL;
-	} else
+	} else {
 		/* All fragments of the 4K page have been freed. */
 		list_del(&page->lru);
+		page->private = 0;
+	}
 	spin_unlock_bh(&mm->context.list_lock);
 	if (page) {
 		pgtable_page_dtor(page);
@@ -348,36 +234,6 @@ void page_table_free(struct mm_struct *m
 	}
 }
 
-void page_table_free_rcu(struct mm_struct *mm, unsigned long *table)
-{
-	struct rcu_table_freelist *batch;
-	struct page *page;
-	unsigned long bits;
-
-	if (atomic_read(&mm->mm_users) < 2 &&
-	    cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id()))) {
-		page_table_free(mm, table);
-		return;
-	}
-	batch = rcu_table_freelist_get(mm);
-	if (!batch) {
-		smp_call_function(smp_sync, NULL, 1);
-		page_table_free(mm, table);
-		return;
-	}
-	bits = (mm->context.noexec || mm->context.has_pgste) ? 3UL : 1UL;
-	bits <<= (__pa(table) & (PAGE_SIZE - 1)) / 256 / sizeof(unsigned long);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
-	spin_lock_bh(&mm->context.list_lock);
-	/* Delayed freeing with rcu prevents reuse of pgtable fragments */
-	list_del_init(&page->lru);
-	spin_unlock_bh(&mm->context.list_lock);
-	table = (unsigned long *)(((unsigned long) table) | bits);
-	batch->table[batch->pgt_index++] = table;
-	if (batch->pgt_index >= batch->crst_index)
-		rcu_table_freelist_finish();
-}
-
 void disable_noexec(struct mm_struct *mm, struct task_struct *tsk)
 {
 	struct page *page;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 12/17] s390: use generic RCP page-table freeing
  2011-02-17 16:23 ` [PATCH 12/17] s390: use generic RCP page-table freeing Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Martin Schwidefsky
[-- Attachment #1: martin_schwidefsky-s390-use_generic_rcp_page-table_freeing.patch --]
[-- Type: text/plain, Size: 16219 bytes --]
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Now that we have a generic implementation for RCU based page table
freeing, use it for s390 as well. It saves a couple of lines.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110215193717.30c2bb0a@mschwide.boeblingen.de.ibm.com>
---
 arch/s390/Kconfig               |    1 
 arch/s390/include/asm/pgalloc.h |   19 +--
 arch/s390/include/asm/tlb.h     |   92 ++++++++-----------
 arch/s390/mm/pgtable.c          |  192 +++++-----------------------------------
 4 files changed, 76 insertions(+), 228 deletions(-)
Index: linux-2.6/arch/s390/Kconfig
===================================================================
--- linux-2.6.orig/arch/s390/Kconfig
+++ linux-2.6/arch/s390/Kconfig
@@ -87,6 +87,7 @@ config S390
 	select HAVE_KERNEL_LZO
 	select HAVE_GET_USER_PAGES_FAST
 	select HAVE_ARCH_MUTEX_CPU_RELAX
+	select HAVE_RCU_TABLE_FREE
 	select ARCH_INLINE_SPIN_TRYLOCK
 	select ARCH_INLINE_SPIN_TRYLOCK_BH
 	select ARCH_INLINE_SPIN_LOCK
Index: linux-2.6/arch/s390/include/asm/pgalloc.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/pgalloc.h
+++ linux-2.6/arch/s390/include/asm/pgalloc.h
@@ -20,12 +20,11 @@
 #define check_pgt_cache()	do {} while (0)
 
 unsigned long *crst_table_alloc(struct mm_struct *, int);
-void crst_table_free(struct mm_struct *, unsigned long *);
-void crst_table_free_rcu(struct mm_struct *, unsigned long *);
+void crst_table_free(unsigned long *);
 
 unsigned long *page_table_alloc(struct mm_struct *);
-void page_table_free(struct mm_struct *, unsigned long *);
-void page_table_free_rcu(struct mm_struct *, unsigned long *);
+void page_table_free(unsigned long *);
+
 void disable_noexec(struct mm_struct *, struct task_struct *);
 
 static inline void clear_table(unsigned long *s, unsigned long val, size_t n)
@@ -95,7 +94,7 @@ static inline pud_t *pud_alloc_one(struc
 		crst_table_init(table, _REGION3_ENTRY_EMPTY);
 	return (pud_t *) table;
 }
-#define pud_free(mm, pud) crst_table_free(mm, (unsigned long *) pud)
+#define pud_free(mm, pud) crst_table_free((unsigned long *) pud)
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long vmaddr)
 {
@@ -104,7 +103,7 @@ static inline pmd_t *pmd_alloc_one(struc
 		crst_table_init(table, _SEGMENT_ENTRY_EMPTY);
 	return (pmd_t *) table;
 }
-#define pmd_free(mm, pmd) crst_table_free(mm, (unsigned long *) pmd)
+#define pmd_free(mm, pmd) crst_table_free((unsigned long *) pmd)
 
 static inline void pgd_populate_kernel(struct mm_struct *mm,
 				       pgd_t *pgd, pud_t *pud)
@@ -148,7 +147,7 @@ static inline pgd_t *pgd_alloc(struct mm
 	return (pgd_t *)
 		crst_table_alloc(mm, user_mode == SECONDARY_SPACE_MODE);
 }
-#define pgd_free(mm, pgd) crst_table_free(mm, (unsigned long *) pgd)
+#define pgd_free(mm, pgd) crst_table_free((unsigned long *) pgd)
 
 static inline void pmd_populate_kernel(struct mm_struct *mm,
 				       pmd_t *pmd, pte_t *pte)
@@ -175,9 +174,7 @@ static inline void pmd_populate(struct m
 #define pte_alloc_one_kernel(mm, vmaddr) ((pte_t *) page_table_alloc(mm))
 #define pte_alloc_one(mm, vmaddr) ((pte_t *) page_table_alloc(mm))
 
-#define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
-#define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
-
-extern void rcu_table_freelist_finish(void);
+#define pte_free_kernel(mm, pte) page_table_free((unsigned long *) pte)
+#define pte_free(mm, pte) page_table_free((unsigned long *) pte)
 
 #endif /* _S390_PGALLOC_H */
Index: linux-2.6/arch/s390/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/tlb.h
+++ linux-2.6/arch/s390/include/asm/tlb.h
@@ -29,50 +29,45 @@
 #include <asm/smp.h>
 #include <asm/tlbflush.h>
 
+struct mmu_table_batch {
+	struct rcu_head rcu;
+	unsigned int nr;
+	void *tables[0];
+};
+
+#define MAX_TABLE_BATCH \
+	((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *))
+
+void tlb_table_flush(struct mmu_gather *);
+void tlb_remove_table(struct mmu_gather *, void *);
+
 struct mmu_gather {
 	struct mm_struct *mm;
 	unsigned int fullmm;
-	unsigned int nr_ptes;
-	unsigned int nr_pxds;
-	unsigned int max;
-	void **array;
-	void *local[8];
+	struct mmu_table_batch *batch;
+	/* need_flush is used only for page tables */
+	unsigned int need_flush : 1;
 };
 
-static inline void __tlb_alloc_page(struct mmu_gather *tlb)
-{
-	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
-
-	if (addr) {
-		tlb->array = (void *) addr;
-		tlb->max = PAGE_SIZE / sizeof(void *);
-	}
-}
-
 static inline void tlb_gather_mmu(struct mmu_gather *tlb,
 				  struct mm_struct *mm,
 				  unsigned int full_mm_flush)
 {
 	tlb->mm = mm;
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->array = tlb->local;
 	tlb->fullmm = full_mm_flush;
 	if (tlb->fullmm)
 		__tlb_flush_mm(mm);
-	else
-		__tlb_alloc_page(tlb);
-	tlb->nr_ptes = 0;
-	tlb->nr_pxds = tlb->max;
+	tlb->batch = NULL;
+	tlb->need_flush = 0;
 }
 
 static inline void tlb_flush_mmu(struct mmu_gather *tlb)
 {
-	if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < tlb->max))
+	if (tlb->need_flush) {
+		tlb->need_flush = 0;
 		__tlb_flush_mm(tlb->mm);
-	while (tlb->nr_ptes > 0)
-		page_table_free_rcu(tlb->mm, tlb->array[--tlb->nr_ptes]);
-	while (tlb->nr_pxds < tlb->max)
-		crst_table_free_rcu(tlb->mm, tlb->array[tlb->nr_pxds++]);
+		tlb_table_flush(tlb);
+	}
 }
 
 static inline void tlb_finish_mmu(struct mmu_gather *tlb,
@@ -80,13 +75,8 @@ static inline void tlb_finish_mmu(struct
 {
 	tlb_flush_mmu(tlb);
 
-	rcu_table_freelist_finish();
-
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
-
-	if (tlb->array != tlb->local)
-		free_pages((unsigned long) tlb->array, 0);
 }
 
 /*
@@ -112,12 +102,11 @@ static inline void tlb_remove_page(struc
 static inline void pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 				unsigned long address)
 {
-	if (!tlb->fullmm) {
-		tlb->array[tlb->nr_ptes++] = pte;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		page_table_free(tlb->mm, (unsigned long *) pte);
+	if (!tlb->fullmm)
+		/* Use LSB to distinguish crst table vs. page table */
+		tlb_remove_table(tlb, (void *) pte + 1);
+	else
+		page_table_free((unsigned long *) pte);
 }
 
 /*
@@ -133,12 +122,10 @@ static inline void pmd_free_tlb(struct m
 #ifdef __s390x__
 	if (tlb->mm->context.asce_limit <= (1UL << 31))
 		return;
-	if (!tlb->fullmm) {
-		tlb->array[--tlb->nr_pxds] = pmd;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		crst_table_free(tlb->mm, (unsigned long *) pmd);
+	if (!tlb->fullmm)
+		tlb_remove_table(tlb, pmd);
+	else
+		crst_table_free((unsigned long *) pmd);
 #endif
 }
 
@@ -155,15 +142,22 @@ static inline void pud_free_tlb(struct m
 #ifdef __s390x__
 	if (tlb->mm->context.asce_limit <= (1UL << 42))
 		return;
-	if (!tlb->fullmm) {
-		tlb->array[--tlb->nr_pxds] = pud;
-		if (tlb->nr_ptes >= tlb->nr_pxds)
-			tlb_flush_mmu(tlb);
-	} else
-		crst_table_free(tlb->mm, (unsigned long *) pud);
+	if (!tlb->fullmm)
+		tlb_remove_table(tlb, pud);
+	else
+		crst_table_free((unsigned long *) pud);
 #endif
 }
 
+static inline void __tlb_remove_table(void *table)
+{
+	/* Use LSB to distinguish crst table vs. page table */
+	if ((unsigned long) table & 1)
+		page_table_free(table - 1);
+	else
+		crst_table_free(table);
+}
+
 #define tlb_start_vma(tlb, vma)			do { } while (0)
 #define tlb_end_vma(tlb, vma)			do { } while (0)
 #define tlb_remove_tlb_entry(tlb, ptep, addr)	do { } while (0)
Index: linux-2.6/arch/s390/mm/pgtable.c
===================================================================
--- linux-2.6.orig/arch/s390/mm/pgtable.c
+++ linux-2.6/arch/s390/mm/pgtable.c
@@ -24,92 +24,17 @@
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 
-struct rcu_table_freelist {
-	struct rcu_head rcu;
-	struct mm_struct *mm;
-	unsigned int pgt_index;
-	unsigned int crst_index;
-	unsigned long *table[0];
-};
-
-#define RCU_FREELIST_SIZE \
-	((PAGE_SIZE - sizeof(struct rcu_table_freelist)) \
-	  / sizeof(unsigned long))
-
-static DEFINE_PER_CPU(struct rcu_table_freelist *, rcu_table_freelist);
-
-static void __page_table_free(struct mm_struct *mm, unsigned long *table);
-static void __crst_table_free(struct mm_struct *mm, unsigned long *table);
-
-static struct rcu_table_freelist *rcu_table_freelist_get(struct mm_struct *mm)
-{
-	struct rcu_table_freelist **batchp = &__get_cpu_var(rcu_table_freelist);
-	struct rcu_table_freelist *batch = *batchp;
-
-	if (batch)
-		return batch;
-	batch = (struct rcu_table_freelist *) __get_free_page(GFP_ATOMIC);
-	if (batch) {
-		batch->mm = mm;
-		batch->pgt_index = 0;
-		batch->crst_index = RCU_FREELIST_SIZE;
-		*batchp = batch;
-	}
-	return batch;
-}
-
-static void rcu_table_freelist_callback(struct rcu_head *head)
-{
-	struct rcu_table_freelist *batch =
-		container_of(head, struct rcu_table_freelist, rcu);
-
-	while (batch->pgt_index > 0)
-		__page_table_free(batch->mm, batch->table[--batch->pgt_index]);
-	while (batch->crst_index < RCU_FREELIST_SIZE)
-		__crst_table_free(batch->mm, batch->table[batch->crst_index++]);
-	free_page((unsigned long) batch);
-}
-
-void rcu_table_freelist_finish(void)
-{
-	struct rcu_table_freelist *batch = __get_cpu_var(rcu_table_freelist);
-
-	if (!batch)
-		return;
-	call_rcu(&batch->rcu, rcu_table_freelist_callback);
-	__get_cpu_var(rcu_table_freelist) = NULL;
-}
-
-static void smp_sync(void *arg)
-{
-}
 
 #ifndef CONFIG_64BIT
 #define ALLOC_ORDER	1
 #define TABLES_PER_PAGE	4
 #define FRAG_MASK	15UL
 #define SECOND_HALVES	10UL
-
-void clear_table_pgstes(unsigned long *table)
-{
-	clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE/4);
-	memset(table + 256, 0, PAGE_SIZE/4);
-	clear_table(table + 512, _PAGE_TYPE_EMPTY, PAGE_SIZE/4);
-	memset(table + 768, 0, PAGE_SIZE/4);
-}
-
 #else
 #define ALLOC_ORDER	2
 #define TABLES_PER_PAGE	2
 #define FRAG_MASK	3UL
 #define SECOND_HALVES	2UL
-
-void clear_table_pgstes(unsigned long *table)
-{
-	clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE/2);
-	memset(table + 256, 0, PAGE_SIZE/2);
-}
-
 #endif
 
 unsigned long VMALLOC_START = VMALLOC_END - VMALLOC_SIZE;
@@ -138,6 +63,7 @@ unsigned long *crst_table_alloc(struct m
 			return NULL;
 		}
 		page->index = page_to_phys(shadow);
+		page->private = (unsigned long) mm;
 	}
 	spin_lock_bh(&mm->context.list_lock);
 	list_add(&page->lru, &mm->context.crst_list);
@@ -145,47 +71,19 @@ unsigned long *crst_table_alloc(struct m
 	return (unsigned long *) page_to_phys(page);
 }
 
-static void __crst_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	unsigned long *shadow = get_shadow_table(table);
-
-	if (shadow)
-		free_pages((unsigned long) shadow, ALLOC_ORDER);
-	free_pages((unsigned long) table, ALLOC_ORDER);
-}
-
-void crst_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	struct page *page = virt_to_page(table);
-
-	spin_lock_bh(&mm->context.list_lock);
-	list_del(&page->lru);
-	spin_unlock_bh(&mm->context.list_lock);
-	__crst_table_free(mm, table);
-}
-
-void crst_table_free_rcu(struct mm_struct *mm, unsigned long *table)
+void crst_table_free(unsigned long *table)
 {
-	struct rcu_table_freelist *batch;
 	struct page *page = virt_to_page(table);
+	struct mm_struct *mm = (struct mm_struct *) page->private;
+	unsigned long *shadow = get_shadow_table(table);
 
 	spin_lock_bh(&mm->context.list_lock);
 	list_del(&page->lru);
+	page->private = 0;
 	spin_unlock_bh(&mm->context.list_lock);
-	if (atomic_read(&mm->mm_users) < 2 &&
-	    cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id()))) {
-		__crst_table_free(mm, table);
-		return;
-	}
-	batch = rcu_table_freelist_get(mm);
-	if (!batch) {
-		smp_call_function(smp_sync, NULL, 1);
-		__crst_table_free(mm, table);
-		return;
-	}
-	batch->table[--batch->crst_index] = table;
-	if (batch->pgt_index >= batch->crst_index)
-		rcu_table_freelist_finish();
+	if (shadow)
+		free_pages((unsigned long) shadow, ALLOC_ORDER);
+	free_pages((unsigned long) table, ALLOC_ORDER);
 }
 
 #ifdef CONFIG_64BIT
@@ -223,7 +121,7 @@ int crst_table_upgrade(struct mm_struct 
 	}
 	spin_unlock_bh(&mm->page_table_lock);
 	if (table)
-		crst_table_free(mm, table);
+		crst_table_free(table);
 	if (mm->context.asce_limit < limit)
 		goto repeat;
 	update_mm(mm, current);
@@ -257,7 +155,7 @@ void crst_table_downgrade(struct mm_stru
 		}
 		mm->pgd = (pgd_t *) (pgd_val(*pgd) & _REGION_ENTRY_ORIGIN);
 		mm->task_size = mm->context.asce_limit;
-		crst_table_free(mm, (unsigned long *) pgd);
+		crst_table_free((unsigned long *) pgd);
 	}
 	update_mm(mm, current);
 }
@@ -288,11 +186,7 @@ unsigned long *page_table_alloc(struct m
 			return NULL;
 		pgtable_page_ctor(page);
 		page->flags &= ~FRAG_MASK;
-		table = (unsigned long *) page_to_phys(page);
-		if (mm->context.has_pgste)
-			clear_table_pgstes(table);
-		else
-			clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE);
+		page->private = (unsigned long) mm;
 		spin_lock_bh(&mm->context.list_lock);
 		list_add(&page->lru, &mm->context.pgtable_list);
 	}
@@ -305,42 +199,34 @@ unsigned long *page_table_alloc(struct m
 	if ((page->flags & FRAG_MASK) == ((1UL << TABLES_PER_PAGE) - 1))
 		list_move_tail(&page->lru, &mm->context.pgtable_list);
 	spin_unlock_bh(&mm->context.list_lock);
+	clear_table(table, _PAGE_TYPE_EMPTY, PTRS_PER_PTE * sizeof(long));
+	if (mm->context.noexec)
+		clear_table(table + 256, _PAGE_TYPE_EMPTY,
+			    PTRS_PER_PTE * sizeof(long));
+	else if (mm->context.has_pgste)
+		clear_table(table + 256, 0, PTRS_PER_PTE * sizeof(long));
 	return table;
 }
 
-static void __page_table_free(struct mm_struct *mm, unsigned long *table)
+void page_table_free(unsigned long *table)
 {
-	struct page *page;
-	unsigned long bits;
-
-	bits = ((unsigned long) table) & 15;
-	table = (unsigned long *)(((unsigned long) table) ^ bits);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
-	page->flags ^= bits;
-	if (!(page->flags & FRAG_MASK)) {
-		pgtable_page_dtor(page);
-		__free_page(page);
-	}
-}
-
-void page_table_free(struct mm_struct *mm, unsigned long *table)
-{
-	struct page *page;
+	struct page *page = virt_to_page(table);
+	struct mm_struct *mm = (struct mm_struct *) page->private;
 	unsigned long bits;
 
 	bits = (mm->context.noexec || mm->context.has_pgste) ? 3UL : 1UL;
 	bits <<= (__pa(table) & (PAGE_SIZE - 1)) / 256 / sizeof(unsigned long);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
 	spin_lock_bh(&mm->context.list_lock);
 	page->flags ^= bits;
 	if (page->flags & FRAG_MASK) {
 		/* Page now has some free pgtable fragments. */
-		if (!list_empty(&page->lru))
-			list_move(&page->lru, &mm->context.pgtable_list);
+		list_move(&page->lru, &mm->context.pgtable_list);
 		page = NULL;
-	} else
+	} else {
 		/* All fragments of the 4K page have been freed. */
 		list_del(&page->lru);
+		page->private = 0;
+	}
 	spin_unlock_bh(&mm->context.list_lock);
 	if (page) {
 		pgtable_page_dtor(page);
@@ -348,36 +234,6 @@ void page_table_free(struct mm_struct *m
 	}
 }
 
-void page_table_free_rcu(struct mm_struct *mm, unsigned long *table)
-{
-	struct rcu_table_freelist *batch;
-	struct page *page;
-	unsigned long bits;
-
-	if (atomic_read(&mm->mm_users) < 2 &&
-	    cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id()))) {
-		page_table_free(mm, table);
-		return;
-	}
-	batch = rcu_table_freelist_get(mm);
-	if (!batch) {
-		smp_call_function(smp_sync, NULL, 1);
-		page_table_free(mm, table);
-		return;
-	}
-	bits = (mm->context.noexec || mm->context.has_pgste) ? 3UL : 1UL;
-	bits <<= (__pa(table) & (PAGE_SIZE - 1)) / 256 / sizeof(unsigned long);
-	page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
-	spin_lock_bh(&mm->context.list_lock);
-	/* Delayed freeing with rcu prevents reuse of pgtable fragments */
-	list_del_init(&page->lru);
-	spin_unlock_bh(&mm->context.list_lock);
-	table = (unsigned long *)(((unsigned long) table) | bits);
-	batch->table[batch->pgt_index++] = table;
-	if (batch->pgt_index >= batch->crst_index)
-		rcu_table_freelist_finish();
-}
-
 void disable_noexec(struct mm_struct *mm, struct task_struct *tsk)
 {
 	struct page *page;
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
- * [PATCH 13/17] mm: Extended batches for generic mmu_gather
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (12 preceding siblings ...)
  2011-02-17 16:23 ` [PATCH 12/17] s390: use generic RCP page-table freeing Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 14/17] mm: Provide generic range tracking and flushing Peter Zijlstra
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Hugh Dickins
[-- Attachment #1: peter_zijlstra-mm-extended_batches_for_generic_mmu_gather.patch --]
[-- Type: text/plain, Size: 5506 bytes --]
Instead of using a single batch (the small on-stack, or an allocated
page), try and extend the batch every time it runs out and only flush
once either the extend fails or we're done.
Requested-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/asm-generic/tlb.h |  120 ++++++++++++++++++++++++++++++----------------
 1 file changed, 80 insertions(+), 40 deletions(-)
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -19,16 +19,6 @@
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 
-/*
- * For UP we don't need to worry about TLB flush
- * and page free order so much..
- */
-#ifdef CONFIG_SMP
-  #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
-#else
-  #define tlb_fast_mode(tlb) 1
-#endif
-
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 /*
  * Semi RCU freeing of the page directories.
@@ -78,31 +68,66 @@ extern void tlb_remove_table(struct mmu_
  */
 #define MMU_GATHER_BUNDLE	8
 
+struct mmu_gather_batch {
+	struct mmu_gather_batch	*next;
+	unsigned int		nr;
+	unsigned int		max;
+	struct page		*pages[0];
+};
+
+#define MAX_GATHER_BATCH	\
+	((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *))
+
 /* struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
  */
 struct mmu_gather {
 	struct mm_struct	*mm;
-	unsigned int		nr;	/* set to ~0U means fast mode */
-	unsigned int		max;	/* nr < max */
-	unsigned int		need_flush;/* Really unmapped some ptes? */
-	unsigned int		fullmm; /* non-zero means full mm flush */
-	struct page		**pages;
-	struct page		*local[MMU_GATHER_BUNDLE];
+	unsigned int		need_flush : 1,	/* Did free PTEs */
+				fast_mode  : 1; /* No batching   */
+	unsigned int		fullmm;		/* Flush full mm */
+
+	struct mmu_gather_batch *active;
+	struct mmu_gather_batch	local;
+	struct page		*__pages[MMU_GATHER_BUNDLE];
 
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	struct mmu_table_batch	*batch;
 #endif
 };
 
-static inline void __tlb_alloc_page(struct mmu_gather *tlb)
+/*
+ * For UP we don't need to worry about TLB flush
+ * and page free order so much..
+ */
+#ifdef CONFIG_SMP
+  #define tlb_fast_mode(tlb) (tlb->fast_mode)
+#else
+  #define tlb_fast_mode(tlb) 1
+#endif
+
+static inline int tlb_next_batch(struct mmu_gather *tlb)
 {
-	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
+	struct mmu_gather_batch *batch;
 
-	if (addr) {
-		tlb->pages = (void *)addr;
-		tlb->max = PAGE_SIZE / sizeof(struct page *);
+	batch = tlb->active;
+	if (batch->next) {
+		tlb->active = batch->next;
+		return 1;
 	}
+
+	batch = (void *)__get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
+	if (!batch)
+		return 0;
+
+	batch->next = NULL;
+	batch->nr   = 0;
+	batch->max  = MAX_GATHER_BATCH;
+
+	tlb->active->next = batch;
+	tlb->active = batch;
+
+	return 1;
 }
 
 /* tlb_gather_mmu
@@ -113,17 +138,16 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 {
 	tlb->mm = mm;
 
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->pages = tlb->local;
-
-	if (num_online_cpus() > 1) {
-		tlb->nr = 0;
-		__tlb_alloc_page(tlb);
-	} else /* Use fast mode if only one CPU is online */
-		tlb->nr = ~0U;
-
+	tlb->need_flush = 0;
+	if (num_online_cpus() == 1)
+		tlb->fast_mode = 1;
 	tlb->fullmm = full_mm_flush;
 
+	tlb->local.next = NULL;
+	tlb->local.nr   = 0;
+	tlb->local.max  = ARRAY_SIZE(tlb->__pages);
+	tlb->active     = &tlb->local;
+
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb->batch = NULL;
 #endif
@@ -132,6 +156,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 static inline void
 tlb_flush_mmu(struct mmu_gather *tlb)
 {
+	struct mmu_gather_batch *batch;
+
 	if (!tlb->need_flush)
 		return;
 	tlb->need_flush = 0;
@@ -139,12 +165,14 @@ tlb_flush_mmu(struct mmu_gather *tlb)
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
-	if (!tlb_fast_mode(tlb)) {
-		free_pages_and_swap_cache(tlb->pages, tlb->nr);
-		tlb->nr = 0;
-		if (tlb->pages == tlb->local)
-			__tlb_alloc_page(tlb);
+	if (tlb_fast_mode(tlb))
+		return;
+
+	for (batch = &tlb->local; batch; batch = batch->next) {
+		free_pages_and_swap_cache(batch->pages, batch->nr);
+		batch->nr = 0;
 	}
+	tlb->active = &tlb->local;
 }
 
 /* tlb_finish_mmu
@@ -154,13 +182,18 @@ tlb_flush_mmu(struct mmu_gather *tlb)
 static inline void
 tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
+	struct mmu_gather_batch *batch, *next;
+
 	tlb_flush_mmu(tlb);
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	if (tlb->pages != tlb->local)
-		free_pages((unsigned long)tlb->pages, 0);
+	for (batch = tlb->local.next; batch; batch = next) {
+		next = batch->next;
+		free_pages((unsigned long)batch, 0);
+	}
+	tlb->local.next = NULL;
 }
 
 /* tlb_remove_page
@@ -170,14 +203,21 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
  */
 static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
+	struct mmu_gather_batch *batch;
+
 	tlb->need_flush = 1;
+
 	if (tlb_fast_mode(tlb)) {
 		free_page_and_swap_cache(page);
 		return 0;
 	}
-	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= tlb->max)
-		return 1;
+
+	batch = tlb->active;
+	batch->pages[batch->nr++] = page;
+	if (batch->nr == batch->max) {
+		if (!tlb_next_batch(tlb))
+			return 1;
+	}
 
 	return 0;
 }
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 13/17] mm: Extended batches for generic mmu_gather
  2011-02-17 16:23 ` [PATCH 13/17] mm: Extended batches for generic mmu_gather Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Hugh Dickins
[-- Attachment #1: peter_zijlstra-mm-extended_batches_for_generic_mmu_gather.patch --]
[-- Type: text/plain, Size: 5508 bytes --]
Instead of using a single batch (the small on-stack, or an allocated
page), try and extend the batch every time it runs out and only flush
once either the extend fails or we're done.
Requested-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/asm-generic/tlb.h |  120 ++++++++++++++++++++++++++++++----------------
 1 file changed, 80 insertions(+), 40 deletions(-)
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -19,16 +19,6 @@
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 
-/*
- * For UP we don't need to worry about TLB flush
- * and page free order so much..
- */
-#ifdef CONFIG_SMP
-  #define tlb_fast_mode(tlb) ((tlb)->nr == ~0U)
-#else
-  #define tlb_fast_mode(tlb) 1
-#endif
-
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 /*
  * Semi RCU freeing of the page directories.
@@ -78,31 +68,66 @@ extern void tlb_remove_table(struct mmu_
  */
 #define MMU_GATHER_BUNDLE	8
 
+struct mmu_gather_batch {
+	struct mmu_gather_batch	*next;
+	unsigned int		nr;
+	unsigned int		max;
+	struct page		*pages[0];
+};
+
+#define MAX_GATHER_BATCH	\
+	((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *))
+
 /* struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
  */
 struct mmu_gather {
 	struct mm_struct	*mm;
-	unsigned int		nr;	/* set to ~0U means fast mode */
-	unsigned int		max;	/* nr < max */
-	unsigned int		need_flush;/* Really unmapped some ptes? */
-	unsigned int		fullmm; /* non-zero means full mm flush */
-	struct page		**pages;
-	struct page		*local[MMU_GATHER_BUNDLE];
+	unsigned int		need_flush : 1,	/* Did free PTEs */
+				fast_mode  : 1; /* No batching   */
+	unsigned int		fullmm;		/* Flush full mm */
+
+	struct mmu_gather_batch *active;
+	struct mmu_gather_batch	local;
+	struct page		*__pages[MMU_GATHER_BUNDLE];
 
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	struct mmu_table_batch	*batch;
 #endif
 };
 
-static inline void __tlb_alloc_page(struct mmu_gather *tlb)
+/*
+ * For UP we don't need to worry about TLB flush
+ * and page free order so much..
+ */
+#ifdef CONFIG_SMP
+  #define tlb_fast_mode(tlb) (tlb->fast_mode)
+#else
+  #define tlb_fast_mode(tlb) 1
+#endif
+
+static inline int tlb_next_batch(struct mmu_gather *tlb)
 {
-	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
+	struct mmu_gather_batch *batch;
 
-	if (addr) {
-		tlb->pages = (void *)addr;
-		tlb->max = PAGE_SIZE / sizeof(struct page *);
+	batch = tlb->active;
+	if (batch->next) {
+		tlb->active = batch->next;
+		return 1;
 	}
+
+	batch = (void *)__get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
+	if (!batch)
+		return 0;
+
+	batch->next = NULL;
+	batch->nr   = 0;
+	batch->max  = MAX_GATHER_BATCH;
+
+	tlb->active->next = batch;
+	tlb->active = batch;
+
+	return 1;
 }
 
 /* tlb_gather_mmu
@@ -113,17 +138,16 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 {
 	tlb->mm = mm;
 
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->pages = tlb->local;
-
-	if (num_online_cpus() > 1) {
-		tlb->nr = 0;
-		__tlb_alloc_page(tlb);
-	} else /* Use fast mode if only one CPU is online */
-		tlb->nr = ~0U;
-
+	tlb->need_flush = 0;
+	if (num_online_cpus() == 1)
+		tlb->fast_mode = 1;
 	tlb->fullmm = full_mm_flush;
 
+	tlb->local.next = NULL;
+	tlb->local.nr   = 0;
+	tlb->local.max  = ARRAY_SIZE(tlb->__pages);
+	tlb->active     = &tlb->local;
+
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb->batch = NULL;
 #endif
@@ -132,6 +156,8 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 static inline void
 tlb_flush_mmu(struct mmu_gather *tlb)
 {
+	struct mmu_gather_batch *batch;
+
 	if (!tlb->need_flush)
 		return;
 	tlb->need_flush = 0;
@@ -139,12 +165,14 @@ tlb_flush_mmu(struct mmu_gather *tlb)
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
-	if (!tlb_fast_mode(tlb)) {
-		free_pages_and_swap_cache(tlb->pages, tlb->nr);
-		tlb->nr = 0;
-		if (tlb->pages == tlb->local)
-			__tlb_alloc_page(tlb);
+	if (tlb_fast_mode(tlb))
+		return;
+
+	for (batch = &tlb->local; batch; batch = batch->next) {
+		free_pages_and_swap_cache(batch->pages, batch->nr);
+		batch->nr = 0;
 	}
+	tlb->active = &tlb->local;
 }
 
 /* tlb_finish_mmu
@@ -154,13 +182,18 @@ tlb_flush_mmu(struct mmu_gather *tlb)
 static inline void
 tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
+	struct mmu_gather_batch *batch, *next;
+
 	tlb_flush_mmu(tlb);
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	if (tlb->pages != tlb->local)
-		free_pages((unsigned long)tlb->pages, 0);
+	for (batch = tlb->local.next; batch; batch = next) {
+		next = batch->next;
+		free_pages((unsigned long)batch, 0);
+	}
+	tlb->local.next = NULL;
 }
 
 /* tlb_remove_page
@@ -170,14 +203,21 @@ tlb_finish_mmu(struct mmu_gather *tlb, u
  */
 static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
+	struct mmu_gather_batch *batch;
+
 	tlb->need_flush = 1;
+
 	if (tlb_fast_mode(tlb)) {
 		free_page_and_swap_cache(page);
 		return 0;
 	}
-	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= tlb->max)
-		return 1;
+
+	batch = tlb->active;
+	batch->pages[batch->nr++] = page;
+	if (batch->nr == batch->max) {
+		if (!tlb_next_batch(tlb))
+			return 1;
+	}
 
 	return 0;
 }
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
- * [PATCH 14/17] mm: Provide generic range tracking and flushing
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (13 preceding siblings ...)
  2011-02-17 16:23 ` [PATCH 13/17] mm: Extended batches for generic mmu_gather Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 15/17] arm, mm: Convert arm to generic tlb Peter Zijlstra
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang
[-- Attachment #1: mm-generic-tlb-range.patch --]
[-- Type: text/plain, Size: 2273 bytes --]
In order to convert ia64, arm and sh to generic tlb we need to provide
some extra infrastructure to track the range of the flushed page
tables.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/Kconfig              |    3 +++
 include/asm-generic/tlb.h |   35 +++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)
Index: linux-2.6/arch/Kconfig
===================================================================
--- linux-2.6.orig/arch/Kconfig
+++ linux-2.6/arch/Kconfig
@@ -181,4 +181,7 @@ config HAVE_ARCH_MUTEX_CPU_RELAX
 config HAVE_RCU_TABLE_FREE
 	bool
 
+config HAVE_MMU_GATHER_RANGE
+	bool
+
 source "kernel/gcov/Kconfig"
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -87,6 +87,10 @@ struct mmu_gather {
 				fast_mode  : 1; /* No batching   */
 	unsigned int		fullmm;		/* Flush full mm */
 
+#ifdef CONFIG_HAVE_MMU_GATHER_RANGE
+	unsigned long		start, end;
+#endif
+
 	struct mmu_gather_batch *active;
 	struct mmu_gather_batch	local;
 	struct page		*__pages[MMU_GATHER_BUNDLE];
@@ -228,6 +232,35 @@ static inline void tlb_remove_page(struc
 		tlb_flush_mmu(tlb);
 }
 
+#ifdef CONFIG_HAVE_MMU_GATHER_RANGE
+static inline void
+__tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, unsigned long address)
+{
+	if (!tlb->fullmm) {
+		tlb->start = min(tlb->start, address);
+		address += PAGE_SIZE;
+		tlb->end = max(tlb->end, address);
+	}
+}
+
+static inline void
+tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+	if (!tlb->fullmm) {
+		flush_cache_range(vma, vma->vm_start, vma->vm_end);
+		tlb->start = TASK_SIZE;
+		tlb->end = 0;
+	}
+}
+
+static inline void
+tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+	if (!tlb->fullmm && tlb->end)
+		flush_tlb_range(vma, tlb->start, tlb->end);
+}
+#endif
+
 /**
  * tlb_remove_tlb_entry - remember a pte unmapping for later tlb invalidation.
  *
@@ -261,6 +294,8 @@ static inline void tlb_remove_page(struc
 		__pmd_free_tlb(tlb, pmdp, address);		\
 	} while (0)
 
+#ifndef tlb_migrate_finish
 #define tlb_migrate_finish(mm) do {} while (0)
+#endif
 
 #endif /* _ASM_GENERIC__TLB_H */
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 14/17] mm: Provide generic range tracking and flushing
  2011-02-17 16:23 ` [PATCH 14/17] mm: Provide generic range tracking and flushing Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang
[-- Attachment #1: mm-generic-tlb-range.patch --]
[-- Type: text/plain, Size: 2275 bytes --]
In order to convert ia64, arm and sh to generic tlb we need to provide
some extra infrastructure to track the range of the flushed page
tables.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/Kconfig              |    3 +++
 include/asm-generic/tlb.h |   35 +++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)
Index: linux-2.6/arch/Kconfig
===================================================================
--- linux-2.6.orig/arch/Kconfig
+++ linux-2.6/arch/Kconfig
@@ -181,4 +181,7 @@ config HAVE_ARCH_MUTEX_CPU_RELAX
 config HAVE_RCU_TABLE_FREE
 	bool
 
+config HAVE_MMU_GATHER_RANGE
+	bool
+
 source "kernel/gcov/Kconfig"
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -87,6 +87,10 @@ struct mmu_gather {
 				fast_mode  : 1; /* No batching   */
 	unsigned int		fullmm;		/* Flush full mm */
 
+#ifdef CONFIG_HAVE_MMU_GATHER_RANGE
+	unsigned long		start, end;
+#endif
+
 	struct mmu_gather_batch *active;
 	struct mmu_gather_batch	local;
 	struct page		*__pages[MMU_GATHER_BUNDLE];
@@ -228,6 +232,35 @@ static inline void tlb_remove_page(struc
 		tlb_flush_mmu(tlb);
 }
 
+#ifdef CONFIG_HAVE_MMU_GATHER_RANGE
+static inline void
+__tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, unsigned long address)
+{
+	if (!tlb->fullmm) {
+		tlb->start = min(tlb->start, address);
+		address += PAGE_SIZE;
+		tlb->end = max(tlb->end, address);
+	}
+}
+
+static inline void
+tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+	if (!tlb->fullmm) {
+		flush_cache_range(vma, vma->vm_start, vma->vm_end);
+		tlb->start = TASK_SIZE;
+		tlb->end = 0;
+	}
+}
+
+static inline void
+tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+	if (!tlb->fullmm && tlb->end)
+		flush_tlb_range(vma, tlb->start, tlb->end);
+}
+#endif
+
 /**
  * tlb_remove_tlb_entry - remember a pte unmapping for later tlb invalidation.
  *
@@ -261,6 +294,8 @@ static inline void tlb_remove_page(struc
 		__pmd_free_tlb(tlb, pmdp, address);		\
 	} while (0)
 
+#ifndef tlb_migrate_finish
 #define tlb_migrate_finish(mm) do {} while (0)
+#endif
 
 #endif /* _ASM_GENERIC__TLB_H */
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
- * [PATCH 15/17] arm, mm: Convert arm to generic tlb
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (14 preceding siblings ...)
  2011-02-17 16:23 ` [PATCH 14/17] mm: Provide generic range tracking and flushing Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 16/17] ia64, mm: Convert ia64 " Peter Zijlstra
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Russell King
[-- Attachment #1: mm-arm-tlb-range.patch --]
[-- Type: text/plain, Size: 4228 bytes --]
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/arm/Kconfig                |    1 
 arch/arm/include/asm/tlb.h      |   90 +---------------------------------------
 arch/arm/include/asm/tlbflush.h |    5 --
 3 files changed, 6 insertions(+), 90 deletions(-)
Index: linux-2.6/arch/arm/Kconfig
===================================================================
--- linux-2.6.orig/arch/arm/Kconfig
+++ linux-2.6/arch/arm/Kconfig
@@ -28,6 +28,7 @@ config ARM
 	select HAVE_C_RECORDMCOUNT
 	select HAVE_GENERIC_HARDIRQS
 	select HAVE_SPARSE_IRQ
+	select HAVE_MMU_GATHER_RANGE if MMU
 	help
 	  The ARM series is a line of low-power-consumption RISC chip designs
 	  licensed by ARM Ltd and targeted at embedded applications and
Index: linux-2.6/arch/arm/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/arm/include/asm/tlb.h
+++ linux-2.6/arch/arm/include/asm/tlb.h
@@ -23,96 +23,14 @@
 #ifndef CONFIG_MMU
 
 #include <linux/pagemap.h>
-#include <asm-generic/tlb.h>
 
 #else /* !CONFIG_MMU */
 
-#include <asm/pgalloc.h>
-
-/*
- * TLB handling.  This allows us to remove pages from the page
- * tables, and efficiently handle the TLB issues.
- */
-struct mmu_gather {
-	struct mm_struct	*mm;
-	unsigned int		fullmm;
-	unsigned long		range_start;
-	unsigned long		range_end;
-};
-
-static inline void
-tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
-{
-	tlb->mm = mm;
-	tlb->fullmm = full_mm_flush;
-}
-
-static inline void
-tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
-{
-	if (tlb->fullmm)
-		flush_tlb_mm(tlb->mm);
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-}
-
-/*
- * Memorize the range for the TLB flush.
- */
-static inline void
-tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, unsigned long addr)
-{
-	if (!tlb->fullmm) {
-		if (addr < tlb->range_start)
-			tlb->range_start = addr;
-		if (addr + PAGE_SIZE > tlb->range_end)
-			tlb->range_end = addr + PAGE_SIZE;
-	}
-}
-
-/*
- * In the case of tlb vma handling, we can optimise these away in the
- * case where we're doing a full MM flush.  When we're doing a munmap,
- * the vmas are adjusted to only cover the region to be torn down.
- */
-static inline void
-tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
-{
-	if (!tlb->fullmm) {
-		flush_cache_range(vma, vma->vm_start, vma->vm_end);
-		tlb->range_start = TASK_SIZE;
-		tlb->range_end = 0;
-	}
-}
-
-static inline void
-tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
-{
-	if (!tlb->fullmm && tlb->range_end > 0)
-		flush_tlb_range(vma, tlb->range_start, tlb->range_end);
-}
-
-static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	free_page_and_swap_cache(page);
-	return 0;
-}
-
-static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	might_sleep();
-	__tlb_remove_page(tlb, page);
-}
-
-static inline void tlb_flush_mmu(struct mmu_gather *tlb)
-{
-}
+#define __pte_free_tlb(tlb, ptep, addr)	pte_free((tlb)->mm, ptep)
+#define __pmd_free_tlb(tlb, pmdp, addr)	pmd_free((tlb)->mm, pmdp)
 
-#define pte_free_tlb(tlb, ptep, addr)	pte_free((tlb)->mm, ptep)
-#define pmd_free_tlb(tlb, pmdp, addr)	pmd_free((tlb)->mm, pmdp)
+#endif /* CONFIG_MMU */
 
-#define tlb_migrate_finish(mm)		do { } while (0)
+#include <asm-generic/tlb.h>
 
-#endif /* CONFIG_MMU */
 #endif
Index: linux-2.6/arch/arm/include/asm/tlbflush.h
===================================================================
--- linux-2.6.orig/arch/arm/include/asm/tlbflush.h
+++ linux-2.6/arch/arm/include/asm/tlbflush.h
@@ -10,12 +10,9 @@
 #ifndef _ASMARM_TLBFLUSH_H
 #define _ASMARM_TLBFLUSH_H
 
-
-#ifndef CONFIG_MMU
-
 #define tlb_flush(tlb)	((void) tlb)
 
-#else /* CONFIG_MMU */
+#ifdef CONFIG_MMU
 
 #include <asm/glue.h>
 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 15/17] arm, mm: Convert arm to generic tlb
  2011-02-17 16:23 ` [PATCH 15/17] arm, mm: Convert arm to generic tlb Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Russell King
[-- Attachment #1: mm-arm-tlb-range.patch --]
[-- Type: text/plain, Size: 3925 bytes --]
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/arm/Kconfig                |    1 
 arch/arm/include/asm/tlb.h      |   90 +---------------------------------------
 arch/arm/include/asm/tlbflush.h |    5 --
 3 files changed, 6 insertions(+), 90 deletions(-)
Index: linux-2.6/arch/arm/Kconfig
===================================================================
--- linux-2.6.orig/arch/arm/Kconfig
+++ linux-2.6/arch/arm/Kconfig
@@ -28,6 +28,7 @@ config ARM
 	select HAVE_C_RECORDMCOUNT
 	select HAVE_GENERIC_HARDIRQS
 	select HAVE_SPARSE_IRQ
+	select HAVE_MMU_GATHER_RANGE if MMU
 	help
 	  The ARM series is a line of low-power-consumption RISC chip designs
 	  licensed by ARM Ltd and targeted at embedded applications and
Index: linux-2.6/arch/arm/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/arm/include/asm/tlb.h
+++ linux-2.6/arch/arm/include/asm/tlb.h
@@ -23,96 +23,14 @@
 #ifndef CONFIG_MMU
 
 #include <linux/pagemap.h>
-#include <asm-generic/tlb.h>
 
 #else /* !CONFIG_MMU */
 
-#include <asm/pgalloc.h>
-
-/*
- * TLB handling.  This allows us to remove pages from the page
- * tables, and efficiently handle the TLB issues.
- */
-struct mmu_gather {
-	struct mm_struct	*mm;
-	unsigned int		fullmm;
-	unsigned long		range_start;
-	unsigned long		range_end;
-};
-
-static inline void
-tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
-{
-	tlb->mm = mm;
-	tlb->fullmm = full_mm_flush;
-}
-
-static inline void
-tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
-{
-	if (tlb->fullmm)
-		flush_tlb_mm(tlb->mm);
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-}
-
-/*
- * Memorize the range for the TLB flush.
- */
-static inline void
-tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, unsigned long addr)
-{
-	if (!tlb->fullmm) {
-		if (addr < tlb->range_start)
-			tlb->range_start = addr;
-		if (addr + PAGE_SIZE > tlb->range_end)
-			tlb->range_end = addr + PAGE_SIZE;
-	}
-}
-
-/*
- * In the case of tlb vma handling, we can optimise these away in the
- * case where we're doing a full MM flush.  When we're doing a munmap,
- * the vmas are adjusted to only cover the region to be torn down.
- */
-static inline void
-tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
-{
-	if (!tlb->fullmm) {
-		flush_cache_range(vma, vma->vm_start, vma->vm_end);
-		tlb->range_start = TASK_SIZE;
-		tlb->range_end = 0;
-	}
-}
-
-static inline void
-tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
-{
-	if (!tlb->fullmm && tlb->range_end > 0)
-		flush_tlb_range(vma, tlb->range_start, tlb->range_end);
-}
-
-static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	free_page_and_swap_cache(page);
-	return 0;
-}
-
-static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	might_sleep();
-	__tlb_remove_page(tlb, page);
-}
-
-static inline void tlb_flush_mmu(struct mmu_gather *tlb)
-{
-}
+#define __pte_free_tlb(tlb, ptep, addr)	pte_free((tlb)->mm, ptep)
+#define __pmd_free_tlb(tlb, pmdp, addr)	pmd_free((tlb)->mm, pmdp)
 
-#define pte_free_tlb(tlb, ptep, addr)	pte_free((tlb)->mm, ptep)
-#define pmd_free_tlb(tlb, pmdp, addr)	pmd_free((tlb)->mm, pmdp)
+#endif /* CONFIG_MMU */
 
-#define tlb_migrate_finish(mm)		do { } while (0)
+#include <asm-generic/tlb.h>
 
-#endif /* CONFIG_MMU */
 #endif
Index: linux-2.6/arch/arm/include/asm/tlbflush.h
===================================================================
--- linux-2.6.orig/arch/arm/include/asm/tlbflush.h
+++ linux-2.6/arch/arm/include/asm/tlbflush.h
@@ -10,12 +10,9 @@
 #ifndef _ASMARM_TLBFLUSH_H
 #define _ASMARM_TLBFLUSH_H
 
-
-#ifndef CONFIG_MMU
-
 #define tlb_flush(tlb)	((void) tlb)
 
-#else /* CONFIG_MMU */
+#ifdef CONFIG_MMU
 
 #include <asm/glue.h>
 
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
- * [PATCH 16/17] ia64, mm: Convert ia64 to generic tlb
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (15 preceding siblings ...)
  2011-02-17 16:23 ` [PATCH 15/17] arm, mm: Convert arm to generic tlb Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-02-17 16:23 ` [PATCH 17/17] sh, mm: Convert sh " Peter Zijlstra
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Tony Luck
[-- Attachment #1: mm-ia64-tlb-range.patch --]
[-- Type: text/plain, Size: 6774 bytes --]
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/ia64/Kconfig           |    1 
 arch/ia64/include/asm/tlb.h |  176 ++------------------------------------------
 2 files changed, 9 insertions(+), 168 deletions(-)
Index: linux-2.6/arch/ia64/Kconfig
===================================================================
--- linux-2.6.orig/arch/ia64/Kconfig
+++ linux-2.6/arch/ia64/Kconfig
@@ -25,6 +25,7 @@ config IA64
 	select HAVE_GENERIC_HARDIRQS
 	select GENERIC_IRQ_PROBE
 	select GENERIC_PENDING_IRQ if SMP
+	select HAVE_MMU_GATHER_RANGE
 	select IRQ_PER_CPU
 	default y
 	help
Index: linux-2.6/arch/ia64/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/ia64/include/asm/tlb.h
+++ linux-2.6/arch/ia64/include/asm/tlb.h
@@ -46,30 +46,6 @@
 #include <asm/tlbflush.h>
 #include <asm/machvec.h>
 
-#ifdef CONFIG_SMP
-# define tlb_fast_mode(tlb)	((tlb)->nr == ~0U)
-#else
-# define tlb_fast_mode(tlb)	(1)
-#endif
-
-/*
- * If we can't allocate a page to make a big batch of page pointers
- * to work on, then just handle a few from the on-stack structure.
- */
-#define	IA64_GATHER_BUNDLE	8
-
-struct mmu_gather {
-	struct mm_struct	*mm;
-	unsigned int		nr;		/* == ~0U => fast mode */
-	unsigned int		max;
-	unsigned char		fullmm;		/* non-zero means full mm flush */
-	unsigned char		need_flush;	/* really unmapped some PTEs? */
-	unsigned long		start_addr;
-	unsigned long		end_addr;
-	struct page		**pages;
-	struct page		*local[IA64_GATHER_BUNDLE];
-};
-
 struct ia64_tr_entry {
 	u64 ifa;
 	u64 itir;
@@ -96,6 +72,12 @@ extern struct ia64_tr_entry *ia64_idtrs[
 #define RR_RID_MASK	0x00000000ffffff00L
 #define RR_TO_RID(val) 	((val >> 8) & 0xffffff)
 
+static inline void tlb_flush(struct mmu_gather *tlb);
+
+#define tlb_migrate_finish(mm)	platform_tlb_migrate_finish(mm)
+
+#include <asm-generic/tlb.h>
+
 /*
  * Flush the TLB for address range START to END and, if not in fast mode, release the
  * freed pages that where gathered up to this point.
@@ -103,12 +85,6 @@ extern struct ia64_tr_entry *ia64_idtrs[
 static inline void
 ia64_tlb_flush_mmu (struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
-	unsigned int nr;
-
-	if (!tlb->need_flush)
-		return;
-	tlb->need_flush = 0;
-
 	if (tlb->fullmm) {
 		/*
 		 * Tearing down the entire address space.  This happens both as a result
@@ -138,147 +114,11 @@ ia64_tlb_flush_mmu (struct mmu_gather *t
 		/* now flush the virt. page-table area mapping the address range: */
 		flush_tlb_range(&vma, ia64_thash(start), ia64_thash(end));
 	}
-
-	/* lastly, release the freed pages */
-	nr = tlb->nr;
-	if (!tlb_fast_mode(tlb)) {
-		unsigned long i;
-		tlb->nr = 0;
-		tlb->start_addr = ~0UL;
-		for (i = 0; i < nr; ++i)
-			free_page_and_swap_cache(tlb->pages[i]);
-	}
-}
-
-static inline void __tlb_alloc_page(struct mmu_gather *tlb)
-{
-	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
-
-	if (addr) {
-		tlb->pages = (void *)addr;
-		tlb->max = PAGE_SIZE / sizeof(void *);
-	}
-}
-
-
-static inline void
-tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
-{
-	tlb->mm = mm;
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->pages = tlb->local;
-	/*
-	 * Use fast mode if only 1 CPU is online.
-	 *
-	 * It would be tempting to turn on fast-mode for full_mm_flush as well.  But this
-	 * doesn't work because of speculative accesses and software prefetching: the page
-	 * table of "mm" may (and usually is) the currently active page table and even
-	 * though the kernel won't do any user-space accesses during the TLB shoot down, a
-	 * compiler might use speculation or lfetch.fault on what happens to be a valid
-	 * user-space address.  This in turn could trigger a TLB miss fault (or a VHPT
-	 * walk) and re-insert a TLB entry we just removed.  Slow mode avoids such
-	 * problems.  (We could make fast-mode work by switching the current task to a
-	 * different "mm" during the shootdown.) --davidm 08/02/2002
-	 */
-	tlb->nr = (num_online_cpus() == 1) ? ~0U : 0;
-	tlb->fullmm = full_mm_flush;
-	tlb->start_addr = ~0UL;
-}
-
-/*
- * Called at the end of the shootdown operation to free up any resources that were
- * collected.
- */
-static inline void
-tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
-{
-	/*
-	 * Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
-	 * tlb->end_addr.
-	 */
-	ia64_tlb_flush_mmu(tlb, start, end);
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-
-	if (tlb->pages != tlb->local)
-		free_pages((unsigned long)tlb->pages, 0);
-}
-
-/*
- * Logically, this routine frees PAGE.  On MP machines, the actual freeing of the page
- * must be delayed until after the TLB has been flushed (see comments at the beginning of
- * this file).
- */
-static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	tlb->need_flush = 1;
-
-	if (tlb_fast_mode(tlb)) {
-		free_page_and_swap_cache(page);
-		return 0;
-	}
-
-	if (!tlb->nr && tlb->pages == tlb->local)
-		__tlb_alloc_page(tlb);
-
-	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= tlb->max)
-		return 1;
-
-	return 0;
 }
 
-static inline void tlb_flush_mmu(struct mmu_gather *tlb)
+static inline void tlb_flush(struct mmu_gather *tlb)
 {
-	ia64_tlb_flush_mmu(tlb, tlb->start_addr, tlb->end_addr);
+	ia64_tlb_flush_mmu(tlb, tlb->start, tlb->end);
 }
 
-static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	if (__tlb_remove_page(tlb, page))
-		tlb_flush_mmu(tlb);
-}
-
-/*
- * Remove TLB entry for PTE mapped at virtual address ADDRESS.  This is called for any
- * PTE, not just those pointing to (normal) physical memory.
- */
-static inline void
-__tlb_remove_tlb_entry (struct mmu_gather *tlb, pte_t *ptep, unsigned long address)
-{
-	if (tlb->start_addr == ~0UL)
-		tlb->start_addr = address;
-	tlb->end_addr = address + PAGE_SIZE;
-}
-
-#define tlb_migrate_finish(mm)	platform_tlb_migrate_finish(mm)
-
-#define tlb_start_vma(tlb, vma)			do { } while (0)
-#define tlb_end_vma(tlb, vma)			do { } while (0)
-
-#define tlb_remove_tlb_entry(tlb, ptep, addr)		\
-do {							\
-	tlb->need_flush = 1;				\
-	__tlb_remove_tlb_entry(tlb, ptep, addr);	\
-} while (0)
-
-#define pte_free_tlb(tlb, ptep, address)		\
-do {							\
-	tlb->need_flush = 1;				\
-	__pte_free_tlb(tlb, ptep, address);		\
-} while (0)
-
-#define pmd_free_tlb(tlb, ptep, address)		\
-do {							\
-	tlb->need_flush = 1;				\
-	__pmd_free_tlb(tlb, ptep, address);		\
-} while (0)
-
-#define pud_free_tlb(tlb, pudp, address)		\
-do {							\
-	tlb->need_flush = 1;				\
-	__pud_free_tlb(tlb, pudp, address);		\
-} while (0)
-
 #endif /* _ASM_IA64_TLB_H */
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 16/17] ia64, mm: Convert ia64 to generic tlb
  2011-02-17 16:23 ` [PATCH 16/17] ia64, mm: Convert ia64 " Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Tony Luck
[-- Attachment #1: mm-ia64-tlb-range.patch --]
[-- Type: text/plain, Size: 6776 bytes --]
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/ia64/Kconfig           |    1 
 arch/ia64/include/asm/tlb.h |  176 ++------------------------------------------
 2 files changed, 9 insertions(+), 168 deletions(-)
Index: linux-2.6/arch/ia64/Kconfig
===================================================================
--- linux-2.6.orig/arch/ia64/Kconfig
+++ linux-2.6/arch/ia64/Kconfig
@@ -25,6 +25,7 @@ config IA64
 	select HAVE_GENERIC_HARDIRQS
 	select GENERIC_IRQ_PROBE
 	select GENERIC_PENDING_IRQ if SMP
+	select HAVE_MMU_GATHER_RANGE
 	select IRQ_PER_CPU
 	default y
 	help
Index: linux-2.6/arch/ia64/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/ia64/include/asm/tlb.h
+++ linux-2.6/arch/ia64/include/asm/tlb.h
@@ -46,30 +46,6 @@
 #include <asm/tlbflush.h>
 #include <asm/machvec.h>
 
-#ifdef CONFIG_SMP
-# define tlb_fast_mode(tlb)	((tlb)->nr == ~0U)
-#else
-# define tlb_fast_mode(tlb)	(1)
-#endif
-
-/*
- * If we can't allocate a page to make a big batch of page pointers
- * to work on, then just handle a few from the on-stack structure.
- */
-#define	IA64_GATHER_BUNDLE	8
-
-struct mmu_gather {
-	struct mm_struct	*mm;
-	unsigned int		nr;		/* == ~0U => fast mode */
-	unsigned int		max;
-	unsigned char		fullmm;		/* non-zero means full mm flush */
-	unsigned char		need_flush;	/* really unmapped some PTEs? */
-	unsigned long		start_addr;
-	unsigned long		end_addr;
-	struct page		**pages;
-	struct page		*local[IA64_GATHER_BUNDLE];
-};
-
 struct ia64_tr_entry {
 	u64 ifa;
 	u64 itir;
@@ -96,6 +72,12 @@ extern struct ia64_tr_entry *ia64_idtrs[
 #define RR_RID_MASK	0x00000000ffffff00L
 #define RR_TO_RID(val) 	((val >> 8) & 0xffffff)
 
+static inline void tlb_flush(struct mmu_gather *tlb);
+
+#define tlb_migrate_finish(mm)	platform_tlb_migrate_finish(mm)
+
+#include <asm-generic/tlb.h>
+
 /*
  * Flush the TLB for address range START to END and, if not in fast mode, release the
  * freed pages that where gathered up to this point.
@@ -103,12 +85,6 @@ extern struct ia64_tr_entry *ia64_idtrs[
 static inline void
 ia64_tlb_flush_mmu (struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
-	unsigned int nr;
-
-	if (!tlb->need_flush)
-		return;
-	tlb->need_flush = 0;
-
 	if (tlb->fullmm) {
 		/*
 		 * Tearing down the entire address space.  This happens both as a result
@@ -138,147 +114,11 @@ ia64_tlb_flush_mmu (struct mmu_gather *t
 		/* now flush the virt. page-table area mapping the address range: */
 		flush_tlb_range(&vma, ia64_thash(start), ia64_thash(end));
 	}
-
-	/* lastly, release the freed pages */
-	nr = tlb->nr;
-	if (!tlb_fast_mode(tlb)) {
-		unsigned long i;
-		tlb->nr = 0;
-		tlb->start_addr = ~0UL;
-		for (i = 0; i < nr; ++i)
-			free_page_and_swap_cache(tlb->pages[i]);
-	}
-}
-
-static inline void __tlb_alloc_page(struct mmu_gather *tlb)
-{
-	unsigned long addr = __get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
-
-	if (addr) {
-		tlb->pages = (void *)addr;
-		tlb->max = PAGE_SIZE / sizeof(void *);
-	}
-}
-
-
-static inline void
-tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
-{
-	tlb->mm = mm;
-	tlb->max = ARRAY_SIZE(tlb->local);
-	tlb->pages = tlb->local;
-	/*
-	 * Use fast mode if only 1 CPU is online.
-	 *
-	 * It would be tempting to turn on fast-mode for full_mm_flush as well.  But this
-	 * doesn't work because of speculative accesses and software prefetching: the page
-	 * table of "mm" may (and usually is) the currently active page table and even
-	 * though the kernel won't do any user-space accesses during the TLB shoot down, a
-	 * compiler might use speculation or lfetch.fault on what happens to be a valid
-	 * user-space address.  This in turn could trigger a TLB miss fault (or a VHPT
-	 * walk) and re-insert a TLB entry we just removed.  Slow mode avoids such
-	 * problems.  (We could make fast-mode work by switching the current task to a
-	 * different "mm" during the shootdown.) --davidm 08/02/2002
-	 */
-	tlb->nr = (num_online_cpus() == 1) ? ~0U : 0;
-	tlb->fullmm = full_mm_flush;
-	tlb->start_addr = ~0UL;
-}
-
-/*
- * Called at the end of the shootdown operation to free up any resources that were
- * collected.
- */
-static inline void
-tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
-{
-	/*
-	 * Note: tlb->nr may be 0 at this point, so we can't rely on tlb->start_addr and
-	 * tlb->end_addr.
-	 */
-	ia64_tlb_flush_mmu(tlb, start, end);
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-
-	if (tlb->pages != tlb->local)
-		free_pages((unsigned long)tlb->pages, 0);
-}
-
-/*
- * Logically, this routine frees PAGE.  On MP machines, the actual freeing of the page
- * must be delayed until after the TLB has been flushed (see comments at the beginning of
- * this file).
- */
-static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	tlb->need_flush = 1;
-
-	if (tlb_fast_mode(tlb)) {
-		free_page_and_swap_cache(page);
-		return 0;
-	}
-
-	if (!tlb->nr && tlb->pages == tlb->local)
-		__tlb_alloc_page(tlb);
-
-	tlb->pages[tlb->nr++] = page;
-	if (tlb->nr >= tlb->max)
-		return 1;
-
-	return 0;
 }
 
-static inline void tlb_flush_mmu(struct mmu_gather *tlb)
+static inline void tlb_flush(struct mmu_gather *tlb)
 {
-	ia64_tlb_flush_mmu(tlb, tlb->start_addr, tlb->end_addr);
+	ia64_tlb_flush_mmu(tlb, tlb->start, tlb->end);
 }
 
-static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	if (__tlb_remove_page(tlb, page))
-		tlb_flush_mmu(tlb);
-}
-
-/*
- * Remove TLB entry for PTE mapped at virtual address ADDRESS.  This is called for any
- * PTE, not just those pointing to (normal) physical memory.
- */
-static inline void
-__tlb_remove_tlb_entry (struct mmu_gather *tlb, pte_t *ptep, unsigned long address)
-{
-	if (tlb->start_addr == ~0UL)
-		tlb->start_addr = address;
-	tlb->end_addr = address + PAGE_SIZE;
-}
-
-#define tlb_migrate_finish(mm)	platform_tlb_migrate_finish(mm)
-
-#define tlb_start_vma(tlb, vma)			do { } while (0)
-#define tlb_end_vma(tlb, vma)			do { } while (0)
-
-#define tlb_remove_tlb_entry(tlb, ptep, addr)		\
-do {							\
-	tlb->need_flush = 1;				\
-	__tlb_remove_tlb_entry(tlb, ptep, addr);	\
-} while (0)
-
-#define pte_free_tlb(tlb, ptep, address)		\
-do {							\
-	tlb->need_flush = 1;				\
-	__pte_free_tlb(tlb, ptep, address);		\
-} while (0)
-
-#define pmd_free_tlb(tlb, ptep, address)		\
-do {							\
-	tlb->need_flush = 1;				\
-	__pmd_free_tlb(tlb, ptep, address);		\
-} while (0)
-
-#define pud_free_tlb(tlb, pudp, address)		\
-do {							\
-	tlb->need_flush = 1;				\
-	__pud_free_tlb(tlb, pudp, address);		\
-} while (0)
-
 #endif /* _ASM_IA64_TLB_H */
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
- * [PATCH 17/17] sh, mm: Convert sh to generic tlb
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (16 preceding siblings ...)
  2011-02-17 16:23 ` [PATCH 16/17] ia64, mm: Convert ia64 " Peter Zijlstra
@ 2011-02-17 16:23 ` Peter Zijlstra
  2011-02-17 16:23   ` Peter Zijlstra
  2011-02-17 17:36 ` [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
  2011-02-17 17:42 ` Peter Zijlstra
  19 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Paul Mundt
[-- Attachment #1: mm-sh-tlb-range.patch --]
[-- Type: text/plain, Size: 4232 bytes --]
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/sh/Kconfig           |    1 
 arch/sh/include/asm/tlb.h |   98 ++--------------------------------------------
 2 files changed, 6 insertions(+), 93 deletions(-)
Index: linux-2.6/arch/sh/Kconfig
===================================================================
--- linux-2.6.orig/arch/sh/Kconfig
+++ linux-2.6/arch/sh/Kconfig
@@ -23,6 +23,7 @@ config SUPERH
 	select HAVE_SPARSE_IRQ
 	select RTC_LIB
 	select GENERIC_ATOMIC64
+	select HAVE_MMU_GATHER_RANGE if MMU
 	# Support the deprecated APIs until MFD and GPIOLIB catch up.
 	select GENERIC_HARDIRQS_NO_DEPRECATED if !MFD_SUPPORT && !GPIOLIB
 	help
Index: linux-2.6/arch/sh/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/sh/include/asm/tlb.h
+++ linux-2.6/arch/sh/include/asm/tlb.h
@@ -9,100 +9,11 @@
 #include <linux/pagemap.h>
 
 #ifdef CONFIG_MMU
-#include <asm/pgalloc.h>
-#include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 
-/*
- * TLB handling.  This allows us to remove pages from the page
- * tables, and efficiently handle the TLB issues.
- */
-struct mmu_gather {
-	struct mm_struct	*mm;
-	unsigned int		fullmm;
-	unsigned long		start, end;
-};
-
-static inline void init_tlb_gather(struct mmu_gather *tlb)
-{
-	tlb->start = TASK_SIZE;
-	tlb->end = 0;
-
-	if (tlb->fullmm) {
-		tlb->start = 0;
-		tlb->end = TASK_SIZE;
-	}
-}
-
-static inline void
-tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
-{
-	tlb->mm = mm;
-	tlb->fullmm = full_mm_flush;
-
-	init_tlb_gather(tlb);
-}
-
-static inline void
-tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
-{
-	if (tlb->fullmm)
-		flush_tlb_mm(tlb->mm);
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-}
-
-static inline void
-tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, unsigned long address)
-{
-	if (tlb->start > address)
-		tlb->start = address;
-	if (tlb->end < address + PAGE_SIZE)
-		tlb->end = address + PAGE_SIZE;
-}
-
-/*
- * In the case of tlb vma handling, we can optimise these away in the
- * case where we're doing a full MM flush.  When we're doing a munmap,
- * the vmas are adjusted to only cover the region to be torn down.
- */
-static inline void
-tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
-{
-	if (!tlb->fullmm)
-		flush_cache_range(vma, vma->vm_start, vma->vm_end);
-}
-
-static inline void
-tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
-{
-	if (!tlb->fullmm && tlb->end) {
-		flush_tlb_range(vma, tlb->start, tlb->end);
-		init_tlb_gather(tlb);
-	}
-}
-
-static inline void tlb_flush_mmu(struct mmu_gather *tlb)
-{
-}
-
-static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	free_page_and_swap_cache(page);
-	return 0;
-}
-
-static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	__tlb_remove_page(tlb, page);
-}
-
-#define pte_free_tlb(tlb, ptep, addr)	pte_free((tlb)->mm, ptep)
-#define pmd_free_tlb(tlb, pmdp, addr)	pmd_free((tlb)->mm, pmdp)
-#define pud_free_tlb(tlb, pudp, addr)	pud_free((tlb)->mm, pudp)
-
-#define tlb_migrate_finish(mm)		do { } while (0)
+#define __pte_free_tlb(tlb, ptep, addr)	pte_free((tlb)->mm, ptep)
+#define __pmd_free_tlb(tlb, pmdp, addr)	pmd_free((tlb)->mm, pmdp)
+#define __pud_free_tlb(tlb, pudp, addr)	pud_free((tlb)->mm, pudp)
 
 #if defined(CONFIG_CPU_SH4) || defined(CONFIG_SUPERH64)
 extern void tlb_wire_entry(struct vm_area_struct *, unsigned long, pte_t);
@@ -127,8 +38,9 @@ static inline void tlb_unwire_entry(void
 #define __tlb_remove_tlb_entry(tlb, pte, address)	do { } while (0)
 #define tlb_flush(tlb)					do { } while (0)
 
+#endif /* CONFIG_MMU */
+
 #include <asm-generic/tlb.h>
 
-#endif /* CONFIG_MMU */
 #endif /* __ASSEMBLY__ */
 #endif /* __ASM_SH_TLB_H */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * [PATCH 17/17] sh, mm: Convert sh to generic tlb
  2011-02-17 16:23 ` [PATCH 17/17] sh, mm: Convert sh " Peter Zijlstra
@ 2011-02-17 16:23   ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 16:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Thomas Gleixner, Rik van Riel,
	Ingo Molnar, akpm, Linus Torvalds
  Cc: linux-kernel, linux-arch, linux-mm, Benjamin Herrenschmidt,
	David Miller, Hugh Dickins, Mel Gorman, Nick Piggin,
	Peter Zijlstra, Paul McKenney, Yanmin Zhang, Paul Mundt
[-- Attachment #1: mm-sh-tlb-range.patch --]
[-- Type: text/plain, Size: 3929 bytes --]
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/sh/Kconfig           |    1 
 arch/sh/include/asm/tlb.h |   98 ++--------------------------------------------
 2 files changed, 6 insertions(+), 93 deletions(-)
Index: linux-2.6/arch/sh/Kconfig
===================================================================
--- linux-2.6.orig/arch/sh/Kconfig
+++ linux-2.6/arch/sh/Kconfig
@@ -23,6 +23,7 @@ config SUPERH
 	select HAVE_SPARSE_IRQ
 	select RTC_LIB
 	select GENERIC_ATOMIC64
+	select HAVE_MMU_GATHER_RANGE if MMU
 	# Support the deprecated APIs until MFD and GPIOLIB catch up.
 	select GENERIC_HARDIRQS_NO_DEPRECATED if !MFD_SUPPORT && !GPIOLIB
 	help
Index: linux-2.6/arch/sh/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/sh/include/asm/tlb.h
+++ linux-2.6/arch/sh/include/asm/tlb.h
@@ -9,100 +9,11 @@
 #include <linux/pagemap.h>
 
 #ifdef CONFIG_MMU
-#include <asm/pgalloc.h>
-#include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
 
-/*
- * TLB handling.  This allows us to remove pages from the page
- * tables, and efficiently handle the TLB issues.
- */
-struct mmu_gather {
-	struct mm_struct	*mm;
-	unsigned int		fullmm;
-	unsigned long		start, end;
-};
-
-static inline void init_tlb_gather(struct mmu_gather *tlb)
-{
-	tlb->start = TASK_SIZE;
-	tlb->end = 0;
-
-	if (tlb->fullmm) {
-		tlb->start = 0;
-		tlb->end = TASK_SIZE;
-	}
-}
-
-static inline void
-tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
-{
-	tlb->mm = mm;
-	tlb->fullmm = full_mm_flush;
-
-	init_tlb_gather(tlb);
-}
-
-static inline void
-tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
-{
-	if (tlb->fullmm)
-		flush_tlb_mm(tlb->mm);
-
-	/* keep the page table cache within bounds */
-	check_pgt_cache();
-}
-
-static inline void
-tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, unsigned long address)
-{
-	if (tlb->start > address)
-		tlb->start = address;
-	if (tlb->end < address + PAGE_SIZE)
-		tlb->end = address + PAGE_SIZE;
-}
-
-/*
- * In the case of tlb vma handling, we can optimise these away in the
- * case where we're doing a full MM flush.  When we're doing a munmap,
- * the vmas are adjusted to only cover the region to be torn down.
- */
-static inline void
-tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
-{
-	if (!tlb->fullmm)
-		flush_cache_range(vma, vma->vm_start, vma->vm_end);
-}
-
-static inline void
-tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
-{
-	if (!tlb->fullmm && tlb->end) {
-		flush_tlb_range(vma, tlb->start, tlb->end);
-		init_tlb_gather(tlb);
-	}
-}
-
-static inline void tlb_flush_mmu(struct mmu_gather *tlb)
-{
-}
-
-static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	free_page_and_swap_cache(page);
-	return 0;
-}
-
-static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
-{
-	__tlb_remove_page(tlb, page);
-}
-
-#define pte_free_tlb(tlb, ptep, addr)	pte_free((tlb)->mm, ptep)
-#define pmd_free_tlb(tlb, pmdp, addr)	pmd_free((tlb)->mm, pmdp)
-#define pud_free_tlb(tlb, pudp, addr)	pud_free((tlb)->mm, pudp)
-
-#define tlb_migrate_finish(mm)		do { } while (0)
+#define __pte_free_tlb(tlb, ptep, addr)	pte_free((tlb)->mm, ptep)
+#define __pmd_free_tlb(tlb, pmdp, addr)	pmd_free((tlb)->mm, pmdp)
+#define __pud_free_tlb(tlb, pudp, addr)	pud_free((tlb)->mm, pudp)
 
 #if defined(CONFIG_CPU_SH4) || defined(CONFIG_SUPERH64)
 extern void tlb_wire_entry(struct vm_area_struct *, unsigned long, pte_t);
@@ -127,8 +38,9 @@ static inline void tlb_unwire_entry(void
 #define __tlb_remove_tlb_entry(tlb, pte, address)	do { } while (0)
 #define tlb_flush(tlb)					do { } while (0)
 
+#endif /* CONFIG_MMU */
+
 #include <asm-generic/tlb.h>
 
-#endif /* CONFIG_MMU */
 #endif /* __ASSEMBLY__ */
 #endif /* __ASM_SH_TLB_H */
^ permalink raw reply	[flat|nested] 90+ messages in thread
 
- * Re: [PATCH 00/17] mm: mmu_gather rework
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (17 preceding siblings ...)
  2011-02-17 16:23 ` [PATCH 17/17] sh, mm: Convert sh " Peter Zijlstra
@ 2011-02-17 17:36 ` Peter Zijlstra
  2011-02-17 17:36   ` Peter Zijlstra
  2011-02-17 17:42 ` Peter Zijlstra
  19 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 17:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, linux-mm,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin, Paul McKenney, Yanmin Zhang
On Thu, 2011-02-17 at 17:23 +0100, Peter Zijlstra wrote:
> Rework the existing mmu_gather infrastructure.
> 
> The direct purpose of these patches was to allow preemptible mmu_gather,
> but even without that I think these patches provide an improvement to the
> status quo.
> 
> The first patch is a fix to the tile architecture, the subsequent 9 patches
> rework the mmu_gather infrastructure. For review purpose I've split them
> into generic and per-arch patches with the last of those a generic cleanup.
> 
> For the final commit I would provide a roll-up of these patches so as not
> to wreck bisectability of non generic archs.
> 
> The next patch provides generic RCU page-table freeing, and the follow up
> is a patch converting s390 to use this. I've also got 4 patches from
> DaveM lined up (not included in this series) that uses this to implement
> gup_fast() for sparc64.
> 
> Then there is one patch that extends the generic mmu_gather batching.
> 
> Finally there are 4 patches that convert various architectures over
> to asm-generic/tlb.h, these are compile tested only and basically RFC.
> 
> After this only um and s390 are left -- um should be straight forward,
> s390 wants a bit more, but more on that in another email.
---
 arch/Kconfig                           |    6 
 arch/alpha/mm/init.c                   |    2 
 arch/arm/Kconfig                       |    1 
 arch/arm/include/asm/tlb.h             |   83 ------------
 arch/arm/include/asm/tlbflush.h        |    5 
 arch/arm/mm/mmu.c                      |    2 
 arch/avr32/mm/init.c                   |    2 
 arch/cris/mm/init.c                    |    2 
 arch/frv/mm/init.c                     |    2 
 arch/ia64/Kconfig                      |    1 
 arch/ia64/include/asm/tlb.h            |  147 +--------------------
 arch/ia64/mm/init.c                    |    2 
 arch/m32r/mm/init.c                    |    2 
 arch/m68k/mm/init.c                    |    2 
 arch/microblaze/mm/init.c              |    2 
 arch/mips/mm/init.c                    |    2 
 arch/mn10300/mm/init.c                 |    2 
 arch/parisc/mm/init.c                  |    2 
 arch/powerpc/Kconfig                   |    1 
 arch/powerpc/include/asm/pgalloc.h     |   21 ++-
 arch/powerpc/include/asm/thread_info.h |    2 
 arch/powerpc/kernel/process.c          |   23 +++
 arch/powerpc/mm/pgtable.c              |  104 ---------------
 arch/powerpc/mm/tlb_hash32.c           |    3 
 arch/powerpc/mm/tlb_hash64.c           |   11 -
 arch/powerpc/mm/tlb_nohash.c           |    3 
 arch/s390/Kconfig                      |    1 
 arch/s390/include/asm/pgalloc.h        |   19 +-
 arch/s390/include/asm/tlb.h            |  100 +++++++-------
 arch/s390/mm/pgtable.c                 |  193 +++-------------------------
 arch/score/mm/init.c                   |    2 
 arch/sh/Kconfig                        |    1 
 arch/sh/include/asm/tlb.h              |   92 -------------
 arch/sh/mm/init.c                      |    1 
 arch/sparc/include/asm/pgalloc_64.h    |    3 
 arch/sparc/include/asm/pgtable_64.h    |   15 +-
 arch/sparc/include/asm/tlb_64.h        |   91 -------------
 arch/sparc/include/asm/tlbflush_64.h   |   12 +
 arch/sparc/mm/init_32.c                |    2 
 arch/sparc/mm/tlb.c                    |   43 +++---
 arch/sparc/mm/tsb.c                    |   15 +-
 arch/tile/mm/init.c                    |    2 
 arch/tile/mm/pgtable.c                 |   15 --
 arch/um/include/asm/tlb.h              |   29 +---
 arch/um/kernel/smp.c                   |    3 
 arch/x86/mm/init.c                     |    2 
 arch/xtensa/mm/mmu.c                   |    2 
 fs/exec.c                              |   10 -
 include/asm-generic/tlb.h              |  227 +++++++++++++++++++++++++++------
 include/linux/mm.h                     |    2 
 mm/memory.c                            |  119 +++++++++++++----
 mm/mmap.c                              |   18 +-
 52 files changed, 536 insertions(+), 918 deletions(-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 90+ messages in thread
- * Re: [PATCH 00/17] mm: mmu_gather rework
  2011-02-17 17:36 ` [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
@ 2011-02-17 17:36   ` Peter Zijlstra
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 17:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, linux-mm,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin, Paul McKenney, Yanmin Zhang
On Thu, 2011-02-17 at 17:23 +0100, Peter Zijlstra wrote:
> Rework the existing mmu_gather infrastructure.
> 
> The direct purpose of these patches was to allow preemptible mmu_gather,
> but even without that I think these patches provide an improvement to the
> status quo.
> 
> The first patch is a fix to the tile architecture, the subsequent 9 patches
> rework the mmu_gather infrastructure. For review purpose I've split them
> into generic and per-arch patches with the last of those a generic cleanup.
> 
> For the final commit I would provide a roll-up of these patches so as not
> to wreck bisectability of non generic archs.
> 
> The next patch provides generic RCU page-table freeing, and the follow up
> is a patch converting s390 to use this. I've also got 4 patches from
> DaveM lined up (not included in this series) that uses this to implement
> gup_fast() for sparc64.
> 
> Then there is one patch that extends the generic mmu_gather batching.
> 
> Finally there are 4 patches that convert various architectures over
> to asm-generic/tlb.h, these are compile tested only and basically RFC.
> 
> After this only um and s390 are left -- um should be straight forward,
> s390 wants a bit more, but more on that in another email.
---
 arch/Kconfig                           |    6 
 arch/alpha/mm/init.c                   |    2 
 arch/arm/Kconfig                       |    1 
 arch/arm/include/asm/tlb.h             |   83 ------------
 arch/arm/include/asm/tlbflush.h        |    5 
 arch/arm/mm/mmu.c                      |    2 
 arch/avr32/mm/init.c                   |    2 
 arch/cris/mm/init.c                    |    2 
 arch/frv/mm/init.c                     |    2 
 arch/ia64/Kconfig                      |    1 
 arch/ia64/include/asm/tlb.h            |  147 +--------------------
 arch/ia64/mm/init.c                    |    2 
 arch/m32r/mm/init.c                    |    2 
 arch/m68k/mm/init.c                    |    2 
 arch/microblaze/mm/init.c              |    2 
 arch/mips/mm/init.c                    |    2 
 arch/mn10300/mm/init.c                 |    2 
 arch/parisc/mm/init.c                  |    2 
 arch/powerpc/Kconfig                   |    1 
 arch/powerpc/include/asm/pgalloc.h     |   21 ++-
 arch/powerpc/include/asm/thread_info.h |    2 
 arch/powerpc/kernel/process.c          |   23 +++
 arch/powerpc/mm/pgtable.c              |  104 ---------------
 arch/powerpc/mm/tlb_hash32.c           |    3 
 arch/powerpc/mm/tlb_hash64.c           |   11 -
 arch/powerpc/mm/tlb_nohash.c           |    3 
 arch/s390/Kconfig                      |    1 
 arch/s390/include/asm/pgalloc.h        |   19 +-
 arch/s390/include/asm/tlb.h            |  100 +++++++-------
 arch/s390/mm/pgtable.c                 |  193 +++-------------------------
 arch/score/mm/init.c                   |    2 
 arch/sh/Kconfig                        |    1 
 arch/sh/include/asm/tlb.h              |   92 -------------
 arch/sh/mm/init.c                      |    1 
 arch/sparc/include/asm/pgalloc_64.h    |    3 
 arch/sparc/include/asm/pgtable_64.h    |   15 +-
 arch/sparc/include/asm/tlb_64.h        |   91 -------------
 arch/sparc/include/asm/tlbflush_64.h   |   12 +
 arch/sparc/mm/init_32.c                |    2 
 arch/sparc/mm/tlb.c                    |   43 +++---
 arch/sparc/mm/tsb.c                    |   15 +-
 arch/tile/mm/init.c                    |    2 
 arch/tile/mm/pgtable.c                 |   15 --
 arch/um/include/asm/tlb.h              |   29 +---
 arch/um/kernel/smp.c                   |    3 
 arch/x86/mm/init.c                     |    2 
 arch/xtensa/mm/mmu.c                   |    2 
 fs/exec.c                              |   10 -
 include/asm-generic/tlb.h              |  227 +++++++++++++++++++++++++++------
 include/linux/mm.h                     |    2 
 mm/memory.c                            |  119 +++++++++++++----
 mm/mmap.c                              |   18 +-
 52 files changed, 536 insertions(+), 918 deletions(-)
^ permalink raw reply	[flat|nested] 90+ messages in thread 
 
- * Re: [PATCH 00/17] mm: mmu_gather rework
  2011-02-17 16:23 [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
                   ` (18 preceding siblings ...)
  2011-02-17 17:36 ` [PATCH 00/17] mm: mmu_gather rework Peter Zijlstra
@ 2011-02-17 17:42 ` Peter Zijlstra
  19 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2011-02-17 17:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Avi Kivity, Thomas Gleixner, Rik van Riel, Ingo Molnar, akpm,
	Linus Torvalds, linux-kernel, linux-arch, linux-mm,
	Benjamin Herrenschmidt, David Miller, Hugh Dickins, Mel Gorman,
	Nick Piggin, Paul McKenney, Yanmin Zhang, Martin Schwidefsky
On Thu, 2011-02-17 at 17:23 +0100, Peter Zijlstra wrote:
> s390 wants a bit more, but more on that in another email.
So what s390 wants is something like the below, where a fullmm gather
flushes a-priory and then simply gathers and frees the pages.
I can't see why something like this shouldn't work on x86 and power (the
only two archs I really looked in depth at), but it certainly is
something that needs a close look.
---
Index: linux-2.6/include/asm-generic/tlb.h
===================================================================
--- linux-2.6.orig/include/asm-generic/tlb.h
+++ linux-2.6/include/asm-generic/tlb.h
@@ -145,7 +145,10 @@ tlb_gather_mmu(struct mmu_gather *tlb, s
 	tlb->need_flush = 0;
 	if (num_online_cpus() == 1)
 		tlb->fast_mode = 1;
+
 	tlb->fullmm = full_mm_flush;
+	if (tlb->fullmm)
+		tlb_flush(tlb);
 
 	tlb->local.next = NULL;
 	tlb->local.nr   = 0;
@@ -162,13 +165,15 @@ tlb_flush_mmu(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
 
-	if (!tlb->need_flush)
-		return;
-	tlb->need_flush = 0;
-	tlb_flush(tlb);
+	if (tlb->need_flush && !tlb->fullmm) {
+		tlb_flush(tlb);
+		tlb->need_flush = 0;
+	}
+
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
+
 	if (tlb_fast_mode(tlb))
 		return;
 
^ permalink raw reply	[flat|nested] 90+ messages in thread