[RFC 00/20] TLB batching consolidation and enhancements

public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC 00/20] TLB batching consolidation and enhancements
@ 2021-01-31  0:11 Nadav Amit
  2021-01-31  0:11 ` [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() Nadav Amit
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, linux-csky, linuxppc-dev, linux-s390, Mel Gorman,
	Nick Piggin, Peter Zijlstra, Thomas Gleixner, Will Deacon, x86,
	Yu Zhao

From: Nadav Amit <namit@vmware.com>

There are currently (at least?) 5 different TLB batching schemes in the
kernel:

1. Using mmu_gather (e.g., zap_page_range()).

2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
   ongoing deferred TLB flush and flushing the entire range eventually
   (e.g., change_protection_range()).

3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).

4. Batching per-table flushes (move_ptes()).

5. By setting a flag on that a deferred TLB flush operation takes place,
   flushing when (try_to_unmap_one() on x86).

It seems that (1)-(4) can be consolidated. In addition, it seems that
(5) is racy. It also seems there can be many redundant TLB flushes, and
potentially TLB-shootdown storms, for instance during batched
reclamation (using try_to_unmap_one()) if at the same time mmu_gather
defers TLB flushes.

More aggressive TLB batching may be possible, but this patch-set does
not add such batching. The proposed changes would enable such batching
in a later time.

Admittedly, I do not understand how things are not broken today, which
frightens me to make further batching before getting things in order.
For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
for each page-table (but not in greater granularity). Can't
ClearPageDirty() be called before the flush, causing writes after
ClearPageDirty() and before the flush to be lost?

This patch-set therefore performs the following changes:

1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
   instead of {inc|dec}_tlb_flush_pending().

2. Avoid TLB flushes if PTE permission is not demoted.

3. Cleans up mmu_gather to be less arch-dependant.

4. Uses mm's generations to track in finer granularity, either per-VMA
   or per page-table, whether a pending mmu_gather operation is
   outstanding. This should allow to avoid some TLB flushes when KSM or
   memory reclamation takes place while another operation such as
   munmap() or mprotect() is running.

5. Changes try_to_unmap_one() flushing scheme, as the current seems
   broken to track in a bitmap which CPUs have outstanding TLB flushes
   instead of having a flag.

Further optimizations are possible, such as changing move_ptes() to use
mmu_gather.

The patches were very very lightly tested. I am looking forward for your
feedback regarding the overall approaches, and whether to split them
into multiple patch-sets.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-csky@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: x86@kernel.org
Cc: Yu Zhao <yuzhao@google.com>

Nadav Amit (20):
  mm/tlb: fix fullmm semantics
  mm/mprotect: use mmu_gather
  mm/mprotect: do not flush on permission promotion
  mm/mapping_dirty_helpers: use mmu_gather
  mm/tlb: move BATCHED_UNMAP_TLB_FLUSH to tlb.h
  fs/task_mmu: use mmu_gather interface of clear-soft-dirty
  mm: move x86 tlb_gen to generic code
  mm: store completed TLB generation
  mm: create pte/pmd_tlb_flush_pending()
  mm: add pte_to_page()
  mm/tlb: remove arch-specific tlb_start/end_vma()
  mm/tlb: save the VMA that is flushed during tlb_start_vma()
  mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes()
  mm: move inc/dec_tlb_flush_pending() to mmu_gather.c
  mm: detect deferred TLB flushes in vma granularity
  mm/tlb: per-page table generation tracking
  mm/tlb: updated completed deferred TLB flush conditionally
  mm: make mm_cpumask() volatile
  lib/cpumask: introduce cpumask_atomic_or()
  mm/rmap: avoid potential races

 arch/arm/include/asm/bitops.h         |   4 +-
 arch/arm/include/asm/pgtable.h        |   4 +-
 arch/arm64/include/asm/pgtable.h      |   4 +-
 arch/csky/Kconfig                     |   1 +
 arch/csky/include/asm/tlb.h           |  12 --
 arch/powerpc/Kconfig                  |   1 +
 arch/powerpc/include/asm/tlb.h        |   2 -
 arch/s390/Kconfig                     |   1 +
 arch/s390/include/asm/tlb.h           |   3 -
 arch/sparc/Kconfig                    |   1 +
 arch/sparc/include/asm/pgtable_64.h   |   9 +-
 arch/sparc/include/asm/tlb_64.h       |   2 -
 arch/sparc/mm/init_64.c               |   2 +-
 arch/x86/Kconfig                      |   3 +
 arch/x86/hyperv/mmu.c                 |   2 +-
 arch/x86/include/asm/mmu.h            |  10 -
 arch/x86/include/asm/mmu_context.h    |   1 -
 arch/x86/include/asm/paravirt_types.h |   2 +-
 arch/x86/include/asm/pgtable.h        |  24 +--
 arch/x86/include/asm/tlb.h            |  21 +-
 arch/x86/include/asm/tlbbatch.h       |  15 --
 arch/x86/include/asm/tlbflush.h       |  61 ++++--
 arch/x86/mm/tlb.c                     |  52 +++--
 arch/x86/xen/mmu_pv.c                 |   2 +-
 drivers/firmware/efi/efi.c            |   1 +
 fs/proc/task_mmu.c                    |  29 ++-
 include/asm-generic/bitops/find.h     |   8 +-
 include/asm-generic/tlb.h             | 291 +++++++++++++++++++++-----
 include/linux/bitmap.h                |  21 +-
 include/linux/cpumask.h               |  40 ++--
 include/linux/huge_mm.h               |   3 +-
 include/linux/mm.h                    |  29 ++-
 include/linux/mm_types.h              | 166 ++++++++++-----
 include/linux/mm_types_task.h         |  13 --
 include/linux/pgtable.h               |   2 +-
 include/linux/smp.h                   |   6 +-
 init/Kconfig                          |  21 ++
 kernel/fork.c                         |   2 +
 kernel/smp.c                          |   8 +-
 lib/bitmap.c                          |  33 ++-
 lib/cpumask.c                         |   8 +-
 lib/find_bit.c                        |  10 +-
 mm/huge_memory.c                      |   6 +-
 mm/init-mm.c                          |   1 +
 mm/internal.h                         |  16 --
 mm/ksm.c                              |   2 +-
 mm/madvise.c                          |   6 +-
 mm/mapping_dirty_helpers.c            |  52 +++--
 mm/memory.c                           |   2 +
 mm/mmap.c                             |   1 +
 mm/mmu_gather.c                       |  59 +++++-
 mm/mprotect.c                         |  55 ++---
 mm/mremap.c                           |   2 +-
 mm/pgtable-generic.c                  |   2 +-
 mm/rmap.c                             |  42 ++--
 mm/vmscan.c                           |   1 +
 56 files changed, 803 insertions(+), 374 deletions(-)
 delete mode 100644 arch/x86/include/asm/tlbbatch.h

-- 
2.25.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
@ 2021-01-31  0:11 ` Nadav Amit
  2021-02-01 12:09   ` Peter Zijlstra
  2021-01-31  0:39 ` [RFC 00/20] TLB batching consolidation and enhancements Andy Lutomirski
  2021-01-31  3:30 ` Nicholas Piggin
  2 siblings, 1 reply; 15+ messages in thread
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon,
	Yu Zhao, Nick Piggin, linux-csky, linuxppc-dev, linux-s390, x86

From: Nadav Amit <namit@vmware.com>

Architecture-specific tlb_start_vma() and tlb_end_vma() seem
unnecessary. They are currently used for:

1. Avoid per-VMA TLB flushes. This can be determined by introducing
   a new config option.

2. Avoid saving information on the vma that is being flushed. Saving
   this information, even for architectures that do not need it, is
   cheap and we will need it for per-VMA deferred TLB flushing.

3. Avoid calling flush_cache_range().

Remove the architecture specific tlb_start_vma() and tlb_end_vma() in
the following manner, corresponding to the previous requirements:

1. Introduce a new config option -
   ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING - to allow architectures to
   define whether they want aggressive TLB flush batching (instead of
   flushing mappings of each VMA separately).

2. Save information on the vma regardless of architecture. Saving this
   information should have negligible overhead, and they will be
   needed for fine granularity TLB flushes.

3. flush_cache_range() is anyhow not defined for the architectures that
   implement tlb_start/end_vma().

No functional change intended.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: linux-csky@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: x86@kernel.org
---
 arch/csky/Kconfig               |  1 +
 arch/csky/include/asm/tlb.h     | 12 ------------
 arch/powerpc/Kconfig            |  1 +
 arch/powerpc/include/asm/tlb.h  |  2 --
 arch/s390/Kconfig               |  1 +
 arch/s390/include/asm/tlb.h     |  3 ---
 arch/sparc/Kconfig              |  1 +
 arch/sparc/include/asm/tlb_64.h |  2 --
 arch/x86/Kconfig                |  1 +
 arch/x86/include/asm/tlb.h      |  3 ---
 include/asm-generic/tlb.h       | 15 +++++----------
 init/Kconfig                    |  8 ++++++++
 12 files changed, 18 insertions(+), 32 deletions(-)

diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index 89dd2fcf38fa..924ff5721240 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -8,6 +8,7 @@ config CSKY
 	select ARCH_HAS_SYNC_DMA_FOR_DEVICE
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_QUEUED_RWLOCKS if NR_CPUS>2
+	select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
 	select ARCH_WANT_FRAME_POINTERS if !CPU_CK610
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
 	select COMMON_CLK
diff --git a/arch/csky/include/asm/tlb.h b/arch/csky/include/asm/tlb.h
index fdff9b8d70c8..8130a5f09a6b 100644
--- a/arch/csky/include/asm/tlb.h
+++ b/arch/csky/include/asm/tlb.h
@@ -6,18 +6,6 @@
 
 #include <asm/cacheflush.h>
 
-#define tlb_start_vma(tlb, vma) \
-	do { \
-		if (!(tlb)->fullmm) \
-			flush_cache_range(vma, (vma)->vm_start, (vma)->vm_end); \
-	}  while (0)
-
-#define tlb_end_vma(tlb, vma) \
-	do { \
-		if (!(tlb)->fullmm) \
-			flush_tlb_range(vma, (vma)->vm_start, (vma)->vm_end); \
-	}  while (0)
-
 #define tlb_flush(tlb) flush_tlb_mm((tlb)->mm)
 
 #include <asm-generic/tlb.h>
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 107bb4319e0e..d9761b6f192a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -151,6 +151,7 @@ config PPC
 	select ARCH_USE_CMPXCHG_LOCKREF		if PPC64
 	select ARCH_USE_QUEUED_RWLOCKS		if PPC_QUEUED_SPINLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS	if PPC_QUEUED_SPINLOCKS
+	select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	select ARCH_WANT_LD_ORPHAN_WARN
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 160422a439aa..880b7daf904e 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -19,8 +19,6 @@
 
 #include <linux/pagemap.h>
 
-#define tlb_start_vma(tlb, vma)	do { } while (0)
-#define tlb_end_vma(tlb, vma)	do { } while (0)
 #define __tlb_remove_tlb_entry	__tlb_remove_tlb_entry
 
 #define tlb_flush tlb_flush
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index c72874f09741..5b3dc5ca9873 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -113,6 +113,7 @@ config S390
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
+	select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select BUILDTIME_TABLE_SORT
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 954fa8ca6cbd..03f31d59f97c 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -27,9 +27,6 @@ static inline void tlb_flush(struct mmu_gather *tlb);
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
 					  struct page *page, int page_size);
 
-#define tlb_start_vma(tlb, vma)			do { } while (0)
-#define tlb_end_vma(tlb, vma)			do { } while (0)
-
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
 #define pmd_free_tlb pmd_free_tlb
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index c9c34dc52b7d..fb46e1b6f177 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -51,6 +51,7 @@ config SPARC
 	select NEED_DMA_MAP_STATE
 	select NEED_SG_DMA_LENGTH
 	select SET_FS
+	select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
 
 config SPARC32
 	def_bool !64BIT
diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h
index 779a5a0f0608..3037187482db 100644
--- a/arch/sparc/include/asm/tlb_64.h
+++ b/arch/sparc/include/asm/tlb_64.h
@@ -22,8 +22,6 @@ void smp_flush_tlb_mm(struct mm_struct *mm);
 void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
 void flush_tlb_pending(void);
 
-#define tlb_start_vma(tlb, vma) do { } while (0)
-#define tlb_end_vma(tlb, vma)	do { } while (0)
 #define tlb_flush(tlb)	flush_tlb_pending()
 
 /*
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6bd4d626a6b3..d56b0f5cb00c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -101,6 +101,7 @@ config X86
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
 	select ARCH_USE_SYM_ANNOTATIONS
+	select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANT_DEFAULT_BPF_JIT	if X86_64
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 1bfe979bb9bc..580636cdc257 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -2,9 +2,6 @@
 #ifndef _ASM_X86_TLB_H
 #define _ASM_X86_TLB_H
 
-#define tlb_start_vma(tlb, vma) do { } while (0)
-#define tlb_end_vma(tlb, vma) do { } while (0)
-
 #define tlb_flush tlb_flush
 static inline void tlb_flush(struct mmu_gather *tlb);
 
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 427bfcc6cdec..b97136b7010b 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -334,8 +334,8 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
 
 #ifdef CONFIG_MMU_GATHER_NO_RANGE
 
-#if defined(tlb_flush) || defined(tlb_start_vma) || defined(tlb_end_vma)
-#error MMU_GATHER_NO_RANGE relies on default tlb_flush(), tlb_start_vma() and tlb_end_vma()
+#if defined(tlb_flush)
+#error MMU_GATHER_NO_RANGE relies on default tlb_flush()
 #endif
 
 /*
@@ -362,10 +362,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
 
 #ifndef tlb_flush
 
-#if defined(tlb_start_vma) || defined(tlb_end_vma)
-#error Default tlb_flush() relies on default tlb_start_vma() and tlb_end_vma()
-#endif
-
 /*
  * When an architecture does not provide its own tlb_flush() implementation
  * but does have a reasonably efficient flush_vma_range() implementation
@@ -486,7 +482,6 @@ static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb)
  * case where we're doing a full MM flush.  When we're doing a munmap,
  * the vmas are adjusted to only cover the region to be torn down.
  */
-#ifndef tlb_start_vma
 static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
 	if (tlb->fullmm)
@@ -495,14 +490,15 @@ static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *
 	tlb_update_vma_flags(tlb, vma);
 	flush_cache_range(vma, vma->vm_start, vma->vm_end);
 }
-#endif
 
-#ifndef tlb_end_vma
 static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
 	if (tlb->fullmm)
 		return;
 
+	if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING))
+		return;
+
 	/*
 	 * Do a TLB flush and reset the range at VMA boundaries; this avoids
 	 * the ranges growing with the unused space between consecutive VMAs,
@@ -511,7 +507,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
 	 */
 	tlb_flush_mmu_tlbonly(tlb);
 }
-#endif
 
 #ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
 
diff --git a/init/Kconfig b/init/Kconfig
index 3d11a0f7c8cc..14a599a48738 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -849,6 +849,14 @@ config ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 config ARCH_HAS_TLB_GENERATIONS
 	bool
 
+#
+# For architectures that prefer to batch TLB flushes aggressively, i.e.,
+# not to flush after changing or removing each VMA. The architecture must
+# provide its own tlb_flush() function.
+config ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
+	bool
+	depends on !CONFIG_MMU_GATHER_NO_GATHER
+
 config CC_HAS_INT128
 	def_bool !$(cc-option,$(m64-flag) -D__SIZEOF_INT128__=0) && 64BIT
 
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
  2021-01-31  0:11 ` [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() Nadav Amit
@ 2021-02-01 12:09   ` Peter Zijlstra
  2021-02-02  6:41     ` Nicholas Piggin
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2021-02-01 12:09 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Nadav Amit, Andrea Arcangeli,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Thomas Gleixner,
	Will Deacon, Yu Zhao, Nick Piggin, linux-csky, linuxppc-dev,
	linux-s390, x86

On Sat, Jan 30, 2021 at 04:11:23PM -0800, Nadav Amit wrote:

> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 427bfcc6cdec..b97136b7010b 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -334,8 +334,8 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
>  
>  #ifdef CONFIG_MMU_GATHER_NO_RANGE
>  
> -#if defined(tlb_flush) || defined(tlb_start_vma) || defined(tlb_end_vma)
> -#error MMU_GATHER_NO_RANGE relies on default tlb_flush(), tlb_start_vma() and tlb_end_vma()
> +#if defined(tlb_flush)
> +#error MMU_GATHER_NO_RANGE relies on default tlb_flush()
>  #endif
>  
>  /*
> @@ -362,10 +362,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
>  
>  #ifndef tlb_flush
>  
> -#if defined(tlb_start_vma) || defined(tlb_end_vma)
> -#error Default tlb_flush() relies on default tlb_start_vma() and tlb_end_vma()
> -#endif

#ifdef CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
#error ....
#endif

goes here...


>  static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
>  {
>  	if (tlb->fullmm)
>  		return;
>  
> +	if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING))
> +		return;

Also, can you please stick to the CONFIG_MMU_GATHER_* namespace?

I also don't think AGRESSIVE_FLUSH_BATCHING quite captures what it does.
How about:

	CONFIG_MMU_GATHER_NO_PER_VMA_FLUSH

?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
  2021-02-01 12:09   ` Peter Zijlstra
@ 2021-02-02  6:41     ` Nicholas Piggin
  2021-02-02  7:20       ` Nadav Amit
  0 siblings, 1 reply; 15+ messages in thread
From: Nicholas Piggin @ 2021-02-02  6:41 UTC (permalink / raw)
  To: Nadav Amit, Peter Zijlstra
  Cc: Andrea Arcangeli, Andrew Morton, Dave Hansen, linux-csky,
	linux-kernel, linux-mm, linuxppc-dev, linux-s390, Andy Lutomirski,
	Nadav Amit, Thomas Gleixner, Will Deacon, x86, Yu Zhao

Excerpts from Peter Zijlstra's message of February 1, 2021 10:09 pm:
> On Sat, Jan 30, 2021 at 04:11:23PM -0800, Nadav Amit wrote:
> 
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index 427bfcc6cdec..b97136b7010b 100644
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -334,8 +334,8 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
>>  
>>  #ifdef CONFIG_MMU_GATHER_NO_RANGE
>>  
>> -#if defined(tlb_flush) || defined(tlb_start_vma) || defined(tlb_end_vma)
>> -#error MMU_GATHER_NO_RANGE relies on default tlb_flush(), tlb_start_vma() and tlb_end_vma()
>> +#if defined(tlb_flush)
>> +#error MMU_GATHER_NO_RANGE relies on default tlb_flush()
>>  #endif
>>  
>>  /*
>> @@ -362,10 +362,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
>>  
>>  #ifndef tlb_flush
>>  
>> -#if defined(tlb_start_vma) || defined(tlb_end_vma)
>> -#error Default tlb_flush() relies on default tlb_start_vma() and tlb_end_vma()
>> -#endif
> 
> #ifdef CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
> #error ....
> #endif
> 
> goes here...
> 
> 
>>  static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
>>  {
>>  	if (tlb->fullmm)
>>  		return;
>>  
>> +	if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING))
>> +		return;
> 
> Also, can you please stick to the CONFIG_MMU_GATHER_* namespace?
> 
> I also don't think AGRESSIVE_FLUSH_BATCHING quite captures what it does.
> How about:
> 
> 	CONFIG_MMU_GATHER_NO_PER_VMA_FLUSH

Yes please, have to have descriptive names.

I didn't quite see why this was much of an improvement though. Maybe 
follow up patches take advantage of it? I didn't see how they all fit 
together.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
  2021-02-02  6:41     ` Nicholas Piggin
@ 2021-02-02  7:20       ` Nadav Amit
  2021-02-02  9:31         ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Nadav Amit @ 2021-02-02  7:20 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Peter Zijlstra, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	linux-csky@vger.kernel.org, LKML, Linux-MM, linuxppc-dev,
	linux-s390, Andy Lutomirski, Thomas Gleixner, Will Deacon, X86 ML,
	Yu Zhao

> On Feb 1, 2021, at 10:41 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> Excerpts from Peter Zijlstra's message of February 1, 2021 10:09 pm:
>> I also don't think AGRESSIVE_FLUSH_BATCHING quite captures what it does.
>> How about:
>> 
>> 	CONFIG_MMU_GATHER_NO_PER_VMA_FLUSH
> 
> Yes please, have to have descriptive names.

Point taken. I will fix it.

> 
> I didn't quite see why this was much of an improvement though. Maybe 
> follow up patches take advantage of it? I didn't see how they all fit 
> together.

They do, but I realized as I said in other emails that I have a serious bug
in the deferred invalidation scheme.

Having said that, I think there is an advantage of having an explicit config
option instead of relying on whether tlb_end_vma is defined. For instance,
Arm does not define tlb_end_vma, and consequently it flushes the TLB after
each VMA. I suspect it is not intentional.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
  2021-02-02  7:20       ` Nadav Amit
@ 2021-02-02  9:31         ` Peter Zijlstra
  2021-02-02  9:54           ` Nadav Amit
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2021-02-02  9:31 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Nicholas Piggin, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	linux-csky@vger.kernel.org, LKML, Linux-MM, linuxppc-dev,
	linux-s390, Andy Lutomirski, Thomas Gleixner, Will Deacon, X86 ML,
	Yu Zhao

On Tue, Feb 02, 2021 at 07:20:55AM +0000, Nadav Amit wrote:
> Arm does not define tlb_end_vma, and consequently it flushes the TLB after
> each VMA. I suspect it is not intentional.

ARM is one of those that look at the VM_EXEC bit to explicitly flush
ITLB IIRC, so it has to.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
  2021-02-02  9:31         ` Peter Zijlstra
@ 2021-02-02  9:54           ` Nadav Amit
  2021-02-02 11:04             ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Nadav Amit @ 2021-02-02  9:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	linux-csky@vger.kernel.org, LKML, Linux-MM, linuxppc-dev,
	linux-s390, Andy Lutomirski, Thomas Gleixner, Will Deacon, X86 ML,
	Yu Zhao

> On Feb 2, 2021, at 1:31 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Tue, Feb 02, 2021 at 07:20:55AM +0000, Nadav Amit wrote:
>> Arm does not define tlb_end_vma, and consequently it flushes the TLB after
>> each VMA. I suspect it is not intentional.
> 
> ARM is one of those that look at the VM_EXEC bit to explicitly flush
> ITLB IIRC, so it has to.

Hmm… I don’t think Arm is doing that. At least arm64 does not use the
default tlb_flush(), and it does not seem to consider VM_EXEC (at least in
this path):

static inline void tlb_flush(struct mmu_gather *tlb)
{
        struct vm_area_struct vma = TLB_FLUSH_VMA(tlb->mm, 0);
        bool last_level = !tlb->freed_tables;
        unsigned long stride = tlb_get_unmap_size(tlb);
        int tlb_level = tlb_get_level(tlb);
        
        /*
         * If we're tearing down the address space then we only care about
         * invalidating the walk-cache, since the ASID allocator won't
         * reallocate our ASID without invalidating the entire TLB.
         */
        if (tlb->mm_exiting) {
                if (!last_level)
                        flush_tlb_mm(tlb->mm);
                return;
        }       
        
        __flush_tlb_range(&vma, tlb->start, tlb->end, stride,
                          last_level, tlb_level);
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
  2021-02-02  9:54           ` Nadav Amit
@ 2021-02-02 11:04             ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2021-02-02 11:04 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Nicholas Piggin, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	linux-csky@vger.kernel.org, LKML, Linux-MM, linuxppc-dev,
	linux-s390, Andy Lutomirski, Thomas Gleixner, Will Deacon, X86 ML,
	Yu Zhao

On Tue, Feb 02, 2021 at 09:54:36AM +0000, Nadav Amit wrote:
> > On Feb 2, 2021, at 1:31 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Tue, Feb 02, 2021 at 07:20:55AM +0000, Nadav Amit wrote:
> >> Arm does not define tlb_end_vma, and consequently it flushes the TLB after
> >> each VMA. I suspect it is not intentional.
> > 
> > ARM is one of those that look at the VM_EXEC bit to explicitly flush
> > ITLB IIRC, so it has to.
> 
> Hmm… I don’t think Arm is doing that. At least arm64 does not use the
> default tlb_flush(), and it does not seem to consider VM_EXEC (at least in
> this path):
> 

ARM != ARM64. ARM certainly does, but you're right, I don't think ARM64
does this.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 00/20] TLB batching consolidation and enhancements
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
  2021-01-31  0:11 ` [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() Nadav Amit
@ 2021-01-31  0:39 ` Andy Lutomirski
  2021-01-31  1:08   ` Nadav Amit
  2021-01-31  3:30 ` Nicholas Piggin
  2 siblings, 1 reply; 15+ messages in thread
From: Andy Lutomirski @ 2021-01-31  0:39 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, LKML, Nadav Amit, Andrea Arcangeli, Andrew Morton,
	Andy Lutomirski, Dave Hansen, linux-csky, linuxppc-dev,
	linux-s390, Mel Gorman, Nick Piggin, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao

On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> From: Nadav Amit <namit@vmware.com>
>
> There are currently (at least?) 5 different TLB batching schemes in the
> kernel:
>
> 1. Using mmu_gather (e.g., zap_page_range()).
>
> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>    ongoing deferred TLB flush and flushing the entire range eventually
>    (e.g., change_protection_range()).
>
> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>
> 4. Batching per-table flushes (move_ptes()).
>
> 5. By setting a flag on that a deferred TLB flush operation takes place,
>    flushing when (try_to_unmap_one() on x86).

Are you referring to the arch_tlbbatch_add_mm/flush mechanism?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 00/20] TLB batching consolidation and enhancements
  2021-01-31  0:39 ` [RFC 00/20] TLB batching consolidation and enhancements Andy Lutomirski
@ 2021-01-31  1:08   ` Nadav Amit
  0 siblings, 0 replies; 15+ messages in thread
From: Nadav Amit @ 2021-01-31  1:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux-MM, LKML, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	linux-csky@vger.kernel.org, linuxppc-dev, linux-s390, Mel Gorman,
	Nick Piggin, Peter Zijlstra, Thomas Gleixner, Will Deacon, X86 ML,
	Yu Zhao

> On Jan 30, 2021, at 4:39 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> There are currently (at least?) 5 different TLB batching schemes in the
>> kernel:
>> 
>> 1. Using mmu_gather (e.g., zap_page_range()).
>> 
>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>   ongoing deferred TLB flush and flushing the entire range eventually
>>   (e.g., change_protection_range()).
>> 
>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>> 
>> 4. Batching per-table flushes (move_ptes()).
>> 
>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>   flushing when (try_to_unmap_one() on x86).
> 
> Are you referring to the arch_tlbbatch_add_mm/flush mechanism?

Yes.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 00/20] TLB batching consolidation and enhancements
  2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
  2021-01-31  0:11 ` [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() Nadav Amit
  2021-01-31  0:39 ` [RFC 00/20] TLB batching consolidation and enhancements Andy Lutomirski
@ 2021-01-31  3:30 ` Nicholas Piggin
  2021-01-31  7:57   ` Nadav Amit
  2 siblings, 1 reply; 15+ messages in thread
From: Nicholas Piggin @ 2021-01-31  3:30 UTC (permalink / raw)
  To: linux-kernel, linux-mm, Nadav Amit
  Cc: Andrea Arcangeli, Andrew Morton, Dave Hansen, linux-csky,
	linuxppc-dev, linux-s390, Andy Lutomirski, Mel Gorman, Nadav Amit,
	Peter Zijlstra, Thomas Gleixner, Will Deacon, x86, Yu Zhao

Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
> From: Nadav Amit <namit@vmware.com>
> 
> There are currently (at least?) 5 different TLB batching schemes in the
> kernel:
> 
> 1. Using mmu_gather (e.g., zap_page_range()).
> 
> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>    ongoing deferred TLB flush and flushing the entire range eventually
>    (e.g., change_protection_range()).
> 
> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
> 
> 4. Batching per-table flushes (move_ptes()).
> 
> 5. By setting a flag on that a deferred TLB flush operation takes place,
>    flushing when (try_to_unmap_one() on x86).
> 
> It seems that (1)-(4) can be consolidated. In addition, it seems that
> (5) is racy. It also seems there can be many redundant TLB flushes, and
> potentially TLB-shootdown storms, for instance during batched
> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
> defers TLB flushes.
> 
> More aggressive TLB batching may be possible, but this patch-set does
> not add such batching. The proposed changes would enable such batching
> in a later time.
> 
> Admittedly, I do not understand how things are not broken today, which
> frightens me to make further batching before getting things in order.
> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
> for each page-table (but not in greater granularity). Can't
> ClearPageDirty() be called before the flush, causing writes after
> ClearPageDirty() and before the flush to be lost?

Because it's holding the page table lock which stops page_mkclean from 
cleaning the page. Or am I misunderstanding the question?

I'll go through the patches a bit more closely when they all come 
through. Sparc and powerpc of course need the arch lazy mode to get 
per-page/pte information for operations that are not freeing pages, 
which is what mmu gather is designed for.

I wouldn't mind using a similar API so it's less of a black box when 
reading generic code, but it might not quite fit the mmu gather API
exactly (most of these paths don't want a full mmu_gather on stack).

> 
> This patch-set therefore performs the following changes:
> 
> 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
>    instead of {inc|dec}_tlb_flush_pending().
> 
> 2. Avoid TLB flushes if PTE permission is not demoted.
> 
> 3. Cleans up mmu_gather to be less arch-dependant.
> 
> 4. Uses mm's generations to track in finer granularity, either per-VMA
>    or per page-table, whether a pending mmu_gather operation is
>    outstanding. This should allow to avoid some TLB flushes when KSM or
>    memory reclamation takes place while another operation such as
>    munmap() or mprotect() is running.
> 
> 5. Changes try_to_unmap_one() flushing scheme, as the current seems
>    broken to track in a bitmap which CPUs have outstanding TLB flushes
>    instead of having a flag.

Putting fixes first, and cleanups and independent patches (like #2) next
would help with getting stuff merged and backported.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 00/20] TLB batching consolidation and enhancements
  2021-01-31  3:30 ` Nicholas Piggin
@ 2021-01-31  7:57   ` Nadav Amit
  2021-01-31  8:14     ` Nadav Amit
  2021-02-01 12:44     ` Peter Zijlstra
  0 siblings, 2 replies; 15+ messages in thread
From: Nadav Amit @ 2021-01-31  7:57 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: LKML, Linux-MM, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	linux-csky@vger.kernel.org, linuxppc-dev, linux-s390,
	Andy Lutomirski, Mel Gorman, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, X86 ML, Yu Zhao

> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> There are currently (at least?) 5 different TLB batching schemes in the
>> kernel:
>> 
>> 1. Using mmu_gather (e.g., zap_page_range()).
>> 
>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>   ongoing deferred TLB flush and flushing the entire range eventually
>>   (e.g., change_protection_range()).
>> 
>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>> 
>> 4. Batching per-table flushes (move_ptes()).
>> 
>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>   flushing when (try_to_unmap_one() on x86).
>> 
>> It seems that (1)-(4) can be consolidated. In addition, it seems that
>> (5) is racy. It also seems there can be many redundant TLB flushes, and
>> potentially TLB-shootdown storms, for instance during batched
>> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
>> defers TLB flushes.
>> 
>> More aggressive TLB batching may be possible, but this patch-set does
>> not add such batching. The proposed changes would enable such batching
>> in a later time.
>> 
>> Admittedly, I do not understand how things are not broken today, which
>> frightens me to make further batching before getting things in order.
>> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
>> for each page-table (but not in greater granularity). Can't
>> ClearPageDirty() be called before the flush, causing writes after
>> ClearPageDirty() and before the flush to be lost?
> 
> Because it's holding the page table lock which stops page_mkclean from 
> cleaning the page. Or am I misunderstanding the question?

Thanks. I understood this part. Looking again at the code, I now understand
my confusion: I forgot that the reverse mapping is removed after the PTE is
zapped.

Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(),
by performing set_page_dirty() for the batched pages when needed in
tlb_finish_mmu() [ i.e., by marking for each batched page whether
set_page_dirty() should be issued for that page while collecting them ].

> I'll go through the patches a bit more closely when they all come 
> through. Sparc and powerpc of course need the arch lazy mode to get 
> per-page/pte information for operations that are not freeing pages, 
> which is what mmu gather is designed for.

IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE
where no previous PTE was set, right?

> I wouldn't mind using a similar API so it's less of a black box when 
> reading generic code, but it might not quite fit the mmu gather API
> exactly (most of these paths don't want a full mmu_gather on stack).

I see your point. It may be possible to create two mmu_gather structs: a
small one that only holds the flush information and another that also holds
the pages. 

>> This patch-set therefore performs the following changes:
>> 
>> 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
>>   instead of {inc|dec}_tlb_flush_pending().
>> 
>> 2. Avoid TLB flushes if PTE permission is not demoted.
>> 
>> 3. Cleans up mmu_gather to be less arch-dependant.
>> 
>> 4. Uses mm's generations to track in finer granularity, either per-VMA
>>   or per page-table, whether a pending mmu_gather operation is
>>   outstanding. This should allow to avoid some TLB flushes when KSM or
>>   memory reclamation takes place while another operation such as
>>   munmap() or mprotect() is running.
>> 
>> 5. Changes try_to_unmap_one() flushing scheme, as the current seems
>>   broken to track in a bitmap which CPUs have outstanding TLB flushes
>>   instead of having a flag.
> 
> Putting fixes first, and cleanups and independent patches (like #2) next
> would help with getting stuff merged and backported.

I tried to do it mostly this way. There are some theoretical races which
I did not manage (or try hard enough) to create, so I did not include in
the “fixes” section. I will restructure the patch-set according to the
feedback.

Thanks,
Nadav

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 00/20] TLB batching consolidation and enhancements
  2021-01-31  7:57   ` Nadav Amit
@ 2021-01-31  8:14     ` Nadav Amit
  2021-02-01 12:44     ` Peter Zijlstra
  1 sibling, 0 replies; 15+ messages in thread
From: Nadav Amit @ 2021-01-31  8:14 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Nicholas Piggin, LKML, Linux-MM, Andrea Arcangeli, Andrew Morton,
	Dave Hansen, linux-csky@vger.kernel.org, linuxppc-dev, linux-s390,
	Andy Lutomirski, Mel Gorman, Peter Zijlstra, Thomas Gleixner,
	Will Deacon, X86 ML, Yu Zhao

> On Jan 30, 2021, at 11:57 PM, Nadav Amit <namit@vmware.com> wrote:
> 
>> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
>> 
>> Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
>>> From: Nadav Amit <namit@vmware.com>
>>> 
>>> There are currently (at least?) 5 different TLB batching schemes in the
>>> kernel:
>>> 
>>> 1. Using mmu_gather (e.g., zap_page_range()).
>>> 
>>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>>  ongoing deferred TLB flush and flushing the entire range eventually
>>>  (e.g., change_protection_range()).
>>> 
>>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>>> 
>>> 4. Batching per-table flushes (move_ptes()).
>>> 
>>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>>  flushing when (try_to_unmap_one() on x86).
>>> 
>>> It seems that (1)-(4) can be consolidated. In addition, it seems that
>>> (5) is racy. It also seems there can be many redundant TLB flushes, and
>>> potentially TLB-shootdown storms, for instance during batched
>>> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
>>> defers TLB flushes.
>>> 
>>> More aggressive TLB batching may be possible, but this patch-set does
>>> not add such batching. The proposed changes would enable such batching
>>> in a later time.
>>> 
>>> Admittedly, I do not understand how things are not broken today, which
>>> frightens me to make further batching before getting things in order.
>>> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
>>> for each page-table (but not in greater granularity). Can't
>>> ClearPageDirty() be called before the flush, causing writes after
>>> ClearPageDirty() and before the flush to be lost?
>> 
>> Because it's holding the page table lock which stops page_mkclean from 
>> cleaning the page. Or am I misunderstanding the question?
> 
> Thanks. I understood this part. Looking again at the code, I now understand
> my confusion: I forgot that the reverse mapping is removed after the PTE is
> zapped.
> 
> Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(),
> by performing set_page_dirty() for the batched pages when needed in
> tlb_finish_mmu() [ i.e., by marking for each batched page whether
> set_page_dirty() should be issued for that page while collecting them ].

Correcting myself (I hope): no we cannot do so, since the buffers might be
remove from the page at that point.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 00/20] TLB batching consolidation and enhancements
  2021-01-31  7:57   ` Nadav Amit
  2021-01-31  8:14     ` Nadav Amit
@ 2021-02-01 12:44     ` Peter Zijlstra
  2021-02-02  7:14       ` Nicholas Piggin
  1 sibling, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2021-02-01 12:44 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Nicholas Piggin, LKML, Linux-MM, Andrea Arcangeli, Andrew Morton,
	Dave Hansen, linux-csky@vger.kernel.org, linuxppc-dev, linux-s390,
	Andy Lutomirski, Mel Gorman, Thomas Gleixner, Will Deacon, X86 ML,
	Yu Zhao

On Sun, Jan 31, 2021 at 07:57:01AM +0000, Nadav Amit wrote:
> > On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote:

> > I'll go through the patches a bit more closely when they all come 
> > through. Sparc and powerpc of course need the arch lazy mode to get 
> > per-page/pte information for operations that are not freeing pages, 
> > which is what mmu gather is designed for.
> 
> IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE
> where no previous PTE was set, right?

These are the HASH architectures. Their hardware doesn't walk the
page-tables, but it consults a hash-table to resolve page translations.

They _MUST_ flush the entries under the PTL to avoid ever seeing
conflicting information, which will make them really unhappy. They can
do this because they have TLBI broadcast.

There's a few more details I'm sure, but those seem to have slipped from
my mind.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 00/20] TLB batching consolidation and enhancements
  2021-02-01 12:44     ` Peter Zijlstra
@ 2021-02-02  7:14       ` Nicholas Piggin
  0 siblings, 0 replies; 15+ messages in thread
From: Nicholas Piggin @ 2021-02-02  7:14 UTC (permalink / raw)
  To: Nadav Amit, Peter Zijlstra
  Cc: Andrea Arcangeli, Andrew Morton, Dave Hansen,
	linux-csky@vger.kernel.org, LKML, Linux-MM, linuxppc-dev,
	linux-s390, Andy Lutomirski, Mel Gorman, Thomas Gleixner,
	Will Deacon, X86 ML, Yu Zhao

Excerpts from Peter Zijlstra's message of February 1, 2021 10:44 pm:
> On Sun, Jan 31, 2021 at 07:57:01AM +0000, Nadav Amit wrote:
>> > On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
>> > I'll go through the patches a bit more closely when they all come 
>> > through. Sparc and powerpc of course need the arch lazy mode to get 
>> > per-page/pte information for operations that are not freeing pages, 
>> > which is what mmu gather is designed for.
>> 
>> IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE
>> where no previous PTE was set, right?

In cases of increasing permissiveness of access, yes it may want to 
update the "TLB" (read hash table) to avoid taking hash table faults.

But whatever the reason for the flush, there may have to be more
data carried than just the virtual address range and/or physical
pages.

If you clear out the PTE then you have no guarantee of actually being
able to go back and address the the in-memory or in-hardware translation 
structures to update them, depending on what exact scheme is used
(powerpc probably could if all page sizes were the same, but THP or 
64k/4k sub pages would throw a spanner in those works).

> These are the HASH architectures. Their hardware doesn't walk the
> page-tables, but it consults a hash-table to resolve page translations.

Yeah, it's very cool in a masochistic way.

I actually don't know if it's worth doing a big rework of it, as much 
as I'd like to. Rather than just keep it in place and eventually
dismantling some of the go-fast hooks from core code if we can one day
deprecate it in favour of the much easier radix mode.

The whole thing is like a big steam train, years ago Paul and Ben and 
Anton and co got the boiler stoked up and set all the valves just right 
so it runs unbelievably well for what it's actually doing but look at it
the wrong way and the whole thing could blow up. (at least that's what 
it feels like to me probably because I don't know the code that well).

Sparc could probably do the same, not sure about Xen. I don't suppose
vmware is intending to add any kind of paravirt mode related to this stuff?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-02-02 11:05 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-01-31  0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit
2021-01-31  0:11 ` [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() Nadav Amit
2021-02-01 12:09   ` Peter Zijlstra
2021-02-02  6:41     ` Nicholas Piggin
2021-02-02  7:20       ` Nadav Amit
2021-02-02  9:31         ` Peter Zijlstra
2021-02-02  9:54           ` Nadav Amit
2021-02-02 11:04             ` Peter Zijlstra
2021-01-31  0:39 ` [RFC 00/20] TLB batching consolidation and enhancements Andy Lutomirski
2021-01-31  1:08   ` Nadav Amit
2021-01-31  3:30 ` Nicholas Piggin
2021-01-31  7:57   ` Nadav Amit
2021-01-31  8:14     ` Nadav Amit
2021-02-01 12:44     ` Peter Zijlstra
2021-02-02  7:14       ` Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox