* [RFC 00/20] TLB batching consolidation and enhancements
@ 2021-01-31 0:11 Nadav Amit
2021-01-31 0:11 ` [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() Nadav Amit
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Nadav Amit @ 2021-01-31 0:11 UTC (permalink / raw)
To: linux-mm, linux-kernel
Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski,
Dave Hansen, linux-csky, linuxppc-dev, linux-s390, Mel Gorman,
Nick Piggin, Peter Zijlstra, Thomas Gleixner, Will Deacon, x86,
Yu Zhao
From: Nadav Amit <namit@vmware.com>
There are currently (at least?) 5 different TLB batching schemes in the
kernel:
1. Using mmu_gather (e.g., zap_page_range()).
2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
ongoing deferred TLB flush and flushing the entire range eventually
(e.g., change_protection_range()).
3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
4. Batching per-table flushes (move_ptes()).
5. By setting a flag on that a deferred TLB flush operation takes place,
flushing when (try_to_unmap_one() on x86).
It seems that (1)-(4) can be consolidated. In addition, it seems that
(5) is racy. It also seems there can be many redundant TLB flushes, and
potentially TLB-shootdown storms, for instance during batched
reclamation (using try_to_unmap_one()) if at the same time mmu_gather
defers TLB flushes.
More aggressive TLB batching may be possible, but this patch-set does
not add such batching. The proposed changes would enable such batching
in a later time.
Admittedly, I do not understand how things are not broken today, which
frightens me to make further batching before getting things in order.
For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
for each page-table (but not in greater granularity). Can't
ClearPageDirty() be called before the flush, causing writes after
ClearPageDirty() and before the flush to be lost?
This patch-set therefore performs the following changes:
1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
instead of {inc|dec}_tlb_flush_pending().
2. Avoid TLB flushes if PTE permission is not demoted.
3. Cleans up mmu_gather to be less arch-dependant.
4. Uses mm's generations to track in finer granularity, either per-VMA
or per page-table, whether a pending mmu_gather operation is
outstanding. This should allow to avoid some TLB flushes when KSM or
memory reclamation takes place while another operation such as
munmap() or mprotect() is running.
5. Changes try_to_unmap_one() flushing scheme, as the current seems
broken to track in a bitmap which CPUs have outstanding TLB flushes
instead of having a flag.
Further optimizations are possible, such as changing move_ptes() to use
mmu_gather.
The patches were very very lightly tested. I am looking forward for your
feedback regarding the overall approaches, and whether to split them
into multiple patch-sets.
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-csky@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: x86@kernel.org
Cc: Yu Zhao <yuzhao@google.com>
Nadav Amit (20):
mm/tlb: fix fullmm semantics
mm/mprotect: use mmu_gather
mm/mprotect: do not flush on permission promotion
mm/mapping_dirty_helpers: use mmu_gather
mm/tlb: move BATCHED_UNMAP_TLB_FLUSH to tlb.h
fs/task_mmu: use mmu_gather interface of clear-soft-dirty
mm: move x86 tlb_gen to generic code
mm: store completed TLB generation
mm: create pte/pmd_tlb_flush_pending()
mm: add pte_to_page()
mm/tlb: remove arch-specific tlb_start/end_vma()
mm/tlb: save the VMA that is flushed during tlb_start_vma()
mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes()
mm: move inc/dec_tlb_flush_pending() to mmu_gather.c
mm: detect deferred TLB flushes in vma granularity
mm/tlb: per-page table generation tracking
mm/tlb: updated completed deferred TLB flush conditionally
mm: make mm_cpumask() volatile
lib/cpumask: introduce cpumask_atomic_or()
mm/rmap: avoid potential races
arch/arm/include/asm/bitops.h | 4 +-
arch/arm/include/asm/pgtable.h | 4 +-
arch/arm64/include/asm/pgtable.h | 4 +-
arch/csky/Kconfig | 1 +
arch/csky/include/asm/tlb.h | 12 --
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/tlb.h | 2 -
arch/s390/Kconfig | 1 +
arch/s390/include/asm/tlb.h | 3 -
arch/sparc/Kconfig | 1 +
arch/sparc/include/asm/pgtable_64.h | 9 +-
arch/sparc/include/asm/tlb_64.h | 2 -
arch/sparc/mm/init_64.c | 2 +-
arch/x86/Kconfig | 3 +
arch/x86/hyperv/mmu.c | 2 +-
arch/x86/include/asm/mmu.h | 10 -
arch/x86/include/asm/mmu_context.h | 1 -
arch/x86/include/asm/paravirt_types.h | 2 +-
arch/x86/include/asm/pgtable.h | 24 +--
arch/x86/include/asm/tlb.h | 21 +-
arch/x86/include/asm/tlbbatch.h | 15 --
arch/x86/include/asm/tlbflush.h | 61 ++++--
arch/x86/mm/tlb.c | 52 +++--
arch/x86/xen/mmu_pv.c | 2 +-
drivers/firmware/efi/efi.c | 1 +
fs/proc/task_mmu.c | 29 ++-
include/asm-generic/bitops/find.h | 8 +-
include/asm-generic/tlb.h | 291 +++++++++++++++++++++-----
include/linux/bitmap.h | 21 +-
include/linux/cpumask.h | 40 ++--
include/linux/huge_mm.h | 3 +-
include/linux/mm.h | 29 ++-
include/linux/mm_types.h | 166 ++++++++++-----
include/linux/mm_types_task.h | 13 --
include/linux/pgtable.h | 2 +-
include/linux/smp.h | 6 +-
init/Kconfig | 21 ++
kernel/fork.c | 2 +
kernel/smp.c | 8 +-
lib/bitmap.c | 33 ++-
lib/cpumask.c | 8 +-
lib/find_bit.c | 10 +-
mm/huge_memory.c | 6 +-
mm/init-mm.c | 1 +
mm/internal.h | 16 --
mm/ksm.c | 2 +-
mm/madvise.c | 6 +-
mm/mapping_dirty_helpers.c | 52 +++--
mm/memory.c | 2 +
mm/mmap.c | 1 +
mm/mmu_gather.c | 59 +++++-
mm/mprotect.c | 55 ++---
mm/mremap.c | 2 +-
mm/pgtable-generic.c | 2 +-
mm/rmap.c | 42 ++--
mm/vmscan.c | 1 +
56 files changed, 803 insertions(+), 374 deletions(-)
delete mode 100644 arch/x86/include/asm/tlbbatch.h
--
2.25.1
^ permalink raw reply [flat|nested] 15+ messages in thread* [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() 2021-01-31 0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit @ 2021-01-31 0:11 ` Nadav Amit 2021-02-01 12:09 ` Peter Zijlstra 2021-01-31 0:39 ` [RFC 00/20] TLB batching consolidation and enhancements Andy Lutomirski 2021-01-31 3:30 ` Nicholas Piggin 2 siblings, 1 reply; 15+ messages in thread From: Nadav Amit @ 2021-01-31 0:11 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski, Dave Hansen, Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin, linux-csky, linuxppc-dev, linux-s390, x86 From: Nadav Amit <namit@vmware.com> Architecture-specific tlb_start_vma() and tlb_end_vma() seem unnecessary. They are currently used for: 1. Avoid per-VMA TLB flushes. This can be determined by introducing a new config option. 2. Avoid saving information on the vma that is being flushed. Saving this information, even for architectures that do not need it, is cheap and we will need it for per-VMA deferred TLB flushing. 3. Avoid calling flush_cache_range(). Remove the architecture specific tlb_start_vma() and tlb_end_vma() in the following manner, corresponding to the previous requirements: 1. Introduce a new config option - ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING - to allow architectures to define whether they want aggressive TLB flush batching (instead of flushing mappings of each VMA separately). 2. Save information on the vma regardless of architecture. Saving this information should have negligible overhead, and they will be needed for fine granularity TLB flushes. 3. flush_cache_range() is anyhow not defined for the architectures that implement tlb_start/end_vma(). No functional change intended. Signed-off-by: Nadav Amit <namit@vmware.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Nick Piggin <npiggin@gmail.com> Cc: linux-csky@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: x86@kernel.org --- arch/csky/Kconfig | 1 + arch/csky/include/asm/tlb.h | 12 ------------ arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/tlb.h | 2 -- arch/s390/Kconfig | 1 + arch/s390/include/asm/tlb.h | 3 --- arch/sparc/Kconfig | 1 + arch/sparc/include/asm/tlb_64.h | 2 -- arch/x86/Kconfig | 1 + arch/x86/include/asm/tlb.h | 3 --- include/asm-generic/tlb.h | 15 +++++---------- init/Kconfig | 8 ++++++++ 12 files changed, 18 insertions(+), 32 deletions(-) diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index 89dd2fcf38fa..924ff5721240 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -8,6 +8,7 @@ config CSKY select ARCH_HAS_SYNC_DMA_FOR_DEVICE select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_QUEUED_RWLOCKS if NR_CPUS>2 + select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING select ARCH_WANT_FRAME_POINTERS if !CPU_CK610 select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT select COMMON_CLK diff --git a/arch/csky/include/asm/tlb.h b/arch/csky/include/asm/tlb.h index fdff9b8d70c8..8130a5f09a6b 100644 --- a/arch/csky/include/asm/tlb.h +++ b/arch/csky/include/asm/tlb.h @@ -6,18 +6,6 @@ #include <asm/cacheflush.h> -#define tlb_start_vma(tlb, vma) \ - do { \ - if (!(tlb)->fullmm) \ - flush_cache_range(vma, (vma)->vm_start, (vma)->vm_end); \ - } while (0) - -#define tlb_end_vma(tlb, vma) \ - do { \ - if (!(tlb)->fullmm) \ - flush_tlb_range(vma, (vma)->vm_start, (vma)->vm_end); \ - } while (0) - #define tlb_flush(tlb) flush_tlb_mm((tlb)->mm) #include <asm-generic/tlb.h> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 107bb4319e0e..d9761b6f192a 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -151,6 +151,7 @@ config PPC select ARCH_USE_CMPXCHG_LOCKREF if PPC64 select ARCH_USE_QUEUED_RWLOCKS if PPC_QUEUED_SPINLOCKS select ARCH_USE_QUEUED_SPINLOCKS if PPC_QUEUED_SPINLOCKS + select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING select ARCH_WANT_IPC_PARSE_VERSION select ARCH_WANT_IRQS_OFF_ACTIVATE_MM select ARCH_WANT_LD_ORPHAN_WARN diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h index 160422a439aa..880b7daf904e 100644 --- a/arch/powerpc/include/asm/tlb.h +++ b/arch/powerpc/include/asm/tlb.h @@ -19,8 +19,6 @@ #include <linux/pagemap.h> -#define tlb_start_vma(tlb, vma) do { } while (0) -#define tlb_end_vma(tlb, vma) do { } while (0) #define __tlb_remove_tlb_entry __tlb_remove_tlb_entry #define tlb_flush tlb_flush diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index c72874f09741..5b3dc5ca9873 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -113,6 +113,7 @@ config S390 select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF select ARCH_WANTS_DYNAMIC_TASK_STRUCT + select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING select ARCH_WANT_DEFAULT_BPF_JIT select ARCH_WANT_IPC_PARSE_VERSION select BUILDTIME_TABLE_SORT diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h index 954fa8ca6cbd..03f31d59f97c 100644 --- a/arch/s390/include/asm/tlb.h +++ b/arch/s390/include/asm/tlb.h @@ -27,9 +27,6 @@ static inline void tlb_flush(struct mmu_gather *tlb); static inline bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size); -#define tlb_start_vma(tlb, vma) do { } while (0) -#define tlb_end_vma(tlb, vma) do { } while (0) - #define tlb_flush tlb_flush #define pte_free_tlb pte_free_tlb #define pmd_free_tlb pmd_free_tlb diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index c9c34dc52b7d..fb46e1b6f177 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -51,6 +51,7 @@ config SPARC select NEED_DMA_MAP_STATE select NEED_SG_DMA_LENGTH select SET_FS + select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING config SPARC32 def_bool !64BIT diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h index 779a5a0f0608..3037187482db 100644 --- a/arch/sparc/include/asm/tlb_64.h +++ b/arch/sparc/include/asm/tlb_64.h @@ -22,8 +22,6 @@ void smp_flush_tlb_mm(struct mm_struct *mm); void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *); void flush_tlb_pending(void); -#define tlb_start_vma(tlb, vma) do { } while (0) -#define tlb_end_vma(tlb, vma) do { } while (0) #define tlb_flush(tlb) flush_tlb_pending() /* diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 6bd4d626a6b3..d56b0f5cb00c 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -101,6 +101,7 @@ config X86 select ARCH_USE_QUEUED_RWLOCKS select ARCH_USE_QUEUED_SPINLOCKS select ARCH_USE_SYM_ANNOTATIONS + select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH select ARCH_WANT_DEFAULT_BPF_JIT if X86_64 select ARCH_WANTS_DYNAMIC_TASK_STRUCT diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h index 1bfe979bb9bc..580636cdc257 100644 --- a/arch/x86/include/asm/tlb.h +++ b/arch/x86/include/asm/tlb.h @@ -2,9 +2,6 @@ #ifndef _ASM_X86_TLB_H #define _ASM_X86_TLB_H -#define tlb_start_vma(tlb, vma) do { } while (0) -#define tlb_end_vma(tlb, vma) do { } while (0) - #define tlb_flush tlb_flush static inline void tlb_flush(struct mmu_gather *tlb); diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 427bfcc6cdec..b97136b7010b 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -334,8 +334,8 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) #ifdef CONFIG_MMU_GATHER_NO_RANGE -#if defined(tlb_flush) || defined(tlb_start_vma) || defined(tlb_end_vma) -#error MMU_GATHER_NO_RANGE relies on default tlb_flush(), tlb_start_vma() and tlb_end_vma() +#if defined(tlb_flush) +#error MMU_GATHER_NO_RANGE relies on default tlb_flush() #endif /* @@ -362,10 +362,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm #ifndef tlb_flush -#if defined(tlb_start_vma) || defined(tlb_end_vma) -#error Default tlb_flush() relies on default tlb_start_vma() and tlb_end_vma() -#endif - /* * When an architecture does not provide its own tlb_flush() implementation * but does have a reasonably efficient flush_vma_range() implementation @@ -486,7 +482,6 @@ static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb) * case where we're doing a full MM flush. When we're doing a munmap, * the vmas are adjusted to only cover the region to be torn down. */ -#ifndef tlb_start_vma static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) { if (tlb->fullmm) @@ -495,14 +490,15 @@ static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct * tlb_update_vma_flags(tlb, vma); flush_cache_range(vma, vma->vm_start, vma->vm_end); } -#endif -#ifndef tlb_end_vma static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) { if (tlb->fullmm) return; + if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING)) + return; + /* * Do a TLB flush and reset the range at VMA boundaries; this avoids * the ranges growing with the unused space between consecutive VMAs, @@ -511,7 +507,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm */ tlb_flush_mmu_tlbonly(tlb); } -#endif #ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS diff --git a/init/Kconfig b/init/Kconfig index 3d11a0f7c8cc..14a599a48738 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -849,6 +849,14 @@ config ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH config ARCH_HAS_TLB_GENERATIONS bool +# +# For architectures that prefer to batch TLB flushes aggressively, i.e., +# not to flush after changing or removing each VMA. The architecture must +# provide its own tlb_flush() function. +config ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING + bool + depends on !CONFIG_MMU_GATHER_NO_GATHER + config CC_HAS_INT128 def_bool !$(cc-option,$(m64-flag) -D__SIZEOF_INT128__=0) && 64BIT -- 2.25.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() 2021-01-31 0:11 ` [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() Nadav Amit @ 2021-02-01 12:09 ` Peter Zijlstra 2021-02-02 6:41 ` Nicholas Piggin 0 siblings, 1 reply; 15+ messages in thread From: Peter Zijlstra @ 2021-02-01 12:09 UTC (permalink / raw) To: Nadav Amit Cc: linux-mm, linux-kernel, Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski, Dave Hansen, Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin, linux-csky, linuxppc-dev, linux-s390, x86 On Sat, Jan 30, 2021 at 04:11:23PM -0800, Nadav Amit wrote: > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > index 427bfcc6cdec..b97136b7010b 100644 > --- a/include/asm-generic/tlb.h > +++ b/include/asm-generic/tlb.h > @@ -334,8 +334,8 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) > > #ifdef CONFIG_MMU_GATHER_NO_RANGE > > -#if defined(tlb_flush) || defined(tlb_start_vma) || defined(tlb_end_vma) > -#error MMU_GATHER_NO_RANGE relies on default tlb_flush(), tlb_start_vma() and tlb_end_vma() > +#if defined(tlb_flush) > +#error MMU_GATHER_NO_RANGE relies on default tlb_flush() > #endif > > /* > @@ -362,10 +362,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm > > #ifndef tlb_flush > > -#if defined(tlb_start_vma) || defined(tlb_end_vma) > -#error Default tlb_flush() relies on default tlb_start_vma() and tlb_end_vma() > -#endif #ifdef CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING #error .... #endif goes here... > static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) > { > if (tlb->fullmm) > return; > > + if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING)) > + return; Also, can you please stick to the CONFIG_MMU_GATHER_* namespace? I also don't think AGRESSIVE_FLUSH_BATCHING quite captures what it does. How about: CONFIG_MMU_GATHER_NO_PER_VMA_FLUSH ? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() 2021-02-01 12:09 ` Peter Zijlstra @ 2021-02-02 6:41 ` Nicholas Piggin 2021-02-02 7:20 ` Nadav Amit 0 siblings, 1 reply; 15+ messages in thread From: Nicholas Piggin @ 2021-02-02 6:41 UTC (permalink / raw) To: Nadav Amit, Peter Zijlstra Cc: Andrea Arcangeli, Andrew Morton, Dave Hansen, linux-csky, linux-kernel, linux-mm, linuxppc-dev, linux-s390, Andy Lutomirski, Nadav Amit, Thomas Gleixner, Will Deacon, x86, Yu Zhao Excerpts from Peter Zijlstra's message of February 1, 2021 10:09 pm: > On Sat, Jan 30, 2021 at 04:11:23PM -0800, Nadav Amit wrote: > >> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h >> index 427bfcc6cdec..b97136b7010b 100644 >> --- a/include/asm-generic/tlb.h >> +++ b/include/asm-generic/tlb.h >> @@ -334,8 +334,8 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) >> >> #ifdef CONFIG_MMU_GATHER_NO_RANGE >> >> -#if defined(tlb_flush) || defined(tlb_start_vma) || defined(tlb_end_vma) >> -#error MMU_GATHER_NO_RANGE relies on default tlb_flush(), tlb_start_vma() and tlb_end_vma() >> +#if defined(tlb_flush) >> +#error MMU_GATHER_NO_RANGE relies on default tlb_flush() >> #endif >> >> /* >> @@ -362,10 +362,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm >> >> #ifndef tlb_flush >> >> -#if defined(tlb_start_vma) || defined(tlb_end_vma) >> -#error Default tlb_flush() relies on default tlb_start_vma() and tlb_end_vma() >> -#endif > > #ifdef CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING > #error .... > #endif > > goes here... > > >> static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) >> { >> if (tlb->fullmm) >> return; >> >> + if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING)) >> + return; > > Also, can you please stick to the CONFIG_MMU_GATHER_* namespace? > > I also don't think AGRESSIVE_FLUSH_BATCHING quite captures what it does. > How about: > > CONFIG_MMU_GATHER_NO_PER_VMA_FLUSH Yes please, have to have descriptive names. I didn't quite see why this was much of an improvement though. Maybe follow up patches take advantage of it? I didn't see how they all fit together. Thanks, Nick ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() 2021-02-02 6:41 ` Nicholas Piggin @ 2021-02-02 7:20 ` Nadav Amit 2021-02-02 9:31 ` Peter Zijlstra 0 siblings, 1 reply; 15+ messages in thread From: Nadav Amit @ 2021-02-02 7:20 UTC (permalink / raw) To: Nicholas Piggin Cc: Peter Zijlstra, Andrea Arcangeli, Andrew Morton, Dave Hansen, linux-csky@vger.kernel.org, LKML, Linux-MM, linuxppc-dev, linux-s390, Andy Lutomirski, Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao > On Feb 1, 2021, at 10:41 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > > Excerpts from Peter Zijlstra's message of February 1, 2021 10:09 pm: >> I also don't think AGRESSIVE_FLUSH_BATCHING quite captures what it does. >> How about: >> >> CONFIG_MMU_GATHER_NO_PER_VMA_FLUSH > > Yes please, have to have descriptive names. Point taken. I will fix it. > > I didn't quite see why this was much of an improvement though. Maybe > follow up patches take advantage of it? I didn't see how they all fit > together. They do, but I realized as I said in other emails that I have a serious bug in the deferred invalidation scheme. Having said that, I think there is an advantage of having an explicit config option instead of relying on whether tlb_end_vma is defined. For instance, Arm does not define tlb_end_vma, and consequently it flushes the TLB after each VMA. I suspect it is not intentional. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() 2021-02-02 7:20 ` Nadav Amit @ 2021-02-02 9:31 ` Peter Zijlstra 2021-02-02 9:54 ` Nadav Amit 0 siblings, 1 reply; 15+ messages in thread From: Peter Zijlstra @ 2021-02-02 9:31 UTC (permalink / raw) To: Nadav Amit Cc: Nicholas Piggin, Andrea Arcangeli, Andrew Morton, Dave Hansen, linux-csky@vger.kernel.org, LKML, Linux-MM, linuxppc-dev, linux-s390, Andy Lutomirski, Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao On Tue, Feb 02, 2021 at 07:20:55AM +0000, Nadav Amit wrote: > Arm does not define tlb_end_vma, and consequently it flushes the TLB after > each VMA. I suspect it is not intentional. ARM is one of those that look at the VM_EXEC bit to explicitly flush ITLB IIRC, so it has to. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() 2021-02-02 9:31 ` Peter Zijlstra @ 2021-02-02 9:54 ` Nadav Amit 2021-02-02 11:04 ` Peter Zijlstra 0 siblings, 1 reply; 15+ messages in thread From: Nadav Amit @ 2021-02-02 9:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Nicholas Piggin, Andrea Arcangeli, Andrew Morton, Dave Hansen, linux-csky@vger.kernel.org, LKML, Linux-MM, linuxppc-dev, linux-s390, Andy Lutomirski, Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao > On Feb 2, 2021, at 1:31 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Feb 02, 2021 at 07:20:55AM +0000, Nadav Amit wrote: >> Arm does not define tlb_end_vma, and consequently it flushes the TLB after >> each VMA. I suspect it is not intentional. > > ARM is one of those that look at the VM_EXEC bit to explicitly flush > ITLB IIRC, so it has to. Hmm… I don’t think Arm is doing that. At least arm64 does not use the default tlb_flush(), and it does not seem to consider VM_EXEC (at least in this path): static inline void tlb_flush(struct mmu_gather *tlb) { struct vm_area_struct vma = TLB_FLUSH_VMA(tlb->mm, 0); bool last_level = !tlb->freed_tables; unsigned long stride = tlb_get_unmap_size(tlb); int tlb_level = tlb_get_level(tlb); /* * If we're tearing down the address space then we only care about * invalidating the walk-cache, since the ASID allocator won't * reallocate our ASID without invalidating the entire TLB. */ if (tlb->mm_exiting) { if (!last_level) flush_tlb_mm(tlb->mm); return; } __flush_tlb_range(&vma, tlb->start, tlb->end, stride, last_level, tlb_level); } ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() 2021-02-02 9:54 ` Nadav Amit @ 2021-02-02 11:04 ` Peter Zijlstra 0 siblings, 0 replies; 15+ messages in thread From: Peter Zijlstra @ 2021-02-02 11:04 UTC (permalink / raw) To: Nadav Amit Cc: Nicholas Piggin, Andrea Arcangeli, Andrew Morton, Dave Hansen, linux-csky@vger.kernel.org, LKML, Linux-MM, linuxppc-dev, linux-s390, Andy Lutomirski, Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao On Tue, Feb 02, 2021 at 09:54:36AM +0000, Nadav Amit wrote: > > On Feb 2, 2021, at 1:31 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Tue, Feb 02, 2021 at 07:20:55AM +0000, Nadav Amit wrote: > >> Arm does not define tlb_end_vma, and consequently it flushes the TLB after > >> each VMA. I suspect it is not intentional. > > > > ARM is one of those that look at the VM_EXEC bit to explicitly flush > > ITLB IIRC, so it has to. > > Hmm… I don’t think Arm is doing that. At least arm64 does not use the > default tlb_flush(), and it does not seem to consider VM_EXEC (at least in > this path): > ARM != ARM64. ARM certainly does, but you're right, I don't think ARM64 does this. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 00/20] TLB batching consolidation and enhancements 2021-01-31 0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit 2021-01-31 0:11 ` [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() Nadav Amit @ 2021-01-31 0:39 ` Andy Lutomirski 2021-01-31 1:08 ` Nadav Amit 2021-01-31 3:30 ` Nicholas Piggin 2 siblings, 1 reply; 15+ messages in thread From: Andy Lutomirski @ 2021-01-31 0:39 UTC (permalink / raw) To: Nadav Amit Cc: Linux-MM, LKML, Nadav Amit, Andrea Arcangeli, Andrew Morton, Andy Lutomirski, Dave Hansen, linux-csky, linuxppc-dev, linux-s390, Mel Gorman, Nick Piggin, Peter Zijlstra, Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote: > > From: Nadav Amit <namit@vmware.com> > > There are currently (at least?) 5 different TLB batching schemes in the > kernel: > > 1. Using mmu_gather (e.g., zap_page_range()). > > 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the > ongoing deferred TLB flush and flushing the entire range eventually > (e.g., change_protection_range()). > > 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). > > 4. Batching per-table flushes (move_ptes()). > > 5. By setting a flag on that a deferred TLB flush operation takes place, > flushing when (try_to_unmap_one() on x86). Are you referring to the arch_tlbbatch_add_mm/flush mechanism? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 00/20] TLB batching consolidation and enhancements 2021-01-31 0:39 ` [RFC 00/20] TLB batching consolidation and enhancements Andy Lutomirski @ 2021-01-31 1:08 ` Nadav Amit 0 siblings, 0 replies; 15+ messages in thread From: Nadav Amit @ 2021-01-31 1:08 UTC (permalink / raw) To: Andy Lutomirski Cc: Linux-MM, LKML, Andrea Arcangeli, Andrew Morton, Dave Hansen, linux-csky@vger.kernel.org, linuxppc-dev, linux-s390, Mel Gorman, Nick Piggin, Peter Zijlstra, Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao > On Jan 30, 2021, at 4:39 PM, Andy Lutomirski <luto@kernel.org> wrote: > > On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote: >> From: Nadav Amit <namit@vmware.com> >> >> There are currently (at least?) 5 different TLB batching schemes in the >> kernel: >> >> 1. Using mmu_gather (e.g., zap_page_range()). >> >> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the >> ongoing deferred TLB flush and flushing the entire range eventually >> (e.g., change_protection_range()). >> >> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). >> >> 4. Batching per-table flushes (move_ptes()). >> >> 5. By setting a flag on that a deferred TLB flush operation takes place, >> flushing when (try_to_unmap_one() on x86). > > Are you referring to the arch_tlbbatch_add_mm/flush mechanism? Yes. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 00/20] TLB batching consolidation and enhancements 2021-01-31 0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit 2021-01-31 0:11 ` [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() Nadav Amit 2021-01-31 0:39 ` [RFC 00/20] TLB batching consolidation and enhancements Andy Lutomirski @ 2021-01-31 3:30 ` Nicholas Piggin 2021-01-31 7:57 ` Nadav Amit 2 siblings, 1 reply; 15+ messages in thread From: Nicholas Piggin @ 2021-01-31 3:30 UTC (permalink / raw) To: linux-kernel, linux-mm, Nadav Amit Cc: Andrea Arcangeli, Andrew Morton, Dave Hansen, linux-csky, linuxppc-dev, linux-s390, Andy Lutomirski, Mel Gorman, Nadav Amit, Peter Zijlstra, Thomas Gleixner, Will Deacon, x86, Yu Zhao Excerpts from Nadav Amit's message of January 31, 2021 10:11 am: > From: Nadav Amit <namit@vmware.com> > > There are currently (at least?) 5 different TLB batching schemes in the > kernel: > > 1. Using mmu_gather (e.g., zap_page_range()). > > 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the > ongoing deferred TLB flush and flushing the entire range eventually > (e.g., change_protection_range()). > > 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). > > 4. Batching per-table flushes (move_ptes()). > > 5. By setting a flag on that a deferred TLB flush operation takes place, > flushing when (try_to_unmap_one() on x86). > > It seems that (1)-(4) can be consolidated. In addition, it seems that > (5) is racy. It also seems there can be many redundant TLB flushes, and > potentially TLB-shootdown storms, for instance during batched > reclamation (using try_to_unmap_one()) if at the same time mmu_gather > defers TLB flushes. > > More aggressive TLB batching may be possible, but this patch-set does > not add such batching. The proposed changes would enable such batching > in a later time. > > Admittedly, I do not understand how things are not broken today, which > frightens me to make further batching before getting things in order. > For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes > for each page-table (but not in greater granularity). Can't > ClearPageDirty() be called before the flush, causing writes after > ClearPageDirty() and before the flush to be lost? Because it's holding the page table lock which stops page_mkclean from cleaning the page. Or am I misunderstanding the question? I'll go through the patches a bit more closely when they all come through. Sparc and powerpc of course need the arch lazy mode to get per-page/pte information for operations that are not freeing pages, which is what mmu gather is designed for. I wouldn't mind using a similar API so it's less of a black box when reading generic code, but it might not quite fit the mmu gather API exactly (most of these paths don't want a full mmu_gather on stack). > > This patch-set therefore performs the following changes: > > 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather > instead of {inc|dec}_tlb_flush_pending(). > > 2. Avoid TLB flushes if PTE permission is not demoted. > > 3. Cleans up mmu_gather to be less arch-dependant. > > 4. Uses mm's generations to track in finer granularity, either per-VMA > or per page-table, whether a pending mmu_gather operation is > outstanding. This should allow to avoid some TLB flushes when KSM or > memory reclamation takes place while another operation such as > munmap() or mprotect() is running. > > 5. Changes try_to_unmap_one() flushing scheme, as the current seems > broken to track in a bitmap which CPUs have outstanding TLB flushes > instead of having a flag. Putting fixes first, and cleanups and independent patches (like #2) next would help with getting stuff merged and backported. Thanks, Nick ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 00/20] TLB batching consolidation and enhancements 2021-01-31 3:30 ` Nicholas Piggin @ 2021-01-31 7:57 ` Nadav Amit 2021-01-31 8:14 ` Nadav Amit 2021-02-01 12:44 ` Peter Zijlstra 0 siblings, 2 replies; 15+ messages in thread From: Nadav Amit @ 2021-01-31 7:57 UTC (permalink / raw) To: Nicholas Piggin Cc: LKML, Linux-MM, Andrea Arcangeli, Andrew Morton, Dave Hansen, linux-csky@vger.kernel.org, linuxppc-dev, linux-s390, Andy Lutomirski, Mel Gorman, Peter Zijlstra, Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao > On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > > Excerpts from Nadav Amit's message of January 31, 2021 10:11 am: >> From: Nadav Amit <namit@vmware.com> >> >> There are currently (at least?) 5 different TLB batching schemes in the >> kernel: >> >> 1. Using mmu_gather (e.g., zap_page_range()). >> >> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the >> ongoing deferred TLB flush and flushing the entire range eventually >> (e.g., change_protection_range()). >> >> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). >> >> 4. Batching per-table flushes (move_ptes()). >> >> 5. By setting a flag on that a deferred TLB flush operation takes place, >> flushing when (try_to_unmap_one() on x86). >> >> It seems that (1)-(4) can be consolidated. In addition, it seems that >> (5) is racy. It also seems there can be many redundant TLB flushes, and >> potentially TLB-shootdown storms, for instance during batched >> reclamation (using try_to_unmap_one()) if at the same time mmu_gather >> defers TLB flushes. >> >> More aggressive TLB batching may be possible, but this patch-set does >> not add such batching. The proposed changes would enable such batching >> in a later time. >> >> Admittedly, I do not understand how things are not broken today, which >> frightens me to make further batching before getting things in order. >> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes >> for each page-table (but not in greater granularity). Can't >> ClearPageDirty() be called before the flush, causing writes after >> ClearPageDirty() and before the flush to be lost? > > Because it's holding the page table lock which stops page_mkclean from > cleaning the page. Or am I misunderstanding the question? Thanks. I understood this part. Looking again at the code, I now understand my confusion: I forgot that the reverse mapping is removed after the PTE is zapped. Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(), by performing set_page_dirty() for the batched pages when needed in tlb_finish_mmu() [ i.e., by marking for each batched page whether set_page_dirty() should be issued for that page while collecting them ]. > I'll go through the patches a bit more closely when they all come > through. Sparc and powerpc of course need the arch lazy mode to get > per-page/pte information for operations that are not freeing pages, > which is what mmu gather is designed for. IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE where no previous PTE was set, right? > I wouldn't mind using a similar API so it's less of a black box when > reading generic code, but it might not quite fit the mmu gather API > exactly (most of these paths don't want a full mmu_gather on stack). I see your point. It may be possible to create two mmu_gather structs: a small one that only holds the flush information and another that also holds the pages. >> This patch-set therefore performs the following changes: >> >> 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather >> instead of {inc|dec}_tlb_flush_pending(). >> >> 2. Avoid TLB flushes if PTE permission is not demoted. >> >> 3. Cleans up mmu_gather to be less arch-dependant. >> >> 4. Uses mm's generations to track in finer granularity, either per-VMA >> or per page-table, whether a pending mmu_gather operation is >> outstanding. This should allow to avoid some TLB flushes when KSM or >> memory reclamation takes place while another operation such as >> munmap() or mprotect() is running. >> >> 5. Changes try_to_unmap_one() flushing scheme, as the current seems >> broken to track in a bitmap which CPUs have outstanding TLB flushes >> instead of having a flag. > > Putting fixes first, and cleanups and independent patches (like #2) next > would help with getting stuff merged and backported. I tried to do it mostly this way. There are some theoretical races which I did not manage (or try hard enough) to create, so I did not include in the “fixes” section. I will restructure the patch-set according to the feedback. Thanks, Nadav ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 00/20] TLB batching consolidation and enhancements 2021-01-31 7:57 ` Nadav Amit @ 2021-01-31 8:14 ` Nadav Amit 2021-02-01 12:44 ` Peter Zijlstra 1 sibling, 0 replies; 15+ messages in thread From: Nadav Amit @ 2021-01-31 8:14 UTC (permalink / raw) To: Nadav Amit Cc: Nicholas Piggin, LKML, Linux-MM, Andrea Arcangeli, Andrew Morton, Dave Hansen, linux-csky@vger.kernel.org, linuxppc-dev, linux-s390, Andy Lutomirski, Mel Gorman, Peter Zijlstra, Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao > On Jan 30, 2021, at 11:57 PM, Nadav Amit <namit@vmware.com> wrote: > >> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Excerpts from Nadav Amit's message of January 31, 2021 10:11 am: >>> From: Nadav Amit <namit@vmware.com> >>> >>> There are currently (at least?) 5 different TLB batching schemes in the >>> kernel: >>> >>> 1. Using mmu_gather (e.g., zap_page_range()). >>> >>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the >>> ongoing deferred TLB flush and flushing the entire range eventually >>> (e.g., change_protection_range()). >>> >>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?). >>> >>> 4. Batching per-table flushes (move_ptes()). >>> >>> 5. By setting a flag on that a deferred TLB flush operation takes place, >>> flushing when (try_to_unmap_one() on x86). >>> >>> It seems that (1)-(4) can be consolidated. In addition, it seems that >>> (5) is racy. It also seems there can be many redundant TLB flushes, and >>> potentially TLB-shootdown storms, for instance during batched >>> reclamation (using try_to_unmap_one()) if at the same time mmu_gather >>> defers TLB flushes. >>> >>> More aggressive TLB batching may be possible, but this patch-set does >>> not add such batching. The proposed changes would enable such batching >>> in a later time. >>> >>> Admittedly, I do not understand how things are not broken today, which >>> frightens me to make further batching before getting things in order. >>> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes >>> for each page-table (but not in greater granularity). Can't >>> ClearPageDirty() be called before the flush, causing writes after >>> ClearPageDirty() and before the flush to be lost? >> >> Because it's holding the page table lock which stops page_mkclean from >> cleaning the page. Or am I misunderstanding the question? > > Thanks. I understood this part. Looking again at the code, I now understand > my confusion: I forgot that the reverse mapping is removed after the PTE is > zapped. > > Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(), > by performing set_page_dirty() for the batched pages when needed in > tlb_finish_mmu() [ i.e., by marking for each batched page whether > set_page_dirty() should be issued for that page while collecting them ]. Correcting myself (I hope): no we cannot do so, since the buffers might be remove from the page at that point. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 00/20] TLB batching consolidation and enhancements 2021-01-31 7:57 ` Nadav Amit 2021-01-31 8:14 ` Nadav Amit @ 2021-02-01 12:44 ` Peter Zijlstra 2021-02-02 7:14 ` Nicholas Piggin 1 sibling, 1 reply; 15+ messages in thread From: Peter Zijlstra @ 2021-02-01 12:44 UTC (permalink / raw) To: Nadav Amit Cc: Nicholas Piggin, LKML, Linux-MM, Andrea Arcangeli, Andrew Morton, Dave Hansen, linux-csky@vger.kernel.org, linuxppc-dev, linux-s390, Andy Lutomirski, Mel Gorman, Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao On Sun, Jan 31, 2021 at 07:57:01AM +0000, Nadav Amit wrote: > > On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > > I'll go through the patches a bit more closely when they all come > > through. Sparc and powerpc of course need the arch lazy mode to get > > per-page/pte information for operations that are not freeing pages, > > which is what mmu gather is designed for. > > IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE > where no previous PTE was set, right? These are the HASH architectures. Their hardware doesn't walk the page-tables, but it consults a hash-table to resolve page translations. They _MUST_ flush the entries under the PTL to avoid ever seeing conflicting information, which will make them really unhappy. They can do this because they have TLBI broadcast. There's a few more details I'm sure, but those seem to have slipped from my mind. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 00/20] TLB batching consolidation and enhancements 2021-02-01 12:44 ` Peter Zijlstra @ 2021-02-02 7:14 ` Nicholas Piggin 0 siblings, 0 replies; 15+ messages in thread From: Nicholas Piggin @ 2021-02-02 7:14 UTC (permalink / raw) To: Nadav Amit, Peter Zijlstra Cc: Andrea Arcangeli, Andrew Morton, Dave Hansen, linux-csky@vger.kernel.org, LKML, Linux-MM, linuxppc-dev, linux-s390, Andy Lutomirski, Mel Gorman, Thomas Gleixner, Will Deacon, X86 ML, Yu Zhao Excerpts from Peter Zijlstra's message of February 1, 2021 10:44 pm: > On Sun, Jan 31, 2021 at 07:57:01AM +0000, Nadav Amit wrote: >> > On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > >> > I'll go through the patches a bit more closely when they all come >> > through. Sparc and powerpc of course need the arch lazy mode to get >> > per-page/pte information for operations that are not freeing pages, >> > which is what mmu gather is designed for. >> >> IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE >> where no previous PTE was set, right? In cases of increasing permissiveness of access, yes it may want to update the "TLB" (read hash table) to avoid taking hash table faults. But whatever the reason for the flush, there may have to be more data carried than just the virtual address range and/or physical pages. If you clear out the PTE then you have no guarantee of actually being able to go back and address the the in-memory or in-hardware translation structures to update them, depending on what exact scheme is used (powerpc probably could if all page sizes were the same, but THP or 64k/4k sub pages would throw a spanner in those works). > These are the HASH architectures. Their hardware doesn't walk the > page-tables, but it consults a hash-table to resolve page translations. Yeah, it's very cool in a masochistic way. I actually don't know if it's worth doing a big rework of it, as much as I'd like to. Rather than just keep it in place and eventually dismantling some of the go-fast hooks from core code if we can one day deprecate it in favour of the much easier radix mode. The whole thing is like a big steam train, years ago Paul and Ben and Anton and co got the boiler stoked up and set all the valves just right so it runs unbelievably well for what it's actually doing but look at it the wrong way and the whole thing could blow up. (at least that's what it feels like to me probably because I don't know the code that well). Sparc could probably do the same, not sure about Xen. I don't suppose vmware is intending to add any kind of paravirt mode related to this stuff? Thanks, Nick ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2021-02-02 11:05 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-01-31 0:11 [RFC 00/20] TLB batching consolidation and enhancements Nadav Amit 2021-01-31 0:11 ` [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma() Nadav Amit 2021-02-01 12:09 ` Peter Zijlstra 2021-02-02 6:41 ` Nicholas Piggin 2021-02-02 7:20 ` Nadav Amit 2021-02-02 9:31 ` Peter Zijlstra 2021-02-02 9:54 ` Nadav Amit 2021-02-02 11:04 ` Peter Zijlstra 2021-01-31 0:39 ` [RFC 00/20] TLB batching consolidation and enhancements Andy Lutomirski 2021-01-31 1:08 ` Nadav Amit 2021-01-31 3:30 ` Nicholas Piggin 2021-01-31 7:57 ` Nadav Amit 2021-01-31 8:14 ` Nadav Amit 2021-02-01 12:44 ` Peter Zijlstra 2021-02-02 7:14 ` Nicholas Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox