[PATCH RFC/RFT 0/4] Remove preventive sfence.vma

linux-mips.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC/RFT 0/4] Remove preventive sfence.vma
@ 2023-12-07 15:03 Alexandre Ghiti
  2023-12-07 15:03 ` [PATCH RFC/RFT 1/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings Alexandre Ghiti
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Alexandre Ghiti @ 2023-12-07 15:03 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm
  Cc: Alexandre Ghiti

In RISC-V, after a new mapping is established, a sfence.vma needs to be
emitted for different reasons:

- if the uarch caches invalid entries, we need to invalidate it otherwise
  we would trap on this invalid entry,
- if the uarch does not cache invalid entries, a reordered access could fail
  to see the new mapping and then trap (sfence.vma acts as a fence).

We can actually avoid emitting those (mostly) useless and costly sfence.vma
by handling the traps instead:

- for new kernel mappings: only vmalloc mappings need to be taken care of,
  other new mapping are rare and already emit the required sfence.vma if
  needed.
  That must be achieved very early in the exception path as explained in
  patch 1, and this also fixes our fragile way of dealing with vmalloc faults.

- for new user mappings: that can be handled in the page fault path as done
  in patch 3.

Patch 2 is certainly a TEMP patch which allows to detect at runtime if a
uarch caches invalid TLB entries.

Patch 4 is a TEMP patch which allows to expose through debugfs the different
sfence.vma that are emitted, which can be used for benchmarking.

On our uarch that does not cache invalid entries and a 6.5 kernel, the
gains are measurable:

* Kernel boot:                  6%
* ltp - mmapstress01:           8%
* lmbench - lat_pagefault:      20%
* lmbench - lat_mmap:           5%

On uarchs that cache invalid entries, the results are more mitigated and
need to be explored more thoroughly (if anyone is interested!): that can
be explained by the extra page faults, which depending on "how much" the
uarch caches invalid entries, could kill the benefits of removing the
preventive sfence.vma.

Ved Shanbhogue has prepared a new extension to be used by uarchs that do
not cache invalid entries, which will certainly be used instead of patch 2.

Thanks to Ved and Matt Evans for triggering the discussion that led to
this patchset!

That's an RFC, so please don't mind the checkpatch warnings and dirty
comments. It applies on 6.6.

Any feedback, test or relevant benchmark are welcome :)

Alexandre Ghiti (4):
  riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
  riscv: Add a runtime detection of invalid TLB entries caching
  riscv: Stop emitting preventive sfence.vma for new userspace mappings
  TEMP: riscv: Add debugfs interface to retrieve #sfence.vma

 arch/arm64/include/asm/pgtable.h              |   2 +-
 arch/mips/include/asm/pgtable.h               |   6 +-
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   8 +-
 arch/riscv/include/asm/cacheflush.h           |  19 ++-
 arch/riscv/include/asm/pgtable.h              |  45 ++++---
 arch/riscv/include/asm/thread_info.h          |   5 +
 arch/riscv/include/asm/tlbflush.h             |   4 +
 arch/riscv/kernel/asm-offsets.c               |   5 +
 arch/riscv/kernel/entry.S                     |  94 +++++++++++++
 arch/riscv/kernel/sbi.c                       |  12 ++
 arch/riscv/mm/init.c                          | 126 ++++++++++++++++++
 arch/riscv/mm/tlbflush.c                      |  17 +++
 include/linux/pgtable.h                       |   8 +-
 mm/memory.c                                   |  12 +-
 14 files changed, 331 insertions(+), 32 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC/RFT 1/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
  2023-12-07 15:03 [PATCH RFC/RFT 0/4] Remove preventive sfence.vma Alexandre Ghiti
@ 2023-12-07 15:03 ` Alexandre Ghiti
  2023-12-07 15:52   ` Christophe Leroy
  2023-12-07 15:03 ` [PATCH RFC/RFT 2/4] riscv: Add a runtime detection of invalid TLB entries caching Alexandre Ghiti
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Alexandre Ghiti @ 2023-12-07 15:03 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm
  Cc: Alexandre Ghiti

In 6.5, we removed the vmalloc fault path because that can't work (see
[1] [2]). Then in order to make sure that new page table entries were
seen by the page table walker, we had to preventively emit a sfence.vma
on all harts [3] but this solution is very costly since it relies on IPI.

And even there, we could end up in a loop of vmalloc faults if a vmalloc
allocation is done in the IPI path (for example if it is traced, see
[4]), which could result in a kernel stack overflow.

Those preventive sfence.vma needed to be emitted because:

- if the uarch caches invalid entries, the new mapping may not be
  observed by the page table walker and an invalidation may be needed.
- if the uarch does not cache invalid entries, a reordered access
  could "miss" the new mapping and traps: in that case, we would actually
  only need to retry the access, no sfence.vma is required.

So this patch removes those preventive sfence.vma and actually handles
the possible (and unlikely) exceptions. And since the kernel stacks
mappings lie in the vmalloc area, this handling must be done very early
when the trap is taken, at the very beginning of handle_exception: this
also rules out the vmalloc allocations in the fault path.

Note that for now, we emit a sfence.vma even for uarchs that do not
cache invalid entries as we have no means to know that: that will be
fixed in the next patch.

Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1]
Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2]
Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3]
Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4]
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
 arch/riscv/include/asm/cacheflush.h  | 19 +++++-
 arch/riscv/include/asm/thread_info.h |  5 ++
 arch/riscv/kernel/asm-offsets.c      |  5 ++
 arch/riscv/kernel/entry.S            | 94 ++++++++++++++++++++++++++++
 arch/riscv/mm/init.c                 |  2 +
 5 files changed, 124 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
index 3cb53c4df27c..a916cbc69d47 100644
--- a/arch/riscv/include/asm/cacheflush.h
+++ b/arch/riscv/include/asm/cacheflush.h
@@ -37,7 +37,24 @@ static inline void flush_dcache_page(struct page *page)
 	flush_icache_mm(vma->vm_mm, 0)
 
 #ifdef CONFIG_64BIT
-#define flush_cache_vmap(start, end)	flush_tlb_kernel_range(start, end)
+extern u64 new_vmalloc[];
+extern char _end[];
+#define flush_cache_vmap flush_cache_vmap
+static inline void flush_cache_vmap(unsigned long start, unsigned long end)
+{
+	if ((start < VMALLOC_END && end > VMALLOC_START) ||
+	    (start < MODULES_END && end > MODULES_VADDR)) {
+		int i;
+
+		/*
+		 * We don't care if concurrently a cpu resets this value since
+		 * the only place this can happen is in handle_exception() where
+		 * an sfence.vma is emitted.
+		 */
+		for (i = 0; i < NR_CPUS / sizeof(u64) + 1; ++i)
+			new_vmalloc[i] = -1ULL;
+	}
+}
 #endif
 
 #ifndef CONFIG_SMP
diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
index 1833beb00489..8fe12fa6c329 100644
--- a/arch/riscv/include/asm/thread_info.h
+++ b/arch/riscv/include/asm/thread_info.h
@@ -60,6 +60,11 @@ struct thread_info {
 	long			user_sp;	/* User stack pointer */
 	int			cpu;
 	unsigned long		syscall_work;	/* SYSCALL_WORK_ flags */
+	/*
+	 * Used in handle_exception() to save a0, a1 and a2 before knowing if we
+	 * can access the kernel stack.
+	 */
+	unsigned long		a0, a1, a2;
 };
 
 /*
diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
index d6a75aac1d27..340c1c84560d 100644
--- a/arch/riscv/kernel/asm-offsets.c
+++ b/arch/riscv/kernel/asm-offsets.c
@@ -34,10 +34,15 @@ void asm_offsets(void)
 	OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
 	OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
 	OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
+
+	OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
 	OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
 	OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
 	OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
 	OFFSET(TASK_TI_USER_SP, task_struct, thread_info.user_sp);
+	OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
+	OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
+	OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
 
 	OFFSET(TASK_THREAD_F0,  task_struct, thread.fstate.f[0]);
 	OFFSET(TASK_THREAD_F1,  task_struct, thread.fstate.f[1]);
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 143a2bb3e697..3a3c7b563816 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -14,6 +14,88 @@
 #include <asm/asm-offsets.h>
 #include <asm/errata_list.h>
 
+.macro new_vmalloc_check
+	REG_S 	a0, TASK_TI_A0(tp)
+	REG_S 	a1, TASK_TI_A1(tp)
+	REG_S	a2, TASK_TI_A2(tp)
+
+	csrr 	a0, CSR_CAUSE
+	/* Exclude IRQs */
+	blt  	a0, zero, _new_vmalloc_restore_context
+	/* Only check new_vmalloc if we are in page/protection fault */
+	li   	a1, EXC_LOAD_PAGE_FAULT
+	beq  	a0, a1, _new_vmalloc_kernel_address
+	li   	a1, EXC_STORE_PAGE_FAULT
+	beq  	a0, a1, _new_vmalloc_kernel_address
+	li   	a1, EXC_INST_PAGE_FAULT
+	bne  	a0, a1, _new_vmalloc_restore_context
+
+_new_vmalloc_kernel_address:
+	/* Is it a kernel address? */
+	csrr 	a0, CSR_TVAL
+	bge 	a0, zero, _new_vmalloc_restore_context
+
+	/* Check if a new vmalloc mapping appeared that could explain the trap */
+
+	/*
+	 * Computes:
+	 * a0 = &new_vmalloc[BIT_WORD(cpu)]
+	 * a1 = BIT_MASK(cpu)
+	 */
+	REG_L 	a2, TASK_TI_CPU(tp)
+	/*
+	 * Compute the new_vmalloc element position:
+	 * (cpu / 64) * 8 = (cpu >> 6) << 3
+	 */
+	srli	a1, a2, 6
+	slli	a1, a1, 3
+	la	a0, new_vmalloc
+	add	a0, a0, a1
+	/*
+	 * Compute the bit position in the new_vmalloc element:
+	 * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6
+	 * 	   = cpu - ((cpu >> 6) << 3) << 3
+	 */
+	slli	a1, a1, 3
+	sub	a1, a2, a1
+	/* Compute the "get mask": 1 << bit_pos */
+	li	a2, 1
+	sll	a1, a2, a1
+
+	/* Check the value of new_vmalloc for this cpu */
+	ld	a2, 0(a0)
+	and	a2, a2, a1
+	beq	a2, zero, _new_vmalloc_restore_context
+
+	ld	a2, 0(a0)
+	not	a1, a1
+	and	a1, a2, a1
+	sd	a1, 0(a0)
+
+	/* Only emit a sfence.vma if the uarch caches invalid entries */
+	la	a0, tlb_caching_invalid_entries
+	lb	a0, 0(a0)
+	beqz	a0, _new_vmalloc_no_caching_invalid_entries
+	sfence.vma
+_new_vmalloc_no_caching_invalid_entries:
+	// debug
+	la	a0, nr_sfence_vma_handle_exception
+	li	a1, 1
+	amoadd.w    a0, a1, (a0)
+	// end debug
+	REG_L	a0, TASK_TI_A0(tp)
+	REG_L	a1, TASK_TI_A1(tp)
+	REG_L	a2, TASK_TI_A2(tp)
+	csrw	CSR_SCRATCH, x0
+	sret
+
+_new_vmalloc_restore_context:
+	REG_L	a0, TASK_TI_A0(tp)
+	REG_L 	a1, TASK_TI_A1(tp)
+	REG_L 	a2, TASK_TI_A2(tp)
+.endm
+
+
 SYM_CODE_START(handle_exception)
 	/*
 	 * If coming from userspace, preserve the user thread pointer and load
@@ -25,6 +107,18 @@ SYM_CODE_START(handle_exception)
 
 _restore_kernel_tpsp:
 	csrr tp, CSR_SCRATCH
+
+	/*
+	 * The RISC-V kernel does not eagerly emit a sfence.vma after each
+	 * new vmalloc mapping, which may result in exceptions:
+	 * - if the uarch caches invalid entries, the new mapping would not be
+	 *   observed by the page table walker and an invalidation is needed.
+	 * - if the uarch does not cache invalid entries, a reordered access
+	 *   could "miss" the new mapping and traps: in that case, we only need
+	 *   to retry the access, no sfence.vma is required.
+	 */
+	new_vmalloc_check
+
 	REG_S sp, TASK_TI_KERNEL_SP(tp)
 
 #ifdef CONFIG_VMAP_STACK
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 0798bd861dcb..379403de6c6f 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -36,6 +36,8 @@
 
 #include "../kernel/head.h"
 
+u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
+
 struct kernel_mapping kernel_map __ro_after_init;
 EXPORT_SYMBOL(kernel_map);
 #ifdef CONFIG_XIP_KERNEL
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC/RFT 1/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
  2023-12-07 15:03 ` [PATCH RFC/RFT 1/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings Alexandre Ghiti
@ 2023-12-07 15:52   ` Christophe Leroy
  2023-12-08 14:28     ` Alexandre Ghiti
  0 siblings, 1 reply; 11+ messages in thread
From: Christophe Leroy @ 2023-12-07 15:52 UTC (permalink / raw)
  To: Alexandre Ghiti, Catalin Marinas, Will Deacon,
	Thomas Bogendoerfer, Michael Ellerman, Nicholas Piggin,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-mips@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
	linux-mm@kvack.org



Le 07/12/2023 à 16:03, Alexandre Ghiti a écrit :
> In 6.5, we removed the vmalloc fault path because that can't work (see
> [1] [2]). Then in order to make sure that new page table entries were
> seen by the page table walker, we had to preventively emit a sfence.vma
> on all harts [3] but this solution is very costly since it relies on IPI.
> 
> And even there, we could end up in a loop of vmalloc faults if a vmalloc
> allocation is done in the IPI path (for example if it is traced, see
> [4]), which could result in a kernel stack overflow.
> 
> Those preventive sfence.vma needed to be emitted because:
> 
> - if the uarch caches invalid entries, the new mapping may not be
>    observed by the page table walker and an invalidation may be needed.
> - if the uarch does not cache invalid entries, a reordered access
>    could "miss" the new mapping and traps: in that case, we would actually
>    only need to retry the access, no sfence.vma is required.
> 
> So this patch removes those preventive sfence.vma and actually handles
> the possible (and unlikely) exceptions. And since the kernel stacks
> mappings lie in the vmalloc area, this handling must be done very early
> when the trap is taken, at the very beginning of handle_exception: this
> also rules out the vmalloc allocations in the fault path.
> 
> Note that for now, we emit a sfence.vma even for uarchs that do not
> cache invalid entries as we have no means to know that: that will be
> fixed in the next patch.
> 
> Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1]
> Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2]
> Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3]
> Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4]
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> ---
>   arch/riscv/include/asm/cacheflush.h  | 19 +++++-
>   arch/riscv/include/asm/thread_info.h |  5 ++
>   arch/riscv/kernel/asm-offsets.c      |  5 ++
>   arch/riscv/kernel/entry.S            | 94 ++++++++++++++++++++++++++++
>   arch/riscv/mm/init.c                 |  2 +
>   5 files changed, 124 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> index 3cb53c4df27c..a916cbc69d47 100644
> --- a/arch/riscv/include/asm/cacheflush.h
> +++ b/arch/riscv/include/asm/cacheflush.h
> @@ -37,7 +37,24 @@ static inline void flush_dcache_page(struct page *page)
>   	flush_icache_mm(vma->vm_mm, 0)
>   
>   #ifdef CONFIG_64BIT
> -#define flush_cache_vmap(start, end)	flush_tlb_kernel_range(start, end)
> +extern u64 new_vmalloc[];

Can you have the table size here ? Would help GCC static analysis for 
boundary checking.

> +extern char _end[];
> +#define flush_cache_vmap flush_cache_vmap
> +static inline void flush_cache_vmap(unsigned long start, unsigned long end)
> +{
> +	if ((start < VMALLOC_END && end > VMALLOC_START) ||
> +	    (start < MODULES_END && end > MODULES_VADDR)) {

Can you use is_vmalloc_or_module_addr() instead ?


> +		int i;
> +
> +		/*
> +		 * We don't care if concurrently a cpu resets this value since
> +		 * the only place this can happen is in handle_exception() where
> +		 * an sfence.vma is emitted.
> +		 */
> +		for (i = 0; i < NR_CPUS / sizeof(u64) + 1; ++i)

Use ARRAY_SIZE() ?

> +			new_vmalloc[i] = -1ULL;
> +	}
> +}
>   #endif
>   
>   #ifndef CONFIG_SMP
> diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
> index 1833beb00489..8fe12fa6c329 100644
> --- a/arch/riscv/include/asm/thread_info.h
> +++ b/arch/riscv/include/asm/thread_info.h
> @@ -60,6 +60,11 @@ struct thread_info {
>   	long			user_sp;	/* User stack pointer */
>   	int			cpu;
>   	unsigned long		syscall_work;	/* SYSCALL_WORK_ flags */
> +	/*
> +	 * Used in handle_exception() to save a0, a1 and a2 before knowing if we
> +	 * can access the kernel stack.
> +	 */
> +	unsigned long		a0, a1, a2;
>   };
>   
>   /*
> diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> index d6a75aac1d27..340c1c84560d 100644
> --- a/arch/riscv/kernel/asm-offsets.c
> +++ b/arch/riscv/kernel/asm-offsets.c
> @@ -34,10 +34,15 @@ void asm_offsets(void)
>   	OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
>   	OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
>   	OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
> +
> +	OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
>   	OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
>   	OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
>   	OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
>   	OFFSET(TASK_TI_USER_SP, task_struct, thread_info.user_sp);
> +	OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
> +	OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
> +	OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
>   
>   	OFFSET(TASK_THREAD_F0,  task_struct, thread.fstate.f[0]);
>   	OFFSET(TASK_THREAD_F1,  task_struct, thread.fstate.f[1]);
> diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> index 143a2bb3e697..3a3c7b563816 100644
> --- a/arch/riscv/kernel/entry.S
> +++ b/arch/riscv/kernel/entry.S
> @@ -14,6 +14,88 @@
>   #include <asm/asm-offsets.h>
>   #include <asm/errata_list.h>
>   
> +.macro new_vmalloc_check
> +	REG_S 	a0, TASK_TI_A0(tp)
> +	REG_S 	a1, TASK_TI_A1(tp)
> +	REG_S	a2, TASK_TI_A2(tp)
> +
> +	csrr 	a0, CSR_CAUSE
> +	/* Exclude IRQs */
> +	blt  	a0, zero, _new_vmalloc_restore_context
> +	/* Only check new_vmalloc if we are in page/protection fault */
> +	li   	a1, EXC_LOAD_PAGE_FAULT
> +	beq  	a0, a1, _new_vmalloc_kernel_address
> +	li   	a1, EXC_STORE_PAGE_FAULT
> +	beq  	a0, a1, _new_vmalloc_kernel_address
> +	li   	a1, EXC_INST_PAGE_FAULT
> +	bne  	a0, a1, _new_vmalloc_restore_context
> +
> +_new_vmalloc_kernel_address:
> +	/* Is it a kernel address? */
> +	csrr 	a0, CSR_TVAL
> +	bge 	a0, zero, _new_vmalloc_restore_context
> +
> +	/* Check if a new vmalloc mapping appeared that could explain the trap */
> +
> +	/*
> +	 * Computes:
> +	 * a0 = &new_vmalloc[BIT_WORD(cpu)]
> +	 * a1 = BIT_MASK(cpu)
> +	 */
> +	REG_L 	a2, TASK_TI_CPU(tp)
> +	/*
> +	 * Compute the new_vmalloc element position:
> +	 * (cpu / 64) * 8 = (cpu >> 6) << 3
> +	 */
> +	srli	a1, a2, 6
> +	slli	a1, a1, 3
> +	la	a0, new_vmalloc
> +	add	a0, a0, a1
> +	/*
> +	 * Compute the bit position in the new_vmalloc element:
> +	 * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6
> +	 * 	   = cpu - ((cpu >> 6) << 3) << 3
> +	 */
> +	slli	a1, a1, 3
> +	sub	a1, a2, a1
> +	/* Compute the "get mask": 1 << bit_pos */
> +	li	a2, 1
> +	sll	a1, a2, a1
> +
> +	/* Check the value of new_vmalloc for this cpu */
> +	ld	a2, 0(a0)
> +	and	a2, a2, a1
> +	beq	a2, zero, _new_vmalloc_restore_context
> +
> +	ld	a2, 0(a0)
> +	not	a1, a1
> +	and	a1, a2, a1
> +	sd	a1, 0(a0)
> +
> +	/* Only emit a sfence.vma if the uarch caches invalid entries */
> +	la	a0, tlb_caching_invalid_entries
> +	lb	a0, 0(a0)
> +	beqz	a0, _new_vmalloc_no_caching_invalid_entries
> +	sfence.vma
> +_new_vmalloc_no_caching_invalid_entries:
> +	// debug
> +	la	a0, nr_sfence_vma_handle_exception
> +	li	a1, 1
> +	amoadd.w    a0, a1, (a0)
> +	// end debug
> +	REG_L	a0, TASK_TI_A0(tp)
> +	REG_L	a1, TASK_TI_A1(tp)
> +	REG_L	a2, TASK_TI_A2(tp)
> +	csrw	CSR_SCRATCH, x0
> +	sret
> +
> +_new_vmalloc_restore_context:
> +	REG_L	a0, TASK_TI_A0(tp)
> +	REG_L 	a1, TASK_TI_A1(tp)
> +	REG_L 	a2, TASK_TI_A2(tp)
> +.endm
> +
> +
>   SYM_CODE_START(handle_exception)
>   	/*
>   	 * If coming from userspace, preserve the user thread pointer and load
> @@ -25,6 +107,18 @@ SYM_CODE_START(handle_exception)
>   
>   _restore_kernel_tpsp:
>   	csrr tp, CSR_SCRATCH
> +
> +	/*
> +	 * The RISC-V kernel does not eagerly emit a sfence.vma after each
> +	 * new vmalloc mapping, which may result in exceptions:
> +	 * - if the uarch caches invalid entries, the new mapping would not be
> +	 *   observed by the page table walker and an invalidation is needed.
> +	 * - if the uarch does not cache invalid entries, a reordered access
> +	 *   could "miss" the new mapping and traps: in that case, we only need
> +	 *   to retry the access, no sfence.vma is required.
> +	 */
> +	new_vmalloc_check
> +
>   	REG_S sp, TASK_TI_KERNEL_SP(tp)
>   
>   #ifdef CONFIG_VMAP_STACK
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 0798bd861dcb..379403de6c6f 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -36,6 +36,8 @@
>   
>   #include "../kernel/head.h"
>   
> +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> +
>   struct kernel_mapping kernel_map __ro_after_init;
>   EXPORT_SYMBOL(kernel_map);
>   #ifdef CONFIG_XIP_KERNEL

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC/RFT 1/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
  2023-12-07 15:52   ` Christophe Leroy
@ 2023-12-08 14:28     ` Alexandre Ghiti
  0 siblings, 0 replies; 11+ messages in thread
From: Alexandre Ghiti @ 2023-12-08 14:28 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Andrew Morton, Ved Shanbhogue, Matt Evans, Dylan Jhong,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-mips@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
	linux-mm@kvack.org

Hi Christophe,

On Thu, Dec 7, 2023 at 4:52 PM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
>
>
> Le 07/12/2023 à 16:03, Alexandre Ghiti a écrit :
> > In 6.5, we removed the vmalloc fault path because that can't work (see
> > [1] [2]). Then in order to make sure that new page table entries were
> > seen by the page table walker, we had to preventively emit a sfence.vma
> > on all harts [3] but this solution is very costly since it relies on IPI.
> >
> > And even there, we could end up in a loop of vmalloc faults if a vmalloc
> > allocation is done in the IPI path (for example if it is traced, see
> > [4]), which could result in a kernel stack overflow.
> >
> > Those preventive sfence.vma needed to be emitted because:
> >
> > - if the uarch caches invalid entries, the new mapping may not be
> >    observed by the page table walker and an invalidation may be needed.
> > - if the uarch does not cache invalid entries, a reordered access
> >    could "miss" the new mapping and traps: in that case, we would actually
> >    only need to retry the access, no sfence.vma is required.
> >
> > So this patch removes those preventive sfence.vma and actually handles
> > the possible (and unlikely) exceptions. And since the kernel stacks
> > mappings lie in the vmalloc area, this handling must be done very early
> > when the trap is taken, at the very beginning of handle_exception: this
> > also rules out the vmalloc allocations in the fault path.
> >
> > Note that for now, we emit a sfence.vma even for uarchs that do not
> > cache invalid entries as we have no means to know that: that will be
> > fixed in the next patch.
> >
> > Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1]
> > Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2]
> > Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3]
> > Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4]
> > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > ---
> >   arch/riscv/include/asm/cacheflush.h  | 19 +++++-
> >   arch/riscv/include/asm/thread_info.h |  5 ++
> >   arch/riscv/kernel/asm-offsets.c      |  5 ++
> >   arch/riscv/kernel/entry.S            | 94 ++++++++++++++++++++++++++++
> >   arch/riscv/mm/init.c                 |  2 +
> >   5 files changed, 124 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > index 3cb53c4df27c..a916cbc69d47 100644
> > --- a/arch/riscv/include/asm/cacheflush.h
> > +++ b/arch/riscv/include/asm/cacheflush.h
> > @@ -37,7 +37,24 @@ static inline void flush_dcache_page(struct page *page)
> >       flush_icache_mm(vma->vm_mm, 0)
> >
> >   #ifdef CONFIG_64BIT
> > -#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end)
> > +extern u64 new_vmalloc[];
>
> Can you have the table size here ? Would help GCC static analysis for
> boundary checking.

Yes, I'll do

>
> > +extern char _end[];
> > +#define flush_cache_vmap flush_cache_vmap
> > +static inline void flush_cache_vmap(unsigned long start, unsigned long end)
> > +{
> > +     if ((start < VMALLOC_END && end > VMALLOC_START) ||
> > +         (start < MODULES_END && end > MODULES_VADDR)) {
>
> Can you use is_vmalloc_or_module_addr() instead ?

Yes, I'll do

>
>
> > +             int i;
> > +
> > +             /*
> > +              * We don't care if concurrently a cpu resets this value since
> > +              * the only place this can happen is in handle_exception() where
> > +              * an sfence.vma is emitted.
> > +              */
> > +             for (i = 0; i < NR_CPUS / sizeof(u64) + 1; ++i)
>
> Use ARRAY_SIZE() ?

And that too :)

Thanks for the review,

Alex

>
> > +                     new_vmalloc[i] = -1ULL;
> > +     }
> > +}
> >   #endif
> >
> >   #ifndef CONFIG_SMP
> > diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
> > index 1833beb00489..8fe12fa6c329 100644
> > --- a/arch/riscv/include/asm/thread_info.h
> > +++ b/arch/riscv/include/asm/thread_info.h
> > @@ -60,6 +60,11 @@ struct thread_info {
> >       long                    user_sp;        /* User stack pointer */
> >       int                     cpu;
> >       unsigned long           syscall_work;   /* SYSCALL_WORK_ flags */
> > +     /*
> > +      * Used in handle_exception() to save a0, a1 and a2 before knowing if we
> > +      * can access the kernel stack.
> > +      */
> > +     unsigned long           a0, a1, a2;
> >   };
> >
> >   /*
> > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > index d6a75aac1d27..340c1c84560d 100644
> > --- a/arch/riscv/kernel/asm-offsets.c
> > +++ b/arch/riscv/kernel/asm-offsets.c
> > @@ -34,10 +34,15 @@ void asm_offsets(void)
> >       OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
> >       OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
> >       OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
> > +
> > +     OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
> >       OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
> >       OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
> >       OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
> >       OFFSET(TASK_TI_USER_SP, task_struct, thread_info.user_sp);
> > +     OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
> > +     OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
> > +     OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
> >
> >       OFFSET(TASK_THREAD_F0,  task_struct, thread.fstate.f[0]);
> >       OFFSET(TASK_THREAD_F1,  task_struct, thread.fstate.f[1]);
> > diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> > index 143a2bb3e697..3a3c7b563816 100644
> > --- a/arch/riscv/kernel/entry.S
> > +++ b/arch/riscv/kernel/entry.S
> > @@ -14,6 +14,88 @@
> >   #include <asm/asm-offsets.h>
> >   #include <asm/errata_list.h>
> >
> > +.macro new_vmalloc_check
> > +     REG_S   a0, TASK_TI_A0(tp)
> > +     REG_S   a1, TASK_TI_A1(tp)
> > +     REG_S   a2, TASK_TI_A2(tp)
> > +
> > +     csrr    a0, CSR_CAUSE
> > +     /* Exclude IRQs */
> > +     blt     a0, zero, _new_vmalloc_restore_context
> > +     /* Only check new_vmalloc if we are in page/protection fault */
> > +     li      a1, EXC_LOAD_PAGE_FAULT
> > +     beq     a0, a1, _new_vmalloc_kernel_address
> > +     li      a1, EXC_STORE_PAGE_FAULT
> > +     beq     a0, a1, _new_vmalloc_kernel_address
> > +     li      a1, EXC_INST_PAGE_FAULT
> > +     bne     a0, a1, _new_vmalloc_restore_context
> > +
> > +_new_vmalloc_kernel_address:
> > +     /* Is it a kernel address? */
> > +     csrr    a0, CSR_TVAL
> > +     bge     a0, zero, _new_vmalloc_restore_context
> > +
> > +     /* Check if a new vmalloc mapping appeared that could explain the trap */
> > +
> > +     /*
> > +      * Computes:
> > +      * a0 = &new_vmalloc[BIT_WORD(cpu)]
> > +      * a1 = BIT_MASK(cpu)
> > +      */
> > +     REG_L   a2, TASK_TI_CPU(tp)
> > +     /*
> > +      * Compute the new_vmalloc element position:
> > +      * (cpu / 64) * 8 = (cpu >> 6) << 3
> > +      */
> > +     srli    a1, a2, 6
> > +     slli    a1, a1, 3
> > +     la      a0, new_vmalloc
> > +     add     a0, a0, a1
> > +     /*
> > +      * Compute the bit position in the new_vmalloc element:
> > +      * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6
> > +      *         = cpu - ((cpu >> 6) << 3) << 3
> > +      */
> > +     slli    a1, a1, 3
> > +     sub     a1, a2, a1
> > +     /* Compute the "get mask": 1 << bit_pos */
> > +     li      a2, 1
> > +     sll     a1, a2, a1
> > +
> > +     /* Check the value of new_vmalloc for this cpu */
> > +     ld      a2, 0(a0)
> > +     and     a2, a2, a1
> > +     beq     a2, zero, _new_vmalloc_restore_context
> > +
> > +     ld      a2, 0(a0)
> > +     not     a1, a1
> > +     and     a1, a2, a1
> > +     sd      a1, 0(a0)
> > +
> > +     /* Only emit a sfence.vma if the uarch caches invalid entries */
> > +     la      a0, tlb_caching_invalid_entries
> > +     lb      a0, 0(a0)
> > +     beqz    a0, _new_vmalloc_no_caching_invalid_entries
> > +     sfence.vma
> > +_new_vmalloc_no_caching_invalid_entries:
> > +     // debug
> > +     la      a0, nr_sfence_vma_handle_exception
> > +     li      a1, 1
> > +     amoadd.w    a0, a1, (a0)
> > +     // end debug
> > +     REG_L   a0, TASK_TI_A0(tp)
> > +     REG_L   a1, TASK_TI_A1(tp)
> > +     REG_L   a2, TASK_TI_A2(tp)
> > +     csrw    CSR_SCRATCH, x0
> > +     sret
> > +
> > +_new_vmalloc_restore_context:
> > +     REG_L   a0, TASK_TI_A0(tp)
> > +     REG_L   a1, TASK_TI_A1(tp)
> > +     REG_L   a2, TASK_TI_A2(tp)
> > +.endm
> > +
> > +
> >   SYM_CODE_START(handle_exception)
> >       /*
> >        * If coming from userspace, preserve the user thread pointer and load
> > @@ -25,6 +107,18 @@ SYM_CODE_START(handle_exception)
> >
> >   _restore_kernel_tpsp:
> >       csrr tp, CSR_SCRATCH
> > +
> > +     /*
> > +      * The RISC-V kernel does not eagerly emit a sfence.vma after each
> > +      * new vmalloc mapping, which may result in exceptions:
> > +      * - if the uarch caches invalid entries, the new mapping would not be
> > +      *   observed by the page table walker and an invalidation is needed.
> > +      * - if the uarch does not cache invalid entries, a reordered access
> > +      *   could "miss" the new mapping and traps: in that case, we only need
> > +      *   to retry the access, no sfence.vma is required.
> > +      */
> > +     new_vmalloc_check
> > +
> >       REG_S sp, TASK_TI_KERNEL_SP(tp)
> >
> >   #ifdef CONFIG_VMAP_STACK
> > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > index 0798bd861dcb..379403de6c6f 100644
> > --- a/arch/riscv/mm/init.c
> > +++ b/arch/riscv/mm/init.c
> > @@ -36,6 +36,8 @@
> >
> >   #include "../kernel/head.h"
> >
> > +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > +
> >   struct kernel_mapping kernel_map __ro_after_init;
> >   EXPORT_SYMBOL(kernel_map);
> >   #ifdef CONFIG_XIP_KERNEL

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC/RFT 2/4] riscv: Add a runtime detection of invalid TLB entries caching
  2023-12-07 15:03 [PATCH RFC/RFT 0/4] Remove preventive sfence.vma Alexandre Ghiti
  2023-12-07 15:03 ` [PATCH RFC/RFT 1/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings Alexandre Ghiti
@ 2023-12-07 15:03 ` Alexandre Ghiti
  2023-12-07 15:55   ` Christophe Leroy
  2023-12-07 15:03 ` [PATCH RFC/RFT 3/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings Alexandre Ghiti
  2023-12-07 15:03 ` [PATCH RFC/RFT 4/4] TEMP: riscv: Add debugfs interface to retrieve #sfence.vma Alexandre Ghiti
  3 siblings, 1 reply; 11+ messages in thread
From: Alexandre Ghiti @ 2023-12-07 15:03 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm
  Cc: Alexandre Ghiti

This mechanism allows to completely bypass the sfence.vma introduced by
the previous commit for uarchs that do not cache invalid TLB entries.

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
 arch/riscv/mm/init.c | 124 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 124 insertions(+)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 379403de6c6f..2e854613740c 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -56,6 +56,8 @@ bool pgtable_l5_enabled = IS_ENABLED(CONFIG_64BIT) && !IS_ENABLED(CONFIG_XIP_KER
 EXPORT_SYMBOL(pgtable_l4_enabled);
 EXPORT_SYMBOL(pgtable_l5_enabled);
 
+bool tlb_caching_invalid_entries;
+
 phys_addr_t phys_ram_base __ro_after_init;
 EXPORT_SYMBOL(phys_ram_base);
 
@@ -750,6 +752,18 @@ static void __init disable_pgtable_l4(void)
 	satp_mode = SATP_MODE_39;
 }
 
+static void __init enable_pgtable_l5(void)
+{
+	pgtable_l5_enabled = true;
+	satp_mode = SATP_MODE_57;
+}
+
+static void __init enable_pgtable_l4(void)
+{
+	pgtable_l4_enabled = true;
+	satp_mode = SATP_MODE_48;
+}
+
 static int __init print_no4lvl(char *p)
 {
 	pr_info("Disabled 4-level and 5-level paging");
@@ -826,6 +840,112 @@ static __init void set_satp_mode(uintptr_t dtb_pa)
 	memset(early_pud, 0, PAGE_SIZE);
 	memset(early_pmd, 0, PAGE_SIZE);
 }
+
+/* Determine at runtime if the uarch caches invalid TLB entries */
+static __init void set_tlb_caching_invalid_entries(void)
+{
+#define NR_RETRIES_CACHING_INVALID_ENTRIES	50
+	uintptr_t set_tlb_caching_invalid_entries_pmd = ((unsigned long)set_tlb_caching_invalid_entries) & PMD_MASK;
+	// TODO the test_addr as defined below could go into another pud...
+	uintptr_t test_addr = set_tlb_caching_invalid_entries_pmd + 2 * PMD_SIZE;
+	pmd_t valid_pmd;
+	u64 satp;
+	int i = 0;
+
+	/* To ease the page table creation */
+	disable_pgtable_l5();
+	disable_pgtable_l4();
+
+	/* Establish a mapping for set_tlb_caching_invalid_entries() in sv39 */
+	create_pgd_mapping(early_pg_dir,
+			   set_tlb_caching_invalid_entries_pmd,
+			   (uintptr_t)early_pmd,
+			   PGDIR_SIZE, PAGE_TABLE);
+
+	/* Handle the case where set_tlb_caching_invalid_entries straddles 2 PMDs */
+	create_pmd_mapping(early_pmd,
+			   set_tlb_caching_invalid_entries_pmd,
+			   set_tlb_caching_invalid_entries_pmd,
+			   PMD_SIZE, PAGE_KERNEL_EXEC);
+	create_pmd_mapping(early_pmd,
+			   set_tlb_caching_invalid_entries_pmd + PMD_SIZE,
+			   set_tlb_caching_invalid_entries_pmd + PMD_SIZE,
+			   PMD_SIZE, PAGE_KERNEL_EXEC);
+
+	/* Establish an invalid mapping */
+	create_pmd_mapping(early_pmd, test_addr, 0, PMD_SIZE, __pgprot(0));
+
+	/* Precompute the valid pmd here because the mapping for pfn_pmd() won't exist */
+	valid_pmd = pfn_pmd(PFN_DOWN(set_tlb_caching_invalid_entries_pmd), PAGE_KERNEL);
+
+	local_flush_tlb_all();
+	satp = PFN_DOWN((uintptr_t)&early_pg_dir) | SATP_MODE_39;
+	csr_write(CSR_SATP, satp);
+
+	/*
+	 * Set stvec to after the trapping access, access this invalid mapping
+	 * and legitimately trap
+	 */
+	// TODO: Should I save the previous stvec?
+#define ASM_STR(x)	__ASM_STR(x)
+	asm volatile(
+		"la a0, 1f				\n"
+		"csrw " ASM_STR(CSR_TVEC) ", a0		\n"
+		"ld a0, 0(%0)				\n"
+		".align 2				\n"
+		"1:					\n"
+		:
+		: "r" (test_addr)
+		: "a0"
+	);
+
+	/* Now establish a valid mapping to check if the invalid one is cached */
+	early_pmd[pmd_index(test_addr)] = valid_pmd;
+
+	/*
+	 * Access the valid mapping multiple times: indeed, we can't use
+	 * sfence.vma as a barrier to make sure the cpu did not reorder accesses
+	 * so we may trap even if the uarch does not cache invalid entries. By
+	 * trying a few times, we make sure that those uarchs will see the right
+	 * mapping at some point.
+	 */
+
+	i = NR_RETRIES_CACHING_INVALID_ENTRIES;
+
+#define ASM_STR(x)	__ASM_STR(x)
+	asm_volatile_goto(
+		"la a0, 1f					\n"
+		"csrw " ASM_STR(CSR_TVEC) ", a0			\n"
+		".align 2					\n"
+		"1:						\n"
+		"addi %0, %0, -1				\n"
+		"blt %0, zero, %l[caching_invalid_entries]	\n"
+		"ld a0, 0(%1)					\n"
+		:
+		: "r" (i), "r" (test_addr)
+		: "a0"
+		: caching_invalid_entries
+	);
+
+	csr_write(CSR_SATP, 0ULL);
+	local_flush_tlb_all();
+
+	/* If we don't trap, the uarch does not cache invalid entries! */
+	tlb_caching_invalid_entries = false;
+	goto clean;
+
+caching_invalid_entries:
+	csr_write(CSR_SATP, 0ULL);
+	local_flush_tlb_all();
+
+	tlb_caching_invalid_entries = true;
+clean:
+	memset(early_pg_dir, 0, PAGE_SIZE);
+	memset(early_pmd, 0, PAGE_SIZE);
+
+	enable_pgtable_l4();
+	enable_pgtable_l5();
+}
 #endif
 
 /*
@@ -1072,6 +1192,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 #endif
 
 #if defined(CONFIG_64BIT) && !defined(CONFIG_XIP_KERNEL)
+	set_tlb_caching_invalid_entries();
 	set_satp_mode(dtb_pa);
 #endif
 
@@ -1322,6 +1443,9 @@ static void __init setup_vm_final(void)
 	local_flush_tlb_all();
 
 	pt_ops_set_late();
+
+	pr_info("uarch caches invalid entries: %s",
+		tlb_caching_invalid_entries ? "yes" : "no");
 }
 #else
 asmlinkage void __init setup_vm(uintptr_t dtb_pa)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC/RFT 2/4] riscv: Add a runtime detection of invalid TLB entries caching
  2023-12-07 15:03 ` [PATCH RFC/RFT 2/4] riscv: Add a runtime detection of invalid TLB entries caching Alexandre Ghiti
@ 2023-12-07 15:55   ` Christophe Leroy
  2023-12-08 14:30     ` Alexandre Ghiti
  0 siblings, 1 reply; 11+ messages in thread
From: Christophe Leroy @ 2023-12-07 15:55 UTC (permalink / raw)
  To: Alexandre Ghiti, Catalin Marinas, Will Deacon,
	Thomas Bogendoerfer, Michael Ellerman, Nicholas Piggin,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-mips@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
	linux-mm@kvack.org



Le 07/12/2023 à 16:03, Alexandre Ghiti a écrit :
> This mechanism allows to completely bypass the sfence.vma introduced by
> the previous commit for uarchs that do not cache invalid TLB entries.
> 
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> ---
>   arch/riscv/mm/init.c | 124 +++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 124 insertions(+)
> 
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 379403de6c6f..2e854613740c 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -56,6 +56,8 @@ bool pgtable_l5_enabled = IS_ENABLED(CONFIG_64BIT) && !IS_ENABLED(CONFIG_XIP_KER
>   EXPORT_SYMBOL(pgtable_l4_enabled);
>   EXPORT_SYMBOL(pgtable_l5_enabled);
>   
> +bool tlb_caching_invalid_entries;
> +
>   phys_addr_t phys_ram_base __ro_after_init;
>   EXPORT_SYMBOL(phys_ram_base);
>   
> @@ -750,6 +752,18 @@ static void __init disable_pgtable_l4(void)
>   	satp_mode = SATP_MODE_39;
>   }
>   
> +static void __init enable_pgtable_l5(void)
> +{
> +	pgtable_l5_enabled = true;
> +	satp_mode = SATP_MODE_57;
> +}
> +
> +static void __init enable_pgtable_l4(void)
> +{
> +	pgtable_l4_enabled = true;
> +	satp_mode = SATP_MODE_48;
> +}
> +
>   static int __init print_no4lvl(char *p)
>   {
>   	pr_info("Disabled 4-level and 5-level paging");
> @@ -826,6 +840,112 @@ static __init void set_satp_mode(uintptr_t dtb_pa)
>   	memset(early_pud, 0, PAGE_SIZE);
>   	memset(early_pmd, 0, PAGE_SIZE);
>   }
> +
> +/* Determine at runtime if the uarch caches invalid TLB entries */
> +static __init void set_tlb_caching_invalid_entries(void)
> +{
> +#define NR_RETRIES_CACHING_INVALID_ENTRIES	50

Looks odd to have macros nested in the middle of a function.

> +	uintptr_t set_tlb_caching_invalid_entries_pmd = ((unsigned long)set_tlb_caching_invalid_entries) & PMD_MASK;
> +	// TODO the test_addr as defined below could go into another pud...
> +	uintptr_t test_addr = set_tlb_caching_invalid_entries_pmd + 2 * PMD_SIZE;
> +	pmd_t valid_pmd;
> +	u64 satp;
> +	int i = 0;
> +
> +	/* To ease the page table creation */
> +	disable_pgtable_l5();
> +	disable_pgtable_l4();
> +
> +	/* Establish a mapping for set_tlb_caching_invalid_entries() in sv39 */
> +	create_pgd_mapping(early_pg_dir,
> +			   set_tlb_caching_invalid_entries_pmd,
> +			   (uintptr_t)early_pmd,
> +			   PGDIR_SIZE, PAGE_TABLE);
> +
> +	/* Handle the case where set_tlb_caching_invalid_entries straddles 2 PMDs */
> +	create_pmd_mapping(early_pmd,
> +			   set_tlb_caching_invalid_entries_pmd,
> +			   set_tlb_caching_invalid_entries_pmd,
> +			   PMD_SIZE, PAGE_KERNEL_EXEC);
> +	create_pmd_mapping(early_pmd,
> +			   set_tlb_caching_invalid_entries_pmd + PMD_SIZE,
> +			   set_tlb_caching_invalid_entries_pmd + PMD_SIZE,
> +			   PMD_SIZE, PAGE_KERNEL_EXEC);
> +
> +	/* Establish an invalid mapping */
> +	create_pmd_mapping(early_pmd, test_addr, 0, PMD_SIZE, __pgprot(0));
> +
> +	/* Precompute the valid pmd here because the mapping for pfn_pmd() won't exist */
> +	valid_pmd = pfn_pmd(PFN_DOWN(set_tlb_caching_invalid_entries_pmd), PAGE_KERNEL);
> +
> +	local_flush_tlb_all();
> +	satp = PFN_DOWN((uintptr_t)&early_pg_dir) | SATP_MODE_39;
> +	csr_write(CSR_SATP, satp);
> +
> +	/*
> +	 * Set stvec to after the trapping access, access this invalid mapping
> +	 * and legitimately trap
> +	 */
> +	// TODO: Should I save the previous stvec?
> +#define ASM_STR(x)	__ASM_STR(x)

Looks odd to have macros nested in the middle of a function.


> +	asm volatile(
> +		"la a0, 1f				\n"
> +		"csrw " ASM_STR(CSR_TVEC) ", a0		\n"
> +		"ld a0, 0(%0)				\n"
> +		".align 2				\n"
> +		"1:					\n"
> +		:
> +		: "r" (test_addr)
> +		: "a0"
> +	);
> +
> +	/* Now establish a valid mapping to check if the invalid one is cached */
> +	early_pmd[pmd_index(test_addr)] = valid_pmd;
> +
> +	/*
> +	 * Access the valid mapping multiple times: indeed, we can't use
> +	 * sfence.vma as a barrier to make sure the cpu did not reorder accesses
> +	 * so we may trap even if the uarch does not cache invalid entries. By
> +	 * trying a few times, we make sure that those uarchs will see the right
> +	 * mapping at some point.
> +	 */
> +
> +	i = NR_RETRIES_CACHING_INVALID_ENTRIES;
> +
> +#define ASM_STR(x)	__ASM_STR(x)

Deplicate define ?

> +	asm_volatile_goto(
> +		"la a0, 1f					\n"
> +		"csrw " ASM_STR(CSR_TVEC) ", a0			\n"
> +		".align 2					\n"
> +		"1:						\n"
> +		"addi %0, %0, -1				\n"
> +		"blt %0, zero, %l[caching_invalid_entries]	\n"
> +		"ld a0, 0(%1)					\n"
> +		:
> +		: "r" (i), "r" (test_addr)
> +		: "a0"
> +		: caching_invalid_entries
> +	);
> +
> +	csr_write(CSR_SATP, 0ULL);
> +	local_flush_tlb_all();
> +
> +	/* If we don't trap, the uarch does not cache invalid entries! */
> +	tlb_caching_invalid_entries = false;
> +	goto clean;
> +
> +caching_invalid_entries:
> +	csr_write(CSR_SATP, 0ULL);
> +	local_flush_tlb_all();
> +
> +	tlb_caching_invalid_entries = true;
> +clean:
> +	memset(early_pg_dir, 0, PAGE_SIZE);
> +	memset(early_pmd, 0, PAGE_SIZE);

Use clear_page() instead ?

> +
> +	enable_pgtable_l4();
> +	enable_pgtable_l5();
> +}
>   #endif
>   
>   /*
> @@ -1072,6 +1192,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>   #endif
>   
>   #if defined(CONFIG_64BIT) && !defined(CONFIG_XIP_KERNEL)
> +	set_tlb_caching_invalid_entries();
>   	set_satp_mode(dtb_pa);
>   #endif
>   
> @@ -1322,6 +1443,9 @@ static void __init setup_vm_final(void)
>   	local_flush_tlb_all();
>   
>   	pt_ops_set_late();
> +
> +	pr_info("uarch caches invalid entries: %s",
> +		tlb_caching_invalid_entries ? "yes" : "no");
>   }
>   #else
>   asmlinkage void __init setup_vm(uintptr_t dtb_pa)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC/RFT 2/4] riscv: Add a runtime detection of invalid TLB entries caching
  2023-12-07 15:55   ` Christophe Leroy
@ 2023-12-08 14:30     ` Alexandre Ghiti
  0 siblings, 0 replies; 11+ messages in thread
From: Alexandre Ghiti @ 2023-12-08 14:30 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Andrew Morton, Ved Shanbhogue, Matt Evans, Dylan Jhong,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-mips@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
	linux-mm@kvack.org

On Thu, Dec 7, 2023 at 4:55 PM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
>
>
> Le 07/12/2023 à 16:03, Alexandre Ghiti a écrit :
> > This mechanism allows to completely bypass the sfence.vma introduced by
> > the previous commit for uarchs that do not cache invalid TLB entries.
> >
> > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > ---
> >   arch/riscv/mm/init.c | 124 +++++++++++++++++++++++++++++++++++++++++++
> >   1 file changed, 124 insertions(+)
> >
> > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > index 379403de6c6f..2e854613740c 100644
> > --- a/arch/riscv/mm/init.c
> > +++ b/arch/riscv/mm/init.c
> > @@ -56,6 +56,8 @@ bool pgtable_l5_enabled = IS_ENABLED(CONFIG_64BIT) && !IS_ENABLED(CONFIG_XIP_KER
> >   EXPORT_SYMBOL(pgtable_l4_enabled);
> >   EXPORT_SYMBOL(pgtable_l5_enabled);
> >
> > +bool tlb_caching_invalid_entries;
> > +
> >   phys_addr_t phys_ram_base __ro_after_init;
> >   EXPORT_SYMBOL(phys_ram_base);
> >
> > @@ -750,6 +752,18 @@ static void __init disable_pgtable_l4(void)
> >       satp_mode = SATP_MODE_39;
> >   }
> >
> > +static void __init enable_pgtable_l5(void)
> > +{
> > +     pgtable_l5_enabled = true;
> > +     satp_mode = SATP_MODE_57;
> > +}
> > +
> > +static void __init enable_pgtable_l4(void)
> > +{
> > +     pgtable_l4_enabled = true;
> > +     satp_mode = SATP_MODE_48;
> > +}
> > +
> >   static int __init print_no4lvl(char *p)
> >   {
> >       pr_info("Disabled 4-level and 5-level paging");
> > @@ -826,6 +840,112 @@ static __init void set_satp_mode(uintptr_t dtb_pa)
> >       memset(early_pud, 0, PAGE_SIZE);
> >       memset(early_pmd, 0, PAGE_SIZE);
> >   }
> > +
> > +/* Determine at runtime if the uarch caches invalid TLB entries */
> > +static __init void set_tlb_caching_invalid_entries(void)
> > +{
> > +#define NR_RETRIES_CACHING_INVALID_ENTRIES   50
>
> Looks odd to have macros nested in the middle of a function.
>
> > +     uintptr_t set_tlb_caching_invalid_entries_pmd = ((unsigned long)set_tlb_caching_invalid_entries) & PMD_MASK;
> > +     // TODO the test_addr as defined below could go into another pud...
> > +     uintptr_t test_addr = set_tlb_caching_invalid_entries_pmd + 2 * PMD_SIZE;
> > +     pmd_t valid_pmd;
> > +     u64 satp;
> > +     int i = 0;
> > +
> > +     /* To ease the page table creation */
> > +     disable_pgtable_l5();
> > +     disable_pgtable_l4();
> > +
> > +     /* Establish a mapping for set_tlb_caching_invalid_entries() in sv39 */
> > +     create_pgd_mapping(early_pg_dir,
> > +                        set_tlb_caching_invalid_entries_pmd,
> > +                        (uintptr_t)early_pmd,
> > +                        PGDIR_SIZE, PAGE_TABLE);
> > +
> > +     /* Handle the case where set_tlb_caching_invalid_entries straddles 2 PMDs */
> > +     create_pmd_mapping(early_pmd,
> > +                        set_tlb_caching_invalid_entries_pmd,
> > +                        set_tlb_caching_invalid_entries_pmd,
> > +                        PMD_SIZE, PAGE_KERNEL_EXEC);
> > +     create_pmd_mapping(early_pmd,
> > +                        set_tlb_caching_invalid_entries_pmd + PMD_SIZE,
> > +                        set_tlb_caching_invalid_entries_pmd + PMD_SIZE,
> > +                        PMD_SIZE, PAGE_KERNEL_EXEC);
> > +
> > +     /* Establish an invalid mapping */
> > +     create_pmd_mapping(early_pmd, test_addr, 0, PMD_SIZE, __pgprot(0));
> > +
> > +     /* Precompute the valid pmd here because the mapping for pfn_pmd() won't exist */
> > +     valid_pmd = pfn_pmd(PFN_DOWN(set_tlb_caching_invalid_entries_pmd), PAGE_KERNEL);
> > +
> > +     local_flush_tlb_all();
> > +     satp = PFN_DOWN((uintptr_t)&early_pg_dir) | SATP_MODE_39;
> > +     csr_write(CSR_SATP, satp);
> > +
> > +     /*
> > +      * Set stvec to after the trapping access, access this invalid mapping
> > +      * and legitimately trap
> > +      */
> > +     // TODO: Should I save the previous stvec?
> > +#define ASM_STR(x)   __ASM_STR(x)
>
> Looks odd to have macros nested in the middle of a function.
>
>
> > +     asm volatile(
> > +             "la a0, 1f                              \n"
> > +             "csrw " ASM_STR(CSR_TVEC) ", a0         \n"
> > +             "ld a0, 0(%0)                           \n"
> > +             ".align 2                               \n"
> > +             "1:                                     \n"
> > +             :
> > +             : "r" (test_addr)
> > +             : "a0"
> > +     );
> > +
> > +     /* Now establish a valid mapping to check if the invalid one is cached */
> > +     early_pmd[pmd_index(test_addr)] = valid_pmd;
> > +
> > +     /*
> > +      * Access the valid mapping multiple times: indeed, we can't use
> > +      * sfence.vma as a barrier to make sure the cpu did not reorder accesses
> > +      * so we may trap even if the uarch does not cache invalid entries. By
> > +      * trying a few times, we make sure that those uarchs will see the right
> > +      * mapping at some point.
> > +      */
> > +
> > +     i = NR_RETRIES_CACHING_INVALID_ENTRIES;
> > +
> > +#define ASM_STR(x)   __ASM_STR(x)
>
> Deplicate define ?
>
> > +     asm_volatile_goto(
> > +             "la a0, 1f                                      \n"
> > +             "csrw " ASM_STR(CSR_TVEC) ", a0                 \n"
> > +             ".align 2                                       \n"
> > +             "1:                                             \n"
> > +             "addi %0, %0, -1                                \n"
> > +             "blt %0, zero, %l[caching_invalid_entries]      \n"
> > +             "ld a0, 0(%1)                                   \n"
> > +             :
> > +             : "r" (i), "r" (test_addr)
> > +             : "a0"
> > +             : caching_invalid_entries
> > +     );
> > +
> > +     csr_write(CSR_SATP, 0ULL);
> > +     local_flush_tlb_all();
> > +
> > +     /* If we don't trap, the uarch does not cache invalid entries! */
> > +     tlb_caching_invalid_entries = false;
> > +     goto clean;
> > +
> > +caching_invalid_entries:
> > +     csr_write(CSR_SATP, 0ULL);
> > +     local_flush_tlb_all();
> > +
> > +     tlb_caching_invalid_entries = true;
> > +clean:
> > +     memset(early_pg_dir, 0, PAGE_SIZE);
> > +     memset(early_pmd, 0, PAGE_SIZE);
>
> Use clear_page() instead ?
>
> > +
> > +     enable_pgtable_l4();
> > +     enable_pgtable_l5();
> > +}
> >   #endif
> >
> >   /*
> > @@ -1072,6 +1192,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >   #endif
> >
> >   #if defined(CONFIG_64BIT) && !defined(CONFIG_XIP_KERNEL)
> > +     set_tlb_caching_invalid_entries();
> >       set_satp_mode(dtb_pa);
> >   #endif
> >
> > @@ -1322,6 +1443,9 @@ static void __init setup_vm_final(void)
> >       local_flush_tlb_all();
> >
> >       pt_ops_set_late();
> > +
> > +     pr_info("uarch caches invalid entries: %s",
> > +             tlb_caching_invalid_entries ? "yes" : "no");
> >   }
> >   #else
> >   asmlinkage void __init setup_vm(uintptr_t dtb_pa)

I left this patch so that people can easily test this without knowing
what their uarch is actually doing, but it will very likely be dropped
as a new extension has just been proposed for that.

Thanks anyway, I should have been more clear in the patch title,

Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC/RFT 3/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings
  2023-12-07 15:03 [PATCH RFC/RFT 0/4] Remove preventive sfence.vma Alexandre Ghiti
  2023-12-07 15:03 ` [PATCH RFC/RFT 1/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings Alexandre Ghiti
  2023-12-07 15:03 ` [PATCH RFC/RFT 2/4] riscv: Add a runtime detection of invalid TLB entries caching Alexandre Ghiti
@ 2023-12-07 15:03 ` Alexandre Ghiti
  2023-12-07 16:37   ` Christophe Leroy
  2023-12-07 15:03 ` [PATCH RFC/RFT 4/4] TEMP: riscv: Add debugfs interface to retrieve #sfence.vma Alexandre Ghiti
  3 siblings, 1 reply; 11+ messages in thread
From: Alexandre Ghiti @ 2023-12-07 15:03 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm
  Cc: Alexandre Ghiti

The preventive sfence.vma were emitted because new mappings must be made
visible to the page table walker, either the uarch caches invalid
entries or not.

Actually, there is no need to preventively sfence.vma on new mappings for
userspace, this should be handled only in the page fault path.

This allows to drastically reduce the number of sfence.vma emitted:

* Ubuntu boot to login:
Before: ~630k sfence.vma
After:  ~200k sfence.vma

* ltp - mmapstress01
Before: ~45k
After:  ~6.3k

* lmbench - lat_pagefault
Before: ~665k
After:   832 (!)

* lmbench - lat_mmap
Before: ~546k
After:   718 (!)

The only issue with the removal of sfence.vma in update_mmu_cache() is
that on uarchs that cache invalid entries, those won't be invalidated
until the process takes a fault: so that's an additional fault in those
cases.

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
 arch/arm64/include/asm/pgtable.h              |  2 +-
 arch/mips/include/asm/pgtable.h               |  6 +--
 arch/powerpc/include/asm/book3s/64/tlbflush.h |  8 ++--
 arch/riscv/include/asm/pgtable.h              | 43 +++++++++++--------
 include/linux/pgtable.h                       |  8 +++-
 mm/memory.c                                   | 12 +++++-
 6 files changed, 48 insertions(+), 31 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7f7d9b1df4e5..728f25f529a5 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -57,7 +57,7 @@ static inline bool arch_thp_swp_supported(void)
  * fault on one CPU which has been handled concurrently by another CPU
  * does not need to perform additional invalidation.
  */
-#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
+#define flush_tlb_fix_spurious_write_fault(vma, address, ptep) do { } while (0)
 
 /*
  * ZERO_PAGE is a global shared page that is always zero: used
diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index 430b208c0130..84439fe6ed29 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -478,9 +478,9 @@ static inline pgprot_t pgprot_writecombine(pgprot_t _prot)
 	return __pgprot(prot);
 }
 
-static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
-						unsigned long address,
-						pte_t *ptep)
+static inline void flush_tlb_fix_spurious_write_fault(struct vm_area_struct *vma,
+						      unsigned long address,
+						      pte_t *ptep)
 {
 }
 
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 1950c1b825b4..7166d56f90db 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -128,10 +128,10 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 #define flush_tlb_page(vma, addr)	local_flush_tlb_page(vma, addr)
 #endif /* CONFIG_SMP */
 
-#define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
-static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
-						unsigned long address,
-						pte_t *ptep)
+#define flush_tlb_fix_spurious_write_fault flush_tlb_fix_spurious_write_fault
+static inline void flush_tlb_fix_spurious_write_fault(struct vm_area_struct *vma,
+						      unsigned long address,
+						      pte_t *ptep)
 {
 	/*
 	 * Book3S 64 does not require spurious fault flushes because the PTE
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index b2ba3f79cfe9..89aa5650f104 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -472,28 +472,20 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
 		struct vm_area_struct *vma, unsigned long address,
 		pte_t *ptep, unsigned int nr)
 {
-	/*
-	 * The kernel assumes that TLBs don't cache invalid entries, but
-	 * in RISC-V, SFENCE.VMA specifies an ordering constraint, not a
-	 * cache flush; it is necessary even after writing invalid entries.
-	 * Relying on flush_tlb_fix_spurious_fault would suffice, but
-	 * the extra traps reduce performance.  So, eagerly SFENCE.VMA.
-	 */
-	while (nr--)
-		local_flush_tlb_page(address + nr * PAGE_SIZE);
 }
 #define update_mmu_cache(vma, addr, ptep) \
 	update_mmu_cache_range(NULL, vma, addr, ptep, 1)
 
 #define __HAVE_ARCH_UPDATE_MMU_TLB
-#define update_mmu_tlb update_mmu_cache
+static inline void update_mmu_tlb(struct vm_area_struct *vma,
+				  unsigned long address, pte_t *ptep)
+{
+	flush_tlb_range(vma, address, address + PAGE_SIZE);
+}
 
 static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmdp)
 {
-	pte_t *ptep = (pte_t *)pmdp;
-
-	update_mmu_cache(vma, address, ptep);
 }
 
 #define __HAVE_ARCH_PTE_SAME
@@ -548,13 +540,26 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
 					unsigned long address, pte_t *ptep,
 					pte_t entry, int dirty)
 {
-	if (!pte_same(*ptep, entry))
+	if (!pte_same(*ptep, entry)) {
 		__set_pte_at(ptep, entry);
-	/*
-	 * update_mmu_cache will unconditionally execute, handling both
-	 * the case that the PTE changed and the spurious fault case.
-	 */
-	return true;
+		/* Here only not svadu is impacted */
+		flush_tlb_page(vma, address);
+		return true;
+	}
+
+	return false;
+}
+
+extern u64 nr_sfence_vma_handle_exception;
+extern bool tlb_caching_invalid_entries;
+
+#define flush_tlb_fix_spurious_read_fault flush_tlb_fix_spurious_read_fault
+static inline void flush_tlb_fix_spurious_read_fault(struct vm_area_struct *vma,
+						     unsigned long address,
+						     pte_t *ptep)
+{
+	if (tlb_caching_invalid_entries)
+		flush_tlb_page(vma, address);
 }
 
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c3b0a3..7abaf42ef612 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -931,8 +931,12 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 # define pte_accessible(mm, pte)	((void)(pte), 1)
 #endif
 
-#ifndef flush_tlb_fix_spurious_fault
-#define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
+#ifndef flush_tlb_fix_spurious_write_fault
+#define flush_tlb_fix_spurious_write_fault(vma, address, ptep) flush_tlb_page(vma, address)
+#endif
+
+#ifndef flush_tlb_fix_spurious_read_fault
+#define flush_tlb_fix_spurious_read_fault(vma, address, ptep)
 #endif
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index 517221f01303..5cb0ccf0c03f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5014,8 +5014,16 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		 * with threads.
 		 */
 		if (vmf->flags & FAULT_FLAG_WRITE)
-			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
-						     vmf->pte);
+			flush_tlb_fix_spurious_write_fault(vmf->vma, vmf->address,
+							   vmf->pte);
+		else
+			/*
+			 * With the pte_same(ptep_get(vmf->pte), entry) check
+			 * that calls update_mmu_tlb() above, multiple threads
+			 * faulting at the same time won't get there.
+			 */
+			flush_tlb_fix_spurious_read_fault(vmf->vma, vmf->address,
+							  vmf->pte);
 	}
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC/RFT 3/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings
  2023-12-07 15:03 ` [PATCH RFC/RFT 3/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings Alexandre Ghiti
@ 2023-12-07 16:37   ` Christophe Leroy
  2023-12-08 14:39     ` Alexandre Ghiti
  0 siblings, 1 reply; 11+ messages in thread
From: Christophe Leroy @ 2023-12-07 16:37 UTC (permalink / raw)
  To: Alexandre Ghiti, Catalin Marinas, Will Deacon,
	Thomas Bogendoerfer, Michael Ellerman, Nicholas Piggin,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-mips@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
	linux-mm@kvack.org

The subject says "riscv:" but it changes core part and several arch. 
Maybe this commit should be split in two commits, one for API changes 
that changes flush_tlb_fix_spurious_fault() to 
flush_tlb_fix_spurious_write_fault() and adds 
flush_tlb_fix_spurious_read_fault() including the change in memory.c, 
then a second patch with the changes to riscv.

Le 07/12/2023 à 16:03, Alexandre Ghiti a écrit :
> The preventive sfence.vma were emitted because new mappings must be made
> visible to the page table walker, either the uarch caches invalid
> entries or not.
> 
> Actually, there is no need to preventively sfence.vma on new mappings for
> userspace, this should be handled only in the page fault path.
> 
> This allows to drastically reduce the number of sfence.vma emitted:
> 
> * Ubuntu boot to login:
> Before: ~630k sfence.vma
> After:  ~200k sfence.vma
> 
> * ltp - mmapstress01
> Before: ~45k
> After:  ~6.3k
> 
> * lmbench - lat_pagefault
> Before: ~665k
> After:   832 (!)
> 
> * lmbench - lat_mmap
> Before: ~546k
> After:   718 (!)
> 
> The only issue with the removal of sfence.vma in update_mmu_cache() is
> that on uarchs that cache invalid entries, those won't be invalidated
> until the process takes a fault: so that's an additional fault in those
> cases.
> 
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> ---
>   arch/arm64/include/asm/pgtable.h              |  2 +-
>   arch/mips/include/asm/pgtable.h               |  6 +--
>   arch/powerpc/include/asm/book3s/64/tlbflush.h |  8 ++--
>   arch/riscv/include/asm/pgtable.h              | 43 +++++++++++--------
>   include/linux/pgtable.h                       |  8 +++-
>   mm/memory.c                                   | 12 +++++-
>   6 files changed, 48 insertions(+), 31 deletions(-)

Did you forget mm/pgtable-generic.c ?

> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 7f7d9b1df4e5..728f25f529a5 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -57,7 +57,7 @@ static inline bool arch_thp_swp_supported(void)
>    * fault on one CPU which has been handled concurrently by another CPU
>    * does not need to perform additional invalidation.
>    */
> -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
> +#define flush_tlb_fix_spurious_write_fault(vma, address, ptep) do { } while (0)

Why do you need to do that change ? Nothing is explained about that in 
the commit message.

>   
>   /*
>    * ZERO_PAGE is a global shared page that is always zero: used
> diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
> index 430b208c0130..84439fe6ed29 100644
> --- a/arch/mips/include/asm/pgtable.h
> +++ b/arch/mips/include/asm/pgtable.h
> @@ -478,9 +478,9 @@ static inline pgprot_t pgprot_writecombine(pgprot_t _prot)
>   	return __pgprot(prot);
>   }
>   
> -static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
> -						unsigned long address,
> -						pte_t *ptep)
> +static inline void flush_tlb_fix_spurious_write_fault(struct vm_area_struct *vma,
> +						      unsigned long address,
> +						      pte_t *ptep)
>   {
>   }
>   
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/include/asm/book3s/64/tlbflush.h
> index 1950c1b825b4..7166d56f90db 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
> @@ -128,10 +128,10 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
>   #define flush_tlb_page(vma, addr)	local_flush_tlb_page(vma, addr)
>   #endif /* CONFIG_SMP */
>   
> -#define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
> -static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
> -						unsigned long address,
> -						pte_t *ptep)
> +#define flush_tlb_fix_spurious_write_fault flush_tlb_fix_spurious_write_fault
> +static inline void flush_tlb_fix_spurious_write_fault(struct vm_area_struct *vma,
> +						      unsigned long address,
> +						      pte_t *ptep)
>   {
>   	/*
>   	 * Book3S 64 does not require spurious fault flushes because the PTE
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index b2ba3f79cfe9..89aa5650f104 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -472,28 +472,20 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
>   		struct vm_area_struct *vma, unsigned long address,
>   		pte_t *ptep, unsigned int nr)
>   {
> -	/*
> -	 * The kernel assumes that TLBs don't cache invalid entries, but
> -	 * in RISC-V, SFENCE.VMA specifies an ordering constraint, not a
> -	 * cache flush; it is necessary even after writing invalid entries.
> -	 * Relying on flush_tlb_fix_spurious_fault would suffice, but
> -	 * the extra traps reduce performance.  So, eagerly SFENCE.VMA.
> -	 */
> -	while (nr--)
> -		local_flush_tlb_page(address + nr * PAGE_SIZE);
>   }
>   #define update_mmu_cache(vma, addr, ptep) \
>   	update_mmu_cache_range(NULL, vma, addr, ptep, 1)
>   
>   #define __HAVE_ARCH_UPDATE_MMU_TLB
> -#define update_mmu_tlb update_mmu_cache
> +static inline void update_mmu_tlb(struct vm_area_struct *vma,
> +				  unsigned long address, pte_t *ptep)
> +{
> +	flush_tlb_range(vma, address, address + PAGE_SIZE);
> +}
>   
>   static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
>   		unsigned long address, pmd_t *pmdp)
>   {
> -	pte_t *ptep = (pte_t *)pmdp;
> -
> -	update_mmu_cache(vma, address, ptep);
>   }
>   
>   #define __HAVE_ARCH_PTE_SAME
> @@ -548,13 +540,26 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>   					unsigned long address, pte_t *ptep,
>   					pte_t entry, int dirty)
>   {
> -	if (!pte_same(*ptep, entry))
> +	if (!pte_same(*ptep, entry)) {
>   		__set_pte_at(ptep, entry);
> -	/*
> -	 * update_mmu_cache will unconditionally execute, handling both
> -	 * the case that the PTE changed and the spurious fault case.
> -	 */
> -	return true;
> +		/* Here only not svadu is impacted */
> +		flush_tlb_page(vma, address);
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
> +extern u64 nr_sfence_vma_handle_exception;
> +extern bool tlb_caching_invalid_entries;
> +
> +#define flush_tlb_fix_spurious_read_fault flush_tlb_fix_spurious_read_fault
> +static inline void flush_tlb_fix_spurious_read_fault(struct vm_area_struct *vma,
> +						     unsigned long address,
> +						     pte_t *ptep)
> +{
> +	if (tlb_caching_invalid_entries)
> +		flush_tlb_page(vma, address);
>   }
>   
>   #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index af7639c3b0a3..7abaf42ef612 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -931,8 +931,12 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>   # define pte_accessible(mm, pte)	((void)(pte), 1)
>   #endif
>   
> -#ifndef flush_tlb_fix_spurious_fault
> -#define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
> +#ifndef flush_tlb_fix_spurious_write_fault
> +#define flush_tlb_fix_spurious_write_fault(vma, address, ptep) flush_tlb_page(vma, address)
> +#endif
> +
> +#ifndef flush_tlb_fix_spurious_read_fault
> +#define flush_tlb_fix_spurious_read_fault(vma, address, ptep)
>   #endif
>   
>   /*
> diff --git a/mm/memory.c b/mm/memory.c
> index 517221f01303..5cb0ccf0c03f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5014,8 +5014,16 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>   		 * with threads.
>   		 */
>   		if (vmf->flags & FAULT_FLAG_WRITE)
> -			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
> -						     vmf->pte);
> +			flush_tlb_fix_spurious_write_fault(vmf->vma, vmf->address,
> +							   vmf->pte);
> +		else
> +			/*
> +			 * With the pte_same(ptep_get(vmf->pte), entry) check
> +			 * that calls update_mmu_tlb() above, multiple threads
> +			 * faulting at the same time won't get there.
> +			 */
> +			flush_tlb_fix_spurious_read_fault(vmf->vma, vmf->address,
> +							  vmf->pte);
>   	}
>   unlock:
>   	pte_unmap_unlock(vmf->pte, vmf->ptl);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC/RFT 3/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings
  2023-12-07 16:37   ` Christophe Leroy
@ 2023-12-08 14:39     ` Alexandre Ghiti
  0 siblings, 0 replies; 11+ messages in thread
From: Alexandre Ghiti @ 2023-12-08 14:39 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Andrew Morton, Ved Shanbhogue, Matt Evans, Dylan Jhong,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-mips@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
	linux-mm@kvack.org

On Thu, Dec 7, 2023 at 5:37 PM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
> The subject says "riscv:" but it changes core part and several arch.
> Maybe this commit should be split in two commits, one for API changes
> that changes flush_tlb_fix_spurious_fault() to
> flush_tlb_fix_spurious_write_fault() and adds
> flush_tlb_fix_spurious_read_fault() including the change in memory.c,
> then a second patch with the changes to riscv.

You're right, I'll do that, thanks.

>
> Le 07/12/2023 à 16:03, Alexandre Ghiti a écrit :
> > The preventive sfence.vma were emitted because new mappings must be made
> > visible to the page table walker, either the uarch caches invalid
> > entries or not.
> >
> > Actually, there is no need to preventively sfence.vma on new mappings for
> > userspace, this should be handled only in the page fault path.
> >
> > This allows to drastically reduce the number of sfence.vma emitted:
> >
> > * Ubuntu boot to login:
> > Before: ~630k sfence.vma
> > After:  ~200k sfence.vma
> >
> > * ltp - mmapstress01
> > Before: ~45k
> > After:  ~6.3k
> >
> > * lmbench - lat_pagefault
> > Before: ~665k
> > After:   832 (!)
> >
> > * lmbench - lat_mmap
> > Before: ~546k
> > After:   718 (!)
> >
> > The only issue with the removal of sfence.vma in update_mmu_cache() is
> > that on uarchs that cache invalid entries, those won't be invalidated
> > until the process takes a fault: so that's an additional fault in those
> > cases.
> >
> > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > ---
> >   arch/arm64/include/asm/pgtable.h              |  2 +-
> >   arch/mips/include/asm/pgtable.h               |  6 +--
> >   arch/powerpc/include/asm/book3s/64/tlbflush.h |  8 ++--
> >   arch/riscv/include/asm/pgtable.h              | 43 +++++++++++--------
> >   include/linux/pgtable.h                       |  8 +++-
> >   mm/memory.c                                   | 12 +++++-
> >   6 files changed, 48 insertions(+), 31 deletions(-)
>
> Did you forget mm/pgtable-generic.c ?

Indeed, I "missed" the occurrence of flush_tlb_fix_spurious_fault()
there, thanks.

>
> >
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 7f7d9b1df4e5..728f25f529a5 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -57,7 +57,7 @@ static inline bool arch_thp_swp_supported(void)
> >    * fault on one CPU which has been handled concurrently by another CPU
> >    * does not need to perform additional invalidation.
> >    */
> > -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
> > +#define flush_tlb_fix_spurious_write_fault(vma, address, ptep) do { } while (0)
>
> Why do you need to do that change ? Nothing is explained about that in
> the commit message.

I renamed this macro because in the page fault path,
flush_tlb_fix_spurious_fault() is called only when the fault is a
write fault (see
https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L5016).
I'll check if that fits the occurrence in mm/pgtable-generic.c too.

Thanks again for the review,

Alex

>
> >
> >   /*
> >    * ZERO_PAGE is a global shared page that is always zero: used
> > diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
> > index 430b208c0130..84439fe6ed29 100644
> > --- a/arch/mips/include/asm/pgtable.h
> > +++ b/arch/mips/include/asm/pgtable.h
> > @@ -478,9 +478,9 @@ static inline pgprot_t pgprot_writecombine(pgprot_t _prot)
> >       return __pgprot(prot);
> >   }
> >
> > -static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
> > -                                             unsigned long address,
> > -                                             pte_t *ptep)
> > +static inline void flush_tlb_fix_spurious_write_fault(struct vm_area_struct *vma,
> > +                                                   unsigned long address,
> > +                                                   pte_t *ptep)
> >   {
> >   }
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/include/asm/book3s/64/tlbflush.h
> > index 1950c1b825b4..7166d56f90db 100644
> > --- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
> > +++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
> > @@ -128,10 +128,10 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
> >   #define flush_tlb_page(vma, addr)   local_flush_tlb_page(vma, addr)
> >   #endif /* CONFIG_SMP */
> >
> > -#define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
> > -static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
> > -                                             unsigned long address,
> > -                                             pte_t *ptep)
> > +#define flush_tlb_fix_spurious_write_fault flush_tlb_fix_spurious_write_fault
> > +static inline void flush_tlb_fix_spurious_write_fault(struct vm_area_struct *vma,
> > +                                                   unsigned long address,
> > +                                                   pte_t *ptep)
> >   {
> >       /*
> >        * Book3S 64 does not require spurious fault flushes because the PTE
> > diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> > index b2ba3f79cfe9..89aa5650f104 100644
> > --- a/arch/riscv/include/asm/pgtable.h
> > +++ b/arch/riscv/include/asm/pgtable.h
> > @@ -472,28 +472,20 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
> >               struct vm_area_struct *vma, unsigned long address,
> >               pte_t *ptep, unsigned int nr)
> >   {
> > -     /*
> > -      * The kernel assumes that TLBs don't cache invalid entries, but
> > -      * in RISC-V, SFENCE.VMA specifies an ordering constraint, not a
> > -      * cache flush; it is necessary even after writing invalid entries.
> > -      * Relying on flush_tlb_fix_spurious_fault would suffice, but
> > -      * the extra traps reduce performance.  So, eagerly SFENCE.VMA.
> > -      */
> > -     while (nr--)
> > -             local_flush_tlb_page(address + nr * PAGE_SIZE);
> >   }
> >   #define update_mmu_cache(vma, addr, ptep) \
> >       update_mmu_cache_range(NULL, vma, addr, ptep, 1)
> >
> >   #define __HAVE_ARCH_UPDATE_MMU_TLB
> > -#define update_mmu_tlb update_mmu_cache
> > +static inline void update_mmu_tlb(struct vm_area_struct *vma,
> > +                               unsigned long address, pte_t *ptep)
> > +{
> > +     flush_tlb_range(vma, address, address + PAGE_SIZE);
> > +}
> >
> >   static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
> >               unsigned long address, pmd_t *pmdp)
> >   {
> > -     pte_t *ptep = (pte_t *)pmdp;
> > -
> > -     update_mmu_cache(vma, address, ptep);
> >   }
> >
> >   #define __HAVE_ARCH_PTE_SAME
> > @@ -548,13 +540,26 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
> >                                       unsigned long address, pte_t *ptep,
> >                                       pte_t entry, int dirty)
> >   {
> > -     if (!pte_same(*ptep, entry))
> > +     if (!pte_same(*ptep, entry)) {
> >               __set_pte_at(ptep, entry);
> > -     /*
> > -      * update_mmu_cache will unconditionally execute, handling both
> > -      * the case that the PTE changed and the spurious fault case.
> > -      */
> > -     return true;
> > +             /* Here only not svadu is impacted */
> > +             flush_tlb_page(vma, address);
> > +             return true;
> > +     }
> > +
> > +     return false;
> > +}
> > +
> > +extern u64 nr_sfence_vma_handle_exception;
> > +extern bool tlb_caching_invalid_entries;
> > +
> > +#define flush_tlb_fix_spurious_read_fault flush_tlb_fix_spurious_read_fault
> > +static inline void flush_tlb_fix_spurious_read_fault(struct vm_area_struct *vma,
> > +                                                  unsigned long address,
> > +                                                  pte_t *ptep)
> > +{
> > +     if (tlb_caching_invalid_entries)
> > +             flush_tlb_page(vma, address);
> >   }
> >
> >   #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index af7639c3b0a3..7abaf42ef612 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -931,8 +931,12 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> >   # define pte_accessible(mm, pte)    ((void)(pte), 1)
> >   #endif
> >
> > -#ifndef flush_tlb_fix_spurious_fault
> > -#define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
> > +#ifndef flush_tlb_fix_spurious_write_fault
> > +#define flush_tlb_fix_spurious_write_fault(vma, address, ptep) flush_tlb_page(vma, address)
> > +#endif
> > +
> > +#ifndef flush_tlb_fix_spurious_read_fault
> > +#define flush_tlb_fix_spurious_read_fault(vma, address, ptep)
> >   #endif
> >
> >   /*
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 517221f01303..5cb0ccf0c03f 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5014,8 +5014,16 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> >                * with threads.
> >                */
> >               if (vmf->flags & FAULT_FLAG_WRITE)
> > -                     flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
> > -                                                  vmf->pte);
> > +                     flush_tlb_fix_spurious_write_fault(vmf->vma, vmf->address,
> > +                                                        vmf->pte);
> > +             else
> > +                     /*
> > +                      * With the pte_same(ptep_get(vmf->pte), entry) check
> > +                      * that calls update_mmu_tlb() above, multiple threads
> > +                      * faulting at the same time won't get there.
> > +                      */
> > +                     flush_tlb_fix_spurious_read_fault(vmf->vma, vmf->address,
> > +                                                       vmf->pte);
> >       }
> >   unlock:
> >       pte_unmap_unlock(vmf->pte, vmf->ptl);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC/RFT 4/4] TEMP: riscv: Add debugfs interface to retrieve #sfence.vma
  2023-12-07 15:03 [PATCH RFC/RFT 0/4] Remove preventive sfence.vma Alexandre Ghiti
                   ` (2 preceding siblings ...)
  2023-12-07 15:03 ` [PATCH RFC/RFT 3/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings Alexandre Ghiti
@ 2023-12-07 15:03 ` Alexandre Ghiti
  3 siblings, 0 replies; 11+ messages in thread
From: Alexandre Ghiti @ 2023-12-07 15:03 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm
  Cc: Alexandre Ghiti

This is useful for testing/benchmarking.

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
 arch/riscv/include/asm/pgtable.h  |  6 ++++--
 arch/riscv/include/asm/tlbflush.h |  4 ++++
 arch/riscv/kernel/sbi.c           | 12 ++++++++++++
 arch/riscv/mm/tlbflush.c          | 17 +++++++++++++++++
 4 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 89aa5650f104..b0855a620cfd 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -550,7 +550,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
 	return false;
 }
 
-extern u64 nr_sfence_vma_handle_exception;
+extern u64 nr_sfence_vma_spurious_read;
 extern bool tlb_caching_invalid_entries;
 
 #define flush_tlb_fix_spurious_read_fault flush_tlb_fix_spurious_read_fault
@@ -558,8 +558,10 @@ static inline void flush_tlb_fix_spurious_read_fault(struct vm_area_struct *vma,
 						     unsigned long address,
 						     pte_t *ptep)
 {
-	if (tlb_caching_invalid_entries)
+	if (tlb_caching_invalid_entries) {
+		__sync_fetch_and_add(&nr_sfence_vma_spurious_read, 1UL);
 		flush_tlb_page(vma, address);
+	}
 }
 
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index a09196f8de68..f419ec9d2207 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -14,14 +14,18 @@
 #ifdef CONFIG_MMU
 extern unsigned long asid_mask;
 
+extern u64 nr_sfence_vma, nr_sfence_vma_all, nr_sfence_vma_all_asid;
+
 static inline void local_flush_tlb_all(void)
 {
+	__sync_fetch_and_add(&nr_sfence_vma_all, 1UL);
 	__asm__ __volatile__ ("sfence.vma" : : : "memory");
 }
 
 /* Flush one page from local TLB */
 static inline void local_flush_tlb_page(unsigned long addr)
 {
+	__sync_fetch_and_add(&nr_sfence_vma, 1UL);
 	ALT_FLUSH_TLB_PAGE(__asm__ __volatile__ ("sfence.vma %0" : : "r" (addr) : "memory"));
 }
 #else /* CONFIG_MMU */
diff --git a/arch/riscv/kernel/sbi.c b/arch/riscv/kernel/sbi.c
index c672c8ba9a2a..ac1617759583 100644
--- a/arch/riscv/kernel/sbi.c
+++ b/arch/riscv/kernel/sbi.c
@@ -376,6 +376,8 @@ int sbi_remote_fence_i(const struct cpumask *cpu_mask)
 }
 EXPORT_SYMBOL(sbi_remote_fence_i);
 
+extern u64 nr_sfence_vma, nr_sfence_vma_all, nr_sfence_vma_all_asid;
+
 /**
  * sbi_remote_sfence_vma() - Execute SFENCE.VMA instructions on given remote
  *			     harts for the specified virtual address range.
@@ -389,6 +391,11 @@ int sbi_remote_sfence_vma(const struct cpumask *cpu_mask,
 			   unsigned long start,
 			   unsigned long size)
 {
+	if (size == (unsigned long)-1)
+		__sync_fetch_and_add(&nr_sfence_vma_all, 1UL);
+	else
+		__sync_fetch_and_add(&nr_sfence_vma, ALIGN(size, PAGE_SIZE) / PAGE_SIZE);
+
 	return __sbi_rfence(SBI_EXT_RFENCE_REMOTE_SFENCE_VMA,
 			    cpu_mask, start, size, 0, 0);
 }
@@ -410,6 +417,11 @@ int sbi_remote_sfence_vma_asid(const struct cpumask *cpu_mask,
 				unsigned long size,
 				unsigned long asid)
 {
+	if (size == (unsigned long)-1)
+		__sync_fetch_and_add(&nr_sfence_vma_all_asid, 1UL);
+	else
+		__sync_fetch_and_add(&nr_sfence_vma, ALIGN(size, PAGE_SIZE) / PAGE_SIZE);
+
 	return __sbi_rfence(SBI_EXT_RFENCE_REMOTE_SFENCE_VMA_ASID,
 			    cpu_mask, start, size, asid, 0);
 }
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 77be59aadc73..75a3e2dff16a 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -3,11 +3,16 @@
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/sched.h>
+#include <linux/debugfs.h>
 #include <asm/sbi.h>
 #include <asm/mmu_context.h>
 
+u64 nr_sfence_vma, nr_sfence_vma_all, nr_sfence_vma_all_asid,
+	nr_sfence_vma_handle_exception, nr_sfence_vma_spurious_read;
+
 static inline void local_flush_tlb_all_asid(unsigned long asid)
 {
+	__sync_fetch_and_add(&nr_sfence_vma_all_asid, 1);
 	__asm__ __volatile__ ("sfence.vma x0, %0"
 			:
 			: "r" (asid)
@@ -17,6 +22,7 @@ static inline void local_flush_tlb_all_asid(unsigned long asid)
 static inline void local_flush_tlb_page_asid(unsigned long addr,
 		unsigned long asid)
 {
+	__sync_fetch_and_add(&nr_sfence_vma, 1);
 	__asm__ __volatile__ ("sfence.vma %0, %1"
 			:
 			: "r" (addr), "r" (asid)
@@ -149,3 +155,14 @@ void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start,
 	__flush_tlb_range(vma->vm_mm, start, end - start, PMD_SIZE);
 }
 #endif
+
+static int debugfs_nr_sfence_vma(void)
+{
+	debugfs_create_u64("nr_sfence_vma", 0444, NULL, &nr_sfence_vma);
+	debugfs_create_u64("nr_sfence_vma_all", 0444, NULL, &nr_sfence_vma_all);
+	debugfs_create_u64("nr_sfence_vma_all_asid", 0444, NULL, &nr_sfence_vma_all_asid);
+	debugfs_create_u64("nr_sfence_vma_handle_exception", 0444, NULL, &nr_sfence_vma_handle_exception);
+	debugfs_create_u64("nr_sfence_vma_spurious_read", 0444, NULL, &nr_sfence_vma_spurious_read);
+	return 0;
+}
+device_initcall(debugfs_nr_sfence_vma);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-12-08 14:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-07 15:03 [PATCH RFC/RFT 0/4] Remove preventive sfence.vma Alexandre Ghiti
2023-12-07 15:03 ` [PATCH RFC/RFT 1/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings Alexandre Ghiti
2023-12-07 15:52   ` Christophe Leroy
2023-12-08 14:28     ` Alexandre Ghiti
2023-12-07 15:03 ` [PATCH RFC/RFT 2/4] riscv: Add a runtime detection of invalid TLB entries caching Alexandre Ghiti
2023-12-07 15:55   ` Christophe Leroy
2023-12-08 14:30     ` Alexandre Ghiti
2023-12-07 15:03 ` [PATCH RFC/RFT 3/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings Alexandre Ghiti
2023-12-07 16:37   ` Christophe Leroy
2023-12-08 14:39     ` Alexandre Ghiti
2023-12-07 15:03 ` [PATCH RFC/RFT 4/4] TEMP: riscv: Add debugfs interface to retrieve #sfence.vma Alexandre Ghiti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).