* [PATCH v3 0/4] Svvptc extension to remove preventive sfence.vma
@ 2024-07-02 8:50 Alexandre Ghiti
2024-07-02 8:50 ` [PATCH v3 1/4] riscv: Add ISA extension parsing for Svvptc Alexandre Ghiti
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: Alexandre Ghiti @ 2024-07-02 8:50 UTC (permalink / raw)
To: Conor Dooley, Rob Herring, Krzysztof Kozlowski, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Ved Shanbhogue, Matt Evans,
linux-kernel, linux-riscv, devicetree
Cc: Alexandre Ghiti
In RISC-V, after a new mapping is established, a sfence.vma needs to be
emitted for different reasons:
- if the uarch caches invalid entries, we need to invalidate it otherwise
we would trap on this invalid entry,
- if the uarch does not cache invalid entries, a reordered access could fail
to see the new mapping and then trap (sfence.vma acts as a fence).
We can actually avoid emitting those (mostly) useless and costly sfence.vma
by handling the traps instead:
- for new kernel mappings: only vmalloc mappings need to be taken care of,
other new mapping are rare and already emit the required sfence.vma if
needed.
That must be achieved very early in the exception path as explained in
patch 3, and this also fixes our fragile way of dealing with vmalloc faults.
- for new user mappings: Svvptc makes update_mmu_cache() a no-op but we can
take some gratuitous page faults (which are very unlikely though).
Patch 1 and 2 introduce Svvptc extension probing.
On our uarch that does not cache invalid entries and a 6.5 kernel, the
gains are measurable:
* Kernel boot: 6%
* ltp - mmapstress01: 8%
* lmbench - lat_pagefault: 20%
* lmbench - lat_mmap: 5%
Here are the corresponding numbers of sfence.vma emitted:
* Ubuntu boot to login:
Before: ~630k sfence.vma
After: ~200k sfence.vma
* ltp - mmapstress01
Before: ~45k
After: ~6.3k
* lmbench - lat_pagefault
Before: ~665k
After: 832 (!)
* lmbench - lat_mmap
Before: ~546k
After: 718 (!)
Thanks to Ved and Matt Evans for triggering the discussion that led to
this patchset!
Any feedback, test or relevant benchmark are welcome :)
Changes in v3:
- Rebase on top of 6.10
- Remove the comment about xRET acting as a fence which is not part of
the ratified specification
- Add #sfence.vma to the cover letter (Andrea)
- Remove the RFC as svvptc was ratified the 28th of June 2024
Changes in v2:
- Rebase on top of 6.8-rc1
- Remove patch with runtime detection of tlb caching and debugfs patch
- Add patch that probes Svvptc
- Add patch that defines the new Svvptc dt-binding
- Leave the behaviour as-is for uarchs that cache invalid TLB entries since
I don't have any good perf numbers
- Address comments from Christoph on v1
- Fix a race condition in new_vmalloc update:
ld a2, 0(a0) <= this could load something which is != -1
not a1, a1 <= here or in the instruction after, flush_cache_vmap()
could make the whole bitmap to 1
and a1, a2, a1
sd a1, 0(a0) <= here we would clear bits that should not be cleared!
Instead, replace the whole sequence with:
amoxor.w a0, a1, (a0)
Alexandre Ghiti (4):
riscv: Add ISA extension parsing for Svvptc
dt-bindings: riscv: Add Svvptc ISA extension description
riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
riscv: Stop emitting preventive sfence.vma for new userspace mappings
with Svvptc
.../devicetree/bindings/riscv/extensions.yaml | 7 ++
arch/riscv/include/asm/cacheflush.h | 18 +++-
arch/riscv/include/asm/hwcap.h | 1 +
arch/riscv/include/asm/pgtable.h | 16 +++-
arch/riscv/include/asm/thread_info.h | 5 ++
arch/riscv/kernel/asm-offsets.c | 5 ++
arch/riscv/kernel/cpufeature.c | 1 +
arch/riscv/kernel/entry.S | 84 +++++++++++++++++++
arch/riscv/mm/init.c | 2 +
arch/riscv/mm/pgtable.c | 13 +++
10 files changed, 150 insertions(+), 2 deletions(-)
--
2.39.2
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v3 1/4] riscv: Add ISA extension parsing for Svvptc
2024-07-02 8:50 [PATCH v3 0/4] Svvptc extension to remove preventive sfence.vma Alexandre Ghiti
@ 2024-07-02 8:50 ` Alexandre Ghiti
2024-07-02 15:02 ` Conor Dooley
2024-07-02 8:50 ` [PATCH v3 2/4] dt-bindings: riscv: Add Svvptc ISA extension description Alexandre Ghiti
` (2 subsequent siblings)
3 siblings, 1 reply; 13+ messages in thread
From: Alexandre Ghiti @ 2024-07-02 8:50 UTC (permalink / raw)
To: Conor Dooley, Rob Herring, Krzysztof Kozlowski, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Ved Shanbhogue, Matt Evans,
linux-kernel, linux-riscv, devicetree
Cc: Alexandre Ghiti
Add support to parse the Svvptc string in the riscv,isa string.
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
arch/riscv/include/asm/hwcap.h | 1 +
arch/riscv/kernel/cpufeature.c | 1 +
2 files changed, 2 insertions(+)
diff --git a/arch/riscv/include/asm/hwcap.h b/arch/riscv/include/asm/hwcap.h
index e17d0078a651..6dd0dd8beb30 100644
--- a/arch/riscv/include/asm/hwcap.h
+++ b/arch/riscv/include/asm/hwcap.h
@@ -81,6 +81,7 @@
#define RISCV_ISA_EXT_ZTSO 72
#define RISCV_ISA_EXT_ZACAS 73
#define RISCV_ISA_EXT_XANDESPMU 74
+#define RISCV_ISA_EXT_SVVPTC 75
#define RISCV_ISA_EXT_XLINUXENVCFG 127
diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
index 5ef48cb20ee1..60780d246743 100644
--- a/arch/riscv/kernel/cpufeature.c
+++ b/arch/riscv/kernel/cpufeature.c
@@ -305,6 +305,7 @@ const struct riscv_isa_ext_data riscv_isa_ext[] = {
__RISCV_ISA_EXT_DATA(svnapot, RISCV_ISA_EXT_SVNAPOT),
__RISCV_ISA_EXT_DATA(svpbmt, RISCV_ISA_EXT_SVPBMT),
__RISCV_ISA_EXT_DATA(xandespmu, RISCV_ISA_EXT_XANDESPMU),
+ __RISCV_ISA_EXT_DATA(svvptc, RISCV_ISA_EXT_SVVPTC),
};
const size_t riscv_isa_ext_count = ARRAY_SIZE(riscv_isa_ext);
--
2.39.2
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v3 2/4] dt-bindings: riscv: Add Svvptc ISA extension description
2024-07-02 8:50 [PATCH v3 0/4] Svvptc extension to remove preventive sfence.vma Alexandre Ghiti
2024-07-02 8:50 ` [PATCH v3 1/4] riscv: Add ISA extension parsing for Svvptc Alexandre Ghiti
@ 2024-07-02 8:50 ` Alexandre Ghiti
2024-07-02 15:02 ` Conor Dooley
2024-07-02 8:50 ` [PATCH v3 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings Alexandre Ghiti
2024-07-02 8:50 ` [PATCH v3 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc Alexandre Ghiti
3 siblings, 1 reply; 13+ messages in thread
From: Alexandre Ghiti @ 2024-07-02 8:50 UTC (permalink / raw)
To: Conor Dooley, Rob Herring, Krzysztof Kozlowski, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Ved Shanbhogue, Matt Evans,
linux-kernel, linux-riscv, devicetree
Cc: Alexandre Ghiti
Add description for the Svvptc ISA extension which was ratified recently.
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
Documentation/devicetree/bindings/riscv/extensions.yaml | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/Documentation/devicetree/bindings/riscv/extensions.yaml b/Documentation/devicetree/bindings/riscv/extensions.yaml
index 468c646247aa..b52375bea512 100644
--- a/Documentation/devicetree/bindings/riscv/extensions.yaml
+++ b/Documentation/devicetree/bindings/riscv/extensions.yaml
@@ -171,6 +171,13 @@ properties:
memory types as ratified in the 20191213 version of the privileged
ISA specification.
+ - const: svvptc
+ description:
+ The standard Svvptc supervisor-level extension for
+ address-translation cache behaviour with respect to invalid entries
+ as ratified at commit 4a69197e5617 ("Update to ratified state") of
+ riscv-svvptc.
+
- const: zacas
description: |
The Zacas extension for Atomic Compare-and-Swap (CAS) instructions
--
2.39.2
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v3 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
2024-07-02 8:50 [PATCH v3 0/4] Svvptc extension to remove preventive sfence.vma Alexandre Ghiti
2024-07-02 8:50 ` [PATCH v3 1/4] riscv: Add ISA extension parsing for Svvptc Alexandre Ghiti
2024-07-02 8:50 ` [PATCH v3 2/4] dt-bindings: riscv: Add Svvptc ISA extension description Alexandre Ghiti
@ 2024-07-02 8:50 ` Alexandre Ghiti
2024-07-02 9:48 ` [External] " yunhui cui
2024-07-02 10:12 ` Anup Patel
2024-07-02 8:50 ` [PATCH v3 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc Alexandre Ghiti
3 siblings, 2 replies; 13+ messages in thread
From: Alexandre Ghiti @ 2024-07-02 8:50 UTC (permalink / raw)
To: Conor Dooley, Rob Herring, Krzysztof Kozlowski, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Ved Shanbhogue, Matt Evans,
linux-kernel, linux-riscv, devicetree
Cc: Alexandre Ghiti
In 6.5, we removed the vmalloc fault path because that can't work (see
[1] [2]). Then in order to make sure that new page table entries were
seen by the page table walker, we had to preventively emit a sfence.vma
on all harts [3] but this solution is very costly since it relies on IPI.
And even there, we could end up in a loop of vmalloc faults if a vmalloc
allocation is done in the IPI path (for example if it is traced, see
[4]), which could result in a kernel stack overflow.
Those preventive sfence.vma needed to be emitted because:
- if the uarch caches invalid entries, the new mapping may not be
observed by the page table walker and an invalidation may be needed.
- if the uarch does not cache invalid entries, a reordered access
could "miss" the new mapping and traps: in that case, we would actually
only need to retry the access, no sfence.vma is required.
So this patch removes those preventive sfence.vma and actually handles
the possible (and unlikely) exceptions. And since the kernel stacks
mappings lie in the vmalloc area, this handling must be done very early
when the trap is taken, at the very beginning of handle_exception: this
also rules out the vmalloc allocations in the fault path.
Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1]
Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2]
Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3]
Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4]
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
arch/riscv/include/asm/cacheflush.h | 18 +++++-
arch/riscv/include/asm/thread_info.h | 5 ++
arch/riscv/kernel/asm-offsets.c | 5 ++
arch/riscv/kernel/entry.S | 84 ++++++++++++++++++++++++++++
arch/riscv/mm/init.c | 2 +
5 files changed, 113 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
index ce79c558a4c8..8de73f91bfa3 100644
--- a/arch/riscv/include/asm/cacheflush.h
+++ b/arch/riscv/include/asm/cacheflush.h
@@ -46,7 +46,23 @@ do { \
} while (0)
#ifdef CONFIG_64BIT
-#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end)
+extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
+extern char _end[];
+#define flush_cache_vmap flush_cache_vmap
+static inline void flush_cache_vmap(unsigned long start, unsigned long end)
+{
+ if (is_vmalloc_or_module_addr((void *)start)) {
+ int i;
+
+ /*
+ * We don't care if concurrently a cpu resets this value since
+ * the only place this can happen is in handle_exception() where
+ * an sfence.vma is emitted.
+ */
+ for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
+ new_vmalloc[i] = -1ULL;
+ }
+}
#define flush_cache_vmap_early(start, end) local_flush_tlb_kernel_range(start, end)
#endif
diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
index 5d473343634b..32631acdcdd4 100644
--- a/arch/riscv/include/asm/thread_info.h
+++ b/arch/riscv/include/asm/thread_info.h
@@ -60,6 +60,11 @@ struct thread_info {
void *scs_base;
void *scs_sp;
#endif
+ /*
+ * Used in handle_exception() to save a0, a1 and a2 before knowing if we
+ * can access the kernel stack.
+ */
+ unsigned long a0, a1, a2;
};
#ifdef CONFIG_SHADOW_CALL_STACK
diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
index b09ca5f944f7..29c0734f2972 100644
--- a/arch/riscv/kernel/asm-offsets.c
+++ b/arch/riscv/kernel/asm-offsets.c
@@ -36,6 +36,8 @@ void asm_offsets(void)
OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
+
+ OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
@@ -43,6 +45,9 @@ void asm_offsets(void)
#ifdef CONFIG_SHADOW_CALL_STACK
OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
#endif
+ OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
+ OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
+ OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
OFFSET(TASK_THREAD_F0, task_struct, thread.fstate.f[0]);
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 68a24cf9481a..822311266a12 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -19,6 +19,78 @@
.section .irqentry.text, "ax"
+.macro new_vmalloc_check
+ REG_S a0, TASK_TI_A0(tp)
+ REG_S a1, TASK_TI_A1(tp)
+ REG_S a2, TASK_TI_A2(tp)
+
+ csrr a0, CSR_CAUSE
+ /* Exclude IRQs */
+ blt a0, zero, _new_vmalloc_restore_context
+ /* Only check new_vmalloc if we are in page/protection fault */
+ li a1, EXC_LOAD_PAGE_FAULT
+ beq a0, a1, _new_vmalloc_kernel_address
+ li a1, EXC_STORE_PAGE_FAULT
+ beq a0, a1, _new_vmalloc_kernel_address
+ li a1, EXC_INST_PAGE_FAULT
+ bne a0, a1, _new_vmalloc_restore_context
+
+_new_vmalloc_kernel_address:
+ /* Is it a kernel address? */
+ csrr a0, CSR_TVAL
+ bge a0, zero, _new_vmalloc_restore_context
+
+ /* Check if a new vmalloc mapping appeared that could explain the trap */
+
+ /*
+ * Computes:
+ * a0 = &new_vmalloc[BIT_WORD(cpu)]
+ * a1 = BIT_MASK(cpu)
+ */
+ REG_L a2, TASK_TI_CPU(tp)
+ /*
+ * Compute the new_vmalloc element position:
+ * (cpu / 64) * 8 = (cpu >> 6) << 3
+ */
+ srli a1, a2, 6
+ slli a1, a1, 3
+ la a0, new_vmalloc
+ add a0, a0, a1
+ /*
+ * Compute the bit position in the new_vmalloc element:
+ * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6
+ * = cpu - ((cpu >> 6) << 3) << 3
+ */
+ slli a1, a1, 3
+ sub a1, a2, a1
+ /* Compute the "get mask": 1 << bit_pos */
+ li a2, 1
+ sll a1, a2, a1
+
+ /* Check the value of new_vmalloc for this cpu */
+ REG_L a2, 0(a0)
+ and a2, a2, a1
+ beq a2, zero, _new_vmalloc_restore_context
+
+ /* Atomically reset the current cpu bit in new_vmalloc */
+ amoxor.w a0, a1, (a0)
+
+ /* Only emit a sfence.vma if the uarch caches invalid entries */
+ ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1)
+
+ REG_L a0, TASK_TI_A0(tp)
+ REG_L a1, TASK_TI_A1(tp)
+ REG_L a2, TASK_TI_A2(tp)
+ csrw CSR_SCRATCH, x0
+ sret
+
+_new_vmalloc_restore_context:
+ REG_L a0, TASK_TI_A0(tp)
+ REG_L a1, TASK_TI_A1(tp)
+ REG_L a2, TASK_TI_A2(tp)
+.endm
+
+
SYM_CODE_START(handle_exception)
/*
* If coming from userspace, preserve the user thread pointer and load
@@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception)
.Lrestore_kernel_tpsp:
csrr tp, CSR_SCRATCH
+
+ /*
+ * The RISC-V kernel does not eagerly emit a sfence.vma after each
+ * new vmalloc mapping, which may result in exceptions:
+ * - if the uarch caches invalid entries, the new mapping would not be
+ * observed by the page table walker and an invalidation is needed.
+ * - if the uarch does not cache invalid entries, a reordered access
+ * could "miss" the new mapping and traps: in that case, we only need
+ * to retry the access, no sfence.vma is required.
+ */
+ new_vmalloc_check
+
REG_S sp, TASK_TI_KERNEL_SP(tp)
#ifdef CONFIG_VMAP_STACK
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index e3405e4b99af..2367a156c33b 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -36,6 +36,8 @@
#include "../kernel/head.h"
+u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
+
struct kernel_mapping kernel_map __ro_after_init;
EXPORT_SYMBOL(kernel_map);
#ifdef CONFIG_XIP_KERNEL
--
2.39.2
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v3 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc
2024-07-02 8:50 [PATCH v3 0/4] Svvptc extension to remove preventive sfence.vma Alexandre Ghiti
` (2 preceding siblings ...)
2024-07-02 8:50 ` [PATCH v3 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings Alexandre Ghiti
@ 2024-07-02 8:50 ` Alexandre Ghiti
2024-07-11 1:27 ` kernel test robot
2024-07-11 3:58 ` kernel test robot
3 siblings, 2 replies; 13+ messages in thread
From: Alexandre Ghiti @ 2024-07-02 8:50 UTC (permalink / raw)
To: Conor Dooley, Rob Herring, Krzysztof Kozlowski, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Ved Shanbhogue, Matt Evans,
linux-kernel, linux-riscv, devicetree
Cc: Alexandre Ghiti
The preventive sfence.vma were emitted because new mappings must be made
visible to the page table walker but Svvptc guarantees that it will
happen within a bounded timeframe, so no need to sfence.vma for the uarchs
that implement this extension, we will then take gratuitous (but very
unlikely) page faults, similarly to x86 and arm64.
This allows to drastically reduce the number of sfence.vma emitted:
* Ubuntu boot to login:
Before: ~630k sfence.vma
After: ~200k sfence.vma
* ltp - mmapstress01
Before: ~45k
After: ~6.3k
* lmbench - lat_pagefault
Before: ~665k
After: 832 (!)
* lmbench - lat_mmap
Before: ~546k
After: 718 (!)
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
arch/riscv/include/asm/pgtable.h | 16 +++++++++++++++-
arch/riscv/mm/pgtable.c | 13 +++++++++++++
2 files changed, 28 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index aad8b8ca51f1..816147e25ca9 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -476,6 +476,9 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
struct vm_area_struct *vma, unsigned long address,
pte_t *ptep, unsigned int nr)
{
+ asm goto(ALTERNATIVE("nop", "j %l[svvptc]", 0, RISCV_ISA_EXT_SVVPTC, 1)
+ : : : : svvptc);
+
/*
* The kernel assumes that TLBs don't cache invalid entries, but
* in RISC-V, SFENCE.VMA specifies an ordering constraint, not a
@@ -485,12 +488,23 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
*/
while (nr--)
local_flush_tlb_page(address + nr * PAGE_SIZE);
+
+svvptc:
+ /*
+ * Svvptc guarantees that the new valid pte will be visible within
+ * a bounded timeframe, so when the uarch does not cache invalid
+ * entries, we don't have to do anything.
+ */
}
#define update_mmu_cache(vma, addr, ptep) \
update_mmu_cache_range(NULL, vma, addr, ptep, 1)
#define __HAVE_ARCH_UPDATE_MMU_TLB
-#define update_mmu_tlb update_mmu_cache
+static inline void update_mmu_tlb(struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep)
+{
+ flush_tlb_range(vma, address, address + PAGE_SIZE);
+}
static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp)
diff --git a/arch/riscv/mm/pgtable.c b/arch/riscv/mm/pgtable.c
index 533ec9055fa0..4ae67324f992 100644
--- a/arch/riscv/mm/pgtable.c
+++ b/arch/riscv/mm/pgtable.c
@@ -9,6 +9,9 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
pte_t entry, int dirty)
{
+ asm goto(ALTERNATIVE("nop", "j %l[svvptc]", 0, RISCV_ISA_EXT_SVVPTC, 1)
+ : : : : svvptc);
+
if (!pte_same(ptep_get(ptep), entry))
__set_pte_at(vma->vm_mm, ptep, entry);
/*
@@ -16,6 +19,16 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
* the case that the PTE changed and the spurious fault case.
*/
return true;
+
+svvptc:
+ if (!pte_same(ptep_get(ptep), entry)) {
+ __set_pte_at(vma->vm_mm, ptep, entry);
+ /* Here only not svadu is impacted */
+ flush_tlb_page(vma, address);
+ return true;
+ }
+
+ return false;
}
int ptep_test_and_clear_young(struct vm_area_struct *vma,
--
2.39.2
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [External] [PATCH v3 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
2024-07-02 8:50 ` [PATCH v3 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings Alexandre Ghiti
@ 2024-07-02 9:48 ` yunhui cui
2024-07-02 12:36 ` Alexandre Ghiti
2024-07-02 10:12 ` Anup Patel
1 sibling, 1 reply; 13+ messages in thread
From: yunhui cui @ 2024-07-02 9:48 UTC (permalink / raw)
To: Alexandre Ghiti
Cc: Conor Dooley, Rob Herring, Krzysztof Kozlowski, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Ved Shanbhogue, Matt Evans,
linux-kernel, linux-riscv, devicetree
Hi Alexandre,
On Tue, Jul 2, 2024 at 4:54 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>
> In 6.5, we removed the vmalloc fault path because that can't work (see
> [1] [2]). Then in order to make sure that new page table entries were
> seen by the page table walker, we had to preventively emit a sfence.vma
> on all harts [3] but this solution is very costly since it relies on IPI.
>
> And even there, we could end up in a loop of vmalloc faults if a vmalloc
> allocation is done in the IPI path (for example if it is traced, see
> [4]), which could result in a kernel stack overflow.
>
> Those preventive sfence.vma needed to be emitted because:
>
> - if the uarch caches invalid entries, the new mapping may not be
> observed by the page table walker and an invalidation may be needed.
> - if the uarch does not cache invalid entries, a reordered access
> could "miss" the new mapping and traps: in that case, we would actually
> only need to retry the access, no sfence.vma is required.
>
> So this patch removes those preventive sfence.vma and actually handles
> the possible (and unlikely) exceptions. And since the kernel stacks
> mappings lie in the vmalloc area, this handling must be done very early
> when the trap is taken, at the very beginning of handle_exception: this
> also rules out the vmalloc allocations in the fault path.
>
> Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1]
> Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2]
> Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3]
> Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4]
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> ---
> arch/riscv/include/asm/cacheflush.h | 18 +++++-
> arch/riscv/include/asm/thread_info.h | 5 ++
> arch/riscv/kernel/asm-offsets.c | 5 ++
> arch/riscv/kernel/entry.S | 84 ++++++++++++++++++++++++++++
> arch/riscv/mm/init.c | 2 +
> 5 files changed, 113 insertions(+), 1 deletion(-)
>
> diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> index ce79c558a4c8..8de73f91bfa3 100644
> --- a/arch/riscv/include/asm/cacheflush.h
> +++ b/arch/riscv/include/asm/cacheflush.h
> @@ -46,7 +46,23 @@ do { \
> } while (0)
>
> #ifdef CONFIG_64BIT
> -#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end)
> +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> +extern char _end[];
> +#define flush_cache_vmap flush_cache_vmap
> +static inline void flush_cache_vmap(unsigned long start, unsigned long end)
> +{
> + if (is_vmalloc_or_module_addr((void *)start)) {
> + int i;
> +
> + /*
> + * We don't care if concurrently a cpu resets this value since
> + * the only place this can happen is in handle_exception() where
> + * an sfence.vma is emitted.
> + */
> + for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> + new_vmalloc[i] = -1ULL;
> + }
> +}
> #define flush_cache_vmap_early(start, end) local_flush_tlb_kernel_range(start, end)
> #endif
>
> diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
> index 5d473343634b..32631acdcdd4 100644
> --- a/arch/riscv/include/asm/thread_info.h
> +++ b/arch/riscv/include/asm/thread_info.h
> @@ -60,6 +60,11 @@ struct thread_info {
> void *scs_base;
> void *scs_sp;
> #endif
> + /*
> + * Used in handle_exception() to save a0, a1 and a2 before knowing if we
> + * can access the kernel stack.
> + */
> + unsigned long a0, a1, a2;
> };
>
> #ifdef CONFIG_SHADOW_CALL_STACK
> diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> index b09ca5f944f7..29c0734f2972 100644
> --- a/arch/riscv/kernel/asm-offsets.c
> +++ b/arch/riscv/kernel/asm-offsets.c
> @@ -36,6 +36,8 @@ void asm_offsets(void)
> OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
> OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
> OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
> +
> + OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
> OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
> OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
> OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
> @@ -43,6 +45,9 @@ void asm_offsets(void)
> #ifdef CONFIG_SHADOW_CALL_STACK
> OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
> #endif
> + OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
> + OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
> + OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
>
> OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
> OFFSET(TASK_THREAD_F0, task_struct, thread.fstate.f[0]);
> diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> index 68a24cf9481a..822311266a12 100644
> --- a/arch/riscv/kernel/entry.S
> +++ b/arch/riscv/kernel/entry.S
> @@ -19,6 +19,78 @@
>
> .section .irqentry.text, "ax"
>
> +.macro new_vmalloc_check
> + REG_S a0, TASK_TI_A0(tp)
> + REG_S a1, TASK_TI_A1(tp)
> + REG_S a2, TASK_TI_A2(tp)
We discussed in the previous version that when executing blt a0, zero,
_new_vmalloc_restore_context, there is no need to save a1, a2 first,
right?
> +
> + csrr a0, CSR_CAUSE
> + /* Exclude IRQs */
> + blt a0, zero, _new_vmalloc_restore_context
> + /* Only check new_vmalloc if we are in page/protection fault */
> + li a1, EXC_LOAD_PAGE_FAULT
> + beq a0, a1, _new_vmalloc_kernel_address
> + li a1, EXC_STORE_PAGE_FAULT
> + beq a0, a1, _new_vmalloc_kernel_address
> + li a1, EXC_INST_PAGE_FAULT
> + bne a0, a1, _new_vmalloc_restore_context
> +
> +_new_vmalloc_kernel_address:
> + /* Is it a kernel address? */
> + csrr a0, CSR_TVAL
> + bge a0, zero, _new_vmalloc_restore_context
> +
> + /* Check if a new vmalloc mapping appeared that could explain the trap */
> +
> + /*
> + * Computes:
> + * a0 = &new_vmalloc[BIT_WORD(cpu)]
> + * a1 = BIT_MASK(cpu)
> + */
> + REG_L a2, TASK_TI_CPU(tp)
> + /*
> + * Compute the new_vmalloc element position:
> + * (cpu / 64) * 8 = (cpu >> 6) << 3
> + */
> + srli a1, a2, 6
> + slli a1, a1, 3
> + la a0, new_vmalloc
> + add a0, a0, a1
> + /*
> + * Compute the bit position in the new_vmalloc element:
> + * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6
> + * = cpu - ((cpu >> 6) << 3) << 3
> + */
> + slli a1, a1, 3
> + sub a1, a2, a1
> + /* Compute the "get mask": 1 << bit_pos */
> + li a2, 1
> + sll a1, a2, a1
> +
> + /* Check the value of new_vmalloc for this cpu */
> + REG_L a2, 0(a0)
> + and a2, a2, a1
> + beq a2, zero, _new_vmalloc_restore_context
> +
> + /* Atomically reset the current cpu bit in new_vmalloc */
> + amoxor.w a0, a1, (a0)
> +
> + /* Only emit a sfence.vma if the uarch caches invalid entries */
> + ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1)
> +
> + REG_L a0, TASK_TI_A0(tp)
> + REG_L a1, TASK_TI_A1(tp)
> + REG_L a2, TASK_TI_A2(tp)
> + csrw CSR_SCRATCH, x0
> + sret
> +
> +_new_vmalloc_restore_context:
> + REG_L a0, TASK_TI_A0(tp)
> + REG_L a1, TASK_TI_A1(tp)
> + REG_L a2, TASK_TI_A2(tp)
> +.endm
> +
> +
> SYM_CODE_START(handle_exception)
> /*
> * If coming from userspace, preserve the user thread pointer and load
> @@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception)
>
> .Lrestore_kernel_tpsp:
> csrr tp, CSR_SCRATCH
> +
> + /*
> + * The RISC-V kernel does not eagerly emit a sfence.vma after each
> + * new vmalloc mapping, which may result in exceptions:
> + * - if the uarch caches invalid entries, the new mapping would not be
> + * observed by the page table walker and an invalidation is needed.
> + * - if the uarch does not cache invalid entries, a reordered access
> + * could "miss" the new mapping and traps: in that case, we only need
> + * to retry the access, no sfence.vma is required.
> + */
> + new_vmalloc_check
> +
> REG_S sp, TASK_TI_KERNEL_SP(tp)
>
> #ifdef CONFIG_VMAP_STACK
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index e3405e4b99af..2367a156c33b 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -36,6 +36,8 @@
>
> #include "../kernel/head.h"
>
> +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> +
> struct kernel_mapping kernel_map __ro_after_init;
> EXPORT_SYMBOL(kernel_map);
> #ifdef CONFIG_XIP_KERNEL
> --
> 2.39.2
>
Thanks,
Yunhui
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
2024-07-02 8:50 ` [PATCH v3 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings Alexandre Ghiti
2024-07-02 9:48 ` [External] " yunhui cui
@ 2024-07-02 10:12 ` Anup Patel
2024-07-02 12:40 ` Alexandre Ghiti
1 sibling, 1 reply; 13+ messages in thread
From: Anup Patel @ 2024-07-02 10:12 UTC (permalink / raw)
To: Alexandre Ghiti
Cc: Conor Dooley, Rob Herring, Krzysztof Kozlowski, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Ved Shanbhogue, Matt Evans,
linux-kernel, linux-riscv, devicetree
On Tue, Jul 2, 2024 at 2:24 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>
> In 6.5, we removed the vmalloc fault path because that can't work (see
> [1] [2]). Then in order to make sure that new page table entries were
> seen by the page table walker, we had to preventively emit a sfence.vma
> on all harts [3] but this solution is very costly since it relies on IPI.
>
> And even there, we could end up in a loop of vmalloc faults if a vmalloc
> allocation is done in the IPI path (for example if it is traced, see
> [4]), which could result in a kernel stack overflow.
>
> Those preventive sfence.vma needed to be emitted because:
>
> - if the uarch caches invalid entries, the new mapping may not be
> observed by the page table walker and an invalidation may be needed.
> - if the uarch does not cache invalid entries, a reordered access
> could "miss" the new mapping and traps: in that case, we would actually
> only need to retry the access, no sfence.vma is required.
>
> So this patch removes those preventive sfence.vma and actually handles
> the possible (and unlikely) exceptions. And since the kernel stacks
> mappings lie in the vmalloc area, this handling must be done very early
> when the trap is taken, at the very beginning of handle_exception: this
> also rules out the vmalloc allocations in the fault path.
>
> Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1]
> Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2]
> Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3]
> Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4]
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> ---
> arch/riscv/include/asm/cacheflush.h | 18 +++++-
> arch/riscv/include/asm/thread_info.h | 5 ++
> arch/riscv/kernel/asm-offsets.c | 5 ++
> arch/riscv/kernel/entry.S | 84 ++++++++++++++++++++++++++++
> arch/riscv/mm/init.c | 2 +
> 5 files changed, 113 insertions(+), 1 deletion(-)
>
> diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> index ce79c558a4c8..8de73f91bfa3 100644
> --- a/arch/riscv/include/asm/cacheflush.h
> +++ b/arch/riscv/include/asm/cacheflush.h
> @@ -46,7 +46,23 @@ do { \
> } while (0)
>
> #ifdef CONFIG_64BIT
> -#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end)
> +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
Why is this u64 and not "unsigned long" ?
Was this tested on rv32 ?
> +extern char _end[];
> +#define flush_cache_vmap flush_cache_vmap
> +static inline void flush_cache_vmap(unsigned long start, unsigned long end)
> +{
> + if (is_vmalloc_or_module_addr((void *)start)) {
> + int i;
> +
> + /*
> + * We don't care if concurrently a cpu resets this value since
> + * the only place this can happen is in handle_exception() where
> + * an sfence.vma is emitted.
> + */
> + for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> + new_vmalloc[i] = -1ULL;
> + }
> +}
> #define flush_cache_vmap_early(start, end) local_flush_tlb_kernel_range(start, end)
> #endif
>
> diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
> index 5d473343634b..32631acdcdd4 100644
> --- a/arch/riscv/include/asm/thread_info.h
> +++ b/arch/riscv/include/asm/thread_info.h
> @@ -60,6 +60,11 @@ struct thread_info {
> void *scs_base;
> void *scs_sp;
> #endif
> + /*
> + * Used in handle_exception() to save a0, a1 and a2 before knowing if we
> + * can access the kernel stack.
> + */
> + unsigned long a0, a1, a2;
> };
>
> #ifdef CONFIG_SHADOW_CALL_STACK
> diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> index b09ca5f944f7..29c0734f2972 100644
> --- a/arch/riscv/kernel/asm-offsets.c
> +++ b/arch/riscv/kernel/asm-offsets.c
> @@ -36,6 +36,8 @@ void asm_offsets(void)
> OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
> OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
> OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
> +
> + OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
> OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
> OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
> OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
> @@ -43,6 +45,9 @@ void asm_offsets(void)
> #ifdef CONFIG_SHADOW_CALL_STACK
> OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
> #endif
> + OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
> + OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
> + OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
>
> OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
> OFFSET(TASK_THREAD_F0, task_struct, thread.fstate.f[0]);
> diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> index 68a24cf9481a..822311266a12 100644
> --- a/arch/riscv/kernel/entry.S
> +++ b/arch/riscv/kernel/entry.S
> @@ -19,6 +19,78 @@
>
> .section .irqentry.text, "ax"
>
> +.macro new_vmalloc_check
> + REG_S a0, TASK_TI_A0(tp)
> + REG_S a1, TASK_TI_A1(tp)
> + REG_S a2, TASK_TI_A2(tp)
> +
> + csrr a0, CSR_CAUSE
> + /* Exclude IRQs */
> + blt a0, zero, _new_vmalloc_restore_context
> + /* Only check new_vmalloc if we are in page/protection fault */
> + li a1, EXC_LOAD_PAGE_FAULT
> + beq a0, a1, _new_vmalloc_kernel_address
> + li a1, EXC_STORE_PAGE_FAULT
> + beq a0, a1, _new_vmalloc_kernel_address
> + li a1, EXC_INST_PAGE_FAULT
> + bne a0, a1, _new_vmalloc_restore_context
> +
> +_new_vmalloc_kernel_address:
> + /* Is it a kernel address? */
> + csrr a0, CSR_TVAL
> + bge a0, zero, _new_vmalloc_restore_context
> +
> + /* Check if a new vmalloc mapping appeared that could explain the trap */
> +
> + /*
> + * Computes:
> + * a0 = &new_vmalloc[BIT_WORD(cpu)]
> + * a1 = BIT_MASK(cpu)
> + */
> + REG_L a2, TASK_TI_CPU(tp)
> + /*
> + * Compute the new_vmalloc element position:
> + * (cpu / 64) * 8 = (cpu >> 6) << 3
> + */
> + srli a1, a2, 6
> + slli a1, a1, 3
> + la a0, new_vmalloc
> + add a0, a0, a1
> + /*
> + * Compute the bit position in the new_vmalloc element:
> + * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6
> + * = cpu - ((cpu >> 6) << 3) << 3
> + */
> + slli a1, a1, 3
> + sub a1, a2, a1
> + /* Compute the "get mask": 1 << bit_pos */
> + li a2, 1
> + sll a1, a2, a1
> +
> + /* Check the value of new_vmalloc for this cpu */
> + REG_L a2, 0(a0)
> + and a2, a2, a1
> + beq a2, zero, _new_vmalloc_restore_context
> +
> + /* Atomically reset the current cpu bit in new_vmalloc */
> + amoxor.w a0, a1, (a0)
Doing only 32bit atomic here, is this intentional ?
> +
> + /* Only emit a sfence.vma if the uarch caches invalid entries */
> + ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1)
> +
> + REG_L a0, TASK_TI_A0(tp)
> + REG_L a1, TASK_TI_A1(tp)
> + REG_L a2, TASK_TI_A2(tp)
> + csrw CSR_SCRATCH, x0
> + sret
> +
> +_new_vmalloc_restore_context:
> + REG_L a0, TASK_TI_A0(tp)
> + REG_L a1, TASK_TI_A1(tp)
> + REG_L a2, TASK_TI_A2(tp)
> +.endm
> +
> +
> SYM_CODE_START(handle_exception)
> /*
> * If coming from userspace, preserve the user thread pointer and load
> @@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception)
>
> .Lrestore_kernel_tpsp:
> csrr tp, CSR_SCRATCH
> +
> + /*
> + * The RISC-V kernel does not eagerly emit a sfence.vma after each
> + * new vmalloc mapping, which may result in exceptions:
> + * - if the uarch caches invalid entries, the new mapping would not be
> + * observed by the page table walker and an invalidation is needed.
> + * - if the uarch does not cache invalid entries, a reordered access
> + * could "miss" the new mapping and traps: in that case, we only need
> + * to retry the access, no sfence.vma is required.
> + */
> + new_vmalloc_check
> +
> REG_S sp, TASK_TI_KERNEL_SP(tp)
>
> #ifdef CONFIG_VMAP_STACK
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index e3405e4b99af..2367a156c33b 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -36,6 +36,8 @@
>
> #include "../kernel/head.h"
>
> +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> +
> struct kernel_mapping kernel_map __ro_after_init;
> EXPORT_SYMBOL(kernel_map);
> #ifdef CONFIG_XIP_KERNEL
> --
> 2.39.2
>
>
Regards,
Anup
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [External] [PATCH v3 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
2024-07-02 9:48 ` [External] " yunhui cui
@ 2024-07-02 12:36 ` Alexandre Ghiti
0 siblings, 0 replies; 13+ messages in thread
From: Alexandre Ghiti @ 2024-07-02 12:36 UTC (permalink / raw)
To: yunhui cui
Cc: Conor Dooley, Rob Herring, Krzysztof Kozlowski, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Ved Shanbhogue, Matt Evans,
linux-kernel, linux-riscv, devicetree
Hi Yunhui,
On Tue, Jul 2, 2024 at 11:48 AM yunhui cui <cuiyunhui@bytedance.com> wrote:
>
> Hi Alexandre,
>
> On Tue, Jul 2, 2024 at 4:54 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> >
> > In 6.5, we removed the vmalloc fault path because that can't work (see
> > [1] [2]). Then in order to make sure that new page table entries were
> > seen by the page table walker, we had to preventively emit a sfence.vma
> > on all harts [3] but this solution is very costly since it relies on IPI.
> >
> > And even there, we could end up in a loop of vmalloc faults if a vmalloc
> > allocation is done in the IPI path (for example if it is traced, see
> > [4]), which could result in a kernel stack overflow.
> >
> > Those preventive sfence.vma needed to be emitted because:
> >
> > - if the uarch caches invalid entries, the new mapping may not be
> > observed by the page table walker and an invalidation may be needed.
> > - if the uarch does not cache invalid entries, a reordered access
> > could "miss" the new mapping and traps: in that case, we would actually
> > only need to retry the access, no sfence.vma is required.
> >
> > So this patch removes those preventive sfence.vma and actually handles
> > the possible (and unlikely) exceptions. And since the kernel stacks
> > mappings lie in the vmalloc area, this handling must be done very early
> > when the trap is taken, at the very beginning of handle_exception: this
> > also rules out the vmalloc allocations in the fault path.
> >
> > Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1]
> > Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2]
> > Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3]
> > Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4]
> > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > ---
> > arch/riscv/include/asm/cacheflush.h | 18 +++++-
> > arch/riscv/include/asm/thread_info.h | 5 ++
> > arch/riscv/kernel/asm-offsets.c | 5 ++
> > arch/riscv/kernel/entry.S | 84 ++++++++++++++++++++++++++++
> > arch/riscv/mm/init.c | 2 +
> > 5 files changed, 113 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > index ce79c558a4c8..8de73f91bfa3 100644
> > --- a/arch/riscv/include/asm/cacheflush.h
> > +++ b/arch/riscv/include/asm/cacheflush.h
> > @@ -46,7 +46,23 @@ do { \
> > } while (0)
> >
> > #ifdef CONFIG_64BIT
> > -#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end)
> > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > +extern char _end[];
> > +#define flush_cache_vmap flush_cache_vmap
> > +static inline void flush_cache_vmap(unsigned long start, unsigned long end)
> > +{
> > + if (is_vmalloc_or_module_addr((void *)start)) {
> > + int i;
> > +
> > + /*
> > + * We don't care if concurrently a cpu resets this value since
> > + * the only place this can happen is in handle_exception() where
> > + * an sfence.vma is emitted.
> > + */
> > + for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> > + new_vmalloc[i] = -1ULL;
> > + }
> > +}
> > #define flush_cache_vmap_early(start, end) local_flush_tlb_kernel_range(start, end)
> > #endif
> >
> > diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
> > index 5d473343634b..32631acdcdd4 100644
> > --- a/arch/riscv/include/asm/thread_info.h
> > +++ b/arch/riscv/include/asm/thread_info.h
> > @@ -60,6 +60,11 @@ struct thread_info {
> > void *scs_base;
> > void *scs_sp;
> > #endif
> > + /*
> > + * Used in handle_exception() to save a0, a1 and a2 before knowing if we
> > + * can access the kernel stack.
> > + */
> > + unsigned long a0, a1, a2;
> > };
> >
> > #ifdef CONFIG_SHADOW_CALL_STACK
> > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > index b09ca5f944f7..29c0734f2972 100644
> > --- a/arch/riscv/kernel/asm-offsets.c
> > +++ b/arch/riscv/kernel/asm-offsets.c
> > @@ -36,6 +36,8 @@ void asm_offsets(void)
> > OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
> > OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
> > OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
> > +
> > + OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
> > OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
> > OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
> > OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
> > @@ -43,6 +45,9 @@ void asm_offsets(void)
> > #ifdef CONFIG_SHADOW_CALL_STACK
> > OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
> > #endif
> > + OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
> > + OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
> > + OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
> >
> > OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
> > OFFSET(TASK_THREAD_F0, task_struct, thread.fstate.f[0]);
> > diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> > index 68a24cf9481a..822311266a12 100644
> > --- a/arch/riscv/kernel/entry.S
> > +++ b/arch/riscv/kernel/entry.S
> > @@ -19,6 +19,78 @@
> >
> > .section .irqentry.text, "ax"
> >
> > +.macro new_vmalloc_check
> > + REG_S a0, TASK_TI_A0(tp)
> > + REG_S a1, TASK_TI_A1(tp)
> > + REG_S a2, TASK_TI_A2(tp)
>
> We discussed in the previous version that when executing blt a0, zero,
> _new_vmalloc_restore_context, there is no need to save a1, a2 first,
> right?
And you're totally right, I forgot to do so...Thanks for bringing that
up again, as it's important we do the minimum amount of work here. I
respin a new version, I should send this in a couple of days.
Thanks,
Alex
>
> > +
> > + csrr a0, CSR_CAUSE
> > + /* Exclude IRQs */
> > + blt a0, zero, _new_vmalloc_restore_context
> > + /* Only check new_vmalloc if we are in page/protection fault */
> > + li a1, EXC_LOAD_PAGE_FAULT
> > + beq a0, a1, _new_vmalloc_kernel_address
> > + li a1, EXC_STORE_PAGE_FAULT
> > + beq a0, a1, _new_vmalloc_kernel_address
> > + li a1, EXC_INST_PAGE_FAULT
> > + bne a0, a1, _new_vmalloc_restore_context
> > +
> > +_new_vmalloc_kernel_address:
> > + /* Is it a kernel address? */
> > + csrr a0, CSR_TVAL
> > + bge a0, zero, _new_vmalloc_restore_context
> > +
> > + /* Check if a new vmalloc mapping appeared that could explain the trap */
> > +
> > + /*
> > + * Computes:
> > + * a0 = &new_vmalloc[BIT_WORD(cpu)]
> > + * a1 = BIT_MASK(cpu)
> > + */
> > + REG_L a2, TASK_TI_CPU(tp)
> > + /*
> > + * Compute the new_vmalloc element position:
> > + * (cpu / 64) * 8 = (cpu >> 6) << 3
> > + */
> > + srli a1, a2, 6
> > + slli a1, a1, 3
> > + la a0, new_vmalloc
> > + add a0, a0, a1
> > + /*
> > + * Compute the bit position in the new_vmalloc element:
> > + * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6
> > + * = cpu - ((cpu >> 6) << 3) << 3
> > + */
> > + slli a1, a1, 3
> > + sub a1, a2, a1
> > + /* Compute the "get mask": 1 << bit_pos */
> > + li a2, 1
> > + sll a1, a2, a1
> > +
> > + /* Check the value of new_vmalloc for this cpu */
> > + REG_L a2, 0(a0)
> > + and a2, a2, a1
> > + beq a2, zero, _new_vmalloc_restore_context
> > +
> > + /* Atomically reset the current cpu bit in new_vmalloc */
> > + amoxor.w a0, a1, (a0)
> > +
> > + /* Only emit a sfence.vma if the uarch caches invalid entries */
> > + ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1)
> > +
> > + REG_L a0, TASK_TI_A0(tp)
> > + REG_L a1, TASK_TI_A1(tp)
> > + REG_L a2, TASK_TI_A2(tp)
> > + csrw CSR_SCRATCH, x0
> > + sret
> > +
> > +_new_vmalloc_restore_context:
> > + REG_L a0, TASK_TI_A0(tp)
> > + REG_L a1, TASK_TI_A1(tp)
> > + REG_L a2, TASK_TI_A2(tp)
> > +.endm
> > +
> > +
> > SYM_CODE_START(handle_exception)
> > /*
> > * If coming from userspace, preserve the user thread pointer and load
> > @@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception)
> >
> > .Lrestore_kernel_tpsp:
> > csrr tp, CSR_SCRATCH
> > +
> > + /*
> > + * The RISC-V kernel does not eagerly emit a sfence.vma after each
> > + * new vmalloc mapping, which may result in exceptions:
> > + * - if the uarch caches invalid entries, the new mapping would not be
> > + * observed by the page table walker and an invalidation is needed.
> > + * - if the uarch does not cache invalid entries, a reordered access
> > + * could "miss" the new mapping and traps: in that case, we only need
> > + * to retry the access, no sfence.vma is required.
> > + */
> > + new_vmalloc_check
> > +
> > REG_S sp, TASK_TI_KERNEL_SP(tp)
> >
> > #ifdef CONFIG_VMAP_STACK
> > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > index e3405e4b99af..2367a156c33b 100644
> > --- a/arch/riscv/mm/init.c
> > +++ b/arch/riscv/mm/init.c
> > @@ -36,6 +36,8 @@
> >
> > #include "../kernel/head.h"
> >
> > +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > +
> > struct kernel_mapping kernel_map __ro_after_init;
> > EXPORT_SYMBOL(kernel_map);
> > #ifdef CONFIG_XIP_KERNEL
> > --
> > 2.39.2
> >
>
> Thanks,
> Yunhui
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
2024-07-02 10:12 ` Anup Patel
@ 2024-07-02 12:40 ` Alexandre Ghiti
0 siblings, 0 replies; 13+ messages in thread
From: Alexandre Ghiti @ 2024-07-02 12:40 UTC (permalink / raw)
To: Anup Patel
Cc: Conor Dooley, Rob Herring, Krzysztof Kozlowski, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Ved Shanbhogue, Matt Evans,
linux-kernel, linux-riscv, devicetree
Hi Anup,
On Tue, Jul 2, 2024 at 12:13 PM Anup Patel <anup@brainfault.org> wrote:
>
> On Tue, Jul 2, 2024 at 2:24 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> >
> > In 6.5, we removed the vmalloc fault path because that can't work (see
> > [1] [2]). Then in order to make sure that new page table entries were
> > seen by the page table walker, we had to preventively emit a sfence.vma
> > on all harts [3] but this solution is very costly since it relies on IPI.
> >
> > And even there, we could end up in a loop of vmalloc faults if a vmalloc
> > allocation is done in the IPI path (for example if it is traced, see
> > [4]), which could result in a kernel stack overflow.
> >
> > Those preventive sfence.vma needed to be emitted because:
> >
> > - if the uarch caches invalid entries, the new mapping may not be
> > observed by the page table walker and an invalidation may be needed.
> > - if the uarch does not cache invalid entries, a reordered access
> > could "miss" the new mapping and traps: in that case, we would actually
> > only need to retry the access, no sfence.vma is required.
> >
> > So this patch removes those preventive sfence.vma and actually handles
> > the possible (and unlikely) exceptions. And since the kernel stacks
> > mappings lie in the vmalloc area, this handling must be done very early
> > when the trap is taken, at the very beginning of handle_exception: this
> > also rules out the vmalloc allocations in the fault path.
> >
> > Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1]
> > Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2]
> > Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3]
> > Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4]
> > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > ---
> > arch/riscv/include/asm/cacheflush.h | 18 +++++-
> > arch/riscv/include/asm/thread_info.h | 5 ++
> > arch/riscv/kernel/asm-offsets.c | 5 ++
> > arch/riscv/kernel/entry.S | 84 ++++++++++++++++++++++++++++
> > arch/riscv/mm/init.c | 2 +
> > 5 files changed, 113 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > index ce79c558a4c8..8de73f91bfa3 100644
> > --- a/arch/riscv/include/asm/cacheflush.h
> > +++ b/arch/riscv/include/asm/cacheflush.h
> > @@ -46,7 +46,23 @@ do { \
> > } while (0)
> >
> > #ifdef CONFIG_64BIT
> > -#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end)
> > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
>
> Why is this u64 and not "unsigned long" ?
I prefer the explicit types but I'm not opposed to using unsigned long
if you think so.
>
> Was this tested on rv32 ?
It is not intended to work on rv32 as rv32 still uses the vmalloc
fault path. But then new_vmalloc_check should only be called for rv64,
so I'll fix that in the next version, thanks for asking.
>
> > +extern char _end[];
> > +#define flush_cache_vmap flush_cache_vmap
> > +static inline void flush_cache_vmap(unsigned long start, unsigned long end)
> > +{
> > + if (is_vmalloc_or_module_addr((void *)start)) {
> > + int i;
> > +
> > + /*
> > + * We don't care if concurrently a cpu resets this value since
> > + * the only place this can happen is in handle_exception() where
> > + * an sfence.vma is emitted.
> > + */
> > + for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> > + new_vmalloc[i] = -1ULL;
> > + }
> > +}
> > #define flush_cache_vmap_early(start, end) local_flush_tlb_kernel_range(start, end)
> > #endif
> >
> > diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
> > index 5d473343634b..32631acdcdd4 100644
> > --- a/arch/riscv/include/asm/thread_info.h
> > +++ b/arch/riscv/include/asm/thread_info.h
> > @@ -60,6 +60,11 @@ struct thread_info {
> > void *scs_base;
> > void *scs_sp;
> > #endif
> > + /*
> > + * Used in handle_exception() to save a0, a1 and a2 before knowing if we
> > + * can access the kernel stack.
> > + */
> > + unsigned long a0, a1, a2;
> > };
> >
> > #ifdef CONFIG_SHADOW_CALL_STACK
> > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > index b09ca5f944f7..29c0734f2972 100644
> > --- a/arch/riscv/kernel/asm-offsets.c
> > +++ b/arch/riscv/kernel/asm-offsets.c
> > @@ -36,6 +36,8 @@ void asm_offsets(void)
> > OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
> > OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
> > OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
> > +
> > + OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
> > OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
> > OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
> > OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
> > @@ -43,6 +45,9 @@ void asm_offsets(void)
> > #ifdef CONFIG_SHADOW_CALL_STACK
> > OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
> > #endif
> > + OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
> > + OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
> > + OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
> >
> > OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
> > OFFSET(TASK_THREAD_F0, task_struct, thread.fstate.f[0]);
> > diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> > index 68a24cf9481a..822311266a12 100644
> > --- a/arch/riscv/kernel/entry.S
> > +++ b/arch/riscv/kernel/entry.S
> > @@ -19,6 +19,78 @@
> >
> > .section .irqentry.text, "ax"
> >
> > +.macro new_vmalloc_check
> > + REG_S a0, TASK_TI_A0(tp)
> > + REG_S a1, TASK_TI_A1(tp)
> > + REG_S a2, TASK_TI_A2(tp)
> > +
> > + csrr a0, CSR_CAUSE
> > + /* Exclude IRQs */
> > + blt a0, zero, _new_vmalloc_restore_context
> > + /* Only check new_vmalloc if we are in page/protection fault */
> > + li a1, EXC_LOAD_PAGE_FAULT
> > + beq a0, a1, _new_vmalloc_kernel_address
> > + li a1, EXC_STORE_PAGE_FAULT
> > + beq a0, a1, _new_vmalloc_kernel_address
> > + li a1, EXC_INST_PAGE_FAULT
> > + bne a0, a1, _new_vmalloc_restore_context
> > +
> > +_new_vmalloc_kernel_address:
> > + /* Is it a kernel address? */
> > + csrr a0, CSR_TVAL
> > + bge a0, zero, _new_vmalloc_restore_context
> > +
> > + /* Check if a new vmalloc mapping appeared that could explain the trap */
> > +
> > + /*
> > + * Computes:
> > + * a0 = &new_vmalloc[BIT_WORD(cpu)]
> > + * a1 = BIT_MASK(cpu)
> > + */
> > + REG_L a2, TASK_TI_CPU(tp)
> > + /*
> > + * Compute the new_vmalloc element position:
> > + * (cpu / 64) * 8 = (cpu >> 6) << 3
> > + */
> > + srli a1, a2, 6
> > + slli a1, a1, 3
> > + la a0, new_vmalloc
> > + add a0, a0, a1
> > + /*
> > + * Compute the bit position in the new_vmalloc element:
> > + * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6
> > + * = cpu - ((cpu >> 6) << 3) << 3
> > + */
> > + slli a1, a1, 3
> > + sub a1, a2, a1
> > + /* Compute the "get mask": 1 << bit_pos */
> > + li a2, 1
> > + sll a1, a2, a1
> > +
> > + /* Check the value of new_vmalloc for this cpu */
> > + REG_L a2, 0(a0)
> > + and a2, a2, a1
> > + beq a2, zero, _new_vmalloc_restore_context
> > +
> > + /* Atomically reset the current cpu bit in new_vmalloc */
> > + amoxor.w a0, a1, (a0)
>
> Doing only 32bit atomic here, is this intentional ?
Oh my, that's a big mistake. Thanks
>
> > +
> > + /* Only emit a sfence.vma if the uarch caches invalid entries */
> > + ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1)
> > +
> > + REG_L a0, TASK_TI_A0(tp)
> > + REG_L a1, TASK_TI_A1(tp)
> > + REG_L a2, TASK_TI_A2(tp)
> > + csrw CSR_SCRATCH, x0
> > + sret
> > +
> > +_new_vmalloc_restore_context:
> > + REG_L a0, TASK_TI_A0(tp)
> > + REG_L a1, TASK_TI_A1(tp)
> > + REG_L a2, TASK_TI_A2(tp)
> > +.endm
> > +
> > +
> > SYM_CODE_START(handle_exception)
> > /*
> > * If coming from userspace, preserve the user thread pointer and load
> > @@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception)
> >
> > .Lrestore_kernel_tpsp:
> > csrr tp, CSR_SCRATCH
> > +
> > + /*
> > + * The RISC-V kernel does not eagerly emit a sfence.vma after each
> > + * new vmalloc mapping, which may result in exceptions:
> > + * - if the uarch caches invalid entries, the new mapping would not be
> > + * observed by the page table walker and an invalidation is needed.
> > + * - if the uarch does not cache invalid entries, a reordered access
> > + * could "miss" the new mapping and traps: in that case, we only need
> > + * to retry the access, no sfence.vma is required.
> > + */
> > + new_vmalloc_check
> > +
> > REG_S sp, TASK_TI_KERNEL_SP(tp)
> >
> > #ifdef CONFIG_VMAP_STACK
> > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > index e3405e4b99af..2367a156c33b 100644
> > --- a/arch/riscv/mm/init.c
> > +++ b/arch/riscv/mm/init.c
> > @@ -36,6 +36,8 @@
> >
> > #include "../kernel/head.h"
> >
> > +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > +
> > struct kernel_mapping kernel_map __ro_after_init;
> > EXPORT_SYMBOL(kernel_map);
> > #ifdef CONFIG_XIP_KERNEL
> > --
> > 2.39.2
> >
> >
>
> Regards,
> Anup
Thanks for taking a look Anup,
Alex
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 2/4] dt-bindings: riscv: Add Svvptc ISA extension description
2024-07-02 8:50 ` [PATCH v3 2/4] dt-bindings: riscv: Add Svvptc ISA extension description Alexandre Ghiti
@ 2024-07-02 15:02 ` Conor Dooley
0 siblings, 0 replies; 13+ messages in thread
From: Conor Dooley @ 2024-07-02 15:02 UTC (permalink / raw)
To: Alexandre Ghiti
Cc: Rob Herring, Krzysztof Kozlowski, Paul Walmsley, Palmer Dabbelt,
Albert Ou, Ved Shanbhogue, Matt Evans, linux-kernel, linux-riscv,
devicetree
[-- Attachment #1: Type: text/plain, Size: 261 bytes --]
On Tue, Jul 02, 2024 at 10:50:32AM +0200, Alexandre Ghiti wrote:
> Add description for the Svvptc ISA extension which was ratified recently.
>
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 1/4] riscv: Add ISA extension parsing for Svvptc
2024-07-02 8:50 ` [PATCH v3 1/4] riscv: Add ISA extension parsing for Svvptc Alexandre Ghiti
@ 2024-07-02 15:02 ` Conor Dooley
0 siblings, 0 replies; 13+ messages in thread
From: Conor Dooley @ 2024-07-02 15:02 UTC (permalink / raw)
To: Alexandre Ghiti
Cc: Rob Herring, Krzysztof Kozlowski, Paul Walmsley, Palmer Dabbelt,
Albert Ou, Ved Shanbhogue, Matt Evans, linux-kernel, linux-riscv,
devicetree
[-- Attachment #1: Type: text/plain, Size: 275 bytes --]
On Tue, Jul 02, 2024 at 10:50:31AM +0200, Alexandre Ghiti wrote:
> Add support to parse the Svvptc string in the riscv,isa string.
>
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Cheers,
Conor.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc
2024-07-02 8:50 ` [PATCH v3 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc Alexandre Ghiti
@ 2024-07-11 1:27 ` kernel test robot
2024-07-11 3:58 ` kernel test robot
1 sibling, 0 replies; 13+ messages in thread
From: kernel test robot @ 2024-07-11 1:27 UTC (permalink / raw)
To: Alexandre Ghiti, Conor Dooley, Rob Herring, Krzysztof Kozlowski,
Paul Walmsley, Palmer Dabbelt, Albert Ou, Ved Shanbhogue,
Matt Evans, linux-kernel, linux-riscv, devicetree
Cc: llvm, oe-kbuild-all, Alexandre Ghiti
Hi Alexandre,
kernel test robot noticed the following build errors:
[auto build test ERROR on robh/for-next]
[also build test ERROR on linus/master v6.10-rc7]
[cannot apply to next-20240710]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Alexandre-Ghiti/riscv-Add-ISA-extension-parsing-for-Svvptc/20240702-171920
base: https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git for-next
patch link: https://lore.kernel.org/r/20240702085034.48395-5-alexghiti%40rivosinc.com
patch subject: [PATCH v3 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc
config: riscv-randconfig-001-20240711 (https://download.01.org/0day-ci/archive/20240711/202407110946.e0VNIrJP-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240711/202407110946.e0VNIrJP-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407110946.e0VNIrJP-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from arch/riscv/kernel/asm-offsets.c:10:
In file included from include/linux/mm.h:30:
In file included from include/linux/pgtable.h:6:
>> arch/riscv/include/asm/pgtable.h:498:1: error: expected statement
}
^
In file included from arch/riscv/kernel/asm-offsets.c:10:
In file included from include/linux/mm.h:1115:
In file included from include/linux/huge_mm.h:8:
In file included from include/linux/fs.h:33:
In file included from include/linux/percpu-rwsem.h:7:
In file included from include/linux/rcuwait.h:6:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:98:11: warning: array index 3 is past the end of the array (which contains 1 element) [-Warray-bounds]
return (set->sig[3] | set->sig[2] |
^ ~
include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
unsigned long sig[_NSIG_WORDS];
^
In file included from arch/riscv/kernel/asm-offsets.c:10:
In file included from include/linux/mm.h:1115:
In file included from include/linux/huge_mm.h:8:
In file included from include/linux/fs.h:33:
In file included from include/linux/percpu-rwsem.h:7:
In file included from include/linux/rcuwait.h:6:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:98:25: warning: array index 2 is past the end of the array (which contains 1 element) [-Warray-bounds]
return (set->sig[3] | set->sig[2] |
^ ~
include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
unsigned long sig[_NSIG_WORDS];
^
In file included from arch/riscv/kernel/asm-offsets.c:10:
In file included from include/linux/mm.h:1115:
In file included from include/linux/huge_mm.h:8:
In file included from include/linux/fs.h:33:
In file included from include/linux/percpu-rwsem.h:7:
In file included from include/linux/rcuwait.h:6:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:99:4: warning: array index 1 is past the end of the array (which contains 1 element) [-Warray-bounds]
set->sig[1] | set->sig[0]) == 0;
^ ~
include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
unsigned long sig[_NSIG_WORDS];
^
In file included from arch/riscv/kernel/asm-offsets.c:10:
In file included from include/linux/mm.h:1115:
In file included from include/linux/huge_mm.h:8:
In file included from include/linux/fs.h:33:
In file included from include/linux/percpu-rwsem.h:7:
In file included from include/linux/rcuwait.h:6:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:101:11: warning: array index 1 is past the end of the array (which contains 1 element) [-Warray-bounds]
return (set->sig[1] | set->sig[0]) == 0;
^ ~
include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
unsigned long sig[_NSIG_WORDS];
^
In file included from arch/riscv/kernel/asm-offsets.c:10:
In file included from include/linux/mm.h:1115:
In file included from include/linux/huge_mm.h:8:
In file included from include/linux/fs.h:33:
In file included from include/linux/percpu-rwsem.h:7:
In file included from include/linux/rcuwait.h:6:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:114:11: warning: array index 3 is past the end of the array (which contains 1 element) [-Warray-bounds]
return (set1->sig[3] == set2->sig[3]) &&
^ ~
include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
unsigned long sig[_NSIG_WORDS];
^
In file included from arch/riscv/kernel/asm-offsets.c:10:
In file included from include/linux/mm.h:1115:
In file included from include/linux/huge_mm.h:8:
In file included from include/linux/fs.h:33:
In file included from include/linux/percpu-rwsem.h:7:
In file included from include/linux/rcuwait.h:6:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:114:27: warning: array index 3 is past the end of the array (which contains 1 element) [-Warray-bounds]
return (set1->sig[3] == set2->sig[3]) &&
^ ~
include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
unsigned long sig[_NSIG_WORDS];
^
In file included from arch/riscv/kernel/asm-offsets.c:10:
In file included from include/linux/mm.h:1115:
In file included from include/linux/huge_mm.h:8:
In file included from include/linux/fs.h:33:
In file included from include/linux/percpu-rwsem.h:7:
In file included from include/linux/rcuwait.h:6:
In file included from include/linux/sched/signal.h:6:
include/linux/signal.h:115:5: warning: array index 2 is past the end of the array (which contains 1 element) [-Warray-bounds]
(set1->sig[2] == set2->sig[2]) &&
^ ~
include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
unsigned long sig[_NSIG_WORDS];
^
In file included from arch/riscv/kernel/asm-offsets.c:10:
In file included from include/linux/mm.h:1115:
In file included from include/linux/huge_mm.h:8:
In file included from include/linux/fs.h:33:
In file included from include/linux/percpu-rwsem.h:7:
In file included from include/linux/rcuwait.h:6:
In file included from include/linux/sched/signal.h:6:
vim +498 arch/riscv/include/asm/pgtable.h
07037db5d479f9 Palmer Dabbelt 2017-07-10 469
07037db5d479f9 Palmer Dabbelt 2017-07-10 470 #define pgd_ERROR(e) \
07037db5d479f9 Palmer Dabbelt 2017-07-10 471 pr_err("%s:%d: bad pgd " PTE_FMT ".\n", __FILE__, __LINE__, pgd_val(e))
07037db5d479f9 Palmer Dabbelt 2017-07-10 472
07037db5d479f9 Palmer Dabbelt 2017-07-10 473
07037db5d479f9 Palmer Dabbelt 2017-07-10 474 /* Commit new configuration to MMU hardware */
864609c6a0b5f0 Matthew Wilcox (Oracle 2023-08-02 475) static inline void update_mmu_cache_range(struct vm_fault *vmf,
864609c6a0b5f0 Matthew Wilcox (Oracle 2023-08-02 476) struct vm_area_struct *vma, unsigned long address,
864609c6a0b5f0 Matthew Wilcox (Oracle 2023-08-02 477) pte_t *ptep, unsigned int nr)
07037db5d479f9 Palmer Dabbelt 2017-07-10 478 {
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 479 asm goto(ALTERNATIVE("nop", "j %l[svvptc]", 0, RISCV_ISA_EXT_SVVPTC, 1)
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 480 : : : : svvptc);
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 481
07037db5d479f9 Palmer Dabbelt 2017-07-10 482 /*
07037db5d479f9 Palmer Dabbelt 2017-07-10 483 * The kernel assumes that TLBs don't cache invalid entries, but
07037db5d479f9 Palmer Dabbelt 2017-07-10 484 * in RISC-V, SFENCE.VMA specifies an ordering constraint, not a
07037db5d479f9 Palmer Dabbelt 2017-07-10 485 * cache flush; it is necessary even after writing invalid entries.
07037db5d479f9 Palmer Dabbelt 2017-07-10 486 * Relying on flush_tlb_fix_spurious_fault would suffice, but
07037db5d479f9 Palmer Dabbelt 2017-07-10 487 * the extra traps reduce performance. So, eagerly SFENCE.VMA.
07037db5d479f9 Palmer Dabbelt 2017-07-10 488 */
864609c6a0b5f0 Matthew Wilcox (Oracle 2023-08-02 489) while (nr--)
864609c6a0b5f0 Matthew Wilcox (Oracle 2023-08-02 490) local_flush_tlb_page(address + nr * PAGE_SIZE);
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 491
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 492 svvptc:
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 493 /*
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 494 * Svvptc guarantees that the new valid pte will be visible within
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 495 * a bounded timeframe, so when the uarch does not cache invalid
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 496 * entries, we don't have to do anything.
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 497 */
07037db5d479f9 Palmer Dabbelt 2017-07-10 @498 }
864609c6a0b5f0 Matthew Wilcox (Oracle 2023-08-02 499) #define update_mmu_cache(vma, addr, ptep) \
864609c6a0b5f0 Matthew Wilcox (Oracle 2023-08-02 500) update_mmu_cache_range(NULL, vma, addr, ptep, 1)
07037db5d479f9 Palmer Dabbelt 2017-07-10 501
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc
2024-07-02 8:50 ` [PATCH v3 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc Alexandre Ghiti
2024-07-11 1:27 ` kernel test robot
@ 2024-07-11 3:58 ` kernel test robot
1 sibling, 0 replies; 13+ messages in thread
From: kernel test robot @ 2024-07-11 3:58 UTC (permalink / raw)
To: Alexandre Ghiti, Conor Dooley, Rob Herring, Krzysztof Kozlowski,
Paul Walmsley, Palmer Dabbelt, Albert Ou, Ved Shanbhogue,
Matt Evans, linux-kernel, linux-riscv, devicetree
Cc: llvm, oe-kbuild-all, Alexandre Ghiti
Hi Alexandre,
kernel test robot noticed the following build errors:
[auto build test ERROR on robh/for-next]
[also build test ERROR on linus/master v6.10-rc7]
[cannot apply to next-20240710]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Alexandre-Ghiti/riscv-Add-ISA-extension-parsing-for-Svvptc/20240702-171920
base: https://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git for-next
patch link: https://lore.kernel.org/r/20240702085034.48395-5-alexghiti%40rivosinc.com
patch subject: [PATCH v3 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc
config: riscv-allmodconfig (https://download.01.org/0day-ci/archive/20240711/202407111151.m5cK0E6R-lkp@intel.com/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project a0c6b8aef853eedaa0980f07c0a502a5a8a9740e)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240711/202407111151.m5cK0E6R-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407111151.m5cK0E6R-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from lib/test_bitops.c:10:
In file included from include/linux/module.h:19:
In file included from include/linux/elf.h:6:
In file included from arch/riscv/include/asm/elf.h:12:
In file included from include/linux/compat.h:17:
In file included from include/linux/fs.h:33:
In file included from include/linux/percpu-rwsem.h:7:
In file included from include/linux/rcuwait.h:6:
In file included from include/linux/sched/signal.h:9:
In file included from include/linux/sched/task.h:13:
In file included from include/linux/uaccess.h:11:
In file included from arch/riscv/include/asm/uaccess.h:12:
>> arch/riscv/include/asm/pgtable.h:498:1: error: label at end of compound statement is a C23 extension [-Werror,-Wc23-extensions]
498 | }
| ^
1 error generated.
vim +498 arch/riscv/include/asm/pgtable.h
07037db5d479f9 Palmer Dabbelt 2017-07-10 469
07037db5d479f9 Palmer Dabbelt 2017-07-10 470 #define pgd_ERROR(e) \
07037db5d479f9 Palmer Dabbelt 2017-07-10 471 pr_err("%s:%d: bad pgd " PTE_FMT ".\n", __FILE__, __LINE__, pgd_val(e))
07037db5d479f9 Palmer Dabbelt 2017-07-10 472
07037db5d479f9 Palmer Dabbelt 2017-07-10 473
07037db5d479f9 Palmer Dabbelt 2017-07-10 474 /* Commit new configuration to MMU hardware */
864609c6a0b5f0 Matthew Wilcox (Oracle 2023-08-02 475) static inline void update_mmu_cache_range(struct vm_fault *vmf,
864609c6a0b5f0 Matthew Wilcox (Oracle 2023-08-02 476) struct vm_area_struct *vma, unsigned long address,
864609c6a0b5f0 Matthew Wilcox (Oracle 2023-08-02 477) pte_t *ptep, unsigned int nr)
07037db5d479f9 Palmer Dabbelt 2017-07-10 478 {
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 479 asm goto(ALTERNATIVE("nop", "j %l[svvptc]", 0, RISCV_ISA_EXT_SVVPTC, 1)
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 480 : : : : svvptc);
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 481
07037db5d479f9 Palmer Dabbelt 2017-07-10 482 /*
07037db5d479f9 Palmer Dabbelt 2017-07-10 483 * The kernel assumes that TLBs don't cache invalid entries, but
07037db5d479f9 Palmer Dabbelt 2017-07-10 484 * in RISC-V, SFENCE.VMA specifies an ordering constraint, not a
07037db5d479f9 Palmer Dabbelt 2017-07-10 485 * cache flush; it is necessary even after writing invalid entries.
07037db5d479f9 Palmer Dabbelt 2017-07-10 486 * Relying on flush_tlb_fix_spurious_fault would suffice, but
07037db5d479f9 Palmer Dabbelt 2017-07-10 487 * the extra traps reduce performance. So, eagerly SFENCE.VMA.
07037db5d479f9 Palmer Dabbelt 2017-07-10 488 */
864609c6a0b5f0 Matthew Wilcox (Oracle 2023-08-02 489) while (nr--)
864609c6a0b5f0 Matthew Wilcox (Oracle 2023-08-02 490) local_flush_tlb_page(address + nr * PAGE_SIZE);
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 491
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 492 svvptc:
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 493 /*
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 494 * Svvptc guarantees that the new valid pte will be visible within
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 495 * a bounded timeframe, so when the uarch does not cache invalid
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 496 * entries, we don't have to do anything.
b5bdff9ee1fdca Alexandre Ghiti 2024-07-02 497 */
07037db5d479f9 Palmer Dabbelt 2017-07-10 @498 }
864609c6a0b5f0 Matthew Wilcox (Oracle 2023-08-02 499) #define update_mmu_cache(vma, addr, ptep) \
864609c6a0b5f0 Matthew Wilcox (Oracle 2023-08-02 500) update_mmu_cache_range(NULL, vma, addr, ptep, 1)
07037db5d479f9 Palmer Dabbelt 2017-07-10 501
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-07-11 3:59 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-02 8:50 [PATCH v3 0/4] Svvptc extension to remove preventive sfence.vma Alexandre Ghiti
2024-07-02 8:50 ` [PATCH v3 1/4] riscv: Add ISA extension parsing for Svvptc Alexandre Ghiti
2024-07-02 15:02 ` Conor Dooley
2024-07-02 8:50 ` [PATCH v3 2/4] dt-bindings: riscv: Add Svvptc ISA extension description Alexandre Ghiti
2024-07-02 15:02 ` Conor Dooley
2024-07-02 8:50 ` [PATCH v3 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings Alexandre Ghiti
2024-07-02 9:48 ` [External] " yunhui cui
2024-07-02 12:36 ` Alexandre Ghiti
2024-07-02 10:12 ` Anup Patel
2024-07-02 12:40 ` Alexandre Ghiti
2024-07-02 8:50 ` [PATCH v3 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc Alexandre Ghiti
2024-07-11 1:27 ` kernel test robot
2024-07-11 3:58 ` kernel test robot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).