* [PATCH 01/10] Add X86_FEATURE_INVLPGB definition.
2024-12-22 4:06 [RFC PATCH 00/10] AMD broadcast TLB invalidation Rik van Riel
@ 2024-12-22 4:06 ` Rik van Riel
2024-12-22 4:06 ` [PATCH 02/10] x86,tlb: get INVLPGB count max from CPUID Rik van Riel
` (9 subsequent siblings)
10 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2024-12-22 4:06 UTC (permalink / raw)
To: x86
Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx, mingo,
bp, hpa, akpm, Rik van Riel
Add the INVPLGB CPUID definition, allowing the kernel to recognize
whether the CPU supports the INVLPGB instruction.
Signed-off-by: Rik van Riel <riel@surriel.com>
---
arch/x86/include/asm/cpufeatures.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 17b6590748c0..b7209d6c3a5f 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -338,6 +338,7 @@
#define X86_FEATURE_CLZERO (13*32+ 0) /* "clzero" CLZERO instruction */
#define X86_FEATURE_IRPERF (13*32+ 1) /* "irperf" Instructions Retired Count */
#define X86_FEATURE_XSAVEERPTR (13*32+ 2) /* "xsaveerptr" Always save/restore FP error pointers */
+#define X86_FEATURE_INVLPGB (13*32+ 3) /* "invlpgb" INVLPGB instruction */
#define X86_FEATURE_RDPRU (13*32+ 4) /* "rdpru" Read processor register at user level */
#define X86_FEATURE_WBNOINVD (13*32+ 9) /* "wbnoinvd" WBNOINVD instruction */
#define X86_FEATURE_AMD_IBPB (13*32+12) /* Indirect Branch Prediction Barrier */
--
2.47.1
^ permalink raw reply related [flat|nested] 28+ messages in thread* [PATCH 02/10] x86,tlb: get INVLPGB count max from CPUID
2024-12-22 4:06 [RFC PATCH 00/10] AMD broadcast TLB invalidation Rik van Riel
2024-12-22 4:06 ` [PATCH 01/10] Add X86_FEATURE_INVLPGB definition Rik van Riel
@ 2024-12-22 4:06 ` Rik van Riel
2024-12-22 11:46 ` Borislav Petkov
2024-12-22 4:06 ` [PATCH 03/10] x86,mm: add INVLPGB support code Rik van Riel
` (8 subsequent siblings)
10 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2024-12-22 4:06 UTC (permalink / raw)
To: x86
Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx, mingo,
bp, hpa, akpm, Rik van Riel
The CPU advertises the maximum number of pages that can be shot down
with one INVLPGB instruction in the CPUID data.
Save that information for later use.
Signed-off-by: Rik van Riel <riel@surriel.com>
---
arch/x86/include/asm/processor.h | 2 ++
arch/x86/kernel/cpu/amd.c | 8 ++++++++
2 files changed, 10 insertions(+)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 20e6009381ed..dd32a75d5da8 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -185,6 +185,8 @@ struct cpuinfo_x86 {
u16 booted_cores;
/* Index into per_cpu list: */
u16 cpu_index;
+ /* Max number of pages invalidated with one INVLPGB */
+ u16 invlpgb_count_max;
/* Is SMT active on this core? */
bool smt_active;
u32 microcode;
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 79d2e17f6582..6a6adbe9ae54 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1135,6 +1135,14 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
tlb_lli_2m[ENTRIES] = eax & mask;
tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
+
+ if (c->extended_cpuid_level < 0x80000008)
+ return;
+
+ cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
+
+ /* Max number of pages INVLPGB can invalidate in one shot */
+ c->invlpgb_count_max = (edx & 0xffff) + 1;
}
static const struct cpu_dev amd_cpu_dev = {
--
2.47.1
^ permalink raw reply related [flat|nested] 28+ messages in thread* Re: [PATCH 02/10] x86,tlb: get INVLPGB count max from CPUID
2024-12-22 4:06 ` [PATCH 02/10] x86,tlb: get INVLPGB count max from CPUID Rik van Riel
@ 2024-12-22 11:46 ` Borislav Petkov
2024-12-22 14:39 ` Rik van Riel
0 siblings, 1 reply; 28+ messages in thread
From: Borislav Petkov @ 2024-12-22 11:46 UTC (permalink / raw)
To: Rik van Riel, x86
Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx, mingo,
hpa, akpm
On December 22, 2024 5:06:34 AM GMT+01:00, Rik van Riel <riel@surriel.com> wrote:
>The CPU advertises the maximum number of pages that can be shot down
>with one INVLPGB instruction in the CPUID data.
>
>Save that information for later use.
Definitely not replicate a single datum into every CPU's descriptor.
--
Sent from a small device: formatting sucks and brevity is inevitable.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 02/10] x86,tlb: get INVLPGB count max from CPUID
2024-12-22 11:46 ` Borislav Petkov
@ 2024-12-22 14:39 ` Rik van Riel
0 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2024-12-22 14:39 UTC (permalink / raw)
To: Borislav Petkov, x86
Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx, mingo,
hpa, akpm
On Sun, 2024-12-22 at 12:46 +0100, Borislav Petkov wrote:
> On December 22, 2024 5:06:34 AM GMT+01:00, Rik van Riel
> <riel@surriel.com> wrote:
> > The CPU advertises the maximum number of pages that can be shot
> > down
> > with one INVLPGB instruction in the CPUID data.
> >
> > Save that information for later use.
>
> Definitely not replicate a single datum into every CPU's descriptor.
>
Oops. I was under the impression this was only done for
the boot CPU.
I'll turn it into a global variable.
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 03/10] x86,mm: add INVLPGB support code
2024-12-22 4:06 [RFC PATCH 00/10] AMD broadcast TLB invalidation Rik van Riel
2024-12-22 4:06 ` [PATCH 01/10] Add X86_FEATURE_INVLPGB definition Rik van Riel
2024-12-22 4:06 ` [PATCH 02/10] x86,tlb: get INVLPGB count max from CPUID Rik van Riel
@ 2024-12-22 4:06 ` Rik van Riel
2024-12-22 11:05 ` Peter Zijlstra
2024-12-22 4:06 ` [PATCH 04/10] x86,mm: use INVLPGB for kernel TLB flushes Rik van Riel
` (7 subsequent siblings)
10 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2024-12-22 4:06 UTC (permalink / raw)
To: x86
Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx, mingo,
bp, hpa, akpm, Rik van Riel
Add invlpgb.h with the helper functions and definitions needed to use
broadcast TLB invalidation on AMD EPYC 3 and newer CPUs.
Signed-off-by: Rik van Riel <riel@surriel.com>
---
arch/x86/include/asm/invlpgb.h | 98 +++++++++++++++++++++++++++++++++
arch/x86/include/asm/tlbflush.h | 1 +
2 files changed, 99 insertions(+)
create mode 100644 arch/x86/include/asm/invlpgb.h
diff --git a/arch/x86/include/asm/invlpgb.h b/arch/x86/include/asm/invlpgb.h
new file mode 100644
index 000000000000..ef12ea4a8d65
--- /dev/null
+++ b/arch/x86/include/asm/invlpgb.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_INVLPGB
+#define _ASM_X86_INVLPGB
+
+#include <vdso/bits.h>
+
+/*
+ * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
+ *
+ * The INVLPGB instruction is weakly ordered, and a batch of invalidations can be done in
+ * a parallel fashion.
+ *
+ * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from this CPU
+ * have completed.
+ */
+static inline void __invlpgb(unsigned long asid, unsigned long pcid, unsigned long addr,
+ int extra_count, bool pmd_stride, unsigned long flags)
+{
+ u64 rax = addr | flags;
+ u32 ecx = (pmd_stride << 31) | extra_count;
+ u32 edx = (pcid << 16) | asid;
+
+ /*
+ * The memory clobber is because the whole point is to invalidate
+ * stale TLB entries and, especially if we're flushing global
+ * mappings, we don't want the compiler to reorder any subsequent
+ * memory accesses before the TLB flush.
+ */
+ asm volatile("invlpgb" : : "a" (rax), "c" (ecx), "d" (edx));
+}
+
+/*
+ * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination of the
+ * three. For example:
+ * - INVLPGB_VA | INVLPGB_INCLUDE_GLOBAL: invalidate all TLB entries at the address
+ * - INVLPGB_PCID: invalidate all TLB entries matching the PCID
+ * - INVLPGB_VA | INVLPGB_ASID: invalidate TLB entries matching the address & ASID
+ *
+ * The first can be used to invalidate kernel mappings across all processes.
+ * The last can be used to invalidate mappings across both PCIDs of a process when using PTI.
+ */
+#define INVLPGB_VA BIT(0)
+#define INVLPGB_PCID BIT(1)
+#define INVLPGB_ASID BIT(2)
+#define INVLPGB_INCLUDE_GLOBAL BIT(3)
+#define INVLPGB_FINAL_ONLY BIT(4)
+#define INVLPGB_INCLUDE_NESTED BIT(5)
+
+/* Flush all mappings for a given pcid and addr, not including globals. */
+static inline void invlpgb_flush_user(unsigned long pcid,
+ unsigned long addr)
+{
+ __invlpgb(0, pcid, addr, 0, 0, INVLPGB_PCID | INVLPGB_VA);
+}
+
+static inline void invlpgb_flush_user_nr(unsigned long pcid, unsigned long addr, int nr,
+ bool pmd_stride)
+{
+ __invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
+}
+
+/* Flush all mappings for a given ASID, not including globals. */
+static inline void invlpgb_flush_single_asid(unsigned long asid)
+{
+ __invlpgb(asid, 0, 0, 0, 0, INVLPGB_ASID);
+}
+
+/* Flush all mappings for a given PCID, not including globals. */
+static inline void invlpgb_flush_single_pcid(unsigned long pcid)
+{
+ __invlpgb(0, pcid, 0, 0, 0, INVLPGB_PCID);
+}
+
+/* Flush all mappings, including globals, for all PCIDs. */
+static inline void invlpgb_flush_all(void)
+{
+ __invlpgb(0, 0, 0, 0, 0, INVLPGB_INCLUDE_GLOBAL);
+}
+
+/* Flush addr, including globals, for all PCIDs. */
+static inline void invlpgb_flush_addr(unsigned long addr, int nr)
+{
+ __invlpgb(0, 0, addr, nr - 1, 0, INVLPGB_INCLUDE_GLOBAL);
+}
+
+/* Flush all mappings for all PCIDs except globals. */
+static inline void invlpgb_flush_all_nonglobals(void)
+{
+ __invlpgb(0, 0, 0, 0, 0, 0);
+}
+
+/* Wait for INVLPGB originated by this CPU to complete. */
+static inline void tlbsync(void)
+{
+ asm volatile("tlbsync");
+}
+
+#endif /* _ASM_X86_INVLPGB */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 02fc2aa06e9e..1f518fb9abba 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -10,6 +10,7 @@
#include <asm/cpufeature.h>
#include <asm/special_insns.h>
#include <asm/smp.h>
+#include <asm/invlpgb.h>
#include <asm/invpcid.h>
#include <asm/pti.h>
#include <asm/processor-flags.h>
--
2.47.1
^ permalink raw reply related [flat|nested] 28+ messages in thread* Re: [PATCH 03/10] x86,mm: add INVLPGB support code
2024-12-22 4:06 ` [PATCH 03/10] x86,mm: add INVLPGB support code Rik van Riel
@ 2024-12-22 11:05 ` Peter Zijlstra
2024-12-22 21:24 ` Rik van Riel
0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2024-12-22 11:05 UTC (permalink / raw)
To: Rik van Riel
Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, tglx, mingo,
bp, hpa, akpm
On Sat, Dec 21, 2024 at 11:06:35PM -0500, Rik van Riel wrote:
> +static inline void __invlpgb(unsigned long asid, unsigned long pcid, unsigned long addr,
> + int extra_count, bool pmd_stride, unsigned long flags)
> +{
> + u64 rax = addr | flags;
> + u32 ecx = (pmd_stride << 31) | extra_count;
> + u32 edx = (pcid << 16) | asid;
> +
> + /*
> + * The memory clobber is because the whole point is to invalidate
> + * stale TLB entries and, especially if we're flushing global
> + * mappings, we don't want the compiler to reorder any subsequent
> + * memory accesses before the TLB flush.
> + */
> + asm volatile("invlpgb" : : "a" (rax), "c" (ecx), "d" (edx));
What memory clobber? Is "memory" gone missing?
> +}
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH 03/10] x86,mm: add INVLPGB support code
2024-12-22 11:05 ` Peter Zijlstra
@ 2024-12-22 21:24 ` Rik van Riel
0 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2024-12-22 21:24 UTC (permalink / raw)
To: Peter Zijlstra
Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, tglx, mingo,
bp, hpa, akpm
On Sun, 2024-12-22 at 12:05 +0100, Peter Zijlstra wrote:
> On Sat, Dec 21, 2024 at 11:06:35PM -0500, Rik van Riel wrote:
>
> > +static inline void __invlpgb(unsigned long asid, unsigned long
> > pcid, unsigned long addr,
> > + int extra_count, bool pmd_stride,
> > unsigned long flags)
> > +{
> > + u64 rax = addr | flags;
> > + u32 ecx = (pmd_stride << 31) | extra_count;
> > + u32 edx = (pcid << 16) | asid;
> > +
> > + /*
> > + * The memory clobber is because the whole point is to
> > invalidate
> > + * stale TLB entries and, especially if we're flushing
> > global
> > + * mappings, we don't want the compiler to reorder any
> > subsequent
> > + * memory accesses before the TLB flush.
> > + */
> > + asm volatile("invlpgb" : : "a" (rax), "c" (ecx), "d"
> > (edx));
>
> What memory clobber? Is "memory" gone missing?
I'm not even sure where that comment came from any more.
We do not need any barriers here, because one of the
big points of INVLPGB is that the flush is kicked off
asynchronously, and flushes may end up being done out
of order, many can be pending simultaneously, etc.
The sync point is the TLBSYNC instruction.
I removed that comment.
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 04/10] x86,mm: use INVLPGB for kernel TLB flushes
2024-12-22 4:06 [RFC PATCH 00/10] AMD broadcast TLB invalidation Rik van Riel
` (2 preceding siblings ...)
2024-12-22 4:06 ` [PATCH 03/10] x86,mm: add INVLPGB support code Rik van Riel
@ 2024-12-22 4:06 ` Rik van Riel
2024-12-22 11:16 ` Peter Zijlstra
2024-12-22 11:47 ` Borislav Petkov
2024-12-22 4:06 ` [PATCH 05/10] x86,tlb: use INVLPGB in flush_tlb_all Rik van Riel
` (6 subsequent siblings)
10 siblings, 2 replies; 28+ messages in thread
From: Rik van Riel @ 2024-12-22 4:06 UTC (permalink / raw)
To: x86
Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx, mingo,
bp, hpa, akpm, Rik van Riel
Use broadcast TLB invalidation for kernel addresses when available.
This stops us from having to send IPIs for kernel TLB flushes.
Signed-off-by: Rik van Riel <riel@surriel.com>
---
arch/x86/mm/tlb.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6cf881a942bb..09980fb17907 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1077,6 +1077,33 @@ void flush_tlb_all(void)
on_each_cpu(do_flush_tlb_all, NULL, 1);
}
+static void broadcast_kernel_range_flush(unsigned long start, unsigned long end)
+{
+ unsigned long addr;
+ unsigned long maxnr = boot_cpu_data.invlpgb_count_max;
+ unsigned long threshold = tlb_single_page_flush_ceiling * maxnr;
+
+ /*
+ * TLBSYNC only waits for flushes originating on the same CPU.
+ * Disabling migration allows us to wait on all flushes.
+ */
+ migrate_disable();
+
+ if (end == TLB_FLUSH_ALL ||
+ (end - start) > threshold << PAGE_SHIFT) {
+ invlpgb_flush_all();
+ } else {
+ unsigned long nr;
+ for (addr = start; addr < end; addr += nr << PAGE_SHIFT) {
+ nr = min((end - addr) >> PAGE_SHIFT, maxnr);
+ invlpgb_flush_addr(addr, nr);
+ }
+ }
+
+ tlbsync();
+ migrate_enable();
+}
+
static void do_kernel_range_flush(void *info)
{
struct flush_tlb_info *f = info;
@@ -1089,6 +1116,11 @@ static void do_kernel_range_flush(void *info)
void flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
+ if (static_cpu_has(X86_FEATURE_INVLPGB)) {
+ broadcast_kernel_range_flush(start, end);
+ return;
+ }
+
/* Balance as user space task's flush, a bit conservative */
if (end == TLB_FLUSH_ALL ||
(end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
--
2.47.1
^ permalink raw reply related [flat|nested] 28+ messages in thread* Re: [PATCH 04/10] x86,mm: use INVLPGB for kernel TLB flushes
2024-12-22 4:06 ` [PATCH 04/10] x86,mm: use INVLPGB for kernel TLB flushes Rik van Riel
@ 2024-12-22 11:16 ` Peter Zijlstra
2024-12-22 15:12 ` Rik van Riel
2024-12-22 11:47 ` Borislav Petkov
1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2024-12-22 11:16 UTC (permalink / raw)
To: Rik van Riel
Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, tglx, mingo,
bp, hpa, akpm
On Sat, Dec 21, 2024 at 11:06:36PM -0500, Rik van Riel wrote:
> Use broadcast TLB invalidation for kernel addresses when available.
>
> This stops us from having to send IPIs for kernel TLB flushes.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
> arch/x86/mm/tlb.c | 32 ++++++++++++++++++++++++++++++++
> 1 file changed, 32 insertions(+)
>
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 6cf881a942bb..09980fb17907 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1077,6 +1077,33 @@ void flush_tlb_all(void)
> on_each_cpu(do_flush_tlb_all, NULL, 1);
> }
>
> +static void broadcast_kernel_range_flush(unsigned long start, unsigned long end)
> +{
> + unsigned long addr;
> + unsigned long maxnr = boot_cpu_data.invlpgb_count_max;
> + unsigned long threshold = tlb_single_page_flush_ceiling * maxnr;
> +
> + /*
> + * TLBSYNC only waits for flushes originating on the same CPU.
> + * Disabling migration allows us to wait on all flushes.
> + */
> + migrate_disable();
So how expensive is all this? That is, I think I would feel better is
this were preempt_disable().
> +
> + if (end == TLB_FLUSH_ALL ||
> + (end - start) > threshold << PAGE_SHIFT) {
> + invlpgb_flush_all();
> + } else {
> + unsigned long nr;
> + for (addr = start; addr < end; addr += nr << PAGE_SHIFT) {
> + nr = min((end - addr) >> PAGE_SHIFT, maxnr);
> + invlpgb_flush_addr(addr, nr);
> + }
> + }
> +
> + tlbsync();
> + migrate_enable();
> +}
> +
> static void do_kernel_range_flush(void *info)
> {
> struct flush_tlb_info *f = info;
> @@ -1089,6 +1116,11 @@ static void do_kernel_range_flush(void *info)
>
> void flush_tlb_kernel_range(unsigned long start, unsigned long end)
> {
> + if (static_cpu_has(X86_FEATURE_INVLPGB)) {
> + broadcast_kernel_range_flush(start, end);
> + return;
> + }
> +
> /* Balance as user space task's flush, a bit conservative */
> if (end == TLB_FLUSH_ALL ||
> (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH 04/10] x86,mm: use INVLPGB for kernel TLB flushes
2024-12-22 11:16 ` Peter Zijlstra
@ 2024-12-22 15:12 ` Rik van Riel
2024-12-24 18:13 ` Peter Zijlstra
0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2024-12-22 15:12 UTC (permalink / raw)
To: Peter Zijlstra
Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, tglx, mingo,
bp, hpa, akpm
On Sun, 2024-12-22 at 12:16 +0100, Peter Zijlstra wrote:
> On Sat, Dec 21, 2024 at 11:06:36PM -0500, Rik van Riel wrote:
> > Use broadcast TLB invalidation for kernel addresses when available.
> >
> > +static void broadcast_kernel_range_flush(unsigned long start,
> > unsigned long end)
> > +{
> > + unsigned long addr;
> > + unsigned long maxnr = boot_cpu_data.invlpgb_count_max;
> > + unsigned long threshold = tlb_single_page_flush_ceiling *
> > maxnr;
> > +
> > + /*
> > + * TLBSYNC only waits for flushes originating on the same
> > CPU.
> > + * Disabling migration allows us to wait on all flushes.
> > + */
> > + migrate_disable();
>
> So how expensive is all this? That is, I think I would feel better is
> this were preempt_disable().
Either should work. If preempt_disable() is cheaper,
I'm happy to use that.
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH 04/10] x86,mm: use INVLPGB for kernel TLB flushes
2024-12-22 15:12 ` Rik van Riel
@ 2024-12-24 18:13 ` Peter Zijlstra
0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2024-12-24 18:13 UTC (permalink / raw)
To: Rik van Riel
Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, tglx, mingo,
bp, hpa, akpm
On Sun, Dec 22, 2024 at 10:12:56AM -0500, Rik van Riel wrote:
> On Sun, 2024-12-22 at 12:16 +0100, Peter Zijlstra wrote:
> > On Sat, Dec 21, 2024 at 11:06:36PM -0500, Rik van Riel wrote:
> > > Use broadcast TLB invalidation for kernel addresses when available.
> > >
> > > +static void broadcast_kernel_range_flush(unsigned long start,
> > > unsigned long end)
> > > +{
> > > + unsigned long addr;
> > > + unsigned long maxnr = boot_cpu_data.invlpgb_count_max;
> > > + unsigned long threshold = tlb_single_page_flush_ceiling *
> > > maxnr;
> > > +
> > > + /*
> > > + * TLBSYNC only waits for flushes originating on the same
> > > CPU.
> > > + * Disabling migration allows us to wait on all flushes.
> > > + */
> > > + migrate_disable();
> >
> > So how expensive is all this? That is, I think I would feel better is
> > this were preempt_disable().
>
> Either should work. If preempt_disable() is cheaper,
> I'm happy to use that.
I'm not sure about cheaper -- but getting arbitrary scheduling delays in
the middle of TLBi sounds like waaay to much 'fun'.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 04/10] x86,mm: use INVLPGB for kernel TLB flushes
2024-12-22 4:06 ` [PATCH 04/10] x86,mm: use INVLPGB for kernel TLB flushes Rik van Riel
2024-12-22 11:16 ` Peter Zijlstra
@ 2024-12-22 11:47 ` Borislav Petkov
1 sibling, 0 replies; 28+ messages in thread
From: Borislav Petkov @ 2024-12-22 11:47 UTC (permalink / raw)
To: Rik van Riel, x86
Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx, mingo,
hpa, akpm
On December 22, 2024 5:06:36 AM GMT+01:00, Rik van Riel <riel@surriel.com> wrote:
>@@ -1089,6 +1116,11 @@ static void do_kernel_range_flush(void *info)
>
> void flush_tlb_kernel_range(unsigned long start, unsigned long end)
> {
>+ if (static_cpu_has(X86_FEATURE_INVLPGB)) {
cpu_feature_enabled
--
Sent from a small device: formatting sucks and brevity is inevitable.
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 05/10] x86,tlb: use INVLPGB in flush_tlb_all
2024-12-22 4:06 [RFC PATCH 00/10] AMD broadcast TLB invalidation Rik van Riel
` (3 preceding siblings ...)
2024-12-22 4:06 ` [PATCH 04/10] x86,mm: use INVLPGB for kernel TLB flushes Rik van Riel
@ 2024-12-22 4:06 ` Rik van Riel
2024-12-22 11:17 ` Peter Zijlstra
2024-12-22 4:06 ` [PATCH 06/10] x86,mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
` (5 subsequent siblings)
10 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2024-12-22 4:06 UTC (permalink / raw)
To: x86
Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx, mingo,
bp, hpa, akpm, Rik van Riel
The flush_tlb_all() function is not used a whole lot, but we might
as well use broadcast TLB flushing there, too.
Signed-off-by: Rik van Riel <riel@surriel.com>
---
arch/x86/mm/tlb.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 09980fb17907..bf85cd0590d5 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1074,6 +1074,11 @@ static void do_flush_tlb_all(void *info)
void flush_tlb_all(void)
{
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+ if (static_cpu_has(X86_FEATURE_INVLPGB)) {
+ invlpgb_flush_all();
+ tlbsync();
+ return;
+ }
on_each_cpu(do_flush_tlb_all, NULL, 1);
}
--
2.47.1
^ permalink raw reply related [flat|nested] 28+ messages in thread* Re: [PATCH 05/10] x86,tlb: use INVLPGB in flush_tlb_all
2024-12-22 4:06 ` [PATCH 05/10] x86,tlb: use INVLPGB in flush_tlb_all Rik van Riel
@ 2024-12-22 11:17 ` Peter Zijlstra
0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2024-12-22 11:17 UTC (permalink / raw)
To: Rik van Riel
Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, tglx, mingo,
bp, hpa, akpm
On Sat, Dec 21, 2024 at 11:06:37PM -0500, Rik van Riel wrote:
> The flush_tlb_all() function is not used a whole lot, but we might
> as well use broadcast TLB flushing there, too.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
> arch/x86/mm/tlb.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 09980fb17907..bf85cd0590d5 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1074,6 +1074,11 @@ static void do_flush_tlb_all(void *info)
> void flush_tlb_all(void)
> {
> count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
> + if (static_cpu_has(X86_FEATURE_INVLPGB)) {
guard(preempt)();
?
> + invlpgb_flush_all();
> + tlbsync();
> + return;
> + }
> on_each_cpu(do_flush_tlb_all, NULL, 1);
> }
>
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 06/10] x86,mm: use broadcast TLB flushing for page reclaim TLB flushing
2024-12-22 4:06 [RFC PATCH 00/10] AMD broadcast TLB invalidation Rik van Riel
` (4 preceding siblings ...)
2024-12-22 4:06 ` [PATCH 05/10] x86,tlb: use INVLPGB in flush_tlb_all Rik van Riel
@ 2024-12-22 4:06 ` Rik van Riel
2024-12-22 11:18 ` Peter Zijlstra
2024-12-22 4:06 ` [PATCH 07/10] x86,mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
` (4 subsequent siblings)
10 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2024-12-22 4:06 UTC (permalink / raw)
To: x86
Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx, mingo,
bp, hpa, akpm, Rik van Riel
In the page reclaim code, we only track the CPU(s) where the TLB needs
to be flushed, rather than all the individual mappings that may be getting
invalidated.
Use broadcast TLB flushing when that is available.
Signed-off-by: Rik van Riel <riel@surriel.com>
---
arch/x86/mm/tlb.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index bf85cd0590d5..9422b10edec1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1313,6 +1313,12 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
int cpu = get_cpu();
+ if (static_cpu_has(X86_FEATURE_INVLPGB)) {
+ invlpgb_flush_all_nonglobals();
+ tlbsync();
+ goto out_put_cpu;
+ }
+
info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
TLB_GENERATION_INVALID);
/*
@@ -1332,6 +1338,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
cpumask_clear(&batch->cpumask);
put_flush_tlb_info();
+out_put_cpu:
put_cpu();
}
--
2.47.1
^ permalink raw reply related [flat|nested] 28+ messages in thread* Re: [PATCH 06/10] x86,mm: use broadcast TLB flushing for page reclaim TLB flushing
2024-12-22 4:06 ` [PATCH 06/10] x86,mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
@ 2024-12-22 11:18 ` Peter Zijlstra
0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2024-12-22 11:18 UTC (permalink / raw)
To: Rik van Riel
Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, tglx, mingo,
bp, hpa, akpm
On Sat, Dec 21, 2024 at 11:06:38PM -0500, Rik van Riel wrote:
> In the page reclaim code, we only track the CPU(s) where the TLB needs
> to be flushed, rather than all the individual mappings that may be getting
> invalidated.
>
> Use broadcast TLB flushing when that is available.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
> arch/x86/mm/tlb.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index bf85cd0590d5..9422b10edec1 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1313,6 +1313,12 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>
> int cpu = get_cpu();
>
> + if (static_cpu_has(X86_FEATURE_INVLPGB)) {
> + invlpgb_flush_all_nonglobals();
> + tlbsync();
> + goto out_put_cpu;
> + }
Urgh, move this before the get_cpu(), and write it like:
if (static_cpu_has(X86_FEATURE_INVLPGB)) {
guard(preempt)();
invlpgb_flush_all_nonglobals();
tlbsync();
return;
}
?
> +
> info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
> TLB_GENERATION_INVALID);
> /*
> @@ -1332,6 +1338,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> cpumask_clear(&batch->cpumask);
>
> put_flush_tlb_info();
> +out_put_cpu:
> put_cpu();
> }
>
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 07/10] x86,mm: enable broadcast TLB invalidation for multi-threaded processes
2024-12-22 4:06 [RFC PATCH 00/10] AMD broadcast TLB invalidation Rik van Riel
` (5 preceding siblings ...)
2024-12-22 4:06 ` [PATCH 06/10] x86,mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
@ 2024-12-22 4:06 ` Rik van Riel
2024-12-22 11:36 ` Peter Zijlstra
2024-12-22 4:06 ` [PATCH 08/10] x86,tlb: do targeted broadcast flushing from tlbbatch code Rik van Riel
` (3 subsequent siblings)
10 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2024-12-22 4:06 UTC (permalink / raw)
To: x86
Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx, mingo,
bp, hpa, akpm, Rik van Riel
Use broadcast TLB invalidation, using the INVPLGB instruction, on AMD EPYC 3
and newer CPUs.
In order to not exhaust PCID space, and keep TLB flushes local for single
threaded processes, we only hand out broadcast ASIDs to processes active on
3 or more CPUs, and gradually increase the threshold as broadcast ASID space
is depleted.
Signed-off-by: Rik van Riel <riel@surriel.com>
---
arch/x86/include/asm/mmu.h | 6 +
arch/x86/include/asm/mmu_context.h | 12 ++
arch/x86/include/asm/tlbflush.h | 15 ++
arch/x86/mm/tlb.c | 322 ++++++++++++++++++++++++++++-
4 files changed, 346 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 3b496cdcb74b..a8e8dfa5a520 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -48,6 +48,12 @@ typedef struct {
unsigned long flags;
#endif
+#ifdef CONFIG_CPU_SUP_AMD
+ struct list_head broadcast_asid_list;
+ u16 broadcast_asid;
+ bool asid_transition;
+#endif
+
#ifdef CONFIG_ADDRESS_MASKING
/* Active LAM mode: X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or 0 (disabled) */
unsigned long lam_cr3_mask;
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 795fdd53bd0a..0dc446c427d2 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
#define enter_lazy_tlb enter_lazy_tlb
extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
+extern void destroy_context_free_broadcast_asid(struct mm_struct *mm);
+
/*
* Init a new mm. Used on mm copies, like at fork()
* and on mm's that are brand-new, like at execve().
@@ -161,6 +163,13 @@ static inline int init_new_context(struct task_struct *tsk,
mm->context.execute_only_pkey = -1;
}
#endif
+
+#ifdef CONFIG_CPU_SUP_AMD
+ INIT_LIST_HEAD(&mm->context.broadcast_asid_list);
+ mm->context.broadcast_asid = 0;
+ mm->context.asid_transition = false;
+#endif
+
mm_reset_untag_mask(mm);
init_new_context_ldt(mm);
return 0;
@@ -170,6 +179,9 @@ static inline int init_new_context(struct task_struct *tsk,
static inline void destroy_context(struct mm_struct *mm)
{
destroy_context_ldt(mm);
+#ifdef CONFIG_CPU_SUP_AMD
+ destroy_context_free_broadcast_asid(mm);
+#endif
}
extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 1f518fb9abba..a59f56c9b355 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -65,6 +65,21 @@ static inline void cr4_clear_bits(unsigned long mask)
*/
#define TLB_NR_DYN_ASIDS 6
+#ifdef CONFIG_CPU_SUP_AMD
+#define is_dyn_asid(asid) (asid) < TLB_NR_DYN_ASIDS
+#define is_broadcast_asid(asid) (asid) >= TLB_NR_DYN_ASIDS
+#define in_asid_transition(info) (info->mm && info->mm->context.asid_transition)
+#else
+#define is_dyn_asid(asid) true
+#define is_broadcast_asid(asid) false
+#define in_asid_transition(info) false
+
+inline bool needs_broadcast_asid_reload(struct mm_struct *next, u16 prev_asid)
+{
+ return false;
+}
+#endif
+
struct tlb_context {
u64 ctx_id;
u64 tlb_gen;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 9422b10edec1..11ecffa26567 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -74,13 +74,15 @@
* use different names for each of them:
*
* ASID - [0, TLB_NR_DYN_ASIDS-1]
- * the canonical identifier for an mm
+ * the canonical identifier for an mm, dynamically allocated on each CPU
+ * [TLB_NR_DYN_ASIDS, MAX_ASID_AVAILABLE-1]
+ * the canonical, global identifier for an mm, identical across all CPUs
*
- * kPCID - [1, TLB_NR_DYN_ASIDS]
+ * kPCID - [1, MAX_ASID_AVAILABLE]
* the value we write into the PCID part of CR3; corresponds to the
* ASID+1, because PCID 0 is special.
*
- * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
+ * uPCID - [2048 + 1, 2048 + MAX_ASID_AVAILABLE]
* for KPTI each mm has two address spaces and thus needs two
* PCID values, but we can still do with a single ASID denomination
* for each mm. Corresponds to kPCID + 2048.
@@ -225,6 +227,18 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
return;
}
+ /*
+ * TLB consistency for this ASID is maintained with INVLPGB;
+ * TLB flushes happen even while the process isn't running.
+ */
+#ifdef CONFIG_CPU_SUP_AMD
+ if (static_cpu_has(X86_FEATURE_INVLPGB) && next->context.broadcast_asid) {
+ *new_asid = next->context.broadcast_asid;
+ *need_flush = false;
+ return;
+ }
+#endif
+
if (this_cpu_read(cpu_tlbstate.invalidate_other))
clear_asid_other();
@@ -251,6 +265,257 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
*need_flush = true;
}
+#ifdef CONFIG_CPU_SUP_AMD
+/*
+ * Logic for AMD INVLPGB support.
+ */
+static DEFINE_SPINLOCK(broadcast_asid_lock);
+static u16 last_broadcast_asid = TLB_NR_DYN_ASIDS;
+static DECLARE_BITMAP(broadcast_asid_used, MAX_ASID_AVAILABLE) = { 0 };
+static LIST_HEAD(broadcast_asid_list);
+static int broadcast_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
+
+static void reset_broadcast_asid_space(void)
+{
+ mm_context_t *context;
+
+ assert_spin_locked(&broadcast_asid_lock);
+
+ /*
+ * Flush once when we wrap around the ASID space, so we won't need
+ * to flush every time we allocate an ASID for boradcast flushing.
+ */
+ invlpgb_flush_all_nonglobals();
+ tlbsync();
+
+ /*
+ * Leave the currently used broadcast ASIDs set in the bitmap, since
+ * those cannot be reused before the next wraparound and flush..
+ */
+ bitmap_clear(broadcast_asid_used, 0, MAX_ASID_AVAILABLE);
+ list_for_each_entry(context, &broadcast_asid_list, broadcast_asid_list)
+ __set_bit(context->broadcast_asid, broadcast_asid_used);
+
+ last_broadcast_asid = TLB_NR_DYN_ASIDS;
+}
+
+static u16 get_broadcast_asid(void)
+{
+ assert_spin_locked(&broadcast_asid_lock);
+
+ do {
+ u16 start = last_broadcast_asid;
+ u16 asid = find_next_zero_bit(broadcast_asid_used, MAX_ASID_AVAILABLE, start);
+
+ if (asid >= MAX_ASID_AVAILABLE) {
+ reset_broadcast_asid_space();
+ continue;
+ }
+
+ /* Try claiming this broadcast ASID. */
+ if (!test_and_set_bit(asid, broadcast_asid_used)) {
+ last_broadcast_asid = asid;
+ return asid;
+ }
+ } while (1);
+}
+
+/*
+ * Returns true if the mm is transitioning from a CPU-local ASID to a broadcast
+ * (INVLPGB) ASID, or the other way around.
+ */
+static bool needs_broadcast_asid_reload(struct mm_struct *next, u16 prev_asid)
+{
+ u16 broadcast_asid = next->context.broadcast_asid;
+
+ if (broadcast_asid && prev_asid != broadcast_asid) {
+ return true;
+ }
+
+ if (!broadcast_asid && is_broadcast_asid(prev_asid)) {
+ return true;
+ }
+
+ return false;
+}
+
+void destroy_context_free_broadcast_asid(struct mm_struct *mm) {
+ unsigned long flags;
+
+ if (!mm->context.broadcast_asid)
+ return;
+
+ spin_lock_irqsave(&broadcast_asid_lock, flags);
+ mm->context.broadcast_asid = 0;
+ list_del(&mm->context.broadcast_asid_list);
+ broadcast_asid_available++;
+ spin_unlock_irqrestore(&broadcast_asid_lock, flags);
+}
+
+static int mm_active_cpus(struct mm_struct *mm)
+{
+ int count = 0;
+ int cpu;
+
+ for_each_cpu(cpu, mm_cpumask(mm)) {
+ /* Skip the CPUs that aren't really running this process. */
+ if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
+ continue;
+
+ if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu))
+ continue;
+
+ count++;
+ }
+ return count;
+}
+
+/*
+ * Assign a broadcast ASID to the current process, protecting against
+ * races between multiple threads in the process.
+ */
+static void use_broadcast_asid(struct mm_struct *mm)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&broadcast_asid_lock, flags);
+
+ /* This process is already using broadcast TLB invalidation. */
+ if (mm->context.broadcast_asid)
+ goto out_unlock;
+
+ mm->context.broadcast_asid = get_broadcast_asid();
+ mm->context.asid_transition = true;
+ list_add(&mm->context.broadcast_asid_list, &broadcast_asid_list);
+ broadcast_asid_available--;
+
+out_unlock:
+ spin_unlock_irqrestore(&broadcast_asid_lock, flags);
+}
+
+/*
+ * Figure out whether to assign a broadcast (global) ASID to a process.
+ * We vary the threshold by how empty or full broadcast ASID space is.
+ * 1/4 full: >= 4 active threads
+ * 1/2 full: >= 8 active threads
+ * 3/4 full: >= 16 active threads
+ * 7/8 full: >= 32 active threads
+ * etc
+ *
+ * This way we should never exhaust the broadcast ASID space, even on very
+ * large systems, and the processes with the largest number of active
+ * threads should be able to use broadcast TLB invalidation.
+ */
+#define HALFFULL_THRESHOLD 8
+static bool meets_broadcast_asid_threshold(struct mm_struct *mm)
+{
+ int avail = broadcast_asid_available;
+ int threshold = HALFFULL_THRESHOLD;
+ int mm_active_threads;
+
+ if (!avail)
+ return false;
+
+ mm_active_threads = mm_active_cpus(mm);
+
+ /* Small processes can just use IPI TLB flushing. */
+ if (mm_active_threads < 3)
+ return false;
+
+ if (avail > MAX_ASID_AVAILABLE * 3 / 4) {
+ threshold = HALFFULL_THRESHOLD / 4;
+ } else if (avail > MAX_ASID_AVAILABLE / 2) {
+ threshold = HALFFULL_THRESHOLD / 2;
+ } else if (avail < MAX_ASID_AVAILABLE / 3) {
+ do {
+ avail *= 2;
+ threshold *= 2;
+ } while ((avail + threshold ) < MAX_ASID_AVAILABLE / 2);
+ }
+
+ return mm_active_threads > threshold;
+}
+
+static void count_tlb_flush(struct mm_struct *mm)
+{
+ if (!static_cpu_has(X86_FEATURE_INVLPGB))
+ return;
+
+ /* Check every once in a while. */
+ if ((current->pid & 0x1f) != (jiffies & 0x1f))
+ return;
+
+ if (meets_broadcast_asid_threshold(mm))
+ use_broadcast_asid(mm);
+}
+
+static void finish_asid_transition(struct flush_tlb_info *info)
+{
+ struct mm_struct *mm = info->mm;
+ int bc_asid = mm->context.broadcast_asid;
+ int cpu;
+
+ if (!mm->context.asid_transition)
+ return;
+
+ for_each_cpu(cpu, mm_cpumask(mm)) {
+ if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm)
+ continue;
+
+ /*
+ * If at least one CPU is not using the broadcast ASID yet,
+ * send a TLB flush IPI. The IPI should cause stragglers
+ * to transition soon.
+ */
+ if (per_cpu(cpu_tlbstate.loaded_mm_asid, cpu) != bc_asid) {
+ flush_tlb_multi(mm_cpumask(info->mm), info);
+ return;
+ }
+ }
+
+ /* All the CPUs running this process are using the broadcast ASID. */
+ mm->context.asid_transition = 0;
+}
+
+static void broadcast_tlb_flush(struct flush_tlb_info *info)
+{
+ bool pmd = info->stride_shift == PMD_SHIFT;
+ unsigned long maxnr = boot_cpu_data.invlpgb_count_max;
+ unsigned long asid = info->mm->context.broadcast_asid;
+ unsigned long addr = info->start;
+ unsigned long nr;
+
+ /* Flushing multiple pages at once is not supported with 1GB pages. */
+ if (info->stride_shift > PMD_SHIFT)
+ maxnr = 1;
+
+ if (info->end == TLB_FLUSH_ALL) {
+ invlpgb_flush_single_pcid(kern_pcid(asid));
+ /* Do any CPUs supporting INVLPGB need PTI? */
+ if (static_cpu_has(X86_FEATURE_PTI))
+ invlpgb_flush_single_pcid(user_pcid(asid));
+ } else do {
+ /*
+ * Calculate how many pages can be flushed at once; if the
+ * remainder of the range is less than one page, flush one.
+ */
+ nr = min(maxnr, (info->end - addr) >> info->stride_shift);
+ nr = max(nr, 1);
+
+ invlpgb_flush_user_nr(kern_pcid(asid), addr, nr, pmd);
+ /* Do any CPUs supporting INVLPGB need PTI? */
+ if (static_cpu_has(X86_FEATURE_PTI))
+ invlpgb_flush_user_nr(user_pcid(asid), addr, nr, pmd);
+ addr += nr << info->stride_shift;
+ } while (addr < info->end);
+
+ finish_asid_transition(info);
+
+ /* Wait for the INVLPGBs kicked off above to finish. */
+ tlbsync();
+}
+#endif /* CONFIG_CPU_SUP_AMD */
+
/*
* Given an ASID, flush the corresponding user ASID. We can delay this
* until the next time we switch to it.
@@ -556,8 +821,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
*/
if (prev == next) {
/* Not actually switching mm's */
- VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
- next->context.ctx_id);
+ if (is_dyn_asid(prev_asid))
+ VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
+ next->context.ctx_id);
/*
* If this races with another thread that enables lam, 'new_lam'
@@ -573,6 +839,23 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
!cpumask_test_cpu(cpu, mm_cpumask(next))))
cpumask_set_cpu(cpu, mm_cpumask(next));
+ /*
+ * Check if the current mm is transitioning to a new ASID.
+ */
+ if (needs_broadcast_asid_reload(next, prev_asid)) {
+ next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+
+ choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
+ goto reload_tlb;
+ }
+
+ /*
+ * Broadcast TLB invalidation keeps this PCID up to date
+ * all the time.
+ */
+ if (is_broadcast_asid(prev_asid))
+ return;
+
/*
* If the CPU is not in lazy TLB mode, we are just switching
* from one thread in a process to another thread in the same
@@ -626,8 +909,10 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
barrier();
}
+reload_tlb:
new_lam = mm_lam_cr3_mask(next);
if (need_flush) {
+ VM_BUG_ON(is_broadcast_asid(new_asid));
this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
load_new_mm_cr3(next->pgd, new_asid, new_lam, true);
@@ -746,7 +1031,7 @@ static void flush_tlb_func(void *info)
const struct flush_tlb_info *f = info;
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
- u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+ u64 local_tlb_gen;
bool local = smp_processor_id() == f->initiating_cpu;
unsigned long nr_invalidate = 0;
u64 mm_tlb_gen;
@@ -769,6 +1054,16 @@ static void flush_tlb_func(void *info)
if (unlikely(loaded_mm == &init_mm))
return;
+ /* Reload the ASID if transitioning into or out of a broadcast ASID */
+ if (needs_broadcast_asid_reload(loaded_mm, loaded_mm_asid)) {
+ switch_mm_irqs_off(NULL, loaded_mm, NULL);
+ loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+ }
+
+ /* Broadcast ASIDs are always kept up to date with INVLPGB. */
+ if (is_broadcast_asid(loaded_mm_asid))
+ return;
+
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
loaded_mm->context.ctx_id);
@@ -786,6 +1081,8 @@ static void flush_tlb_func(void *info)
return;
}
+ local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+
if (unlikely(f->new_tlb_gen != TLB_GENERATION_INVALID &&
f->new_tlb_gen <= local_tlb_gen)) {
/*
@@ -953,7 +1250,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
* up on the new contents of what used to be page tables, while
* doing a speculative memory access.
*/
- if (info->freed_tables)
+ if (info->freed_tables || in_asid_transition(info))
on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
else
on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
@@ -1026,14 +1323,18 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
bool freed_tables)
{
struct flush_tlb_info *info;
+ unsigned long threshold = tlb_single_page_flush_ceiling;
u64 new_tlb_gen;
int cpu;
+ if (static_cpu_has(X86_FEATURE_INVLPGB))
+ threshold *= boot_cpu_data.invlpgb_count_max;
+
cpu = get_cpu();
/* Should we flush just the requested range? */
if ((end == TLB_FLUSH_ALL) ||
- ((end - start) >> stride_shift) > tlb_single_page_flush_ceiling) {
+ ((end - start) >> stride_shift) > threshold) {
start = 0;
end = TLB_FLUSH_ALL;
}
@@ -1049,9 +1350,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
* a local TLB flush is needed. Optimize this use-case by calling
* flush_tlb_func_local() directly in this case.
*/
- if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+ if (IS_ENABLED(CONFIG_CPU_SUP_AMD) && mm->context.broadcast_asid) {
+ broadcast_tlb_flush(info);
+ } else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
info->trim_cpumask = should_trim_cpumask(mm);
flush_tlb_multi(mm_cpumask(mm), info);
+ count_tlb_flush(mm);
} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
lockdep_assert_irqs_enabled();
local_irq_disable();
--
2.47.1
^ permalink raw reply related [flat|nested] 28+ messages in thread* Re: [PATCH 07/10] x86,mm: enable broadcast TLB invalidation for multi-threaded processes
2024-12-22 4:06 ` [PATCH 07/10] x86,mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
@ 2024-12-22 11:36 ` Peter Zijlstra
2024-12-22 22:45 ` Rik van Riel
0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2024-12-22 11:36 UTC (permalink / raw)
To: Rik van Riel
Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, tglx, mingo,
bp, hpa, akpm
On Sat, Dec 21, 2024 at 11:06:39PM -0500, Rik van Riel wrote:
> +#ifdef CONFIG_CPU_SUP_AMD
> +/*
> + * Logic for AMD INVLPGB support.
> + */
> +static DEFINE_SPINLOCK(broadcast_asid_lock);
RAW_SPINLOCK ?
> +static u16 last_broadcast_asid = TLB_NR_DYN_ASIDS;
> +static DECLARE_BITMAP(broadcast_asid_used, MAX_ASID_AVAILABLE) = { 0 };
> +static LIST_HEAD(broadcast_asid_list);
> +static int broadcast_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
> +
> +static void reset_broadcast_asid_space(void)
> +{
> + mm_context_t *context;
> +
> + assert_spin_locked(&broadcast_asid_lock);
lockdep_assert_locked(&broadcast_asid_lock);
> +
> + /*
> + * Flush once when we wrap around the ASID space, so we won't need
> + * to flush every time we allocate an ASID for boradcast flushing.
> + */
> + invlpgb_flush_all_nonglobals();
> + tlbsync();
> +
> + /*
> + * Leave the currently used broadcast ASIDs set in the bitmap, since
> + * those cannot be reused before the next wraparound and flush..
> + */
> + bitmap_clear(broadcast_asid_used, 0, MAX_ASID_AVAILABLE);
> + list_for_each_entry(context, &broadcast_asid_list, broadcast_asid_list)
> + __set_bit(context->broadcast_asid, broadcast_asid_used);
> +
> + last_broadcast_asid = TLB_NR_DYN_ASIDS;
> +}
> +
> +static u16 get_broadcast_asid(void)
> +{
> + assert_spin_locked(&broadcast_asid_lock);
lockdep_assert_locked
> +
> + do {
> + u16 start = last_broadcast_asid;
> + u16 asid = find_next_zero_bit(broadcast_asid_used, MAX_ASID_AVAILABLE, start);
> +
> + if (asid >= MAX_ASID_AVAILABLE) {
> + reset_broadcast_asid_space();
> + continue;
> + }
> +
> + /* Try claiming this broadcast ASID. */
> + if (!test_and_set_bit(asid, broadcast_asid_used)) {
> + last_broadcast_asid = asid;
> + return asid;
> + }
> + } while (1);
> +}
> +
> +/*
> + * Returns true if the mm is transitioning from a CPU-local ASID to a broadcast
> + * (INVLPGB) ASID, or the other way around.
> + */
> +static bool needs_broadcast_asid_reload(struct mm_struct *next, u16 prev_asid)
> +{
> + u16 broadcast_asid = next->context.broadcast_asid;
> +
> + if (broadcast_asid && prev_asid != broadcast_asid) {
> + return true;
> + }
> +
> + if (!broadcast_asid && is_broadcast_asid(prev_asid)) {
> + return true;
> + }
> +
> + return false;
> +}
Those return statements don't really need {} on.
> +
> +void destroy_context_free_broadcast_asid(struct mm_struct *mm) {
{ goes on a new line.
> + unsigned long flags;
> +
> + if (!mm->context.broadcast_asid)
> + return;
> +
> + spin_lock_irqsave(&broadcast_asid_lock, flags);
guard(raw_spin_lock_irqsave)(&broadcast_asid_lock);
> + mm->context.broadcast_asid = 0;
> + list_del(&mm->context.broadcast_asid_list);
> + broadcast_asid_available++;
> + spin_unlock_irqrestore(&broadcast_asid_lock, flags);
> +}
> +
> +static int mm_active_cpus(struct mm_struct *mm)
> +{
> + int count = 0;
> + int cpu;
> +
> + for_each_cpu(cpu, mm_cpumask(mm)) {
> + /* Skip the CPUs that aren't really running this process. */
> + if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
> + continue;
> +
> + if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu))
> + continue;
> +
> + count++;
> + }
> + return count;
> +}
> +
> +/*
> + * Assign a broadcast ASID to the current process, protecting against
> + * races between multiple threads in the process.
> + */
> +static void use_broadcast_asid(struct mm_struct *mm)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&broadcast_asid_lock, flags);
guard(raw_spin_lock_irqsave)(&broadcast_asid_lock);
> +
> + /* This process is already using broadcast TLB invalidation. */
> + if (mm->context.broadcast_asid)
> + goto out_unlock;
return;
> + mm->context.broadcast_asid = get_broadcast_asid();
> + mm->context.asid_transition = true;
> + list_add(&mm->context.broadcast_asid_list, &broadcast_asid_list);
> + broadcast_asid_available--;
> +
> +out_unlock:
Notably we're really wanting to get away from the whole goto unlock
pattern.
> + spin_unlock_irqrestore(&broadcast_asid_lock, flags);
> +}
> +
> +/*
> + * Figure out whether to assign a broadcast (global) ASID to a process.
> + * We vary the threshold by how empty or full broadcast ASID space is.
> + * 1/4 full: >= 4 active threads
> + * 1/2 full: >= 8 active threads
> + * 3/4 full: >= 16 active threads
> + * 7/8 full: >= 32 active threads
> + * etc
> + *
> + * This way we should never exhaust the broadcast ASID space, even on very
> + * large systems, and the processes with the largest number of active
> + * threads should be able to use broadcast TLB invalidation.
I'm a little confused, at most we need one ASID per CPU, IIRC we have
something like 4k ASIDs (page-offset bits in the physical address bits)
so for anything with less than 4K CPUs we're good, but with anything
having more CPUs we're up a creek irrespective of the above scheme, no?
> + */
> +#define HALFFULL_THRESHOLD 8
> +static bool meets_broadcast_asid_threshold(struct mm_struct *mm)
> +{
> + int avail = broadcast_asid_available;
> + int threshold = HALFFULL_THRESHOLD;
> + int mm_active_threads;
> +
> + if (!avail)
> + return false;
> +
> + mm_active_threads = mm_active_cpus(mm);
> +
> + /* Small processes can just use IPI TLB flushing. */
> + if (mm_active_threads < 3)
> + return false;
> +
> + if (avail > MAX_ASID_AVAILABLE * 3 / 4) {
> + threshold = HALFFULL_THRESHOLD / 4;
> + } else if (avail > MAX_ASID_AVAILABLE / 2) {
> + threshold = HALFFULL_THRESHOLD / 2;
> + } else if (avail < MAX_ASID_AVAILABLE / 3) {
> + do {
> + avail *= 2;
> + threshold *= 2;
> + } while ((avail + threshold ) < MAX_ASID_AVAILABLE / 2);
> + }
> +
> + return mm_active_threads > threshold;
> +}
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH 07/10] x86,mm: enable broadcast TLB invalidation for multi-threaded processes
2024-12-22 11:36 ` Peter Zijlstra
@ 2024-12-22 22:45 ` Rik van Riel
0 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2024-12-22 22:45 UTC (permalink / raw)
To: Peter Zijlstra
Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, tglx, mingo,
bp, hpa, akpm
On Sun, 2024-12-22 at 12:36 +0100, Peter Zijlstra wrote:
> On Sat, Dec 21, 2024 at 11:06:39PM -0500, Rik van Riel wrote:
>
> > +/*
> > + * Figure out whether to assign a broadcast (global) ASID to a
> > process.
> > + * We vary the threshold by how empty or full broadcast ASID space
> > is.
> > + * 1/4 full: >= 4 active threads
> > + * 1/2 full: >= 8 active threads
> > + * 3/4 full: >= 16 active threads
> > + * 7/8 full: >= 32 active threads
> > + * etc
> > + *
> > + * This way we should never exhaust the broadcast ASID space, even
> > on very
> > + * large systems, and the processes with the largest number of
> > active
> > + * threads should be able to use broadcast TLB invalidation.
>
> I'm a little confused, at most we need one ASID per CPU, IIRC we have
> something like 4k ASIDs (page-offset bits in the physical address
> bits)
> so for anything with less than 4K CPUs we're good, but with anything
> having more CPUs we're up a creek irrespective of the above scheme,
> no?
We don't need to use broadcast TLB flushing for every
process in the system.
If a system with 8196 CPUs runs a few large processes
(hundreds, or thousands of active CPUs), we want to
ensure those use broadcast TLB flushing with INVLPGB.
We don't really care if some random systemd helper
task ends up being relegated to IPI TLB flushing.
The code is meant to ensure that the tasks where
INVLPGB will help the most are the most likely to
be able to use it.
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 08/10] x86,tlb: do targeted broadcast flushing from tlbbatch code
2024-12-22 4:06 [RFC PATCH 00/10] AMD broadcast TLB invalidation Rik van Riel
` (6 preceding siblings ...)
2024-12-22 4:06 ` [PATCH 07/10] x86,mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
@ 2024-12-22 4:06 ` Rik van Riel
2024-12-22 4:06 ` [PATCH 09/10] x86/mm: enable AMD translation cache extensions Rik van Riel
` (2 subsequent siblings)
10 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2024-12-22 4:06 UTC (permalink / raw)
To: x86
Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx, mingo,
bp, hpa, akpm, Rik van Riel
Instead of doing a system-wide TLB flush from arch_tlbbatch_flush,
queue up asynchronous, targeted flushes from arch_tlbbatch_add_pending.
This also allows us to avoid adding the CPUs of processes using broadcast
flushing to the batch->cpumask, and will hopefully further reduce TLB
flushing from the reclaim and compaction paths.
Signed-off-by: Rik van Riel <riel@surriel.com>
---
arch/x86/include/asm/tlbbatch.h | 1 +
arch/x86/include/asm/tlbflush.h | 12 +++---------
arch/x86/mm/tlb.c | 33 ++++++++++++++++++++++++++++++---
3 files changed, 34 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/tlbbatch.h b/arch/x86/include/asm/tlbbatch.h
index 1ad56eb3e8a8..f9a17edf63ad 100644
--- a/arch/x86/include/asm/tlbbatch.h
+++ b/arch/x86/include/asm/tlbbatch.h
@@ -10,6 +10,7 @@ struct arch_tlbflush_unmap_batch {
* the PFNs being flushed..
*/
struct cpumask cpumask;
+ bool used_invlpgb;
};
#endif /* _ARCH_X86_TLBBATCH_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index a59f56c9b355..87f9a3725d95 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -294,21 +294,15 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
return atomic64_inc_return(&mm->context.tlb_gen);
}
-static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
- struct mm_struct *mm,
- unsigned long uaddr)
-{
- inc_mm_tlb_gen(mm);
- cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
- mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
-}
-
static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
{
flush_tlb_mm(mm);
}
extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+ struct mm_struct *mm,
+ unsigned long uaddr);
static inline bool pte_flags_need_flush(unsigned long oldflags,
unsigned long newflags,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 11ecffa26567..0482042e011c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1614,12 +1614,13 @@ EXPORT_SYMBOL_GPL(__flush_tlb_all);
void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
{
struct flush_tlb_info *info;
-
int cpu = get_cpu();
- if (static_cpu_has(X86_FEATURE_INVLPGB)) {
- invlpgb_flush_all_nonglobals();
+ /* If we issued (asynchronous) INVLPGB flushes, wait for them here. */
+ if (static_cpu_has(X86_FEATURE_INVLPGB) && batch->used_invlpgb) {
tlbsync();
+ migrate_enable();
+ batch->used_invlpgb = false;
goto out_put_cpu;
}
@@ -1646,6 +1647,32 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
put_cpu();
}
+void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+ struct mm_struct *mm,
+ unsigned long uaddr)
+{
+ if (static_cpu_has(X86_FEATURE_INVLPGB) && mm->context.broadcast_asid) {
+ u16 asid = mm->context.broadcast_asid;
+ /*
+ * Queue up an asynchronous invalidation. The corresponding
+ * TLBSYNC is done in arch_tlbbatch_flush(), and must be done
+ * on the same CPU.
+ */
+ if (!batch->used_invlpgb) {
+ batch->used_invlpgb = true;
+ migrate_disable();
+ }
+ invlpgb_flush_user_nr(kern_pcid(asid), uaddr, 1, 0);
+ /* Do any CPUs supporting INVLPGB need PTI? */
+ if (static_cpu_has(X86_FEATURE_PTI))
+ invlpgb_flush_user_nr(user_pcid(asid), uaddr, 1, 0);
+ } else {
+ inc_mm_tlb_gen(mm);
+ cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
+ }
+ mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
+}
+
/*
* Blindly accessing user memory from NMI context can be dangerous
* if we're in the middle of switching the current user task or
--
2.47.1
^ permalink raw reply related [flat|nested] 28+ messages in thread* [PATCH 09/10] x86/mm: enable AMD translation cache extensions
2024-12-22 4:06 [RFC PATCH 00/10] AMD broadcast TLB invalidation Rik van Riel
` (7 preceding siblings ...)
2024-12-22 4:06 ` [PATCH 08/10] x86,tlb: do targeted broadcast flushing from tlbbatch code Rik van Riel
@ 2024-12-22 4:06 ` Rik van Riel
2024-12-22 11:38 ` Peter Zijlstra
2024-12-22 4:06 ` [PATCH 10/10] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
2024-12-22 11:11 ` [RFC PATCH 00/10] AMD broadcast TLB invalidation Peter Zijlstra
10 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2024-12-22 4:06 UTC (permalink / raw)
To: x86
Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx, mingo,
bp, hpa, akpm, Rik van Riel
With AMD TCE (translation cache extensions) only the intermediate mappings
that cover the address range zapped by INVLPG / INVLPGB get invalidated,
rather than all intermediate mappings getting zapped at every TLB invalidation.
This can help reduce the TLB miss rate, by keeping more intermediate
mappings in the cache.
Signed-off-by: Rik van Riel <riel@surriel.com>
---
arch/x86/kernel/cpu/amd.c | 8 ++++++++
arch/x86/mm/tlb.c | 10 +++++++---
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 6a6adbe9ae54..34f85aa18fca 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1143,6 +1143,14 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
/* Max number of pages INVLPGB can invalidate in one shot */
c->invlpgb_count_max = (edx & 0xffff) + 1;
+
+ /* If supported, enable translation cache extensions (TCE) */
+ cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
+ if (ecx & BIT(17)) {
+ u64 msr = native_read_msr(MSR_EFER);;
+ msr |= BIT(15);
+ wrmsrl(MSR_EFER, msr);
+ }
}
static const struct cpu_dev amd_cpu_dev = {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 0482042e011c..9b13d97d0fb5 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -489,7 +489,7 @@ static void broadcast_tlb_flush(struct flush_tlb_info *info)
if (info->stride_shift > PMD_SHIFT)
maxnr = 1;
- if (info->end == TLB_FLUSH_ALL) {
+ if (info->end == TLB_FLUSH_ALL || info->freed_tables) {
invlpgb_flush_single_pcid(kern_pcid(asid));
/* Do any CPUs supporting INVLPGB need PTI? */
if (static_cpu_has(X86_FEATURE_PTI))
@@ -1122,7 +1122,7 @@ static void flush_tlb_func(void *info)
*
* The only question is whether to do a full or partial flush.
*
- * We do a partial flush if requested and two extra conditions
+ * We do a partial flush if requested and three extra conditions
* are met:
*
* 1. f->new_tlb_gen == local_tlb_gen + 1. We have an invariant that
@@ -1149,10 +1149,14 @@ static void flush_tlb_func(void *info)
* date. By doing a full flush instead, we can increase
* local_tlb_gen all the way to mm_tlb_gen and we can probably
* avoid another flush in the very near future.
+ *
+ * 3. No page tables were freed. If page tables were freed, a full
+ * flush ensures intermediate translations in the TLB get flushed.
*/
if (f->end != TLB_FLUSH_ALL &&
f->new_tlb_gen == local_tlb_gen + 1 &&
- f->new_tlb_gen == mm_tlb_gen) {
+ f->new_tlb_gen == mm_tlb_gen &&
+ !f->freed_tables) {
/* Partial flush */
unsigned long addr = f->start;
--
2.47.1
^ permalink raw reply related [flat|nested] 28+ messages in thread* Re: [PATCH 09/10] x86/mm: enable AMD translation cache extensions
2024-12-22 4:06 ` [PATCH 09/10] x86/mm: enable AMD translation cache extensions Rik van Riel
@ 2024-12-22 11:38 ` Peter Zijlstra
2024-12-22 15:37 ` Rik van Riel
0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2024-12-22 11:38 UTC (permalink / raw)
To: Rik van Riel
Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, tglx, mingo,
bp, hpa, akpm
On Sat, Dec 21, 2024 at 11:06:41PM -0500, Rik van Riel wrote:
> With AMD TCE (translation cache extensions) only the intermediate mappings
Only the leave mapings, as written this all don't make sense,
> that cover the address range zapped by INVLPG / INVLPGB get invalidated,
> rather than all intermediate mappings getting zapped at every TLB invalidation.
>
> This can help reduce the TLB miss rate, by keeping more intermediate
> mappings in the cache.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 09/10] x86/mm: enable AMD translation cache extensions
2024-12-22 11:38 ` Peter Zijlstra
@ 2024-12-22 15:37 ` Rik van Riel
2024-12-24 18:25 ` Peter Zijlstra
0 siblings, 1 reply; 28+ messages in thread
From: Rik van Riel @ 2024-12-22 15:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, tglx, mingo,
bp, hpa, akpm
On Sun, 2024-12-22 at 12:38 +0100, Peter Zijlstra wrote:
> On Sat, Dec 21, 2024 at 11:06:41PM -0500, Rik van Riel wrote:
> > With AMD TCE (translation cache extensions) only the intermediate
> > mappings
>
> Only the leave mapings, as written this all don't make sense,
Check out page 513 of the AMD manual:
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/40332.pdf
"Translation Cache Extension (TCE) Bit. Bit 15, read/write.
Setting this bit to 1 changes how the INVLPG, INVLPGB, and INVPCID
instructions operate on TLB entries. When this bit is 0, these
instructions remove the target PTE from the TLB as well as all
upper-level table entries that are cached in the TLB, whether or
not they are associated with the target PTE. When this bit is set,
these instructions will remove the target PTE and only those
upper-level entries that lead to the target PTE in the page table
hierarchy, leaving unrelated upper-level entries intact. This may
provide a performance benefit.
Page table management software must be written in a way that takes
this behavior into account. Software that was written for a
processor that does not cache upper-level table entries may result
in stale entries being incorrectly used for translations when TCE
is enabled. Software that is compatible with TCE mode will operate
in either mode.
For software using INVLPGB to broadcast TLB invalidations, the
invalidations are controlled by the EFER.TCE value on the processor
executing the INVLPGB instruction.
Before setting TCE, system software should verify that this feature
is supported by examining the feature flag CPUID Fn8000_0001_ECX[TCE].
See Section 3.3 “Processor Feature Identification,” on
page 71 for information on using the CPUID instruction"
This suggests that:
1) TCE does control the "don't make sense" behavior :)
2) Wait, does EFER.TCE need to be set on every CPU
in the system? Could a system run with TCE set
on some CPUs, and cleared on another?!
It would be nice if somebody from AMD could chime in on
the latter, and ensure that this patch actually does
the right thing :)
I'll incorporate all the other feedback from you and
Borislav into the next version of the patch series!
Thank you for the feedback.
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 09/10] x86/mm: enable AMD translation cache extensions
2024-12-22 15:37 ` Rik van Riel
@ 2024-12-24 18:25 ` Peter Zijlstra
0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2024-12-24 18:25 UTC (permalink / raw)
To: Rik van Riel
Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, tglx, mingo,
bp, hpa, akpm
On Sun, Dec 22, 2024 at 10:37:01AM -0500, Rik van Riel wrote:
> On Sun, 2024-12-22 at 12:38 +0100, Peter Zijlstra wrote:
> > On Sat, Dec 21, 2024 at 11:06:41PM -0500, Rik van Riel wrote:
> > > With AMD TCE (translation cache extensions) only the intermediate
> > > mappings
> >
> > Only the leave mapings, as written this all don't make sense,
>
> Check out page 513 of the AMD manual:
>
> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/40332.pdf
>
> "Translation Cache Extension (TCE) Bit. Bit 15, read/write.
>
> Setting this bit to 1 changes how the INVLPG, INVLPGB, and INVPCID
> instructions operate on TLB entries. When this bit is 0, these
> instructions remove the target PTE from the TLB as well as all
> upper-level table entries that are cached in the TLB, whether or
> not they are associated with the target PTE. When this bit is set,
> these instructions will remove the target PTE and only those
> upper-level entries that lead to the target PTE in the page table
> hierarchy, leaving unrelated upper-level entries intact. This may
> provide a performance benefit.
>
> Page table management software must be written in a way that takes
> this behavior into account. Software that was written for a
> processor that does not cache upper-level table entries may result
> in stale entries being incorrectly used for translations when TCE
> is enabled. Software that is compatible with TCE mode will operate
> in either mode.
>
> For software using INVLPGB to broadcast TLB invalidations, the
> invalidations are controlled by the EFER.TCE value on the processor
> executing the INVLPGB instruction.
>
> Before setting TCE, system software should verify that this feature
> is supported by examining the feature flag CPUID Fn8000_0001_ECX[TCE].
> See Section 3.3 “Processor Feature Identification,” on
> page 71 for information on using the CPUID instruction"
So that makes a ton more sense.
>
> This suggests that:
> 1) TCE does control the "don't make sense" behavior :)
Well, you wrote:
> > With AMD TCE (translation cache extensions) only the intermediate mappings
> > that cover the address range zapped by INVLPG / INVLPGB get invalidated,
> > rather than all intermediate mappings getting zapped at every TLB invalidation.
And I read that like it would zap only the intermediate mappings rather
than the intermediate mappings.
Reading it a wee bit more carefully, I see it's not quite as bad, but
still not very clear.
> 2) Wait, does EFER.TCE need to be set on every CPU
> in the system? Could a system run with TCE set
> on some CPUs, and cleared on another?!
I would imagine it can; I don't think they would recommend anybody do
this though.
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 10/10] x86/mm: only invalidate final translations with INVLPGB
2024-12-22 4:06 [RFC PATCH 00/10] AMD broadcast TLB invalidation Rik van Riel
` (8 preceding siblings ...)
2024-12-22 4:06 ` [PATCH 09/10] x86/mm: enable AMD translation cache extensions Rik van Riel
@ 2024-12-22 4:06 ` Rik van Riel
2024-12-22 11:11 ` [RFC PATCH 00/10] AMD broadcast TLB invalidation Peter Zijlstra
10 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2024-12-22 4:06 UTC (permalink / raw)
To: x86
Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, tglx, mingo,
bp, hpa, akpm, Rik van Riel
Use the INVLPGB_FINAL_ONLY flag when invalidating mappings with INVPLGB.
This leaves intermediate translations cached in the TLB.
On the (rare) occasions where we free page tables we do a full flush,
ensuring intermediate translations get flushed from the TLB.
Signed-off-by: Rik van Riel <riel@surriel.com>
---
arch/x86/include/asm/invlpgb.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/invlpgb.h b/arch/x86/include/asm/invlpgb.h
index ef12ea4a8d65..c1478277b9ea 100644
--- a/arch/x86/include/asm/invlpgb.h
+++ b/arch/x86/include/asm/invlpgb.h
@@ -56,7 +56,7 @@ static inline void invlpgb_flush_user(unsigned long pcid,
static inline void invlpgb_flush_user_nr(unsigned long pcid, unsigned long addr, int nr,
bool pmd_stride)
{
- __invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
+ __invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA | INVLPGB_FINAL_ONLY);
}
/* Flush all mappings for a given ASID, not including globals. */
--
2.47.1
^ permalink raw reply related [flat|nested] 28+ messages in thread* Re: [RFC PATCH 00/10] AMD broadcast TLB invalidation
2024-12-22 4:06 [RFC PATCH 00/10] AMD broadcast TLB invalidation Rik van Riel
` (9 preceding siblings ...)
2024-12-22 4:06 ` [PATCH 10/10] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
@ 2024-12-22 11:11 ` Peter Zijlstra
2024-12-22 15:06 ` Rik van Riel
10 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2024-12-22 11:11 UTC (permalink / raw)
To: Rik van Riel
Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, tglx, mingo,
bp, hpa, akpm
On Sat, Dec 21, 2024 at 11:06:32PM -0500, Rik van Riel wrote:
> Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.
>
> This allows the kernel to invalidate TLB entries on remote CPUs without
> needing to send IPIs, without having to wait for remote CPUs to handle
> those interrupts, and with less interruption to what was running on
> those CPUs.
You're going to be needing this somewhere at the start of this series.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9d7bd0ae48c4..e8743f8c9fd0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -274,7 +274,7 @@ config X86
select HAVE_PCI
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
- select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT
+ select MMU_GATHER_RCU_TABLE_FREE
select MMU_GATHER_MERGE_VMAS
select HAVE_POSIX_CPU_TIMERS_TASK_WORK
select HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index fec381533555..c037592c67ef 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -61,7 +61,7 @@ void __init native_pv_lock_init(void)
static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
{
- tlb_remove_page(tlb, table);
+ tlb_remove_table(tlb, table);
}
struct static_key paravirt_steal_enabled;
^ permalink raw reply related [flat|nested] 28+ messages in thread* Re: [RFC PATCH 00/10] AMD broadcast TLB invalidation
2024-12-22 11:11 ` [RFC PATCH 00/10] AMD broadcast TLB invalidation Peter Zijlstra
@ 2024-12-22 15:06 ` Rik van Riel
0 siblings, 0 replies; 28+ messages in thread
From: Rik van Riel @ 2024-12-22 15:06 UTC (permalink / raw)
To: Peter Zijlstra
Cc: x86, linux-kernel, kernel-team, dave.hansen, luto, tglx, mingo,
bp, hpa, akpm
On Sun, 22 Dec 2024 12:11:13 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, Dec 21, 2024 at 11:06:32PM -0500, Rik van Riel wrote:
> > Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.
> >
> > This allows the kernel to invalidate TLB entries on remote CPUs without
> > needing to send IPIs, without having to wait for remote CPUs to handle
> > those interrupts, and with less interruption to what was running on
> > those CPUs.
>
> You're going to be needing this somewhere at the start of this series.
I like this change, and would be happy to add it.
What exactly is it needed for?
When I add this change, what other things do
I need to be looking for?
Can we just set the function pointer to
tlb_remove_table, and get rid of the function,
like this?
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9d7bd0ae48c4..e8743f8c9fd0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -274,7 +274,7 @@ config X86
select HAVE_PCI
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
- select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT
+ select MMU_GATHER_RCU_TABLE_FREE
select MMU_GATHER_MERGE_VMAS
select HAVE_POSIX_CPU_TIMERS_TASK_WORK
select HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index fec381533555..2b78a6b466ed 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -59,11 +59,6 @@ void __init native_pv_lock_init(void)
static_branch_enable(&virt_spin_lock_key);
}
-static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
- tlb_remove_page(tlb, table);
-}
-
struct static_key paravirt_steal_enabled;
struct static_key paravirt_steal_rq_enabled;
@@ -191,7 +186,7 @@ struct paravirt_patch_template pv_ops = {
.mmu.flush_tlb_kernel = native_flush_tlb_global,
.mmu.flush_tlb_one_user = native_flush_tlb_one_user,
.mmu.flush_tlb_multi = native_flush_tlb_multi,
- .mmu.tlb_remove_table = native_tlb_remove_table,
+ .mmu.tlb_remove_table = tlb_remove_table,
.mmu.exit_mmap = paravirt_nop,
.mmu.notify_page_enc_status_changed = paravirt_nop,
^ permalink raw reply related [flat|nested] 28+ messages in thread