[PATCH v14 00/13] AMD broadcast TLB invalidation

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v14 00/13] AMD broadcast TLB invalidation
@ 2025-02-26  3:00 Rik van Riel
  2025-02-26  3:00 ` [PATCH v14 01/13] x86/mm: consolidate full flush threshold decision Rik van Riel
                   ` (14 more replies)
  0 siblings, 15 replies; 86+ messages in thread
From: Rik van Riel @ 2025-02-26  3:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.

This allows the kernel to invalidate TLB entries on remote CPUs without
needing to send IPIs, without having to wait for remote CPUs to handle
those interrupts, and with less interruption to what was running on
those CPUs.

Because x86 PCID space is limited, and there are some very large
systems out there, broadcast TLB invalidation is only used for
processes that are active on 3 or more CPUs, with the threshold
being gradually increased the more the PCID space gets exhausted.

Combined with the removal of unnecessary lru_add_drain calls
(see https://lkml.org/lkml/2024/12/19/1388) this results in a
nice performance boost for the will-it-scale tlb_flush2_threads
test on an AMD Milan system with 36 cores:

- vanilla kernel:           527k loops/second
- lru_add_drain removal:    731k loops/second
- only INVLPGB:             527k loops/second
- lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while
TLB invalidation went down from 40% of the total CPU
time to only around 4% of CPU time, the contention
simply moved to the LRU lock.

Fixing both at the same time about doubles the
number of iterations per second from this case.

Some numbers closer to real world performance
can be found at Phoronix, thanks to Michael:

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

My current plan is to implement support for Intel's RAR
(Remote Action Request) TLB flushing in a follow-up series,
after this thing has been merged into -tip. Making things
any larger would just be unwieldy for reviewers.

v14:
 - code & comment cleanups (Boris)
 - drop "noinvlpgb" commandline option (Boris)
 - fix !CONFIG_X86_BROADCAST_TLB_FLUSH compile anywhere in the series
v13:
 - move invlpgb_count_max back to amd.c for resume (Boris, Oleksandr)
 - fix Kconfig circular dependency (Tom, Boris)
 - add performance numbers to the patch adding invlpgb for userspace (Ingo)
 - drop page table RCU free patches (already in -tip)
v12
 - make sure "nopcid" command line option turns off invlpgb (Brendan)
 - add "noinvlpgb" kernel command line option
 - split out kernel TLB flushing differently (Dave & Yosry)
 - split up the patch that does invlpgb flushing for user processes (Dave)
 - clean up get_flush_tlb_info (Boris)
 - move invlpgb_count_max initialization to get_cpu_cap (Boris)
 - bunch more comments as requested
v11:
 - resolve conflict with CONFIG_PT_RECLAIM code
 - a few more cleanups (Peter, Brendan, Nadav)
v10:
 - simplify partial pages with min(nr, 1) in the invlpgb loop (Peter)
 - document x86 paravirt, AMD invlpgb, and ARM64 flush without IPI (Brendan)
 - remove IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) (Brendan)
 - various cleanups (Brendan)
v9:
 - print warning when start or end address was rounded (Peter)
 - in the reclaim code, tlbsync at context switch time (Peter)
 - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan)
v8:
 - round start & end to handle non-page-aligned callers (Steven & Jan)
 - fix up changelog & add tested-by tags (Manali)
v7:
 - a few small code cleanups (Nadav)
 - fix spurious VM_WARN_ON_ONCE in mm_global_asid
 - code simplifications & better barriers (Peter & Dave)
v6:
 - fix info->end check in flush_tlb_kernel_range (Michael)
 - disable broadcast TLB flushing on 32 bit x86
v5:
 - use byte assembly for compatibility with older toolchains (Borislav, Michael)
 - ensure a panic on an invalid number of extra pages (Dave, Tom)
 - add cant_migrate() assertion to tlbsync (Jann)
 - a bunch more cleanups (Nadav)
 - key TCE enabling off X86_FEATURE_TCE (Andrew)
 - fix a race between reclaim and ASID transition (Jann)
v4:
 - Use only bitmaps to track free global ASIDs (Nadav)
 - Improved AMD initialization (Borislav & Tom)
 - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
 - Fixes for subtle race conditions (Jann)
v3:
 - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
 - More suggested cleanups and changelog fixes by Peter and Nadav
v2:
 - Apply suggestions by Peter and Borislav (thank you!)
 - Fix bug in arch_tlbbatch_flush, where we need to do both
   the TLBSYNC, and flush the CPUs that are in the cpumask.
 - Some updates to comments and changelogs based on questions.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v14 01/13] x86/mm: consolidate full flush threshold decision
  2025-02-26  3:00 [PATCH v14 00/13] AMD broadcast TLB invalidation Rik van Riel
@ 2025-02-26  3:00 ` Rik van Riel
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Consolidate " tip-bot2 for Rik van Riel
                     ` (2 more replies)
  2025-02-26  3:00 ` [PATCH v14 02/13] x86/mm: get INVLPGB count max from CPUID Rik van Riel
                   ` (13 subsequent siblings)
  14 siblings, 3 replies; 86+ messages in thread
From: Rik van Riel @ 2025-02-26  3:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel,
	Dave Hansen

Reduce code duplication by consolidating the decision point
for whether to do individual invalidations or a full flush
inside get_flush_tlb_info.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
---
 arch/x86/mm/tlb.c | 41 +++++++++++++++++++----------------------
 1 file changed, 19 insertions(+), 22 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ffc25b348041..dbcb5c968ff9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1000,6 +1000,15 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 	BUG_ON(this_cpu_inc_return(flush_tlb_info_idx) != 1);
 #endif
 
+	/*
+	 * If the number of flushes is so large that a full flush
+	 * would be faster, do a full flush.
+	 */
+	if ((end - start) >> stride_shift > tlb_single_page_flush_ceiling) {
+		start = 0;
+		end = TLB_FLUSH_ALL;
+	}
+
 	info->start		= start;
 	info->end		= end;
 	info->mm		= mm;
@@ -1026,17 +1035,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				bool freed_tables)
 {
 	struct flush_tlb_info *info;
+	int cpu = get_cpu();
 	u64 new_tlb_gen;
-	int cpu;
-
-	cpu = get_cpu();
-
-	/* Should we flush just the requested range? */
-	if ((end == TLB_FLUSH_ALL) ||
-	    ((end - start) >> stride_shift) > tlb_single_page_flush_ceiling) {
-		start = 0;
-		end = TLB_FLUSH_ALL;
-	}
 
 	/* This is also a barrier that synchronizes with switch_mm(). */
 	new_tlb_gen = inc_mm_tlb_gen(mm);
@@ -1089,22 +1089,19 @@ static void do_kernel_range_flush(void *info)
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
-	/* Balance as user space task's flush, a bit conservative */
-	if (end == TLB_FLUSH_ALL ||
-	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
-	} else {
-		struct flush_tlb_info *info;
+	struct flush_tlb_info *info;
+
+	guard(preempt)();
 
-		preempt_disable();
-		info = get_flush_tlb_info(NULL, start, end, 0, false,
-					  TLB_GENERATION_INVALID);
+	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
+				  TLB_GENERATION_INVALID);
 
+	if (info->end == TLB_FLUSH_ALL)
+		on_each_cpu(do_flush_tlb_all, NULL, 1);
+	else
 		on_each_cpu(do_kernel_range_flush, info, 1);
 
-		put_flush_tlb_info();
-		preempt_enable();
-	}
+	put_flush_tlb_info();
 }
 
 /*
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v14 02/13] x86/mm: get INVLPGB count max from CPUID
  2025-02-26  3:00 [PATCH v14 00/13] AMD broadcast TLB invalidation Rik van Riel
  2025-02-26  3:00 ` [PATCH v14 01/13] x86/mm: consolidate full flush threshold decision Rik van Riel
@ 2025-02-26  3:00 ` Rik van Riel
  2025-02-28 16:21   ` Borislav Petkov
                     ` (3 more replies)
  2025-02-26  3:00 ` [PATCH v14 03/13] x86/mm: add INVLPGB support code Rik van Riel
                   ` (12 subsequent siblings)
  14 siblings, 4 replies; 86+ messages in thread
From: Rik van Riel @ 2025-02-26  3:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel,
	Dave Hansen

The CPU advertises the maximum number of pages that can be shot down
with one INVLPGB instruction in the CPUID data.

Save that information for later use.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
---
 arch/x86/Kconfig.cpu               | 4 ++++
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/include/asm/tlbflush.h    | 3 +++
 arch/x86/kernel/cpu/amd.c          | 6 ++++++
 4 files changed, 14 insertions(+)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 2a7279d80460..981def9cbfac 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -401,6 +401,10 @@ menuconfig PROCESSOR_SELECT
 	  This lets you choose what x86 vendor support code your kernel
 	  will include.
 
+config X86_BROADCAST_TLB_FLUSH
+	def_bool y
+	depends on CPU_SUP_AMD && 64BIT
+
 config CPU_SUP_INTEL
 	default y
 	bool "Support Intel processors" if PROCESSOR_SELECT
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 508c0dad116b..b5c66b7465ba 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -338,6 +338,7 @@
 #define X86_FEATURE_CLZERO		(13*32+ 0) /* "clzero" CLZERO instruction */
 #define X86_FEATURE_IRPERF		(13*32+ 1) /* "irperf" Instructions Retired Count */
 #define X86_FEATURE_XSAVEERPTR		(13*32+ 2) /* "xsaveerptr" Always save/restore FP error pointers */
+#define X86_FEATURE_INVLPGB		(13*32+ 3) /* INVLPGB and TLBSYNC instruction supported. */
 #define X86_FEATURE_RDPRU		(13*32+ 4) /* "rdpru" Read processor register at user level */
 #define X86_FEATURE_WBNOINVD		(13*32+ 9) /* "wbnoinvd" WBNOINVD instruction */
 #define X86_FEATURE_AMD_IBPB		(13*32+12) /* Indirect Branch Prediction Barrier */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3da645139748..855c13da2045 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -183,6 +183,9 @@ static inline void cr4_init_shadow(void)
 extern unsigned long mmu_cr4_features;
 extern u32 *trampoline_cr4_features;
 
+/* How many pages can be invalidated with one INVLPGB. */
+extern u16 invlpgb_count_max;
+
 extern void initialize_tlbstate_and_flush(void);
 
 /*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 54194f5995de..3c75c174a274 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -29,6 +29,8 @@
 
 #include "cpu.h"
 
+u16 invlpgb_count_max __ro_after_init;
+
 static inline int rdmsrl_amd_safe(unsigned msr, unsigned long long *p)
 {
 	u32 gprs[8] = { 0 };
@@ -1139,6 +1141,10 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
 		tlb_lli_2m[ENTRIES] = eax & mask;
 
 	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
+
+	/* Max number of pages INVLPGB can invalidate in one shot */
+	if (boot_cpu_has(X86_FEATURE_INVLPGB))
+		invlpgb_count_max = (cpuid_edx(0x80000008) & 0xffff) + 1;
 }
 
 static const struct cpu_dev amd_cpu_dev = {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v14 03/13] x86/mm: add INVLPGB support code
  2025-02-26  3:00 [PATCH v14 00/13] AMD broadcast TLB invalidation Rik van Riel
  2025-02-26  3:00 ` [PATCH v14 01/13] x86/mm: consolidate full flush threshold decision Rik van Riel
  2025-02-26  3:00 ` [PATCH v14 02/13] x86/mm: get INVLPGB count max from CPUID Rik van Riel
@ 2025-02-26  3:00 ` Rik van Riel
  2025-02-28 18:46   ` Borislav Petkov
                     ` (4 more replies)
  2025-02-26  3:00 ` [PATCH v14 04/13] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
                   ` (11 subsequent siblings)
  14 siblings, 5 replies; 86+ messages in thread
From: Rik van Riel @ 2025-02-26  3:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel,
	Dave Hansen

Add helper functions and definitions needed to use broadcast TLB
invalidation on AMD EPYC 3 and newer CPUs.

All the functions defined in invlpgb.h are used later in the series.

Compile time disabling X86_FEATURE_INVLPGB when the config
option is not set allows the compiler to omit unnecessary code.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
---
 arch/x86/include/asm/disabled-features.h |  8 +-
 arch/x86/include/asm/tlb.h               | 98 ++++++++++++++++++++++++
 2 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index c492bdc97b05..625a89259968 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -129,6 +129,12 @@
 #define DISABLE_SEV_SNP		(1 << (X86_FEATURE_SEV_SNP & 31))
 #endif
 
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+#define DISABLE_INVLPGB		0
+#else
+#define DISABLE_INVLPGB		(1 << (X86_FEATURE_INVLPGB & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -146,7 +152,7 @@
 #define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
 			 DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
 #define DISABLED_MASK12	(DISABLE_FRED|DISABLE_LAM)
-#define DISABLED_MASK13	0
+#define DISABLED_MASK13	(DISABLE_INVLPGB)
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 77f52bc1578a..91c9a4da3ace 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -6,6 +6,9 @@
 static inline void tlb_flush(struct mmu_gather *tlb);
 
 #include <asm-generic/tlb.h>
+#include <linux/kernel.h>
+#include <vdso/bits.h>
+#include <vdso/page.h>
 
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
@@ -25,4 +28,99 @@ static inline void invlpg(unsigned long addr)
 	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
 }
 
+
+/*
+ * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
+ *
+ * The INVLPGB instruction is weakly ordered, and a batch of invalidations can
+ * be done in a parallel fashion.
+ *
+ * The instruction takes the number of extra pages to invalidate, beyond
+ * the first page, while __invlpgb gets the more human readable number of
+ * pages to invalidate.
+ *
+ * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from
+ * this CPU have completed.
+ */
+static inline void __invlpgb(unsigned long asid, unsigned long pcid,
+			     unsigned long addr, u16 nr_pages,
+			     bool pmd_stride, u8 flags)
+{
+	u32 edx = (pcid << 16) | asid;
+	u32 ecx = (pmd_stride << 31) | (nr_pages - 1);
+	u64 rax = addr | flags;
+
+	/* The low bits in rax are for flags. Verify addr is clean. */
+	VM_WARN_ON_ONCE(addr & ~PAGE_MASK);
+
+	/* INVLPGB; supported in binutils >= 2.36. */
+	asm volatile(".byte 0x0f, 0x01, 0xfe" : : "a" (rax), "c" (ecx), "d" (edx));
+}
+
+static inline void __tlbsync(void)
+{
+	/*
+	 * tlbsync waits for invlpgb instructions originating on the
+	 * same CPU to have completed. Print a warning if we could have
+	 * migrated, and might not be waiting on all the invlpgbs issued
+	 * during this TLB invalidation sequence.
+	 */
+	cant_migrate();
+
+	/* TLBSYNC: supported in binutils >= 0.36. */
+	asm volatile(".byte 0x0f, 0x01, 0xff" ::: "memory");
+}
+
+/*
+ * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
+ * of the three. For example:
+ * - INVLPGB_VA | INVLPGB_INCLUDE_GLOBAL: invalidate all TLB entries at the address
+ * - INVLPGB_PCID:			  invalidate all TLB entries matching the PCID
+ *
+ * The first can be used to invalidate (kernel) mappings at a particular
+ * address across all processes.
+ *
+ * The latter invalidates all TLB entries matching a PCID.
+ */
+#define INVLPGB_VA			BIT(0)
+#define INVLPGB_PCID			BIT(1)
+#define INVLPGB_ASID			BIT(2)
+#define INVLPGB_INCLUDE_GLOBAL		BIT(3)
+#define INVLPGB_FINAL_ONLY		BIT(4)
+#define INVLPGB_INCLUDE_NESTED		BIT(5)
+
+static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
+						unsigned long addr,
+						u16 nr,
+						bool pmd_stride)
+{
+	__invlpgb(0, pcid, addr, nr, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
+}
+
+/* Flush all mappings for a given PCID, not including globals. */
+static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
+{
+	__invlpgb(0, pcid, 0, 1, 0, INVLPGB_PCID);
+}
+
+/* Flush all mappings, including globals, for all PCIDs. */
+static inline void invlpgb_flush_all(void)
+{
+	__invlpgb(0, 0, 0, 1, 0, INVLPGB_INCLUDE_GLOBAL);
+	__tlbsync();
+}
+
+/* Flush addr, including globals, for all PCIDs. */
+static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
+{
+	__invlpgb(0, 0, addr, nr, 0, INVLPGB_INCLUDE_GLOBAL);
+}
+
+/* Flush all mappings for all PCIDs except globals. */
+static inline void invlpgb_flush_all_nonglobals(void)
+{
+	__invlpgb(0, 0, 0, 1, 0, 0);
+	__tlbsync();
+}
+
 #endif /* _ASM_X86_TLB_H */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v14 04/13] x86/mm: use INVLPGB for kernel TLB flushes
  2025-02-26  3:00 [PATCH v14 00/13] AMD broadcast TLB invalidation Rik van Riel
                   ` (2 preceding siblings ...)
  2025-02-26  3:00 ` [PATCH v14 03/13] x86/mm: add INVLPGB support code Rik van Riel
@ 2025-02-26  3:00 ` Rik van Riel
  2025-02-28 19:00   ` Dave Hansen
                     ` (3 more replies)
  2025-02-26  3:00 ` [PATCH v14 05/13] x86/mm: use INVLPGB in flush_tlb_all Rik van Riel
                   ` (10 subsequent siblings)
  14 siblings, 4 replies; 86+ messages in thread
From: Rik van Riel @ 2025-02-26  3:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

Use broadcast TLB invalidation for kernel addresses when available.

Remove the need to send IPIs for kernel TLB flushes.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/mm/tlb.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index dbcb5c968ff9..f44a03bca41c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1077,6 +1077,18 @@ void flush_tlb_all(void)
 	on_each_cpu(do_flush_tlb_all, NULL, 1);
 }
 
+static void invlpgb_kernel_range_flush(struct flush_tlb_info *info)
+{
+	unsigned long addr, nr;
+
+	for (addr = info->start; addr < info->end; addr += nr << PAGE_SHIFT) {
+		nr = (info->end - addr) >> PAGE_SHIFT;
+		nr = clamp_val(nr, 1, invlpgb_count_max);
+		invlpgb_flush_addr_nosync(addr, nr);
+	}
+	__tlbsync();
+}
+
 static void do_kernel_range_flush(void *info)
 {
 	struct flush_tlb_info *f = info;
@@ -1087,6 +1099,22 @@ static void do_kernel_range_flush(void *info)
 		flush_tlb_one_kernel(addr);
 }
 
+static void kernel_tlb_flush_all(struct flush_tlb_info *info)
+{
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		invlpgb_flush_all();
+	else
+		on_each_cpu(do_flush_tlb_all, NULL, 1);
+}
+
+static void kernel_tlb_flush_range(struct flush_tlb_info *info)
+{
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		invlpgb_kernel_range_flush(info);
+	else
+		on_each_cpu(do_kernel_range_flush, info, 1);
+}
+
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
 	struct flush_tlb_info *info;
@@ -1097,9 +1125,9 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 				  TLB_GENERATION_INVALID);
 
 	if (info->end == TLB_FLUSH_ALL)
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
+		kernel_tlb_flush_all(info);
 	else
-		on_each_cpu(do_kernel_range_flush, info, 1);
+		kernel_tlb_flush_range(info);
 
 	put_flush_tlb_info();
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v14 05/13] x86/mm: use INVLPGB in flush_tlb_all
  2025-02-26  3:00 [PATCH v14 00/13] AMD broadcast TLB invalidation Rik van Riel
                   ` (3 preceding siblings ...)
  2025-02-26  3:00 ` [PATCH v14 04/13] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
@ 2025-02-26  3:00 ` Rik van Riel
  2025-02-28 19:18   ` Dave Hansen
  2025-02-28 22:20   ` Borislav Petkov
  2025-02-26  3:00 ` [PATCH v14 06/13] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 86+ messages in thread
From: Rik van Riel @ 2025-02-26  3:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

The flush_tlb_all() function is not used a whole lot, but we might
as well use broadcast TLB flushing there, too.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/mm/tlb.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index f44a03bca41c..a6cd61d5f423 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1064,7 +1064,6 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
-
 static void do_flush_tlb_all(void *info)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -1074,6 +1073,15 @@ static void do_flush_tlb_all(void *info)
 void flush_tlb_all(void)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+
+	/* First try (faster) hardware-assisted TLB invalidation. */
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		guard(preempt)();
+		invlpgb_flush_all();
+		return;
+	}
+
+	/* Fall back to the IPI-based invalidation. */
 	on_each_cpu(do_flush_tlb_all, NULL, 1);
 }
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v14 06/13] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-26  3:00 [PATCH v14 00/13] AMD broadcast TLB invalidation Rik van Riel
                   ` (4 preceding siblings ...)
  2025-02-26  3:00 ` [PATCH v14 05/13] x86/mm: use INVLPGB in flush_tlb_all Rik van Riel
@ 2025-02-26  3:00 ` Rik van Riel
  2025-02-28 18:57   ` Borislav Petkov
                     ` (2 more replies)
  2025-02-26  3:00 ` [PATCH v14 07/13] x86/mm: add global ASID allocation helper functions Rik van Riel
                   ` (8 subsequent siblings)
  14 siblings, 3 replies; 86+ messages in thread
From: Rik van Riel @ 2025-02-26  3:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

In the page reclaim code, we only track the CPU(s) where the TLB needs
to be flushed, rather than all the individual mappings that may be getting
invalidated.

Use broadcast TLB flushing when that is available.

This is a temporary hack to ensure that the PCID context for
tasks in the next patch gets properly flushed from the page
reclaim code, because the IPI based flushing in arch_tlbbatch_flush
only flushes the currently loaded TLB context on each CPU.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/mm/tlb.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a6cd61d5f423..1cc25e83bd34 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1316,7 +1316,9 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		invlpgb_flush_all_nonglobals();
+	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v14 07/13] x86/mm: add global ASID allocation helper functions
  2025-02-26  3:00 [PATCH v14 00/13] AMD broadcast TLB invalidation Rik van Riel
                   ` (5 preceding siblings ...)
  2025-02-26  3:00 ` [PATCH v14 06/13] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
@ 2025-02-26  3:00 ` Rik van Riel
  2025-03-02  7:06   ` Borislav Petkov
                     ` (2 more replies)
  2025-02-26  3:00 ` [PATCH v14 08/13] x86/mm: global ASID context switch & TLB flush handling Rik van Riel
                   ` (7 subsequent siblings)
  14 siblings, 3 replies; 86+ messages in thread
From: Rik van Riel @ 2025-02-26  3:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

Add functions to manage global ASID space. Multithreaded processes that
are simultaneously active on 4 or more CPUs can get a global ASID,
resulting in the same PCID being used for that process on every CPU.

This in turn will allow the kernel to use hardware-assisted TLB flushing
through AMD INVLPGB or Intel RAR for these processes.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/include/asm/mmu.h         |  11 +++
 arch/x86/include/asm/mmu_context.h |   2 +
 arch/x86/include/asm/tlbflush.h    |  43 +++++++++
 arch/x86/mm/tlb.c                  | 146 ++++++++++++++++++++++++++++-
 4 files changed, 199 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 3b496cdcb74b..edb5942d4829 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -69,6 +69,17 @@ typedef struct {
 	u16 pkey_allocation_map;
 	s16 execute_only_pkey;
 #endif
+
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	/*
+	 * The global ASID will be a non-zero value when the process has
+	 * the same ASID across all CPUs, allowing it to make use of
+	 * hardware-assisted remote TLB invalidation like AMD INVLPGB.
+	 */
+	u16 global_asid;
+	/* The process is transitioning to a new global ASID number. */
+	bool asid_transition;
+#endif
 } mm_context_t;
 
 #define INIT_MM_CONTEXT(mm)						\
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 795fdd53bd0a..a2c70e495b1b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
 #define enter_lazy_tlb enter_lazy_tlb
 extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 
+extern void mm_free_global_asid(struct mm_struct *mm);
+
 /*
  * Init a new mm.  Used on mm copies, like at fork()
  * and on mm's that are brand-new, like at execve().
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 855c13da2045..8e7df0ed7005 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -6,6 +6,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/sched.h>
 
+#include <asm/barrier.h>
 #include <asm/processor.h>
 #include <asm/cpufeature.h>
 #include <asm/special_insns.h>
@@ -234,6 +235,48 @@ void flush_tlb_one_kernel(unsigned long addr);
 void flush_tlb_multi(const struct cpumask *cpumask,
 		      const struct flush_tlb_info *info);
 
+static inline bool is_dyn_asid(u16 asid)
+{
+	return asid < TLB_NR_DYN_ASIDS;
+}
+
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+static inline u16 mm_global_asid(struct mm_struct *mm)
+{
+	u16 asid;
+
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return 0;
+
+	asid = smp_load_acquire(&mm->context.global_asid);
+
+	/* mm->context.global_asid is either 0, or a global ASID */
+	VM_WARN_ON_ONCE(asid && is_dyn_asid(asid));
+
+	return asid;
+}
+
+static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
+{
+	/*
+	 * Notably flush_tlb_mm_range() -> broadcast_tlb_flush() ->
+	 * finish_asid_transition() needs to observe asid_transition = true
+	 * once it observes global_asid.
+	 */
+	mm->context.asid_transition = true;
+	smp_store_release(&mm->context.global_asid, asid);
+}
+#else
+static inline u16 mm_global_asid(struct mm_struct *mm)
+{
+	return 0;
+}
+
+static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
+{
+}
+#endif
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 1cc25e83bd34..9b1652c02452 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -74,13 +74,15 @@
  * use different names for each of them:
  *
  * ASID  - [0, TLB_NR_DYN_ASIDS-1]
- *         the canonical identifier for an mm
+ *         the canonical identifier for an mm, dynamically allocated on each CPU
+ *         [TLB_NR_DYN_ASIDS, MAX_ASID_AVAILABLE-1]
+ *         the canonical, global identifier for an mm, identical across all CPUs
  *
- * kPCID - [1, TLB_NR_DYN_ASIDS]
+ * kPCID - [1, MAX_ASID_AVAILABLE]
  *         the value we write into the PCID part of CR3; corresponds to the
  *         ASID+1, because PCID 0 is special.
  *
- * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
+ * uPCID - [2048 + 1, 2048 + MAX_ASID_AVAILABLE]
  *         for KPTI each mm has two address spaces and thus needs two
  *         PCID values, but we can still do with a single ASID denomination
  *         for each mm. Corresponds to kPCID + 2048.
@@ -251,6 +253,144 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 	*need_flush = true;
 }
 
+/*
+ * Global ASIDs are allocated for multi-threaded processes that are
+ * active on multiple CPUs simultaneously, giving each of those
+ * processes the same PCID on every CPU, for use with hardware-assisted
+ * TLB shootdown on remote CPUs, like AMD INVLPGB or Intel RAR.
+ *
+ * These global ASIDs are held for the lifetime of the process.
+ */
+static DEFINE_RAW_SPINLOCK(global_asid_lock);
+static u16 last_global_asid = MAX_ASID_AVAILABLE;
+static DECLARE_BITMAP(global_asid_used, MAX_ASID_AVAILABLE);
+static DECLARE_BITMAP(global_asid_freed, MAX_ASID_AVAILABLE);
+static int global_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
+
+/*
+ * When the search for a free ASID in the global ASID space reaches
+ * MAX_ASID_AVAILABLE, a global TLB flush guarantees that previously
+ * freed global ASIDs are safe to re-use.
+ *
+ * This way the global flush only needs to happen at ASID rollover
+ * time, and not at ASID allocation time.
+ */
+static void reset_global_asid_space(void)
+{
+	lockdep_assert_held(&global_asid_lock);
+
+	invlpgb_flush_all_nonglobals();
+
+	/*
+	 * The TLB flush above makes it safe to re-use the previously
+	 * freed global ASIDs.
+	 */
+	bitmap_andnot(global_asid_used, global_asid_used,
+			global_asid_freed, MAX_ASID_AVAILABLE);
+	bitmap_clear(global_asid_freed, 0, MAX_ASID_AVAILABLE);
+
+	/* Restart the search from the start of global ASID space. */
+	last_global_asid = TLB_NR_DYN_ASIDS;
+}
+
+static u16 allocate_global_asid(void)
+{
+	u16 asid;
+
+	lockdep_assert_held(&global_asid_lock);
+
+	/* The previous allocation hit the edge of available address space */
+	if (last_global_asid >= MAX_ASID_AVAILABLE - 1)
+		reset_global_asid_space();
+
+	asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, last_global_asid);
+
+	if (asid >= MAX_ASID_AVAILABLE && !global_asid_available) {
+		/* This should never happen. */
+		VM_WARN_ONCE(1, "Unable to allocate global ASID despite %d available\n",
+				global_asid_available);
+		return 0;
+	}
+
+	/* Claim this global ASID. */
+	__set_bit(asid, global_asid_used);
+	last_global_asid = asid;
+	global_asid_available--;
+	return asid;
+}
+
+/*
+ * Check whether a process is currently active on more than @threshold CPUs.
+ * This is a cheap estimation on whether or not it may make sense to assign
+ * a global ASID to this process, and use broadcast TLB invalidation.
+ */
+static bool mm_active_cpus_exceeds(struct mm_struct *mm, int threshold)
+{
+	int count = 0;
+	int cpu;
+
+	/* This quick check should eliminate most single threaded programs. */
+	if (cpumask_weight(mm_cpumask(mm)) <= threshold)
+		return false;
+
+	/* Slower check to make sure. */
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/* Skip the CPUs that aren't really running this process. */
+		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
+			continue;
+
+		if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu))
+			continue;
+
+		if (++count > threshold)
+			return true;
+	}
+	return false;
+}
+
+/*
+ * Assign a global ASID to the current process, protecting against
+ * races between multiple threads in the process.
+ */
+static void use_global_asid(struct mm_struct *mm)
+{
+	u16 asid;
+
+	guard(raw_spinlock_irqsave)(&global_asid_lock);
+
+	/* This process is already using broadcast TLB invalidation. */
+	if (mm_global_asid(mm))
+		return;
+
+	/* The last global ASID was consumed while waiting for the lock. */
+	if (!global_asid_available) {
+		VM_WARN_ONCE(1, "Ran out of global ASIDs\n");
+		return;
+	}
+
+	asid = allocate_global_asid();
+	if (!asid)
+		return;
+
+	mm_assign_global_asid(mm, asid);
+}
+
+void mm_free_global_asid(struct mm_struct *mm)
+{
+	if (!mm_global_asid(mm))
+		return;
+
+	guard(raw_spinlock_irqsave)(&global_asid_lock);
+
+	/* The global ASID can be re-used only after flush at wrap-around. */
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	__set_bit(mm->context.global_asid, global_asid_freed);
+
+	mm->context.global_asid = 0;
+	global_asid_available++;
+#endif
+}
+
 /*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v14 08/13] x86/mm: global ASID context switch & TLB flush handling
  2025-02-26  3:00 [PATCH v14 00/13] AMD broadcast TLB invalidation Rik van Riel
                   ` (6 preceding siblings ...)
  2025-02-26  3:00 ` [PATCH v14 07/13] x86/mm: add global ASID allocation helper functions Rik van Riel
@ 2025-02-26  3:00 ` Rik van Riel
  2025-03-02  7:58   ` Borislav Petkov
                     ` (2 more replies)
  2025-02-26  3:00 ` [PATCH v14 09/13] x86/mm: global ASID process exit helpers Rik van Riel
                   ` (6 subsequent siblings)
  14 siblings, 3 replies; 86+ messages in thread
From: Rik van Riel @ 2025-02-26  3:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

Context switch and TLB flush support for processes that use a global
ASID & PCID across all CPUs.

At both context switch time and TLB flush time, we need to check
whether a task is switching to a global ASID, and reload the TLB
with the new ASID as appropriate.

In both code paths, we also short-circuit the TLB flush if we
are using a global ASID, because the global ASIDs are always
kept up to date across CPUs, even while the process is not
running on a CPU.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/tlbflush.h | 18 ++++++++
 arch/x86/mm/tlb.c               | 77 ++++++++++++++++++++++++++++++---
 2 files changed, 88 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8e7df0ed7005..37b735dcf025 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -240,6 +240,11 @@ static inline bool is_dyn_asid(u16 asid)
 	return asid < TLB_NR_DYN_ASIDS;
 }
 
+static inline bool is_global_asid(u16 asid)
+{
+	return !is_dyn_asid(asid);
+}
+
 #ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
 static inline u16 mm_global_asid(struct mm_struct *mm)
 {
@@ -266,6 +271,14 @@ static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
 	mm->context.asid_transition = true;
 	smp_store_release(&mm->context.global_asid, asid);
 }
+
+static inline bool in_asid_transition(struct mm_struct *mm)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	return mm && READ_ONCE(mm->context.asid_transition);
+}
 #else
 static inline u16 mm_global_asid(struct mm_struct *mm)
 {
@@ -275,6 +288,11 @@ static inline u16 mm_global_asid(struct mm_struct *mm)
 static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
 {
 }
+
+static inline bool in_asid_transition(struct mm_struct *mm)
+{
+	return false;
+}
 #endif
 
 #ifdef CONFIG_PARAVIRT
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 9b1652c02452..b7d461db1b08 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -227,6 +227,20 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 		return;
 	}
 
+	/*
+	 * TLB consistency for global ASIDs is maintained with hardware assisted
+	 * remote TLB flushing. Global ASIDs are always up to date.
+	 */
+	if (static_cpu_has(X86_FEATURE_INVLPGB)) {
+		u16 global_asid = mm_global_asid(next);
+
+		if (global_asid) {
+			*new_asid = global_asid;
+			*need_flush = false;
+			return;
+		}
+	}
+
 	if (this_cpu_read(cpu_tlbstate.invalidate_other))
 		clear_asid_other();
 
@@ -391,6 +405,23 @@ void mm_free_global_asid(struct mm_struct *mm)
 #endif
 }
 
+/*
+ * Is the mm transitioning from a CPU-local ASID to a global ASID?
+ */
+static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
+{
+	u16 global_asid = mm_global_asid(next);
+
+	if (!static_cpu_has(X86_FEATURE_INVLPGB))
+		return false;
+
+	/* Process is transitioning to a global ASID */
+	if (global_asid && prev_asid != global_asid)
+		return true;
+
+	return false;
+}
+
 /*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
@@ -696,7 +727,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 	 */
 	if (prev == next) {
 		/* Not actually switching mm's */
-		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
+		VM_WARN_ON(is_dyn_asid(prev_asid) &&
+			   this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
 			   next->context.ctx_id);
 
 		/*
@@ -713,6 +745,20 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 
+		/* Check if the current mm is transitioning to a global ASID */
+		if (needs_global_asid_reload(next, prev_asid)) {
+			next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+			choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
+			goto reload_tlb;
+		}
+
+		/*
+		 * Broadcast TLB invalidation keeps this PCID up to date
+		 * all the time.
+		 */
+		if (is_global_asid(prev_asid))
+			return;
+
 		/*
 		 * If the CPU is not in lazy TLB mode, we are just switching
 		 * from one thread in a process to another thread in the same
@@ -746,6 +792,13 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		 */
 		cond_mitigation(tsk);
 
+		/*
+		 * Let nmi_uaccess_okay() and finish_asid_transition()
+		 * know that we're changing CR3.
+		 */
+		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
+		barrier();
+
 		/*
 		 * Leave this CPU in prev's mm_cpumask. Atomic writes to
 		 * mm_cpumask can be expensive under contention. The CPU
@@ -760,14 +813,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
 		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
-
-		/* Let nmi_uaccess_okay() know that we're changing CR3. */
-		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
-		barrier();
 	}
 
+reload_tlb:
 	new_lam = mm_lam_cr3_mask(next);
 	if (need_flush) {
+		VM_WARN_ON_ONCE(is_global_asid(new_asid));
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
 		load_new_mm_cr3(next->pgd, new_asid, new_lam, true);
@@ -886,7 +937,7 @@ static void flush_tlb_func(void *info)
 	const struct flush_tlb_info *f = info;
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+	u64 local_tlb_gen;
 	bool local = smp_processor_id() == f->initiating_cpu;
 	unsigned long nr_invalidate = 0;
 	u64 mm_tlb_gen;
@@ -909,6 +960,16 @@ static void flush_tlb_func(void *info)
 	if (unlikely(loaded_mm == &init_mm))
 		return;
 
+	/* Reload the ASID if transitioning into or out of a global ASID */
+	if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) {
+		switch_mm_irqs_off(NULL, loaded_mm, NULL);
+		loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	}
+
+	/* Broadcast ASIDs are always kept up to date with INVLPGB. */
+	if (is_global_asid(loaded_mm_asid))
+		return;
+
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
@@ -926,6 +987,8 @@ static void flush_tlb_func(void *info)
 		return;
 	}
 
+	local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+
 	if (unlikely(f->new_tlb_gen != TLB_GENERATION_INVALID &&
 		     f->new_tlb_gen <= local_tlb_gen)) {
 		/*
@@ -1093,7 +1156,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
 	 * up on the new contents of what used to be page tables, while
 	 * doing a speculative memory access.
 	 */
-	if (info->freed_tables)
+	if (info->freed_tables || in_asid_transition(info->mm))
 		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
 	else
 		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v14 09/13] x86/mm: global ASID process exit helpers
  2025-02-26  3:00 [PATCH v14 00/13] AMD broadcast TLB invalidation Rik van Riel
                   ` (7 preceding siblings ...)
  2025-02-26  3:00 ` [PATCH v14 08/13] x86/mm: global ASID context switch & TLB flush handling Rik van Riel
@ 2025-02-26  3:00 ` Rik van Riel
  2025-03-02 12:38   ` Borislav Petkov
                     ` (2 more replies)
  2025-02-26  3:00 ` [PATCH v14 10/13] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
                   ` (5 subsequent siblings)
  14 siblings, 3 replies; 86+ messages in thread
From: Rik van Riel @ 2025-02-26  3:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

A global ASID is allocated for the lifetime of a process.

Free the global ASID at process exit time.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/mmu_context.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index a2c70e495b1b..b47ac6d270e6 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -163,6 +163,14 @@ static inline int init_new_context(struct task_struct *tsk,
 		mm->context.execute_only_pkey = -1;
 	}
 #endif
+
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		mm->context.global_asid = 0;
+		mm->context.asid_transition = false;
+	}
+#endif
+
 	mm_reset_untag_mask(mm);
 	init_new_context_ldt(mm);
 	return 0;
@@ -172,6 +180,10 @@ static inline int init_new_context(struct task_struct *tsk,
 static inline void destroy_context(struct mm_struct *mm)
 {
 	destroy_context_ldt(mm);
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		mm_free_global_asid(mm);
+#endif
 }
 
 extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v14 10/13] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-02-26  3:00 [PATCH v14 00/13] AMD broadcast TLB invalidation Rik van Riel
                   ` (8 preceding siblings ...)
  2025-02-26  3:00 ` [PATCH v14 09/13] x86/mm: global ASID process exit helpers Rik van Riel
@ 2025-02-26  3:00 ` Rik van Riel
  2025-03-03 10:57   ` Borislav Petkov
                     ` (2 more replies)
  2025-02-26  3:00 ` [PATCH v14 11/13] x86/mm: do targeted broadcast flushing from tlbbatch code Rik van Riel
                   ` (4 subsequent siblings)
  14 siblings, 3 replies; 86+ messages in thread
From: Rik van Riel @ 2025-02-26  3:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

Use broadcast TLB invalidation, using the INVPLGB instruction.

There is not enough room in the 12-bit ASID address space to hand
out broadcast ASIDs to every process. Only hand out broadcast ASIDs
to processes when they are observed to be simultaneously running
on 4 or more CPUs.

This also allows single threaded process to continue using the
cheaper, local TLB invalidation instructions like INVLPGB.

Due to the structure of flush_tlb_mm_range, the INVLPGB
flushing is done in a generically named broadcast_tlb_flush
function, which can later also be used for Intel RAR.

Combined with the removal of unnecessary lru_add_drain calls
(see https://lkml.org/lkml/2024/12/19/1388) this results in a
nice performance boost for the will-it-scale tlb_flush2_threads
test on an AMD Milan system with 36 cores:

- vanilla kernel:           527k loops/second
- lru_add_drain removal:    731k loops/second
- only INVLPGB:             527k loops/second
- lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while
TLB invalidation went down from 40% of the total CPU
time to only around 4% of CPU time, the contention
simply moved to the LRU lock.

Fixing both at the same time about doubles the
number of iterations per second from this case.

Comparing will-it-scale tlb_flush2_threads with
several different numbers of threads on a 72 CPU
AMD Milan shows similar results. The number
represents the total number of loops per second
across all the threads:

threads		tip		invlpgb

1		315k		304k
2		423k		424k
4		644k		1032k
8		652k		1267k
16		737k		1368k
32		759k		1199k
64		636k		1094k
72		609k		993k

1 and 2 thread performance is similar with and
without invlpgb, because invlpgb is only used
on processes using 4 or more CPUs simultaneously.

The number is the median across 5 runs.

Some numbers closer to real world performance
can be found at Phoronix, thanks to Michael:

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/include/asm/tlbflush.h |   9 +++
 arch/x86/mm/tlb.c               | 107 +++++++++++++++++++++++++++++++-
 2 files changed, 115 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 37b735dcf025..811dd70eb6b8 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -272,6 +272,11 @@ static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
 	smp_store_release(&mm->context.global_asid, asid);
 }
 
+static inline void clear_asid_transition(struct mm_struct *mm)
+{
+	WRITE_ONCE(mm->context.asid_transition, false);
+}
+
 static inline bool in_asid_transition(struct mm_struct *mm)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
@@ -289,6 +294,10 @@ static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
 {
 }
 
+static inline void clear_asid_transition(struct mm_struct *mm)
+{
+}
+
 static inline bool in_asid_transition(struct mm_struct *mm)
 {
 	return false;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index b7d461db1b08..cd109bdf0dd9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -422,6 +422,108 @@ static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
 	return false;
 }
 
+/*
+ * x86 has 4k ASIDs (2k when compiled with KPTI), but the largest
+ * x86 systems have over 8k CPUs. Because of this potential ASID
+ * shortage, global ASIDs are handed out to processes that have
+ * frequent TLB flushes and are active on 4 or more CPUs simultaneously.
+ */
+static void consider_global_asid(struct mm_struct *mm)
+{
+	if (!static_cpu_has(X86_FEATURE_INVLPGB))
+		return;
+
+	/* Check every once in a while. */
+	if ((current->pid & 0x1f) != (jiffies & 0x1f))
+		return;
+
+	if (!READ_ONCE(global_asid_available))
+		return;
+
+	/*
+	 * Assign a global ASID if the process is active on
+	 * 4 or more CPUs simultaneously.
+	 */
+	if (mm_active_cpus_exceeds(mm, 3))
+		use_global_asid(mm);
+}
+
+static void finish_asid_transition(struct flush_tlb_info *info)
+{
+	struct mm_struct *mm = info->mm;
+	int bc_asid = mm_global_asid(mm);
+	int cpu;
+
+	if (!in_asid_transition(mm))
+		return;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/*
+		 * The remote CPU is context switching. Wait for that to
+		 * finish, to catch the unlikely case of it switching to
+		 * the target mm with an out of date ASID.
+		 */
+		while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) == LOADED_MM_SWITCHING)
+			cpu_relax();
+
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm)
+			continue;
+
+		/*
+		 * If at least one CPU is not using the global ASID yet,
+		 * send a TLB flush IPI. The IPI should cause stragglers
+		 * to transition soon.
+		 *
+		 * This can race with the CPU switching to another task;
+		 * that results in a (harmless) extra IPI.
+		 */
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) {
+			flush_tlb_multi(mm_cpumask(info->mm), info);
+			return;
+		}
+	}
+
+	/* All the CPUs running this process are using the global ASID. */
+	clear_asid_transition(mm);
+}
+
+static void broadcast_tlb_flush(struct flush_tlb_info *info)
+{
+	bool pmd = info->stride_shift == PMD_SHIFT;
+	unsigned long asid = mm_global_asid(info->mm);
+	unsigned long addr = info->start;
+
+	/*
+	 * TLB flushes with INVLPGB are kicked off asynchronously.
+	 * The inc_mm_tlb_gen() guarantees page table updates are done
+	 * before these TLB flushes happen.
+	 */
+	if (info->end == TLB_FLUSH_ALL) {
+		invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_single_pcid_nosync(user_pcid(asid));
+	} else do {
+		unsigned long nr = 1;
+
+		if (info->stride_shift <= PMD_SHIFT) {
+			nr = (info->end - addr) >> info->stride_shift;
+			nr = clamp_val(nr, 1, invlpgb_count_max);
+		}
+
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd);
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd);
+
+		addr += nr << info->stride_shift;
+	} while (addr < info->end);
+
+	finish_asid_transition(info);
+
+	/* Wait for the INVLPGBs kicked off above to finish. */
+	__tlbsync();
+}
+
 /*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
@@ -1252,9 +1354,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+	if (mm_global_asid(mm)) {
+		broadcast_tlb_flush(info);
+	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
 		info->trim_cpumask = should_trim_cpumask(mm);
 		flush_tlb_multi(mm_cpumask(mm), info);
+		consider_global_asid(mm);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v14 11/13] x86/mm: do targeted broadcast flushing from tlbbatch code
  2025-02-26  3:00 [PATCH v14 00/13] AMD broadcast TLB invalidation Rik van Riel
                   ` (9 preceding siblings ...)
  2025-02-26  3:00 ` [PATCH v14 10/13] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
@ 2025-02-26  3:00 ` Rik van Riel
  2025-03-03 11:46   ` Borislav Petkov
  2025-02-26  3:00 ` [PATCH v14 12/13] x86/mm: enable AMD translation cache extensions Rik van Riel
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 86+ messages in thread
From: Rik van Riel @ 2025-02-26  3:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

Instead of doing a system-wide TLB flush from arch_tlbbatch_flush,
queue up asynchronous, targeted flushes from arch_tlbbatch_add_pending.

This also allows us to avoid adding the CPUs of processes using broadcast
flushing to the batch->cpumask, and will hopefully further reduce TLB
flushing from the reclaim and compaction paths.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/include/asm/tlb.h      | 12 ++---
 arch/x86/include/asm/tlbflush.h | 34 ++++++++++----
 arch/x86/mm/tlb.c               | 79 +++++++++++++++++++++++++++++++--
 3 files changed, 107 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 91c9a4da3ace..e645884a1877 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -89,16 +89,16 @@ static inline void __tlbsync(void)
 #define INVLPGB_FINAL_ONLY		BIT(4)
 #define INVLPGB_INCLUDE_NESTED		BIT(5)
 
-static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
-						unsigned long addr,
-						u16 nr,
-						bool pmd_stride)
+static inline void __invlpgb_flush_user_nr_nosync(unsigned long pcid,
+						  unsigned long addr,
+						  u16 nr,
+						  bool pmd_stride)
 {
 	__invlpgb(0, pcid, addr, nr, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
 }
 
 /* Flush all mappings for a given PCID, not including globals. */
-static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
+static inline void __invlpgb_flush_single_pcid_nosync(unsigned long pcid)
 {
 	__invlpgb(0, pcid, 0, 1, 0, INVLPGB_PCID);
 }
@@ -111,7 +111,7 @@ static inline void invlpgb_flush_all(void)
 }
 
 /* Flush addr, including globals, for all PCIDs. */
-static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
+static inline void __invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
 {
 	__invlpgb(0, 0, addr, nr, 0, INVLPGB_INCLUDE_GLOBAL);
 }
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 811dd70eb6b8..22462bd4b1ee 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -105,6 +105,9 @@ struct tlb_state {
 	 * need to be invalidated.
 	 */
 	bool invalidate_other;
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	bool need_tlbsync;
+#endif
 
 #ifdef CONFIG_ADDRESS_MASKING
 	/*
@@ -284,6 +287,16 @@ static inline bool in_asid_transition(struct mm_struct *mm)
 
 	return mm && READ_ONCE(mm->context.asid_transition);
 }
+
+static inline bool cpu_need_tlbsync(void)
+{
+	return this_cpu_read(cpu_tlbstate.need_tlbsync);
+}
+
+static inline void cpu_write_tlbsync(bool state)
+{
+	this_cpu_write(cpu_tlbstate.need_tlbsync, state);
+}
 #else
 static inline u16 mm_global_asid(struct mm_struct *mm)
 {
@@ -302,6 +315,15 @@ static inline bool in_asid_transition(struct mm_struct *mm)
 {
 	return false;
 }
+
+static inline bool cpu_need_tlbsync(void)
+{
+	return false;
+}
+
+static inline void cpu_write_tlbsync(bool state)
+{
+}
 #endif
 
 #ifdef CONFIG_PARAVIRT
@@ -351,21 +373,15 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 	return atomic64_inc_return(&mm->context.tlb_gen);
 }
 
-static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
-					     struct mm_struct *mm,
-					     unsigned long uaddr)
-{
-	inc_mm_tlb_gen(mm);
-	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
-	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
-}
-
 static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
 {
 	flush_tlb_mm(mm);
 }
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+					     struct mm_struct *mm,
+					     unsigned long uaddr);
 
 static inline bool pte_flags_need_flush(unsigned long oldflags,
 					unsigned long newflags,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index cd109bdf0dd9..4d56d22b9893 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -487,6 +487,37 @@ static void finish_asid_transition(struct flush_tlb_info *info)
 	clear_asid_transition(mm);
 }
 
+static inline void tlbsync(void)
+{
+	if (!cpu_need_tlbsync())
+		return;
+	__tlbsync();
+	cpu_write_tlbsync(false);
+}
+
+static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
+						unsigned long addr,
+						u16 nr, bool pmd_stride)
+{
+	__invlpgb_flush_user_nr_nosync(pcid, addr, nr, pmd_stride);
+	if (!cpu_need_tlbsync())
+		cpu_write_tlbsync(true);
+}
+
+static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
+{
+	__invlpgb_flush_single_pcid_nosync(pcid);
+	if (!cpu_need_tlbsync())
+		cpu_write_tlbsync(true);
+}
+
+static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
+{
+	__invlpgb_flush_addr_nosync(addr, nr);
+	if (!cpu_need_tlbsync())
+		cpu_write_tlbsync(true);
+}
+
 static void broadcast_tlb_flush(struct flush_tlb_info *info)
 {
 	bool pmd = info->stride_shift == PMD_SHIFT;
@@ -785,6 +816,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
 		WARN_ON_ONCE(!irqs_disabled());
 
+	tlbsync();
+
 	/*
 	 * Verify that CR3 is what we think it is.  This will catch
 	 * hypothetical buggy code that directly switches to swapper_pg_dir
@@ -961,6 +994,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
  */
 void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
+	tlbsync();
+
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
 		return;
 
@@ -1624,9 +1659,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
-		invlpgb_flush_all_nonglobals();
-	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
@@ -1635,12 +1668,52 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 		local_irq_enable();
 	}
 
+	/*
+	 * If we issued (asynchronous) INVLPGB flushes, wait for them here.
+	 * The cpumask above contains only CPUs that were running tasks
+	 * not using broadcast TLB flushing.
+	 */
+	tlbsync();
+
 	cpumask_clear(&batch->cpumask);
 
 	put_flush_tlb_info();
 	put_cpu();
 }
 
+void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+					     struct mm_struct *mm,
+					     unsigned long uaddr)
+{
+	u16 asid = mm_global_asid(mm);
+
+	if (asid) {
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false);
+
+		/*
+		 * Some CPUs might still be using a local ASID for this
+		 * process, and require IPIs, while others are using the
+		 * global ASID.
+		 *
+		 * In this corner case we need to do both the broadcast
+		 * TLB invalidation, and send IPIs. The IPIs will help
+		 * stragglers transition to the broadcast ASID.
+		 */
+		if (in_asid_transition(mm))
+			asid = 0;
+	}
+
+	if (!asid) {
+		inc_mm_tlb_gen(mm);
+		cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
+	}
+
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
+}
+
 /*
  * Blindly accessing user memory from NMI context can be dangerous
  * if we're in the middle of switching the current user task or
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v14 12/13] x86/mm: enable AMD translation cache extensions
  2025-02-26  3:00 [PATCH v14 00/13] AMD broadcast TLB invalidation Rik van Riel
                   ` (10 preceding siblings ...)
  2025-02-26  3:00 ` [PATCH v14 11/13] x86/mm: do targeted broadcast flushing from tlbbatch code Rik van Riel
@ 2025-02-26  3:00 ` Rik van Riel
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Enable " tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  2025-02-26  3:00 ` [PATCH v14 13/13] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
                   ` (2 subsequent siblings)
  14 siblings, 2 replies; 86+ messages in thread
From: Rik van Riel @ 2025-02-26  3:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

With AMD TCE (translation cache extensions) only the intermediate mappings
that cover the address range zapped by INVLPG / INVLPGB get invalidated,
rather than all intermediate mappings getting zapped at every TLB invalidation.

This can help reduce the TLB miss rate, by keeping more intermediate
mappings in the cache.

From the AMD manual:

Translation Cache Extension (TCE) Bit. Bit 15, read/write. Setting this bit
to 1 changes how the INVLPG, INVLPGB, and INVPCID instructions operate on
TLB entries. When this bit is 0, these instructions remove the target PTE
from the TLB as well as all upper-level table entries that are cached
in the TLB, whether or not they are associated with the target PTE.
When this bit is set, these instructions will remove the target PTE and
only those upper-level entries that lead to the target PTE in
the page table hierarchy, leaving unrelated upper-level entries intact.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/include/asm/msr-index.h       | 2 ++
 arch/x86/kernel/cpu/amd.c              | 4 ++++
 tools/arch/x86/include/asm/msr-index.h | 2 ++
 3 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 9a71880eec07..a7ea9720ba3c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -25,6 +25,7 @@
 #define _EFER_SVME		12 /* Enable virtualization */
 #define _EFER_LMSLE		13 /* Long Mode Segment Limit Enable */
 #define _EFER_FFXSR		14 /* Enable Fast FXSAVE/FXRSTOR */
+#define _EFER_TCE		15 /* Enable Translation Cache Extensions */
 #define _EFER_AUTOIBRS		21 /* Enable Automatic IBRS */
 
 #define EFER_SCE		(1<<_EFER_SCE)
@@ -34,6 +35,7 @@
 #define EFER_SVME		(1<<_EFER_SVME)
 #define EFER_LMSLE		(1<<_EFER_LMSLE)
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
+#define EFER_TCE		(1<<_EFER_TCE)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
 /*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 3c75c174a274..2bd512a1b4d0 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1075,6 +1075,10 @@ static void init_amd(struct cpuinfo_x86 *c)
 
 	/* AMD CPUs don't need fencing after x2APIC/TSC_DEADLINE MSR writes. */
 	clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
+
+	/* Enable Translation Cache Extension */
+	if (cpu_feature_enabled(X86_FEATURE_TCE))
+		msr_set_bit(MSR_EFER, _EFER_TCE);
 }
 
 #ifdef CONFIG_X86_32
diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h
index 3ae84c3b8e6d..dc1c1057f26e 100644
--- a/tools/arch/x86/include/asm/msr-index.h
+++ b/tools/arch/x86/include/asm/msr-index.h
@@ -25,6 +25,7 @@
 #define _EFER_SVME		12 /* Enable virtualization */
 #define _EFER_LMSLE		13 /* Long Mode Segment Limit Enable */
 #define _EFER_FFXSR		14 /* Enable Fast FXSAVE/FXRSTOR */
+#define _EFER_TCE		15 /* Enable Translation Cache Extensions */
 #define _EFER_AUTOIBRS		21 /* Enable Automatic IBRS */
 
 #define EFER_SCE		(1<<_EFER_SCE)
@@ -34,6 +35,7 @@
 #define EFER_SVME		(1<<_EFER_SVME)
 #define EFER_LMSLE		(1<<_EFER_LMSLE)
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
+#define EFER_TCE		(1<<_EFER_TCE)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
 /*
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v14 13/13] x86/mm: only invalidate final translations with INVLPGB
  2025-02-26  3:00 [PATCH v14 00/13] AMD broadcast TLB invalidation Rik van Riel
                   ` (11 preceding siblings ...)
  2025-02-26  3:00 ` [PATCH v14 12/13] x86/mm: enable AMD translation cache extensions Rik van Riel
@ 2025-02-26  3:00 ` Rik van Riel
  2025-03-03 22:40   ` Dave Hansen
  2025-03-03 12:42 ` [PATCH v14 00/13] AMD broadcast TLB invalidation Borislav Petkov
  2025-03-04 12:04 ` [PATCH] x86/mm: Always set the ASID valid bit for the INVLPGB instruction Borislav Petkov
  14 siblings, 1 reply; 86+ messages in thread
From: Rik van Riel @ 2025-02-26  3:00 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

Use the INVLPGB_FINAL_ONLY flag when invalidating mappings with INVPLGB.
This way only leaf mappings get removed from the TLB, leaving intermediate
translations cached.

On the (rare) occasions where we free page tables we do a full flush,
ensuring intermediate translations get flushed from the TLB.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/include/asm/tlb.h | 10 ++++++++--
 arch/x86/mm/tlb.c          | 13 +++++++------
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index e645884a1877..8d78667a2d1b 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -92,9 +92,15 @@ static inline void __tlbsync(void)
 static inline void __invlpgb_flush_user_nr_nosync(unsigned long pcid,
 						  unsigned long addr,
 						  u16 nr,
-						  bool pmd_stride)
+						  bool pmd_stride,
+						  bool freed_tables)
 {
-	__invlpgb(0, pcid, addr, nr, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
+	u8 flags = INVLPGB_PCID | INVLPGB_VA;
+
+	if (!freed_tables)
+		flags |= INVLPGB_FINAL_ONLY;
+
+	__invlpgb(0, pcid, addr, nr, pmd_stride, flags);
 }
 
 /* Flush all mappings for a given PCID, not including globals. */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4d56d22b9893..91680cfd5868 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -497,9 +497,10 @@ static inline void tlbsync(void)
 
 static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
 						unsigned long addr,
-						u16 nr, bool pmd_stride)
+						u16 nr, bool pmd_stride,
+						bool freed_tables)
 {
-	__invlpgb_flush_user_nr_nosync(pcid, addr, nr, pmd_stride);
+	__invlpgb_flush_user_nr_nosync(pcid, addr, nr, pmd_stride, freed_tables);
 	if (!cpu_need_tlbsync())
 		cpu_write_tlbsync(true);
 }
@@ -542,9 +543,9 @@ static void broadcast_tlb_flush(struct flush_tlb_info *info)
 			nr = clamp_val(nr, 1, invlpgb_count_max);
 		}
 
-		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd);
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd, info->freed_tables);
 		if (static_cpu_has(X86_FEATURE_PTI))
-			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd);
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd, info->freed_tables);
 
 		addr += nr << info->stride_shift;
 	} while (addr < info->end);
@@ -1688,10 +1689,10 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 	u16 asid = mm_global_asid(mm);
 
 	if (asid) {
-		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false, false);
 		/* Do any CPUs supporting INVLPGB need PTI? */
 		if (static_cpu_has(X86_FEATURE_PTI))
-			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false);
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false, false);
 
 		/*
 		 * Some CPUs might still be using a local ASID for this
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 02/13] x86/mm: get INVLPGB count max from CPUID
  2025-02-26  3:00 ` [PATCH v14 02/13] x86/mm: get INVLPGB count max from CPUID Rik van Riel
@ 2025-02-28 16:21   ` Borislav Petkov
  2025-02-28 19:27   ` Borislav Petkov
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 86+ messages in thread
From: Borislav Petkov @ 2025-02-28 16:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Dave Hansen

On Tue, Feb 25, 2025 at 10:00:37PM -0500, Rik van Riel wrote:
> @@ -1139,6 +1141,10 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
>  		tlb_lli_2m[ENTRIES] = eax & mask;
>  
>  	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
> +
> +	/* Max number of pages INVLPGB can invalidate in one shot */
> +	if (boot_cpu_has(X86_FEATURE_INVLPGB))
	    ^^^^^^^^^^^^^

I even sent you an *actual* hunk which you could've merged ontop of yours!

https://lore.kernel.org/r/20250224120029.GDZ7xfXV3jMjnbdbGl@fat_crate.local

What is the best way to convey to you what needs to be done?

Next diff ontop:

---
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 9f48596da4f0..0e2e9af18cef 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1143,7 +1143,7 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
 	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
 
 	/* Max number of pages INVLPGB can invalidate in one shot */
-	if (boot_cpu_has(X86_FEATURE_INVLPGB))
+	if (cpu_has(c, X86_FEATURE_INVLPGB))
 		invlpgb_count_max = (cpuid_edx(0x80000008) & 0xffff) + 1;
 }
 

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 03/13] x86/mm: add INVLPGB support code
  2025-02-26  3:00 ` [PATCH v14 03/13] x86/mm: add INVLPGB support code Rik van Riel
@ 2025-02-28 18:46   ` Borislav Petkov
  2025-02-28 18:51   ` Dave Hansen
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 86+ messages in thread
From: Borislav Petkov @ 2025-02-28 18:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Dave Hansen

On Tue, Feb 25, 2025 at 10:00:38PM -0500, Rik van Riel wrote:
> Add helper functions and definitions needed to use broadcast TLB
> invalidation on AMD EPYC 3 and newer CPUs.
> 
> All the functions defined in invlpgb.h are used later in the series.

Uff, that's tlb.h now. As already said. :-\

Btw, this is why there's no point to write *what* the patch does - that is
visible from the diff itself. This sentence is simply not needed.

> Compile time disabling X86_FEATURE_INVLPGB when the config
> option is not set allows the compiler to omit unnecessary code.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Manali Shukla <Manali.Shukla@amd.com>
> Tested-by: Brendan Jackman <jackmanb@google.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Acked-by: Dave Hansen <dave.hansen@intel.com>

And I asked you already but still crickets:

What do those Tested-by tags mean if you keep changing the patches?!

https://lore.kernel.org/r/20250224123142.GFZ7xmruuyrc2Wy0r7@fat_crate.local

...

IOW, you need to drop those tags.

> +/* Flush all mappings for all PCIDs except globals. */

This comment should state that addr=0 means both rax[1] (valid PCID) and
rax[2] (valid ASID) are clear and this means: flush *any* PCID and ASID. So
that it is clear.

> +static inline void invlpgb_flush_all_nonglobals(void)
> +{
> +	__invlpgb(0, 0, 0, 1, 0, 0);
> +	__tlbsync();
> +}
> +
>  #endif /* _ASM_X86_TLB_H */
> -- 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 03/13] x86/mm: add INVLPGB support code
  2025-02-26  3:00 ` [PATCH v14 03/13] x86/mm: add INVLPGB support code Rik van Riel
  2025-02-28 18:46   ` Borislav Petkov
@ 2025-02-28 18:51   ` Dave Hansen
  2025-02-28 19:47   ` Borislav Petkov
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2025-02-28 18:51 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On 2/25/25 19:00, Rik van Riel wrote:
> Add helper functions and definitions needed to use broadcast TLB
> invalidation on AMD EPYC 3 and newer CPUs.

I don't know if I mentioned it earlier, but I'd leave this explanation
of where the feature shows up for the cover letter or the Documentation/.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 06/13] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-26  3:00 ` [PATCH v14 06/13] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
@ 2025-02-28 18:57   ` Borislav Petkov
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Use broadcast TLB flushing in page reclaim tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 86+ messages in thread
From: Borislav Petkov @ 2025-02-28 18:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Tue, Feb 25, 2025 at 10:00:41PM -0500, Rik van Riel wrote:
> In the page reclaim code, we only track the CPU(s) where the TLB needs
> to be flushed, rather than all the individual mappings that may be getting
> invalidated.
> 
> Use broadcast TLB flushing when that is available.
> 
> This is a temporary hack to ensure that the PCID context for
> tasks in the next patch gets properly flushed from the page
> reclaim code, because the IPI based flushing in arch_tlbbatch_flush
> only flushes the currently loaded TLB context on each CPU.

Ok, why am I bothering to review stuff again if you're going to ignore it?

https://lore.kernel.org/r/20250224132711.GHZ7xzr0vdhva3-TvK@fat_crate.local

Please go through all review feedback to v13 and incorporate it properly.

It is not nice to ignore review feedback: you either agree and you incorporate
it or you don't and you explain why.

But ignoring it silently is very unfriendly. :-(

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 04/13] x86/mm: use INVLPGB for kernel TLB flushes
  2025-02-26  3:00 ` [PATCH v14 04/13] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
@ 2025-02-28 19:00   ` Dave Hansen
  2025-02-28 21:43   ` Borislav Petkov
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2025-02-28 19:00 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On 2/25/25 19:00, Rik van Riel wrote:
> Use broadcast TLB invalidation for kernel addresses when available.
> 
> Remove the need to send IPIs for kernel TLB flushes.

Nit: the changelog doesn't address the refactoring.

*Ideally*, you'd create the helpers and move the code there in one patch
and then actually "use INVLPGB for kernel TLB flushes" in the next. It's
compact enough here that it's not a deal breaker.

> +static void invlpgb_kernel_range_flush(struct flush_tlb_info *info)
> +{
> +	unsigned long addr, nr;
> +
> +	for (addr = info->start; addr < info->end; addr += nr << PAGE_SHIFT) {
> +		nr = (info->end - addr) >> PAGE_SHIFT;
> +		nr = clamp_val(nr, 1, invlpgb_count_max);
> +		invlpgb_flush_addr_nosync(addr, nr);
> +	}
> +	__tlbsync();
> +}

This needs a comment or two. Explaining that the function can take large
sizes:

/*
 * Flush an arbitrarily large range of memory with INVLPGB
 */

But that the _instruction_ can not is important.  This would be great in
the loop just above the clamp:

		/*
		 * INVLPGB has a limit on the size of ranges
		 * it can flush. Break large flushes up.
		 */

>  static void do_kernel_range_flush(void *info)
>  {
>  	struct flush_tlb_info *f = info;
> @@ -1087,6 +1099,22 @@ static void do_kernel_range_flush(void *info)
>  		flush_tlb_one_kernel(addr);
>  }
>  
> +static void kernel_tlb_flush_all(struct flush_tlb_info *info)
> +{
> +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		invlpgb_flush_all();
> +	else
> +		on_each_cpu(do_flush_tlb_all, NULL, 1);
> +}
> +
> +static void kernel_tlb_flush_range(struct flush_tlb_info *info)
> +{
> +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		invlpgb_kernel_range_flush(info);
> +	else
> +		on_each_cpu(do_kernel_range_flush, info, 1);
> +}
> +
>  void flush_tlb_kernel_range(unsigned long start, unsigned long end)
>  {
>  	struct flush_tlb_info *info;
> @@ -1097,9 +1125,9 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
>  				  TLB_GENERATION_INVALID);
>  
>  	if (info->end == TLB_FLUSH_ALL)
> -		on_each_cpu(do_flush_tlb_all, NULL, 1);
> +		kernel_tlb_flush_all(info);
>  	else
> -		on_each_cpu(do_kernel_range_flush, info, 1);
> +		kernel_tlb_flush_range(info);
>  
>  	put_flush_tlb_info();
>  }

But the structure of this code is much better than previous versions.
With the comments fixed:

Acked-by: Dave Hansen <dave.hansen@intel.com>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 05/13] x86/mm: use INVLPGB in flush_tlb_all
  2025-02-26  3:00 ` [PATCH v14 05/13] x86/mm: use INVLPGB in flush_tlb_all Rik van Riel
@ 2025-02-28 19:18   ` Dave Hansen
  2025-03-01 12:20     ` Borislav Petkov
  2025-02-28 22:20   ` Borislav Petkov
  1 sibling, 1 reply; 86+ messages in thread
From: Dave Hansen @ 2025-02-28 19:18 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On 2/25/25 19:00, Rik van Riel wrote:
>  void flush_tlb_all(void)
>  {
>  	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
> +
> +	/* First try (faster) hardware-assisted TLB invalidation. */
> +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> +		guard(preempt)();
> +		invlpgb_flush_all();
> +		return;
> +	}

We haven't talked at all about the locking rules for
invlpgb_flush_all(). It was used once in this series without any
explicit preempt twiddling. I assume that was because it was used in a
path where preempt is disabled.

If it does need a universal rule about preempt, can we please add an:

	lockdep_assert_preemption_disabled()

along with a comment about why it needs preempt disabled?

Also, the previous code did:

	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
		invlpgb_foo();
	else
		old_foo();

Is there a reason to break with that pattern? It would be nice to be
consistent.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 02/13] x86/mm: get INVLPGB count max from CPUID
  2025-02-26  3:00 ` [PATCH v14 02/13] x86/mm: get INVLPGB count max from CPUID Rik van Riel
  2025-02-28 16:21   ` Borislav Petkov
@ 2025-02-28 19:27   ` Borislav Petkov
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Add INVLPGB feature and Kconfig entry tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  3 siblings, 0 replies; 86+ messages in thread
From: Borislav Petkov @ 2025-02-28 19:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Dave Hansen

On Tue, Feb 25, 2025 at 10:00:37PM -0500, Rik van Riel wrote:
> The CPU advertises the maximum number of pages that can be shot down
> with one INVLPGB instruction in the CPUID data.
> 
> Save that information for later use.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Manali Shukla <Manali.Shukla@amd.com>
> Tested-by: Brendan Jackman <jackmanb@google.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Acked-by: Dave Hansen <dave.hansen@intel.com>
> ---
>  arch/x86/Kconfig.cpu               | 4 ++++
>  arch/x86/include/asm/cpufeatures.h | 1 +
>  arch/x86/include/asm/tlbflush.h    | 3 +++
>  arch/x86/kernel/cpu/amd.c          | 6 ++++++
>  4 files changed, 14 insertions(+)

As discussed, changed ontop.

 diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
-index 2a7279d80460..981def9cbfac 100644
+index f8b3296fe2e1..63864f5348f2 100644
 --- a/arch/x86/Kconfig.cpu
 +++ b/arch/x86/Kconfig.cpu
-@@ -401,6 +401,10 @@ menuconfig PROCESSOR_SELECT
+@@ -334,6 +334,10 @@ menuconfig PROCESSOR_SELECT
  	  This lets you choose what x86 vendor support code your kernel
  	  will include.
  
@@ -37,14 +36,14 @@ index 2a7279d80460..981def9cbfac 100644
  	default y
  	bool "Support Intel processors" if PROCESSOR_SELECT
 diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
-index 508c0dad116b..b5c66b7465ba 100644
+index d5985e8eef29..c0462be0c5f6 100644
 --- a/arch/x86/include/asm/cpufeatures.h
 +++ b/arch/x86/include/asm/cpufeatures.h
-@@ -338,6 +338,7 @@
+@@ -330,6 +330,7 @@
  #define X86_FEATURE_CLZERO		(13*32+ 0) /* "clzero" CLZERO instruction */
  #define X86_FEATURE_IRPERF		(13*32+ 1) /* "irperf" Instructions Retired Count */
  #define X86_FEATURE_XSAVEERPTR		(13*32+ 2) /* "xsaveerptr" Always save/restore FP error pointers */
-+#define X86_FEATURE_INVLPGB		(13*32+ 3) /* INVLPGB and TLBSYNC instruction supported. */
++#define X86_FEATURE_INVLPGB		(13*32+ 3) /* INVLPGB and TLBSYNC instructions supported */
  #define X86_FEATURE_RDPRU		(13*32+ 4) /* "rdpru" Read processor register at user level */
  #define X86_FEATURE_WBNOINVD		(13*32+ 9) /* "wbnoinvd" WBNOINVD instruction */
  #define X86_FEATURE_AMD_IBPB		(13*32+12) /* Indirect Branch Prediction Barrier */
@@ -63,7 +62,7 @@ index 3da645139748..855c13da2045 100644
  
  /*
 diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
-index 54194f5995de..3c75c174a274 100644
+index d747515ad013..0e2e9af18cef 100644
 --- a/arch/x86/kernel/cpu/amd.c
 +++ b/arch/x86/kernel/cpu/amd.c
 @@ -29,6 +29,8 @@
@@ -81,11 +80,11 @@ index 54194f5995de..3c75c174a274 100644
  	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
 +
 +	/* Max number of pages INVLPGB can invalidate in one shot */
-+	if (boot_cpu_has(X86_FEATURE_INVLPGB))
++	if (cpu_has(c, X86_FEATURE_INVLPGB))
 +		invlpgb_count_max = (cpuid_edx(0x80000008) & 0xffff) + 1;
  }
  
  static const struct cpu_dev amd_cpu_dev = {
 -- 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 03/13] x86/mm: add INVLPGB support code
  2025-02-26  3:00 ` [PATCH v14 03/13] x86/mm: add INVLPGB support code Rik van Riel
  2025-02-28 18:46   ` Borislav Petkov
  2025-02-28 18:51   ` Dave Hansen
@ 2025-02-28 19:47   ` Borislav Petkov
  2025-03-03 18:41     ` Dave Hansen
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Add " tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  4 siblings, 1 reply; 86+ messages in thread
From: Borislav Petkov @ 2025-02-28 19:47 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Dave Hansen

On Tue, Feb 25, 2025 at 10:00:38PM -0500, Rik van Riel wrote:
> Add helper functions and definitions needed to use broadcast TLB
> invalidation on AMD EPYC 3 and newer CPUs.
> 
> All the functions defined in invlpgb.h are used later in the series.
> 
> Compile time disabling X86_FEATURE_INVLPGB when the config
> option is not set allows the compiler to omit unnecessary code.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Manali Shukla <Manali.Shukla@amd.com>
> Tested-by: Brendan Jackman <jackmanb@google.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Acked-by: Dave Hansen <dave.hansen@intel.com>
> ---
>  arch/x86/include/asm/disabled-features.h |  8 +-
>  arch/x86/include/asm/tlb.h               | 98 ++++++++++++++++++++++++
>  2 files changed, 105 insertions(+), 1 deletion(-)

My edits ontop.

x86/cpu has dropped {disabled,required}-features.h in favor of a new, better
mechanism to compile-time disable X86 features, see below.

--- /tmp/current.patch	2025-02-28 20:44:40.765404608 +0100
+++ /tmp/0001-x86-mm-Add-INVLPGB-support-code.patch	2025-02-28 20:44:18.492326903 +0100
@@ -1,55 +1,38 @@
+From ce22946ea806ae459b4d88767a59b010e70682d5 Mon Sep 17 00:00:00 2001
 From: Rik van Riel <riel@surriel.com>
-Date: Tue, 25 Feb 2025 22:00:38 -0500
-Subject: x86/mm: Add INVLPGB support code
+Date: Fri, 28 Feb 2025 20:32:30 +0100
+Subject: [PATCH]  x86/mm: Add INVLPGB support code
 
 Add helper functions and definitions needed to use broadcast TLB
-invalidation on AMD EPYC 3 and newer CPUs.
+invalidation on AMD CPUs.
 
-All the functions defined in invlpgb.h are used later in the series.
-
-Compile time disabling X86_FEATURE_INVLPGB when the config
-option is not set allows the compiler to omit unnecessary code.
+  [ bp:
+      - Cleanup commit message
+      - port it to new Kconfig.cpufeatures machinery
+      - add a comment about flushing any PCID and ASID ]
 
 Signed-off-by: Rik van Riel <riel@surriel.com>
 Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
-Acked-by: Dave Hansen <dave.hansen@intel.com>
-Tested-by: Manali Shukla <Manali.Shukla@amd.com>
-Tested-by: Brendan Jackman <jackmanb@google.com>
-Tested-by: Michael Kelley <mhklinux@outlook.com>
 Link: https://lore.kernel.org/r/20250226030129.530345-4-riel@surriel.com
 ---
- arch/x86/include/asm/disabled-features.h |  8 +-
- arch/x86/include/asm/tlb.h               | 98 ++++++++++++++++++++++++
- 2 files changed, 105 insertions(+), 1 deletion(-)
-
-diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
-index c492bdc97b05..625a89259968 100644
---- a/arch/x86/include/asm/disabled-features.h
-+++ b/arch/x86/include/asm/disabled-features.h
-@@ -129,6 +129,12 @@
- #define DISABLE_SEV_SNP		(1 << (X86_FEATURE_SEV_SNP & 31))
- #endif
- 
-+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
-+#define DISABLE_INVLPGB		0
-+#else
-+#define DISABLE_INVLPGB		(1 << (X86_FEATURE_INVLPGB & 31))
-+#endif
-+
- /*
-  * Make sure to add features to the correct mask
-  */
-@@ -146,7 +152,7 @@
- #define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
- 			 DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
- #define DISABLED_MASK12	(DISABLE_FRED|DISABLE_LAM)
--#define DISABLED_MASK13	0
-+#define DISABLED_MASK13	(DISABLE_INVLPGB)
- #define DISABLED_MASK14	0
- #define DISABLED_MASK15	0
- #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
+ arch/x86/Kconfig.cpufeatures |   4 ++
+ arch/x86/include/asm/tlb.h   | 101 +++++++++++++++++++++++++++++++++++
+ 2 files changed, 105 insertions(+)
+
+diff --git a/arch/x86/Kconfig.cpufeatures b/arch/x86/Kconfig.cpufeatures
+index 5dcc49d928c5..f9af51205f07 100644
+--- a/arch/x86/Kconfig.cpufeatures
++++ b/arch/x86/Kconfig.cpufeatures
+@@ -195,3 +195,7 @@ config X86_DISABLED_FEATURE_FRED
+ config X86_DISABLED_FEATURE_SEV_SNP
+ 	def_bool y
+ 	depends on !KVM_AMD_SEV
++
++config X86_DISABLED_FEATURE_BROADCAST_TLB_FLUSH
++	def_bool y
++	depends on !X86_BROADCAST_TLB_FLUSH
 diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
-index 77f52bc1578a..91c9a4da3ace 100644
+index 77f52bc1578a..45d9c7687d61 100644
 --- a/arch/x86/include/asm/tlb.h
 +++ b/arch/x86/include/asm/tlb.h
 @@ -6,6 +6,9 @@
@@ -62,7 +45,7 @@ index 77f52bc1578a..91c9a4da3ace 100644
  
  static inline void tlb_flush(struct mmu_gather *tlb)
  {
-@@ -25,4 +28,99 @@ static inline void invlpg(unsigned long addr)
+@@ -25,4 +28,102 @@ static inline void invlpg(unsigned long addr)
  	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
  }
  
@@ -157,11 +140,14 @@ index 77f52bc1578a..91c9a4da3ace 100644
 +/* Flush all mappings for all PCIDs except globals. */
 +static inline void invlpgb_flush_all_nonglobals(void)
 +{
++	/*
++	 * @addr=0 means both rax[1] (valid PCID) and rax[2] (valid ASID) are clear
++	 * so flush *any* PCID and ASID.
++	 */
 +	__invlpgb(0, 0, 0, 1, 0, 0);
 +	__tlbsync();
 +}
-+
  #endif /* _ASM_X86_TLB_H */
 -- 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 04/13] x86/mm: use INVLPGB for kernel TLB flushes
  2025-02-26  3:00 ` [PATCH v14 04/13] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
  2025-02-28 19:00   ` Dave Hansen
@ 2025-02-28 21:43   ` Borislav Petkov
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Use " tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  3 siblings, 0 replies; 86+ messages in thread
From: Borislav Petkov @ 2025-02-28 21:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Tue, Feb 25, 2025 at 10:00:39PM -0500, Rik van Riel wrote:
> Use broadcast TLB invalidation for kernel addresses when available.
> 
> Remove the need to send IPIs for kernel TLB flushes.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Manali Shukla <Manali.Shukla@amd.com>
> Tested-by: Brendan Jackman <jackmanb@google.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> ---
>  arch/x86/mm/tlb.c | 32 ++++++++++++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)

Changes ontop:

--- /tmp/current.patch	2025-02-28 22:39:33.236465716 +0100
+++ /tmp/0001-x86-mm-Use-INVLPGB-for-kernel-TLB-flushes.patch	2025-02-28 22:39:59.432310072 +0100
@@ -1,36 +1,43 @@
+From b97ae5e31069cd536b563df185de52d33a565077 Mon Sep 17 00:00:00 2001
 From: Rik van Riel <riel@surriel.com>
 Date: Tue, 25 Feb 2025 22:00:39 -0500
-Subject: x86/mm: Use INVLPGB for kernel TLB flushes
+Subject: [PATCH] x86/mm: Use INVLPGB for kernel TLB flushes
 
 Use broadcast TLB invalidation for kernel addresses when available.
-
 Remove the need to send IPIs for kernel TLB flushes.
 
+   [ bp: Integrate dhansen's comments additions. ]
+
 Signed-off-by: Rik van Riel <riel@surriel.com>
 Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
-Tested-by: Manali Shukla <Manali.Shukla@amd.com>
-Tested-by: Brendan Jackman <jackmanb@google.com>
-Tested-by: Michael Kelley <mhklinux@outlook.com>
+Acked-by: Dave Hansen <dave.hansen@intel.com>
 Link: https://lore.kernel.org/r/20250226030129.530345-5-riel@surriel.com
 ---
- arch/x86/mm/tlb.c | 32 ++++++++++++++++++++++++++++++--
- 1 file changed, 30 insertions(+), 2 deletions(-)
+ arch/x86/mm/tlb.c | 39 +++++++++++++++++++++++++++++++++++++--
+ 1 file changed, 37 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
-index dbcb5c968ff9..f44a03bca41c 100644
+index dbcb5c968ff9..5c44b94ad5af 100644
 --- a/arch/x86/mm/tlb.c
 +++ b/arch/x86/mm/tlb.c
-@@ -1077,6 +1077,18 @@ void flush_tlb_all(void)
+@@ -1077,6 +1077,25 @@ void flush_tlb_all(void)
  	on_each_cpu(do_flush_tlb_all, NULL, 1);
  }
  
++/* Flush an arbitrarily large range of memory with INVLPGB. */
 +static void invlpgb_kernel_range_flush(struct flush_tlb_info *info)
 +{
 +	unsigned long addr, nr;
 +
 +	for (addr = info->start; addr < info->end; addr += nr << PAGE_SHIFT) {
 +		nr = (info->end - addr) >> PAGE_SHIFT;
++
++		/*
++		 * INVLPGB has a limit on the size of ranges it can
++		 * flush. Break up large flushes.
++		 */
 +		nr = clamp_val(nr, 1, invlpgb_count_max);
++
 +		invlpgb_flush_addr_nosync(addr, nr);
 +	}
 +	__tlbsync();
 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 05/13] x86/mm: use INVLPGB in flush_tlb_all
  2025-02-26  3:00 ` [PATCH v14 05/13] x86/mm: use INVLPGB in flush_tlb_all Rik van Riel
  2025-02-28 19:18   ` Dave Hansen
@ 2025-02-28 22:20   ` Borislav Petkov
  1 sibling, 0 replies; 86+ messages in thread
From: Borislav Petkov @ 2025-02-28 22:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Tue, Feb 25, 2025 at 10:00:40PM -0500, Rik van Riel wrote:
> The flush_tlb_all() function is not used a whole lot, but we might
> as well use broadcast TLB flushing there, too.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Manali Shukla <Manali.Shukla@amd.com>
> Tested-by: Brendan Jackman <jackmanb@google.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> ---
>  arch/x86/mm/tlb.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)

Edits ontop:

--- /tmp/current.patch	2025-02-28 23:18:51.670490799 +0100
+++ /tmp/0001-x86-mm-Use-INVLPGB-in-flush_tlb_all.patch	2025-02-28 23:17:48.590844991 +0100
@@ -1,22 +1,23 @@
+From 5bdf59c0589b71328bd340ea48a00917def62dc0 Mon Sep 17 00:00:00 2001
 From: Rik van Riel <riel@surriel.com>
 Date: Tue, 25 Feb 2025 22:00:40 -0500
-Subject: x86/mm: Use INVLPGB in flush_tlb_all
+Subject: [PATCH] x86/mm: Use INVLPGB in flush_tlb_all()
 
-The flush_tlb_all() function is not used a whole lot, but we might
-as well use broadcast TLB flushing there, too.
+The flush_tlb_all() function is not used a whole lot, but it might as
+well use broadcast TLB flushing there, too.
+
+  [ bp: Massage, restore balanced if-else branches in the function,
+    comment some. ]
 
 Signed-off-by: Rik van Riel <riel@surriel.com>
 Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
-Tested-by: Manali Shukla <Manali.Shukla@amd.com>
-Tested-by: Brendan Jackman <jackmanb@google.com>
-Tested-by: Michael Kelley <mhklinux@outlook.com>
 Link: https://lore.kernel.org/r/20250226030129.530345-6-riel@surriel.com
 ---
- arch/x86/mm/tlb.c | 10 +++++++++-
- 1 file changed, 9 insertions(+), 1 deletion(-)
+ arch/x86/mm/tlb.c | 17 +++++++++++++++--
+ 1 file changed, 15 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
-index f44a03bca41c..a6cd61d5f423 100644
+index 5c44b94ad5af..f49627e02311 100644
 --- a/arch/x86/mm/tlb.c
 +++ b/arch/x86/mm/tlb.c
 @@ -1064,7 +1064,6 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
@@ -27,22 +28,29 @@ index f44a03bca41c..a6cd61d5f423 100644
  static void do_flush_tlb_all(void *info)
  {
  	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
-@@ -1074,6 +1073,15 @@ static void do_flush_tlb_all(void *info)
+@@ -1074,7 +1073,21 @@ static void do_flush_tlb_all(void *info)
  void flush_tlb_all(void)
  {
  	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+-	on_each_cpu(do_flush_tlb_all, NULL, 1);
 +
 +	/* First try (faster) hardware-assisted TLB invalidation. */
 +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
++		/*
++		 * TLBSYNC at the end needs to make sure all flushes done
++		 * on the current CPU have been executed system-wide.
++		 * Therefore, make sure nothing gets migrated
++		 * in-between but disable preemption as it is cheaper.
++		 */
 +		guard(preempt)();
 +		invlpgb_flush_all();
-+		return;
++	} else {
++		/* Fall back to the IPI-based invalidation. */
++		on_each_cpu(do_flush_tlb_all, NULL, 1);
 +	}
-+
-+	/* Fall back to the IPI-based invalidation. */
- 	on_each_cpu(do_flush_tlb_all, NULL, 1);
  }
  
+ /* Flush an arbitrarily large range of memory with INVLPGB. */
 -- 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 05/13] x86/mm: use INVLPGB in flush_tlb_all
  2025-02-28 19:18   ` Dave Hansen
@ 2025-03-01 12:20     ` Borislav Petkov
  2025-03-01 15:54       ` Rik van Riel
  0 siblings, 1 reply; 86+ messages in thread
From: Borislav Petkov @ 2025-03-01 12:20 UTC (permalink / raw)
  To: Dave Hansen, Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Fri, Feb 28, 2025 at 11:18:04AM -0800, Dave Hansen wrote:
> We haven't talked at all about the locking rules for
> invlpgb_flush_all(). It was used once in this series without any
> explicit preempt twiddling. I assume that was because it was used in a
> path where preempt is disabled.
> 
> If it does need a universal rule about preempt, can we please add an:
> 
> 	lockdep_assert_preemption_disabled()
> 
> along with a comment about why it needs preempt disabled?

So, after talking on IRC last night, below is what I think we should do ontop.

More specifically:

- I've pushed the preemption guard inside the functions which do
  INVLPGB+TLBSYNC so that callers do not have to think about it.

- invlpgb_kernel_range_flush() I still don't like and we have to rely there on
  cant_migrate() in __tlbsync() - I'd like for all of them to be nicely packed
  but don't have an idea yet how to do that cleanly...

- document what means for bits rax[0:2] being clear when issuing INVLPGB


That ok?

Anything I've missed?

If not, I'll integrate this into the patches.

Thx.

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 45d9c7687d61..0d90ceeb472b 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -39,6 +39,10 @@ static inline void invlpg(unsigned long addr)
  * the first page, while __invlpgb gets the more human readable number of
  * pages to invalidate.
  *
+ * The bits in rax[0:2] determine respectively which components of the address
+ * (VA, PCID, ASID) get compared when flushing. If neither bits are set, *any*
+ * address in the specified range matches.
+ *
  * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from
  * this CPU have completed.
  */
@@ -60,10 +64,10 @@ static inline void __invlpgb(unsigned long asid, unsigned long pcid,
 static inline void __tlbsync(void)
 {
 	/*
-	 * tlbsync waits for invlpgb instructions originating on the
-	 * same CPU to have completed. Print a warning if we could have
-	 * migrated, and might not be waiting on all the invlpgbs issued
-	 * during this TLB invalidation sequence.
+	 * TLBSYNC waits for INVLPGB instructions originating on the same CPU
+	 * to have completed. Print a warning if the task has been migrated,
+	 * and might not be waiting on all the INVLPGBs issued during this TLB
+	 * invalidation sequence.
 	 */
 	cant_migrate();
 
@@ -106,6 +110,13 @@ static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
 /* Flush all mappings, including globals, for all PCIDs. */
 static inline void invlpgb_flush_all(void)
 {
+	/*
+	 * TLBSYNC at the end needs to make sure all flushes done on the
+	 * current CPU have been executed system-wide. Therefore, make
+	 * sure nothing gets migrated in-between but disable preemption
+	 * as it is cheaper.
+	 */
+	guard(preempt)();
 	__invlpgb(0, 0, 0, 1, 0, INVLPGB_INCLUDE_GLOBAL);
 	__tlbsync();
 }
@@ -119,10 +130,7 @@ static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
 /* Flush all mappings for all PCIDs except globals. */
 static inline void invlpgb_flush_all_nonglobals(void)
 {
-	/*
-	 * @addr=0 means both rax[1] (valid PCID) and rax[2] (valid ASID) are clear
-	 * so flush *any* PCID and ASID.
-	 */
+	guard(preempt)();
 	__invlpgb(0, 0, 0, 1, 0, 0);
 	__tlbsync();
 }
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index f49627e02311..8cd084bc3d98 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1075,19 +1075,11 @@ void flush_tlb_all(void)
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 
 	/* First try (faster) hardware-assisted TLB invalidation. */
-	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
-		/*
-		 * TLBSYNC at the end needs to make sure all flushes done
-		 * on the current CPU have been executed system-wide.
-		 * Therefore, make sure nothing gets migrated
-		 * in-between but disable preemption as it is cheaper.
-		 */
-		guard(preempt)();
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
 		invlpgb_flush_all();
-	} else {
+	else
 		/* Fall back to the IPI-based invalidation. */
 		on_each_cpu(do_flush_tlb_all, NULL, 1);
-	}
 }
 
 /* Flush an arbitrarily large range of memory with INVLPGB. */

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 05/13] x86/mm: use INVLPGB in flush_tlb_all
  2025-03-01 12:20     ` Borislav Petkov
@ 2025-03-01 15:54       ` Rik van Riel
  0 siblings, 0 replies; 86+ messages in thread
From: Rik van Riel @ 2025-03-01 15:54 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Sat, 2025-03-01 at 13:20 +0100, Borislav Petkov wrote:
> 
> So, after talking on IRC last night, below is what I think we should
> do ontop.

This all looks great! Thank you.

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 07/13] x86/mm: add global ASID allocation helper functions
  2025-02-26  3:00 ` [PATCH v14 07/13] x86/mm: add global ASID allocation helper functions Rik van Riel
@ 2025-03-02  7:06   ` Borislav Petkov
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Add " tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 86+ messages in thread
From: Borislav Petkov @ 2025-03-02  7:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Tue, Feb 25, 2025 at 10:00:42PM -0500, Rik van Riel wrote:
> Add functions to manage global ASID space. Multithreaded processes that
> are simultaneously active on 4 or more CPUs can get a global ASID,
> resulting in the same PCID being used for that process on every CPU.
> 
> This in turn will allow the kernel to use hardware-assisted TLB flushing
> through AMD INVLPGB or Intel RAR for these processes.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Manali Shukla <Manali.Shukla@amd.com>
> Tested-by: Brendan Jackman <jackmanb@google.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> ---
>  arch/x86/include/asm/mmu.h         |  11 +++
>  arch/x86/include/asm/mmu_context.h |   2 +
>  arch/x86/include/asm/tlbflush.h    |  43 +++++++++
>  arch/x86/mm/tlb.c                  | 146 ++++++++++++++++++++++++++++-
>  4 files changed, 199 insertions(+), 3 deletions(-)

Some small touchups ontop:

--- /tmp/current.patch	2025-03-02 07:33:13.913105249 +0100
+++ /tmp/0001-x86-mm-Add-global-ASID-allocation-helper-functions.patch	2025-03-02 08:05:23.613262232 +0100
 diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
-index 3b496cdcb74b..edb5942d4829 100644
+index 3b496cdcb74b..7fbefea5fdae 100644
 --- a/arch/x86/include/asm/mmu.h
 +++ b/arch/x86/include/asm/mmu.h
@@ -35,6 +38,7 @@ index 3b496cdcb74b..edb5942d4829 100644
 +	 * hardware-assisted remote TLB invalidation like AMD INVLPGB.
 +	 */
 +	u16 global_asid;
++
 +	/* The process is transitioning to a new global ASID number. */
 +	bool asid_transition;
 +#endif
@@ -251,7 +255,12 @@ index 1cc25e83bd34..9b1652c02452 100644
 +	if (mm_global_asid(mm))
 +		return;
 +
-+	/* The last global ASID was consumed while waiting for the lock. */
++	/*
++	 * The last global ASID was consumed while waiting for the lock.
++	 *
++	 * If this fires, a more aggressive ASID reuse scheme might be
++	 * needed.
++	 */
 +	if (!global_asid_available) {
 +		VM_WARN_ONCE(1, "Ran out of global ASIDs\n");
 +		return;
@@ -284,5 +293,5 @@ index 1cc25e83bd34..9b1652c02452 100644
   * Given an ASID, flush the corresponding user ASID.  We can delay this
   * until the next time we switch to it.
 -- 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 08/13] x86/mm: global ASID context switch & TLB flush handling
  2025-02-26  3:00 ` [PATCH v14 08/13] x86/mm: global ASID context switch & TLB flush handling Rik van Riel
@ 2025-03-02  7:58   ` Borislav Petkov
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Handle global ASID context switch and TLB flush tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 86+ messages in thread
From: Borislav Petkov @ 2025-03-02  7:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Tue, Feb 25, 2025 at 10:00:43PM -0500, Rik van Riel wrote:
> Context switch and TLB flush support for processes that use a global
> ASID & PCID across all CPUs.
> 
> At both context switch time and TLB flush time, we need to check
> whether a task is switching to a global ASID, and reload the TLB
> with the new ASID as appropriate.
> 
> In both code paths, we also short-circuit the TLB flush if we
> are using a global ASID, because the global ASIDs are always
> kept up to date across CPUs, even while the process is not
> running on a CPU.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/tlbflush.h | 18 ++++++++
>  arch/x86/mm/tlb.c               | 77 ++++++++++++++++++++++++++++++---
>  2 files changed, 88 insertions(+), 7 deletions(-)

Some touchups:

--- /tmp/current.patch	2025-03-02 08:54:44.821408308 +0100
+++ /tmp/0001-x86-mm-Handle-global-ASID-context-switch-and-TLB-flu.patch	2025-03-02 08:55:27.029190935 +0100
@@ -1,18 +1,23 @@
+From a92847ac925d2849708d036d8bb4920d9b6f2a59 Mon Sep 17 00:00:00 2001
 From: Rik van Riel <riel@surriel.com>
 Date: Tue, 25 Feb 2025 22:00:43 -0500
-Subject: x86/mm: Global ASID context switch & TLB flush handling
+Subject: [PATCH] x86/mm: Handle global ASID context switch and TLB flush
 
-Context switch and TLB flush support for processes that use a global
-ASID & PCID across all CPUs.
+Do context switch and TLB flush support for processes that use a global
+ASID and PCID across all CPUs.
 
-At both context switch time and TLB flush time, we need to check
-whether a task is switching to a global ASID, and reload the TLB
-with the new ASID as appropriate.
-
-In both code paths, we also short-circuit the TLB flush if we
-are using a global ASID, because the global ASIDs are always
-kept up to date across CPUs, even while the process is not
-running on a CPU.
+At both context switch time and TLB flush time, it needs to be checked whether
+a task is switching to a global ASID, and, if so, reload the TLB with the new
+ASID as appropriate.
+
+In both code paths, the TLB flush is avoided if a global ASID is used, because
+the global ASIDs are always kept up to date across CPUs, even when the
+process is not running on a CPU.
+
+  [ bp:
+   - Massage
+   - :%s/\<static_cpu_has\>/cpu_feature_enabled/cgi
+  ]
 
 Signed-off-by: Rik van Riel <riel@surriel.com>
 Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
@@ -66,7 +71,7 @@ index 8e7df0ed7005..37b735dcf025 100644
  
  #ifdef CONFIG_PARAVIRT
 diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
-index 9b1652c02452..b7d461db1b08 100644
+index d79ebdf095e1..cb43ab08ea4a 100644
 --- a/arch/x86/mm/tlb.c
 +++ b/arch/x86/mm/tlb.c
 @@ -227,6 +227,20 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
@@ -77,7 +82,7 @@ index 9b1652c02452..b7d461db1b08 100644
 +	 * TLB consistency for global ASIDs is maintained with hardware assisted
 +	 * remote TLB flushing. Global ASIDs are always up to date.
 +	 */
-+	if (static_cpu_has(X86_FEATURE_INVLPGB)) {
++	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
 +		u16 global_asid = mm_global_asid(next);
 +
 +		if (global_asid) {
@@ -90,22 +95,22 @@ index 9b1652c02452..b7d461db1b08 100644
  	if (this_cpu_read(cpu_tlbstate.invalidate_other))
  		clear_asid_other();
  
-@@ -391,6 +405,23 @@ void mm_free_global_asid(struct mm_struct *mm)
+@@ -396,6 +410,23 @@ void mm_free_global_asid(struct mm_struct *mm)
  #endif
  }
  
 +/*
 + * Is the mm transitioning from a CPU-local ASID to a global ASID?
 + */
-+static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
++static bool mm_needs_global_asid(struct mm_struct *mm, u16 asid)
 +{
-+	u16 global_asid = mm_global_asid(next);
++	u16 global_asid = mm_global_asid(mm);
 +
-+	if (!static_cpu_has(X86_FEATURE_INVLPGB))
++	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
 +		return false;
 +
 +	/* Process is transitioning to a global ASID */
-+	if (global_asid && prev_asid != global_asid)
++	if (global_asid && asid != global_asid)
 +		return true;
 +
 +	return false;
@@ -124,19 +129,19 @@ index 9b1652c02452..b7d461db1b08 100644
  			   next->context.ctx_id);
  
  		/*
-@@ -713,6 +745,20 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
+@@ -718,6 +750,20 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
  				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
  			cpumask_set_cpu(cpu, mm_cpumask(next));
  
 +		/* Check if the current mm is transitioning to a global ASID */
-+		if (needs_global_asid_reload(next, prev_asid)) {
++		if (mm_needs_global_asid(next, prev_asid)) {
 +			next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 +			choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
 +			goto reload_tlb;
 +		}
 +
 +		/*
-+		 * Broadcast TLB invalidation keeps this PCID up to date
++		 * Broadcast TLB invalidation keeps this ASID up to date
 +		 * all the time.
 +		 */
 +		if (is_global_asid(prev_asid))
@@ -145,13 +150,13 @@ index 9b1652c02452..b7d461db1b08 100644
  		/*
  		 * If the CPU is not in lazy TLB mode, we are just switching
  		 * from one thread in a process to another thread in the same
-@@ -746,6 +792,13 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
+@@ -751,6 +797,13 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
  		 */
  		cond_mitigation(tsk);
  
 +		/*
 +		 * Let nmi_uaccess_okay() and finish_asid_transition()
-+		 * know that we're changing CR3.
++		 * know that CR3 is changing.
 +		 */
 +		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
 +		barrier();
@@ -185,12 +190,12 @@ index 9b1652c02452..b7d461db1b08 100644
  	bool local = smp_processor_id() == f->initiating_cpu;
  	unsigned long nr_invalidate = 0;
  	u64 mm_tlb_gen;
-@@ -909,6 +960,16 @@ static void flush_tlb_func(void *info)
+@@ -914,6 +965,16 @@ static void flush_tlb_func(void *info)
  	if (unlikely(loaded_mm == &init_mm))
  		return;
  
 +	/* Reload the ASID if transitioning into or out of a global ASID */
-+	if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) {
++	if (mm_needs_global_asid(loaded_mm, loaded_mm_asid)) {
 +		switch_mm_irqs_off(NULL, loaded_mm, NULL);
 +		loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
 +	}

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 09/13] x86/mm: global ASID process exit helpers
  2025-02-26  3:00 ` [PATCH v14 09/13] x86/mm: global ASID process exit helpers Rik van Riel
@ 2025-03-02 12:38   ` Borislav Petkov
  2025-03-02 13:53     ` Rik van Riel
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Add " tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  2 siblings, 1 reply; 86+ messages in thread
From: Borislav Petkov @ 2025-03-02 12:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Tue, Feb 25, 2025 at 10:00:44PM -0500, Rik van Riel wrote:
> A global ASID is allocated for the lifetime of a process.
> 
> Free the global ASID at process exit time.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/mmu_context.h | 12 ++++++++++++
>  1 file changed, 12 insertions(+)

So I don't like the ifdeffery and tried removing it, see below.

So I added helpers.

Then I entered the include hell.

And then I caught a bug with the DISABLED_FEATURE stuff.

Anyway, see below. It builds the allno-, defconfig and mine I use on this
machine so more testing is definitely needed. But the end result looks a lot
better now.

I'll integrate the other changes into the respective patches and I probably
should push my branch somewhere as we have accumulated a lot of changes by now
so tracking them over mail is getting not that easy anymore.

Will do so on Monday.

diff --git a/arch/x86/Kconfig.cpufeatures b/arch/x86/Kconfig.cpufeatures
index f9af51205f07..3207e5546438 100644
--- a/arch/x86/Kconfig.cpufeatures
+++ b/arch/x86/Kconfig.cpufeatures
@@ -196,6 +196,6 @@ config X86_DISABLED_FEATURE_SEV_SNP
 	def_bool y
 	depends on !KVM_AMD_SEV
 
-config X86_DISABLED_FEATURE_BROADCAST_TLB_FLUSH
+config X86_DISABLED_FEATURE_INVLPGB
 	def_bool y
 	depends on !X86_BROADCAST_TLB_FLUSH
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index a2c70e495b1b..2398058b6e83 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -2,7 +2,6 @@
 #ifndef _ASM_X86_MMU_CONTEXT_H
 #define _ASM_X86_MMU_CONTEXT_H
 
-#include <asm/desc.h>
 #include <linux/atomic.h>
 #include <linux/mm_types.h>
 #include <linux/pkeys.h>
@@ -13,6 +12,7 @@
 #include <asm/paravirt.h>
 #include <asm/debugreg.h>
 #include <asm/gsseg.h>
+#include <asm/desc.h>
 
 extern atomic64_t last_mm_ctx_id;
 
@@ -139,6 +139,9 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
 #define enter_lazy_tlb enter_lazy_tlb
 extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 
+#define mm_init_global_asid mm_init_global_asid
+extern void mm_init_global_asid(struct mm_struct *mm);
+
 extern void mm_free_global_asid(struct mm_struct *mm);
 
 /*
@@ -163,6 +166,8 @@ static inline int init_new_context(struct task_struct *tsk,
 		mm->context.execute_only_pkey = -1;
 	}
 #endif
+
+	mm_init_global_asid(mm);
 	mm_reset_untag_mask(mm);
 	init_new_context_ldt(mm);
 	return 0;
@@ -172,6 +177,7 @@ static inline int init_new_context(struct task_struct *tsk,
 static inline void destroy_context(struct mm_struct *mm)
 {
 	destroy_context_ldt(mm);
+	mm_free_global_asid(mm);
 }
 
 extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 37b735dcf025..01d8a152f04a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -261,6 +261,14 @@ static inline u16 mm_global_asid(struct mm_struct *mm)
 	return asid;
 }
 
+static inline void mm_init_global_asid(struct mm_struct *mm)
+{
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		mm->context.global_asid = 0;
+		mm->context.asid_transition = false;
+	}
+}
+
 static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
 {
 	/*
@@ -272,7 +280,7 @@ static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
 	smp_store_release(&mm->context.global_asid, asid);
 }
 
-static inline bool in_asid_transition(struct mm_struct *mm)
+static inline bool mm_in_asid_transition(struct mm_struct *mm)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
 		return false;
@@ -280,19 +288,10 @@ static inline bool in_asid_transition(struct mm_struct *mm)
 	return mm && READ_ONCE(mm->context.asid_transition);
 }
 #else
-static inline u16 mm_global_asid(struct mm_struct *mm)
-{
-	return 0;
-}
-
-static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
-{
-}
-
-static inline bool in_asid_transition(struct mm_struct *mm)
-{
-	return false;
-}
+static inline u16 mm_global_asid(struct mm_struct *mm) { return 0; }
+static inline void mm_init_global_asid(struct mm_struct *mm) { }
+static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid) { }
+static inline bool mm_in_asid_transition(struct mm_struct *mm) { return false; }
 #endif
 
 #ifdef CONFIG_PARAVIRT
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index cb43ab08ea4a..c2167b331bbe 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -396,6 +396,9 @@ static void use_global_asid(struct mm_struct *mm)
 
 void mm_free_global_asid(struct mm_struct *mm)
 {
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return;
+
 	if (!mm_global_asid(mm))
 		return;
 
@@ -1161,7 +1164,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
 	 * up on the new contents of what used to be page tables, while
 	 * doing a speculative memory access.
 	 */
-	if (info->freed_tables || in_asid_transition(info->mm))
+	if (info->freed_tables || mm_in_asid_transition(info->mm))
 		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
 	else
 		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
-- 
2.43.0

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 09/13] x86/mm: global ASID process exit helpers
  2025-03-02 12:38   ` Borislav Petkov
@ 2025-03-02 13:53     ` Rik van Riel
  2025-03-03 10:16       ` Borislav Petkov
  0 siblings, 1 reply; 86+ messages in thread
From: Rik van Riel @ 2025-03-02 13:53 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Sun, 2025-03-02 at 13:38 +0100, Borislav Petkov wrote:
> On Tue, Feb 25, 2025 at 10:00:44PM -0500, Rik van Riel wrote:
> > A global ASID is allocated for the lifetime of a process.
> > 
> > Free the global ASID at process exit time.
> > 
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > ---
> >  arch/x86/include/asm/mmu_context.h | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> 
> So I don't like the ifdeffery and tried removing it, see below.
> 
> So I added helpers.
> 
> Then I entered the include hell.
> 
> And then I caught a bug with the DISABLED_FEATURE stuff.

I've been there. Repeatedly :)

Thank you for these changes, it does look better
than before now.


-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 09/13] x86/mm: global ASID process exit helpers
  2025-03-02 13:53     ` Rik van Riel
@ 2025-03-03 10:16       ` Borislav Petkov
  0 siblings, 0 replies; 86+ messages in thread
From: Borislav Petkov @ 2025-03-03 10:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Sun, Mar 02, 2025 at 08:53:10AM -0500, Rik van Riel wrote:
> I've been there. Repeatedly :)

Yap, it is. And despite all the compile-time disabling fun, clang still can't
do proper DCE and complains about the inline asm in __invlpgb() using a u64 on
32-bit builds.

So I did the fix below, ontop and with that randconfig builds all pass fine.

What I have so far is here:

https://web.git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/log/?h=tip-x86-cpu-tlbi

Lemme go through the rest of your patches now.

Thx.

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 5375145eb959..04f2c6f4cee3 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -29,6 +29,7 @@ static inline void invlpg(unsigned long addr)
 }
 
 
+#ifdef CONFIG_BROADCAST_TLB_FLUSH
 /*
  * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
  *
@@ -74,6 +75,14 @@ static inline void __tlbsync(void)
 	/* TLBSYNC: supported in binutils >= 0.36. */
 	asm volatile(".byte 0x0f, 0x01, 0xff" ::: "memory");
 }
+#else
+/* Some compilers simply can't do DCE */
+static inline void __invlpgb(unsigned long asid, unsigned long pcid,
+			     unsigned long addr, u16 nr_pages,
+			     bool pmd_stride, u8 flags) { }
+
+static inline void __tlbsync(void) { }
+#endif
 
 /*
  * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 10/13] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-02-26  3:00 ` [PATCH v14 10/13] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
@ 2025-03-03 10:57   ` Borislav Petkov
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Enable " tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 86+ messages in thread
From: Borislav Petkov @ 2025-03-03 10:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Tue, Feb 25, 2025 at 10:00:45PM -0500, Rik van Riel wrote:
> +/*
> + * x86 has 4k ASIDs (2k when compiled with KPTI), but the largest
> + * x86 systems have over 8k CPUs. Because of this potential ASID
> + * shortage, global ASIDs are handed out to processes that have
> + * frequent TLB flushes and are active on 4 or more CPUs simultaneously.
> + */
> +static void consider_global_asid(struct mm_struct *mm)
> +{
> +	if (!static_cpu_has(X86_FEATURE_INVLPGB))
> +		return;
> +
> +	/* Check every once in a while. */
> +	if ((current->pid & 0x1f) != (jiffies & 0x1f))
> +		return;

Uff, this looks funky.

> +
> +	if (!READ_ONCE(global_asid_available))
> +		return;

use_global_asid() will do that check for us already and it'll even warn.

> +
> +	/*
> +	 * Assign a global ASID if the process is active on
> +	 * 4 or more CPUs simultaneously.
> +	 */
> +	if (mm_active_cpus_exceeds(mm, 3))
> +		use_global_asid(mm);
> +}
> +

...

> +static void broadcast_tlb_flush(struct flush_tlb_info *info)
> +{
> +	bool pmd = info->stride_shift == PMD_SHIFT;
> +	unsigned long asid = mm_global_asid(info->mm);
> +	unsigned long addr = info->start;
> +
> +	/*
> +	 * TLB flushes with INVLPGB are kicked off asynchronously.
> +	 * The inc_mm_tlb_gen() guarantees page table updates are done
> +	 * before these TLB flushes happen.
> +	 */
> +	if (info->end == TLB_FLUSH_ALL) {
> +		invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
> +		/* Do any CPUs supporting INVLPGB need PTI? */

I hope not. :)

However, I think one can force-enable PTI on AMD so yeah, let's keep that.

...

Final result:

From: Rik van Riel <riel@surriel.com>
Date: Tue, 25 Feb 2025 22:00:45 -0500
Subject: [PATCH] x86/mm: Enable broadcast TLB invalidation for multi-threaded
 processes

There is not enough room in the 12-bit ASID address space to hand out
broadcast ASIDs to every process. Only hand out broadcast ASIDs to processes
when they are observed to be simultaneously running on 4 or more CPUs.

This also allows single threaded process to continue using the cheaper, local
TLB invalidation instructions like INVLPGB.

Due to the structure of flush_tlb_mm_range(), the INVLPGB flushing is done in
a generically named broadcast_tlb_flush() function which can later also be
used for Intel RAR.

Combined with the removal of unnecessary lru_add_drain calls() (see
https://lore.kernel.org/r/20241219153253.3da9e8aa@fangorn) this results in
a nice performance boost for the will-it-scale tlb_flush2_threads test on an
AMD Milan system with 36 cores:

  - vanilla kernel:           527k loops/second
  - lru_add_drain removal:    731k loops/second
  - only INVLPGB:             527k loops/second
  - lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while TLB invalidation went
down from 40% of the total CPU time to only around 4% of CPU time, the
contention simply moved to the LRU lock.

Fixing both at the same time about doubles the number of iterations per second
from this case.

Comparing will-it-scale tlb_flush2_threads with several different numbers of
threads on a 72 CPU AMD Milan shows similar results. The number represents the
total number of loops per second across all the threads:

  threads	tip		INVLPGB

  1		315k		304k
  2		423k		424k
  4		644k		1032k
  8		652k		1267k
  16		737k		1368k
  32		759k		1199k
  64		636k		1094k
  72		609k		993k

1 and 2 thread performance is similar with and without INVLPGB, because
INVLPGB is only used on processes using 4 or more CPUs simultaneously.

The number is the median across 5 runs.

Some numbers closer to real world performance can be found at Phoronix, thanks
to Michael:

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

  [ bp:
   - Massage
   - :%s/\<static_cpu_has\>/cpu_feature_enabled/cgi
   - :%s/\<clear_asid_transition\>/mm_clear_asid_transition/cgi
   ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Link: https://lore.kernel.org/r/20250226030129.530345-11-riel@surriel.com
---
 arch/x86/include/asm/tlbflush.h |   5 ++
 arch/x86/mm/tlb.c               | 104 +++++++++++++++++++++++++++++++-
 2 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index e6c3be06dd21..8c21030269ff 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -280,6 +280,11 @@ static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
 	smp_store_release(&mm->context.global_asid, asid);
 }
 
+static inline void mm_clear_asid_transition(struct mm_struct *mm)
+{
+	WRITE_ONCE(mm->context.asid_transition, false);
+}
+
 static inline bool mm_in_asid_transition(struct mm_struct *mm)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index b5681e6f2333..0efd99053c09 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -430,6 +430,105 @@ static bool mm_needs_global_asid(struct mm_struct *mm, u16 asid)
 	return false;
 }
 
+/*
+ * x86 has 4k ASIDs (2k when compiled with KPTI), but the largest x86
+ * systems have over 8k CPUs. Because of this potential ASID shortage,
+ * global ASIDs are handed out to processes that have frequent TLB
+ * flushes and are active on 4 or more CPUs simultaneously.
+ */
+static void consider_global_asid(struct mm_struct *mm)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return;
+
+	/* Check every once in a while. */
+	if ((current->pid & 0x1f) != (jiffies & 0x1f))
+		return;
+
+	/*
+	 * Assign a global ASID if the process is active on
+	 * 4 or more CPUs simultaneously.
+	 */
+	if (mm_active_cpus_exceeds(mm, 3))
+		use_global_asid(mm);
+}
+
+static void finish_asid_transition(struct flush_tlb_info *info)
+{
+	struct mm_struct *mm = info->mm;
+	int bc_asid = mm_global_asid(mm);
+	int cpu;
+
+	if (!mm_in_asid_transition(mm))
+		return;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/*
+		 * The remote CPU is context switching. Wait for that to
+		 * finish, to catch the unlikely case of it switching to
+		 * the target mm with an out of date ASID.
+		 */
+		while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) == LOADED_MM_SWITCHING)
+			cpu_relax();
+
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm)
+			continue;
+
+		/*
+		 * If at least one CPU is not using the global ASID yet,
+		 * send a TLB flush IPI. The IPI should cause stragglers
+		 * to transition soon.
+		 *
+		 * This can race with the CPU switching to another task;
+		 * that results in a (harmless) extra IPI.
+		 */
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) {
+			flush_tlb_multi(mm_cpumask(info->mm), info);
+			return;
+		}
+	}
+
+	/* All the CPUs running this process are using the global ASID. */
+	mm_clear_asid_transition(mm);
+}
+
+static void broadcast_tlb_flush(struct flush_tlb_info *info)
+{
+	bool pmd = info->stride_shift == PMD_SHIFT;
+	unsigned long asid = mm_global_asid(info->mm);
+	unsigned long addr = info->start;
+
+	/*
+	 * TLB flushes with INVLPGB are kicked off asynchronously.
+	 * The inc_mm_tlb_gen() guarantees page table updates are done
+	 * before these TLB flushes happen.
+	 */
+	if (info->end == TLB_FLUSH_ALL) {
+		invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (cpu_feature_enabled(X86_FEATURE_PTI))
+			invlpgb_flush_single_pcid_nosync(user_pcid(asid));
+	} else do {
+		unsigned long nr = 1;
+
+		if (info->stride_shift <= PMD_SHIFT) {
+			nr = (info->end - addr) >> info->stride_shift;
+			nr = clamp_val(nr, 1, invlpgb_count_max);
+		}
+
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd);
+		if (cpu_feature_enabled(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd);
+
+		addr += nr << info->stride_shift;
+	} while (addr < info->end);
+
+	finish_asid_transition(info);
+
+	/* Wait for the INVLPGBs kicked off above to finish. */
+	__tlbsync();
+}
+
 /*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
@@ -1260,9 +1359,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+	if (mm_global_asid(mm)) {
+		broadcast_tlb_flush(info);
+	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
 		info->trim_cpumask = should_trim_cpumask(mm);
 		flush_tlb_multi(mm_cpumask(mm), info);
+		consider_global_asid(mm);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-- 
2.43.0

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 11/13] x86/mm: do targeted broadcast flushing from tlbbatch code
  2025-02-26  3:00 ` [PATCH v14 11/13] x86/mm: do targeted broadcast flushing from tlbbatch code Rik van Riel
@ 2025-03-03 11:46   ` Borislav Petkov
  2025-03-03 21:47     ` Dave Hansen
  0 siblings, 1 reply; 86+ messages in thread
From: Borislav Petkov @ 2025-03-03 11:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Tue, Feb 25, 2025 at 10:00:46PM -0500, Rik van Riel wrote:
> +static inline bool cpu_need_tlbsync(void)
> +{
> +	return this_cpu_read(cpu_tlbstate.need_tlbsync);
> +}
> +
> +static inline void cpu_write_tlbsync(bool state)

That thing feels better like "cpu_set_tlbsync" in the code...

> +{
> +	this_cpu_write(cpu_tlbstate.need_tlbsync, state);
> +}
>  #else
>  static inline u16 mm_global_asid(struct mm_struct *mm)
>  {

...

> +static inline void tlbsync(void)
> +{
> +	if (!cpu_need_tlbsync())
> +		return;
> +	__tlbsync();
> +	cpu_write_tlbsync(false);
> +}

Easier to parse visually:

static inline void tlbsync(void)
{
        if (cpu_need_tlbsync()) {
                __tlbsync();
                cpu_write_tlbsync(false);
        }
}

Final:

From: Rik van Riel <riel@surriel.com>
Date: Tue, 25 Feb 2025 22:00:46 -0500
Subject: [PATCH] x86/mm: Do targeted broadcast flushing from tlbbatch code

Instead of doing a system-wide TLB flush from arch_tlbbatch_flush(), queue up
asynchronous, targeted flushes from arch_tlbbatch_add_pending().

This also allows to avoid adding the CPUs of processes using broadcast
flushing to the batch->cpumask, and will hopefully further reduce TLB flushing
from the reclaim and compaction paths.

  [ bp:
   - Massage
   - :%s/\<static_cpu_has\>/cpu_feature_enabled/cgi ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250226030129.530345-12-riel@surriel.com
---
 arch/x86/include/asm/tlb.h      | 12 ++---
 arch/x86/include/asm/tlbflush.h | 27 +++++++----
 arch/x86/mm/tlb.c               | 79 +++++++++++++++++++++++++++++++--
 3 files changed, 100 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 04f2c6f4cee3..b5c2005725cf 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -102,16 +102,16 @@ static inline void __tlbsync(void) { }
 #define INVLPGB_FINAL_ONLY		BIT(4)
 #define INVLPGB_INCLUDE_NESTED		BIT(5)
 
-static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
-						unsigned long addr,
-						u16 nr,
-						bool pmd_stride)
+static inline void __invlpgb_flush_user_nr_nosync(unsigned long pcid,
+						  unsigned long addr,
+						  u16 nr,
+						  bool pmd_stride)
 {
 	__invlpgb(0, pcid, addr, nr, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
 }
 
 /* Flush all mappings for a given PCID, not including globals. */
-static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
+static inline void __invlpgb_flush_single_pcid_nosync(unsigned long pcid)
 {
 	__invlpgb(0, pcid, 0, 1, 0, INVLPGB_PCID);
 }
@@ -131,7 +131,7 @@ static inline void invlpgb_flush_all(void)
 }
 
 /* Flush addr, including globals, for all PCIDs. */
-static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
+static inline void __invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
 {
 	__invlpgb(0, 0, addr, nr, 0, INVLPGB_INCLUDE_GLOBAL);
 }
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8c21030269ff..cbdb86d58301 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -105,6 +105,9 @@ struct tlb_state {
 	 * need to be invalidated.
 	 */
 	bool invalidate_other;
+#ifdef CONFIG_BROADCAST_TLB_FLUSH
+	bool need_tlbsync;
+#endif
 
 #ifdef CONFIG_ADDRESS_MASKING
 	/*
@@ -292,11 +295,23 @@ static inline bool mm_in_asid_transition(struct mm_struct *mm)
 
 	return mm && READ_ONCE(mm->context.asid_transition);
 }
+
+static inline bool cpu_need_tlbsync(void)
+{
+	return this_cpu_read(cpu_tlbstate.need_tlbsync);
+}
+
+static inline void cpu_set_tlbsync(bool state)
+{
+	this_cpu_write(cpu_tlbstate.need_tlbsync, state);
+}
 #else
 static inline u16 mm_global_asid(struct mm_struct *mm) { return 0; }
 static inline void mm_init_global_asid(struct mm_struct *mm) { }
 static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid) { }
 static inline bool mm_in_asid_transition(struct mm_struct *mm) { return false; }
+static inline bool cpu_need_tlbsync(void) { return false; }
+static inline void cpu_set_tlbsync(bool state) { }
 #endif /* CONFIG_BROADCAST_TLB_FLUSH */
 
 #ifdef CONFIG_PARAVIRT
@@ -346,21 +361,15 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 	return atomic64_inc_return(&mm->context.tlb_gen);
 }
 
-static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
-					     struct mm_struct *mm,
-					     unsigned long uaddr)
-{
-	inc_mm_tlb_gen(mm);
-	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
-	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
-}
-
 static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
 {
 	flush_tlb_mm(mm);
 }
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+					     struct mm_struct *mm,
+					     unsigned long uaddr);
 
 static inline bool pte_flags_need_flush(unsigned long oldflags,
 					unsigned long newflags,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 0efd99053c09..83ba6876adbf 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -492,6 +492,37 @@ static void finish_asid_transition(struct flush_tlb_info *info)
 	mm_clear_asid_transition(mm);
 }
 
+static inline void tlbsync(void)
+{
+	if (cpu_need_tlbsync()) {
+		__tlbsync();
+		cpu_set_tlbsync(false);
+	}
+}
+
+static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
+						unsigned long addr,
+						u16 nr, bool pmd_stride)
+{
+	__invlpgb_flush_user_nr_nosync(pcid, addr, nr, pmd_stride);
+	if (!cpu_need_tlbsync())
+		cpu_set_tlbsync(true);
+}
+
+static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
+{
+	__invlpgb_flush_single_pcid_nosync(pcid);
+	if (!cpu_need_tlbsync())
+		cpu_set_tlbsync(true);
+}
+
+static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
+{
+	__invlpgb_flush_addr_nosync(addr, nr);
+	if (!cpu_need_tlbsync())
+		cpu_set_tlbsync(true);
+}
+
 static void broadcast_tlb_flush(struct flush_tlb_info *info)
 {
 	bool pmd = info->stride_shift == PMD_SHIFT;
@@ -790,6 +821,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
 		WARN_ON_ONCE(!irqs_disabled());
 
+	tlbsync();
+
 	/*
 	 * Verify that CR3 is what we think it is.  This will catch
 	 * hypothetical buggy code that directly switches to swapper_pg_dir
@@ -966,6 +999,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
  */
 void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
+	tlbsync();
+
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
 		return;
 
@@ -1633,9 +1668,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
-		invlpgb_flush_all_nonglobals();
-	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
@@ -1644,12 +1677,52 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 		local_irq_enable();
 	}
 
+	/*
+	 * If (asynchronous) INVLPGB flushes were issued, wait for them here.
+	 * The cpumask above contains only CPUs that were running tasks
+	 * not using broadcast TLB flushing.
+	 */
+	tlbsync();
+
 	cpumask_clear(&batch->cpumask);
 
 	put_flush_tlb_info();
 	put_cpu();
 }
 
+void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+					     struct mm_struct *mm,
+					     unsigned long uaddr)
+{
+	u16 asid = mm_global_asid(mm);
+
+	if (asid) {
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (cpu_feature_enabled(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false);
+
+		/*
+		 * Some CPUs might still be using a local ASID for this
+		 * process, and require IPIs, while others are using the
+		 * global ASID.
+		 *
+		 * In this corner case, both broadcast TLB invalidation
+		 * and IPIs need to be sent. The IPIs will help
+		 * stragglers transition to the broadcast ASID.
+		 */
+		if (mm_in_asid_transition(mm))
+			asid = 0;
+	}
+
+	if (!asid) {
+		inc_mm_tlb_gen(mm);
+		cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
+	}
+
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
+}
+
 /*
  * Blindly accessing user memory from NMI context can be dangerous
  * if we're in the middle of switching the current user task or
-- 
2.43.0

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 00/13] AMD broadcast TLB invalidation
  2025-02-26  3:00 [PATCH v14 00/13] AMD broadcast TLB invalidation Rik van Riel
                   ` (12 preceding siblings ...)
  2025-02-26  3:00 ` [PATCH v14 13/13] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
@ 2025-03-03 12:42 ` Borislav Petkov
  2025-03-03 13:29   ` Borislav Petkov
  2025-03-04 12:04 ` [PATCH] x86/mm: Always set the ASID valid bit for the INVLPGB instruction Borislav Petkov
  14 siblings, 1 reply; 86+ messages in thread
From: Borislav Petkov @ 2025-03-03 12:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Tue, Feb 25, 2025 at 10:00:35PM -0500, Rik van Riel wrote:
> Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.

Ok, I've got the whole thing here now:

https://web.git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/log/?h=tip-x86-cpu-tlbi

Lemme take it for a spin on my machines, see whether the cat catches fire...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 00/13] AMD broadcast TLB invalidation
  2025-03-03 12:42 ` [PATCH v14 00/13] AMD broadcast TLB invalidation Borislav Petkov
@ 2025-03-03 13:29   ` Borislav Petkov
  0 siblings, 0 replies; 86+ messages in thread
From: Borislav Petkov @ 2025-03-03 13:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Mon, Mar 03, 2025 at 01:42:14PM +0100, Borislav Petkov wrote:
> On Tue, Feb 25, 2025 at 10:00:35PM -0500, Rik van Riel wrote:
> > Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.
> 
> Ok, I've got the whole thing here now:
> 
> https://web.git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/log/?h=tip-x86-cpu-tlbi
> 
> Lemme take it for a spin on my machines, see whether the cat catches fire...

Well, it boots on Zen3 and 4, laptops and servers - although only server has
the broadcast thing - but all look good.

I guess ship it!

:-)

Srsly, I'll send out the whole set tomorrow to have it on the ML and see
whether people have more opinions.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 03/13] x86/mm: add INVLPGB support code
  2025-02-28 19:47   ` Borislav Petkov
@ 2025-03-03 18:41     ` Dave Hansen
  2025-03-03 19:23       ` Dave Hansen
  0 siblings, 1 reply; 86+ messages in thread
From: Dave Hansen @ 2025-03-03 18:41 UTC (permalink / raw)
  To: Borislav Petkov, Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

[-- Attachment #1: Type: text/plain, Size: 870 bytes --]

On 2/28/25 11:47, Borislav Petkov wrote:
> @@ -157,11 +140,14 @@ index 77f52bc1578a..91c9a4da3ace 100644
>  +/* Flush all mappings for all PCIDs except globals. */
>  +static inline void invlpgb_flush_all_nonglobals(void)
>  +{
> ++	/*
> ++	 * @addr=0 means both rax[1] (valid PCID) and rax[2] (valid ASID) are clear
> ++	 * so flush *any* PCID and ASID.
> ++	 */
>  +	__invlpgb(0, 0, 0, 1, 0, 0);
>  +	__tlbsync();
>  +}

I had a bit of an allergic reaction to all of the magic numbers.

Could we do something like the attached where we give a _few_ of the
magic numbers some symbolic names?

For instance, instead of passing around a bool for pmd_stride, this uses
an enum. It also explicitly separates things that are setting
pmd_stride=0 but are really saying "this is a 4k stride" from things
that set pmd_stride=0 but are for operations that don't _have_ a stride.

[-- Attachment #2: supportcode.patch --]
[-- Type: text/x-patch, Size: 4430 bytes --]

--- c83449680170170f55a0ab2eb498b92ce97c0624.patch	2025-03-03 10:35:47.422277335 -0800
+++ 11ce4b22643be.patch	2025-03-03 10:38:05.692509993 -0800
@@ -1,8 +1,8 @@
-commit c83449680170170f55a0ab2eb498b92ce97c0624
+commit 11ce4b22643be54b2c70cf6b4743e6b73b461814
 Author: Rik van Riel <riel@surriel.com>
 Date:   Fri Feb 28 20:32:30 2025 +0100
 
-     x86/mm: Add INVLPGB support code
+    x86/mm: Add INVLPGB support code
     
     Add helper functions and definitions needed to use broadcast TLB
     invalidation on AMD CPUs.
@@ -17,7 +17,7 @@
     Link: https://lore.kernel.org/r/20250226030129.530345-4-riel@surriel.com
 
 diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
-index 77f52bc1578a7..5375145eb9596 100644
+index 77f52bc1578a7..3bd617c204346 100644
 --- a/arch/x86/include/asm/tlb.h
 +++ b/arch/x86/include/asm/tlb.h
 @@ -6,6 +6,9 @@
@@ -30,10 +30,15 @@
  
  static inline void tlb_flush(struct mmu_gather *tlb)
  {
-@@ -25,4 +28,110 @@ static inline void invlpg(unsigned long addr)
+@@ -25,4 +28,119 @@ static inline void invlpg(unsigned long addr)
  	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
  }
  
++enum invlpgb_stride {
++	NO_STRIDE  = 0,
++	PTE_STRIDE = 0,
++	PMD_STRIDE = 1
++};
 +
 +/*
 + * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
@@ -54,10 +59,10 @@
 + */
 +static inline void __invlpgb(unsigned long asid, unsigned long pcid,
 +			     unsigned long addr, u16 nr_pages,
-+			     bool pmd_stride, u8 flags)
++			     enum invlpgb_stride stride, u8 flags)
 +{
 +	u32 edx = (pcid << 16) | asid;
-+	u32 ecx = (pmd_stride << 31) | (nr_pages - 1);
++	u32 ecx = (stride << 31) | (nr_pages - 1);
 +	u64 rax = addr | flags;
 +
 +	/* The low bits in rax are for flags. Verify addr is clean. */
@@ -84,33 +89,37 @@
 +/*
 + * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
 + * of the three. For example:
-+ * - INVLPGB_VA | INVLPGB_INCLUDE_GLOBAL: invalidate all TLB entries at the address
-+ * - INVLPGB_PCID:			  invalidate all TLB entries matching the PCID
++ * - FLAG_VA | FLAG_INCLUDE_GLOBAL: invalidate all TLB entries at the address
++ * - FLAG_PCID:			    invalidate all TLB entries matching the PCID
 + *
-+ * The first can be used to invalidate (kernel) mappings at a particular
++ * The first is used to invalidate (kernel) mappings at a particular
 + * address across all processes.
 + *
 + * The latter invalidates all TLB entries matching a PCID.
 + */
-+#define INVLPGB_VA			BIT(0)
-+#define INVLPGB_PCID			BIT(1)
-+#define INVLPGB_ASID			BIT(2)
-+#define INVLPGB_INCLUDE_GLOBAL		BIT(3)
-+#define INVLPGB_FINAL_ONLY		BIT(4)
-+#define INVLPGB_INCLUDE_NESTED		BIT(5)
++#define INVLPGB_FLAG_VA			BIT(0)
++#define INVLPGB_FLAG_PCID		BIT(1)
++#define INVLPGB_FLAG_ASID		BIT(2)
++#define INVLPGB_FLAG_INCLUDE_GLOBAL	BIT(3)
++#define INVLPGB_FLAG_FINAL_ONLY		BIT(4)
++#define INVLPGB_FLAG_INCLUDE_NESTED	BIT(5)
++
++/* The implied mode when all bits are clear: */
++#define INVLPGB_MODE_ALL_NONGLOBALS	0UL
 +
 +static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
 +						unsigned long addr,
 +						u16 nr,
 +						bool pmd_stride)
 +{
-+	__invlpgb(0, pcid, addr, nr, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
++	__invlpgb(0, pcid, addr, nr, pmd_stride, INVLPGB_FLAG_PCID |
++		  				 INVLPGB_FLAG_VA);
 +}
 +
 +/* Flush all mappings for a given PCID, not including globals. */
 +static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
 +{
-+	__invlpgb(0, pcid, 0, 1, 0, INVLPGB_PCID);
++	__invlpgb(0, pcid, 0, 1, NO_STRIDE, INVLPGB_FLAG_PCID);
 +}
 +
 +/* Flush all mappings, including globals, for all PCIDs. */
@@ -123,21 +132,21 @@
 +	 * as it is cheaper.
 +	 */
 +	guard(preempt)();
-+	__invlpgb(0, 0, 0, 1, 0, INVLPGB_INCLUDE_GLOBAL);
++	__invlpgb(0, 0, 0, 1, NO_STRIDE, INVLPGB_FLAG_INCLUDE_GLOBAL);
 +	__tlbsync();
 +}
 +
 +/* Flush addr, including globals, for all PCIDs. */
 +static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
 +{
-+	__invlpgb(0, 0, addr, nr, 0, INVLPGB_INCLUDE_GLOBAL);
++	__invlpgb(0, 0, addr, nr, PTE_STRIDE, INVLPGB_FLAG_INCLUDE_GLOBAL);
 +}
 +
 +/* Flush all mappings for all PCIDs except globals. */
 +static inline void invlpgb_flush_all_nonglobals(void)
 +{
 +	guard(preempt)();
-+	__invlpgb(0, 0, 0, 1, 0, 0);
++	__invlpgb(0, 0, 0, 1, NO_STRIDE, INVLPGB_MODE_ALL_NONGLOBALS);
 +	__tlbsync();
 +}
  #endif /* _ASM_X86_TLB_H */

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 03/13] x86/mm: add INVLPGB support code
  2025-03-03 18:41     ` Dave Hansen
@ 2025-03-03 19:23       ` Dave Hansen
  2025-03-04 11:00         ` Borislav Petkov
  0 siblings, 1 reply; 86+ messages in thread
From: Dave Hansen @ 2025-03-03 19:23 UTC (permalink / raw)
  To: Borislav Petkov, Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

[-- Attachment #1: Type: text/plain, Size: 53 bytes --]

Here's a plain diff if you just want to squish it in.

[-- Attachment #2: supportcode1.patch --]
[-- Type: text/x-patch, Size: 3618 bytes --]

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 5375145eb9596..3bd617c204346 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -28,6 +28,11 @@ static inline void invlpg(unsigned long addr)
 	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
 }
 
+enum invlpgb_stride {
+	NO_STRIDE  = 0,
+	PTE_STRIDE = 0,
+	PMD_STRIDE = 1
+};
 
 /*
  * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
@@ -48,10 +53,10 @@ static inline void invlpg(unsigned long addr)
  */
 static inline void __invlpgb(unsigned long asid, unsigned long pcid,
 			     unsigned long addr, u16 nr_pages,
-			     bool pmd_stride, u8 flags)
+			     enum invlpgb_stride stride, u8 flags)
 {
 	u32 edx = (pcid << 16) | asid;
-	u32 ecx = (pmd_stride << 31) | (nr_pages - 1);
+	u32 ecx = (stride << 31) | (nr_pages - 1);
 	u64 rax = addr | flags;
 
 	/* The low bits in rax are for flags. Verify addr is clean. */
@@ -78,33 +83,37 @@ static inline void __tlbsync(void)
 /*
  * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
  * of the three. For example:
- * - INVLPGB_VA | INVLPGB_INCLUDE_GLOBAL: invalidate all TLB entries at the address
- * - INVLPGB_PCID:			  invalidate all TLB entries matching the PCID
+ * - FLAG_VA | FLAG_INCLUDE_GLOBAL: invalidate all TLB entries at the address
+ * - FLAG_PCID:			    invalidate all TLB entries matching the PCID
  *
- * The first can be used to invalidate (kernel) mappings at a particular
+ * The first is used to invalidate (kernel) mappings at a particular
  * address across all processes.
  *
  * The latter invalidates all TLB entries matching a PCID.
  */
-#define INVLPGB_VA			BIT(0)
-#define INVLPGB_PCID			BIT(1)
-#define INVLPGB_ASID			BIT(2)
-#define INVLPGB_INCLUDE_GLOBAL		BIT(3)
-#define INVLPGB_FINAL_ONLY		BIT(4)
-#define INVLPGB_INCLUDE_NESTED		BIT(5)
+#define INVLPGB_FLAG_VA			BIT(0)
+#define INVLPGB_FLAG_PCID		BIT(1)
+#define INVLPGB_FLAG_ASID		BIT(2)
+#define INVLPGB_FLAG_INCLUDE_GLOBAL	BIT(3)
+#define INVLPGB_FLAG_FINAL_ONLY		BIT(4)
+#define INVLPGB_FLAG_INCLUDE_NESTED	BIT(5)
+
+/* The implied mode when all bits are clear: */
+#define INVLPGB_MODE_ALL_NONGLOBALS	0UL
 
 static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
 						unsigned long addr,
 						u16 nr,
 						bool pmd_stride)
 {
-	__invlpgb(0, pcid, addr, nr, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
+	__invlpgb(0, pcid, addr, nr, pmd_stride, INVLPGB_FLAG_PCID |
+		  				 INVLPGB_FLAG_VA);
 }
 
 /* Flush all mappings for a given PCID, not including globals. */
 static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
 {
-	__invlpgb(0, pcid, 0, 1, 0, INVLPGB_PCID);
+	__invlpgb(0, pcid, 0, 1, NO_STRIDE, INVLPGB_FLAG_PCID);
 }
 
 /* Flush all mappings, including globals, for all PCIDs. */
@@ -117,21 +126,21 @@ static inline void invlpgb_flush_all(void)
 	 * as it is cheaper.
 	 */
 	guard(preempt)();
-	__invlpgb(0, 0, 0, 1, 0, INVLPGB_INCLUDE_GLOBAL);
+	__invlpgb(0, 0, 0, 1, NO_STRIDE, INVLPGB_FLAG_INCLUDE_GLOBAL);
 	__tlbsync();
 }
 
 /* Flush addr, including globals, for all PCIDs. */
 static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
 {
-	__invlpgb(0, 0, addr, nr, 0, INVLPGB_INCLUDE_GLOBAL);
+	__invlpgb(0, 0, addr, nr, PTE_STRIDE, INVLPGB_FLAG_INCLUDE_GLOBAL);
 }
 
 /* Flush all mappings for all PCIDs except globals. */
 static inline void invlpgb_flush_all_nonglobals(void)
 {
 	guard(preempt)();
-	__invlpgb(0, 0, 0, 1, 0, 0);
+	__invlpgb(0, 0, 0, 1, NO_STRIDE, INVLPGB_MODE_ALL_NONGLOBALS);
 	__tlbsync();
 }
 #endif /* _ASM_X86_TLB_H */

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 11/13] x86/mm: do targeted broadcast flushing from tlbbatch code
  2025-03-03 11:46   ` Borislav Petkov
@ 2025-03-03 21:47     ` Dave Hansen
  2025-03-04 11:52       ` Borislav Petkov
  2025-03-04 12:52       ` Brendan Jackman
  0 siblings, 2 replies; 86+ messages in thread
From: Dave Hansen @ 2025-03-03 21:47 UTC (permalink / raw)
  To: Borislav Petkov, Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

[-- Attachment #1: Type: text/plain, Size: 10750 bytes --]

On 3/3/25 03:46, Borislav Petkov wrote:
> On Tue, Feb 25, 2025 at 10:00:46PM -0500, Rik van Riel wrote:
>> +static inline bool cpu_need_tlbsync(void)
>> +{
>> +	return this_cpu_read(cpu_tlbstate.need_tlbsync);
>> +}
>> +
>> +static inline void cpu_write_tlbsync(bool state)
> 
> That thing feels better like "cpu_set_tlbsync" in the code...
> 
>> +{
>> +	this_cpu_write(cpu_tlbstate.need_tlbsync, state);
>> +}
>>  #else
>>  static inline u16 mm_global_asid(struct mm_struct *mm)
>>  {
> 
> ...
> 
>> +static inline void tlbsync(void)
>> +{
>> +	if (!cpu_need_tlbsync())
>> +		return;
>> +	__tlbsync();
>> +	cpu_write_tlbsync(false);
>> +}
> 
> Easier to parse visually:
> 
> static inline void tlbsync(void)
> {
>         if (cpu_need_tlbsync()) {
>                 __tlbsync();
>                 cpu_write_tlbsync(false);
>         }
> }
> 
> Final:
> 
> From: Rik van Riel <riel@surriel.com>
> Date: Tue, 25 Feb 2025 22:00:46 -0500
> Subject: [PATCH] x86/mm: Do targeted broadcast flushing from tlbbatch code
> 
> Instead of doing a system-wide TLB flush from arch_tlbbatch_flush(), queue up
> asynchronous, targeted flushes from arch_tlbbatch_add_pending().
> 
> This also allows to avoid adding the CPUs of processes using broadcast
> flushing to the batch->cpumask, and will hopefully further reduce TLB flushing
> from the reclaim and compaction paths.
> 
>   [ bp:
>    - Massage
>    - :%s/\<static_cpu_has\>/cpu_feature_enabled/cgi ]
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
> Link: https://lore.kernel.org/r/20250226030129.530345-12-riel@surriel.com
> ---
>  arch/x86/include/asm/tlb.h      | 12 ++---
>  arch/x86/include/asm/tlbflush.h | 27 +++++++----
>  arch/x86/mm/tlb.c               | 79 +++++++++++++++++++++++++++++++--
>  3 files changed, 100 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
> index 04f2c6f4cee3..b5c2005725cf 100644
> --- a/arch/x86/include/asm/tlb.h
> +++ b/arch/x86/include/asm/tlb.h
> @@ -102,16 +102,16 @@ static inline void __tlbsync(void) { }
>  #define INVLPGB_FINAL_ONLY		BIT(4)
>  #define INVLPGB_INCLUDE_NESTED		BIT(5)
>  
> -static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
> -						unsigned long addr,
> -						u16 nr,
> -						bool pmd_stride)
> +static inline void __invlpgb_flush_user_nr_nosync(unsigned long pcid,
> +						  unsigned long addr,
> +						  u16 nr,
> +						  bool pmd_stride)
>  {
>  	__invlpgb(0, pcid, addr, nr, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
>  }
>  
>  /* Flush all mappings for a given PCID, not including globals. */
> -static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
> +static inline void __invlpgb_flush_single_pcid_nosync(unsigned long pcid)
>  {
>  	__invlpgb(0, pcid, 0, 1, 0, INVLPGB_PCID);
>  }
> @@ -131,7 +131,7 @@ static inline void invlpgb_flush_all(void)
>  }
>  
>  /* Flush addr, including globals, for all PCIDs. */
> -static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
> +static inline void __invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
>  {
>  	__invlpgb(0, 0, addr, nr, 0, INVLPGB_INCLUDE_GLOBAL);
>  }
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 8c21030269ff..cbdb86d58301 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -105,6 +105,9 @@ struct tlb_state {
>  	 * need to be invalidated.
>  	 */
>  	bool invalidate_other;
> +#ifdef CONFIG_BROADCAST_TLB_FLUSH
> +	bool need_tlbsync;
> +#endif
>  
>  #ifdef CONFIG_ADDRESS_MASKING
>  	/*
> @@ -292,11 +295,23 @@ static inline bool mm_in_asid_transition(struct mm_struct *mm)
>  
>  	return mm && READ_ONCE(mm->context.asid_transition);
>  }
> +
> +static inline bool cpu_need_tlbsync(void)
> +{
> +	return this_cpu_read(cpu_tlbstate.need_tlbsync);
> +}
> +
> +static inline void cpu_set_tlbsync(bool state)
> +{
> +	this_cpu_write(cpu_tlbstate.need_tlbsync, state);
> +}
>  #else
>  static inline u16 mm_global_asid(struct mm_struct *mm) { return 0; }
>  static inline void mm_init_global_asid(struct mm_struct *mm) { }
>  static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid) { }
>  static inline bool mm_in_asid_transition(struct mm_struct *mm) { return false; }
> +static inline bool cpu_need_tlbsync(void) { return false; }
> +static inline void cpu_set_tlbsync(bool state) { }
>  #endif /* CONFIG_BROADCAST_TLB_FLUSH */
>  
>  #ifdef CONFIG_PARAVIRT
> @@ -346,21 +361,15 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
>  	return atomic64_inc_return(&mm->context.tlb_gen);
>  }
>  
> -static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> -					     struct mm_struct *mm,
> -					     unsigned long uaddr)
> -{
> -	inc_mm_tlb_gen(mm);
> -	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
> -	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
> -}
> -
>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
>  {
>  	flush_tlb_mm(mm);
>  }
>  
>  extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
> +extern void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> +					     struct mm_struct *mm,
> +					     unsigned long uaddr);
>  
>  static inline bool pte_flags_need_flush(unsigned long oldflags,
>  					unsigned long newflags,
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 0efd99053c09..83ba6876adbf 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -492,6 +492,37 @@ static void finish_asid_transition(struct flush_tlb_info *info)
>  	mm_clear_asid_transition(mm);
>  }
>  
> +static inline void tlbsync(void)
> +{
> +	if (cpu_need_tlbsync()) {
> +		__tlbsync();
> +		cpu_set_tlbsync(false);
> +	}
> +}
> +
> +static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
> +						unsigned long addr,
> +						u16 nr, bool pmd_stride)
> +{
> +	__invlpgb_flush_user_nr_nosync(pcid, addr, nr, pmd_stride);
> +	if (!cpu_need_tlbsync())
> +		cpu_set_tlbsync(true);
> +}
> +
> +static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
> +{
> +	__invlpgb_flush_single_pcid_nosync(pcid);
> +	if (!cpu_need_tlbsync())
> +		cpu_set_tlbsync(true);
> +}
> +
> +static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
> +{
> +	__invlpgb_flush_addr_nosync(addr, nr);
> +	if (!cpu_need_tlbsync())
> +		cpu_set_tlbsync(true);
> +}

One thought on these:

Instead of having three functions:

	1. A raw __invlpgb_*_nosync()
	2. A wrapper invlpgb_*_nosync() that flips cpu_set_tlbsync()
	3. A wrapper invlpgb_*()

Could we get away with just two?  For instance, what if we had *ALL*
__invlpgb()'s do cpu_set_tlbsync()? Then we'd universally call tlbsync().

static inline void invlpgb_flush_all_nonglobals(void)
{
        guard(preempt)();
        __invlpgb(0, 0, 0, 1, NO_STRIDE, INVLPGB_MODE_ALL_NONGLOBALS);
        tlbsync();
}

Then we wouldn't need any of those three new wrappers. The only downside
is that we'd end up with paths that logically do:

	__invlpgb()
	cpu_set_tlbsync(true);
	if (cpu_need_tlbsync()) { // always true
		__tlbsync();
		cpu_set_tlbsync(true);
	}

In other words, a possibly superfluous set/check/clear of the
"need_tlbsync" state. But I'd expect that to be a pittance compared to
the actual cost of INVLPGB/TLBSYNC.

>  static void broadcast_tlb_flush(struct flush_tlb_info *info)
>  {
>  	bool pmd = info->stride_shift == PMD_SHIFT;
> @@ -790,6 +821,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
>  	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
>  		WARN_ON_ONCE(!irqs_disabled());
>  
> +	tlbsync();

This one is in dire need of comments.

>  	/*
>  	 * Verify that CR3 is what we think it is.  This will catch
>  	 * hypothetical buggy code that directly switches to swapper_pg_dir
> @@ -966,6 +999,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
>   */
>  void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
>  {
> +	tlbsync();

Ditto, *especially* if this hits the init_mm state. There really
shouldn't be any deferred flushes for the init_mm.

>  	if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
>  		return;
>  
> @@ -1633,9 +1668,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>  	 * a local TLB flush is needed. Optimize this use-case by calling
>  	 * flush_tlb_func_local() directly in this case.
>  	 */
> -	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> -		invlpgb_flush_all_nonglobals();
> -	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
> +	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
>  		flush_tlb_multi(&batch->cpumask, info);
>  	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
>  		lockdep_assert_irqs_enabled();
> @@ -1644,12 +1677,52 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>  		local_irq_enable();
>  	}
>  
> +	/*
> +	 * If (asynchronous) INVLPGB flushes were issued, wait for them here.
> +	 * The cpumask above contains only CPUs that were running tasks
> +	 * not using broadcast TLB flushing.
> +	 */
> +	tlbsync();

I have a suggested comment in the attached fixups. This makes it sound
like tasks are either INVLPGB *or* in ->cpumask. Transitioning mms can
be both, I think.

>  	cpumask_clear(&batch->cpumask);
>  
>  	put_flush_tlb_info();
>  	put_cpu();
>  }
>  
> +void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> +					     struct mm_struct *mm,
> +					     unsigned long uaddr)
> +{
> +	u16 asid = mm_global_asid(mm);
> +
> +	if (asid) {
> +		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
> +		/* Do any CPUs supporting INVLPGB need PTI? */
> +		if (cpu_feature_enabled(X86_FEATURE_PTI))
> +			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false);
> +
> +		/*
> +		 * Some CPUs might still be using a local ASID for this
> +		 * process, and require IPIs, while others are using the
> +		 * global ASID.
> +		 *
> +		 * In this corner case, both broadcast TLB invalidation
> +		 * and IPIs need to be sent. The IPIs will help
> +		 * stragglers transition to the broadcast ASID.
> +		 */
> +		if (mm_in_asid_transition(mm))
> +			asid = 0;
> +	}
> +
> +	if (!asid) {
> +		inc_mm_tlb_gen(mm);
> +		cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
> +	}
> +
> +	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
> +}

I stuck some better comments in the patch. I think this should both
mention that a later tlbsync() is required and *also* match teh two
cases better.

Also, the "if (asid)" isn't super nice naming. Let's please call it
"global_asid" because it's either a global ASID or 0.

[-- Attachment #2: tlbbatch.patch --]
[-- Type: text/x-patch, Size: 1955 bytes --]

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8e596de963a8f..8e1ac8be123b2 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -821,6 +821,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
 		WARN_ON_ONCE(!irqs_disabled());
 
+	// FIXME
+	// This is totally unexplained and needs justification and
+	// commenting
 	tlbsync();
 
 	/*
@@ -1678,9 +1681,10 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	}
 
 	/*
-	 * If (asynchronous) INVLPGB flushes were issued, wait for them here.
-	 * The cpumask above contains only CPUs that were running tasks
-	 * not using broadcast TLB flushing.
+	 * Wait for outstanding INVLPGB flushes. batch->cpumask will
+	 * be empty when the batch was handled completely by INVLPGB.
+	 * Note that mm_in_asid_transition() mm's may use INVLPGB and
+	 * the flush_tlb_multi() IPIs at the same time.
 	 */
 	tlbsync();
 
@@ -1693,9 +1697,14 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 			       struct mm_struct *mm, unsigned long uaddr)
 {
-	u16 asid = mm_global_asid(mm);
+	u16 global_asid = mm_global_asid(mm);
 
-	if (asid) {
+	if (global_asid) {
+		/*
+		 * Global ASIDs can be flushed with INVLPGB. Flush
+		 * now instead of batching them for later. A later
+		 * tlbsync() is required to ensure these completed.
+		 */
 		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
 		/* Do any CPUs supporting INVLPGB need PTI? */
 		if (cpu_feature_enabled(X86_FEATURE_PTI))
@@ -1714,7 +1723,11 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 			asid = 0;
 	}
 
-	if (!asid) {
+	if (!global_asid) {
+		/*
+		 * Mark the mm and the CPU so that
+		 * the TLB gets flushed later.
+		 */
 		inc_mm_tlb_gen(mm);
 		cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
 	}

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 13/13] x86/mm: only invalidate final translations with INVLPGB
  2025-02-26  3:00 ` [PATCH v14 13/13] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
@ 2025-03-03 22:40   ` Dave Hansen
  2025-03-04 11:53     ` Borislav Petkov
  0 siblings, 1 reply; 86+ messages in thread
From: Dave Hansen @ 2025-03-03 22:40 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On 2/25/25 19:00, Rik van Riel wrote:
>  static inline void __invlpgb_flush_user_nr_nosync(unsigned long pcid,
>  						  unsigned long addr,
>  						  u16 nr,
> -						  bool pmd_stride)
> +						  bool pmd_stride,
> +						  bool freed_tables)
>  {
> -	__invlpgb(0, pcid, addr, nr, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
> +	u8 flags = INVLPGB_PCID | INVLPGB_VA;
> +
> +	if (!freed_tables)
> +		flags |= INVLPGB_FINAL_ONLY;
> +
> +	__invlpgb(0, pcid, addr, nr, pmd_stride, flags);
>  }

I'm not sure this is OK.

Think of a hugetlbfs mapping with shared page tables. Say you had a
1GB-sized and 1GB-aligned mapping. It might zap the one PUD that it
needs, set tlb->cleared_puds=1 but it never sets ->freed_tables because
it didn't actually free the shared page table page.

I'd honestly just throw this patch out of the series for now. All of the
other TLB invalidation that the kernel does implicitly toss out the
mid-level paging structure caches.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 03/13] x86/mm: add INVLPGB support code
  2025-03-03 19:23       ` Dave Hansen
@ 2025-03-04 11:00         ` Borislav Petkov
  2025-03-04 15:10           ` Dave Hansen
  0 siblings, 1 reply; 86+ messages in thread
From: Borislav Petkov @ 2025-03-04 11:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Rik van Riel, x86, linux-kernel, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla,
	mingo

On Mon, Mar 03, 2025 at 11:23:58AM -0800, Dave Hansen wrote:
> Here's a plain diff if you just want to squish it in.

> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
> index 5375145eb9596..3bd617c204346 100644
> --- a/arch/x86/include/asm/tlb.h
> +++ b/arch/x86/include/asm/tlb.h
> @@ -28,6 +28,11 @@ static inline void invlpg(unsigned long addr)
>  	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
>  }
>  
> +enum invlpgb_stride {

Right, this is an address stride, as the text calls it.

> +	NO_STRIDE  = 0,
> +	PTE_STRIDE = 0,

Ok, so those are confusing. No stride is PTE stride so let's just zap
NO_STRIDE.

> +	PMD_STRIDE = 1
> +};
>  
>  /*
>   * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.

...

>  static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
>  						unsigned long addr,
>  						u16 nr,
>  						bool pmd_stride)

You're relying on the fact that true == PMD_STRIDE and false to PTE_STRIDE but
let's make it Right(tm), see below.

Rest looks ok.

IOW, I'm merging this into patch 3:

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 5375145eb959..6718835c3b0c 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -28,6 +28,10 @@ static inline void invlpg(unsigned long addr)
 	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
 }
 
+enum addr_stride {
+	PTE_STRIDE = 0,
+	PMD_STRIDE = 1
+};
 
 /*
  * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
@@ -48,10 +52,10 @@ static inline void invlpg(unsigned long addr)
  */
 static inline void __invlpgb(unsigned long asid, unsigned long pcid,
 			     unsigned long addr, u16 nr_pages,
-			     bool pmd_stride, u8 flags)
+			     enum addr_stride stride, u8 flags)
 {
 	u32 edx = (pcid << 16) | asid;
-	u32 ecx = (pmd_stride << 31) | (nr_pages - 1);
+	u32 ecx = (stride << 31) | (nr_pages - 1);
 	u64 rax = addr | flags;
 
 	/* The low bits in rax are for flags. Verify addr is clean. */
@@ -78,33 +82,38 @@ static inline void __tlbsync(void)
 /*
  * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
  * of the three. For example:
- * - INVLPGB_VA | INVLPGB_INCLUDE_GLOBAL: invalidate all TLB entries at the address
- * - INVLPGB_PCID:			  invalidate all TLB entries matching the PCID
+ * - FLAG_VA | FLAG_INCLUDE_GLOBAL: invalidate all TLB entries at the address
+ * - FLAG_PCID:			    invalidate all TLB entries matching the PCID
  *
- * The first can be used to invalidate (kernel) mappings at a particular
+ * The first is used to invalidate (kernel) mappings at a particular
  * address across all processes.
  *
  * The latter invalidates all TLB entries matching a PCID.
  */
-#define INVLPGB_VA			BIT(0)
-#define INVLPGB_PCID			BIT(1)
-#define INVLPGB_ASID			BIT(2)
-#define INVLPGB_INCLUDE_GLOBAL		BIT(3)
-#define INVLPGB_FINAL_ONLY		BIT(4)
-#define INVLPGB_INCLUDE_NESTED		BIT(5)
+#define INVLPGB_FLAG_VA			BIT(0)
+#define INVLPGB_FLAG_PCID		BIT(1)
+#define INVLPGB_FLAG_ASID		BIT(2)
+#define INVLPGB_FLAG_INCLUDE_GLOBAL	BIT(3)
+#define INVLPGB_FLAG_FINAL_ONLY		BIT(4)
+#define INVLPGB_FLAG_INCLUDE_NESTED	BIT(5)
+
+/* The implied mode when all bits are clear: */
+#define INVLPGB_MODE_ALL_NONGLOBALS	0UL
 
 static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
 						unsigned long addr,
-						u16 nr,
-						bool pmd_stride)
+						u16 nr, bool stride)
 {
-	__invlpgb(0, pcid, addr, nr, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
+	enum addr_stride str = stride ? PMD_STRIDE : PTE_STRIDE;
+	u8 flags = INVLPGB_FLAG_PCID | INVLPGB_FLAG_VA;
+
+	__invlpgb(0, pcid, addr, nr, str, flags);
 }
 
 /* Flush all mappings for a given PCID, not including globals. */
 static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
 {
-	__invlpgb(0, pcid, 0, 1, 0, INVLPGB_PCID);
+	__invlpgb(0, pcid, 0, 1, PTE_STRIDE, INVLPGB_FLAG_PCID);
 }
 
 /* Flush all mappings, including globals, for all PCIDs. */
@@ -117,21 +126,21 @@ static inline void invlpgb_flush_all(void)
 	 * as it is cheaper.
 	 */
 	guard(preempt)();
-	__invlpgb(0, 0, 0, 1, 0, INVLPGB_INCLUDE_GLOBAL);
+	__invlpgb(0, 0, 0, 1, PTE_STRIDE, INVLPGB_FLAG_INCLUDE_GLOBAL);
 	__tlbsync();
 }
 
 /* Flush addr, including globals, for all PCIDs. */
 static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
 {
-	__invlpgb(0, 0, addr, nr, 0, INVLPGB_INCLUDE_GLOBAL);
+	__invlpgb(0, 0, addr, nr, PTE_STRIDE, INVLPGB_FLAG_INCLUDE_GLOBAL);
 }
 
 /* Flush all mappings for all PCIDs except globals. */
 static inline void invlpgb_flush_all_nonglobals(void)
 {
 	guard(preempt)();
-	__invlpgb(0, 0, 0, 1, 0, 0);
+	__invlpgb(0, 0, 0, 1, PTE_STRIDE, INVLPGB_MODE_ALL_NONGLOBALS);
 	__tlbsync();
 }
 #endif /* _ASM_X86_TLB_H */

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 11/13] x86/mm: do targeted broadcast flushing from tlbbatch code
  2025-03-03 21:47     ` Dave Hansen
@ 2025-03-04 11:52       ` Borislav Petkov
  2025-03-04 15:24         ` Dave Hansen
  2025-03-04 12:52       ` Brendan Jackman
  1 sibling, 1 reply; 86+ messages in thread
From: Borislav Petkov @ 2025-03-04 11:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Rik van Riel, x86, linux-kernel, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla,
	mingo

On Mon, Mar 03, 2025 at 01:47:42PM -0800, Dave Hansen wrote:
> > +static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
> > +{
> > +	__invlpgb_flush_addr_nosync(addr, nr);
> > +	if (!cpu_need_tlbsync())
> > +		cpu_set_tlbsync(true);
> > +}
> 
> One thought on these:
> 
> Instead of having three functions:
> 
> 	1. A raw __invlpgb_*_nosync()
> 	2. A wrapper invlpgb_*_nosync() that flips cpu_set_tlbsync()
> 	3. A wrapper invlpgb_*()
> 
> Could we get away with just two?  For instance, what if we had *ALL*
> __invlpgb()'s do cpu_set_tlbsync()? Then we'd universally call tlbsync().
> 
> static inline void invlpgb_flush_all_nonglobals(void)
> {
>         guard(preempt)();
>         __invlpgb(0, 0, 0, 1, NO_STRIDE, INVLPGB_MODE_ALL_NONGLOBALS);
>         tlbsync();
> }
> 
> Then we wouldn't need any of those three new wrappers. The only downside
> is that we'd end up with paths that logically do:
> 
> 	__invlpgb()
> 	cpu_set_tlbsync(true);
> 	if (cpu_need_tlbsync()) { // always true
> 		__tlbsync();
> 		cpu_set_tlbsync(true);
> 	}
> 
> In other words, a possibly superfluous set/check/clear of the
> "need_tlbsync" state. But I'd expect that to be a pittance compared to
> the actual cost of INVLPGB/TLBSYNC.

Lemme see whether I can grasp properly what you mean:

What you really want is for the _nosync() variants to set need_tlbsync, right?

Because looking at all the call sites which do set tlbsync, the flow is this:

        __invlpgb_flush_user_nr_nosync(pcid, addr, nr, pmd_stride, freed_tables);
        if (!cpu_need_tlbsync())
                cpu_set_tlbsync(true);

So we should move that

	cpu_set_tlbsync(true);

into the __invlpgb_*_nosync() variant.

And in order to make it even simpler, we should drop the testing too:

IOW, this:

/* Flush all mappings for a given PCID, not including globals. */
static inline void __invlpgb_flush_single_pcid_nosync(unsigned long pcid)
{
        __invlpgb(0, pcid, 0, 1, 0, INVLPGB_PCID);
        cpu_set_tlbsync(true);
}

Right?

> >  static void broadcast_tlb_flush(struct flush_tlb_info *info)
> >  {
> >  	bool pmd = info->stride_shift == PMD_SHIFT;
> > @@ -790,6 +821,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
> >  	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
> >  		WARN_ON_ONCE(!irqs_disabled());
> >  
> > +	tlbsync();
> 
> This one is in dire need of comments.

Maybe this:

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 08672350536f..b97249ffff1f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -822,6 +822,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
        if (IS_ENABLED(CONFIG_PROVE_LOCKING))
                WARN_ON_ONCE(!irqs_disabled());
 
+       /*
+        * Finish any remote TLB flushes pending from this CPU:
+        */
        tlbsync();
 
        /*

> Ditto, *especially* if this hits the init_mm state. There really
> shouldn't be any deferred flushes for the init_mm.

Basically what you said but as a comment. :-P

Merged in the rest.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 13/13] x86/mm: only invalidate final translations with INVLPGB
  2025-03-03 22:40   ` Dave Hansen
@ 2025-03-04 11:53     ` Borislav Petkov
  0 siblings, 0 replies; 86+ messages in thread
From: Borislav Petkov @ 2025-03-04 11:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Rik van Riel, x86, linux-kernel, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla,
	mingo

On Mon, Mar 03, 2025 at 02:40:49PM -0800, Dave Hansen wrote:
> On 2/25/25 19:00, Rik van Riel wrote:
> >  static inline void __invlpgb_flush_user_nr_nosync(unsigned long pcid,
> >  						  unsigned long addr,
> >  						  u16 nr,
> > -						  bool pmd_stride)
> > +						  bool pmd_stride,
> > +						  bool freed_tables)
> >  {
> > -	__invlpgb(0, pcid, addr, nr, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
> > +	u8 flags = INVLPGB_PCID | INVLPGB_VA;
> > +
> > +	if (!freed_tables)
> > +		flags |= INVLPGB_FINAL_ONLY;
> > +
> > +	__invlpgb(0, pcid, addr, nr, pmd_stride, flags);
> >  }
> 
> I'm not sure this is OK.
> 
> Think of a hugetlbfs mapping with shared page tables. Say you had a
> 1GB-sized and 1GB-aligned mapping. It might zap the one PUD that it
> needs, set tlb->cleared_puds=1 but it never sets ->freed_tables because
> it didn't actually free the shared page table page.
> 
> I'd honestly just throw this patch out of the series for now. All of the
> other TLB invalidation that the kernel does implicitly toss out the
> mid-level paging structure caches.

Right, I guess we can revisit this later, once the dust settles.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH] x86/mm: Always set the ASID valid bit for the INVLPGB instruction
  2025-02-26  3:00 [PATCH v14 00/13] AMD broadcast TLB invalidation Rik van Riel
                   ` (13 preceding siblings ...)
  2025-03-03 12:42 ` [PATCH v14 00/13] AMD broadcast TLB invalidation Borislav Petkov
@ 2025-03-04 12:04 ` Borislav Petkov
  2025-03-04 12:43   ` Borislav Petkov
                     ` (2 more replies)
  14 siblings, 3 replies; 86+ messages in thread
From: Borislav Petkov @ 2025-03-04 12:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Tue, Feb 25, 2025 at 10:00:35PM -0500, Rik van Riel wrote:
> Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.

One more patch ontop from Tom.

Now lemme test this a bit again...

From: Tom Lendacky <thomas.lendacky@amd.com>
Date: Tue, 4 Mar 2025 12:59:56 +0100
Subject: [PATCH] x86/mm: Always set the ASID valid bit for the INVLPGB instruction

When executing the INVLPGB instruction on a bare-metal host or hypervisor, if
the ASID valid bit is not set, the instruction will flush the TLB entries that
match the specified criteria for any ASID, not just the those of the host. If
virtual machines are running on the system, this may result in inadvertent
flushes of guest TLB entries.

When executing the INVLPGB instruction in a guest and the INVLPGB instruction is
not intercepted by the hypervisor, the hardware will replace the requested ASID
with the guest ASID and set the ASID valid bit before doing the broadcast
invalidation. Thus a guest is only able to flush its own TLB entries.

So to limit the host TLB flushing reach, always set the ASID valid bit using an
ASID value of 0 (which represents the host/hypervisor). This will will result in
the desired effect in both host and guest.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
---
 arch/x86/include/asm/tlb.h | 58 +++++++++++++++++++++-----------------
 1 file changed, 32 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index e8561a846754..56fe331fb797 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -33,6 +33,27 @@ enum addr_stride {
 	PMD_STRIDE = 1
 };
 
+/*
+ * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
+ * of the three. For example:
+ * - FLAG_VA | FLAG_INCLUDE_GLOBAL: invalidate all TLB entries at the address
+ * - FLAG_PCID:			    invalidate all TLB entries matching the PCID
+ *
+ * The first is used to invalidate (kernel) mappings at a particular
+ * address across all processes.
+ *
+ * The latter invalidates all TLB entries matching a PCID.
+ */
+#define INVLPGB_FLAG_VA			BIT(0)
+#define INVLPGB_FLAG_PCID		BIT(1)
+#define INVLPGB_FLAG_ASID		BIT(2)
+#define INVLPGB_FLAG_INCLUDE_GLOBAL	BIT(3)
+#define INVLPGB_FLAG_FINAL_ONLY		BIT(4)
+#define INVLPGB_FLAG_INCLUDE_NESTED	BIT(5)
+
+/* The implied mode when all bits are clear: */
+#define INVLPGB_MODE_ALL_NONGLOBALS	0UL
+
 #ifdef CONFIG_BROADCAST_TLB_FLUSH
 /*
  * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
@@ -40,14 +61,20 @@ enum addr_stride {
  * The INVLPGB instruction is weakly ordered, and a batch of invalidations can
  * be done in a parallel fashion.
  *
- * The instruction takes the number of extra pages to invalidate, beyond
- * the first page, while __invlpgb gets the more human readable number of
- * pages to invalidate.
+ * The instruction takes the number of extra pages to invalidate, beyond the
+ * first page, while __invlpgb gets the more human readable number of pages to
+ * invalidate.
  *
  * The bits in rax[0:2] determine respectively which components of the address
  * (VA, PCID, ASID) get compared when flushing. If neither bits are set, *any*
  * address in the specified range matches.
  *
+ * Since it is desired to only flush TLB entries for the ASID that is executing
+ * the instruction (a host/hypervisor or a guest), the ASID valid bit should
+ * always be set. On a host/hypervisor, the hardware will use the ASID value
+ * specified in EDX[15:0] (which should be 0). On a guest, the hardware will
+ * use the actual ASID value of the guest.
+ *
  * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from
  * this CPU have completed.
  */
@@ -55,9 +82,9 @@ static inline void __invlpgb(unsigned long asid, unsigned long pcid,
 			     unsigned long addr, u16 nr_pages,
 			     enum addr_stride stride, u8 flags)
 {
-	u32 edx = (pcid << 16) | asid;
+	u64 rax = addr | flags | INVLPGB_FLAG_ASID;
 	u32 ecx = (stride << 31) | (nr_pages - 1);
-	u64 rax = addr | flags;
+	u32 edx = (pcid << 16) | asid;
 
 	/* The low bits in rax are for flags. Verify addr is clean. */
 	VM_WARN_ON_ONCE(addr & ~PAGE_MASK);
@@ -87,27 +114,6 @@ static inline void __invlpgb(unsigned long asid, unsigned long pcid,
 static inline void __tlbsync(void) { }
 #endif
 
-/*
- * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
- * of the three. For example:
- * - FLAG_VA | FLAG_INCLUDE_GLOBAL: invalidate all TLB entries at the address
- * - FLAG_PCID:			    invalidate all TLB entries matching the PCID
- *
- * The first is used to invalidate (kernel) mappings at a particular
- * address across all processes.
- *
- * The latter invalidates all TLB entries matching a PCID.
- */
-#define INVLPGB_FLAG_VA			BIT(0)
-#define INVLPGB_FLAG_PCID		BIT(1)
-#define INVLPGB_FLAG_ASID		BIT(2)
-#define INVLPGB_FLAG_INCLUDE_GLOBAL	BIT(3)
-#define INVLPGB_FLAG_FINAL_ONLY		BIT(4)
-#define INVLPGB_FLAG_INCLUDE_NESTED	BIT(5)
-
-/* The implied mode when all bits are clear: */
-#define INVLPGB_MODE_ALL_NONGLOBALS	0UL
-
 static inline void __invlpgb_flush_user_nr_nosync(unsigned long pcid,
 						  unsigned long addr,
 						  u16 nr, bool stride)
-- 
2.43.0

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH] x86/mm: Always set the ASID valid bit for the INVLPGB instruction
  2025-03-04 12:04 ` [PATCH] x86/mm: Always set the ASID valid bit for the INVLPGB instruction Borislav Petkov
@ 2025-03-04 12:43   ` Borislav Petkov
  2025-03-05 21:36   ` [tip: x86/mm] " tip-bot2 for Tom Lendacky
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Tom Lendacky
  2 siblings, 0 replies; 86+ messages in thread
From: Borislav Petkov @ 2025-03-04 12:43 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Rik van Riel, x86, linux-kernel, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla,
	mingo

On Tue, Mar 04, 2025 at 01:04:49PM +0100, Borislav Petkov wrote:
> @@ -55,9 +82,9 @@ static inline void __invlpgb(unsigned long asid, unsigned long pcid,
>  			     unsigned long addr, u16 nr_pages,
>  			     enum addr_stride stride, u8 flags)
>  {
> -	u32 edx = (pcid << 16) | asid;
> +	u64 rax = addr | flags | INVLPGB_FLAG_ASID;

So I did it this way instead of how you sent it to me privately and I think
doing it this way is equivalent.

However, my Zen3 doesn't like it somehow.

If I zap your patch, it boots fine.

So we need to stare at this some more first...

Thx.

[   13.834550] (sd-mkdcre[570]: Changing mount flags /dev/shm (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_NOSYMFOLLOW|MS_BIND "")...
[   13.834605] (sd-mkdcre[570]: Moving mount /dev/shm → /run/credentials/systemd-tmpfiles-setup-dev-early.service (MS_MOVE "")...
[   13.835200] EXT4-fs (sda3): re-mounted 69500787-6b3d-4b03-96d6-6bb20668e924 r/w. Quota mode: none.
[   13.867434] (journald)[561]: systemd-journald.service: Enabled MemoryDenyWriteExecute= with PR_SET_MDWE
[   13.868764] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 8-.... } 8 jiffies s: 129 root: 0x1/.
[   13.868779] rcu: blocking rcu_node structures (internal RCU debug): l=1:0-15:0x100/.
[   13.868789] Sending NMI from CPU 19 to CPUs 8:
[   13.868797] NMI backtrace for cpu 8
[   13.868799] CPU: 8 UID: 0 PID: 566 Comm: mount Not tainted 6.14.0-rc3+ #1
[   13.868802] Hardware name: Supermicro Super Server/H12SSL-i, BIOS 2.5 09/08/2022
[   13.868804] RIP: 0010:delay_halt_mwaitx+0x38/0x40
[   13.868809] Code: 31 d2 48 89 d1 48 05 00 60 00 00 0f 01 fa b8 ff ff ff ff b9 02 00 00 00 48 39 c6 48 0f 46 c6 48 89 c3 b8 f0 00 00 00 0f 01 fb <5b> e9 cd ea 23 ff 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[   13.868810] RSP: 0018:ffffc90003b67ae0 EFLAGS: 00000097
[   13.868813] RAX: 00000000000000f0 RBX: 0000000000000b9b RCX: 0000000000000002
[   13.868814] RDX: 0000000000000000 RSI: 0000000000000b9b RDI: 00000036410a0588
[   13.868816] RBP: 00000036410a0588 R08: 000000000000005b R09: 0000000000000006
[   13.868817] R10: 0000000000000000 R11: 0000000000000006 R12: 0000000000000020
[   13.868818] R13: ffffffff83fb5900 R14: 0000000000000000 R15: 0000000000000001
[   13.868819] FS:  00007f04ffb9c800(0000) GS:ffff889002e00000(0000) knlGS:0000000000000000
[   13.868821] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.868822] CR2: 00007f92418c8350 CR3: 000000012bbe4004 CR4: 0000000000770ef0
[   13.868823] PKRU: 55555554
[   13.868824] Call Trace:
[   13.868826]  <NMI>
[   13.868828]  ? nmi_cpu_backtrace.cold+0x16/0x6b
[   13.868832]  ? nmi_cpu_backtrace_handler+0xd/0x20
[   13.868835]  ? nmi_handle+0xcd/0x240
[   13.868838]  ? delay_halt_mwaitx+0x38/0x40
[   13.868841]  ? default_do_nmi+0x6b/0x180
[   13.868844]  ? exc_nmi+0x12e/0x1c0
[   13.868847]  ? end_repeat_nmi+0xf/0x53
[   13.868854]  ? delay_halt_mwaitx+0x38/0x40
[   13.868857]  ? delay_halt_mwaitx+0x38/0x40
[   13.868860]  ? delay_halt_mwaitx+0x38/0x40
[   13.868863]  </NMI>
[   13.868864]  <TASK>
[   13.868865]  delay_halt+0x3b/0x60
[   13.868868]  wait_for_lsr+0x3f/0xa0
[   13.868873]  serial8250_console_write+0x270/0x670
[   13.868877]  ? srso_alias_return_thunk+0x5/0xfbef5
[   13.868879]  ? lock_release+0x183/0x2d0
[   13.868885]  ? console_flush_all+0x381/0x6b0
[   13.868888]  console_flush_all+0x3ad/0x6b0
[   13.868891]  ? console_flush_all+0x381/0x6b0
[   13.868893]  ? console_flush_all+0x39/0x6b0
[   13.868900]  console_unlock+0x157/0x240
[   13.868904]  vprintk_emit+0x304/0x550
[   13.868910]  _printk+0x58/0x80
[   13.868914]  ? ___ratelimit+0x85/0x100
[   13.868919]  __ext4_msg.cold+0x7f/0x84
[   13.868926]  ? srso_alias_return_thunk+0x5/0xfbef5
[   13.868928]  ? ext4_register_li_request+0x1c9/0x2a0
[   13.868933]  ext4_reconfigure+0x503/0xc40
[   13.868939]  ? shrink_dentry_list+0x17/0x320
[   13.868942]  ? shrink_dentry_list+0x89/0x320
[   13.868944]  ? srso_alias_return_thunk+0x5/0xfbef5
[   13.868946]  ? shrink_dentry_list+0x16c/0x320
[   13.868953]  reconfigure_super+0xc5/0x210
[   13.868958]  __do_sys_fsconfig+0x607/0x750
[   13.868967]  do_syscall_64+0x5d/0x100
[   13.868970]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   13.868973] RIP: 0033:0x7f04ffdb481a
[   13.868976] Code: 73 01 c3 48 8b 0d 06 56 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 af 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d6 55 0d 00 f7 d8 64 89 01 48
[   13.868977] RSP: 002b:00007ffe3e177008 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
[   13.868980] RAX: ffffffffffffffda RBX: 000055ab49f74a50 RCX: 00007f04ffdb481a
[   13.868981] RDX: 0000000000000000 RSI: 0000000000000007 RDI: 0000000000000004
[   13.868982] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[   13.868983] R10: 0000000000000000 R11: 0000000000000246 R12: 000055ab49f75ee0
[   13.868984] R13: 00007f04fff3b5e0 R14: 00007f04fff3d244 R15: 0000000000000065
[   13.868993]  </TASK>
[   13.880199] (tmpfiles)[564]: (sd-mkdcreds) succeeded.
[   13.936943] (journald)[561]: Restricting namespace to: n/a.
[   13.937918] (tmpfiles)[564]: systemd-tmpfiles-setup-dev-early.service: Executing: systemd-tmpfiles --prefix=/dev --create --boot --graceful
[   14.025641] (journald)[561]: systemd-journald.service: Executing: /usr/lib/systemd/systemd-journald
[   14.137455] ACPI: bus type drm_connector registered
[   14.306514] (udevadm)[569]: Found cgroup2 on /sys/fs/cgroup/, full unified hierarchy
[   14.357107] (udevadm)[569]: Found cgroup2 on /sys/fs/cgroup/, full unified hierarch

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 11/13] x86/mm: do targeted broadcast flushing from tlbbatch code
  2025-03-03 21:47     ` Dave Hansen
  2025-03-04 11:52       ` Borislav Petkov
@ 2025-03-04 12:52       ` Brendan Jackman
  2025-03-04 14:11         ` Borislav Petkov
  1 sibling, 1 reply; 86+ messages in thread
From: Brendan Jackman @ 2025-03-04 12:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Borislav Petkov, Rik van Riel, x86, linux-kernel, peterz,
	dave.hansen, zhengqi.arch, nadav.amit, thomas.lendacky,
	kernel-team, linux-mm, akpm, jannh, mhklinux, andrew.cooper3,
	Manali.Shukla, mingo

On Mon, Mar 03, 2025 at 01:47:42PM -0800, Dave Hansen wrote:
> >  static void broadcast_tlb_flush(struct flush_tlb_info *info)
> >  {
> >  	bool pmd = info->stride_shift == PMD_SHIFT;
> > @@ -790,6 +821,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
> >  	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
> >  		WARN_ON_ONCE(!irqs_disabled());
> >  
> > +	tlbsync();
> 
> This one is in dire need of comments.

I have been advocating to move this to a place where it's clearer. I
also think it needs comments:

https://lore.kernel.org/all/CA+i-1C31TrceZiizC_tng_cc-zcvKsfXLAZD_XDftXnp9B2Tdw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 11/13] x86/mm: do targeted broadcast flushing from tlbbatch code
  2025-03-04 12:52       ` Brendan Jackman
@ 2025-03-04 14:11         ` Borislav Petkov
  2025-03-04 15:33           ` Brendan Jackman
  0 siblings, 1 reply; 86+ messages in thread
From: Borislav Petkov @ 2025-03-04 14:11 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Dave Hansen, Rik van Riel, x86, linux-kernel, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jannh, mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Tue, Mar 04, 2025 at 12:52:47PM +0000, Brendan Jackman wrote:
> https://lore.kernel.org/all/CA+i-1C31TrceZiizC_tng_cc-zcvKsfXLAZD_XDftXnp9B2Tdw@mail.gmail.com/

Lemme try to understand what you're suggesting on that subthread:

> static inline void arch_start_context_switch(struct task_struct *prev)
> {
>     arch_paravirt_start_context_switch(prev);
>     tlb_start_context_switch(prev);
> }

This kinda makes sense to me...

> Now I think about it... if we always tlbsync() before a context switch, is the
> cant_migrate() above actually required? I think with that, even if we migrated
> in the middle of e.g.  broadcast_kernel_range_flush(), we'd be fine? (At
> least, from the specific perspective of the invplgb code, presumably having
> preemption on there would break things horribly in other ways).

I think we still need it because you need to TLBSYNC on the same CPU you've
issued the INVLPGB and actually, you want all TLBs to have been synched
system-wide.

Or am I misunderstanding it?

Anything else I missed?

Btw, I just sent v15 - if you wanna continue commenting there...

https://lore.kernel.org/r/20250304135816.12356-1-bp@kernel.org

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 03/13] x86/mm: add INVLPGB support code
  2025-03-04 11:00         ` Borislav Petkov
@ 2025-03-04 15:10           ` Dave Hansen
  2025-03-04 16:19             ` Borislav Petkov
  0 siblings, 1 reply; 86+ messages in thread
From: Dave Hansen @ 2025-03-04 15:10 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rik van Riel, x86, linux-kernel, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla,
	mingo

On 3/4/25 03:00, Borislav Petkov wrote:
> On Mon, Mar 03, 2025 at 11:23:58AM -0800, Dave Hansen wrote:
>> Here's a plain diff if you just want to squish it in.
> 
>> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
>> index 5375145eb9596..3bd617c204346 100644
>> --- a/arch/x86/include/asm/tlb.h
>> +++ b/arch/x86/include/asm/tlb.h
>> @@ -28,6 +28,11 @@ static inline void invlpg(unsigned long addr)
>>  	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
>>  }
>>  
>> +enum invlpgb_stride {
> 
> Right, this is an address stride, as the text calls it.
> 
>> +	NO_STRIDE  = 0,
>> +	PTE_STRIDE = 0,
> 
> Ok, so those are confusing. No stride is PTE stride so let's just zap
> NO_STRIDE.

Passing "PTE_STRIDE" to an operation that doesn't have a stride is
pretty confusing too.

...
>  /* Flush all mappings, including globals, for all PCIDs. */
> @@ -117,21 +126,21 @@ static inline void invlpgb_flush_all(void)
>  	 * as it is cheaper.
>  	 */
>  	guard(preempt)();
> -	__invlpgb(0, 0, 0, 1, 0, INVLPGB_INCLUDE_GLOBAL);
> +	__invlpgb(0, 0, 0, 1, PTE_STRIDE, INVLPGB_FLAG_INCLUDE_GLOBAL);
>  	__tlbsync();
>  }

This one, for example. It's not flushing PTEs an doesn't have a start
address or nr>0.

So, we could have the enum be totally divorced from the hardware type:

	NO_STRIDE,
	PTE_STRIDE,
	PMD_STRIDE

and decode it at the end:

	if (stride == PMD_STRIDE)
		foo | PMD_STRIDE_BIT;


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 11/13] x86/mm: do targeted broadcast flushing from tlbbatch code
  2025-03-04 11:52       ` Borislav Petkov
@ 2025-03-04 15:24         ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2025-03-04 15:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rik van Riel, x86, linux-kernel, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla,
	mingo

On 3/4/25 03:52, Borislav Petkov wrote:
> On Mon, Mar 03, 2025 at 01:47:42PM -0800, Dave Hansen wrote:
...
> IOW, this:
> 
> /* Flush all mappings for a given PCID, not including globals. */
> static inline void __invlpgb_flush_single_pcid_nosync(unsigned long pcid)
> {
>         __invlpgb(0, pcid, 0, 1, 0, INVLPGB_PCID);
>         cpu_set_tlbsync(true);
> }
> 
> Right?

Yep, that works.

Optimizing out the writes like the old code did is certainly a good
thought. But I suspect the cacheline is hot the majority of the time.

>>>  static void broadcast_tlb_flush(struct flush_tlb_info *info)
>>>  {
>>>  	bool pmd = info->stride_shift == PMD_SHIFT;
>>> @@ -790,6 +821,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
>>>  	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
>>>  		WARN_ON_ONCE(!irqs_disabled());
>>>  
>>> +	tlbsync();
>>
>> This one is in dire need of comments.
> 
> Maybe this:
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 08672350536f..b97249ffff1f 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -822,6 +822,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
>         if (IS_ENABLED(CONFIG_PROVE_LOCKING))
>                 WARN_ON_ONCE(!irqs_disabled());
>  
> +       /*
> +        * Finish any remote TLB flushes pending from this CPU:
> +        */
>         tlbsync();

That's a prototypical "what" comment and not "why", though. It makes a
lot of sense that any flushes that the old task did should complete
before a new gets activated. But I honestly can't think of a _specific_
problem that it causes.

I don't doubt that this does _some_ good, but I just don't know what
good it does. ;)


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 11/13] x86/mm: do targeted broadcast flushing from tlbbatch code
  2025-03-04 14:11         ` Borislav Petkov
@ 2025-03-04 15:33           ` Brendan Jackman
  2025-03-04 17:51             ` Dave Hansen
  0 siblings, 1 reply; 86+ messages in thread
From: Brendan Jackman @ 2025-03-04 15:33 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, Rik van Riel, x86, linux-kernel, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jannh, mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Tue, Mar 04, 2025 at 03:11:34PM +0100, Borislav Petkov wrote:
> On Tue, Mar 04, 2025 at 12:52:47PM +0000, Brendan Jackman wrote:
> > https://lore.kernel.org/all/CA+i-1C31TrceZiizC_tng_cc-zcvKsfXLAZD_XDftXnp9B2Tdw@mail.gmail.com/
> 
> Lemme try to understand what you're suggesting on that subthread:
> 
> > static inline void arch_start_context_switch(struct task_struct *prev)
> > {
> >     arch_paravirt_start_context_switch(prev);
> >     tlb_start_context_switch(prev);
> > }
> 
> This kinda makes sense to me...

Yeah so basically my concern here is that we are doing something
that's about context switching, but we're doing it in mm-switching
code, entangling an assumption that "context_switch() must either call
this function or that function". Whereas if we just call it explicitly
from context_switch() it will be much clearer.

> > Now I think about it... if we always tlbsync() before a context switch, is the
> > cant_migrate() above actually required? I think with that, even if we migrated
> > in the middle of e.g.  broadcast_kernel_range_flush(), we'd be fine? (At
> > least, from the specific perspective of the invplgb code, presumably having
> > preemption on there would break things horribly in other ways).
> 
> I think we still need it because you need to TLBSYNC on the same CPU you've
> issued the INVLPGB and actually, you want all TLBs to have been synched
> system-wide.
> 
> Or am I misunderstanding it?

Er, I might be exposing my own ignorance here. I was thinking that you
always go through context_switch() before you get migrated, but I
might not understand hwo migration happens.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 03/13] x86/mm: add INVLPGB support code
  2025-03-04 15:10           ` Dave Hansen
@ 2025-03-04 16:19             ` Borislav Petkov
  2025-03-04 16:57               ` Dave Hansen
  0 siblings, 1 reply; 86+ messages in thread
From: Borislav Petkov @ 2025-03-04 16:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Rik van Riel, x86, linux-kernel, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla,
	mingo

On Tue, Mar 04, 2025 at 07:10:13AM -0800, Dave Hansen wrote:
> So, we could have the enum be totally divorced from the hardware type:
> 
> 	NO_STRIDE,
> 	PTE_STRIDE,
> 	PMD_STRIDE

How about we completely hide that NO_STRIDE thing and do a __invlpgb_all()
"sub-helper" which is basically telling you it is invalidating all kinds of
TLB entries and stride does not apply there:

---

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index e8561a846754..361b3dde2656 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -66,6 +66,12 @@ static inline void __invlpgb(unsigned long asid, unsigned long pcid,
 	asm volatile(".byte 0x0f, 0x01, 0xfe" :: "a" (rax), "c" (ecx), "d" (edx));
 }
 
+static inline void __invlpgb_all(unsigned long asid, unsigned long pcid,
+				 unsigned long addr, u16 nr_pages, u8 flags)
+{
+	__invlpgb(asid, pcid, addr, nr_pages, 0, flags);
+}
+
 static inline void __tlbsync(void)
 {
 	/*
@@ -84,6 +90,8 @@ static inline void __tlbsync(void)
 static inline void __invlpgb(unsigned long asid, unsigned long pcid,
 			     unsigned long addr, u16 nr_pages,
 			     enum addr_stride s, u8 flags) { }
+static inline void __invlpgb_all(unsigned long asid, unsigned long pcid,
+				 unsigned long addr, u16 nr_pages, u8 flags) { }
 static inline void __tlbsync(void) { }
 #endif
 
@@ -121,7 +129,7 @@ static inline void __invlpgb_flush_user_nr_nosync(unsigned long pcid,
 /* Flush all mappings for a given PCID, not including globals. */
 static inline void __invlpgb_flush_single_pcid_nosync(unsigned long pcid)
 {
-	__invlpgb(0, pcid, 0, 1, PTE_STRIDE, INVLPGB_FLAG_PCID);
+	__invlpgb_all(0, pcid, 0, 1, INVLPGB_FLAG_PCID);
 }
 
 /* Flush all mappings, including globals, for all PCIDs. */
@@ -134,7 +142,7 @@ static inline void invlpgb_flush_all(void)
 	 * as it is cheaper.
 	 */
 	guard(preempt)();
-	__invlpgb(0, 0, 0, 1, PTE_STRIDE, INVLPGB_FLAG_INCLUDE_GLOBAL);
+	__invlpgb_all(0, 0, 0, 1, INVLPGB_FLAG_INCLUDE_GLOBAL);
 	__tlbsync();
 }
 
@@ -148,7 +156,7 @@ static inline void __invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
 static inline void invlpgb_flush_all_nonglobals(void)
 {
 	guard(preempt)();
-	__invlpgb(0, 0, 0, 1, PTE_STRIDE, INVLPGB_MODE_ALL_NONGLOBALS);
+	__invlpgb_all(0, 0, 0, 1, INVLPGB_MODE_ALL_NONGLOBALS);
 	__tlbsync();
 }
 #endif /* _ASM_X86_TLB_H */

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 03/13] x86/mm: add INVLPGB support code
  2025-03-04 16:19             ` Borislav Petkov
@ 2025-03-04 16:57               ` Dave Hansen
  2025-03-04 21:12                 ` Borislav Petkov
  0 siblings, 1 reply; 86+ messages in thread
From: Dave Hansen @ 2025-03-04 16:57 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rik van Riel, x86, linux-kernel, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla,
	mingo

On 3/4/25 08:19, Borislav Petkov wrote:
> +static inline void __invlpgb_all(unsigned long asid, unsigned long pcid,
> +				 unsigned long addr, u16 nr_pages, u8 flags)
> +{
> +	__invlpgb(asid, pcid, addr, nr_pages, 0, flags);
> +}

Why would __invlpg_all() need an 'addr' or 'nr_pages'? Shouldn't those be 0?

It's _better_ of course when it happens at a single site and it's close
to a prototype for __invlpgb(). But it's still a magic '0' that it's
impossible to make sense of without looking at the prototype.

Looking at the APM again... there really are three possible values for
ECX[31]:

 0: increment by 4k
 1: increment by 2M
 X: Don't care, no increment is going to happen

What you wrote above could actually be written:

	__invlpgb(asid, pcid, addr, nr_pages, 1, flags);

so the 0/1 is _actually_ completely random and arbitrary as far as the
spec goes.

Why does it matter?

It enables you to do sanity checking. For example, we could actually
enforce a rule that "no stride" can't be paired with any of the
per-address invalidation characteristics:

	if (stride == NO_STRIDE) {
		WARN_ON(flags & INVLPGB_FLAG_VA);
		WARN_ON(addr);
		WARN_ON(nr_pages);
	}

That's impossible if you pass a 'bool' in.

But, honestly, I'm deep into nitpick mode here. I think differentiating
the three cases is worth it, but it's also not the hill I'm going to die
on. ;)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 11/13] x86/mm: do targeted broadcast flushing from tlbbatch code
  2025-03-04 15:33           ` Brendan Jackman
@ 2025-03-04 17:51             ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2025-03-04 17:51 UTC (permalink / raw)
  To: Brendan Jackman, Borislav Petkov
  Cc: Rik van Riel, x86, linux-kernel, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jannh, mhklinux, andrew.cooper3, Manali.Shukla, mingo

On 3/4/25 07:33, Brendan Jackman wrote:
> On Tue, Mar 04, 2025 at 03:11:34PM +0100, Borislav Petkov wrote:
>> On Tue, Mar 04, 2025 at 12:52:47PM +0000, Brendan Jackman wrote:
>>> https://lore.kernel.org/all/CA+i-1C31TrceZiizC_tng_cc-zcvKsfXLAZD_XDftXnp9B2Tdw@mail.gmail.com/
>>
>> Lemme try to understand what you're suggesting on that subthread:
>>
>>> static inline void arch_start_context_switch(struct task_struct *prev)
>>> {
>>>     arch_paravirt_start_context_switch(prev);
>>>     tlb_start_context_switch(prev);
>>> }
>>
>> This kinda makes sense to me...
> 
> Yeah so basically my concern here is that we are doing something
> that's about context switching, but we're doing it in mm-switching
> code, entangling an assumption that "context_switch() must either call
> this function or that function". Whereas if we just call it explicitly
> from context_switch() it will be much clearer.

I was coming to a similar conclusion. All of the nastiness here would
come from an operation like:

	INVLPGB
	=> get scheduled on another CPU
	TLBSYNC

But there's no nastiness with:

	INVLPGB
	=> switch to init_mm
	TLBSYNC

at *all*. Because the TLBSYNC still works just fine. In fact, it *has*
to work just fine because you can get an TLB flush IPI in that window
already.

>>> Now I think about it... if we always tlbsync() before a context switch, is the
>>> cant_migrate() above actually required? I think with that, even if we migrated
>>> in the middle of e.g.  broadcast_kernel_range_flush(), we'd be fine? (At
>>> least, from the specific perspective of the invplgb code, presumably having
>>> preemption on there would break things horribly in other ways).
>>
>> I think we still need it because you need to TLBSYNC on the same CPU you've
>> issued the INVLPGB and actually, you want all TLBs to have been synched
>> system-wide.
>>
>> Or am I misunderstanding it?
> 
> Er, I might be exposing my own ignorance here. I was thinking that you
> always go through context_switch() before you get migrated, but I
> might not understand hwo migration happens.

Let's take a step back. Most of our IPI-based TLB flushes end up in this
code:

        preempt_disable();
        smp_call_function_many_cond(...);
        preempt_enable();

We don't have any fanciness around to keep the initiating thread on the
same CPU or check for pending TLB flushes at context switch time or lazy
tlb entry. We don't send asynchronous IPIs from the tlbbatch code and
then check for completion at batch finish time.

There's a lot of complexity to stuffing these TLB flushes into a
microarchitectural buffer and making *sure* they are flushed out.
INVLPGB is not free. It's not clear at all to me that doing loads of
them in reclaim code is superior to doing loads of:

	inc_mm_tlb_gen(mm);
	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));

and then just zapping the whole TLB on the next context switch.

I think we should defer this for now. Let's look at it again when there
are some performance numbers to justify it.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v14 03/13] x86/mm: add INVLPGB support code
  2025-03-04 16:57               ` Dave Hansen
@ 2025-03-04 21:12                 ` Borislav Petkov
  0 siblings, 0 replies; 86+ messages in thread
From: Borislav Petkov @ 2025-03-04 21:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Rik van Riel, x86, linux-kernel, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla,
	mingo

On Tue, Mar 04, 2025 at 08:57:30AM -0800, Dave Hansen wrote:
> Why would __invlpg_all() need an 'addr' or 'nr_pages'? Shouldn't those be 0?

Yap, good idea. It makes the _all helper even better:

static inline void __invlpgb_all(unsigned long asid, unsigned long pcid, u8 flags)
{
        __invlpgb(asid, pcid, 0, 1, 0, flags);
}

> It's _better_ of course when it happens at a single site and it's close
> to a prototype for __invlpgb(). But it's still a magic '0' that it's
> impossible to make sense of without looking at the prototype.

Yes.

> Looking at the APM again... there really are three possible values for
> ECX[31]:
> 
>  0: increment by 4k
>  1: increment by 2M
>  X: Don't care, no increment is going to happen
> 
> What you wrote above could actually be written:
> 
> 	__invlpgb(asid, pcid, addr, nr_pages, 1, flags);
> 
> so the 0/1 is _actually_ completely random and arbitrary as far as the
> spec goes.

Yes.

> Why does it matter?
> 
> It enables you to do sanity checking. For example, we could actually
> enforce a rule that "no stride" can't be paired with any of the
> per-address invalidation characteristics:
> 
> 	if (stride == NO_STRIDE) {
> 		WARN_ON(flags & INVLPGB_FLAG_VA);
> 		WARN_ON(addr);
> 		WARN_ON(nr_pages);
> 	}
> 
> That's impossible if you pass a 'bool' in.
> 
> But, honestly, I'm deep into nitpick mode here. I think differentiating
> the three cases is worth it, but it's also not the hill I'm going to die
> on. ;)

Yap, and now I've massaged it so much so that it doesn't really need that
checking. Because I have exactly two calls which use the stride:

1.

static inline void __invlpgb_flush_user_nr_nosync(unsigned long pcid,
                                                  unsigned long addr,
                                                  u16 nr, bool stride)
{
        enum addr_stride str = stride ? PMD_STRIDE : PTE_STRIDE;
        u8 flags = INVLPGB_FLAG_PCID | INVLPGB_FLAG_VA;

        __invlpgb(0, pcid, addr, nr, str, flags);
}

This one is fine - I verify it.

2.

/* Flush addr, including globals, for all PCIDs. */
static inline void __invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
{
	__invlpgb(0, 0, addr, nr, PTE_STRIDE, INVLPGB_FLAG_INCLUDE_GLOBAL);
}

This one controls it already.

So the only case where something could go bad is when one would use
__invlpgb() directly and that should hopefully be caught early enough.

But if you really want, I could add sanitization to __invlpgb() to massage it
into the right stride. And print a single warning - the big fat WARN* in an
inline functions are probably too much. Hm, I dunno...

Current diff ontop:

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index e8561a846754..8ab21487d6ee 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -66,6 +66,11 @@ static inline void __invlpgb(unsigned long asid, unsigned long pcid,
 	asm volatile(".byte 0x0f, 0x01, 0xfe" :: "a" (rax), "c" (ecx), "d" (edx));
 }
 
+static inline void __invlpgb_all(unsigned long asid, unsigned long pcid, u8 flags)
+{
+	__invlpgb(asid, pcid, 0, 1, 0, flags);
+}
+
 static inline void __tlbsync(void)
 {
 	/*
@@ -84,6 +89,7 @@ static inline void __tlbsync(void)
 static inline void __invlpgb(unsigned long asid, unsigned long pcid,
 			     unsigned long addr, u16 nr_pages,
 			     enum addr_stride s, u8 flags) { }
+static inline void __invlpgb_all(unsigned long asid, unsigned long pcid, u8 flags) { }
 static inline void __tlbsync(void) { }
 #endif
 
@@ -121,7 +127,7 @@ static inline void __invlpgb_flush_user_nr_nosync(unsigned long pcid,
 /* Flush all mappings for a given PCID, not including globals. */
 static inline void __invlpgb_flush_single_pcid_nosync(unsigned long pcid)
 {
-	__invlpgb(0, pcid, 0, 1, PTE_STRIDE, INVLPGB_FLAG_PCID);
+	__invlpgb_all(0, pcid, INVLPGB_FLAG_PCID);
 }
 
 /* Flush all mappings, including globals, for all PCIDs. */
@@ -134,7 +140,7 @@ static inline void invlpgb_flush_all(void)
 	 * as it is cheaper.
 	 */
 	guard(preempt)();
-	__invlpgb(0, 0, 0, 1, PTE_STRIDE, INVLPGB_FLAG_INCLUDE_GLOBAL);
+	__invlpgb_all(0, 0, INVLPGB_FLAG_INCLUDE_GLOBAL);
 	__tlbsync();
 }
 
@@ -148,7 +154,7 @@ static inline void __invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
 static inline void invlpgb_flush_all_nonglobals(void)
 {
 	guard(preempt)();
-	__invlpgb(0, 0, 0, 1, PTE_STRIDE, INVLPGB_MODE_ALL_NONGLOBALS);
+	__invlpgb_all(0, 0, INVLPGB_MODE_ALL_NONGLOBALS);
 	__tlbsync();
 }
 #endif /* _ASM_X86_TLB_H */

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/mm] x86/mm: Always set the ASID valid bit for the INVLPGB instruction
  2025-03-04 12:04 ` [PATCH] x86/mm: Always set the ASID valid bit for the INVLPGB instruction Borislav Petkov
  2025-03-04 12:43   ` Borislav Petkov
@ 2025-03-05 21:36   ` tip-bot2 for Tom Lendacky
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Tom Lendacky
  2 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Tom Lendacky @ 2025-03-05 21:36 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Tom Lendacky, Borislav Petkov (AMD), x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     0896acd80782ec49c6d36e576fcd53786f0a2bfb
Gitweb:        https://git.kernel.org/tip/0896acd80782ec49c6d36e576fcd53786f0a2bfb
Author:        Tom Lendacky <thomas.lendacky@amd.com>
AuthorDate:    Tue, 04 Mar 2025 12:59:56 +01:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Wed, 05 Mar 2025 22:18:16 +01:00

x86/mm: Always set the ASID valid bit for the INVLPGB instruction

When executing the INVLPGB instruction on a bare-metal host or hypervisor, if
the ASID valid bit is not set, the instruction will flush the TLB entries that
match the specified criteria for any ASID, not just the those of the host. If
virtual machines are running on the system, this may result in inadvertent
flushes of guest TLB entries.

When executing the INVLPGB instruction in a guest and the INVLPGB instruction is
not intercepted by the hypervisor, the hardware will replace the requested ASID
with the guest ASID and set the ASID valid bit before doing the broadcast
invalidation. Thus a guest is only able to flush its own TLB entries.

So to limit the host TLB flushing reach, always set the ASID valid bit using an
ASID value of 0 (which represents the host/hypervisor). This will will result in
the desired effect in both host and guest.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250304120449.GHZ8bsYYyEBOKQIxBm@fat_crate.local
---
 arch/x86/include/asm/tlb.h | 58 ++++++++++++++++++++-----------------
 1 file changed, 32 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 31f6db4..866ea78 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -33,6 +33,27 @@ enum addr_stride {
 	PMD_STRIDE = 1
 };
 
+/*
+ * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
+ * of the three. For example:
+ * - FLAG_VA | FLAG_INCLUDE_GLOBAL: invalidate all TLB entries at the address
+ * - FLAG_PCID:			    invalidate all TLB entries matching the PCID
+ *
+ * The first is used to invalidate (kernel) mappings at a particular
+ * address across all processes.
+ *
+ * The latter invalidates all TLB entries matching a PCID.
+ */
+#define INVLPGB_FLAG_VA			BIT(0)
+#define INVLPGB_FLAG_PCID		BIT(1)
+#define INVLPGB_FLAG_ASID		BIT(2)
+#define INVLPGB_FLAG_INCLUDE_GLOBAL	BIT(3)
+#define INVLPGB_FLAG_FINAL_ONLY		BIT(4)
+#define INVLPGB_FLAG_INCLUDE_NESTED	BIT(5)
+
+/* The implied mode when all bits are clear: */
+#define INVLPGB_MODE_ALL_NONGLOBALS	0UL
+
 #ifdef CONFIG_BROADCAST_TLB_FLUSH
 /*
  * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
@@ -40,14 +61,20 @@ enum addr_stride {
  * The INVLPGB instruction is weakly ordered, and a batch of invalidations can
  * be done in a parallel fashion.
  *
- * The instruction takes the number of extra pages to invalidate, beyond
- * the first page, while __invlpgb gets the more human readable number of
- * pages to invalidate.
+ * The instruction takes the number of extra pages to invalidate, beyond the
+ * first page, while __invlpgb gets the more human readable number of pages to
+ * invalidate.
  *
  * The bits in rax[0:2] determine respectively which components of the address
  * (VA, PCID, ASID) get compared when flushing. If neither bits are set, *any*
  * address in the specified range matches.
  *
+ * Since it is desired to only flush TLB entries for the ASID that is executing
+ * the instruction (a host/hypervisor or a guest), the ASID valid bit should
+ * always be set. On a host/hypervisor, the hardware will use the ASID value
+ * specified in EDX[15:0] (which should be 0). On a guest, the hardware will
+ * use the actual ASID value of the guest.
+ *
  * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from
  * this CPU have completed.
  */
@@ -55,9 +82,9 @@ static inline void __invlpgb(unsigned long asid, unsigned long pcid,
 			     unsigned long addr, u16 nr_pages,
 			     enum addr_stride stride, u8 flags)
 {
-	u32 edx = (pcid << 16) | asid;
+	u64 rax = addr | flags | INVLPGB_FLAG_ASID;
 	u32 ecx = (stride << 31) | (nr_pages - 1);
-	u64 rax = addr | flags;
+	u32 edx = (pcid << 16) | asid;
 
 	/* The low bits in rax are for flags. Verify addr is clean. */
 	VM_WARN_ON_ONCE(addr & ~PAGE_MASK);
@@ -93,27 +120,6 @@ static inline void __invlpgb_all(unsigned long asid, unsigned long pcid, u8 flag
 static inline void __tlbsync(void) { }
 #endif
 
-/*
- * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
- * of the three. For example:
- * - FLAG_VA | FLAG_INCLUDE_GLOBAL: invalidate all TLB entries at the address
- * - FLAG_PCID:			    invalidate all TLB entries matching the PCID
- *
- * The first is used to invalidate (kernel) mappings at a particular
- * address across all processes.
- *
- * The latter invalidates all TLB entries matching a PCID.
- */
-#define INVLPGB_FLAG_VA			BIT(0)
-#define INVLPGB_FLAG_PCID		BIT(1)
-#define INVLPGB_FLAG_ASID		BIT(2)
-#define INVLPGB_FLAG_INCLUDE_GLOBAL	BIT(3)
-#define INVLPGB_FLAG_FINAL_ONLY		BIT(4)
-#define INVLPGB_FLAG_INCLUDE_NESTED	BIT(5)
-
-/* The implied mode when all bits are clear: */
-#define INVLPGB_MODE_ALL_NONGLOBALS	0UL
-
 static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
 						unsigned long addr,
 						u16 nr, bool stride)

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/mm] x86/mm: Enable AMD translation cache extensions
  2025-02-26  3:00 ` [PATCH v14 12/13] x86/mm: enable AMD translation cache extensions Rik van Riel
@ 2025-03-05 21:36   ` tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  1 sibling, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-05 21:36 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Rik van Riel, Borislav Petkov (AMD), x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     b95ef870dcd67c0fbfc57f808c4a0efd0bc2f144
Gitweb:        https://git.kernel.org/tip/b95ef870dcd67c0fbfc57f808c4a0efd0bc2f144
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:47 -05:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Wed, 05 Mar 2025 22:18:16 +01:00

x86/mm: Enable AMD translation cache extensions

With AMD TCE (translation cache extensions) only the intermediate mappings
that cover the address range zapped by INVLPG / INVLPGB get invalidated,
rather than all intermediate mappings getting zapped at every TLB invalidation.

This can help reduce the TLB miss rate, by keeping more intermediate mappings
in the cache.

>From the AMD manual:

Translation Cache Extension (TCE) Bit. Bit 15, read/write. Setting this bit to
1 changes how the INVLPG, INVLPGB, and INVPCID instructions operate on TLB
entries. When this bit is 0, these instructions remove the target PTE from the
TLB as well as all upper-level table entries that are cached in the TLB,
whether or not they are associated with the target PTE.  When this bit is set,
these instructions will remove the target PTE and only those upper-level
entries that lead to the target PTE in the page table hierarchy, leaving
unrelated upper-level entries intact.

  [ bp: use cpu_has()... I know, it is a mess. ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250226030129.530345-13-riel@surriel.com
---
 arch/x86/include/asm/msr-index.h       | 2 ++
 arch/x86/kernel/cpu/amd.c              | 4 ++++
 tools/arch/x86/include/asm/msr-index.h | 2 ++
 3 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 72765b2..1aacd6b 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -25,6 +25,7 @@
 #define _EFER_SVME		12 /* Enable virtualization */
 #define _EFER_LMSLE		13 /* Long Mode Segment Limit Enable */
 #define _EFER_FFXSR		14 /* Enable Fast FXSAVE/FXRSTOR */
+#define _EFER_TCE		15 /* Enable Translation Cache Extensions */
 #define _EFER_AUTOIBRS		21 /* Enable Automatic IBRS */
 
 #define EFER_SCE		(1<<_EFER_SCE)
@@ -34,6 +35,7 @@
 #define EFER_SVME		(1<<_EFER_SVME)
 #define EFER_LMSLE		(1<<_EFER_LMSLE)
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
+#define EFER_TCE		(1<<_EFER_TCE)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
 /*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 7a72ef4..7058533 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1075,6 +1075,10 @@ static void init_amd(struct cpuinfo_x86 *c)
 
 	/* AMD CPUs don't need fencing after x2APIC/TSC_DEADLINE MSR writes. */
 	clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
+
+	/* Enable Translation Cache Extension */
+	if (cpu_has(c, X86_FEATURE_TCE))
+		msr_set_bit(MSR_EFER, _EFER_TCE);
 }
 
 #ifdef CONFIG_X86_32
diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h
index 3ae84c3..dc1c105 100644
--- a/tools/arch/x86/include/asm/msr-index.h
+++ b/tools/arch/x86/include/asm/msr-index.h
@@ -25,6 +25,7 @@
 #define _EFER_SVME		12 /* Enable virtualization */
 #define _EFER_LMSLE		13 /* Long Mode Segment Limit Enable */
 #define _EFER_FFXSR		14 /* Enable Fast FXSAVE/FXRSTOR */
+#define _EFER_TCE		15 /* Enable Translation Cache Extensions */
 #define _EFER_AUTOIBRS		21 /* Enable Automatic IBRS */
 
 #define EFER_SCE		(1<<_EFER_SCE)
@@ -34,6 +35,7 @@
 #define EFER_SVME		(1<<_EFER_SVME)
 #define EFER_LMSLE		(1<<_EFER_LMSLE)
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
+#define EFER_TCE		(1<<_EFER_TCE)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
 /*

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/mm] x86/mm: Enable broadcast TLB invalidation for multi-threaded processes
  2025-02-26  3:00 ` [PATCH v14 10/13] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
  2025-03-03 10:57   ` Borislav Petkov
@ 2025-03-05 21:36   ` tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-05 21:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Rik van Riel, Borislav Petkov (AMD), Nadav Amit, x86,
	linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     f37ff575b1fe47ff3eeccf751455b4ab05306008
Gitweb:        https://git.kernel.org/tip/f37ff575b1fe47ff3eeccf751455b4ab05306008
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:45 -05:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Wed, 05 Mar 2025 22:18:10 +01:00

x86/mm: Enable broadcast TLB invalidation for multi-threaded processes

There is not enough room in the 12-bit ASID address space to hand out
broadcast ASIDs to every process. Only hand out broadcast ASIDs to processes
when they are observed to be simultaneously running on 4 or more CPUs.

This also allows single threaded process to continue using the cheaper, local
TLB invalidation instructions like INVLPGB.

Due to the structure of flush_tlb_mm_range(), the INVLPGB flushing is done in
a generically named broadcast_tlb_flush() function which can later also be
used for Intel RAR.

Combined with the removal of unnecessary lru_add_drain calls() (see
https://lore.kernel.org/r/20241219153253.3da9e8aa@fangorn) this results in
a nice performance boost for the will-it-scale tlb_flush2_threads test on an
AMD Milan system with 36 cores:

  - vanilla kernel:           527k loops/second
  - lru_add_drain removal:    731k loops/second
  - only INVLPGB:             527k loops/second
  - lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while TLB invalidation went
down from 40% of the total CPU time to only around 4% of CPU time, the
contention simply moved to the LRU lock.

Fixing both at the same time about doubles the number of iterations per second
from this case.

Comparing will-it-scale tlb_flush2_threads with several different numbers of
threads on a 72 CPU AMD Milan shows similar results. The number represents the
total number of loops per second across all the threads:

  threads	tip		INVLPGB

  1		315k		304k
  2		423k		424k
  4		644k		1032k
  8		652k		1267k
  16		737k		1368k
  32		759k		1199k
  64		636k		1094k
  72		609k		993k

1 and 2 thread performance is similar with and without INVLPGB, because
INVLPGB is only used on processes using 4 or more CPUs simultaneously.

The number is the median across 5 runs.

Some numbers closer to real world performance can be found at Phoronix, thanks
to Michael:

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

  [ bp:
   - Massage
   - :%s/\<static_cpu_has\>/cpu_feature_enabled/cgi
   - :%s/\<clear_asid_transition\>/mm_clear_asid_transition/cgi
   - Fold in a 0day bot fix: https://lore.kernel.org/oe-kbuild-all/202503040000.GtiWUsBm-lkp@intel.com
   ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Link: https://lore.kernel.org/r/20250226030129.530345-11-riel@surriel.com
---
 arch/x86/include/asm/tlbflush.h |   6 ++-
 arch/x86/mm/tlb.c               | 104 ++++++++++++++++++++++++++++++-
 2 files changed, 109 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index e6c3be0..7cad283 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -280,6 +280,11 @@ static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
 	smp_store_release(&mm->context.global_asid, asid);
 }
 
+static inline void mm_clear_asid_transition(struct mm_struct *mm)
+{
+	WRITE_ONCE(mm->context.asid_transition, false);
+}
+
 static inline bool mm_in_asid_transition(struct mm_struct *mm)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
@@ -291,6 +296,7 @@ static inline bool mm_in_asid_transition(struct mm_struct *mm)
 static inline u16 mm_global_asid(struct mm_struct *mm) { return 0; }
 static inline void mm_init_global_asid(struct mm_struct *mm) { }
 static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid) { }
+static inline void mm_clear_asid_transition(struct mm_struct *mm) { }
 static inline bool mm_in_asid_transition(struct mm_struct *mm) { return false; }
 #endif /* CONFIG_BROADCAST_TLB_FLUSH */
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index b5681e6..0efd990 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -431,6 +431,105 @@ static bool mm_needs_global_asid(struct mm_struct *mm, u16 asid)
 }
 
 /*
+ * x86 has 4k ASIDs (2k when compiled with KPTI), but the largest x86
+ * systems have over 8k CPUs. Because of this potential ASID shortage,
+ * global ASIDs are handed out to processes that have frequent TLB
+ * flushes and are active on 4 or more CPUs simultaneously.
+ */
+static void consider_global_asid(struct mm_struct *mm)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return;
+
+	/* Check every once in a while. */
+	if ((current->pid & 0x1f) != (jiffies & 0x1f))
+		return;
+
+	/*
+	 * Assign a global ASID if the process is active on
+	 * 4 or more CPUs simultaneously.
+	 */
+	if (mm_active_cpus_exceeds(mm, 3))
+		use_global_asid(mm);
+}
+
+static void finish_asid_transition(struct flush_tlb_info *info)
+{
+	struct mm_struct *mm = info->mm;
+	int bc_asid = mm_global_asid(mm);
+	int cpu;
+
+	if (!mm_in_asid_transition(mm))
+		return;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/*
+		 * The remote CPU is context switching. Wait for that to
+		 * finish, to catch the unlikely case of it switching to
+		 * the target mm with an out of date ASID.
+		 */
+		while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) == LOADED_MM_SWITCHING)
+			cpu_relax();
+
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm)
+			continue;
+
+		/*
+		 * If at least one CPU is not using the global ASID yet,
+		 * send a TLB flush IPI. The IPI should cause stragglers
+		 * to transition soon.
+		 *
+		 * This can race with the CPU switching to another task;
+		 * that results in a (harmless) extra IPI.
+		 */
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) {
+			flush_tlb_multi(mm_cpumask(info->mm), info);
+			return;
+		}
+	}
+
+	/* All the CPUs running this process are using the global ASID. */
+	mm_clear_asid_transition(mm);
+}
+
+static void broadcast_tlb_flush(struct flush_tlb_info *info)
+{
+	bool pmd = info->stride_shift == PMD_SHIFT;
+	unsigned long asid = mm_global_asid(info->mm);
+	unsigned long addr = info->start;
+
+	/*
+	 * TLB flushes with INVLPGB are kicked off asynchronously.
+	 * The inc_mm_tlb_gen() guarantees page table updates are done
+	 * before these TLB flushes happen.
+	 */
+	if (info->end == TLB_FLUSH_ALL) {
+		invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (cpu_feature_enabled(X86_FEATURE_PTI))
+			invlpgb_flush_single_pcid_nosync(user_pcid(asid));
+	} else do {
+		unsigned long nr = 1;
+
+		if (info->stride_shift <= PMD_SHIFT) {
+			nr = (info->end - addr) >> info->stride_shift;
+			nr = clamp_val(nr, 1, invlpgb_count_max);
+		}
+
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd);
+		if (cpu_feature_enabled(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd);
+
+		addr += nr << info->stride_shift;
+	} while (addr < info->end);
+
+	finish_asid_transition(info);
+
+	/* Wait for the INVLPGBs kicked off above to finish. */
+	__tlbsync();
+}
+
+/*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
  *
@@ -1260,9 +1359,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+	if (mm_global_asid(mm)) {
+		broadcast_tlb_flush(info);
+	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
 		info->trim_cpumask = should_trim_cpumask(mm);
 		flush_tlb_multi(mm_cpumask(mm), info);
+		consider_global_asid(mm);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/mm] x86/mm: Add global ASID process exit helpers
  2025-02-26  3:00 ` [PATCH v14 09/13] x86/mm: global ASID process exit helpers Rik van Riel
  2025-03-02 12:38   ` Borislav Petkov
@ 2025-03-05 21:36   ` tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-05 21:36 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Rik van Riel, Borislav Petkov (AMD), x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     26a3633bfddfcdd57f5f2e3d157806d68aff5ac4
Gitweb:        https://git.kernel.org/tip/26a3633bfddfcdd57f5f2e3d157806d68aff5ac4
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:44 -05:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Wed, 05 Mar 2025 17:19:53 +01:00

x86/mm: Add global ASID process exit helpers

A global ASID is allocated for the lifetime of a process. Free the global ASID
at process exit time.

  [ bp: Massage, create helpers, hide details inside them. ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250226030129.530345-10-riel@surriel.com
---
 arch/x86/include/asm/mmu_context.h |  8 +++++++-
 arch/x86/include/asm/tlbflush.h    |  9 +++++++++
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index a2c70e4..2398058 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -2,7 +2,6 @@
 #ifndef _ASM_X86_MMU_CONTEXT_H
 #define _ASM_X86_MMU_CONTEXT_H
 
-#include <asm/desc.h>
 #include <linux/atomic.h>
 #include <linux/mm_types.h>
 #include <linux/pkeys.h>
@@ -13,6 +12,7 @@
 #include <asm/paravirt.h>
 #include <asm/debugreg.h>
 #include <asm/gsseg.h>
+#include <asm/desc.h>
 
 extern atomic64_t last_mm_ctx_id;
 
@@ -139,6 +139,9 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
 #define enter_lazy_tlb enter_lazy_tlb
 extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 
+#define mm_init_global_asid mm_init_global_asid
+extern void mm_init_global_asid(struct mm_struct *mm);
+
 extern void mm_free_global_asid(struct mm_struct *mm);
 
 /*
@@ -163,6 +166,8 @@ static inline int init_new_context(struct task_struct *tsk,
 		mm->context.execute_only_pkey = -1;
 	}
 #endif
+
+	mm_init_global_asid(mm);
 	mm_reset_untag_mask(mm);
 	init_new_context_ldt(mm);
 	return 0;
@@ -172,6 +177,7 @@ static inline int init_new_context(struct task_struct *tsk,
 static inline void destroy_context(struct mm_struct *mm)
 {
 	destroy_context_ldt(mm);
+	mm_free_global_asid(mm);
 }
 
 extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 1f61a39..e6c3be0 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -261,6 +261,14 @@ static inline u16 mm_global_asid(struct mm_struct *mm)
 	return asid;
 }
 
+static inline void mm_init_global_asid(struct mm_struct *mm)
+{
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		mm->context.global_asid = 0;
+		mm->context.asid_transition = false;
+	}
+}
+
 static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
 {
 	/*
@@ -281,6 +289,7 @@ static inline bool mm_in_asid_transition(struct mm_struct *mm)
 }
 #else
 static inline u16 mm_global_asid(struct mm_struct *mm) { return 0; }
+static inline void mm_init_global_asid(struct mm_struct *mm) { }
 static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid) { }
 static inline bool mm_in_asid_transition(struct mm_struct *mm) { return false; }
 #endif /* CONFIG_BROADCAST_TLB_FLUSH */

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/mm] x86/mm: Handle global ASID context switch and TLB flush
  2025-02-26  3:00 ` [PATCH v14 08/13] x86/mm: global ASID context switch & TLB flush handling Rik van Riel
  2025-03-02  7:58   ` Borislav Petkov
@ 2025-03-05 21:36   ` tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-05 21:36 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Rik van Riel, Borislav Petkov (AMD), x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     aac0a3aefb99eeaa26c6ddbd4fa208a794648a36
Gitweb:        https://git.kernel.org/tip/aac0a3aefb99eeaa26c6ddbd4fa208a794648a36
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:43 -05:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Wed, 05 Mar 2025 17:19:53 +01:00

x86/mm: Handle global ASID context switch and TLB flush

Do context switch and TLB flush support for processes that use a global
ASID and PCID across all CPUs.

At both context switch time and TLB flush time, it needs to be checked whether
a task is switching to a global ASID, and, if so, reload the TLB with the new
ASID as appropriate.

In both code paths, the TLB flush is avoided if a global ASID is used, because
the global ASIDs are always kept up to date across CPUs, even when the
process is not running on a CPU.

  [ bp:
   - Massage
   - :%s/\<static_cpu_has\>/cpu_feature_enabled/cgi
  ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250226030129.530345-9-riel@surriel.com
---
 arch/x86/include/asm/tlbflush.h | 14 ++++++-
 arch/x86/mm/tlb.c               | 77 +++++++++++++++++++++++++++++---
 2 files changed, 84 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index f7b374b..1f61a39 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -240,6 +240,11 @@ static inline bool is_dyn_asid(u16 asid)
 	return asid < TLB_NR_DYN_ASIDS;
 }
 
+static inline bool is_global_asid(u16 asid)
+{
+	return !is_dyn_asid(asid);
+}
+
 #ifdef CONFIG_BROADCAST_TLB_FLUSH
 static inline u16 mm_global_asid(struct mm_struct *mm)
 {
@@ -266,9 +271,18 @@ static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
 	mm->context.asid_transition = true;
 	smp_store_release(&mm->context.global_asid, asid);
 }
+
+static inline bool mm_in_asid_transition(struct mm_struct *mm)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	return mm && READ_ONCE(mm->context.asid_transition);
+}
 #else
 static inline u16 mm_global_asid(struct mm_struct *mm) { return 0; }
 static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid) { }
+static inline bool mm_in_asid_transition(struct mm_struct *mm) { return false; }
 #endif /* CONFIG_BROADCAST_TLB_FLUSH */
 
 #ifdef CONFIG_PARAVIRT
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6c24d96..b5681e6 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -227,6 +227,20 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 		return;
 	}
 
+	/*
+	 * TLB consistency for global ASIDs is maintained with hardware assisted
+	 * remote TLB flushing. Global ASIDs are always up to date.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		u16 global_asid = mm_global_asid(next);
+
+		if (global_asid) {
+			*new_asid = global_asid;
+			*need_flush = false;
+			return;
+		}
+	}
+
 	if (this_cpu_read(cpu_tlbstate.invalidate_other))
 		clear_asid_other();
 
@@ -400,6 +414,23 @@ void mm_free_global_asid(struct mm_struct *mm)
 }
 
 /*
+ * Is the mm transitioning from a CPU-local ASID to a global ASID?
+ */
+static bool mm_needs_global_asid(struct mm_struct *mm, u16 asid)
+{
+	u16 global_asid = mm_global_asid(mm);
+
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	/* Process is transitioning to a global ASID */
+	if (global_asid && asid != global_asid)
+		return true;
+
+	return false;
+}
+
+/*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
  *
@@ -704,7 +735,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 	 */
 	if (prev == next) {
 		/* Not actually switching mm's */
-		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
+		VM_WARN_ON(is_dyn_asid(prev_asid) &&
+			   this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
 			   next->context.ctx_id);
 
 		/*
@@ -721,6 +753,20 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 
+		/* Check if the current mm is transitioning to a global ASID */
+		if (mm_needs_global_asid(next, prev_asid)) {
+			next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+			choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
+			goto reload_tlb;
+		}
+
+		/*
+		 * Broadcast TLB invalidation keeps this ASID up to date
+		 * all the time.
+		 */
+		if (is_global_asid(prev_asid))
+			return;
+
 		/*
 		 * If the CPU is not in lazy TLB mode, we are just switching
 		 * from one thread in a process to another thread in the same
@@ -755,6 +801,13 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		cond_mitigation(tsk);
 
 		/*
+		 * Let nmi_uaccess_okay() and finish_asid_transition()
+		 * know that CR3 is changing.
+		 */
+		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
+		barrier();
+
+		/*
 		 * Leave this CPU in prev's mm_cpumask. Atomic writes to
 		 * mm_cpumask can be expensive under contention. The CPU
 		 * will be removed lazily at TLB flush time.
@@ -768,14 +821,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
 		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
-
-		/* Let nmi_uaccess_okay() know that we're changing CR3. */
-		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
-		barrier();
 	}
 
+reload_tlb:
 	new_lam = mm_lam_cr3_mask(next);
 	if (need_flush) {
+		VM_WARN_ON_ONCE(is_global_asid(new_asid));
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
 		load_new_mm_cr3(next->pgd, new_asid, new_lam, true);
@@ -894,7 +945,7 @@ static void flush_tlb_func(void *info)
 	const struct flush_tlb_info *f = info;
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+	u64 local_tlb_gen;
 	bool local = smp_processor_id() == f->initiating_cpu;
 	unsigned long nr_invalidate = 0;
 	u64 mm_tlb_gen;
@@ -917,6 +968,16 @@ static void flush_tlb_func(void *info)
 	if (unlikely(loaded_mm == &init_mm))
 		return;
 
+	/* Reload the ASID if transitioning into or out of a global ASID */
+	if (mm_needs_global_asid(loaded_mm, loaded_mm_asid)) {
+		switch_mm_irqs_off(NULL, loaded_mm, NULL);
+		loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	}
+
+	/* Broadcast ASIDs are always kept up to date with INVLPGB. */
+	if (is_global_asid(loaded_mm_asid))
+		return;
+
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
@@ -934,6 +995,8 @@ static void flush_tlb_func(void *info)
 		return;
 	}
 
+	local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+
 	if (unlikely(f->new_tlb_gen != TLB_GENERATION_INVALID &&
 		     f->new_tlb_gen <= local_tlb_gen)) {
 		/*
@@ -1101,7 +1164,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
 	 * up on the new contents of what used to be page tables, while
 	 * doing a speculative memory access.
 	 */
-	if (info->freed_tables)
+	if (info->freed_tables || mm_in_asid_transition(info->mm))
 		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
 	else
 		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/mm] x86/mm: Use broadcast TLB flushing in page reclaim
  2025-02-26  3:00 ` [PATCH v14 06/13] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
  2025-02-28 18:57   ` Borislav Petkov
@ 2025-03-05 21:36   ` tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-05 21:36 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Rik van Riel, Borislav Petkov (AMD), x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     0b3a2f246a5bf016b80a136a9f8c24d886107dc3
Gitweb:        https://git.kernel.org/tip/0b3a2f246a5bf016b80a136a9f8c24d886107dc3
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:41 -05:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Wed, 05 Mar 2025 17:19:52 +01:00

x86/mm: Use broadcast TLB flushing in page reclaim

Page reclaim tracks only the CPU(s) where the TLB needs to be flushed, rather
than all the individual mappings that may be getting invalidated.

Use broadcast TLB flushing when that is available.

  [ bp: Massage commit message. ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250226030129.530345-7-riel@surriel.com
---
 arch/x86/mm/tlb.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8cd084b..76b4a88 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1320,7 +1320,9 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		invlpgb_flush_all_nonglobals();
+	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/mm] x86/mm: Add global ASID allocation helper functions
  2025-02-26  3:00 ` [PATCH v14 07/13] x86/mm: add global ASID allocation helper functions Rik van Riel
  2025-03-02  7:06   ` Borislav Petkov
@ 2025-03-05 21:36   ` tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-05 21:36 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Rik van Riel, Borislav Petkov (AMD), x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     6a7a65ddfd6d78ca2f075b0fe0a582792081fc03
Gitweb:        https://git.kernel.org/tip/6a7a65ddfd6d78ca2f075b0fe0a582792081fc03
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:42 -05:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Wed, 05 Mar 2025 17:19:52 +01:00

x86/mm: Add global ASID allocation helper functions

Add functions to manage global ASID space. Multithreaded processes that are
simultaneously active on 4 or more CPUs can get a global ASID, resulting in the
same PCID being used for that process on every CPU.

This in turn will allow the kernel to use hardware-assisted TLB flushing
through AMD INVLPGB or Intel RAR for these processes.

  [ bp:
   - Extend use_global_asid() comment
   - s/X86_BROADCAST_TLB_FLUSH/BROADCAST_TLB_FLUSH/g
   - other touchups ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250226030129.530345-8-riel@surriel.com
---
 arch/x86/include/asm/mmu.h         |  12 ++-
 arch/x86/include/asm/mmu_context.h |   2 +-
 arch/x86/include/asm/tlbflush.h    |  37 +++++++-
 arch/x86/mm/tlb.c                  | 154 +++++++++++++++++++++++++++-
 4 files changed, 202 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 3b496cd..8b8055a 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -69,6 +69,18 @@ typedef struct {
 	u16 pkey_allocation_map;
 	s16 execute_only_pkey;
 #endif
+
+#ifdef CONFIG_BROADCAST_TLB_FLUSH
+	/*
+	 * The global ASID will be a non-zero value when the process has
+	 * the same ASID across all CPUs, allowing it to make use of
+	 * hardware-assisted remote TLB invalidation like AMD INVLPGB.
+	 */
+	u16 global_asid;
+
+	/* The process is transitioning to a new global ASID number. */
+	bool asid_transition;
+#endif
 } mm_context_t;
 
 #define INIT_MM_CONTEXT(mm)						\
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 795fdd5..a2c70e4 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
 #define enter_lazy_tlb enter_lazy_tlb
 extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 
+extern void mm_free_global_asid(struct mm_struct *mm);
+
 /*
  * Init a new mm.  Used on mm copies, like at fork()
  * and on mm's that are brand-new, like at execve().
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 855c13d..f7b374b 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -6,6 +6,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/sched.h>
 
+#include <asm/barrier.h>
 #include <asm/processor.h>
 #include <asm/cpufeature.h>
 #include <asm/special_insns.h>
@@ -234,6 +235,42 @@ void flush_tlb_one_kernel(unsigned long addr);
 void flush_tlb_multi(const struct cpumask *cpumask,
 		      const struct flush_tlb_info *info);
 
+static inline bool is_dyn_asid(u16 asid)
+{
+	return asid < TLB_NR_DYN_ASIDS;
+}
+
+#ifdef CONFIG_BROADCAST_TLB_FLUSH
+static inline u16 mm_global_asid(struct mm_struct *mm)
+{
+	u16 asid;
+
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return 0;
+
+	asid = smp_load_acquire(&mm->context.global_asid);
+
+	/* mm->context.global_asid is either 0, or a global ASID */
+	VM_WARN_ON_ONCE(asid && is_dyn_asid(asid));
+
+	return asid;
+}
+
+static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
+{
+	/*
+	 * Notably flush_tlb_mm_range() -> broadcast_tlb_flush() ->
+	 * finish_asid_transition() needs to observe asid_transition = true
+	 * once it observes global_asid.
+	 */
+	mm->context.asid_transition = true;
+	smp_store_release(&mm->context.global_asid, asid);
+}
+#else
+static inline u16 mm_global_asid(struct mm_struct *mm) { return 0; }
+static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid) { }
+#endif /* CONFIG_BROADCAST_TLB_FLUSH */
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 76b4a88..6c24d96 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -74,13 +74,15 @@
  * use different names for each of them:
  *
  * ASID  - [0, TLB_NR_DYN_ASIDS-1]
- *         the canonical identifier for an mm
+ *         the canonical identifier for an mm, dynamically allocated on each CPU
+ *         [TLB_NR_DYN_ASIDS, MAX_ASID_AVAILABLE-1]
+ *         the canonical, global identifier for an mm, identical across all CPUs
  *
- * kPCID - [1, TLB_NR_DYN_ASIDS]
+ * kPCID - [1, MAX_ASID_AVAILABLE]
  *         the value we write into the PCID part of CR3; corresponds to the
  *         ASID+1, because PCID 0 is special.
  *
- * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
+ * uPCID - [2048 + 1, 2048 + MAX_ASID_AVAILABLE]
  *         for KPTI each mm has two address spaces and thus needs two
  *         PCID values, but we can still do with a single ASID denomination
  *         for each mm. Corresponds to kPCID + 2048.
@@ -252,6 +254,152 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 }
 
 /*
+ * Global ASIDs are allocated for multi-threaded processes that are
+ * active on multiple CPUs simultaneously, giving each of those
+ * processes the same PCID on every CPU, for use with hardware-assisted
+ * TLB shootdown on remote CPUs, like AMD INVLPGB or Intel RAR.
+ *
+ * These global ASIDs are held for the lifetime of the process.
+ */
+static DEFINE_RAW_SPINLOCK(global_asid_lock);
+static u16 last_global_asid = MAX_ASID_AVAILABLE;
+static DECLARE_BITMAP(global_asid_used, MAX_ASID_AVAILABLE);
+static DECLARE_BITMAP(global_asid_freed, MAX_ASID_AVAILABLE);
+static int global_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
+
+/*
+ * When the search for a free ASID in the global ASID space reaches
+ * MAX_ASID_AVAILABLE, a global TLB flush guarantees that previously
+ * freed global ASIDs are safe to re-use.
+ *
+ * This way the global flush only needs to happen at ASID rollover
+ * time, and not at ASID allocation time.
+ */
+static void reset_global_asid_space(void)
+{
+	lockdep_assert_held(&global_asid_lock);
+
+	invlpgb_flush_all_nonglobals();
+
+	/*
+	 * The TLB flush above makes it safe to re-use the previously
+	 * freed global ASIDs.
+	 */
+	bitmap_andnot(global_asid_used, global_asid_used,
+			global_asid_freed, MAX_ASID_AVAILABLE);
+	bitmap_clear(global_asid_freed, 0, MAX_ASID_AVAILABLE);
+
+	/* Restart the search from the start of global ASID space. */
+	last_global_asid = TLB_NR_DYN_ASIDS;
+}
+
+static u16 allocate_global_asid(void)
+{
+	u16 asid;
+
+	lockdep_assert_held(&global_asid_lock);
+
+	/* The previous allocation hit the edge of available address space */
+	if (last_global_asid >= MAX_ASID_AVAILABLE - 1)
+		reset_global_asid_space();
+
+	asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, last_global_asid);
+
+	if (asid >= MAX_ASID_AVAILABLE && !global_asid_available) {
+		/* This should never happen. */
+		VM_WARN_ONCE(1, "Unable to allocate global ASID despite %d available\n",
+				global_asid_available);
+		return 0;
+	}
+
+	/* Claim this global ASID. */
+	__set_bit(asid, global_asid_used);
+	last_global_asid = asid;
+	global_asid_available--;
+	return asid;
+}
+
+/*
+ * Check whether a process is currently active on more than @threshold CPUs.
+ * This is a cheap estimation on whether or not it may make sense to assign
+ * a global ASID to this process, and use broadcast TLB invalidation.
+ */
+static bool mm_active_cpus_exceeds(struct mm_struct *mm, int threshold)
+{
+	int count = 0;
+	int cpu;
+
+	/* This quick check should eliminate most single threaded programs. */
+	if (cpumask_weight(mm_cpumask(mm)) <= threshold)
+		return false;
+
+	/* Slower check to make sure. */
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/* Skip the CPUs that aren't really running this process. */
+		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
+			continue;
+
+		if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu))
+			continue;
+
+		if (++count > threshold)
+			return true;
+	}
+	return false;
+}
+
+/*
+ * Assign a global ASID to the current process, protecting against
+ * races between multiple threads in the process.
+ */
+static void use_global_asid(struct mm_struct *mm)
+{
+	u16 asid;
+
+	guard(raw_spinlock_irqsave)(&global_asid_lock);
+
+	/* This process is already using broadcast TLB invalidation. */
+	if (mm_global_asid(mm))
+		return;
+
+	/*
+	 * The last global ASID was consumed while waiting for the lock.
+	 *
+	 * If this fires, a more aggressive ASID reuse scheme might be
+	 * needed.
+	 */
+	if (!global_asid_available) {
+		VM_WARN_ONCE(1, "Ran out of global ASIDs\n");
+		return;
+	}
+
+	asid = allocate_global_asid();
+	if (!asid)
+		return;
+
+	mm_assign_global_asid(mm, asid);
+}
+
+void mm_free_global_asid(struct mm_struct *mm)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return;
+
+	if (!mm_global_asid(mm))
+		return;
+
+	guard(raw_spinlock_irqsave)(&global_asid_lock);
+
+	/* The global ASID can be re-used only after flush at wrap-around. */
+#ifdef CONFIG_BROADCAST_TLB_FLUSH
+	__set_bit(mm->context.global_asid, global_asid_freed);
+
+	mm->context.global_asid = 0;
+	global_asid_available++;
+#endif
+}
+
+/*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
  *

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/mm] x86/mm: Use INVLPGB for kernel TLB flushes
  2025-02-26  3:00 ` [PATCH v14 04/13] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
  2025-02-28 19:00   ` Dave Hansen
  2025-02-28 21:43   ` Borislav Petkov
@ 2025-03-05 21:36   ` tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  3 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-05 21:36 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Rik van Riel, Borislav Petkov (AMD), x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     ccc19c694b0fe063a90dd27470e9f4ba22990ea1
Gitweb:        https://git.kernel.org/tip/ccc19c694b0fe063a90dd27470e9f4ba22990ea1
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:39 -05:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Wed, 05 Mar 2025 17:19:52 +01:00

x86/mm: Use INVLPGB for kernel TLB flushes

Use broadcast TLB invalidation for kernel addresses when available.
Remove the need to send IPIs for kernel TLB flushes.

   [ bp: Integrate dhansen's comments additions, merge the
     flush_tlb_all() change into this one too. ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250226030129.530345-5-riel@surriel.com
---
 arch/x86/mm/tlb.c | 48 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index dbcb5c9..8cd084b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1064,7 +1064,6 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
-
 static void do_flush_tlb_all(void *info)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -1074,7 +1073,32 @@ static void do_flush_tlb_all(void *info)
 void flush_tlb_all(void)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
-	on_each_cpu(do_flush_tlb_all, NULL, 1);
+
+	/* First try (faster) hardware-assisted TLB invalidation. */
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		invlpgb_flush_all();
+	else
+		/* Fall back to the IPI-based invalidation. */
+		on_each_cpu(do_flush_tlb_all, NULL, 1);
+}
+
+/* Flush an arbitrarily large range of memory with INVLPGB. */
+static void invlpgb_kernel_range_flush(struct flush_tlb_info *info)
+{
+	unsigned long addr, nr;
+
+	for (addr = info->start; addr < info->end; addr += nr << PAGE_SHIFT) {
+		nr = (info->end - addr) >> PAGE_SHIFT;
+
+		/*
+		 * INVLPGB has a limit on the size of ranges it can
+		 * flush. Break up large flushes.
+		 */
+		nr = clamp_val(nr, 1, invlpgb_count_max);
+
+		invlpgb_flush_addr_nosync(addr, nr);
+	}
+	__tlbsync();
 }
 
 static void do_kernel_range_flush(void *info)
@@ -1087,6 +1111,22 @@ static void do_kernel_range_flush(void *info)
 		flush_tlb_one_kernel(addr);
 }
 
+static void kernel_tlb_flush_all(struct flush_tlb_info *info)
+{
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		invlpgb_flush_all();
+	else
+		on_each_cpu(do_flush_tlb_all, NULL, 1);
+}
+
+static void kernel_tlb_flush_range(struct flush_tlb_info *info)
+{
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		invlpgb_kernel_range_flush(info);
+	else
+		on_each_cpu(do_kernel_range_flush, info, 1);
+}
+
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
 	struct flush_tlb_info *info;
@@ -1097,9 +1137,9 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 				  TLB_GENERATION_INVALID);
 
 	if (info->end == TLB_FLUSH_ALL)
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
+		kernel_tlb_flush_all(info);
 	else
-		on_each_cpu(do_kernel_range_flush, info, 1);
+		kernel_tlb_flush_range(info);
 
 	put_flush_tlb_info();
 }

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/mm] x86/mm: Add INVLPGB support code
  2025-02-26  3:00 ` [PATCH v14 03/13] x86/mm: add INVLPGB support code Rik van Riel
                     ` (2 preceding siblings ...)
  2025-02-28 19:47   ` Borislav Petkov
@ 2025-03-05 21:36   ` tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  4 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-05 21:36 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Rik van Riel, Borislav Petkov (AMD), x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     6272c3a217e5837c72c6714d1e7eddd34254fac3
Gitweb:        https://git.kernel.org/tip/6272c3a217e5837c72c6714d1e7eddd34254fac3
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Fri, 28 Feb 2025 20:32:30 +01:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Wed, 05 Mar 2025 17:19:46 +01:00

 x86/mm: Add INVLPGB support code

Add helper functions and definitions needed to use broadcast TLB
invalidation on AMD CPUs.

  [ bp:
      - Cleanup commit message
      - Improve and expand comments
      - push the preemption guards inside the invlpgb* helpers
      - merge improvements from dhansen
      - add !CONFIG_BROADCAST_TLB_FLUSH function stubs because Clang
	can't do DCE properly yet and looks at the inline asm and
	complains about it getting a u64 argument on 32-bit code ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250226030129.530345-4-riel@surriel.com
---
 arch/x86/include/asm/tlb.h | 132 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 132 insertions(+)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 77f52bc..31f6db4 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -6,6 +6,9 @@
 static inline void tlb_flush(struct mmu_gather *tlb);
 
 #include <asm-generic/tlb.h>
+#include <linux/kernel.h>
+#include <vdso/bits.h>
+#include <vdso/page.h>
 
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
@@ -25,4 +28,133 @@ static inline void invlpg(unsigned long addr)
 	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
 }
 
+enum addr_stride {
+	PTE_STRIDE = 0,
+	PMD_STRIDE = 1
+};
+
+#ifdef CONFIG_BROADCAST_TLB_FLUSH
+/*
+ * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
+ *
+ * The INVLPGB instruction is weakly ordered, and a batch of invalidations can
+ * be done in a parallel fashion.
+ *
+ * The instruction takes the number of extra pages to invalidate, beyond
+ * the first page, while __invlpgb gets the more human readable number of
+ * pages to invalidate.
+ *
+ * The bits in rax[0:2] determine respectively which components of the address
+ * (VA, PCID, ASID) get compared when flushing. If neither bits are set, *any*
+ * address in the specified range matches.
+ *
+ * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from
+ * this CPU have completed.
+ */
+static inline void __invlpgb(unsigned long asid, unsigned long pcid,
+			     unsigned long addr, u16 nr_pages,
+			     enum addr_stride stride, u8 flags)
+{
+	u32 edx = (pcid << 16) | asid;
+	u32 ecx = (stride << 31) | (nr_pages - 1);
+	u64 rax = addr | flags;
+
+	/* The low bits in rax are for flags. Verify addr is clean. */
+	VM_WARN_ON_ONCE(addr & ~PAGE_MASK);
+
+	/* INVLPGB; supported in binutils >= 2.36. */
+	asm volatile(".byte 0x0f, 0x01, 0xfe" :: "a" (rax), "c" (ecx), "d" (edx));
+}
+
+static inline void __invlpgb_all(unsigned long asid, unsigned long pcid, u8 flags)
+{
+	__invlpgb(asid, pcid, 0, 1, 0, flags);
+}
+
+static inline void __tlbsync(void)
+{
+	/*
+	 * TLBSYNC waits for INVLPGB instructions originating on the same CPU
+	 * to have completed. Print a warning if the task has been migrated,
+	 * and might not be waiting on all the INVLPGBs issued during this TLB
+	 * invalidation sequence.
+	 */
+	cant_migrate();
+
+	/* TLBSYNC: supported in binutils >= 0.36. */
+	asm volatile(".byte 0x0f, 0x01, 0xff" ::: "memory");
+}
+#else
+/* Some compilers (I'm looking at you clang!) simply can't do DCE */
+static inline void __invlpgb(unsigned long asid, unsigned long pcid,
+			     unsigned long addr, u16 nr_pages,
+			     enum addr_stride s, u8 flags) { }
+static inline void __invlpgb_all(unsigned long asid, unsigned long pcid, u8 flags) { }
+static inline void __tlbsync(void) { }
+#endif
+
+/*
+ * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
+ * of the three. For example:
+ * - FLAG_VA | FLAG_INCLUDE_GLOBAL: invalidate all TLB entries at the address
+ * - FLAG_PCID:			    invalidate all TLB entries matching the PCID
+ *
+ * The first is used to invalidate (kernel) mappings at a particular
+ * address across all processes.
+ *
+ * The latter invalidates all TLB entries matching a PCID.
+ */
+#define INVLPGB_FLAG_VA			BIT(0)
+#define INVLPGB_FLAG_PCID		BIT(1)
+#define INVLPGB_FLAG_ASID		BIT(2)
+#define INVLPGB_FLAG_INCLUDE_GLOBAL	BIT(3)
+#define INVLPGB_FLAG_FINAL_ONLY		BIT(4)
+#define INVLPGB_FLAG_INCLUDE_NESTED	BIT(5)
+
+/* The implied mode when all bits are clear: */
+#define INVLPGB_MODE_ALL_NONGLOBALS	0UL
+
+static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
+						unsigned long addr,
+						u16 nr, bool stride)
+{
+	enum addr_stride str = stride ? PMD_STRIDE : PTE_STRIDE;
+	u8 flags = INVLPGB_FLAG_PCID | INVLPGB_FLAG_VA;
+
+	__invlpgb(0, pcid, addr, nr, str, flags);
+}
+
+/* Flush all mappings for a given PCID, not including globals. */
+static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
+{
+	__invlpgb_all(0, pcid, INVLPGB_FLAG_PCID);
+}
+
+/* Flush all mappings, including globals, for all PCIDs. */
+static inline void invlpgb_flush_all(void)
+{
+	/*
+	 * TLBSYNC at the end needs to make sure all flushes done on the
+	 * current CPU have been executed system-wide. Therefore, make
+	 * sure nothing gets migrated in-between but disable preemption
+	 * as it is cheaper.
+	 */
+	guard(preempt)();
+	__invlpgb_all(0, 0, INVLPGB_FLAG_INCLUDE_GLOBAL);
+	__tlbsync();
+}
+
+/* Flush addr, including globals, for all PCIDs. */
+static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
+{
+	__invlpgb(0, 0, addr, nr, PTE_STRIDE, INVLPGB_FLAG_INCLUDE_GLOBAL);
+}
+
+/* Flush all mappings for all PCIDs except globals. */
+static inline void invlpgb_flush_all_nonglobals(void)
+{
+	guard(preempt)();
+	__invlpgb_all(0, 0, INVLPGB_MODE_ALL_NONGLOBALS);
+	__tlbsync();
+}
 #endif /* _ASM_X86_TLB_H */

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/mm] x86/mm: Add INVLPGB feature and Kconfig entry
  2025-02-26  3:00 ` [PATCH v14 02/13] x86/mm: get INVLPGB count max from CPUID Rik van Riel
  2025-02-28 16:21   ` Borislav Petkov
  2025-02-28 19:27   ` Borislav Petkov
@ 2025-03-05 21:36   ` tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  3 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-05 21:36 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Rik van Riel, Borislav Petkov (AMD), x86, linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     dc9a260e96d57cc60482a147205e60de35833d08
Gitweb:        https://git.kernel.org/tip/dc9a260e96d57cc60482a147205e60de35833d08
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:37 -05:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Wed, 05 Mar 2025 17:19:08 +01:00

x86/mm: Add INVLPGB feature and Kconfig entry

In addition, the CPU advertises the maximum number of pages that can be
shot down with one INVLPGB instruction in CPUID. Save that information
for later use.

  [ bp: use cpu_has(), typos, massage. ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250226030129.530345-3-riel@surriel.com
---
 arch/x86/Kconfig.cpu                     | 4 ++++
 arch/x86/include/asm/cpufeatures.h       | 1 +
 arch/x86/include/asm/disabled-features.h | 8 +++++++-
 arch/x86/include/asm/tlbflush.h          | 3 +++
 arch/x86/kernel/cpu/amd.c                | 6 ++++++
 5 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 2a7279d..25c55cc 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -401,6 +401,10 @@ menuconfig PROCESSOR_SELECT
 	  This lets you choose what x86 vendor support code your kernel
 	  will include.
 
+config BROADCAST_TLB_FLUSH
+	def_bool y
+	depends on CPU_SUP_AMD && 64BIT
+
 config CPU_SUP_INTEL
 	default y
 	bool "Support Intel processors" if PROCESSOR_SELECT
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 508c0da..8770dc1 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -338,6 +338,7 @@
 #define X86_FEATURE_CLZERO		(13*32+ 0) /* "clzero" CLZERO instruction */
 #define X86_FEATURE_IRPERF		(13*32+ 1) /* "irperf" Instructions Retired Count */
 #define X86_FEATURE_XSAVEERPTR		(13*32+ 2) /* "xsaveerptr" Always save/restore FP error pointers */
+#define X86_FEATURE_INVLPGB		(13*32+ 3) /* INVLPGB and TLBSYNC instructions supported */
 #define X86_FEATURE_RDPRU		(13*32+ 4) /* "rdpru" Read processor register at user level */
 #define X86_FEATURE_WBNOINVD		(13*32+ 9) /* "wbnoinvd" WBNOINVD instruction */
 #define X86_FEATURE_AMD_IBPB		(13*32+12) /* Indirect Branch Prediction Barrier */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index c492bdc..be8c388 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -129,6 +129,12 @@
 #define DISABLE_SEV_SNP		(1 << (X86_FEATURE_SEV_SNP & 31))
 #endif
 
+#ifdef CONFIG_BROADCAST_TLB_FLUSH
+#define DISABLE_INVLPGB		0
+#else
+#define DISABLE_INVLPGB		(1 << (X86_FEATURE_INVLPGB & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -146,7 +152,7 @@
 #define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
 			 DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
 #define DISABLED_MASK12	(DISABLE_FRED|DISABLE_LAM)
-#define DISABLED_MASK13	0
+#define DISABLED_MASK13	(DISABLE_INVLPGB)
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3da6451..855c13d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -183,6 +183,9 @@ static inline void cr4_init_shadow(void)
 extern unsigned long mmu_cr4_features;
 extern u32 *trampoline_cr4_features;
 
+/* How many pages can be invalidated with one INVLPGB. */
+extern u16 invlpgb_count_max;
+
 extern void initialize_tlbstate_and_flush(void);
 
 /*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 54194f5..7a72ef4 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -29,6 +29,8 @@
 
 #include "cpu.h"
 
+u16 invlpgb_count_max __ro_after_init;
+
 static inline int rdmsrl_amd_safe(unsigned msr, unsigned long long *p)
 {
 	u32 gprs[8] = { 0 };
@@ -1139,6 +1141,10 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
 		tlb_lli_2m[ENTRIES] = eax & mask;
 
 	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
+
+	/* Max number of pages INVLPGB can invalidate in one shot */
+	if (cpu_has(c, X86_FEATURE_INVLPGB))
+		invlpgb_count_max = (cpuid_edx(0x80000008) & 0xffff) + 1;
 }
 
 static const struct cpu_dev amd_cpu_dev = {

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/mm] x86/mm: Consolidate full flush threshold decision
  2025-02-26  3:00 ` [PATCH v14 01/13] x86/mm: consolidate full flush threshold decision Rik van Riel
@ 2025-03-05 21:36   ` tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
  2025-09-02 15:44   ` [BUG] x86/mm: regression after 4a02ed8e1cc3 Giovanni Cabiddu
  2 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-05 21:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dave Hansen, Rik van Riel, Borislav Petkov (AMD), x86,
	linux-kernel

The following commit has been merged into the x86/mm branch of tip:

Commit-ID:     fd521566cc0cd5e327683fea70cba93b1966ebac
Gitweb:        https://git.kernel.org/tip/fd521566cc0cd5e327683fea70cba93b1966ebac
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:36 -05:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Tue, 04 Mar 2025 11:39:38 +01:00

x86/mm: Consolidate full flush threshold decision

Reduce code duplication by consolidating the decision point for whether to do
individual invalidations or a full flush inside get_flush_tlb_info().

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lore.kernel.org/r/20250226030129.530345-2-riel@surriel.com
---
 arch/x86/mm/tlb.c | 41 +++++++++++++++++++----------------------
 1 file changed, 19 insertions(+), 22 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ffc25b3..dbcb5c9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1000,6 +1000,15 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 	BUG_ON(this_cpu_inc_return(flush_tlb_info_idx) != 1);
 #endif
 
+	/*
+	 * If the number of flushes is so large that a full flush
+	 * would be faster, do a full flush.
+	 */
+	if ((end - start) >> stride_shift > tlb_single_page_flush_ceiling) {
+		start = 0;
+		end = TLB_FLUSH_ALL;
+	}
+
 	info->start		= start;
 	info->end		= end;
 	info->mm		= mm;
@@ -1026,17 +1035,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				bool freed_tables)
 {
 	struct flush_tlb_info *info;
+	int cpu = get_cpu();
 	u64 new_tlb_gen;
-	int cpu;
-
-	cpu = get_cpu();
-
-	/* Should we flush just the requested range? */
-	if ((end == TLB_FLUSH_ALL) ||
-	    ((end - start) >> stride_shift) > tlb_single_page_flush_ceiling) {
-		start = 0;
-		end = TLB_FLUSH_ALL;
-	}
 
 	/* This is also a barrier that synchronizes with switch_mm(). */
 	new_tlb_gen = inc_mm_tlb_gen(mm);
@@ -1089,22 +1089,19 @@ static void do_kernel_range_flush(void *info)
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
-	/* Balance as user space task's flush, a bit conservative */
-	if (end == TLB_FLUSH_ALL ||
-	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
-	} else {
-		struct flush_tlb_info *info;
+	struct flush_tlb_info *info;
+
+	guard(preempt)();
 
-		preempt_disable();
-		info = get_flush_tlb_info(NULL, start, end, 0, false,
-					  TLB_GENERATION_INVALID);
+	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
+				  TLB_GENERATION_INVALID);
 
+	if (info->end == TLB_FLUSH_ALL)
+		on_each_cpu(do_flush_tlb_all, NULL, 1);
+	else
 		on_each_cpu(do_kernel_range_flush, info, 1);
 
-		put_flush_tlb_info();
-		preempt_enable();
-	}
+	put_flush_tlb_info();
 }
 
 /*

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/core] x86/mm: Always set the ASID valid bit for the INVLPGB instruction
  2025-03-04 12:04 ` [PATCH] x86/mm: Always set the ASID valid bit for the INVLPGB instruction Borislav Petkov
  2025-03-04 12:43   ` Borislav Petkov
  2025-03-05 21:36   ` [tip: x86/mm] " tip-bot2 for Tom Lendacky
@ 2025-03-19 11:04   ` tip-bot2 for Tom Lendacky
  2 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Tom Lendacky @ 2025-03-19 11:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Tom Lendacky, Borislav Petkov (AMD), Ingo Molnar, x86,
	linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     634ab76159a842f890743fb435f74c228e06bb04
Gitweb:        https://git.kernel.org/tip/634ab76159a842f890743fb435f74c228e06bb04
Author:        Tom Lendacky <thomas.lendacky@amd.com>
AuthorDate:    Tue, 04 Mar 2025 12:59:56 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Mar 2025 11:12:29 +01:00

x86/mm: Always set the ASID valid bit for the INVLPGB instruction

When executing the INVLPGB instruction on a bare-metal host or hypervisor, if
the ASID valid bit is not set, the instruction will flush the TLB entries that
match the specified criteria for any ASID, not just the those of the host. If
virtual machines are running on the system, this may result in inadvertent
flushes of guest TLB entries.

When executing the INVLPGB instruction in a guest and the INVLPGB instruction is
not intercepted by the hypervisor, the hardware will replace the requested ASID
with the guest ASID and set the ASID valid bit before doing the broadcast
invalidation. Thus a guest is only able to flush its own TLB entries.

So to limit the host TLB flushing reach, always set the ASID valid bit using an
ASID value of 0 (which represents the host/hypervisor). This will will result in
the desired effect in both host and guest.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250304120449.GHZ8bsYYyEBOKQIxBm@fat_crate.local
---
 arch/x86/include/asm/tlb.h | 58 ++++++++++++++++++++-----------------
 1 file changed, 32 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 31f6db4..866ea78 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -33,6 +33,27 @@ enum addr_stride {
 	PMD_STRIDE = 1
 };
 
+/*
+ * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
+ * of the three. For example:
+ * - FLAG_VA | FLAG_INCLUDE_GLOBAL: invalidate all TLB entries at the address
+ * - FLAG_PCID:			    invalidate all TLB entries matching the PCID
+ *
+ * The first is used to invalidate (kernel) mappings at a particular
+ * address across all processes.
+ *
+ * The latter invalidates all TLB entries matching a PCID.
+ */
+#define INVLPGB_FLAG_VA			BIT(0)
+#define INVLPGB_FLAG_PCID		BIT(1)
+#define INVLPGB_FLAG_ASID		BIT(2)
+#define INVLPGB_FLAG_INCLUDE_GLOBAL	BIT(3)
+#define INVLPGB_FLAG_FINAL_ONLY		BIT(4)
+#define INVLPGB_FLAG_INCLUDE_NESTED	BIT(5)
+
+/* The implied mode when all bits are clear: */
+#define INVLPGB_MODE_ALL_NONGLOBALS	0UL
+
 #ifdef CONFIG_BROADCAST_TLB_FLUSH
 /*
  * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
@@ -40,14 +61,20 @@ enum addr_stride {
  * The INVLPGB instruction is weakly ordered, and a batch of invalidations can
  * be done in a parallel fashion.
  *
- * The instruction takes the number of extra pages to invalidate, beyond
- * the first page, while __invlpgb gets the more human readable number of
- * pages to invalidate.
+ * The instruction takes the number of extra pages to invalidate, beyond the
+ * first page, while __invlpgb gets the more human readable number of pages to
+ * invalidate.
  *
  * The bits in rax[0:2] determine respectively which components of the address
  * (VA, PCID, ASID) get compared when flushing. If neither bits are set, *any*
  * address in the specified range matches.
  *
+ * Since it is desired to only flush TLB entries for the ASID that is executing
+ * the instruction (a host/hypervisor or a guest), the ASID valid bit should
+ * always be set. On a host/hypervisor, the hardware will use the ASID value
+ * specified in EDX[15:0] (which should be 0). On a guest, the hardware will
+ * use the actual ASID value of the guest.
+ *
  * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from
  * this CPU have completed.
  */
@@ -55,9 +82,9 @@ static inline void __invlpgb(unsigned long asid, unsigned long pcid,
 			     unsigned long addr, u16 nr_pages,
 			     enum addr_stride stride, u8 flags)
 {
-	u32 edx = (pcid << 16) | asid;
+	u64 rax = addr | flags | INVLPGB_FLAG_ASID;
 	u32 ecx = (stride << 31) | (nr_pages - 1);
-	u64 rax = addr | flags;
+	u32 edx = (pcid << 16) | asid;
 
 	/* The low bits in rax are for flags. Verify addr is clean. */
 	VM_WARN_ON_ONCE(addr & ~PAGE_MASK);
@@ -93,27 +120,6 @@ static inline void __invlpgb_all(unsigned long asid, unsigned long pcid, u8 flag
 static inline void __tlbsync(void) { }
 #endif
 
-/*
- * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
- * of the three. For example:
- * - FLAG_VA | FLAG_INCLUDE_GLOBAL: invalidate all TLB entries at the address
- * - FLAG_PCID:			    invalidate all TLB entries matching the PCID
- *
- * The first is used to invalidate (kernel) mappings at a particular
- * address across all processes.
- *
- * The latter invalidates all TLB entries matching a PCID.
- */
-#define INVLPGB_FLAG_VA			BIT(0)
-#define INVLPGB_FLAG_PCID		BIT(1)
-#define INVLPGB_FLAG_ASID		BIT(2)
-#define INVLPGB_FLAG_INCLUDE_GLOBAL	BIT(3)
-#define INVLPGB_FLAG_FINAL_ONLY		BIT(4)
-#define INVLPGB_FLAG_INCLUDE_NESTED	BIT(5)
-
-/* The implied mode when all bits are clear: */
-#define INVLPGB_MODE_ALL_NONGLOBALS	0UL
-
 static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
 						unsigned long addr,
 						u16 nr, bool stride)

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/core] x86/mm: Enable AMD translation cache extensions
  2025-02-26  3:00 ` [PATCH v14 12/13] x86/mm: enable AMD translation cache extensions Rik van Riel
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Enable " tip-bot2 for Rik van Riel
@ 2025-03-19 11:04   ` tip-bot2 for Rik van Riel
  1 sibling, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-19 11:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Rik van Riel, Borislav Petkov (AMD), Ingo Molnar, x86,
	linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     440a65b7d25fb06f85ee5d99c5ac492d49a15370
Gitweb:        https://git.kernel.org/tip/440a65b7d25fb06f85ee5d99c5ac492d49a15370
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:47 -05:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Mar 2025 11:12:29 +01:00

x86/mm: Enable AMD translation cache extensions

With AMD TCE (translation cache extensions) only the intermediate mappings
that cover the address range zapped by INVLPG / INVLPGB get invalidated,
rather than all intermediate mappings getting zapped at every TLB invalidation.

This can help reduce the TLB miss rate, by keeping more intermediate mappings
in the cache.

>From the AMD manual:

Translation Cache Extension (TCE) Bit. Bit 15, read/write. Setting this bit to
1 changes how the INVLPG, INVLPGB, and INVPCID instructions operate on TLB
entries. When this bit is 0, these instructions remove the target PTE from the
TLB as well as all upper-level table entries that are cached in the TLB,
whether or not they are associated with the target PTE.  When this bit is set,
these instructions will remove the target PTE and only those upper-level
entries that lead to the target PTE in the page table hierarchy, leaving
unrelated upper-level entries intact.

  [ bp: use cpu_has()... I know, it is a mess. ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250226030129.530345-13-riel@surriel.com
---
 arch/x86/include/asm/msr-index.h       | 2 ++
 arch/x86/kernel/cpu/amd.c              | 4 ++++
 tools/arch/x86/include/asm/msr-index.h | 2 ++
 3 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 72765b2..1aacd6b 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -25,6 +25,7 @@
 #define _EFER_SVME		12 /* Enable virtualization */
 #define _EFER_LMSLE		13 /* Long Mode Segment Limit Enable */
 #define _EFER_FFXSR		14 /* Enable Fast FXSAVE/FXRSTOR */
+#define _EFER_TCE		15 /* Enable Translation Cache Extensions */
 #define _EFER_AUTOIBRS		21 /* Enable Automatic IBRS */
 
 #define EFER_SCE		(1<<_EFER_SCE)
@@ -34,6 +35,7 @@
 #define EFER_SVME		(1<<_EFER_SVME)
 #define EFER_LMSLE		(1<<_EFER_LMSLE)
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
+#define EFER_TCE		(1<<_EFER_TCE)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
 /*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 351c030..79569f7 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1075,6 +1075,10 @@ static void init_amd(struct cpuinfo_x86 *c)
 
 	/* AMD CPUs don't need fencing after x2APIC/TSC_DEADLINE MSR writes. */
 	clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
+
+	/* Enable Translation Cache Extension */
+	if (cpu_has(c, X86_FEATURE_TCE))
+		msr_set_bit(MSR_EFER, _EFER_TCE);
 }
 
 #ifdef CONFIG_X86_32
diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h
index 3ae84c3..dc1c105 100644
--- a/tools/arch/x86/include/asm/msr-index.h
+++ b/tools/arch/x86/include/asm/msr-index.h
@@ -25,6 +25,7 @@
 #define _EFER_SVME		12 /* Enable virtualization */
 #define _EFER_LMSLE		13 /* Long Mode Segment Limit Enable */
 #define _EFER_FFXSR		14 /* Enable Fast FXSAVE/FXRSTOR */
+#define _EFER_TCE		15 /* Enable Translation Cache Extensions */
 #define _EFER_AUTOIBRS		21 /* Enable Automatic IBRS */
 
 #define EFER_SCE		(1<<_EFER_SCE)
@@ -34,6 +35,7 @@
 #define EFER_SVME		(1<<_EFER_SVME)
 #define EFER_LMSLE		(1<<_EFER_LMSLE)
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
+#define EFER_TCE		(1<<_EFER_TCE)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
 /*

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/core] x86/mm: Enable broadcast TLB invalidation for multi-threaded processes
  2025-02-26  3:00 ` [PATCH v14 10/13] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
  2025-03-03 10:57   ` Borislav Petkov
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Enable " tip-bot2 for Rik van Riel
@ 2025-03-19 11:04   ` tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-19 11:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Rik van Riel, Borislav Petkov (AMD), Ingo Molnar, Nadav Amit, x86,
	linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     4afeb0ed1753ebcad93ee3b45427ce85e9c8ec40
Gitweb:        https://git.kernel.org/tip/4afeb0ed1753ebcad93ee3b45427ce85e9c8ec40
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:45 -05:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Mar 2025 11:12:29 +01:00

x86/mm: Enable broadcast TLB invalidation for multi-threaded processes

There is not enough room in the 12-bit ASID address space to hand out
broadcast ASIDs to every process. Only hand out broadcast ASIDs to processes
when they are observed to be simultaneously running on 4 or more CPUs.

This also allows single threaded process to continue using the cheaper, local
TLB invalidation instructions like INVLPGB.

Due to the structure of flush_tlb_mm_range(), the INVLPGB flushing is done in
a generically named broadcast_tlb_flush() function which can later also be
used for Intel RAR.

Combined with the removal of unnecessary lru_add_drain calls() (see
https://lore.kernel.org/r/20241219153253.3da9e8aa@fangorn) this results in
a nice performance boost for the will-it-scale tlb_flush2_threads test on an
AMD Milan system with 36 cores:

  - vanilla kernel:           527k loops/second
  - lru_add_drain removal:    731k loops/second
  - only INVLPGB:             527k loops/second
  - lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while TLB invalidation went
down from 40% of the total CPU time to only around 4% of CPU time, the
contention simply moved to the LRU lock.

Fixing both at the same time about doubles the number of iterations per second
from this case.

Comparing will-it-scale tlb_flush2_threads with several different numbers of
threads on a 72 CPU AMD Milan shows similar results. The number represents the
total number of loops per second across all the threads:

  threads	tip		INVLPGB

  1		315k		304k
  2		423k		424k
  4		644k		1032k
  8		652k		1267k
  16		737k		1368k
  32		759k		1199k
  64		636k		1094k
  72		609k		993k

1 and 2 thread performance is similar with and without INVLPGB, because
INVLPGB is only used on processes using 4 or more CPUs simultaneously.

The number is the median across 5 runs.

Some numbers closer to real world performance can be found at Phoronix, thanks
to Michael:

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

  [ bp:
   - Massage
   - :%s/\<static_cpu_has\>/cpu_feature_enabled/cgi
   - :%s/\<clear_asid_transition\>/mm_clear_asid_transition/cgi
   - Fold in a 0day bot fix: https://lore.kernel.org/oe-kbuild-all/202503040000.GtiWUsBm-lkp@intel.com
   ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Link: https://lore.kernel.org/r/20250226030129.530345-11-riel@surriel.com
---
 arch/x86/include/asm/tlbflush.h |   6 ++-
 arch/x86/mm/tlb.c               | 104 ++++++++++++++++++++++++++++++-
 2 files changed, 109 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index e6c3be0..7cad283 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -280,6 +280,11 @@ static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
 	smp_store_release(&mm->context.global_asid, asid);
 }
 
+static inline void mm_clear_asid_transition(struct mm_struct *mm)
+{
+	WRITE_ONCE(mm->context.asid_transition, false);
+}
+
 static inline bool mm_in_asid_transition(struct mm_struct *mm)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
@@ -291,6 +296,7 @@ static inline bool mm_in_asid_transition(struct mm_struct *mm)
 static inline u16 mm_global_asid(struct mm_struct *mm) { return 0; }
 static inline void mm_init_global_asid(struct mm_struct *mm) { }
 static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid) { }
+static inline void mm_clear_asid_transition(struct mm_struct *mm) { }
 static inline bool mm_in_asid_transition(struct mm_struct *mm) { return false; }
 #endif /* CONFIG_BROADCAST_TLB_FLUSH */
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index b5681e6..0efd990 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -431,6 +431,105 @@ static bool mm_needs_global_asid(struct mm_struct *mm, u16 asid)
 }
 
 /*
+ * x86 has 4k ASIDs (2k when compiled with KPTI), but the largest x86
+ * systems have over 8k CPUs. Because of this potential ASID shortage,
+ * global ASIDs are handed out to processes that have frequent TLB
+ * flushes and are active on 4 or more CPUs simultaneously.
+ */
+static void consider_global_asid(struct mm_struct *mm)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return;
+
+	/* Check every once in a while. */
+	if ((current->pid & 0x1f) != (jiffies & 0x1f))
+		return;
+
+	/*
+	 * Assign a global ASID if the process is active on
+	 * 4 or more CPUs simultaneously.
+	 */
+	if (mm_active_cpus_exceeds(mm, 3))
+		use_global_asid(mm);
+}
+
+static void finish_asid_transition(struct flush_tlb_info *info)
+{
+	struct mm_struct *mm = info->mm;
+	int bc_asid = mm_global_asid(mm);
+	int cpu;
+
+	if (!mm_in_asid_transition(mm))
+		return;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/*
+		 * The remote CPU is context switching. Wait for that to
+		 * finish, to catch the unlikely case of it switching to
+		 * the target mm with an out of date ASID.
+		 */
+		while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) == LOADED_MM_SWITCHING)
+			cpu_relax();
+
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm)
+			continue;
+
+		/*
+		 * If at least one CPU is not using the global ASID yet,
+		 * send a TLB flush IPI. The IPI should cause stragglers
+		 * to transition soon.
+		 *
+		 * This can race with the CPU switching to another task;
+		 * that results in a (harmless) extra IPI.
+		 */
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) {
+			flush_tlb_multi(mm_cpumask(info->mm), info);
+			return;
+		}
+	}
+
+	/* All the CPUs running this process are using the global ASID. */
+	mm_clear_asid_transition(mm);
+}
+
+static void broadcast_tlb_flush(struct flush_tlb_info *info)
+{
+	bool pmd = info->stride_shift == PMD_SHIFT;
+	unsigned long asid = mm_global_asid(info->mm);
+	unsigned long addr = info->start;
+
+	/*
+	 * TLB flushes with INVLPGB are kicked off asynchronously.
+	 * The inc_mm_tlb_gen() guarantees page table updates are done
+	 * before these TLB flushes happen.
+	 */
+	if (info->end == TLB_FLUSH_ALL) {
+		invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (cpu_feature_enabled(X86_FEATURE_PTI))
+			invlpgb_flush_single_pcid_nosync(user_pcid(asid));
+	} else do {
+		unsigned long nr = 1;
+
+		if (info->stride_shift <= PMD_SHIFT) {
+			nr = (info->end - addr) >> info->stride_shift;
+			nr = clamp_val(nr, 1, invlpgb_count_max);
+		}
+
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd);
+		if (cpu_feature_enabled(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd);
+
+		addr += nr << info->stride_shift;
+	} while (addr < info->end);
+
+	finish_asid_transition(info);
+
+	/* Wait for the INVLPGBs kicked off above to finish. */
+	__tlbsync();
+}
+
+/*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
  *
@@ -1260,9 +1359,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+	if (mm_global_asid(mm)) {
+		broadcast_tlb_flush(info);
+	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
 		info->trim_cpumask = should_trim_cpumask(mm);
 		flush_tlb_multi(mm_cpumask(mm), info);
+		consider_global_asid(mm);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/core] x86/mm: Add global ASID process exit helpers
  2025-02-26  3:00 ` [PATCH v14 09/13] x86/mm: global ASID process exit helpers Rik van Riel
  2025-03-02 12:38   ` Borislav Petkov
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Add " tip-bot2 for Rik van Riel
@ 2025-03-19 11:04   ` tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-19 11:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Rik van Riel, Borislav Petkov (AMD), Ingo Molnar, x86,
	linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     c9826613a9cbbf889b1fec6b7b943fe88ec2c212
Gitweb:        https://git.kernel.org/tip/c9826613a9cbbf889b1fec6b7b943fe88ec2c212
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:44 -05:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Mar 2025 11:12:29 +01:00

x86/mm: Add global ASID process exit helpers

A global ASID is allocated for the lifetime of a process. Free the global ASID
at process exit time.

  [ bp: Massage, create helpers, hide details inside them. ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250226030129.530345-10-riel@surriel.com
---
 arch/x86/include/asm/mmu_context.h |  8 +++++++-
 arch/x86/include/asm/tlbflush.h    |  9 +++++++++
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index a2c70e4..2398058 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -2,7 +2,6 @@
 #ifndef _ASM_X86_MMU_CONTEXT_H
 #define _ASM_X86_MMU_CONTEXT_H
 
-#include <asm/desc.h>
 #include <linux/atomic.h>
 #include <linux/mm_types.h>
 #include <linux/pkeys.h>
@@ -13,6 +12,7 @@
 #include <asm/paravirt.h>
 #include <asm/debugreg.h>
 #include <asm/gsseg.h>
+#include <asm/desc.h>
 
 extern atomic64_t last_mm_ctx_id;
 
@@ -139,6 +139,9 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
 #define enter_lazy_tlb enter_lazy_tlb
 extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 
+#define mm_init_global_asid mm_init_global_asid
+extern void mm_init_global_asid(struct mm_struct *mm);
+
 extern void mm_free_global_asid(struct mm_struct *mm);
 
 /*
@@ -163,6 +166,8 @@ static inline int init_new_context(struct task_struct *tsk,
 		mm->context.execute_only_pkey = -1;
 	}
 #endif
+
+	mm_init_global_asid(mm);
 	mm_reset_untag_mask(mm);
 	init_new_context_ldt(mm);
 	return 0;
@@ -172,6 +177,7 @@ static inline int init_new_context(struct task_struct *tsk,
 static inline void destroy_context(struct mm_struct *mm)
 {
 	destroy_context_ldt(mm);
+	mm_free_global_asid(mm);
 }
 
 extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 1f61a39..e6c3be0 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -261,6 +261,14 @@ static inline u16 mm_global_asid(struct mm_struct *mm)
 	return asid;
 }
 
+static inline void mm_init_global_asid(struct mm_struct *mm)
+{
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		mm->context.global_asid = 0;
+		mm->context.asid_transition = false;
+	}
+}
+
 static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
 {
 	/*
@@ -281,6 +289,7 @@ static inline bool mm_in_asid_transition(struct mm_struct *mm)
 }
 #else
 static inline u16 mm_global_asid(struct mm_struct *mm) { return 0; }
+static inline void mm_init_global_asid(struct mm_struct *mm) { }
 static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid) { }
 static inline bool mm_in_asid_transition(struct mm_struct *mm) { return false; }
 #endif /* CONFIG_BROADCAST_TLB_FLUSH */

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/core] x86/mm: Handle global ASID context switch and TLB flush
  2025-02-26  3:00 ` [PATCH v14 08/13] x86/mm: global ASID context switch & TLB flush handling Rik van Riel
  2025-03-02  7:58   ` Borislav Petkov
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Handle global ASID context switch and TLB flush tip-bot2 for Rik van Riel
@ 2025-03-19 11:04   ` tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-19 11:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Rik van Riel, Borislav Petkov (AMD), Ingo Molnar, x86,
	linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     be88a1dd6112bbcf40d0fe9da02fb71bfb427cfe
Gitweb:        https://git.kernel.org/tip/be88a1dd6112bbcf40d0fe9da02fb71bfb427cfe
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:43 -05:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Mar 2025 11:12:29 +01:00

x86/mm: Handle global ASID context switch and TLB flush

Do context switch and TLB flush support for processes that use a global
ASID and PCID across all CPUs.

At both context switch time and TLB flush time, it needs to be checked whether
a task is switching to a global ASID, and, if so, reload the TLB with the new
ASID as appropriate.

In both code paths, the TLB flush is avoided if a global ASID is used, because
the global ASIDs are always kept up to date across CPUs, even when the
process is not running on a CPU.

  [ bp:
   - Massage
   - :%s/\<static_cpu_has\>/cpu_feature_enabled/cgi
  ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250226030129.530345-9-riel@surriel.com
---
 arch/x86/include/asm/tlbflush.h | 14 ++++++-
 arch/x86/mm/tlb.c               | 77 +++++++++++++++++++++++++++++---
 2 files changed, 84 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index f7b374b..1f61a39 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -240,6 +240,11 @@ static inline bool is_dyn_asid(u16 asid)
 	return asid < TLB_NR_DYN_ASIDS;
 }
 
+static inline bool is_global_asid(u16 asid)
+{
+	return !is_dyn_asid(asid);
+}
+
 #ifdef CONFIG_BROADCAST_TLB_FLUSH
 static inline u16 mm_global_asid(struct mm_struct *mm)
 {
@@ -266,9 +271,18 @@ static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
 	mm->context.asid_transition = true;
 	smp_store_release(&mm->context.global_asid, asid);
 }
+
+static inline bool mm_in_asid_transition(struct mm_struct *mm)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	return mm && READ_ONCE(mm->context.asid_transition);
+}
 #else
 static inline u16 mm_global_asid(struct mm_struct *mm) { return 0; }
 static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid) { }
+static inline bool mm_in_asid_transition(struct mm_struct *mm) { return false; }
 #endif /* CONFIG_BROADCAST_TLB_FLUSH */
 
 #ifdef CONFIG_PARAVIRT
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6c24d96..b5681e6 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -227,6 +227,20 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 		return;
 	}
 
+	/*
+	 * TLB consistency for global ASIDs is maintained with hardware assisted
+	 * remote TLB flushing. Global ASIDs are always up to date.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		u16 global_asid = mm_global_asid(next);
+
+		if (global_asid) {
+			*new_asid = global_asid;
+			*need_flush = false;
+			return;
+		}
+	}
+
 	if (this_cpu_read(cpu_tlbstate.invalidate_other))
 		clear_asid_other();
 
@@ -400,6 +414,23 @@ void mm_free_global_asid(struct mm_struct *mm)
 }
 
 /*
+ * Is the mm transitioning from a CPU-local ASID to a global ASID?
+ */
+static bool mm_needs_global_asid(struct mm_struct *mm, u16 asid)
+{
+	u16 global_asid = mm_global_asid(mm);
+
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	/* Process is transitioning to a global ASID */
+	if (global_asid && asid != global_asid)
+		return true;
+
+	return false;
+}
+
+/*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
  *
@@ -704,7 +735,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 	 */
 	if (prev == next) {
 		/* Not actually switching mm's */
-		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
+		VM_WARN_ON(is_dyn_asid(prev_asid) &&
+			   this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
 			   next->context.ctx_id);
 
 		/*
@@ -721,6 +753,20 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 
+		/* Check if the current mm is transitioning to a global ASID */
+		if (mm_needs_global_asid(next, prev_asid)) {
+			next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+			choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
+			goto reload_tlb;
+		}
+
+		/*
+		 * Broadcast TLB invalidation keeps this ASID up to date
+		 * all the time.
+		 */
+		if (is_global_asid(prev_asid))
+			return;
+
 		/*
 		 * If the CPU is not in lazy TLB mode, we are just switching
 		 * from one thread in a process to another thread in the same
@@ -755,6 +801,13 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		cond_mitigation(tsk);
 
 		/*
+		 * Let nmi_uaccess_okay() and finish_asid_transition()
+		 * know that CR3 is changing.
+		 */
+		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
+		barrier();
+
+		/*
 		 * Leave this CPU in prev's mm_cpumask. Atomic writes to
 		 * mm_cpumask can be expensive under contention. The CPU
 		 * will be removed lazily at TLB flush time.
@@ -768,14 +821,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
 		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
-
-		/* Let nmi_uaccess_okay() know that we're changing CR3. */
-		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
-		barrier();
 	}
 
+reload_tlb:
 	new_lam = mm_lam_cr3_mask(next);
 	if (need_flush) {
+		VM_WARN_ON_ONCE(is_global_asid(new_asid));
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
 		load_new_mm_cr3(next->pgd, new_asid, new_lam, true);
@@ -894,7 +945,7 @@ static void flush_tlb_func(void *info)
 	const struct flush_tlb_info *f = info;
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+	u64 local_tlb_gen;
 	bool local = smp_processor_id() == f->initiating_cpu;
 	unsigned long nr_invalidate = 0;
 	u64 mm_tlb_gen;
@@ -917,6 +968,16 @@ static void flush_tlb_func(void *info)
 	if (unlikely(loaded_mm == &init_mm))
 		return;
 
+	/* Reload the ASID if transitioning into or out of a global ASID */
+	if (mm_needs_global_asid(loaded_mm, loaded_mm_asid)) {
+		switch_mm_irqs_off(NULL, loaded_mm, NULL);
+		loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	}
+
+	/* Broadcast ASIDs are always kept up to date with INVLPGB. */
+	if (is_global_asid(loaded_mm_asid))
+		return;
+
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
@@ -934,6 +995,8 @@ static void flush_tlb_func(void *info)
 		return;
 	}
 
+	local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+
 	if (unlikely(f->new_tlb_gen != TLB_GENERATION_INVALID &&
 		     f->new_tlb_gen <= local_tlb_gen)) {
 		/*
@@ -1101,7 +1164,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
 	 * up on the new contents of what used to be page tables, while
 	 * doing a speculative memory access.
 	 */
-	if (info->freed_tables)
+	if (info->freed_tables || mm_in_asid_transition(info->mm))
 		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
 	else
 		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/core] x86/mm: Use broadcast TLB flushing in page reclaim
  2025-02-26  3:00 ` [PATCH v14 06/13] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
  2025-02-28 18:57   ` Borislav Petkov
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Use broadcast TLB flushing in page reclaim tip-bot2 for Rik van Riel
@ 2025-03-19 11:04   ` tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-19 11:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Rik van Riel, Borislav Petkov (AMD), Ingo Molnar, x86,
	linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     72a920eacd8ab02a519fde94ef4fdffe8740d84b
Gitweb:        https://git.kernel.org/tip/72a920eacd8ab02a519fde94ef4fdffe8740d84b
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:41 -05:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Mar 2025 11:12:29 +01:00

x86/mm: Use broadcast TLB flushing in page reclaim

Page reclaim tracks only the CPU(s) where the TLB needs to be flushed, rather
than all the individual mappings that may be getting invalidated.

Use broadcast TLB flushing when that is available.

  [ bp: Massage commit message. ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250226030129.530345-7-riel@surriel.com
---
 arch/x86/mm/tlb.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8cd084b..76b4a88 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1320,7 +1320,9 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		invlpgb_flush_all_nonglobals();
+	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/core] x86/mm: Add global ASID allocation helper functions
  2025-02-26  3:00 ` [PATCH v14 07/13] x86/mm: add global ASID allocation helper functions Rik van Riel
  2025-03-02  7:06   ` Borislav Petkov
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Add " tip-bot2 for Rik van Riel
@ 2025-03-19 11:04   ` tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-19 11:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Rik van Riel, Borislav Petkov (AMD), Ingo Molnar, x86,
	linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     d504d1247e369e82c30588b296cc45a85a1ecc12
Gitweb:        https://git.kernel.org/tip/d504d1247e369e82c30588b296cc45a85a1ecc12
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:42 -05:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Mar 2025 11:12:29 +01:00

x86/mm: Add global ASID allocation helper functions

Add functions to manage global ASID space. Multithreaded processes that are
simultaneously active on 4 or more CPUs can get a global ASID, resulting in the
same PCID being used for that process on every CPU.

This in turn will allow the kernel to use hardware-assisted TLB flushing
through AMD INVLPGB or Intel RAR for these processes.

  [ bp:
   - Extend use_global_asid() comment
   - s/X86_BROADCAST_TLB_FLUSH/BROADCAST_TLB_FLUSH/g
   - other touchups ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250226030129.530345-8-riel@surriel.com
---
 arch/x86/include/asm/mmu.h         |  12 ++-
 arch/x86/include/asm/mmu_context.h |   2 +-
 arch/x86/include/asm/tlbflush.h    |  37 +++++++-
 arch/x86/mm/tlb.c                  | 154 +++++++++++++++++++++++++++-
 4 files changed, 202 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 3b496cd..8b8055a 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -69,6 +69,18 @@ typedef struct {
 	u16 pkey_allocation_map;
 	s16 execute_only_pkey;
 #endif
+
+#ifdef CONFIG_BROADCAST_TLB_FLUSH
+	/*
+	 * The global ASID will be a non-zero value when the process has
+	 * the same ASID across all CPUs, allowing it to make use of
+	 * hardware-assisted remote TLB invalidation like AMD INVLPGB.
+	 */
+	u16 global_asid;
+
+	/* The process is transitioning to a new global ASID number. */
+	bool asid_transition;
+#endif
 } mm_context_t;
 
 #define INIT_MM_CONTEXT(mm)						\
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 795fdd5..a2c70e4 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
 #define enter_lazy_tlb enter_lazy_tlb
 extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 
+extern void mm_free_global_asid(struct mm_struct *mm);
+
 /*
  * Init a new mm.  Used on mm copies, like at fork()
  * and on mm's that are brand-new, like at execve().
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 855c13d..f7b374b 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -6,6 +6,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/sched.h>
 
+#include <asm/barrier.h>
 #include <asm/processor.h>
 #include <asm/cpufeature.h>
 #include <asm/special_insns.h>
@@ -234,6 +235,42 @@ void flush_tlb_one_kernel(unsigned long addr);
 void flush_tlb_multi(const struct cpumask *cpumask,
 		      const struct flush_tlb_info *info);
 
+static inline bool is_dyn_asid(u16 asid)
+{
+	return asid < TLB_NR_DYN_ASIDS;
+}
+
+#ifdef CONFIG_BROADCAST_TLB_FLUSH
+static inline u16 mm_global_asid(struct mm_struct *mm)
+{
+	u16 asid;
+
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return 0;
+
+	asid = smp_load_acquire(&mm->context.global_asid);
+
+	/* mm->context.global_asid is either 0, or a global ASID */
+	VM_WARN_ON_ONCE(asid && is_dyn_asid(asid));
+
+	return asid;
+}
+
+static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid)
+{
+	/*
+	 * Notably flush_tlb_mm_range() -> broadcast_tlb_flush() ->
+	 * finish_asid_transition() needs to observe asid_transition = true
+	 * once it observes global_asid.
+	 */
+	mm->context.asid_transition = true;
+	smp_store_release(&mm->context.global_asid, asid);
+}
+#else
+static inline u16 mm_global_asid(struct mm_struct *mm) { return 0; }
+static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid) { }
+#endif /* CONFIG_BROADCAST_TLB_FLUSH */
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 76b4a88..6c24d96 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -74,13 +74,15 @@
  * use different names for each of them:
  *
  * ASID  - [0, TLB_NR_DYN_ASIDS-1]
- *         the canonical identifier for an mm
+ *         the canonical identifier for an mm, dynamically allocated on each CPU
+ *         [TLB_NR_DYN_ASIDS, MAX_ASID_AVAILABLE-1]
+ *         the canonical, global identifier for an mm, identical across all CPUs
  *
- * kPCID - [1, TLB_NR_DYN_ASIDS]
+ * kPCID - [1, MAX_ASID_AVAILABLE]
  *         the value we write into the PCID part of CR3; corresponds to the
  *         ASID+1, because PCID 0 is special.
  *
- * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
+ * uPCID - [2048 + 1, 2048 + MAX_ASID_AVAILABLE]
  *         for KPTI each mm has two address spaces and thus needs two
  *         PCID values, but we can still do with a single ASID denomination
  *         for each mm. Corresponds to kPCID + 2048.
@@ -252,6 +254,152 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 }
 
 /*
+ * Global ASIDs are allocated for multi-threaded processes that are
+ * active on multiple CPUs simultaneously, giving each of those
+ * processes the same PCID on every CPU, for use with hardware-assisted
+ * TLB shootdown on remote CPUs, like AMD INVLPGB or Intel RAR.
+ *
+ * These global ASIDs are held for the lifetime of the process.
+ */
+static DEFINE_RAW_SPINLOCK(global_asid_lock);
+static u16 last_global_asid = MAX_ASID_AVAILABLE;
+static DECLARE_BITMAP(global_asid_used, MAX_ASID_AVAILABLE);
+static DECLARE_BITMAP(global_asid_freed, MAX_ASID_AVAILABLE);
+static int global_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
+
+/*
+ * When the search for a free ASID in the global ASID space reaches
+ * MAX_ASID_AVAILABLE, a global TLB flush guarantees that previously
+ * freed global ASIDs are safe to re-use.
+ *
+ * This way the global flush only needs to happen at ASID rollover
+ * time, and not at ASID allocation time.
+ */
+static void reset_global_asid_space(void)
+{
+	lockdep_assert_held(&global_asid_lock);
+
+	invlpgb_flush_all_nonglobals();
+
+	/*
+	 * The TLB flush above makes it safe to re-use the previously
+	 * freed global ASIDs.
+	 */
+	bitmap_andnot(global_asid_used, global_asid_used,
+			global_asid_freed, MAX_ASID_AVAILABLE);
+	bitmap_clear(global_asid_freed, 0, MAX_ASID_AVAILABLE);
+
+	/* Restart the search from the start of global ASID space. */
+	last_global_asid = TLB_NR_DYN_ASIDS;
+}
+
+static u16 allocate_global_asid(void)
+{
+	u16 asid;
+
+	lockdep_assert_held(&global_asid_lock);
+
+	/* The previous allocation hit the edge of available address space */
+	if (last_global_asid >= MAX_ASID_AVAILABLE - 1)
+		reset_global_asid_space();
+
+	asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, last_global_asid);
+
+	if (asid >= MAX_ASID_AVAILABLE && !global_asid_available) {
+		/* This should never happen. */
+		VM_WARN_ONCE(1, "Unable to allocate global ASID despite %d available\n",
+				global_asid_available);
+		return 0;
+	}
+
+	/* Claim this global ASID. */
+	__set_bit(asid, global_asid_used);
+	last_global_asid = asid;
+	global_asid_available--;
+	return asid;
+}
+
+/*
+ * Check whether a process is currently active on more than @threshold CPUs.
+ * This is a cheap estimation on whether or not it may make sense to assign
+ * a global ASID to this process, and use broadcast TLB invalidation.
+ */
+static bool mm_active_cpus_exceeds(struct mm_struct *mm, int threshold)
+{
+	int count = 0;
+	int cpu;
+
+	/* This quick check should eliminate most single threaded programs. */
+	if (cpumask_weight(mm_cpumask(mm)) <= threshold)
+		return false;
+
+	/* Slower check to make sure. */
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/* Skip the CPUs that aren't really running this process. */
+		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
+			continue;
+
+		if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu))
+			continue;
+
+		if (++count > threshold)
+			return true;
+	}
+	return false;
+}
+
+/*
+ * Assign a global ASID to the current process, protecting against
+ * races between multiple threads in the process.
+ */
+static void use_global_asid(struct mm_struct *mm)
+{
+	u16 asid;
+
+	guard(raw_spinlock_irqsave)(&global_asid_lock);
+
+	/* This process is already using broadcast TLB invalidation. */
+	if (mm_global_asid(mm))
+		return;
+
+	/*
+	 * The last global ASID was consumed while waiting for the lock.
+	 *
+	 * If this fires, a more aggressive ASID reuse scheme might be
+	 * needed.
+	 */
+	if (!global_asid_available) {
+		VM_WARN_ONCE(1, "Ran out of global ASIDs\n");
+		return;
+	}
+
+	asid = allocate_global_asid();
+	if (!asid)
+		return;
+
+	mm_assign_global_asid(mm, asid);
+}
+
+void mm_free_global_asid(struct mm_struct *mm)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return;
+
+	if (!mm_global_asid(mm))
+		return;
+
+	guard(raw_spinlock_irqsave)(&global_asid_lock);
+
+	/* The global ASID can be re-used only after flush at wrap-around. */
+#ifdef CONFIG_BROADCAST_TLB_FLUSH
+	__set_bit(mm->context.global_asid, global_asid_freed);
+
+	mm->context.global_asid = 0;
+	global_asid_available++;
+#endif
+}
+
+/*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
  *

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/core] x86/mm: Use INVLPGB for kernel TLB flushes
  2025-02-26  3:00 ` [PATCH v14 04/13] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
                     ` (2 preceding siblings ...)
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Use " tip-bot2 for Rik van Riel
@ 2025-03-19 11:04   ` tip-bot2 for Rik van Riel
  3 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-19 11:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Rik van Riel, Borislav Petkov (AMD), Ingo Molnar, x86,
	linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     82378c6c2f435dba66145609de16bf44a9de6303
Gitweb:        https://git.kernel.org/tip/82378c6c2f435dba66145609de16bf44a9de6303
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:39 -05:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Mar 2025 11:12:29 +01:00

x86/mm: Use INVLPGB for kernel TLB flushes

Use broadcast TLB invalidation for kernel addresses when available.
Remove the need to send IPIs for kernel TLB flushes.

   [ bp: Integrate dhansen's comments additions, merge the
     flush_tlb_all() change into this one too. ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250226030129.530345-5-riel@surriel.com
---
 arch/x86/mm/tlb.c | 48 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index dbcb5c9..8cd084b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1064,7 +1064,6 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
-
 static void do_flush_tlb_all(void *info)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -1074,7 +1073,32 @@ static void do_flush_tlb_all(void *info)
 void flush_tlb_all(void)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
-	on_each_cpu(do_flush_tlb_all, NULL, 1);
+
+	/* First try (faster) hardware-assisted TLB invalidation. */
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		invlpgb_flush_all();
+	else
+		/* Fall back to the IPI-based invalidation. */
+		on_each_cpu(do_flush_tlb_all, NULL, 1);
+}
+
+/* Flush an arbitrarily large range of memory with INVLPGB. */
+static void invlpgb_kernel_range_flush(struct flush_tlb_info *info)
+{
+	unsigned long addr, nr;
+
+	for (addr = info->start; addr < info->end; addr += nr << PAGE_SHIFT) {
+		nr = (info->end - addr) >> PAGE_SHIFT;
+
+		/*
+		 * INVLPGB has a limit on the size of ranges it can
+		 * flush. Break up large flushes.
+		 */
+		nr = clamp_val(nr, 1, invlpgb_count_max);
+
+		invlpgb_flush_addr_nosync(addr, nr);
+	}
+	__tlbsync();
 }
 
 static void do_kernel_range_flush(void *info)
@@ -1087,6 +1111,22 @@ static void do_kernel_range_flush(void *info)
 		flush_tlb_one_kernel(addr);
 }
 
+static void kernel_tlb_flush_all(struct flush_tlb_info *info)
+{
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		invlpgb_flush_all();
+	else
+		on_each_cpu(do_flush_tlb_all, NULL, 1);
+}
+
+static void kernel_tlb_flush_range(struct flush_tlb_info *info)
+{
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		invlpgb_kernel_range_flush(info);
+	else
+		on_each_cpu(do_kernel_range_flush, info, 1);
+}
+
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
 	struct flush_tlb_info *info;
@@ -1097,9 +1137,9 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 				  TLB_GENERATION_INVALID);
 
 	if (info->end == TLB_FLUSH_ALL)
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
+		kernel_tlb_flush_all(info);
 	else
-		on_each_cpu(do_kernel_range_flush, info, 1);
+		kernel_tlb_flush_range(info);
 
 	put_flush_tlb_info();
 }

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/core] x86/mm: Add INVLPGB support code
  2025-02-26  3:00 ` [PATCH v14 03/13] x86/mm: add INVLPGB support code Rik van Riel
                     ` (3 preceding siblings ...)
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Add " tip-bot2 for Rik van Riel
@ 2025-03-19 11:04   ` tip-bot2 for Rik van Riel
  4 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-19 11:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Rik van Riel, Borislav Petkov (AMD), Ingo Molnar, x86,
	linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     b7aa05cbdc52d61119b0e736bb3e288735f860fe
Gitweb:        https://git.kernel.org/tip/b7aa05cbdc52d61119b0e736bb3e288735f860fe
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Fri, 28 Feb 2025 20:32:30 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Mar 2025 11:12:25 +01:00

x86/mm: Add INVLPGB support code

Add helper functions and definitions needed to use broadcast TLB
invalidation on AMD CPUs.

  [ bp:
      - Cleanup commit message
      - Improve and expand comments
      - push the preemption guards inside the invlpgb* helpers
      - merge improvements from dhansen
      - add !CONFIG_BROADCAST_TLB_FLUSH function stubs because Clang
	can't do DCE properly yet and looks at the inline asm and
	complains about it getting a u64 argument on 32-bit code ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250226030129.530345-4-riel@surriel.com
---
 arch/x86/include/asm/tlb.h | 132 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 132 insertions(+)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 77f52bc..31f6db4 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -6,6 +6,9 @@
 static inline void tlb_flush(struct mmu_gather *tlb);
 
 #include <asm-generic/tlb.h>
+#include <linux/kernel.h>
+#include <vdso/bits.h>
+#include <vdso/page.h>
 
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
@@ -25,4 +28,133 @@ static inline void invlpg(unsigned long addr)
 	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
 }
 
+enum addr_stride {
+	PTE_STRIDE = 0,
+	PMD_STRIDE = 1
+};
+
+#ifdef CONFIG_BROADCAST_TLB_FLUSH
+/*
+ * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
+ *
+ * The INVLPGB instruction is weakly ordered, and a batch of invalidations can
+ * be done in a parallel fashion.
+ *
+ * The instruction takes the number of extra pages to invalidate, beyond
+ * the first page, while __invlpgb gets the more human readable number of
+ * pages to invalidate.
+ *
+ * The bits in rax[0:2] determine respectively which components of the address
+ * (VA, PCID, ASID) get compared when flushing. If neither bits are set, *any*
+ * address in the specified range matches.
+ *
+ * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from
+ * this CPU have completed.
+ */
+static inline void __invlpgb(unsigned long asid, unsigned long pcid,
+			     unsigned long addr, u16 nr_pages,
+			     enum addr_stride stride, u8 flags)
+{
+	u32 edx = (pcid << 16) | asid;
+	u32 ecx = (stride << 31) | (nr_pages - 1);
+	u64 rax = addr | flags;
+
+	/* The low bits in rax are for flags. Verify addr is clean. */
+	VM_WARN_ON_ONCE(addr & ~PAGE_MASK);
+
+	/* INVLPGB; supported in binutils >= 2.36. */
+	asm volatile(".byte 0x0f, 0x01, 0xfe" :: "a" (rax), "c" (ecx), "d" (edx));
+}
+
+static inline void __invlpgb_all(unsigned long asid, unsigned long pcid, u8 flags)
+{
+	__invlpgb(asid, pcid, 0, 1, 0, flags);
+}
+
+static inline void __tlbsync(void)
+{
+	/*
+	 * TLBSYNC waits for INVLPGB instructions originating on the same CPU
+	 * to have completed. Print a warning if the task has been migrated,
+	 * and might not be waiting on all the INVLPGBs issued during this TLB
+	 * invalidation sequence.
+	 */
+	cant_migrate();
+
+	/* TLBSYNC: supported in binutils >= 0.36. */
+	asm volatile(".byte 0x0f, 0x01, 0xff" ::: "memory");
+}
+#else
+/* Some compilers (I'm looking at you clang!) simply can't do DCE */
+static inline void __invlpgb(unsigned long asid, unsigned long pcid,
+			     unsigned long addr, u16 nr_pages,
+			     enum addr_stride s, u8 flags) { }
+static inline void __invlpgb_all(unsigned long asid, unsigned long pcid, u8 flags) { }
+static inline void __tlbsync(void) { }
+#endif
+
+/*
+ * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
+ * of the three. For example:
+ * - FLAG_VA | FLAG_INCLUDE_GLOBAL: invalidate all TLB entries at the address
+ * - FLAG_PCID:			    invalidate all TLB entries matching the PCID
+ *
+ * The first is used to invalidate (kernel) mappings at a particular
+ * address across all processes.
+ *
+ * The latter invalidates all TLB entries matching a PCID.
+ */
+#define INVLPGB_FLAG_VA			BIT(0)
+#define INVLPGB_FLAG_PCID		BIT(1)
+#define INVLPGB_FLAG_ASID		BIT(2)
+#define INVLPGB_FLAG_INCLUDE_GLOBAL	BIT(3)
+#define INVLPGB_FLAG_FINAL_ONLY		BIT(4)
+#define INVLPGB_FLAG_INCLUDE_NESTED	BIT(5)
+
+/* The implied mode when all bits are clear: */
+#define INVLPGB_MODE_ALL_NONGLOBALS	0UL
+
+static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
+						unsigned long addr,
+						u16 nr, bool stride)
+{
+	enum addr_stride str = stride ? PMD_STRIDE : PTE_STRIDE;
+	u8 flags = INVLPGB_FLAG_PCID | INVLPGB_FLAG_VA;
+
+	__invlpgb(0, pcid, addr, nr, str, flags);
+}
+
+/* Flush all mappings for a given PCID, not including globals. */
+static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
+{
+	__invlpgb_all(0, pcid, INVLPGB_FLAG_PCID);
+}
+
+/* Flush all mappings, including globals, for all PCIDs. */
+static inline void invlpgb_flush_all(void)
+{
+	/*
+	 * TLBSYNC at the end needs to make sure all flushes done on the
+	 * current CPU have been executed system-wide. Therefore, make
+	 * sure nothing gets migrated in-between but disable preemption
+	 * as it is cheaper.
+	 */
+	guard(preempt)();
+	__invlpgb_all(0, 0, INVLPGB_FLAG_INCLUDE_GLOBAL);
+	__tlbsync();
+}
+
+/* Flush addr, including globals, for all PCIDs. */
+static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
+{
+	__invlpgb(0, 0, addr, nr, PTE_STRIDE, INVLPGB_FLAG_INCLUDE_GLOBAL);
+}
+
+/* Flush all mappings for all PCIDs except globals. */
+static inline void invlpgb_flush_all_nonglobals(void)
+{
+	guard(preempt)();
+	__invlpgb_all(0, 0, INVLPGB_MODE_ALL_NONGLOBALS);
+	__tlbsync();
+}
 #endif /* _ASM_X86_TLB_H */

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/core] x86/mm: Add INVLPGB feature and Kconfig entry
  2025-02-26  3:00 ` [PATCH v14 02/13] x86/mm: get INVLPGB count max from CPUID Rik van Riel
                     ` (2 preceding siblings ...)
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Add INVLPGB feature and Kconfig entry tip-bot2 for Rik van Riel
@ 2025-03-19 11:04   ` tip-bot2 for Rik van Riel
  3 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-19 11:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Rik van Riel, Borislav Petkov (AMD), Ingo Molnar, x86,
	linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     767ae437a32d644786c0779d0d54492ff9cbe574
Gitweb:        https://git.kernel.org/tip/767ae437a32d644786c0779d0d54492ff9cbe574
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Wed, 19 Mar 2025 11:08:26 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Mar 2025 11:08:52 +01:00

x86/mm: Add INVLPGB feature and Kconfig entry

In addition, the CPU advertises the maximum number of pages that can be
shot down with one INVLPGB instruction in CPUID. Save that information
for later use.

  [ bp: use cpu_has(), typos, massage. ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250226030129.530345-3-riel@surriel.com
---
 arch/x86/Kconfig.cpu                     | 4 ++++
 arch/x86/include/asm/cpufeatures.h       | 1 +
 arch/x86/include/asm/disabled-features.h | 8 +++++++-
 arch/x86/include/asm/tlbflush.h          | 3 +++
 arch/x86/kernel/cpu/amd.c                | 6 ++++++
 5 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index f8b3296..753b876 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -334,6 +334,10 @@ menuconfig PROCESSOR_SELECT
 	  This lets you choose what x86 vendor support code your kernel
 	  will include.
 
+config BROADCAST_TLB_FLUSH
+	def_bool y
+	depends on CPU_SUP_AMD && 64BIT
+
 config CPU_SUP_INTEL
 	default y
 	bool "Support Intel processors" if PROCESSOR_SELECT
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 508c0da..8770dc1 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -338,6 +338,7 @@
 #define X86_FEATURE_CLZERO		(13*32+ 0) /* "clzero" CLZERO instruction */
 #define X86_FEATURE_IRPERF		(13*32+ 1) /* "irperf" Instructions Retired Count */
 #define X86_FEATURE_XSAVEERPTR		(13*32+ 2) /* "xsaveerptr" Always save/restore FP error pointers */
+#define X86_FEATURE_INVLPGB		(13*32+ 3) /* INVLPGB and TLBSYNC instructions supported */
 #define X86_FEATURE_RDPRU		(13*32+ 4) /* "rdpru" Read processor register at user level */
 #define X86_FEATURE_WBNOINVD		(13*32+ 9) /* "wbnoinvd" WBNOINVD instruction */
 #define X86_FEATURE_AMD_IBPB		(13*32+12) /* Indirect Branch Prediction Barrier */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index c492bdc..be8c388 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -129,6 +129,12 @@
 #define DISABLE_SEV_SNP		(1 << (X86_FEATURE_SEV_SNP & 31))
 #endif
 
+#ifdef CONFIG_BROADCAST_TLB_FLUSH
+#define DISABLE_INVLPGB		0
+#else
+#define DISABLE_INVLPGB		(1 << (X86_FEATURE_INVLPGB & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -146,7 +152,7 @@
 #define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
 			 DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
 #define DISABLED_MASK12	(DISABLE_FRED|DISABLE_LAM)
-#define DISABLED_MASK13	0
+#define DISABLED_MASK13	(DISABLE_INVLPGB)
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3da6451..855c13d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -183,6 +183,9 @@ static inline void cr4_init_shadow(void)
 extern unsigned long mmu_cr4_features;
 extern u32 *trampoline_cr4_features;
 
+/* How many pages can be invalidated with one INVLPGB. */
+extern u16 invlpgb_count_max;
+
 extern void initialize_tlbstate_and_flush(void);
 
 /*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 3157664..351c030 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -29,6 +29,8 @@
 
 #include "cpu.h"
 
+u16 invlpgb_count_max __ro_after_init;
+
 static inline int rdmsrl_amd_safe(unsigned msr, unsigned long long *p)
 {
 	u32 gprs[8] = { 0 };
@@ -1139,6 +1141,10 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
 		tlb_lli_2m = eax & mask;
 
 	tlb_lli_4m = tlb_lli_2m >> 1;
+
+	/* Max number of pages INVLPGB can invalidate in one shot */
+	if (cpu_has(c, X86_FEATURE_INVLPGB))
+		invlpgb_count_max = (cpuid_edx(0x80000008) & 0xffff) + 1;
 }
 
 static const struct cpu_dev amd_cpu_dev = {

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [tip: x86/core] x86/mm: Consolidate full flush threshold decision
  2025-02-26  3:00 ` [PATCH v14 01/13] x86/mm: consolidate full flush threshold decision Rik van Riel
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Consolidate " tip-bot2 for Rik van Riel
@ 2025-03-19 11:04   ` tip-bot2 for Rik van Riel
  2025-09-02 15:44   ` [BUG] x86/mm: regression after 4a02ed8e1cc3 Giovanni Cabiddu
  2 siblings, 0 replies; 86+ messages in thread
From: tip-bot2 for Rik van Riel @ 2025-03-19 11:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dave Hansen, Rik van Riel, Borislav Petkov (AMD), Ingo Molnar,
	x86, linux-kernel

The following commit has been merged into the x86/core branch of tip:

Commit-ID:     4a02ed8e1cc33acd04d8f5b89751d3bbb6be35d8
Gitweb:        https://git.kernel.org/tip/4a02ed8e1cc33acd04d8f5b89751d3bbb6be35d8
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 25 Feb 2025 22:00:36 -05:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Mar 2025 11:08:07 +01:00

x86/mm: Consolidate full flush threshold decision

Reduce code duplication by consolidating the decision point for whether to do
individual invalidations or a full flush inside get_flush_tlb_info().

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lore.kernel.org/r/20250226030129.530345-2-riel@surriel.com
---
 arch/x86/mm/tlb.c | 41 +++++++++++++++++++----------------------
 1 file changed, 19 insertions(+), 22 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ffc25b3..dbcb5c9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1000,6 +1000,15 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 	BUG_ON(this_cpu_inc_return(flush_tlb_info_idx) != 1);
 #endif
 
+	/*
+	 * If the number of flushes is so large that a full flush
+	 * would be faster, do a full flush.
+	 */
+	if ((end - start) >> stride_shift > tlb_single_page_flush_ceiling) {
+		start = 0;
+		end = TLB_FLUSH_ALL;
+	}
+
 	info->start		= start;
 	info->end		= end;
 	info->mm		= mm;
@@ -1026,17 +1035,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				bool freed_tables)
 {
 	struct flush_tlb_info *info;
+	int cpu = get_cpu();
 	u64 new_tlb_gen;
-	int cpu;
-
-	cpu = get_cpu();
-
-	/* Should we flush just the requested range? */
-	if ((end == TLB_FLUSH_ALL) ||
-	    ((end - start) >> stride_shift) > tlb_single_page_flush_ceiling) {
-		start = 0;
-		end = TLB_FLUSH_ALL;
-	}
 
 	/* This is also a barrier that synchronizes with switch_mm(). */
 	new_tlb_gen = inc_mm_tlb_gen(mm);
@@ -1089,22 +1089,19 @@ static void do_kernel_range_flush(void *info)
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
-	/* Balance as user space task's flush, a bit conservative */
-	if (end == TLB_FLUSH_ALL ||
-	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
-	} else {
-		struct flush_tlb_info *info;
+	struct flush_tlb_info *info;
+
+	guard(preempt)();
 
-		preempt_disable();
-		info = get_flush_tlb_info(NULL, start, end, 0, false,
-					  TLB_GENERATION_INVALID);
+	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
+				  TLB_GENERATION_INVALID);
 
+	if (info->end == TLB_FLUSH_ALL)
+		on_each_cpu(do_flush_tlb_all, NULL, 1);
+	else
 		on_each_cpu(do_kernel_range_flush, info, 1);
 
-		put_flush_tlb_info();
-		preempt_enable();
-	}
+	put_flush_tlb_info();
 }
 
 /*

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [BUG] x86/mm: regression after 4a02ed8e1cc3
  2025-02-26  3:00 ` [PATCH v14 01/13] x86/mm: consolidate full flush threshold decision Rik van Riel
  2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Consolidate " tip-bot2 for Rik van Riel
  2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
@ 2025-09-02 15:44   ` Giovanni Cabiddu
  2025-09-02 15:50     ` Dave Hansen
                       ` (2 more replies)
  2 siblings, 3 replies; 86+ messages in thread
From: Giovanni Cabiddu @ 2025-09-02 15:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm,
	jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla, mingo,
	Dave Hansen, baolu.lu, david.guckian, damian.muszynski

On Tue, Feb 25, 2025 at 10:00:36PM -0500, Rik van Riel wrote:
> Reduce code duplication by consolidating the decision point
> for whether to do individual invalidations or a full flush
> inside get_flush_tlb_info.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Acked-by: Dave Hansen <dave.hansen@intel.com>
> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
> ---
After 4a02ed8e1cc3 ("x86/mm: Consolidate full flush threshold
decision"), we've seen data corruption in DMAd buffers when testing SVA.

From our preliminary analysis, it appears that get_flush_tlb_info()
modifies the start and end parameters for full TLB flushes (setting
start=0, end=TLB_FLUSH_ALL). However, the MMU notifier call at the end
of the function still uses the original parameters instead of the
updated info->start and info->end.

The change below appears to solve the problem, however we are not sure if
this is the right way to fix the problem.

----8<----
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 39f80111e6f1..e66c7662c254 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1459,7 +1459,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 
 	put_flush_tlb_info();
 	put_cpu();
-	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, info->start, info->end);
 }
 
 static void do_flush_tlb_all(void *info)
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [BUG] x86/mm: regression after 4a02ed8e1cc3
  2025-09-02 15:44   ` [BUG] x86/mm: regression after 4a02ed8e1cc3 Giovanni Cabiddu
@ 2025-09-02 15:50     ` Dave Hansen
  2025-09-02 16:08       ` Nadav Amit
  2025-09-03 14:00       ` Rik van Riel
  2025-09-02 16:05     ` Jann Horn
  2025-09-02 16:31     ` Jann Horn
  2 siblings, 2 replies; 86+ messages in thread
From: Dave Hansen @ 2025-09-02 15:50 UTC (permalink / raw)
  To: Giovanni Cabiddu, Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm,
	jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla, mingo,
	baolu.lu, david.guckian, damian.muszynski

On 9/2/25 08:44, Giovanni Cabiddu wrote:
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 39f80111e6f1..e66c7662c254 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1459,7 +1459,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>  
>  	put_flush_tlb_info();
>  	put_cpu();
> -	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
> +	mmu_notifier_arch_invalidate_secondary_tlbs(mm, info->start, info->end);
>  }

That does look like the right solution.

This is the downside of wrapping everything up in that 'info' struct;
it's not obvious that the canonical source of the start/end information
moved from those variables into the structure.

Rik, is that your read on it too?

In any case, Giovanni, do you want to send that as a "real" patch that
we can apply (changelog, SoB, Fixes, Cc:stable@, etc...)?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [BUG] x86/mm: regression after 4a02ed8e1cc3
  2025-09-02 15:44   ` [BUG] x86/mm: regression after 4a02ed8e1cc3 Giovanni Cabiddu
  2025-09-02 15:50     ` Dave Hansen
@ 2025-09-02 16:05     ` Jann Horn
  2025-09-02 16:13       ` Jann Horn
  2025-09-03 14:18       ` Nadav Amit
  2025-09-02 16:31     ` Jann Horn
  2 siblings, 2 replies; 86+ messages in thread
From: Jann Horn @ 2025-09-02 16:05 UTC (permalink / raw)
  To: Giovanni Cabiddu
  Cc: Rik van Riel, x86, linux-kernel, bp, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, mhklinux, andrew.cooper3, Manali.Shukla, mingo,
	Dave Hansen, baolu.lu, david.guckian, damian.muszynski

On Tue, Sep 2, 2025 at 5:44 PM Giovanni Cabiddu
<giovanni.cabiddu@intel.com> wrote:
> On Tue, Feb 25, 2025 at 10:00:36PM -0500, Rik van Riel wrote:
> > Reduce code duplication by consolidating the decision point
> > for whether to do individual invalidations or a full flush
> > inside get_flush_tlb_info.
> >
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > Tested-by: Michael Kelley <mhklinux@outlook.com>
> > Acked-by: Dave Hansen <dave.hansen@intel.com>
> > Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
> > ---
> After 4a02ed8e1cc3 ("x86/mm: Consolidate full flush threshold
> decision"), we've seen data corruption in DMAd buffers when testing SVA.
>
> From our preliminary analysis, it appears that get_flush_tlb_info()
> modifies the start and end parameters for full TLB flushes (setting
> start=0, end=TLB_FLUSH_ALL). However, the MMU notifier call at the end
> of the function still uses the original parameters instead of the
> updated info->start and info->end.
>
> The change below appears to solve the problem, however we are not sure if
> this is the right way to fix the problem.
>
> ----8<----
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 39f80111e6f1..e66c7662c254 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1459,7 +1459,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>
>         put_flush_tlb_info();
>         put_cpu();
> -       mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
> +       mmu_notifier_arch_invalidate_secondary_tlbs(mm, info->start, info->end);
>  }

I don't see why the IOMMU flush should be broadened just because the
CPU flush got broadened.

On x86, IOMMU flushes happen from arch_tlbbatch_add_pending() and
flush_tlb_mm_range(); the IOMMU folks might know better, but as far as
I know, there is nothing that elides IOMMU flushes depending on the
state of X86-internal flush generation tracking or such.

To me this looks like a change that is correct but makes it easier to
hit IOMMU flushing issues in other places.

Are you encountering these issues on an Intel system or an AMD system?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [BUG] x86/mm: regression after 4a02ed8e1cc3
  2025-09-02 15:50     ` Dave Hansen
@ 2025-09-02 16:08       ` Nadav Amit
  2025-09-02 16:11         ` Dave Hansen
  2025-09-03 14:00       ` Rik van Riel
  1 sibling, 1 reply; 86+ messages in thread
From: Nadav Amit @ 2025-09-02 16:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Giovanni Cabiddu, Rik van Riel, the arch/x86 maintainers,
	Linux Kernel Mailing List, Borislav Petkov, peterz, Dave Hansen,
	zhengqi.arch, thomas.lendacky, kernel-team,
	open list:MEMORY MANAGEMENT, Andrew Morton, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, Ingo Molnar, baolu.lu,
	david.guckian, damian.muszynski



> On 2 Sep 2025, at 18:50, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 9/2/25 08:44, Giovanni Cabiddu wrote:
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index 39f80111e6f1..e66c7662c254 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -1459,7 +1459,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>> 
>> 	put_flush_tlb_info();
>> 	put_cpu();
>> - 	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
>> + 	mmu_notifier_arch_invalidate_secondary_tlbs(mm, info->start, info->end);
>> }
> 
> That does look like the right solution.
> 
> This is the downside of wrapping everything up in that 'info' struct;
> it's not obvious that the canonical source of the start/end information
> moved from those variables into the structure.

Just a reminder (which I am sure that you know): the main motivation behind
this info struct is to allow the payload that is required for the TLB
shootdown to be in a single cache-line.

It would be nice (if possible) that callees like broadcast_tlb_flush()
would constify flush_tlb_info (I’m not encouraging lots of consification,
but maybe it’s appropriate here) to prevent the misuse of flush_tlb_info.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [BUG] x86/mm: regression after 4a02ed8e1cc3
  2025-09-02 16:08       ` Nadav Amit
@ 2025-09-02 16:11         ` Dave Hansen
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Hansen @ 2025-09-02 16:11 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Giovanni Cabiddu, Rik van Riel, the arch/x86 maintainers,
	Linux Kernel Mailing List, Borislav Petkov, peterz, Dave Hansen,
	zhengqi.arch, thomas.lendacky, kernel-team,
	open list:MEMORY MANAGEMENT, Andrew Morton, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, Ingo Molnar, baolu.lu,
	david.guckian, damian.muszynski

On 9/2/25 09:08, Nadav Amit wrote:
> It would be nice (if possible) that callees like broadcast_tlb_flush()
> would constify flush_tlb_info (I’m not encouraging lots of consification,
> but maybe it’s appropriate here) to prevent the misuse of flush_tlb_info.

Yeah, agreed. It would be nice if there was a clear and enforced line
between producers and consumers here.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [BUG] x86/mm: regression after 4a02ed8e1cc3
  2025-09-02 16:05     ` Jann Horn
@ 2025-09-02 16:13       ` Jann Horn
  2025-09-03 14:18       ` Nadav Amit
  1 sibling, 0 replies; 86+ messages in thread
From: Jann Horn @ 2025-09-02 16:13 UTC (permalink / raw)
  To: Giovanni Cabiddu
  Cc: Rik van Riel, x86, linux-kernel, bp, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, mhklinux, andrew.cooper3, Manali.Shukla, mingo,
	Dave Hansen, baolu.lu, david.guckian, damian.muszynski

On Tue, Sep 2, 2025 at 6:05 PM Jann Horn <jannh@google.com> wrote:
> To me this looks like a change that is correct but makes it easier to
> hit IOMMU flushing issues in other places.

(This was ambigous; to be clear, I meant that 4a02ed8e1cc3 looks
correct to me, and that the suggested fix doesn't look to me like it
fixes anything.)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [BUG] x86/mm: regression after 4a02ed8e1cc3
  2025-09-02 15:44   ` [BUG] x86/mm: regression after 4a02ed8e1cc3 Giovanni Cabiddu
  2025-09-02 15:50     ` Dave Hansen
  2025-09-02 16:05     ` Jann Horn
@ 2025-09-02 16:31     ` Jann Horn
  2025-09-02 16:57       ` Giovanni Cabiddu
  2 siblings, 1 reply; 86+ messages in thread
From: Jann Horn @ 2025-09-02 16:31 UTC (permalink / raw)
  To: Giovanni Cabiddu
  Cc: Rik van Riel, x86, linux-kernel, bp, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, mhklinux, andrew.cooper3, Manali.Shukla, mingo,
	Dave Hansen, baolu.lu, david.guckian, damian.muszynski

On Tue, Sep 2, 2025 at 5:44 PM Giovanni Cabiddu
<giovanni.cabiddu@intel.com> wrote:
> On Tue, Feb 25, 2025 at 10:00:36PM -0500, Rik van Riel wrote:
> > Reduce code duplication by consolidating the decision point
> > for whether to do individual invalidations or a full flush
> > inside get_flush_tlb_info.
> >
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > Tested-by: Michael Kelley <mhklinux@outlook.com>
> > Acked-by: Dave Hansen <dave.hansen@intel.com>
> > Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
> > ---
> After 4a02ed8e1cc3 ("x86/mm: Consolidate full flush threshold
> decision"), we've seen data corruption in DMAd buffers when testing SVA.

If it's not too much effort, you could try to see whether bumping
/sys/kernel/debug/x86/tlb_single_page_flush_ceiling to some relatively
large value (maybe something like 262144, which is 512*512) causes you
to see the same kind of issue before commit 4a02ed8e1cc3. (Note that
increasing tlb_single_page_flush_ceiling costs performance, and
putting an overly big value in there will probably make the system
completely unresponsive.)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [BUG] x86/mm: regression after 4a02ed8e1cc3
  2025-09-02 16:31     ` Jann Horn
@ 2025-09-02 16:57       ` Giovanni Cabiddu
  0 siblings, 0 replies; 86+ messages in thread
From: Giovanni Cabiddu @ 2025-09-02 16:57 UTC (permalink / raw)
  To: Jann Horn
  Cc: Rik van Riel, x86, linux-kernel, bp, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, mhklinux, andrew.cooper3, Manali.Shukla, mingo,
	Dave Hansen, baolu.lu, david.guckian, damian.muszynski

On Tue, Sep 02, 2025 at 06:31:49PM +0200, Jann Horn wrote:
> On Tue, Sep 2, 2025 at 5:44 PM Giovanni Cabiddu
> <giovanni.cabiddu@intel.com> wrote:
> > On Tue, Feb 25, 2025 at 10:00:36PM -0500, Rik van Riel wrote:
> > > Reduce code duplication by consolidating the decision point
> > > for whether to do individual invalidations or a full flush
> > > inside get_flush_tlb_info.
> > >
> > > Signed-off-by: Rik van Riel <riel@surriel.com>
> > > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > > Tested-by: Michael Kelley <mhklinux@outlook.com>
> > > Acked-by: Dave Hansen <dave.hansen@intel.com>
> > > Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
> > > ---
> > After 4a02ed8e1cc3 ("x86/mm: Consolidate full flush threshold
> > decision"), we've seen data corruption in DMAd buffers when testing SVA.
> 
> If it's not too much effort, you could try to see whether bumping
> /sys/kernel/debug/x86/tlb_single_page_flush_ceiling to some relatively
> large value (maybe something like 262144, which is 512*512) causes you
> to see the same kind of issue before commit 4a02ed8e1cc3. (Note that
> increasing tlb_single_page_flush_ceiling costs performance, and
> putting an overly big value in there will probably make the system
> completely unresponsive.)
Thanks. We will try to increase tlb_single_page_flush_ceiling before
4a02ed8e1cc3.

I'm not familiar with this code, but based on the commit message,
4a02ed8e1cc3 appears to be a refactor. However, that doesn't seem to be
the case.
Before the commit, mmu_notifier_arch_invalidate_secondary_tlbs() was
getting a modified value for start and end. After the commit it appears
to be using the original values instead. Was this intentional?

Regards,

-- 
Giovanni

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [BUG] x86/mm: regression after 4a02ed8e1cc3
  2025-09-02 15:50     ` Dave Hansen
  2025-09-02 16:08       ` Nadav Amit
@ 2025-09-03 14:00       ` Rik van Riel
  1 sibling, 0 replies; 86+ messages in thread
From: Rik van Riel @ 2025-09-03 14:00 UTC (permalink / raw)
  To: Dave Hansen, Giovanni Cabiddu
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm,
	jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla, mingo,
	baolu.lu, david.guckian, damian.muszynski

On Tue, 2025-09-02 at 08:50 -0700, Dave Hansen wrote:
> On 9/2/25 08:44, Giovanni Cabiddu wrote:
> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index 39f80111e6f1..e66c7662c254 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -1459,7 +1459,7 @@ void flush_tlb_mm_range(struct mm_struct *mm,
> > unsigned long start,
> >  
> >  	put_flush_tlb_info();
> >  	put_cpu();
> > -	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start,
> > end);
> > +	mmu_notifier_arch_invalidate_secondary_tlbs(mm, info-
> > >start, info->end);
> >  }
> 
> That does look like the right solution.
> 
> This is the downside of wrapping everything up in that 'info' struct;
> it's not obvious that the canonical source of the start/end
> information
> moved from those variables into the structure.
> 
> Rik, is that your read on it too?
> 
In flush_tlb_mm_range(), we only need to flush from
start to end. The same should be true for the mmu notifier.

The only reason info->start and info->end can be modified
is because a global TLB flush can be faster than a ranged
one.

I can't think of any reason why that should affect the
range the mmu notifier needs to flush.

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [BUG] x86/mm: regression after 4a02ed8e1cc3
  2025-09-02 16:05     ` Jann Horn
  2025-09-02 16:13       ` Jann Horn
@ 2025-09-03 14:18       ` Nadav Amit
  2025-09-03 14:42         ` Jann Horn
  1 sibling, 1 reply; 86+ messages in thread
From: Nadav Amit @ 2025-09-03 14:18 UTC (permalink / raw)
  To: Jann Horn
  Cc: Giovanni Cabiddu, Rik van Riel, the arch/x86 maintainers,
	Linux Kernel Mailing List, Borislav Petkov, peterz, Dave Hansen,
	zhengqi.arch, thomas.lendacky, kernel-team,
	open list:MEMORY MANAGEMENT, Andrew Morton, jackmanb, mhklinux,
	andrew.cooper3, Manali.Shukla, Ingo Molnar, Dave Hansen, baolu.lu,
	david.guckian, damian.muszynski

I just noticed few things need clarification

> On 2 Sep 2025, at 19:05, Jann Horn <jannh@google.com> wrote:
> 
> On Tue, Sep 2, 2025 at 5:44 PM Giovanni Cabiddu
> <giovanni.cabiddu@intel.com> wrote:
>> 

[snip]

>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index 39f80111e6f1..e66c7662c254 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -1459,7 +1459,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>> 
>>        put_flush_tlb_info();
>>        put_cpu();
>> -       mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
>> +       mmu_notifier_arch_invalidate_secondary_tlbs(mm, info->start, info->end);
>> }
> 
> I don't see why the IOMMU flush should be broadened just because the
> CPU flush got broadened.

Agreed (as Rik also indicated now)

> 
> On x86, IOMMU flushes happen from arch_tlbbatch_add_pending() and
> flush_tlb_mm_range(); the IOMMU folks might know better, but as far as
> I know, there is nothing that elides IOMMU flushes depending on the
> state of X86-internal flush generation tracking or such.
> 
> To me this looks like a change that is correct but makes it easier to
> hit IOMMU flushing issues in other places.

This change is not correct. Do not reference info after calling
put_flush_tlb_info().

> 
> Are you encountering these issues on an Intel system or an AMD system?

I would note that on AMD IOMMUs there is some masking functionality
that allows to make large yet selective flushes.

This means both that we should not force IOMMU flush range to be large
just because we decided to do so for the CPU (as you correctly said)
and that there might be an unrelated bug, like in the mask computation.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [BUG] x86/mm: regression after 4a02ed8e1cc3
  2025-09-03 14:18       ` Nadav Amit
@ 2025-09-03 14:42         ` Jann Horn
  0 siblings, 0 replies; 86+ messages in thread
From: Jann Horn @ 2025-09-03 14:42 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Giovanni Cabiddu, Rik van Riel, the arch/x86 maintainers,
	Linux Kernel Mailing List, Borislav Petkov, peterz, Dave Hansen,
	zhengqi.arch, thomas.lendacky, kernel-team,
	open list:MEMORY MANAGEMENT, Andrew Morton, jackmanb, mhklinux,
	andrew.cooper3, Manali.Shukla, Ingo Molnar, Dave Hansen, baolu.lu,
	david.guckian, damian.muszynski

On Wed, Sep 3, 2025 at 4:18 PM Nadav Amit <nadav.amit@gmail.com> wrote:
> > On 2 Sep 2025, at 19:05, Jann Horn <jannh@google.com> wrote:
> > On x86, IOMMU flushes happen from arch_tlbbatch_add_pending() and
> > flush_tlb_mm_range(); the IOMMU folks might know better, but as far as
> > I know, there is nothing that elides IOMMU flushes depending on the
> > state of X86-internal flush generation tracking or such.
> >
> > To me this looks like a change that is correct but makes it easier to
> > hit IOMMU flushing issues in other places.
>
> This change is not correct. Do not reference info after calling
> put_flush_tlb_info().

To be clear, I worded that very confusingly, I meant that Rik's commit
4a02ed8e1cc3 looks correct to me.

^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2025-09-03 14:43 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-26  3:00 [PATCH v14 00/13] AMD broadcast TLB invalidation Rik van Riel
2025-02-26  3:00 ` [PATCH v14 01/13] x86/mm: consolidate full flush threshold decision Rik van Riel
2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Consolidate " tip-bot2 for Rik van Riel
2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
2025-09-02 15:44   ` [BUG] x86/mm: regression after 4a02ed8e1cc3 Giovanni Cabiddu
2025-09-02 15:50     ` Dave Hansen
2025-09-02 16:08       ` Nadav Amit
2025-09-02 16:11         ` Dave Hansen
2025-09-03 14:00       ` Rik van Riel
2025-09-02 16:05     ` Jann Horn
2025-09-02 16:13       ` Jann Horn
2025-09-03 14:18       ` Nadav Amit
2025-09-03 14:42         ` Jann Horn
2025-09-02 16:31     ` Jann Horn
2025-09-02 16:57       ` Giovanni Cabiddu
2025-02-26  3:00 ` [PATCH v14 02/13] x86/mm: get INVLPGB count max from CPUID Rik van Riel
2025-02-28 16:21   ` Borislav Petkov
2025-02-28 19:27   ` Borislav Petkov
2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Add INVLPGB feature and Kconfig entry tip-bot2 for Rik van Riel
2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
2025-02-26  3:00 ` [PATCH v14 03/13] x86/mm: add INVLPGB support code Rik van Riel
2025-02-28 18:46   ` Borislav Petkov
2025-02-28 18:51   ` Dave Hansen
2025-02-28 19:47   ` Borislav Petkov
2025-03-03 18:41     ` Dave Hansen
2025-03-03 19:23       ` Dave Hansen
2025-03-04 11:00         ` Borislav Petkov
2025-03-04 15:10           ` Dave Hansen
2025-03-04 16:19             ` Borislav Petkov
2025-03-04 16:57               ` Dave Hansen
2025-03-04 21:12                 ` Borislav Petkov
2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Add " tip-bot2 for Rik van Riel
2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
2025-02-26  3:00 ` [PATCH v14 04/13] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
2025-02-28 19:00   ` Dave Hansen
2025-02-28 21:43   ` Borislav Petkov
2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Use " tip-bot2 for Rik van Riel
2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
2025-02-26  3:00 ` [PATCH v14 05/13] x86/mm: use INVLPGB in flush_tlb_all Rik van Riel
2025-02-28 19:18   ` Dave Hansen
2025-03-01 12:20     ` Borislav Petkov
2025-03-01 15:54       ` Rik van Riel
2025-02-28 22:20   ` Borislav Petkov
2025-02-26  3:00 ` [PATCH v14 06/13] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
2025-02-28 18:57   ` Borislav Petkov
2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Use broadcast TLB flushing in page reclaim tip-bot2 for Rik van Riel
2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
2025-02-26  3:00 ` [PATCH v14 07/13] x86/mm: add global ASID allocation helper functions Rik van Riel
2025-03-02  7:06   ` Borislav Petkov
2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Add " tip-bot2 for Rik van Riel
2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
2025-02-26  3:00 ` [PATCH v14 08/13] x86/mm: global ASID context switch & TLB flush handling Rik van Riel
2025-03-02  7:58   ` Borislav Petkov
2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Handle global ASID context switch and TLB flush tip-bot2 for Rik van Riel
2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
2025-02-26  3:00 ` [PATCH v14 09/13] x86/mm: global ASID process exit helpers Rik van Riel
2025-03-02 12:38   ` Borislav Petkov
2025-03-02 13:53     ` Rik van Riel
2025-03-03 10:16       ` Borislav Petkov
2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Add " tip-bot2 for Rik van Riel
2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
2025-02-26  3:00 ` [PATCH v14 10/13] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
2025-03-03 10:57   ` Borislav Petkov
2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Enable " tip-bot2 for Rik van Riel
2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
2025-02-26  3:00 ` [PATCH v14 11/13] x86/mm: do targeted broadcast flushing from tlbbatch code Rik van Riel
2025-03-03 11:46   ` Borislav Petkov
2025-03-03 21:47     ` Dave Hansen
2025-03-04 11:52       ` Borislav Petkov
2025-03-04 15:24         ` Dave Hansen
2025-03-04 12:52       ` Brendan Jackman
2025-03-04 14:11         ` Borislav Petkov
2025-03-04 15:33           ` Brendan Jackman
2025-03-04 17:51             ` Dave Hansen
2025-02-26  3:00 ` [PATCH v14 12/13] x86/mm: enable AMD translation cache extensions Rik van Riel
2025-03-05 21:36   ` [tip: x86/mm] x86/mm: Enable " tip-bot2 for Rik van Riel
2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Rik van Riel
2025-02-26  3:00 ` [PATCH v14 13/13] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
2025-03-03 22:40   ` Dave Hansen
2025-03-04 11:53     ` Borislav Petkov
2025-03-03 12:42 ` [PATCH v14 00/13] AMD broadcast TLB invalidation Borislav Petkov
2025-03-03 13:29   ` Borislav Petkov
2025-03-04 12:04 ` [PATCH] x86/mm: Always set the ASID valid bit for the INVLPGB instruction Borislav Petkov
2025-03-04 12:43   ` Borislav Petkov
2025-03-05 21:36   ` [tip: x86/mm] " tip-bot2 for Tom Lendacky
2025-03-19 11:04   ` [tip: x86/core] " tip-bot2 for Tom Lendacky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).