[PATCH v13 00/14] AMD broadcast TLB invalidation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v13 00/14] AMD broadcast TLB invalidation
@ 2025-02-23 19:48 Rik van Riel
  2025-02-23 19:48 ` [PATCH v13 01/14] x86/mm: consolidate full flush threshold decision Rik van Riel
                   ` (13 more replies)
  0 siblings, 14 replies; 45+ messages in thread
From: Rik van Riel @ 2025-02-23 19:48 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.

This allows the kernel to invalidate TLB entries on remote CPUs without
needing to send IPIs, without having to wait for remote CPUs to handle
those interrupts, and with less interruption to what was running on
those CPUs.

Because x86 PCID space is limited, and there are some very large
systems out there, broadcast TLB invalidation is only used for
processes that are active on 3 or more CPUs, with the threshold
being gradually increased the more the PCID space gets exhausted.

Combined with the removal of unnecessary lru_add_drain calls
(see https://lkml.org/lkml/2024/12/19/1388) this results in a
nice performance boost for the will-it-scale tlb_flush2_threads
test on an AMD Milan system with 36 cores:

- vanilla kernel:           527k loops/second
- lru_add_drain removal:    731k loops/second
- only INVLPGB:             527k loops/second
- lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while
TLB invalidation went down from 40% of the total CPU
time to only around 4% of CPU time, the contention
simply moved to the LRU lock.

Fixing both at the same time about doubles the
number of iterations per second from this case.

Some numbers closer to real world performance
can be found at Phoronix, thanks to Michael:

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

My current plan is to implement support for Intel's RAR
(Remote Action Request) TLB flushing in a follow-up series,
after this thing has been merged into -tip. Making things
any larger would just be unwieldy for reviewers.

v13:
 - move invlpgb_count_max back to amd.c for resume (Boris, Oleksandr)
 - fix Kconfig circular dependency (Tom, Boris)
 - add performance numbers to the patch adding invlpgb for userspace (Ingo)
 - drop page table RCU free patches (already in -tip)
v12:
 - make sure "nopcid" command line option turns off invlpgb (Brendan)
 - add "noinvlpgb" kernel command line option
 - split out kernel TLB flushing differently (Dave & Yosry)
 - split up the patch that does invlpgb flushing for user processes (Dave)
 - clean up get_flush_tlb_info (Boris)
 - move invlpgb_count_max initialization to get_cpu_cap (Boris)
 - bunch more comments as requested
v11:
 - resolve conflict with CONFIG_PT_RECLAIM code
 - a few more cleanups (Peter, Brendan, Nadav)
v10:
 - simplify partial pages with min(nr, 1) in the invlpgb loop (Peter)
 - document x86 paravirt, AMD invlpgb, and ARM64 flush without IPI (Brendan)
 - remove IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) (Brendan)
 - various cleanups (Brendan)
v9:
 - print warning when start or end address was rounded (Peter)
 - in the reclaim code, tlbsync at context switch time (Peter)
 - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan)
v8:
 - round start & end to handle non-page-aligned callers (Steven & Jan)
 - fix up changelog & add tested-by tags (Manali)
v7:
 - a few small code cleanups (Nadav)
 - fix spurious VM_WARN_ON_ONCE in mm_global_asid
 - code simplifications & better barriers (Peter & Dave)
v6:
 - fix info->end check in flush_tlb_kernel_range (Michael)
 - disable broadcast TLB flushing on 32 bit x86
v5:
 - use byte assembly for compatibility with older toolchains (Borislav, Michael)
 - ensure a panic on an invalid number of extra pages (Dave, Tom)
 - add cant_migrate() assertion to tlbsync (Jann)
 - a bunch more cleanups (Nadav)
 - key TCE enabling off X86_FEATURE_TCE (Andrew)
 - fix a race between reclaim and ASID transition (Jann)
v4:
 - Use only bitmaps to track free global ASIDs (Nadav)
 - Improved AMD initialization (Borislav & Tom)
 - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
 - Fixes for subtle race conditions (Jann)
v3:
 - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
 - More suggested cleanups and changelog fixes by Peter and Nadav
v2:
 - Apply suggestions by Peter and Borislav (thank you!)
 - Fix bug in arch_tlbbatch_flush, where we need to do both
   the TLBSYNC, and flush the CPUs that are in the cpumask.
 - Some updates to comments and changelogs based on questions.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v13 01/14] x86/mm: consolidate full flush threshold decision
  2025-02-23 19:48 [PATCH v13 00/14] AMD broadcast TLB invalidation Rik van Riel
@ 2025-02-23 19:48 ` Rik van Riel
  2025-02-24 11:48   ` Borislav Petkov
  2025-02-23 19:48 ` [PATCH v13 02/14] x86/mm: get INVLPGB count max from CPUID Rik van Riel
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2025-02-23 19:48 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel,
	Dave Hansen

Reduce code duplication by consolidating the decision point
for whether to do individual invalidations or a full flush
inside get_flush_tlb_info.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
---
 arch/x86/mm/tlb.c | 41 +++++++++++++++++++----------------------
 1 file changed, 19 insertions(+), 22 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ffc25b348041..dbcb5c968ff9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1000,6 +1000,15 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 	BUG_ON(this_cpu_inc_return(flush_tlb_info_idx) != 1);
 #endif
 
+	/*
+	 * If the number of flushes is so large that a full flush
+	 * would be faster, do a full flush.
+	 */
+	if ((end - start) >> stride_shift > tlb_single_page_flush_ceiling) {
+		start = 0;
+		end = TLB_FLUSH_ALL;
+	}
+
 	info->start		= start;
 	info->end		= end;
 	info->mm		= mm;
@@ -1026,17 +1035,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				bool freed_tables)
 {
 	struct flush_tlb_info *info;
+	int cpu = get_cpu();
 	u64 new_tlb_gen;
-	int cpu;
-
-	cpu = get_cpu();
-
-	/* Should we flush just the requested range? */
-	if ((end == TLB_FLUSH_ALL) ||
-	    ((end - start) >> stride_shift) > tlb_single_page_flush_ceiling) {
-		start = 0;
-		end = TLB_FLUSH_ALL;
-	}
 
 	/* This is also a barrier that synchronizes with switch_mm(). */
 	new_tlb_gen = inc_mm_tlb_gen(mm);
@@ -1089,22 +1089,19 @@ static void do_kernel_range_flush(void *info)
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
-	/* Balance as user space task's flush, a bit conservative */
-	if (end == TLB_FLUSH_ALL ||
-	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
-	} else {
-		struct flush_tlb_info *info;
+	struct flush_tlb_info *info;
+
+	guard(preempt)();
 
-		preempt_disable();
-		info = get_flush_tlb_info(NULL, start, end, 0, false,
-					  TLB_GENERATION_INVALID);
+	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
+				  TLB_GENERATION_INVALID);
 
+	if (info->end == TLB_FLUSH_ALL)
+		on_each_cpu(do_flush_tlb_all, NULL, 1);
+	else
 		on_each_cpu(do_kernel_range_flush, info, 1);
 
-		put_flush_tlb_info();
-		preempt_enable();
-	}
+	put_flush_tlb_info();
 }
 
 /*
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 02/14] x86/mm: get INVLPGB count max from CPUID
  2025-02-23 19:48 [PATCH v13 00/14] AMD broadcast TLB invalidation Rik van Riel
  2025-02-23 19:48 ` [PATCH v13 01/14] x86/mm: consolidate full flush threshold decision Rik van Riel
@ 2025-02-23 19:48 ` Rik van Riel
  2025-02-24 12:00   ` Borislav Petkov
  2025-02-23 19:48 ` [PATCH v13 03/14] x86/mm: add INVLPGB support code Rik van Riel
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2025-02-23 19:48 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel,
	Dave Hansen

The CPU advertises the maximum number of pages that can be shot down
with one INVLPGB instruction in the CPUID data.

Save that information for later use.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
---
 arch/x86/Kconfig.cpu               | 4 ++++
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/include/asm/tlbflush.h    | 3 +++
 arch/x86/kernel/cpu/amd.c          | 8 ++++++++
 4 files changed, 16 insertions(+)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 2a7279d80460..981def9cbfac 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -401,6 +401,10 @@ menuconfig PROCESSOR_SELECT
 	  This lets you choose what x86 vendor support code your kernel
 	  will include.
 
+config X86_BROADCAST_TLB_FLUSH
+	def_bool y
+	depends on CPU_SUP_AMD && 64BIT
+
 config CPU_SUP_INTEL
 	default y
 	bool "Support Intel processors" if PROCESSOR_SELECT
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 508c0dad116b..b5c66b7465ba 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -338,6 +338,7 @@
 #define X86_FEATURE_CLZERO		(13*32+ 0) /* "clzero" CLZERO instruction */
 #define X86_FEATURE_IRPERF		(13*32+ 1) /* "irperf" Instructions Retired Count */
 #define X86_FEATURE_XSAVEERPTR		(13*32+ 2) /* "xsaveerptr" Always save/restore FP error pointers */
+#define X86_FEATURE_INVLPGB		(13*32+ 3) /* INVLPGB and TLBSYNC instruction supported. */
 #define X86_FEATURE_RDPRU		(13*32+ 4) /* "rdpru" Read processor register at user level */
 #define X86_FEATURE_WBNOINVD		(13*32+ 9) /* "wbnoinvd" WBNOINVD instruction */
 #define X86_FEATURE_AMD_IBPB		(13*32+12) /* Indirect Branch Prediction Barrier */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3da645139748..09463a2fb05f 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -183,6 +183,9 @@ static inline void cr4_init_shadow(void)
 extern unsigned long mmu_cr4_features;
 extern u32 *trampoline_cr4_features;
 
+/* How many pages can we invalidate with one INVLPGB. */
+extern u16 invlpgb_count_max;
+
 extern void initialize_tlbstate_and_flush(void);
 
 /*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 54194f5995de..3e8180354303 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -29,6 +29,8 @@
 
 #include "cpu.h"
 
+u16 invlpgb_count_max __ro_after_init;
+
 static inline int rdmsrl_amd_safe(unsigned msr, unsigned long long *p)
 {
 	u32 gprs[8] = { 0 };
@@ -1139,6 +1141,12 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
 		tlb_lli_2m[ENTRIES] = eax & mask;
 
 	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
+
+	/* Max number of pages INVLPGB can invalidate in one shot */
+	if (boot_cpu_has(X86_FEATURE_INVLPGB)) {
+		cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
+		invlpgb_count_max = (edx & 0xffff) + 1;
+	}
 }
 
 static const struct cpu_dev amd_cpu_dev = {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 03/14] x86/mm: add INVLPGB support code
  2025-02-23 19:48 [PATCH v13 00/14] AMD broadcast TLB invalidation Rik van Riel
  2025-02-23 19:48 ` [PATCH v13 01/14] x86/mm: consolidate full flush threshold decision Rik van Riel
  2025-02-23 19:48 ` [PATCH v13 02/14] x86/mm: get INVLPGB count max from CPUID Rik van Riel
@ 2025-02-23 19:48 ` Rik van Riel
  2025-02-24 12:14   ` Borislav Petkov
  2025-02-23 19:48 ` [PATCH v13 04/14] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2025-02-23 19:48 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel,
	Dave Hansen

Add invlpgb.h with the helper functions and definitions needed to use
broadcast TLB invalidation on AMD EPYC 3 and newer CPUs.

All the functions defined in invlpgb.h are used later in the series.

Compile time disabling X86_FEATURE_INVLPGB when the config
option is not set allows the compiler to omit unnecessary code.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
---
 arch/x86/include/asm/disabled-features.h |  9 ++-
 arch/x86/include/asm/tlb.h               | 92 ++++++++++++++++++++++++
 2 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index c492bdc97b05..95997caf0935 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -129,6 +129,13 @@
 #define DISABLE_SEV_SNP		(1 << (X86_FEATURE_SEV_SNP & 31))
 #endif
 
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+#define DISABLE_INVLPGB		0
+#else
+/* Keep 32 bit kernels smaller by compiling out the INVLPGB code. */
+#define DISABLE_INVLPGB		(1 << (X86_FEATURE_INVLPGB & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -146,7 +153,7 @@
 #define DISABLED_MASK11	(DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
 			 DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
 #define DISABLED_MASK12	(DISABLE_FRED|DISABLE_LAM)
-#define DISABLED_MASK13	0
+#define DISABLED_MASK13	(DISABLE_INVLPGB)
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 77f52bc1578a..b3cd521e5e2f 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -6,6 +6,9 @@
 static inline void tlb_flush(struct mmu_gather *tlb);
 
 #include <asm-generic/tlb.h>
+#include <linux/kernel.h>
+#include <vdso/bits.h>
+#include <vdso/page.h>
 
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
@@ -25,4 +28,93 @@ static inline void invlpg(unsigned long addr)
 	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
 }
 
+
+/*
+ * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
+ *
+ * The INVLPGB instruction is weakly ordered, and a batch of invalidations can
+ * be done in a parallel fashion.
+ *
+ * The instruction takes the number of extra pages to invalidate, beyond
+ * the first page, while __invlpgb gets the more human readable number of
+ * pages to invalidate.
+ *
+ * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from
+ * this CPU have completed.
+ */
+static inline void __invlpgb(unsigned long asid, unsigned long pcid,
+			     unsigned long addr, u16 nr_pages,
+			     bool pmd_stride, u8 flags)
+{
+	u32 edx = (pcid << 16) | asid;
+	u32 ecx = (pmd_stride << 31) | (nr_pages - 1);
+	u64 rax = addr | flags;
+
+	/* The low bits in rax are for flags. Verify addr is clean. */
+	VM_WARN_ON_ONCE(addr & ~PAGE_MASK);
+
+	/* INVLPGB; supported in binutils >= 2.36. */
+	asm volatile(".byte 0x0f, 0x01, 0xfe" : : "a" (rax), "c" (ecx), "d" (edx));
+}
+
+/* Wait for INVLPGB originated by this CPU to complete. */
+static inline void __tlbsync(void)
+{
+	cant_migrate();
+	/* TLBSYNC: supported in binutils >= 0.36. */
+	asm volatile(".byte 0x0f, 0x01, 0xff" ::: "memory");
+}
+
+/*
+ * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
+ * of the three. For example:
+ * - INVLPGB_VA | INVLPGB_INCLUDE_GLOBAL: invalidate all TLB entries at the address
+ * - INVLPGB_PCID:			  invalidate all TLB entries matching the PCID
+ *
+ * The first can be used to invalidate (kernel) mappings at a particular
+ * address across all processes.
+ *
+ * The latter invalidates all TLB entries matching a PCID.
+ */
+#define INVLPGB_VA			BIT(0)
+#define INVLPGB_PCID			BIT(1)
+#define INVLPGB_ASID			BIT(2)
+#define INVLPGB_INCLUDE_GLOBAL		BIT(3)
+#define INVLPGB_FINAL_ONLY		BIT(4)
+#define INVLPGB_INCLUDE_NESTED		BIT(5)
+
+static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
+						unsigned long addr,
+						u16 nr,
+						bool pmd_stride)
+{
+	__invlpgb(0, pcid, addr, nr, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
+}
+
+/* Flush all mappings for a given PCID, not including globals. */
+static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
+{
+	__invlpgb(0, pcid, 0, 1, 0, INVLPGB_PCID);
+}
+
+/* Flush all mappings, including globals, for all PCIDs. */
+static inline void invlpgb_flush_all(void)
+{
+	__invlpgb(0, 0, 0, 1, 0, INVLPGB_INCLUDE_GLOBAL);
+	__tlbsync();
+}
+
+/* Flush addr, including globals, for all PCIDs. */
+static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
+{
+	__invlpgb(0, 0, addr, nr, 0, INVLPGB_INCLUDE_GLOBAL);
+}
+
+/* Flush all mappings for all PCIDs except globals. */
+static inline void invlpgb_flush_all_nonglobals(void)
+{
+	__invlpgb(0, 0, 0, 1, 0, 0);
+	__tlbsync();
+}
+
 #endif /* _ASM_X86_TLB_H */
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 04/14] x86/mm: use INVLPGB for kernel TLB flushes
  2025-02-23 19:48 [PATCH v13 00/14] AMD broadcast TLB invalidation Rik van Riel
                   ` (2 preceding siblings ...)
  2025-02-23 19:48 ` [PATCH v13 03/14] x86/mm: add INVLPGB support code Rik van Riel
@ 2025-02-23 19:48 ` Rik van Riel
  2025-02-24 12:31   ` Borislav Petkov
  2025-02-23 19:48 ` [PATCH v13 05/14] x86/mm: use INVLPGB in flush_tlb_all Rik van Riel
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2025-02-23 19:48 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

Use broadcast TLB invalidation for kernel addresses when available.

Remove the need to send IPIs for kernel TLB flushes.

Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/mm/tlb.c | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index dbcb5c968ff9..59396a3c6e9c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1077,6 +1077,20 @@ void flush_tlb_all(void)
 	on_each_cpu(do_flush_tlb_all, NULL, 1);
 }
 
+static bool invlpgb_kernel_range_flush(struct flush_tlb_info *info)
+{
+	unsigned long addr;
+	unsigned long nr;
+
+	for (addr = info->start; addr < info->end; addr += nr << PAGE_SHIFT) {
+		nr = (info->end - addr) >> PAGE_SHIFT;
+		nr = clamp_val(nr, 1, invlpgb_count_max);
+		invlpgb_flush_addr_nosync(addr, nr);
+	}
+	__tlbsync();
+	return true;
+}
+
 static void do_kernel_range_flush(void *info)
 {
 	struct flush_tlb_info *f = info;
@@ -1087,6 +1101,22 @@ static void do_kernel_range_flush(void *info)
 		flush_tlb_one_kernel(addr);
 }
 
+static void kernel_tlb_flush_all(struct flush_tlb_info *info)
+{
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		invlpgb_flush_all();
+	else
+		on_each_cpu(do_flush_tlb_all, NULL, 1);
+}
+
+static void kernel_tlb_flush_range(struct flush_tlb_info *info)
+{
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		invlpgb_kernel_range_flush(info);
+	else
+		on_each_cpu(do_kernel_range_flush, info, 1);
+}
+
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
 	struct flush_tlb_info *info;
@@ -1097,9 +1127,9 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 				  TLB_GENERATION_INVALID);
 
 	if (info->end == TLB_FLUSH_ALL)
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
+		kernel_tlb_flush_all(info);
 	else
-		on_each_cpu(do_kernel_range_flush, info, 1);
+		kernel_tlb_flush_range(info);
 
 	put_flush_tlb_info();
 }
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 05/14] x86/mm: use INVLPGB in flush_tlb_all
  2025-02-23 19:48 [PATCH v13 00/14] AMD broadcast TLB invalidation Rik van Riel
                   ` (3 preceding siblings ...)
  2025-02-23 19:48 ` [PATCH v13 04/14] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
@ 2025-02-23 19:48 ` Rik van Riel
  2025-02-24 12:46   ` Borislav Petkov
  2025-02-23 19:48 ` [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2025-02-23 19:48 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

The flush_tlb_all() function is not used a whole lot, but we might
as well use broadcast TLB flushing there, too.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/mm/tlb.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 59396a3c6e9c..2d7ed0fda61f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1065,6 +1065,16 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 }
 
 
+static bool broadcast_flush_tlb_all(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	guard(preempt)();
+	invlpgb_flush_all();
+	return true;
+}
+
 static void do_flush_tlb_all(void *info)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -1074,6 +1084,12 @@ static void do_flush_tlb_all(void *info)
 void flush_tlb_all(void)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+
+	/* First try (faster) hardware-assisted TLB invalidation. */
+	if (broadcast_flush_tlb_all())
+		return;
+
+	/* Fall back to the IPI-based invalidation. */
 	on_each_cpu(do_flush_tlb_all, NULL, 1);
 }
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-23 19:48 [PATCH v13 00/14] AMD broadcast TLB invalidation Rik van Riel
                   ` (4 preceding siblings ...)
  2025-02-23 19:48 ` [PATCH v13 05/14] x86/mm: use INVLPGB in flush_tlb_all Rik van Riel
@ 2025-02-23 19:48 ` Rik van Riel
  2025-02-24 13:27   ` Borislav Petkov
  2025-02-23 19:48 ` [PATCH v13 07/14] x86/mm: global ASID allocation helper functions Rik van Riel
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2025-02-23 19:48 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

In the page reclaim code, we only track the CPU(s) where the TLB needs
to be flushed, rather than all the individual mappings that may be getting
invalidated.

Use broadcast TLB flushing when that is available.

This is a temporary hack to ensure that the PCID context for
tasks in the next patch gets properly flushed from the page
reclaim code, because the IPI based flushing in arch_tlbbatch_flush
only flushes the currently loaded TLB context on each CPU.

Patch 10 replaces this with the actual mechanism used to do
broadcast TLB flushing from the page reclaim code.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/mm/tlb.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2d7ed0fda61f..16839651f67f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1326,7 +1326,9 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		invlpgb_flush_all_nonglobals();
+	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 07/14] x86/mm: global ASID allocation helper functions
  2025-02-23 19:48 [PATCH v13 00/14] AMD broadcast TLB invalidation Rik van Riel
                   ` (5 preceding siblings ...)
  2025-02-23 19:48 ` [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
@ 2025-02-23 19:48 ` Rik van Riel
  2025-02-24 14:16   ` Borislav Petkov
  2025-02-23 19:48 ` [PATCH v13 08/14] x86/mm: global ASID context switch & TLB flush handling Rik van Riel
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2025-02-23 19:48 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

Functions to manage global ASID space. Multithreaded processes that
are simultaneously active on 4 or more CPUs can get a global ASID,
resulting in the same PCID being used for that process on every CPU.

This in turn will allow the kernel to use hardware-assisted TLB flushing
through AMD INVLPGB or Intel RAR for these processes.

Helper functions split out by request.

Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/include/asm/mmu.h      |  11 +++
 arch/x86/include/asm/tlbflush.h |  43 ++++++++++
 arch/x86/mm/tlb.c               | 144 +++++++++++++++++++++++++++++++-
 3 files changed, 195 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 3b496cdcb74b..edb5942d4829 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -69,6 +69,17 @@ typedef struct {
 	u16 pkey_allocation_map;
 	s16 execute_only_pkey;
 #endif
+
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	/*
+	 * The global ASID will be a non-zero value when the process has
+	 * the same ASID across all CPUs, allowing it to make use of
+	 * hardware-assisted remote TLB invalidation like AMD INVLPGB.
+	 */
+	u16 global_asid;
+	/* The process is transitioning to a new global ASID number. */
+	bool asid_transition;
+#endif
 } mm_context_t;
 
 #define INIT_MM_CONTEXT(mm)						\
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 09463a2fb05f..83f1da2f1e4a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -6,6 +6,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/sched.h>
 
+#include <asm/barrier.h>
 #include <asm/processor.h>
 #include <asm/cpufeature.h>
 #include <asm/special_insns.h>
@@ -234,6 +235,48 @@ void flush_tlb_one_kernel(unsigned long addr);
 void flush_tlb_multi(const struct cpumask *cpumask,
 		      const struct flush_tlb_info *info);
 
+static inline bool is_dyn_asid(u16 asid)
+{
+	return asid < TLB_NR_DYN_ASIDS;
+}
+
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+static inline u16 mm_global_asid(struct mm_struct *mm)
+{
+	u16 asid;
+
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return 0;
+
+	asid = smp_load_acquire(&mm->context.global_asid);
+
+	/* mm->context.global_asid is either 0, or a global ASID */
+	VM_WARN_ON_ONCE(asid && is_dyn_asid(asid));
+
+	return asid;
+}
+
+static inline void assign_mm_global_asid(struct mm_struct *mm, u16 asid)
+{
+	/*
+	 * Notably flush_tlb_mm_range() -> broadcast_tlb_flush() ->
+	 * finish_asid_transition() needs to observe asid_transition = true
+	 * once it observes global_asid.
+	 */
+	mm->context.asid_transition = true;
+	smp_store_release(&mm->context.global_asid, asid);
+}
+#else
+static inline u16 mm_global_asid(struct mm_struct *mm)
+{
+	return 0;
+}
+
+static inline void assign_mm_global_asid(struct mm_struct *mm, u16 asid)
+{
+}
+#endif
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 16839651f67f..405630479b90 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -74,13 +74,15 @@
  * use different names for each of them:
  *
  * ASID  - [0, TLB_NR_DYN_ASIDS-1]
- *         the canonical identifier for an mm
+ *         the canonical identifier for an mm, dynamically allocated on each CPU
+ *         [TLB_NR_DYN_ASIDS, MAX_ASID_AVAILABLE-1]
+ *         the canonical, global identifier for an mm, identical across all CPUs
  *
- * kPCID - [1, TLB_NR_DYN_ASIDS]
+ * kPCID - [1, MAX_ASID_AVAILABLE]
  *         the value we write into the PCID part of CR3; corresponds to the
  *         ASID+1, because PCID 0 is special.
  *
- * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
+ * uPCID - [2048 + 1, 2048 + MAX_ASID_AVAILABLE]
  *         for KPTI each mm has two address spaces and thus needs two
  *         PCID values, but we can still do with a single ASID denomination
  *         for each mm. Corresponds to kPCID + 2048.
@@ -251,6 +253,142 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 	*need_flush = true;
 }
 
+/*
+ * Global ASIDs are allocated for multi-threaded processes that are
+ * active on multiple CPUs simultaneously, giving each of those
+ * processes the same PCIDs on every CPU, for use with hardware-assisted
+ * TLB shootdown on remote CPUs, like AMD INVLPGB or Intel RAR.
+ *
+ * These global ASIDs are held for the lifetime of the process.
+ */
+static DEFINE_RAW_SPINLOCK(global_asid_lock);
+static u16 last_global_asid = MAX_ASID_AVAILABLE;
+static DECLARE_BITMAP(global_asid_used, MAX_ASID_AVAILABLE);
+static DECLARE_BITMAP(global_asid_freed, MAX_ASID_AVAILABLE);
+static int global_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
+
+/*
+ * When the search for a free ASID in the global ASID space reaches
+ * MAX_ASID_AVAILABLE, a global TLB flush guarantees that previously
+ * freed global ASIDs are safe to re-use.
+ *
+ * This way the global flush only needs to happen at ASID rollover
+ * time, and not at ASID allocation time.
+ */
+static void reset_global_asid_space(void)
+{
+	lockdep_assert_held(&global_asid_lock);
+
+	invlpgb_flush_all_nonglobals();
+
+	/*
+	 * The TLB flush above makes it safe to re-use the previously
+	 * freed global ASIDs.
+	 */
+	bitmap_andnot(global_asid_used, global_asid_used,
+			global_asid_freed, MAX_ASID_AVAILABLE);
+	bitmap_clear(global_asid_freed, 0, MAX_ASID_AVAILABLE);
+
+	/* Restart the search from the start of global ASID space. */
+	last_global_asid = TLB_NR_DYN_ASIDS;
+}
+
+static u16 allocate_global_asid(void)
+{
+	u16 asid;
+
+	lockdep_assert_held(&global_asid_lock);
+
+	/* The previous allocation hit the edge of available address space */
+	if (last_global_asid >= MAX_ASID_AVAILABLE - 1)
+		reset_global_asid_space();
+
+	asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, last_global_asid);
+
+	if (asid >= MAX_ASID_AVAILABLE) {
+		/* This should never happen. */
+		VM_WARN_ONCE(1, "Unable to allocate global ASID despite %d available\n",
+				global_asid_available);
+		return 0;
+	}
+
+	/* Claim this global ASID. */
+	__set_bit(asid, global_asid_used);
+	last_global_asid = asid;
+	global_asid_available--;
+	return asid;
+}
+
+/*
+ * Check whether a process is currently active on more than "threshold" CPUs.
+ * This is a cheap estimation on whether or not it may make sense to assign
+ * a global ASID to this process, and use broadcast TLB invalidation.
+ */
+static bool mm_active_cpus_exceeds(struct mm_struct *mm, int threshold)
+{
+	int count = 0;
+	int cpu;
+
+	/* This quick check should eliminate most single threaded programs. */
+	if (cpumask_weight(mm_cpumask(mm)) <= threshold)
+		return false;
+
+	/* Slower check to make sure. */
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/* Skip the CPUs that aren't really running this process. */
+		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
+			continue;
+
+		if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu))
+			continue;
+
+		if (++count > threshold)
+			return true;
+	}
+	return false;
+}
+
+/*
+ * Assign a global ASID to the current process, protecting against
+ * races between multiple threads in the process.
+ */
+static void use_global_asid(struct mm_struct *mm)
+{
+	u16 asid;
+
+	guard(raw_spinlock_irqsave)(&global_asid_lock);
+
+	/* This process is already using broadcast TLB invalidation. */
+	if (mm_global_asid(mm))
+		return;
+
+	/* The last global ASID was consumed while waiting for the lock. */
+	if (!global_asid_available) {
+		VM_WARN_ONCE(1, "Ran out of global ASIDs\n");
+		return;
+	}
+
+	asid = allocate_global_asid();
+	if (!asid)
+		return;
+
+	assign_mm_global_asid(mm, asid);
+}
+
+void destroy_context_free_global_asid(struct mm_struct *mm)
+{
+	if (!mm_global_asid(mm))
+		return;
+
+	guard(raw_spinlock_irqsave)(&global_asid_lock);
+
+	/* The global ASID can be re-used only after flush at wrap-around. */
+	__set_bit(mm->context.global_asid, global_asid_freed);
+
+	mm->context.global_asid = 0;
+	global_asid_available++;
+}
+
 /*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 08/14] x86/mm: global ASID context switch & TLB flush handling
  2025-02-23 19:48 [PATCH v13 00/14] AMD broadcast TLB invalidation Rik van Riel
                   ` (6 preceding siblings ...)
  2025-02-23 19:48 ` [PATCH v13 07/14] x86/mm: global ASID allocation helper functions Rik van Riel
@ 2025-02-23 19:48 ` Rik van Riel
  2025-02-23 23:08   ` kernel test robot
  2025-02-24 18:41   ` kernel test robot
  2025-02-23 19:48 ` [PATCH v13 09/14] x86/mm: global ASID process exit helpers Rik van Riel
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 45+ messages in thread
From: Rik van Riel @ 2025-02-23 19:48 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

Context switch and TLB flush support for processes that use a global
ASID & PCID across all CPUs.

At both context switch time and TLB flush time, we need to check
whether a task is switching to a global ASID, and reload the TLB
with the new ASID as appropriate.

In both code paths, we also short-circuit the TLB flush if we
are using a global ASID, because the global ASIDs are always
kept up to date across CPUs, even while the process is not
running on a CPU.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/tlbflush.h | 13 ++++++
 arch/x86/mm/tlb.c               | 77 ++++++++++++++++++++++++++++++---
 2 files changed, 83 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 83f1da2f1e4a..f1f82571249b 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -240,6 +240,19 @@ static inline bool is_dyn_asid(u16 asid)
 	return asid < TLB_NR_DYN_ASIDS;
 }
 
+static inline bool is_global_asid(u16 asid)
+{
+	return !is_dyn_asid(asid);
+}
+
+static inline bool in_asid_transition(struct mm_struct *mm)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	return mm && READ_ONCE(mm->context.asid_transition);
+}
+
 #ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
 static inline u16 mm_global_asid(struct mm_struct *mm)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 405630479b90..d8a04e398615 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -227,6 +227,20 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 		return;
 	}
 
+	/*
+	 * TLB consistency for global ASIDs is maintained with hardware assisted
+	 * remote TLB flushing. Global ASIDs are always up to date.
+	 */
+	if (static_cpu_has(X86_FEATURE_INVLPGB)) {
+		u16 global_asid = mm_global_asid(next);
+
+		if (global_asid) {
+			*new_asid = global_asid;
+			*need_flush = false;
+			return;
+		}
+	}
+
 	if (this_cpu_read(cpu_tlbstate.invalidate_other))
 		clear_asid_other();
 
@@ -389,6 +403,23 @@ void destroy_context_free_global_asid(struct mm_struct *mm)
 	global_asid_available++;
 }
 
+/*
+ * Is the mm transitioning from a CPU-local ASID to a global ASID?
+ */
+static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
+{
+	u16 global_asid = mm_global_asid(next);
+
+	if (!static_cpu_has(X86_FEATURE_INVLPGB))
+		return false;
+
+	/* Process is transitioning to a global ASID */
+	if (global_asid && prev_asid != global_asid)
+		return true;
+
+	return false;
+}
+
 /*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
@@ -694,7 +725,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 	 */
 	if (prev == next) {
 		/* Not actually switching mm's */
-		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
+		VM_WARN_ON(is_dyn_asid(prev_asid) &&
+			   this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
 			   next->context.ctx_id);
 
 		/*
@@ -711,6 +743,20 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 
+		/* Check if the current mm is transitioning to a global ASID */
+		if (needs_global_asid_reload(next, prev_asid)) {
+			next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+			choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
+			goto reload_tlb;
+		}
+
+		/*
+		 * Broadcast TLB invalidation keeps this PCID up to date
+		 * all the time.
+		 */
+		if (is_global_asid(prev_asid))
+			return;
+
 		/*
 		 * If the CPU is not in lazy TLB mode, we are just switching
 		 * from one thread in a process to another thread in the same
@@ -744,6 +790,13 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		 */
 		cond_mitigation(tsk);
 
+		/*
+		 * Let nmi_uaccess_okay() and finish_asid_transition()
+		 * know that we're changing CR3.
+		 */
+		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
+		barrier();
+
 		/*
 		 * Leave this CPU in prev's mm_cpumask. Atomic writes to
 		 * mm_cpumask can be expensive under contention. The CPU
@@ -758,14 +811,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
 		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
-
-		/* Let nmi_uaccess_okay() know that we're changing CR3. */
-		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
-		barrier();
 	}
 
+reload_tlb:
 	new_lam = mm_lam_cr3_mask(next);
 	if (need_flush) {
+		VM_WARN_ON_ONCE(is_global_asid(new_asid));
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
 		load_new_mm_cr3(next->pgd, new_asid, new_lam, true);
@@ -884,7 +935,7 @@ static void flush_tlb_func(void *info)
 	const struct flush_tlb_info *f = info;
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+	u64 local_tlb_gen;
 	bool local = smp_processor_id() == f->initiating_cpu;
 	unsigned long nr_invalidate = 0;
 	u64 mm_tlb_gen;
@@ -907,6 +958,16 @@ static void flush_tlb_func(void *info)
 	if (unlikely(loaded_mm == &init_mm))
 		return;
 
+	/* Reload the ASID if transitioning into or out of a global ASID */
+	if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) {
+		switch_mm_irqs_off(NULL, loaded_mm, NULL);
+		loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	}
+
+	/* Broadcast ASIDs are always kept up to date with INVLPGB. */
+	if (is_global_asid(loaded_mm_asid))
+		return;
+
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
@@ -924,6 +985,8 @@ static void flush_tlb_func(void *info)
 		return;
 	}
 
+	local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+
 	if (unlikely(f->new_tlb_gen != TLB_GENERATION_INVALID &&
 		     f->new_tlb_gen <= local_tlb_gen)) {
 		/*
@@ -1091,7 +1154,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
 	 * up on the new contents of what used to be page tables, while
 	 * doing a speculative memory access.
 	 */
-	if (info->freed_tables)
+	if (info->freed_tables || in_asid_transition(info->mm))
 		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
 	else
 		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 09/14] x86/mm: global ASID process exit helpers
  2025-02-23 19:48 [PATCH v13 00/14] AMD broadcast TLB invalidation Rik van Riel
                   ` (7 preceding siblings ...)
  2025-02-23 19:48 ` [PATCH v13 08/14] x86/mm: global ASID context switch & TLB flush handling Rik van Riel
@ 2025-02-23 19:48 ` Rik van Riel
  2025-02-23 19:49 ` [PATCH v13 10/14] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2025-02-23 19:48 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

A global ASID is allocated for the lifetime of a process.

Free the global ASID at process exit time.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/mmu_context.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 795fdd53bd0a..d670699d32c2 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
 #define enter_lazy_tlb enter_lazy_tlb
 extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 
+extern void destroy_context_free_global_asid(struct mm_struct *mm);
+
 /*
  * Init a new mm.  Used on mm copies, like at fork()
  * and on mm's that are brand-new, like at execve().
@@ -161,6 +163,14 @@ static inline int init_new_context(struct task_struct *tsk,
 		mm->context.execute_only_pkey = -1;
 	}
 #endif
+
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		mm->context.global_asid = 0;
+		mm->context.asid_transition = false;
+	}
+#endif
+
 	mm_reset_untag_mask(mm);
 	init_new_context_ldt(mm);
 	return 0;
@@ -170,6 +180,10 @@ static inline int init_new_context(struct task_struct *tsk,
 static inline void destroy_context(struct mm_struct *mm)
 {
 	destroy_context_ldt(mm);
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		destroy_context_free_global_asid(mm);
+#endif
 }
 
 extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 10/14] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-02-23 19:48 [PATCH v13 00/14] AMD broadcast TLB invalidation Rik van Riel
                   ` (8 preceding siblings ...)
  2025-02-23 19:48 ` [PATCH v13 09/14] x86/mm: global ASID process exit helpers Rik van Riel
@ 2025-02-23 19:49 ` Rik van Riel
  2025-02-23 19:49 ` [PATCH v13 11/14] x86/mm: do targeted broadcast flushing from tlbbatch code Rik van Riel
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2025-02-23 19:49 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

Use broadcast TLB invalidation, using the INVPLGB instruction.

There is not enough room in the 12-bit ASID address space to hand
out broadcast ASIDs to every process. Only hand out broadcast ASIDs
to processes when they are observed to be simultaneously running
on 4 or more CPUs.

This also allows single threaded process to continue using the
cheaper, local TLB invalidation instructions like INVLPGB.

Combined with the removal of unnecessary lru_add_drain calls
(see https://lkml.org/lkml/2024/12/19/1388) this results in a
nice performance boost for the will-it-scale tlb_flush2_threads
test on an AMD Milan system with 36 cores:

- vanilla kernel:           527k loops/second
- lru_add_drain removal:    731k loops/second
- only INVLPGB:             527k loops/second
- lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while
TLB invalidation went down from 40% of the total CPU
time to only around 4% of CPU time, the contention
simply moved to the LRU lock.

Fixing both at the same time about doubles the
number of iterations per second from this case.

Comparing will-it-scale tlb_flush2_threads with
several different numbers of threads on a 72 CPU
AMD Milan shows similar results. The number
represents the total number of loops per second
across all the threads:

threads		tip		invlpgb

1		315k		304k
2		423k		424k
4		644k		1032k
8		652k		1267k
16		737k		1368k
32		759k		1199k
64		636k		1094k
72		609k		993k

1 and 2 thread performance is similar with and
without invlpgb, because invlpgb is only used
on processes using 4 or more CPUs simultaneously.

The number is the median across 5 runs.

Some numbers closer to real world performance
can be found at Phoronix, thanks to Michael:

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/mm/tlb.c | 107 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 106 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index d8a04e398615..01a5edb51ebe 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -420,6 +420,108 @@ static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
 	return false;
 }
 
+/*
+ * x86 has 4k ASIDs (2k when compiled with KPTI), but the largest
+ * x86 systems have over 8k CPUs. Because of this potential ASID
+ * shortage, global ASIDs are handed out to processes that have
+ * frequent TLB flushes and are active on 4 or more CPUs simultaneously.
+ */
+static void consider_global_asid(struct mm_struct *mm)
+{
+	if (!static_cpu_has(X86_FEATURE_INVLPGB))
+		return;
+
+	/* Check every once in a while. */
+	if ((current->pid & 0x1f) != (jiffies & 0x1f))
+		return;
+
+	if (!READ_ONCE(global_asid_available))
+		return;
+
+	/*
+	 * Assign a global ASID if the process is active on
+	 * 4 or more CPUs simultaneously.
+	 */
+	if (mm_active_cpus_exceeds(mm, 3))
+		use_global_asid(mm);
+}
+
+static void finish_asid_transition(struct flush_tlb_info *info)
+{
+	struct mm_struct *mm = info->mm;
+	int bc_asid = mm_global_asid(mm);
+	int cpu;
+
+	if (!READ_ONCE(mm->context.asid_transition))
+		return;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/*
+		 * The remote CPU is context switching. Wait for that to
+		 * finish, to catch the unlikely case of it switching to
+		 * the target mm with an out of date ASID.
+		 */
+		while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) == LOADED_MM_SWITCHING)
+			cpu_relax();
+
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm)
+			continue;
+
+		/*
+		 * If at least one CPU is not using the global ASID yet,
+		 * send a TLB flush IPI. The IPI should cause stragglers
+		 * to transition soon.
+		 *
+		 * This can race with the CPU switching to another task;
+		 * that results in a (harmless) extra IPI.
+		 */
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) {
+			flush_tlb_multi(mm_cpumask(info->mm), info);
+			return;
+		}
+	}
+
+	/* All the CPUs running this process are using the global ASID. */
+	WRITE_ONCE(mm->context.asid_transition, false);
+}
+
+static void broadcast_tlb_flush(struct flush_tlb_info *info)
+{
+	bool pmd = info->stride_shift == PMD_SHIFT;
+	unsigned long asid = info->mm->context.global_asid;
+	unsigned long addr = info->start;
+
+	/*
+	 * TLB flushes with INVLPGB are kicked off asynchronously.
+	 * The inc_mm_tlb_gen() guarantees page table updates are done
+	 * before these TLB flushes happen.
+	 */
+	if (info->end == TLB_FLUSH_ALL) {
+		invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_single_pcid_nosync(user_pcid(asid));
+	} else do {
+		unsigned long nr = 1;
+
+		if (info->stride_shift <= PMD_SHIFT) {
+			nr = (info->end - addr) >> info->stride_shift;
+			nr = clamp_val(nr, 1, invlpgb_count_max);
+		}
+
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd);
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd);
+
+		addr += nr << info->stride_shift;
+	} while (addr < info->end);
+
+	finish_asid_transition(info);
+
+	/* Wait for the INVLPGBs kicked off above to finish. */
+	__tlbsync();
+}
+
 /*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
@@ -1250,9 +1352,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+	if (mm_global_asid(mm)) {
+		broadcast_tlb_flush(info);
+	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
 		info->trim_cpumask = should_trim_cpumask(mm);
 		flush_tlb_multi(mm_cpumask(mm), info);
+		consider_global_asid(mm);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 11/14] x86/mm: do targeted broadcast flushing from tlbbatch code
  2025-02-23 19:48 [PATCH v13 00/14] AMD broadcast TLB invalidation Rik van Riel
                   ` (9 preceding siblings ...)
  2025-02-23 19:49 ` [PATCH v13 10/14] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
@ 2025-02-23 19:49 ` Rik van Riel
  2025-02-23 19:49 ` [PATCH v13 12/14] x86/mm: enable AMD translation cache extensions Rik van Riel
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2025-02-23 19:49 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

Instead of doing a system-wide TLB flush from arch_tlbbatch_flush,
queue up asynchronous, targeted flushes from arch_tlbbatch_add_pending.

This also allows us to avoid adding the CPUs of processes using broadcast
flushing to the batch->cpumask, and will hopefully further reduce TLB
flushing from the reclaim and compaction paths.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/include/asm/tlb.h      | 12 ++---
 arch/x86/include/asm/tlbflush.h | 19 ++++----
 arch/x86/mm/tlb.c               | 79 +++++++++++++++++++++++++++++++--
 3 files changed, 92 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index b3cd521e5e2f..f69b243683e1 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -83,16 +83,16 @@ static inline void __tlbsync(void)
 #define INVLPGB_FINAL_ONLY		BIT(4)
 #define INVLPGB_INCLUDE_NESTED		BIT(5)
 
-static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
-						unsigned long addr,
-						u16 nr,
-						bool pmd_stride)
+static inline void __invlpgb_flush_user_nr_nosync(unsigned long pcid,
+						  unsigned long addr,
+						  u16 nr,
+						  bool pmd_stride)
 {
 	__invlpgb(0, pcid, addr, nr, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
 }
 
 /* Flush all mappings for a given PCID, not including globals. */
-static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
+static inline void __invlpgb_flush_single_pcid_nosync(unsigned long pcid)
 {
 	__invlpgb(0, pcid, 0, 1, 0, INVLPGB_PCID);
 }
@@ -105,7 +105,7 @@ static inline void invlpgb_flush_all(void)
 }
 
 /* Flush addr, including globals, for all PCIDs. */
-static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
+static inline void __invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
 {
 	__invlpgb(0, 0, addr, nr, 0, INVLPGB_INCLUDE_GLOBAL);
 }
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index f1f82571249b..241fa1435375 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -105,6 +105,9 @@ struct tlb_state {
 	 * need to be invalidated.
 	 */
 	bool invalidate_other;
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	bool need_tlbsync;
+#endif
 
 #ifdef CONFIG_ADDRESS_MASKING
 	/*
@@ -288,6 +291,10 @@ static inline u16 mm_global_asid(struct mm_struct *mm)
 static inline void assign_mm_global_asid(struct mm_struct *mm, u16 asid)
 {
 }
+
+static inline void tlbsync(void)
+{
+}
 #endif
 
 #ifdef CONFIG_PARAVIRT
@@ -337,21 +344,15 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 	return atomic64_inc_return(&mm->context.tlb_gen);
 }
 
-static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
-					     struct mm_struct *mm,
-					     unsigned long uaddr)
-{
-	inc_mm_tlb_gen(mm);
-	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
-	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
-}
-
 static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
 {
 	flush_tlb_mm(mm);
 }
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+					     struct mm_struct *mm,
+					     unsigned long uaddr);
 
 static inline bool pte_flags_need_flush(unsigned long oldflags,
 					unsigned long newflags,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 01a5edb51ebe..9ca22c504f82 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -485,6 +485,37 @@ static void finish_asid_transition(struct flush_tlb_info *info)
 	WRITE_ONCE(mm->context.asid_transition, false);
 }
 
+static inline void tlbsync(void)
+{
+	if (!this_cpu_read(cpu_tlbstate.need_tlbsync))
+		return;
+	__tlbsync();
+	this_cpu_write(cpu_tlbstate.need_tlbsync, false);
+}
+
+static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
+						unsigned long addr,
+						u16 nr, bool pmd_stride)
+{
+	__invlpgb_flush_user_nr_nosync(pcid, addr, nr, pmd_stride);
+	if (!this_cpu_read(cpu_tlbstate.need_tlbsync))
+		this_cpu_write(cpu_tlbstate.need_tlbsync, true);
+}
+
+static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
+{
+	__invlpgb_flush_single_pcid_nosync(pcid);
+	if (!this_cpu_read(cpu_tlbstate.need_tlbsync))
+		this_cpu_write(cpu_tlbstate.need_tlbsync, true);
+}
+
+static inline void invlpgb_flush_addr_nosync(unsigned long addr, u16 nr)
+{
+	__invlpgb_flush_addr_nosync(addr, nr);
+	if (!this_cpu_read(cpu_tlbstate.need_tlbsync))
+		this_cpu_write(cpu_tlbstate.need_tlbsync, true);
+}
+
 static void broadcast_tlb_flush(struct flush_tlb_info *info)
 {
 	bool pmd = info->stride_shift == PMD_SHIFT;
@@ -783,6 +814,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
 		WARN_ON_ONCE(!irqs_disabled());
 
+	tlbsync();
+
 	/*
 	 * Verify that CR3 is what we think it is.  This will catch
 	 * hypothetical buggy code that directly switches to swapper_pg_dir
@@ -959,6 +992,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
  */
 void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
+	tlbsync();
+
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
 		return;
 
@@ -1632,9 +1667,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
-		invlpgb_flush_all_nonglobals();
-	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
@@ -1643,12 +1676,52 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 		local_irq_enable();
 	}
 
+	/*
+	 * If we issued (asynchronous) INVLPGB flushes, wait for them here.
+	 * The cpumask above contains only CPUs that were running tasks
+	 * not using broadcast TLB flushing.
+	 */
+	tlbsync();
+
 	cpumask_clear(&batch->cpumask);
 
 	put_flush_tlb_info();
 	put_cpu();
 }
 
+void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+					     struct mm_struct *mm,
+					     unsigned long uaddr)
+{
+	u16 asid = mm_global_asid(mm);
+
+	if (asid) {
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false);
+
+		/*
+		 * Some CPUs might still be using a local ASID for this
+		 * process, and require IPIs, while others are using the
+		 * global ASID.
+		 *
+		 * In this corner case we need to do both the broadcast
+		 * TLB invalidation, and send IPIs. The IPIs will help
+		 * stragglers transition to the broadcast ASID.
+		 */
+		if (in_asid_transition(mm))
+			asid = 0;
+	}
+
+	if (!asid) {
+		inc_mm_tlb_gen(mm);
+		cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
+	}
+
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
+}
+
 /*
  * Blindly accessing user memory from NMI context can be dangerous
  * if we're in the middle of switching the current user task or
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 12/14] x86/mm: enable AMD translation cache extensions
  2025-02-23 19:48 [PATCH v13 00/14] AMD broadcast TLB invalidation Rik van Riel
                   ` (10 preceding siblings ...)
  2025-02-23 19:49 ` [PATCH v13 11/14] x86/mm: do targeted broadcast flushing from tlbbatch code Rik van Riel
@ 2025-02-23 19:49 ` Rik van Riel
  2025-02-23 19:49 ` [PATCH v13 13/14] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
  2025-02-23 19:49 ` [PATCH v13 14/14] x86/mm: add noinvlpgb commandline option Rik van Riel
  13 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2025-02-23 19:49 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

With AMD TCE (translation cache extensions) only the intermediate mappings
that cover the address range zapped by INVLPG / INVLPGB get invalidated,
rather than all intermediate mappings getting zapped at every TLB invalidation.

This can help reduce the TLB miss rate, by keeping more intermediate
mappings in the cache.

From the AMD manual:

Translation Cache Extension (TCE) Bit. Bit 15, read/write. Setting this bit
to 1 changes how the INVLPG, INVLPGB, and INVPCID instructions operate on
TLB entries. When this bit is 0, these instructions remove the target PTE
from the TLB as well as all upper-level table entries that are cached
in the TLB, whether or not they are associated with the target PTE.
When this bit is set, these instructions will remove the target PTE and
only those upper-level entries that lead to the target PTE in
the page table hierarchy, leaving unrelated upper-level entries intact.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/include/asm/msr-index.h       | 2 ++
 arch/x86/kernel/cpu/amd.c              | 4 ++++
 tools/arch/x86/include/asm/msr-index.h | 2 ++
 3 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 9a71880eec07..a7ea9720ba3c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -25,6 +25,7 @@
 #define _EFER_SVME		12 /* Enable virtualization */
 #define _EFER_LMSLE		13 /* Long Mode Segment Limit Enable */
 #define _EFER_FFXSR		14 /* Enable Fast FXSAVE/FXRSTOR */
+#define _EFER_TCE		15 /* Enable Translation Cache Extensions */
 #define _EFER_AUTOIBRS		21 /* Enable Automatic IBRS */
 
 #define EFER_SCE		(1<<_EFER_SCE)
@@ -34,6 +35,7 @@
 #define EFER_SVME		(1<<_EFER_SVME)
 #define EFER_LMSLE		(1<<_EFER_LMSLE)
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
+#define EFER_TCE		(1<<_EFER_TCE)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
 /*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 3e8180354303..38f454671c88 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1075,6 +1075,10 @@ static void init_amd(struct cpuinfo_x86 *c)
 
 	/* AMD CPUs don't need fencing after x2APIC/TSC_DEADLINE MSR writes. */
 	clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
+
+	/* Enable Translation Cache Extension */
+	if (cpu_feature_enabled(X86_FEATURE_TCE))
+		msr_set_bit(MSR_EFER, _EFER_TCE);
 }
 
 #ifdef CONFIG_X86_32
diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h
index 3ae84c3b8e6d..dc1c1057f26e 100644
--- a/tools/arch/x86/include/asm/msr-index.h
+++ b/tools/arch/x86/include/asm/msr-index.h
@@ -25,6 +25,7 @@
 #define _EFER_SVME		12 /* Enable virtualization */
 #define _EFER_LMSLE		13 /* Long Mode Segment Limit Enable */
 #define _EFER_FFXSR		14 /* Enable Fast FXSAVE/FXRSTOR */
+#define _EFER_TCE		15 /* Enable Translation Cache Extensions */
 #define _EFER_AUTOIBRS		21 /* Enable Automatic IBRS */
 
 #define EFER_SCE		(1<<_EFER_SCE)
@@ -34,6 +35,7 @@
 #define EFER_SVME		(1<<_EFER_SVME)
 #define EFER_LMSLE		(1<<_EFER_LMSLE)
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
+#define EFER_TCE		(1<<_EFER_TCE)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
 /*
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 13/14] x86/mm: only invalidate final translations with INVLPGB
  2025-02-23 19:48 [PATCH v13 00/14] AMD broadcast TLB invalidation Rik van Riel
                   ` (11 preceding siblings ...)
  2025-02-23 19:49 ` [PATCH v13 12/14] x86/mm: enable AMD translation cache extensions Rik van Riel
@ 2025-02-23 19:49 ` Rik van Riel
  2025-02-23 19:49 ` [PATCH v13 14/14] x86/mm: add noinvlpgb commandline option Rik van Riel
  13 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2025-02-23 19:49 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

Use the INVLPGB_FINAL_ONLY flag when invalidating mappings with INVPLGB.
This way only leaf mappings get removed from the TLB, leaving intermediate
translations cached.

On the (rare) occasions where we free page tables we do a full flush,
ensuring intermediate translations get flushed from the TLB.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/include/asm/tlb.h | 10 ++++++++--
 arch/x86/mm/tlb.c          | 13 +++++++------
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index f69b243683e1..b1a18fe30d9b 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -86,9 +86,15 @@ static inline void __tlbsync(void)
 static inline void __invlpgb_flush_user_nr_nosync(unsigned long pcid,
 						  unsigned long addr,
 						  u16 nr,
-						  bool pmd_stride)
+						  bool pmd_stride,
+						  bool freed_tables)
 {
-	__invlpgb(0, pcid, addr, nr, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
+	u8 flags = INVLPGB_PCID | INVLPGB_VA;
+
+	if (!freed_tables)
+		flags |= INVLPGB_FINAL_ONLY;
+
+	__invlpgb(0, pcid, addr, nr, pmd_stride, flags);
 }
 
 /* Flush all mappings for a given PCID, not including globals. */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 9ca22c504f82..8494d14d2fb7 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -495,9 +495,10 @@ static inline void tlbsync(void)
 
 static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
 						unsigned long addr,
-						u16 nr, bool pmd_stride)
+						u16 nr, bool pmd_stride,
+						bool freed_tables)
 {
-	__invlpgb_flush_user_nr_nosync(pcid, addr, nr, pmd_stride);
+	__invlpgb_flush_user_nr_nosync(pcid, addr, nr, pmd_stride, freed_tables);
 	if (!this_cpu_read(cpu_tlbstate.need_tlbsync))
 		this_cpu_write(cpu_tlbstate.need_tlbsync, true);
 }
@@ -540,9 +541,9 @@ static void broadcast_tlb_flush(struct flush_tlb_info *info)
 			nr = clamp_val(nr, 1, invlpgb_count_max);
 		}
 
-		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd);
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd, info->freed_tables);
 		if (static_cpu_has(X86_FEATURE_PTI))
-			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd);
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd, info->freed_tables);
 
 		addr += nr << info->stride_shift;
 	} while (addr < info->end);
@@ -1696,10 +1697,10 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 	u16 asid = mm_global_asid(mm);
 
 	if (asid) {
-		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false, false);
 		/* Do any CPUs supporting INVLPGB need PTI? */
 		if (static_cpu_has(X86_FEATURE_PTI))
-			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false);
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false, false);
 
 		/*
 		 * Some CPUs might still be using a local ASID for this
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 14/14] x86/mm: add noinvlpgb commandline option
  2025-02-23 19:48 [PATCH v13 00/14] AMD broadcast TLB invalidation Rik van Riel
                   ` (12 preceding siblings ...)
  2025-02-23 19:49 ` [PATCH v13 13/14] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
@ 2025-02-23 19:49 ` Rik van Riel
  2025-02-23 21:29   ` Borislav Petkov
  13 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2025-02-23 19:49 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Rik van Riel

Add a "noinvlpgb" commandline option to disable AMD
broadcast TLB flushing at boot time.

Also fix up the "nopcid" boot option to automatically
disable INVLPGB functionality, which relies on processes
to run on globally allocated PCIDs.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Brendan Jackman <jackmanb@google.com>
---
 .../admin-guide/kernel-parameters.txt         |  3 ++
 arch/x86/kernel/cpu/common.c                  | 28 +++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb8752b42ec8..91260e1949fb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4182,6 +4182,8 @@
 
 	nomodule	Disable module load
 
+	noinvlpgb	[X86-64,EARLY] Disable the INVLPGB cpu feature.
+
 	nonmi_ipi	[X86] Disable using NMI IPIs during panic/reboot to
 			shutdown the other cpus.  Instead use the REBOOT_VECTOR
 			irq.
@@ -4190,6 +4192,7 @@
 			pagetables) support.
 
 	nopcid		[X86-64,EARLY] Disable the PCID cpu feature.
+			This also disables INVLPGB, which relies on PCID.
 
 	nopku		[X86] Disable Memory Protection Keys CPU feature found
 			in some Intel CPUs.
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 7cce91b19fb2..b038c9b50b32 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -243,6 +243,33 @@ DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
 } };
 EXPORT_PER_CPU_SYMBOL_GPL(gdt_page);
 
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+static void disable_invlpgb(void)
+{
+	/* do not emit a message if the feature is not present */
+	if (!boot_cpu_has(X86_FEATURE_INVLPGB))
+		return;
+
+	setup_clear_cpu_cap(X86_FEATURE_INVLPGB);
+	pr_info("INVLPGB feature disabled\n");
+}
+
+static int __init x86_noinvlpgb_setup(char *s)
+{
+	/* noinvlpgb doesn't accept parameters */
+	if (s)
+		return -EINVAL;
+
+	disable_invlpgb();
+	return 0;
+}
+early_param("noinvlpgb", x86_noinvlpgb_setup);
+#else
+static void disable_invlpgb(void)
+{
+}
+#endif
+
 #ifdef CONFIG_X86_64
 static int __init x86_nopcid_setup(char *s)
 {
@@ -256,6 +283,7 @@ static int __init x86_nopcid_setup(char *s)
 
 	setup_clear_cpu_cap(X86_FEATURE_PCID);
 	pr_info("nopcid: PCID feature disabled\n");
+	disable_invlpgb();
 	return 0;
 }
 early_param("nopcid", x86_nopcid_setup);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 14/14] x86/mm: add noinvlpgb commandline option
  2025-02-23 19:49 ` [PATCH v13 14/14] x86/mm: add noinvlpgb commandline option Rik van Riel
@ 2025-02-23 21:29   ` Borislav Petkov
  2025-02-24  0:34     ` Rik van Riel
  0 siblings, 1 reply; 45+ messages in thread
From: Borislav Petkov @ 2025-02-23 21:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Sun, Feb 23, 2025 at 02:49:04PM -0500, Rik van Riel wrote:
> Add a "noinvlpgb" commandline option to disable AMD
> broadcast TLB flushing at boot time.

See if clearcpuid= works too pls.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 08/14] x86/mm: global ASID context switch & TLB flush handling
  2025-02-23 19:48 ` [PATCH v13 08/14] x86/mm: global ASID context switch & TLB flush handling Rik van Riel
@ 2025-02-23 23:08   ` kernel test robot
  2025-02-24  1:26     ` Rik van Riel
  2025-02-24  2:01     ` Rik van Riel
  2025-02-24 18:41   ` kernel test robot
  1 sibling, 2 replies; 45+ messages in thread
From: kernel test robot @ 2025-02-23 23:08 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: oe-kbuild-all, linux-kernel, bp, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla,
	mingo, Rik van Riel

Hi Rik,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/x86/core]
[also build test ERROR on tip/x86/mm tip/master linus/master v6.14-rc4 next-20250221]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Rik-van-Riel/x86-mm-consolidate-full-flush-threshold-decision/20250224-035335
base:   tip/x86/core
patch link:    https://lore.kernel.org/r/20250223194943.3518952-9-riel%40surriel.com
patch subject: [PATCH v13 08/14] x86/mm: global ASID context switch & TLB flush handling
config: x86_64-buildonly-randconfig-004-20250224 (https://download.01.org/0day-ci/archive/20250224/202502240650.kzshiji7-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250224/202502240650.kzshiji7-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502240650.kzshiji7-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from <command-line>:
   arch/x86/include/asm/tlbflush.h: In function 'in_asid_transition':
>> arch/x86/include/asm/tlbflush.h:253:43: error: 'mm_context_t' has no member named 'asid_transition'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                                           ^
   include/linux/compiler_types.h:522:23: note: in definition of macro '__compiletime_assert'
     522 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:542:9: note: in expansion of macro '_compiletime_assert'
     542 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/x86/include/asm/tlbflush.h:253:22: note: in expansion of macro 'READ_ONCE'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                      ^~~~~~~~~
>> arch/x86/include/asm/tlbflush.h:253:43: error: 'mm_context_t' has no member named 'asid_transition'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                                           ^
   include/linux/compiler_types.h:522:23: note: in definition of macro '__compiletime_assert'
     522 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:542:9: note: in expansion of macro '_compiletime_assert'
     542 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/x86/include/asm/tlbflush.h:253:22: note: in expansion of macro 'READ_ONCE'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                      ^~~~~~~~~
>> arch/x86/include/asm/tlbflush.h:253:43: error: 'mm_context_t' has no member named 'asid_transition'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                                           ^
   include/linux/compiler_types.h:522:23: note: in definition of macro '__compiletime_assert'
     522 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:542:9: note: in expansion of macro '_compiletime_assert'
     542 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/x86/include/asm/tlbflush.h:253:22: note: in expansion of macro 'READ_ONCE'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                      ^~~~~~~~~
>> arch/x86/include/asm/tlbflush.h:253:43: error: 'mm_context_t' has no member named 'asid_transition'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                                           ^
   include/linux/compiler_types.h:522:23: note: in definition of macro '__compiletime_assert'
     522 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:542:9: note: in expansion of macro '_compiletime_assert'
     542 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/x86/include/asm/tlbflush.h:253:22: note: in expansion of macro 'READ_ONCE'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                      ^~~~~~~~~
>> arch/x86/include/asm/tlbflush.h:253:43: error: 'mm_context_t' has no member named 'asid_transition'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                                           ^
   include/linux/compiler_types.h:522:23: note: in definition of macro '__compiletime_assert'
     522 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:542:9: note: in expansion of macro '_compiletime_assert'
     542 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/x86/include/asm/tlbflush.h:253:22: note: in expansion of macro 'READ_ONCE'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                      ^~~~~~~~~
>> arch/x86/include/asm/tlbflush.h:253:43: error: 'mm_context_t' has no member named 'asid_transition'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                                           ^
   include/linux/compiler_types.h:498:27: note: in definition of macro '__unqual_scalar_typeof'
     498 |                 _Generic((x),                                           \
         |                           ^
   include/asm-generic/rwonce.h:50:9: note: in expansion of macro '__READ_ONCE'
      50 |         __READ_ONCE(x);                                                 \
         |         ^~~~~~~~~~~
   arch/x86/include/asm/tlbflush.h:253:22: note: in expansion of macro 'READ_ONCE'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                      ^~~~~~~~~
   In file included from ./arch/x86/include/generated/asm/rwonce.h:1,
                    from include/linux/compiler.h:354,
                    from include/linux/build_bug.h:5,
                    from include/linux/container_of.h:5,
                    from include/linux/list.h:5,
                    from include/linux/swait.h:5,
                    from include/linux/completion.h:12,
                    from include/linux/crypto.h:15,
                    from arch/x86/kernel/asm-offsets.c:9:
>> arch/x86/include/asm/tlbflush.h:253:43: error: 'mm_context_t' has no member named 'asid_transition'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                                           ^
   include/asm-generic/rwonce.h:44:73: note: in definition of macro '__READ_ONCE'
      44 | #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
         |                                                                         ^
   arch/x86/include/asm/tlbflush.h:253:22: note: in expansion of macro 'READ_ONCE'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                      ^~~~~~~~~
   make[3]: *** [scripts/Makefile.build:102: arch/x86/kernel/asm-offsets.s] Error 1 shuffle=3847572294
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1264: prepare0] Error 2 shuffle=3847572294
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:251: __sub-make] Error 2 shuffle=3847572294
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:251: __sub-make] Error 2 shuffle=3847572294
   make: Target 'prepare' not remade because of errors.


vim +253 arch/x86/include/asm/tlbflush.h

   247	
   248	static inline bool in_asid_transition(struct mm_struct *mm)
   249	{
   250		if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
   251			return false;
   252	
 > 253		return mm && READ_ONCE(mm->context.asid_transition);
   254	}
   255	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 14/14] x86/mm: add noinvlpgb commandline option
  2025-02-23 21:29   ` Borislav Petkov
@ 2025-02-24  0:34     ` Rik van Riel
  2025-02-24  6:29       ` Borislav Petkov
  0 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2025-02-24  0:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Sun, 2025-02-23 at 22:29 +0100, Borislav Petkov wrote:
> On Sun, Feb 23, 2025 at 02:49:04PM -0500, Rik van Riel wrote:
> > Add a "noinvlpgb" commandline option to disable AMD
> > broadcast TLB flushing at boot time.
> 
> See if clearcpuid= works too pls.
> 
What specifically are the things you want me to test here?

Just that clearcpuid=419 properly disables INVLPGB, or
also whether clearcpuid=145 (PCID) also automatically
disables INVLPGB?

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 08/14] x86/mm: global ASID context switch & TLB flush handling
  2025-02-23 23:08   ` kernel test robot
@ 2025-02-24  1:26     ` Rik van Riel
  2025-02-24  2:01     ` Rik van Riel
  1 sibling, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2025-02-24  1:26 UTC (permalink / raw)
  To: kernel test robot, x86
  Cc: oe-kbuild-all, linux-kernel, bp, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla,
	mingo

On Mon, 2025-02-24 at 07:08 +0800, kernel test robot wrote:
> 
> All errors (new ones prefixed by >>):
> 
>    In file included from <command-line>:
>    arch/x86/include/asm/tlbflush.h: In function 'in_asid_transition':
> > > arch/x86/include/asm/tlbflush.h:253:43: error: 'mm_context_t' has
> > > no member named 'asid_transition'
>      253 |         return mm && READ_ONCE(mm-
> >context.asid_transition);
>          |                                           ^

Looks like this one needs to be moved under
the #ifdef, too.

Short-circuiting the build by hard-disabling
X86_FEATURE_INVLPGB, or even adding an explicit
if(!IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH))
is not enough to get the compiler to short-circuit
the rest of the code in that function.

We might need an #ifdef inside destroy_context_free_global_asid
as well, no good way to move that one into a .h file without
making the global asid data structures in tlb.c visible to
the entire kernel instead of keeping them static in tlb.c

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 08/14] x86/mm: global ASID context switch & TLB flush handling
  2025-02-23 23:08   ` kernel test robot
  2025-02-24  1:26     ` Rik van Riel
@ 2025-02-24  2:01     ` Rik van Riel
  1 sibling, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2025-02-24  2:01 UTC (permalink / raw)
  To: kernel test robot
  Cc: x86, oe-kbuild-all, linux-kernel, bp, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla,
	mingo

On Mon, 24 Feb 2025 07:08:49 +0800
kernel test robot <lkp@intel.com> wrote:

> Hi Rik,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on tip/x86/core]
> [also build test ERROR on tip/x86/mm tip/master linus/master v6.14-rc4 next-20250221]

This version of patch 8 fixes these compile errors.

I hoped the compiler was smart enough to elide the code when
broadcast TLB invalidation was disabled at the config level,
but maybe that happens at a later stage in the compilation,
after it's already thrown the errors...

I'm not entirely happy with this, but it does seem simple enough.

---8<---

x86/mm: global ASID context switch & TLB flush handling

Context switch and TLB flush support for processes that use a global
ASID & PCID across all CPUs.

At both context switch time and TLB flush time, we need to check
whether a task is switching to a global ASID, and reload the TLB
with the new ASID as appropriate.

In both code paths, we also short-circuit the TLB flush if we
are using a global ASID, because the global ASIDs are always
kept up to date across CPUs, even while the process is not
running on a CPU.

Signed-off-by: Rik van Riel <riel@surriel.com>

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 83f1da2f1e4a..24e6531e0b1a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -240,6 +240,11 @@ static inline bool is_dyn_asid(u16 asid)
 	return asid < TLB_NR_DYN_ASIDS;
 }
 
+static inline bool is_global_asid(u16 asid)
+{
+	return !is_dyn_asid(asid);
+}
+
 #ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
 static inline u16 mm_global_asid(struct mm_struct *mm)
 {
@@ -266,6 +271,14 @@ static inline void assign_mm_global_asid(struct mm_struct *mm, u16 asid)
 	mm->context.asid_transition = true;
 	smp_store_release(&mm->context.global_asid, asid);
 }
+
+static inline bool in_asid_transition(struct mm_struct *mm)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	return mm && READ_ONCE(mm->context.asid_transition);
+}
 #else
 static inline u16 mm_global_asid(struct mm_struct *mm)
 {
@@ -275,6 +288,11 @@ static inline u16 mm_global_asid(struct mm_struct *mm)
 static inline void assign_mm_global_asid(struct mm_struct *mm, u16 asid)
 {
 }
+
+static inline bool in_asid_transition(struct mm_struct *mm)
+{
+	return false;
+}
 #endif
 
 #ifdef CONFIG_PARAVIRT
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 405630479b90..e4aecd38ac4f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -227,6 +227,20 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 		return;
 	}
 
+	/*
+	 * TLB consistency for global ASIDs is maintained with hardware assisted
+	 * remote TLB flushing. Global ASIDs are always up to date.
+	 */
+	if (static_cpu_has(X86_FEATURE_INVLPGB)) {
+		u16 global_asid = mm_global_asid(next);
+
+		if (global_asid) {
+			*new_asid = global_asid;
+			*need_flush = false;
+			return;
+		}
+	}
+
 	if (this_cpu_read(cpu_tlbstate.invalidate_other))
 		clear_asid_other();
 
@@ -382,11 +396,30 @@ void destroy_context_free_global_asid(struct mm_struct *mm)
 
 	guard(raw_spinlock_irqsave)(&global_asid_lock);
 
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
 	/* The global ASID can be re-used only after flush at wrap-around. */
 	__set_bit(mm->context.global_asid, global_asid_freed);
 
 	mm->context.global_asid = 0;
 	global_asid_available++;
+#endif
+}
+
+/*
+ * Is the mm transitioning from a CPU-local ASID to a global ASID?
+ */
+static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
+{
+	u16 global_asid = mm_global_asid(next);
+
+	if (!static_cpu_has(X86_FEATURE_INVLPGB))
+		return false;
+
+	/* Process is transitioning to a global ASID */
+	if (global_asid && prev_asid != global_asid)
+		return true;
+
+	return false;
 }
 
 /*
@@ -694,7 +727,8 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 	 */
 	if (prev == next) {
 		/* Not actually switching mm's */
-		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
+		VM_WARN_ON(is_dyn_asid(prev_asid) &&
+			   this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
 			   next->context.ctx_id);
 
 		/*
@@ -711,6 +745,20 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 
+		/* Check if the current mm is transitioning to a global ASID */
+		if (needs_global_asid_reload(next, prev_asid)) {
+			next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+			choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
+			goto reload_tlb;
+		}
+
+		/*
+		 * Broadcast TLB invalidation keeps this PCID up to date
+		 * all the time.
+		 */
+		if (is_global_asid(prev_asid))
+			return;
+
 		/*
 		 * If the CPU is not in lazy TLB mode, we are just switching
 		 * from one thread in a process to another thread in the same
@@ -744,6 +792,13 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		 */
 		cond_mitigation(tsk);
 
+		/*
+		 * Let nmi_uaccess_okay() and finish_asid_transition()
+		 * know that we're changing CR3.
+		 */
+		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
+		barrier();
+
 		/*
 		 * Leave this CPU in prev's mm_cpumask. Atomic writes to
 		 * mm_cpumask can be expensive under contention. The CPU
@@ -758,14 +813,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
 		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
-
-		/* Let nmi_uaccess_okay() know that we're changing CR3. */
-		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
-		barrier();
 	}
 
+reload_tlb:
 	new_lam = mm_lam_cr3_mask(next);
 	if (need_flush) {
+		VM_WARN_ON_ONCE(is_global_asid(new_asid));
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
 		load_new_mm_cr3(next->pgd, new_asid, new_lam, true);
@@ -884,7 +937,7 @@ static void flush_tlb_func(void *info)
 	const struct flush_tlb_info *f = info;
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+	u64 local_tlb_gen;
 	bool local = smp_processor_id() == f->initiating_cpu;
 	unsigned long nr_invalidate = 0;
 	u64 mm_tlb_gen;
@@ -907,6 +960,16 @@ static void flush_tlb_func(void *info)
 	if (unlikely(loaded_mm == &init_mm))
 		return;
 
+	/* Reload the ASID if transitioning into or out of a global ASID */
+	if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) {
+		switch_mm_irqs_off(NULL, loaded_mm, NULL);
+		loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	}
+
+	/* Broadcast ASIDs are always kept up to date with INVLPGB. */
+	if (is_global_asid(loaded_mm_asid))
+		return;
+
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
@@ -924,6 +987,8 @@ static void flush_tlb_func(void *info)
 		return;
 	}
 
+	local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+
 	if (unlikely(f->new_tlb_gen != TLB_GENERATION_INVALID &&
 		     f->new_tlb_gen <= local_tlb_gen)) {
 		/*
@@ -1091,7 +1156,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
 	 * up on the new contents of what used to be page tables, while
 	 * doing a speculative memory access.
 	 */
-	if (info->freed_tables)
+	if (info->freed_tables || in_asid_transition(info->mm))
 		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
 	else
 		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 14/14] x86/mm: add noinvlpgb commandline option
  2025-02-24  0:34     ` Rik van Riel
@ 2025-02-24  6:29       ` Borislav Petkov
  0 siblings, 0 replies; 45+ messages in thread
From: Borislav Petkov @ 2025-02-24  6:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Sun, Feb 23, 2025 at 07:34:03PM -0500, Rik van Riel wrote:
> What specifically are the things you want me to test here?

If clearcpuid= works too, there's no need for yet another "no<bla>" switch
which I'll have to janitor away a couple of releases later.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 01/14] x86/mm: consolidate full flush threshold decision
  2025-02-23 19:48 ` [PATCH v13 01/14] x86/mm: consolidate full flush threshold decision Rik van Riel
@ 2025-02-24 11:48   ` Borislav Petkov
  0 siblings, 0 replies; 45+ messages in thread
From: Borislav Petkov @ 2025-02-24 11:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Dave Hansen

On Sun, Feb 23, 2025 at 02:48:51PM -0500, Rik van Riel wrote:
> Reduce code duplication by consolidating the decision point
> for whether to do individual invalidations or a full flush
> inside get_flush_tlb_info.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Acked-by: Dave Hansen <dave.hansen@intel.com>
> ---
>  arch/x86/mm/tlb.c | 41 +++++++++++++++++++----------------------
>  1 file changed, 19 insertions(+), 22 deletions(-)

Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 02/14] x86/mm: get INVLPGB count max from CPUID
  2025-02-23 19:48 ` [PATCH v13 02/14] x86/mm: get INVLPGB count max from CPUID Rik van Riel
@ 2025-02-24 12:00   ` Borislav Petkov
  0 siblings, 0 replies; 45+ messages in thread
From: Borislav Petkov @ 2025-02-24 12:00 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Dave Hansen

On Sun, Feb 23, 2025 at 02:48:52PM -0500, Rik van Riel wrote:
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 3da645139748..09463a2fb05f 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -183,6 +183,9 @@ static inline void cr4_init_shadow(void)
>  extern unsigned long mmu_cr4_features;
>  extern u32 *trampoline_cr4_features;
>  
> +/* How many pages can we invalidate with one INVLPGB. */

"... can be invalidated... "

Please use passive voice in your text: no "we" or "I", etc,
and describe your changes in imperative mood.

Bottom line is: personal pronouns are ambiguous in text, especially with
so many parties/companies/etc developing the kernel so let's avoid them
please.

Please check all your text.

> +extern u16 invlpgb_count_max;
> +
>  extern void initialize_tlbstate_and_flush(void);
>  
>  /*
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index 54194f5995de..3e8180354303 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -29,6 +29,8 @@
>  
>  #include "cpu.h"
>  
> +u16 invlpgb_count_max __ro_after_init;
> +
>  static inline int rdmsrl_amd_safe(unsigned msr, unsigned long long *p)
>  {
>  	u32 gprs[8] = { 0 };
> @@ -1139,6 +1141,12 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
>  		tlb_lli_2m[ENTRIES] = eax & mask;
>  
>  	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
> +
> +	/* Max number of pages INVLPGB can invalidate in one shot */
> +	if (boot_cpu_has(X86_FEATURE_INVLPGB)) {
> +		cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
> +		invlpgb_count_max = (edx & 0xffff) + 1;
> +	}
>  }

Ontop:

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 3e8180354303..7a72ef47a983 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1143,10 +1143,8 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
 	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
 
 	/* Max number of pages INVLPGB can invalidate in one shot */
-	if (boot_cpu_has(X86_FEATURE_INVLPGB)) {
-		cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
-		invlpgb_count_max = (edx & 0xffff) + 1;
-	}
+	if (cpu_has(c, X86_FEATURE_INVLPGB))
+		invlpgb_count_max = (cpuid_edx(0x80000008) & 0xffff) + 1;
 }
 
 static const struct cpu_dev amd_cpu_dev = {

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 03/14] x86/mm: add INVLPGB support code
  2025-02-23 19:48 ` [PATCH v13 03/14] x86/mm: add INVLPGB support code Rik van Riel
@ 2025-02-24 12:14   ` Borislav Petkov
  0 siblings, 0 replies; 45+ messages in thread
From: Borislav Petkov @ 2025-02-24 12:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo, Dave Hansen

On Sun, Feb 23, 2025 at 02:48:53PM -0500, Rik van Riel wrote:
> Add invlpgb.h with the helper functions and definitions needed to use

"invlpgb.h" is stale now.

> diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
> index c492bdc97b05..95997caf0935 100644
> --- a/arch/x86/include/asm/disabled-features.h
> +++ b/arch/x86/include/asm/disabled-features.h
> @@ -129,6 +129,13 @@
>  #define DISABLE_SEV_SNP		(1 << (X86_FEATURE_SEV_SNP & 31))
>  #endif
>  
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +#define DISABLE_INVLPGB		0
> +#else
> +/* Keep 32 bit kernels smaller by compiling out the INVLPGB code. */

No need for that comment.

> +#define DISABLE_INVLPGB		(1 << (X86_FEATURE_INVLPGB & 31))
> +#endif
> +
>  /*
>   * Make sure to add features to the correct mask
>   */

...

> +/* Wait for INVLPGB originated by this CPU to complete. */
> +static inline void __tlbsync(void)
> +{
> +	cant_migrate();

I could use a comment above that one...

> +	/* TLBSYNC: supported in binutils >= 0.36. */
> +	asm volatile(".byte 0x0f, 0x01, 0xff" ::: "memory");
> +}
> +

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 04/14] x86/mm: use INVLPGB for kernel TLB flushes
  2025-02-23 19:48 ` [PATCH v13 04/14] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
@ 2025-02-24 12:31   ` Borislav Petkov
  2025-02-24 12:44     ` Nadav Amit
  0 siblings, 1 reply; 45+ messages in thread
From: Borislav Petkov @ 2025-02-24 12:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Sun, Feb 23, 2025 at 02:48:54PM -0500, Rik van Riel wrote:
> Use broadcast TLB invalidation for kernel addresses when available.
> 
> Remove the need to send IPIs for kernel TLB flushes.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
> Tested-by: Manali Shukla <Manali.Shukla@amd.com>
> Tested-by: Brendan Jackman <jackmanb@google.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>

What's the point of keeping those Tested-by tags if you still keep changing
the patchset?

> ---
>  arch/x86/mm/tlb.c | 34 ++++++++++++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index dbcb5c968ff9..59396a3c6e9c 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1077,6 +1077,20 @@ void flush_tlb_all(void)
>  	on_each_cpu(do_flush_tlb_all, NULL, 1);
>  }
>  
> +static bool invlpgb_kernel_range_flush(struct flush_tlb_info *info)
> +{
> +	unsigned long addr;
> +	unsigned long nr;

Oh, c'mon:

	unsigned long addr, nr;

> +
> +	for (addr = info->start; addr < info->end; addr += nr << PAGE_SHIFT) {
> +		nr = (info->end - addr) >> PAGE_SHIFT;
> +		nr = clamp_val(nr, 1, invlpgb_count_max);
> +		invlpgb_flush_addr_nosync(addr, nr);
> +	}
> +	__tlbsync();
> +	return true;

Unused retval - that function looks like it wants to be void.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 04/14] x86/mm: use INVLPGB for kernel TLB flushes
  2025-02-24 12:31   ` Borislav Petkov
@ 2025-02-24 12:44     ` Nadav Amit
  0 siblings, 0 replies; 45+ messages in thread
From: Nadav Amit @ 2025-02-24 12:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List, peterz,
	Dave Hansen, zhengqi.arch, thomas.lendacky, kernel-team,
	open list:MEMORY MANAGEMENT, Andrew Morton, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, Ingo Molnar,
	Borislav Petkov

> On 24 Feb 2025, at 14:31, Borislav Petkov <bp@alien8.de> wrote:
> 
> On Sun, Feb 23, 2025 at 02:48:54PM -0500, Rik van Riel wrote:
>> Use broadcast TLB invalidation for kernel addresses when available.
>> 
>> Remove the need to send IPIs for kernel TLB flushes.
>> 
>> Signed-off-by: Rik van Riel <riel@surriel.com>
>> Reviewed-by: Nadav Amit <nadav.amit@gmail.com>

Nothing concrete against this patch, but I do not remember reviewing it
thoroughly, and I do not see that I sent any “Reviewed-by” tag for it
before, so please remove it. I only reviewed thoroughly and gave
“reviewed-by” for patch 9.

[ I would note at this opportunity that while I managed to convince myself
patch 9 is safe, I personally would have considered taking a more defensive
approach there. I may reiterate it there. ]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 05/14] x86/mm: use INVLPGB in flush_tlb_all
  2025-02-23 19:48 ` [PATCH v13 05/14] x86/mm: use INVLPGB in flush_tlb_all Rik van Riel
@ 2025-02-24 12:46   ` Borislav Petkov
  0 siblings, 0 replies; 45+ messages in thread
From: Borislav Petkov @ 2025-02-24 12:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Sun, Feb 23, 2025 at 02:48:55PM -0500, Rik van Riel wrote:
> The flush_tlb_all() function is not used a whole lot, but we might
> as well use broadcast TLB flushing there, too.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Manali Shukla <Manali.Shukla@amd.com>
> Tested-by: Brendan Jackman <jackmanb@google.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> ---
>  arch/x86/mm/tlb.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 59396a3c6e9c..2d7ed0fda61f 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1065,6 +1065,16 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>  }
>  
>  
> +static bool broadcast_flush_tlb_all(void)

broadcast?

All those INVLPGB functions have a "invlpgb_" prefix...

In any case, get rid of that function:

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2d7ed0fda61f..feaca53b7685 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1065,16 +1065,6 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 }
 
 
-static bool broadcast_flush_tlb_all(void)
-{
-	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
-		return false;
-
-	guard(preempt)();
-	invlpgb_flush_all();
-	return true;
-}
-
 static void do_flush_tlb_all(void *info)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -1085,9 +1075,11 @@ void flush_tlb_all(void)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 
-	/* First try (faster) hardware-assisted TLB invalidation. */
-	if (broadcast_flush_tlb_all())
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		guard(preempt)();
+		invlpgb_flush_all();
 		return;
+	}
 
 	/* Fall back to the IPI-based invalidation. */
 	on_each_cpu(do_flush_tlb_all, NULL, 1);

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-23 19:48 ` [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
@ 2025-02-24 13:27   ` Borislav Petkov
  2025-02-25 19:17     ` Rik van Riel
  0 siblings, 1 reply; 45+ messages in thread
From: Borislav Petkov @ 2025-02-24 13:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Sun, Feb 23, 2025 at 02:48:56PM -0500, Rik van Riel wrote:
> In the page reclaim code, we only track the CPU(s) where the TLB needs
> to be flushed, rather than all the individual mappings that may be getting
> invalidated.
> 
> Use broadcast TLB flushing when that is available.
> 
> This is a temporary hack to ensure that the PCID context for
> tasks in the next patch gets properly flushed from the page

There's no "next patch" in git - just merge this three-liner with the next
one.

> reclaim code, because the IPI based flushing in arch_tlbbatch_flush
> only flushes the currently loaded TLB context on each CPU.
> 
> Patch 10 replaces this with the actual mechanism used to do

Same. No "patch 10" in git.

> broadcast TLB flushing from the page reclaim code.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Manali Shukla <Manali.Shukla@amd.com>
> Tested-by: Brendan Jackman <jackmanb@google.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> ---
>  arch/x86/mm/tlb.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 2d7ed0fda61f..16839651f67f 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1326,7 +1326,9 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>  	 * a local TLB flush is needed. Optimize this use-case by calling
>  	 * flush_tlb_func_local() directly in this case.
>  	 */
> -	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
> +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> +		invlpgb_flush_all_nonglobals();

I'm confused. The docs say rAX needs to be 0x6 to do "Invalidate all TLB
entries that match {ASID, PCID} excluding Global". But you're calling INVLPGB
with rAX==0 and the APM doesn't say what this does.

I'm guessing you want this to mean invalidate all non-globals for any ASID and
PCID. So I muss be missing the place in the docs where it says so...

Hmmm?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 07/14] x86/mm: global ASID allocation helper functions
  2025-02-23 19:48 ` [PATCH v13 07/14] x86/mm: global ASID allocation helper functions Rik van Riel
@ 2025-02-24 14:16   ` Borislav Petkov
  2025-02-25 20:22     ` Rik van Riel
  0 siblings, 1 reply; 45+ messages in thread
From: Borislav Petkov @ 2025-02-24 14:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Sun, Feb 23, 2025 at 02:48:57PM -0500, Rik van Riel wrote:
> Subject: Re: [PATCH v13 07/14] x86/mm: global ASID allocation helper functions

This subject needs a verb. I.e.,

"Add global ..."

> Functions to manage global ASID space. 

Ditto.

> Multithreaded processes that
> are simultaneously active on 4 or more CPUs can get a global ASID,
> resulting in the same PCID being used for that process on every CPU.
> 
> This in turn will allow the kernel to use hardware-assisted TLB flushing
> through AMD INVLPGB or Intel RAR for these processes.
> 
> Helper functions split out by request.

No need for that sentence.

> Signed-off-by: Rik van Riel <riel@surriel.com>
> Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
> Tested-by: Manali Shukla <Manali.Shukla@amd.com>
> Tested-by: Brendan Jackman <jackmanb@google.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> ---
>  arch/x86/include/asm/mmu.h      |  11 +++
>  arch/x86/include/asm/tlbflush.h |  43 ++++++++++
>  arch/x86/mm/tlb.c               | 144 +++++++++++++++++++++++++++++++-
>  3 files changed, 195 insertions(+), 3 deletions(-)

arch/x86/mm/tlb.c:378:6: warning: no previous prototype for ‘destroy_context_free_global_asid’ [-Wmissing-prototypes]
  378 | void destroy_context_free_global_asid(struct mm_struct *mm)
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/mm/tlb.c:355:13: warning: ‘use_global_asid’ defined but not used [-Wunused-function]
  355 | static void use_global_asid(struct mm_struct *mm)
      |             ^~~~~~~~~~~~~~~
arch/x86/mm/tlb.c:327:13: warning: ‘mm_active_cpus_exceeds’ defined but not used [-Wunused-function]
  327 | static bool mm_active_cpus_exceeds(struct mm_struct *mm, int threshold)
      |             ^~~~~~~~~~~~~~~~~~~~~~

If those functions are going to remain global they better get a proper prefix
like "tlb_".

And add the functions with their first use - no need to pre-add them.

> diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
> index 3b496cdcb74b..edb5942d4829 100644
> --- a/arch/x86/include/asm/mmu.h
> +++ b/arch/x86/include/asm/mmu.h
> @@ -69,6 +69,17 @@ typedef struct {
>  	u16 pkey_allocation_map;
>  	s16 execute_only_pkey;
>  #endif
> +
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +	/*
> +	 * The global ASID will be a non-zero value when the process has
> +	 * the same ASID across all CPUs, allowing it to make use of
> +	 * hardware-assisted remote TLB invalidation like AMD INVLPGB.
> +	 */
> +	u16 global_asid;
> +	/* The process is transitioning to a new global ASID number. */
> +	bool asid_transition;
> +#endif
>  } mm_context_t;
>  
>  #define INIT_MM_CONTEXT(mm)						\
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 09463a2fb05f..83f1da2f1e4a 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -6,6 +6,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/sched.h>
>  
> +#include <asm/barrier.h>
>  #include <asm/processor.h>
>  #include <asm/cpufeature.h>
>  #include <asm/special_insns.h>
> @@ -234,6 +235,48 @@ void flush_tlb_one_kernel(unsigned long addr);
>  void flush_tlb_multi(const struct cpumask *cpumask,
>  		      const struct flush_tlb_info *info);
>  
> +static inline bool is_dyn_asid(u16 asid)
> +{
> +	return asid < TLB_NR_DYN_ASIDS;
> +}
> +
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +static inline u16 mm_global_asid(struct mm_struct *mm)
> +{
> +	u16 asid;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		return 0;
> +
> +	asid = smp_load_acquire(&mm->context.global_asid);
> +
> +	/* mm->context.global_asid is either 0, or a global ASID */
> +	VM_WARN_ON_ONCE(asid && is_dyn_asid(asid));
> +
> +	return asid;
> +}
> +
> +static inline void assign_mm_global_asid(struct mm_struct *mm, u16 asid)

mm_assign_global_asid()

> +{
> +	/*
> +	 * Notably flush_tlb_mm_range() -> broadcast_tlb_flush() ->
> +	 * finish_asid_transition() needs to observe asid_transition = true
> +	 * once it observes global_asid.
> +	 */
> +	mm->context.asid_transition = true;
> +	smp_store_release(&mm->context.global_asid, asid);
> +}
> +#else
> +static inline u16 mm_global_asid(struct mm_struct *mm)
> +{
> +	return 0;
> +}
> +
> +static inline void assign_mm_global_asid(struct mm_struct *mm, u16 asid)
> +{
> +}
> +#endif
> +
>  #ifdef CONFIG_PARAVIRT
>  #include <asm/paravirt.h>
>  #endif
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 16839651f67f..405630479b90 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -74,13 +74,15 @@
>   * use different names for each of them:
>   *
>   * ASID  - [0, TLB_NR_DYN_ASIDS-1]
> - *         the canonical identifier for an mm
> + *         the canonical identifier for an mm, dynamically allocated on each CPU
> + *         [TLB_NR_DYN_ASIDS, MAX_ASID_AVAILABLE-1]
> + *         the canonical, global identifier for an mm, identical across all CPUs
>   *
> - * kPCID - [1, TLB_NR_DYN_ASIDS]
> + * kPCID - [1, MAX_ASID_AVAILABLE]
>   *         the value we write into the PCID part of CR3; corresponds to the
>   *         ASID+1, because PCID 0 is special.
>   *
> - * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
> + * uPCID - [2048 + 1, 2048 + MAX_ASID_AVAILABLE]
>   *         for KPTI each mm has two address spaces and thus needs two
>   *         PCID values, but we can still do with a single ASID denomination
>   *         for each mm. Corresponds to kPCID + 2048.
> @@ -251,6 +253,142 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
>  	*need_flush = true;
>  }
>  
> +/*
> + * Global ASIDs are allocated for multi-threaded processes that are
> + * active on multiple CPUs simultaneously, giving each of those
> + * processes the same PCIDs on every CPU, for use with hardware-assisted

"the same PCID" or "the same PCIDs"?

Does a multithreaded process get one or more than one PCIDs? I'd expect only
one ofc.

> + * TLB shootdown on remote CPUs, like AMD INVLPGB or Intel RAR.
> + *
> + * These global ASIDs are held for the lifetime of the process.
> + */
> +static DEFINE_RAW_SPINLOCK(global_asid_lock);
> +static u16 last_global_asid = MAX_ASID_AVAILABLE;
> +static DECLARE_BITMAP(global_asid_used, MAX_ASID_AVAILABLE);
> +static DECLARE_BITMAP(global_asid_freed, MAX_ASID_AVAILABLE);
> +static int global_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
> +
> +/*
> + * When the search for a free ASID in the global ASID space reaches
> + * MAX_ASID_AVAILABLE, a global TLB flush guarantees that previously
> + * freed global ASIDs are safe to re-use.
> + *
> + * This way the global flush only needs to happen at ASID rollover
> + * time, and not at ASID allocation time.
> + */
> +static void reset_global_asid_space(void)
> +{
> +	lockdep_assert_held(&global_asid_lock);
> +
> +	invlpgb_flush_all_nonglobals();
> +
> +	/*
> +	 * The TLB flush above makes it safe to re-use the previously
> +	 * freed global ASIDs.
> +	 */
> +	bitmap_andnot(global_asid_used, global_asid_used,
> +			global_asid_freed, MAX_ASID_AVAILABLE);
> +	bitmap_clear(global_asid_freed, 0, MAX_ASID_AVAILABLE);
> +
> +	/* Restart the search from the start of global ASID space. */
> +	last_global_asid = TLB_NR_DYN_ASIDS;
> +}
> +
> +static u16 allocate_global_asid(void)
> +{
> +	u16 asid;
> +
> +	lockdep_assert_held(&global_asid_lock);
> +
> +	/* The previous allocation hit the edge of available address space */
> +	if (last_global_asid >= MAX_ASID_AVAILABLE - 1)
> +		reset_global_asid_space();
> +
> +	asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, last_global_asid);
> +
> +	if (asid >= MAX_ASID_AVAILABLE) {

	if (asid >= MAX_ASID_AVAILABLE && !global_asid_available)

> +		/* This should never happen. */
> +		VM_WARN_ONCE(1, "Unable to allocate global ASID despite %d available\n",
> +				global_asid_available);
> +		return 0;
> +	}
> +
> +	/* Claim this global ASID. */
> +	__set_bit(asid, global_asid_used);
> +	last_global_asid = asid;
> +	global_asid_available--;
> +	return asid;
> +}
> +
> +/*
> + * Check whether a process is currently active on more than "threshold" CPUs.

@threshold - then it is clear that you mean the function argument.

> + * This is a cheap estimation on whether or not it may make sense to assign
> + * a global ASID to this process, and use broadcast TLB invalidation.
> + */
> +static bool mm_active_cpus_exceeds(struct mm_struct *mm, int threshold)
> +{
> +	int count = 0;
> +	int cpu;

Is this function supposed to hold some sort of a lock?

> +
> +	/* This quick check should eliminate most single threaded programs. */
> +	if (cpumask_weight(mm_cpumask(mm)) <= threshold)
> +		return false;
> +
> +	/* Slower check to make sure. */
> +	for_each_cpu(cpu, mm_cpumask(mm)) {

Needs CPU hotplug protection?

> +		/* Skip the CPUs that aren't really running this process. */
> +		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
> +			continue;
> +
> +		if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu))
> +			continue;
> +
> +		if (++count > threshold)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/*
> + * Assign a global ASID to the current process, protecting against
> + * races between multiple threads in the process.
> + */
> +static void use_global_asid(struct mm_struct *mm)
> +{
> +	u16 asid;
> +
> +	guard(raw_spinlock_irqsave)(&global_asid_lock);
> +
> +	/* This process is already using broadcast TLB invalidation. */
> +	if (mm_global_asid(mm))
> +		return;
> +
> +	/* The last global ASID was consumed while waiting for the lock. */
> +	if (!global_asid_available) {
> +		VM_WARN_ONCE(1, "Ran out of global ASIDs\n");

And? Why do we want to warn here? We need to reset global ASIDs anyway.

> +		return;
> +	}
> +
> +	asid = allocate_global_asid();
> +	if (!asid)
> +		return;
> +
> +	assign_mm_global_asid(mm, asid);
> +}
> +
> +void destroy_context_free_global_asid(struct mm_struct *mm)

That name is a mouthful. mm_free_global_asid() is just fine.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 08/14] x86/mm: global ASID context switch & TLB flush handling
  2025-02-23 19:48 ` [PATCH v13 08/14] x86/mm: global ASID context switch & TLB flush handling Rik van Riel
  2025-02-23 23:08   ` kernel test robot
@ 2025-02-24 18:41   ` kernel test robot
  1 sibling, 0 replies; 45+ messages in thread
From: kernel test robot @ 2025-02-24 18:41 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: llvm, oe-kbuild-all, linux-kernel, bp, peterz, dave.hansen,
	zhengqi.arch, nadav.amit, thomas.lendacky, kernel-team, linux-mm,
	akpm, jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla,
	mingo, Rik van Riel

Hi Rik,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/x86/core]
[also build test ERROR on tip/x86/mm tip/master linus/master v6.14-rc4 next-20250224]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Rik-van-Riel/x86-mm-consolidate-full-flush-threshold-decision/20250224-035335
base:   tip/x86/core
patch link:    https://lore.kernel.org/r/20250223194943.3518952-9-riel%40surriel.com
patch subject: [PATCH v13 08/14] x86/mm: global ASID context switch & TLB flush handling
config: i386-buildonly-randconfig-001-20250224 (https://download.01.org/0day-ci/archive/20250225/202502250255.UQONr93N-lkp@intel.com/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250225/202502250255.UQONr93N-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502250255.UQONr93N-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/x86/kernel/asm-offsets.c:14:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:9:
   In file included from include/linux/sched/task.h:13:
   In file included from include/linux/uaccess.h:12:
   In file included from arch/x86/include/asm/uaccess.h:17:
>> arch/x86/include/asm/tlbflush.h:253:37: error: no member named 'asid_transition' in 'mm_context_t'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                                ~~~~~~~~~~~ ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |                                        ^
   include/asm-generic/rwonce.h:36:35: note: expanded from macro 'compiletime_assert_rwonce_type'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                                          ^
   include/linux/compiler_types.h:509:10: note: expanded from macro '__native_word'
     509 |         (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
         |                 ^
   include/linux/compiler_types.h:542:22: note: expanded from macro 'compiletime_assert'
     542 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                             ^~~~~~~~~
   include/linux/compiler_types.h:530:23: note: expanded from macro '_compiletime_assert'
     530 |         __compiletime_assert(condition, msg, prefix, suffix)
         |                              ^~~~~~~~~
   include/linux/compiler_types.h:522:9: note: expanded from macro '__compiletime_assert'
     522 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   In file included from arch/x86/kernel/asm-offsets.c:14:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:9:
   In file included from include/linux/sched/task.h:13:
   In file included from include/linux/uaccess.h:12:
   In file included from arch/x86/include/asm/uaccess.h:17:
>> arch/x86/include/asm/tlbflush.h:253:37: error: no member named 'asid_transition' in 'mm_context_t'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                                ~~~~~~~~~~~ ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |                                        ^
   include/asm-generic/rwonce.h:36:35: note: expanded from macro 'compiletime_assert_rwonce_type'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                                          ^
   include/linux/compiler_types.h:509:39: note: expanded from macro '__native_word'
     509 |         (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
         |                                              ^
   include/linux/compiler_types.h:542:22: note: expanded from macro 'compiletime_assert'
     542 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                             ^~~~~~~~~
   include/linux/compiler_types.h:530:23: note: expanded from macro '_compiletime_assert'
     530 |         __compiletime_assert(condition, msg, prefix, suffix)
         |                              ^~~~~~~~~
   include/linux/compiler_types.h:522:9: note: expanded from macro '__compiletime_assert'
     522 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   In file included from arch/x86/kernel/asm-offsets.c:14:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:9:
   In file included from include/linux/sched/task.h:13:
   In file included from include/linux/uaccess.h:12:
   In file included from arch/x86/include/asm/uaccess.h:17:
>> arch/x86/include/asm/tlbflush.h:253:37: error: no member named 'asid_transition' in 'mm_context_t'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                                ~~~~~~~~~~~ ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |                                        ^
   include/asm-generic/rwonce.h:36:35: note: expanded from macro 'compiletime_assert_rwonce_type'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                                          ^
   include/linux/compiler_types.h:510:10: note: expanded from macro '__native_word'
     510 |          sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
         |                 ^
   include/linux/compiler_types.h:542:22: note: expanded from macro 'compiletime_assert'
     542 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                             ^~~~~~~~~
   include/linux/compiler_types.h:530:23: note: expanded from macro '_compiletime_assert'
     530 |         __compiletime_assert(condition, msg, prefix, suffix)
         |                              ^~~~~~~~~
   include/linux/compiler_types.h:522:9: note: expanded from macro '__compiletime_assert'
     522 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   In file included from arch/x86/kernel/asm-offsets.c:14:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:9:
   In file included from include/linux/sched/task.h:13:
   In file included from include/linux/uaccess.h:12:
   In file included from arch/x86/include/asm/uaccess.h:17:
>> arch/x86/include/asm/tlbflush.h:253:37: error: no member named 'asid_transition' in 'mm_context_t'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                                ~~~~~~~~~~~ ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |                                        ^
   include/asm-generic/rwonce.h:36:35: note: expanded from macro 'compiletime_assert_rwonce_type'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                                          ^
   include/linux/compiler_types.h:510:38: note: expanded from macro '__native_word'
     510 |          sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
         |                                             ^
   include/linux/compiler_types.h:542:22: note: expanded from macro 'compiletime_assert'
     542 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                             ^~~~~~~~~
   include/linux/compiler_types.h:530:23: note: expanded from macro '_compiletime_assert'
     530 |         __compiletime_assert(condition, msg, prefix, suffix)
         |                              ^~~~~~~~~
   include/linux/compiler_types.h:522:9: note: expanded from macro '__compiletime_assert'
     522 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   In file included from arch/x86/kernel/asm-offsets.c:14:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:9:
   In file included from include/linux/sched/task.h:13:
   In file included from include/linux/uaccess.h:12:
   In file included from arch/x86/include/asm/uaccess.h:17:
>> arch/x86/include/asm/tlbflush.h:253:37: error: no member named 'asid_transition' in 'mm_context_t'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                                ~~~~~~~~~~~ ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |                                        ^
   include/asm-generic/rwonce.h:36:48: note: expanded from macro 'compiletime_assert_rwonce_type'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                                                       ^
   include/linux/compiler_types.h:542:22: note: expanded from macro 'compiletime_assert'
     542 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                             ^~~~~~~~~
   include/linux/compiler_types.h:530:23: note: expanded from macro '_compiletime_assert'
     530 |         __compiletime_assert(condition, msg, prefix, suffix)
         |                              ^~~~~~~~~
   include/linux/compiler_types.h:522:9: note: expanded from macro '__compiletime_assert'
     522 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   In file included from arch/x86/kernel/asm-offsets.c:14:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:9:
   In file included from include/linux/sched/task.h:13:
   In file included from include/linux/uaccess.h:12:
   In file included from arch/x86/include/asm/uaccess.h:17:
>> arch/x86/include/asm/tlbflush.h:253:37: error: no member named 'asid_transition' in 'mm_context_t'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                                ~~~~~~~~~~~ ^
   include/asm-generic/rwonce.h:50:14: note: expanded from macro 'READ_ONCE'
      50 |         __READ_ONCE(x);                                                 \
         |                     ^
   include/asm-generic/rwonce.h:44:65: note: expanded from macro '__READ_ONCE'
      44 | #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
         |                                                                  ^
   include/linux/compiler_types.h:498:13: note: expanded from macro '__unqual_scalar_typeof'
     498 |                 _Generic((x),                                           \
         |                           ^
   In file included from arch/x86/kernel/asm-offsets.c:14:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:9:
   In file included from include/linux/sched/task.h:13:
   In file included from include/linux/uaccess.h:12:
   In file included from arch/x86/include/asm/uaccess.h:17:
>> arch/x86/include/asm/tlbflush.h:253:37: error: no member named 'asid_transition' in 'mm_context_t'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                                ~~~~~~~~~~~ ^
   include/asm-generic/rwonce.h:50:14: note: expanded from macro 'READ_ONCE'
      50 |         __READ_ONCE(x);                                                 \
         |                     ^
   include/asm-generic/rwonce.h:44:65: note: expanded from macro '__READ_ONCE'
      44 | #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
         |                                                                  ^
   include/linux/compiler_types.h:505:15: note: expanded from macro '__unqual_scalar_typeof'
     505 |                          default: (x)))
         |                                    ^
   In file included from arch/x86/kernel/asm-offsets.c:14:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:9:
   In file included from include/linux/sched/task.h:13:
   In file included from include/linux/uaccess.h:12:
   In file included from arch/x86/include/asm/uaccess.h:17:
>> arch/x86/include/asm/tlbflush.h:253:37: error: no member named 'asid_transition' in 'mm_context_t'
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                                ~~~~~~~~~~~ ^
   include/asm-generic/rwonce.h:50:14: note: expanded from macro 'READ_ONCE'
      50 |         __READ_ONCE(x);                                                 \
         |                     ^
   include/asm-generic/rwonce.h:44:72: note: expanded from macro '__READ_ONCE'
      44 | #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
         |                                                                         ^
   In file included from arch/x86/kernel/asm-offsets.c:14:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:9:
   In file included from include/linux/sched/task.h:13:
   In file included from include/linux/uaccess.h:12:
   In file included from arch/x86/include/asm/uaccess.h:17:
>> arch/x86/include/asm/tlbflush.h:253:12: error: invalid operands to binary expression ('struct mm_struct *' and 'void')
     253 |         return mm && READ_ONCE(mm->context.asid_transition);
         |                ~~ ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   9 errors generated.
   make[3]: *** [scripts/Makefile.build:102: arch/x86/kernel/asm-offsets.s] Error 1 shuffle=1461763656
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1264: prepare0] Error 2 shuffle=1461763656
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:251: __sub-make] Error 2 shuffle=1461763656
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:251: __sub-make] Error 2 shuffle=1461763656
   make: Target 'prepare' not remade because of errors.


vim +253 arch/x86/include/asm/tlbflush.h

   247	
   248	static inline bool in_asid_transition(struct mm_struct *mm)
   249	{
   250		if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
   251			return false;
   252	
 > 253		return mm && READ_ONCE(mm->context.asid_transition);
   254	}
   255	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-24 13:27   ` Borislav Petkov
@ 2025-02-25 19:17     ` Rik van Riel
  2025-02-25 20:38       ` Borislav Petkov
  0 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2025-02-25 19:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Mon, 2025-02-24 at 14:27 +0100, Borislav Petkov wrote:
> On Sun, Feb 23, 2025 at 02:48:56PM -0500, Rik van Riel wrote:
> > 
> > +++ b/arch/x86/mm/tlb.c
> > @@ -1326,7 +1326,9 @@ void arch_tlbbatch_flush(struct
> > arch_tlbflush_unmap_batch *batch)
> >  	 * a local TLB flush is needed. Optimize this use-case by
> > calling
> >  	 * flush_tlb_func_local() directly in this case.
> >  	 */
> > -	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
> > +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> > +		invlpgb_flush_all_nonglobals();
> 
> I'm confused. The docs say rAX needs to be 0x6 to do "Invalidate all
> TLB
> entries that match {ASID, PCID} excluding Global". But you're calling
> INVLPGB
> with rAX==0 and the APM doesn't say what this does.
> 
> I'm guessing you want this to mean invalidate all non-globals for any
> ASID and
> PCID. So I muss be missing the place in the docs where it says so...

You are right that it does not explicitly say it.
However, it does strongly hint at it:

"The TLB control field is specified in rAX[5:0]. It determines
which components of the address (VA, PCID, ASID) are valid for
comparison in the TLB and whether to include global entries in 
the invalidation process.

...

rAX[3:0] provides for various types of invalidations. A few 
examples are listed in the following table, but all values 
are legal." 

This text, to me, suggests we can filter the TLB
invalidations by some combination of virtual address,
PCID, or ASID, or not filter by any of those keys,
and invalidate them all.

Who do we need to ask to confirm that reading?

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 07/14] x86/mm: global ASID allocation helper functions
  2025-02-24 14:16   ` Borislav Petkov
@ 2025-02-25 20:22     ` Rik van Riel
  0 siblings, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2025-02-25 20:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Mon, 2025-02-24 at 15:16 +0100, Borislav Petkov wrote:
> On Sun, Feb 23, 2025 at 02:48:57PM -0500, Rik van Riel wrote:
> > 
> >  arch/x86/include/asm/mmu.h      |  11 +++
> >  arch/x86/include/asm/tlbflush.h |  43 ++++++++++
> >  arch/x86/mm/tlb.c               | 144
> > +++++++++++++++++++++++++++++++-
> >  3 files changed, 195 insertions(+), 3 deletions(-)
> 
> arch/x86/mm/tlb.c:378:6: warning: no previous prototype for
> ‘destroy_context_free_global_asid’ [-Wmissing-prototypes]
>   378 | void destroy_context_free_global_asid(struct mm_struct *mm)
>       |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> arch/x86/mm/tlb.c:355:13: warning: ‘use_global_asid’ defined but not
> used [-Wunused-function]
>   355 | static void use_global_asid(struct mm_struct *mm)
>       |             ^~~~~~~~~~~~~~~
> arch/x86/mm/tlb.c:327:13: warning: ‘mm_active_cpus_exceeds’ defined
> but not used [-Wunused-function]
>   327 | static bool mm_active_cpus_exceeds(struct mm_struct *mm, int
> threshold)
>       |             ^~~~~~~~~~~~~~~~~~~~~~
> 
> If those functions are going to remain global they better get a
> proper prefix
> like "tlb_".
> 
I've renamed the global function to 
tlb_destroy_context_free_global_asid

> And add the functions with their first use - no need to pre-add them.

That's what I originally had. Dave requested I split
out the patch into multiple ones.

That means either introducing helper functions in a
separate patch, or coming up with one version of the
code in one patch, and then replacing that code in
the next, resulting in a bunch of extra code to review.

https://lore.kernel.org/linux-kernel/b067a9fc-ff5f-4baa-a1ff-3fa749ae4d44@intel.com/
> 
> > 
> > +static inline void assign_mm_global_asid(struct mm_struct *mm, u16
> > asid)
> 
> mm_assign_global_asid()

Done.

> > +/*
> > + * Global ASIDs are allocated for multi-threaded processes that
> > are
> > + * active on multiple CPUs simultaneously, giving each of those
> > + * processes the same PCIDs on every CPU, for use with hardware-
> > assisted
> 
> "the same PCID" or "the same PCIDs"?
> 
> Does a multithreaded process get one or more than one PCIDs? I'd
> expect only
> one ofc.

It's only one. Fixed the commment.

> 
> > +	asid = find_next_zero_bit(global_asid_used,
> > MAX_ASID_AVAILABLE, last_global_asid);
> > +
> > +	if (asid >= MAX_ASID_AVAILABLE) {
> 
> 	if (asid >= MAX_ASID_AVAILABLE && !global_asid_available)

Done.

> > +/*
> > + * Check whether a process is currently active on more than
> > "threshold" CPUs.
> 
> @threshold - then it is clear that you mean the function argument.
> 
Done. Thank you.

> > + * This is a cheap estimation on whether or not it may make sense
> > to assign
> > + * a global ASID to this process, and use broadcast TLB
> > invalidation.
> > + */
> > +static bool mm_active_cpus_exceeds(struct mm_struct *mm, int
> > threshold)
> > +{
> > +	int count = 0;
> > +	int cpu;
> 
> Is this function supposed to hold some sort of a lock?

I don't think we care too much about total accuracy here.

Not holding up other CPUs is probably more important.

> 
> > +
> > +	/* This quick check should eliminate most single threaded
> > programs. */
> > +	if (cpumask_weight(mm_cpumask(mm)) <= threshold)
> > +		return false;
> > +
> > +	/* Slower check to make sure. */
> > +	for_each_cpu(cpu, mm_cpumask(mm)) {
> 
> Needs CPU hotplug protection?

I don't see CPU hotplug protection in most other loops
that iterate over CPUs and do nothing but read some
per-CPU data.

What are we trying to protect against?

What kind of protection do we need?

> 
> > +	/* The last global ASID was consumed while waiting for the
> > lock. */
> > +	if (!global_asid_available) {
> > +		VM_WARN_ONCE(1, "Ran out of global ASIDs\n");
> 
> And? Why do we want to warn here? We need to reset global ASIDs
> anyway.

This warning prints if we run out of global ASIDs,
but have more processes in the system that want one.

This basically helps us figure out whether or not
we should bother implementing more aggressive ASID
re-use (like ARM and RISCV have), which might
require us to figure out how to make that re-use
more scalable than it is today.

> 
> > +		return;
> > +	}
> > +
> > +	asid = allocate_global_asid();
> > +	if (!asid)
> > +		return;
> > +
> > +	assign_mm_global_asid(mm, asid);
> > +}
> > +
> > +void destroy_context_free_global_asid(struct mm_struct *mm)
> 
> That name is a mouthful. mm_free_global_asid() is just fine.
> 
Done.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-25 19:17     ` Rik van Riel
@ 2025-02-25 20:38       ` Borislav Petkov
  2025-02-25 21:03         ` Borislav Petkov
  0 siblings, 1 reply; 45+ messages in thread
From: Borislav Petkov @ 2025-02-25 20:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Tue, Feb 25, 2025 at 02:17:20PM -0500, Rik van Riel wrote:
> Who do we need to ask to confirm that reading?

Lemme figure it out.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-25 20:38       ` Borislav Petkov
@ 2025-02-25 21:03         ` Borislav Petkov
  2025-02-25 22:03           ` Rik van Riel
  2025-02-26 17:00           ` Tom Lendacky
  0 siblings, 2 replies; 45+ messages in thread
From: Borislav Petkov @ 2025-02-25 21:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Tue, Feb 25, 2025 at 09:38:03PM +0100, Borislav Petkov wrote:
> On Tue, Feb 25, 2025 at 02:17:20PM -0500, Rik van Riel wrote:
> > Who do we need to ask to confirm that reading?
> 
> Lemme figure it out.

Confirmed - that really is the case.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-25 21:03         ` Borislav Petkov
@ 2025-02-25 22:03           ` Rik van Riel
  2025-02-26 17:00           ` Tom Lendacky
  1 sibling, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2025-02-25 22:03 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jackmanb, jannh,
	mhklinux, andrew.cooper3, Manali.Shukla, mingo

On Tue, 2025-02-25 at 22:03 +0100, Borislav Petkov wrote:
> On Tue, Feb 25, 2025 at 09:38:03PM +0100, Borislav Petkov wrote:
> > On Tue, Feb 25, 2025 at 02:17:20PM -0500, Rik van Riel wrote:
> > > Who do we need to ask to confirm that reading?
> > 
> > Lemme figure it out.
> 
> Confirmed - that really is the case.

Awesome, thank you for checking that!

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-25 21:03         ` Borislav Petkov
  2025-02-25 22:03           ` Rik van Riel
@ 2025-02-26 17:00           ` Tom Lendacky
  2025-02-26 17:02             ` Rik van Riel
  1 sibling, 1 reply; 45+ messages in thread
From: Tom Lendacky @ 2025-02-26 17:00 UTC (permalink / raw)
  To: Borislav Petkov, Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jackmanb, jannh, mhklinux,
	andrew.cooper3, Manali.Shukla, mingo

On 2/25/25 15:03, Borislav Petkov wrote:
> On Tue, Feb 25, 2025 at 09:38:03PM +0100, Borislav Petkov wrote:
>> On Tue, Feb 25, 2025 at 02:17:20PM -0500, Rik van Riel wrote:
>>> Who do we need to ask to confirm that reading?
>>
>> Lemme figure it out.
> 
> Confirmed - that really is the case.

Hmmm... since this is for host/hypervisor TLB handling, this makes me
think that we should always be setting the ASID valid bit with 0 for the
ASID value in EDX[15:0] on all INVLPGB instructions in this series.
Otherwise we will also be flushing any guest ASID TLB entries that match
the conditions, when that isn't intended.

Thanks,
Tom

> 
> Thx.
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-26 17:00           ` Tom Lendacky
@ 2025-02-26 17:02             ` Rik van Riel
  2025-02-26 17:36               ` Tom Lendacky
  0 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2025-02-26 17:02 UTC (permalink / raw)
  To: Tom Lendacky, Borislav Petkov
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jackmanb, jannh, mhklinux,
	andrew.cooper3, Manali.Shukla, mingo

On Wed, 2025-02-26 at 11:00 -0600, Tom Lendacky wrote:
> On 2/25/25 15:03, Borislav Petkov wrote:
> > On Tue, Feb 25, 2025 at 09:38:03PM +0100, Borislav Petkov wrote:
> > > On Tue, Feb 25, 2025 at 02:17:20PM -0500, Rik van Riel wrote:
> > > > Who do we need to ask to confirm that reading?
> > > 
> > > Lemme figure it out.
> > 
> > Confirmed - that really is the case.
> 
> Hmmm... since this is for host/hypervisor TLB handling, this makes me
> think that we should always be setting the ASID valid bit with 0 for
> the
> ASID value in EDX[15:0] on all INVLPGB instructions in this series.
> Otherwise we will also be flushing any guest ASID TLB entries that
> match
> the conditions, when that isn't intended.

The ASID is always zero in this series.

We only ever use the PCID for address space identification.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-26 17:02             ` Rik van Riel
@ 2025-02-26 17:36               ` Tom Lendacky
  2025-02-26 17:46                 ` Rik van Riel
  0 siblings, 1 reply; 45+ messages in thread
From: Tom Lendacky @ 2025-02-26 17:36 UTC (permalink / raw)
  To: Rik van Riel, Borislav Petkov
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jackmanb, jannh, mhklinux,
	andrew.cooper3, Manali.Shukla, mingo

On 2/26/25 11:02, Rik van Riel wrote:
> On Wed, 2025-02-26 at 11:00 -0600, Tom Lendacky wrote:
>> On 2/25/25 15:03, Borislav Petkov wrote:
>>> On Tue, Feb 25, 2025 at 09:38:03PM +0100, Borislav Petkov wrote:
>>>> On Tue, Feb 25, 2025 at 02:17:20PM -0500, Rik van Riel wrote:
>>>>> Who do we need to ask to confirm that reading?
>>>>
>>>> Lemme figure it out.
>>>
>>> Confirmed - that really is the case.
>>
>> Hmmm... since this is for host/hypervisor TLB handling, this makes me
>> think that we should always be setting the ASID valid bit with 0 for
>> the
>> ASID value in EDX[15:0] on all INVLPGB instructions in this series.
>> Otherwise we will also be flushing any guest ASID TLB entries that
>> match
>> the conditions, when that isn't intended.
> 
> The ASID is always zero in this series.
> 
> We only ever use the PCID for address space identification.

Right, but the ASID valid bit is not set, so the flushing may match more
than just host/hypervisor TLB entries.

Thanks,
Tom

> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-26 17:36               ` Tom Lendacky
@ 2025-02-26 17:46                 ` Rik van Riel
  2025-02-26 18:12                   ` Tom Lendacky
  0 siblings, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2025-02-26 17:46 UTC (permalink / raw)
  To: Tom Lendacky, Borislav Petkov
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jackmanb, jannh, mhklinux,
	andrew.cooper3, Manali.Shukla, mingo

On Wed, 2025-02-26 at 11:36 -0600, Tom Lendacky wrote:
> 
> Right, but the ASID valid bit is not set, so the flushing may match
> more
> than just host/hypervisor TLB entries.
> 
Good point, when using SVM these flushes could result
in flushing more TLB entries than we really want.

On the flip side, when SVM is not initialized, the
invlpgb instruction will fail with a general protection
fault if we have anything at all in the ASID field.

I don't know whether setting the ASID valid bit in
rAX will cause a system to crash when SVM is not
enabled or initialized.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-26 17:46                 ` Rik van Riel
@ 2025-02-26 18:12                   ` Tom Lendacky
  2025-02-26 22:01                     ` Rik van Riel
  2025-02-28  1:13                     ` Rik van Riel
  0 siblings, 2 replies; 45+ messages in thread
From: Tom Lendacky @ 2025-02-26 18:12 UTC (permalink / raw)
  To: Rik van Riel, Borislav Petkov
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jackmanb, jannh, mhklinux,
	andrew.cooper3, Manali.Shukla, mingo

On 2/26/25 11:46, Rik van Riel wrote:
> On Wed, 2025-02-26 at 11:36 -0600, Tom Lendacky wrote:
>>
>> Right, but the ASID valid bit is not set, so the flushing may match
>> more
>> than just host/hypervisor TLB entries.
>>
> Good point, when using SVM these flushes could result
> in flushing more TLB entries than we really want.
> 
> On the flip side, when SVM is not initialized, the
> invlpgb instruction will fail with a general protection
> fault if we have anything at all in the ASID field.
> 
> I don't know whether setting the ASID valid bit in
> rAX will cause a system to crash when SVM is not
> enabled or initialized.

As long as you keep the ASID value in EDX[15:0] as 0, then you won't
#GP. ASID 0 is the host/hypervisor. An ASID > 0 belongs to a guest.

Thanks,
Tom

> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-26 18:12                   ` Tom Lendacky
@ 2025-02-26 22:01                     ` Rik van Riel
  2025-02-28  1:13                     ` Rik van Riel
  1 sibling, 0 replies; 45+ messages in thread
From: Rik van Riel @ 2025-02-26 22:01 UTC (permalink / raw)
  To: Tom Lendacky, Borislav Petkov
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jackmanb, jannh, mhklinux,
	andrew.cooper3, Manali.Shukla, mingo

On Wed, 2025-02-26 at 12:12 -0600, Tom Lendacky wrote:
> On 2/26/25 11:46, Rik van Riel wrote:
> > 
> > I don't know whether setting the ASID valid bit in
> > rAX will cause a system to crash when SVM is not
> > enabled or initialized.
> 
> As long as you keep the ASID value in EDX[15:0] as 0, then you won't
> #GP. ASID 0 is the host/hypervisor. An ASID > 0 belongs to a guest.

Thank you, that is good to know.

I think I'll wait for feedback on v14 of the series
to figure out whether to post that as part of a new
series, or as a follow-up enhancement.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-26 18:12                   ` Tom Lendacky
  2025-02-26 22:01                     ` Rik van Riel
@ 2025-02-28  1:13                     ` Rik van Riel
  2025-02-28 15:02                       ` Tom Lendacky
  1 sibling, 1 reply; 45+ messages in thread
From: Rik van Riel @ 2025-02-28  1:13 UTC (permalink / raw)
  To: Tom Lendacky, Borislav Petkov
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jackmanb, jannh, mhklinux,
	andrew.cooper3, Manali.Shukla, mingo

On Wed, 2025-02-26 at 12:12 -0600, Tom Lendacky wrote:
> 
> As long as you keep the ASID value in EDX[15:0] as 0, then you won't
> #GP. ASID 0 is the host/hypervisor. An ASID > 0 belongs to a guest.
> 
I've been spending some time reading the KVM code,
and I don't think invlpgb would be currently useful
with KVM.

From reading pre_svm_run(), new_asid(), and svm_vcpu_run(),
it looks like the ASID number used might be different for
each VCPU, assigned on a per (physical host) CPU basis.

It would take some surgery to change that around.

Some googling around also suggests that the ASID address
space is even more limited than the PCID address space :(

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-28  1:13                     ` Rik van Riel
@ 2025-02-28 15:02                       ` Tom Lendacky
  2025-02-28 15:57                         ` Sean Christopherson
  2025-02-28 20:42                         ` Tom Lendacky
  0 siblings, 2 replies; 45+ messages in thread
From: Tom Lendacky @ 2025-02-28 15:02 UTC (permalink / raw)
  To: Rik van Riel, Borislav Petkov
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jackmanb, jannh, mhklinux,
	andrew.cooper3, Manali.Shukla, mingo

On 2/27/25 19:13, Rik van Riel wrote:
> On Wed, 2025-02-26 at 12:12 -0600, Tom Lendacky wrote:
>>
>> As long as you keep the ASID value in EDX[15:0] as 0, then you won't
>> #GP. ASID 0 is the host/hypervisor. An ASID > 0 belongs to a guest.
>>
> I've been spending some time reading the KVM code,
> and I don't think invlpgb would be currently useful
> with KVM.
> 
> From reading pre_svm_run(), new_asid(), and svm_vcpu_run(),
> it looks like the ASID number used might be different for
> each VCPU, assigned on a per (physical host) CPU basis.
> 
> It would take some surgery to change that around.
> 
> Some googling around also suggests that the ASID address
> space is even more limited than the PCID address space :(

Right, to support using INVLPGB in guests you need a global ASID, which is
an ASID that doesn't change over the VMs lifetime and is used on all
vCPUs. Global ASIDs are only available and used today with SEV guests. At
that point you would not intercept the instruction and, based on APM vol
3, the ASID value is replaced with the guest ASID value.

"A guest that executes a legal INVLPGB that is not intercepted will have
the requested ASID field replaced by the current ASID and the valid ASID
bit set before doing the broadcast invalidation."

So I'm in the process of verifying that issuing INVLPLG in a guest with
the ASID valid bit set and an ASID value of 0 (EDX[15:0]) won't #GP, but
will just replace the specified ASID value with the guest ASID value in
hardware.

For non-SEV guests, INVLPGB would need to be intercepted and somehow
emulated or just not advertised to the guest so that the IPI path is used.

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-28 15:02                       ` Tom Lendacky
@ 2025-02-28 15:57                         ` Sean Christopherson
  2025-02-28 20:42                         ` Tom Lendacky
  1 sibling, 0 replies; 45+ messages in thread
From: Sean Christopherson @ 2025-02-28 15:57 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Rik van Riel, Borislav Petkov, x86, linux-kernel, peterz,
	dave.hansen, zhengqi.arch, nadav.amit, kernel-team, linux-mm,
	akpm, jackmanb, jannh, mhklinux, andrew.cooper3, Manali.Shukla,
	mingo

On Fri, Feb 28, 2025, Tom Lendacky wrote:
> On 2/27/25 19:13, Rik van Riel wrote:
> > On Wed, 2025-02-26 at 12:12 -0600, Tom Lendacky wrote:
> >>
> >> As long as you keep the ASID value in EDX[15:0] as 0, then you won't
> >> #GP. ASID 0 is the host/hypervisor. An ASID > 0 belongs to a guest.
> >>
> > I've been spending some time reading the KVM code,
> > and I don't think invlpgb would be currently useful
> > with KVM.
> > 
> > From reading pre_svm_run(), new_asid(), and svm_vcpu_run(),
> > it looks like the ASID number used might be different for
> > each VCPU, assigned on a per (physical host) CPU basis.
> > 
> > It would take some surgery to change that around.
> > 
> > Some googling around also suggests that the ASID address
> > space is even more limited than the PCID address space :(

KVM's mess of ASID handling isn't due to space limitations, it's because early
AMD hardware didn't support a targeted ASID flush.  To avoid flushing the entire
TLB, KVM fudged around lack of precise flushing by using a new ASID.  Under the
hood, hardware uses the new ASID so the previous entries are effectively"flushed",
and will eventually be flushed for real once their evicted due to TLB pressure.

Irrespective of INVLPGB support, I am all in favor of ripping out the ASID
shenanigans in favor of static, per-VM ASIDs for all VM types.  For CPUs that
don't support FLUSHBYASID, KVM can simply blast TLB_CONTROL_FLUSH_ALL_ASID.

FLUSHBYASID was added ~15 years ago.  If someone is still running hardware that's
that old, they can't possibly care about performance.

That would meaningfuly simplify KVM code, likely be a performance win on modern
hardware, and gives us direct line of sight to using INVLPGB (assuming it's a
performance win).

> Right, to support using INVLPGB in guests you need a global ASID, 

I'm pretty sure Rik is talking about using INVLPGB in the _host_, e.g. by doing
INVLPGB in kvm_arch_flush_remote_tlbs() instead of blasing an IPI to all vCPUs.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-02-28 15:02                       ` Tom Lendacky
  2025-02-28 15:57                         ` Sean Christopherson
@ 2025-02-28 20:42                         ` Tom Lendacky
  1 sibling, 0 replies; 45+ messages in thread
From: Tom Lendacky @ 2025-02-28 20:42 UTC (permalink / raw)
  To: Rik van Riel, Borislav Petkov
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jackmanb, jannh, mhklinux,
	andrew.cooper3, Manali.Shukla, mingo

On 2/28/25 09:02, Tom Lendacky wrote:
> On 2/27/25 19:13, Rik van Riel wrote:
>> On Wed, 2025-02-26 at 12:12 -0600, Tom Lendacky wrote:
>>>
>>> As long as you keep the ASID value in EDX[15:0] as 0, then you won't
>>> #GP. ASID 0 is the host/hypervisor. An ASID > 0 belongs to a guest.
>>>
>> I've been spending some time reading the KVM code,
>> and I don't think invlpgb would be currently useful
>> with KVM.
>>
>> From reading pre_svm_run(), new_asid(), and svm_vcpu_run(),
>> it looks like the ASID number used might be different for
>> each VCPU, assigned on a per (physical host) CPU basis.
>>
>> It would take some surgery to change that around.
>>
>> Some googling around also suggests that the ASID address
>> space is even more limited than the PCID address space :(
> 
> Right, to support using INVLPGB in guests you need a global ASID, which is
> an ASID that doesn't change over the VMs lifetime and is used on all
> vCPUs. Global ASIDs are only available and used today with SEV guests. At
> that point you would not intercept the instruction and, based on APM vol
> 3, the ASID value is replaced with the guest ASID value.
> 
> "A guest that executes a legal INVLPGB that is not intercepted will have
> the requested ASID field replaced by the current ASID and the valid ASID
> bit set before doing the broadcast invalidation."
> 
> So I'm in the process of verifying that issuing INVLPLG in a guest with
> the ASID valid bit set and an ASID value of 0 (EDX[15:0]) won't #GP, but
> will just replace the specified ASID value with the guest ASID value in
> hardware.

I verified that when (a non-intercepted) INVLPGB is issued in a guest,
hardware will set the ASID valid bit and use the guest ASID value
(regardless of the value specified in EDX[15:0]) before doing the
broadcast invalidation.

So the implementation of setting the ASID-valid bit and specifying ASID 0
is not incompatible in the guest.

Thanks,
Tom

> 
> For non-SEV guests, INVLPGB would need to be intercepted and somehow
> emulated or just not advertised to the guest so that the IPI path is used.
> 
> Thanks,
> Tom
> 
>>


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2025-02-28 20:43 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-23 19:48 [PATCH v13 00/14] AMD broadcast TLB invalidation Rik van Riel
2025-02-23 19:48 ` [PATCH v13 01/14] x86/mm: consolidate full flush threshold decision Rik van Riel
2025-02-24 11:48   ` Borislav Petkov
2025-02-23 19:48 ` [PATCH v13 02/14] x86/mm: get INVLPGB count max from CPUID Rik van Riel
2025-02-24 12:00   ` Borislav Petkov
2025-02-23 19:48 ` [PATCH v13 03/14] x86/mm: add INVLPGB support code Rik van Riel
2025-02-24 12:14   ` Borislav Petkov
2025-02-23 19:48 ` [PATCH v13 04/14] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
2025-02-24 12:31   ` Borislav Petkov
2025-02-24 12:44     ` Nadav Amit
2025-02-23 19:48 ` [PATCH v13 05/14] x86/mm: use INVLPGB in flush_tlb_all Rik van Riel
2025-02-24 12:46   ` Borislav Petkov
2025-02-23 19:48 ` [PATCH v13 06/14] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
2025-02-24 13:27   ` Borislav Petkov
2025-02-25 19:17     ` Rik van Riel
2025-02-25 20:38       ` Borislav Petkov
2025-02-25 21:03         ` Borislav Petkov
2025-02-25 22:03           ` Rik van Riel
2025-02-26 17:00           ` Tom Lendacky
2025-02-26 17:02             ` Rik van Riel
2025-02-26 17:36               ` Tom Lendacky
2025-02-26 17:46                 ` Rik van Riel
2025-02-26 18:12                   ` Tom Lendacky
2025-02-26 22:01                     ` Rik van Riel
2025-02-28  1:13                     ` Rik van Riel
2025-02-28 15:02                       ` Tom Lendacky
2025-02-28 15:57                         ` Sean Christopherson
2025-02-28 20:42                         ` Tom Lendacky
2025-02-23 19:48 ` [PATCH v13 07/14] x86/mm: global ASID allocation helper functions Rik van Riel
2025-02-24 14:16   ` Borislav Petkov
2025-02-25 20:22     ` Rik van Riel
2025-02-23 19:48 ` [PATCH v13 08/14] x86/mm: global ASID context switch & TLB flush handling Rik van Riel
2025-02-23 23:08   ` kernel test robot
2025-02-24  1:26     ` Rik van Riel
2025-02-24  2:01     ` Rik van Riel
2025-02-24 18:41   ` kernel test robot
2025-02-23 19:48 ` [PATCH v13 09/14] x86/mm: global ASID process exit helpers Rik van Riel
2025-02-23 19:49 ` [PATCH v13 10/14] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
2025-02-23 19:49 ` [PATCH v13 11/14] x86/mm: do targeted broadcast flushing from tlbbatch code Rik van Riel
2025-02-23 19:49 ` [PATCH v13 12/14] x86/mm: enable AMD translation cache extensions Rik van Riel
2025-02-23 19:49 ` [PATCH v13 13/14] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
2025-02-23 19:49 ` [PATCH v13 14/14] x86/mm: add noinvlpgb commandline option Rik van Riel
2025-02-23 21:29   ` Borislav Petkov
2025-02-24  0:34     ` Rik van Riel
2025-02-24  6:29       ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).