[RFC PATCH v4 00/10] AMD broadcast TLB invalidation

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v4 00/10] AMD broadcast TLB invalidation
@ 2025-01-12 15:53 Rik van Riel
  2025-01-12 15:53 ` [PATCH v4 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
                   ` (11 more replies)
  0 siblings, 12 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-12 15:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh

Add support for broadcast TLB invalidation using AMD's INVLPGB instruction.

This allows the kernel to invalidate TLB entries on remote CPUs without
needing to send IPIs, without having to wait for remote CPUs to handle
those interrupts, and with less interruption to what was running on
those CPUs.

Because x86 PCID space is limited, and there are some very large
systems out there, broadcast TLB invalidation is only used for
processes that are active on 3 or more CPUs, with the threshold
being gradually increased the more the PCID space gets exhausted.

Combined with the removal of unnecessary lru_add_drain calls
(see https://lkml.org/lkml/2024/12/19/1388) this results in a
nice performance boost for the will-it-scale tlb_flush2_threads
test on an AMD Milan system with 36 cores:

- vanilla kernel:           527k loops/second
- lru_add_drain removal:    731k loops/second
- only INVLPGB:             527k loops/second
- lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while
TLB invalidation went down from 40% of the total CPU
time to only around 4% of CPU time, the contention
simply moved to the LRU lock.

Fixing both at the same time about doubles the
number of iterations per second from this case.

Some numbers closer to real world performance
can be found at Phoronix, thanks to Michael:

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

There was a large amount of feedback and debate around v3
of the series. I have tried to incorporate everybody's
feedback, but please let me know if I missed a spot.

v4:
 - Use only bitmaps to track free global ASIDs (Nadav)
 - Improved AMD initialization (Borislav & Tom)
 - Various naming and documentation improvements (Peter, Nadav, Tom, Dave)
 - Fixes for subtle race conditions (Jann)
v3:
 - Remove paravirt tlb_remove_table call (thank you Qi Zheng)
 - More suggested cleanups and changelog fixes by Peter and Nadav
v2:
 - Apply suggestions by Peter and Borislav (thank you!)
 - Fix bug in arch_tlbbatch_flush, where we need to do both
   the TLBSYNC, and flush the CPUs that are in the cpumask.
 - Some updates to comments and changelogs based on questions.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2025-01-12 15:53 [RFC PATCH v4 00/10] AMD broadcast TLB invalidation Rik van Riel
@ 2025-01-12 15:53 ` Rik van Riel
  2025-01-14 12:32   ` Borislav Petkov
  2025-01-12 15:53 ` [PATCH v4 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call Rik van Riel
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2025-01-12 15:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, Rik van Riel

Currently x86 uses CONFIG_MMU_GATHER_TABLE_FREE when using
paravirt, and not when running on bare metal.

There is no real good reason to do things differently for
each setup. Make them all the same.

Currently get_user_pages_fast synchronizes against page table
freeing in two different ways:
- on bare metal, by blocking IRQs, which block TLB flush IPIs
- on paravirt, with MMU_GATHER_RCU_TABLE_FREE

This is done because some paravirt TLB flush implementations
handle the TLB flush in the hypervisor, and will do the flush
even when the target CPU has interrupts disabled.

After this change, the synchronization between get_user_pages_fast
and page table freeing is always handled with MMU_GATHER_RCU_TABLE_FREE,
which allows bare metal to also do TLB flushes while interrupts are
disabled.

That makes it safe to use INVLPGB on AMD CPUs.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/Kconfig           | 2 +-
 arch/x86/kernel/paravirt.c | 7 +------
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9d7bd0ae48c4..e8743f8c9fd0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -274,7 +274,7 @@ config X86
 	select HAVE_PCI
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
-	select MMU_GATHER_RCU_TABLE_FREE	if PARAVIRT
+	select MMU_GATHER_RCU_TABLE_FREE
 	select MMU_GATHER_MERGE_VMAS
 	select HAVE_POSIX_CPU_TIMERS_TASK_WORK
 	select HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index fec381533555..2b78a6b466ed 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -59,11 +59,6 @@ void __init native_pv_lock_init(void)
 		static_branch_enable(&virt_spin_lock_key);
 }
 
-static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	tlb_remove_page(tlb, table);
-}
-
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
 
@@ -191,7 +186,7 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
 	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
-	.mmu.tlb_remove_table	= native_tlb_remove_table,
+	.mmu.tlb_remove_table	= tlb_remove_table,
 
 	.mmu.exit_mmap		= paravirt_nop,
 	.mmu.notify_page_enc_status_changed	= paravirt_nop,
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call
  2025-01-12 15:53 [RFC PATCH v4 00/10] AMD broadcast TLB invalidation Rik van Riel
  2025-01-12 15:53 ` [PATCH v4 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
@ 2025-01-12 15:53 ` Rik van Riel
  2025-01-12 15:53 ` [PATCH v4 03/12] x86/mm: consolidate full flush threshold decision Rik van Riel
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-12 15:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, Rik van Riel

Every pv_ops.mmu.tlb_remove_table call ends up calling tlb_remove_table.

Get rid of the indirection by simply calling tlb_remove_table directly,
and not going through the paravirt function pointers.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/x86/hyperv/mmu.c                 |  1 -
 arch/x86/include/asm/paravirt.h       |  5 -----
 arch/x86/include/asm/paravirt_types.h |  2 --
 arch/x86/kernel/kvm.c                 |  1 -
 arch/x86/kernel/paravirt.c            |  1 -
 arch/x86/mm/pgtable.c                 | 16 ++++------------
 arch/x86/xen/mmu_pv.c                 |  1 -
 7 files changed, 4 insertions(+), 23 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 1cc113200ff5..cbe6c71e17c1 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -240,5 +240,4 @@ void hyperv_setup_mmu_ops(void)
 
 	pr_info("Using hypercall for remote TLB flush\n");
 	pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
-	pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d4eb9e1d61b8..794ba3647c6c 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -91,11 +91,6 @@ static inline void __flush_tlb_multi(const struct cpumask *cpumask,
 	PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
-static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	PVOP_VCALL2(mmu.tlb_remove_table, tlb, table);
-}
-
 static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
 {
 	PVOP_VCALL1(mmu.exit_mmap, mm);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 8d4fbe1be489..13405959e4db 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -136,8 +136,6 @@ struct pv_mmu_ops {
 	void (*flush_tlb_multi)(const struct cpumask *cpus,
 				const struct flush_tlb_info *info);
 
-	void (*tlb_remove_table)(struct mmu_gather *tlb, void *table);
-
 	/* Hook for intercepting the destruction of an mm_struct. */
 	void (*exit_mmap)(struct mm_struct *mm);
 	void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, bool enc);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7a422a6c5983..3be9b3342c67 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -838,7 +838,6 @@ static void __init kvm_guest_init(void)
 #ifdef CONFIG_SMP
 	if (pv_tlb_flush_supported()) {
 		pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
-		pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 		pr_info("KVM setup pv remote TLB flush\n");
 	}
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 2b78a6b466ed..c019771e0123 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -186,7 +186,6 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
 	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
-	.mmu.tlb_remove_table	= tlb_remove_table,
 
 	.mmu.exit_mmap		= paravirt_nop,
 	.mmu.notify_page_enc_status_changed	= paravirt_nop,
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 5745a354a241..3dc4af1f7868 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -18,14 +18,6 @@ EXPORT_SYMBOL(physical_mask);
 #define PGTABLE_HIGHMEM 0
 #endif
 
-#ifndef CONFIG_PARAVIRT
-static inline
-void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
-{
-	tlb_remove_page(tlb, table);
-}
-#endif
-
 gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;
 
 pgtable_t pte_alloc_one(struct mm_struct *mm)
@@ -54,7 +46,7 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 {
 	pagetable_pte_dtor(page_ptdesc(pte));
 	paravirt_release_pte(page_to_pfn(pte));
-	paravirt_tlb_remove_table(tlb, pte);
+	tlb_remove_table(tlb, pte);
 }
 
 #if CONFIG_PGTABLE_LEVELS > 2
@@ -70,7 +62,7 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 	tlb->need_flush_all = 1;
 #endif
 	pagetable_pmd_dtor(ptdesc);
-	paravirt_tlb_remove_table(tlb, ptdesc_page(ptdesc));
+	tlb_remove_table(tlb, ptdesc_page(ptdesc));
 }
 
 #if CONFIG_PGTABLE_LEVELS > 3
@@ -80,14 +72,14 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 
 	pagetable_pud_dtor(ptdesc);
 	paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
-	paravirt_tlb_remove_table(tlb, virt_to_page(pud));
+	tlb_remove_table(tlb, virt_to_page(pud));
 }
 
 #if CONFIG_PGTABLE_LEVELS > 4
 void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
 {
 	paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
-	paravirt_tlb_remove_table(tlb, virt_to_page(p4d));
+	tlb_remove_table(tlb, virt_to_page(p4d));
 }
 #endif	/* CONFIG_PGTABLE_LEVELS > 4 */
 #endif	/* CONFIG_PGTABLE_LEVELS > 3 */
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 55a4996d0c04..041e17282af0 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2137,7 +2137,6 @@ static const typeof(pv_ops) xen_mmu_ops __initconst = {
 		.flush_tlb_kernel = xen_flush_tlb,
 		.flush_tlb_one_user = xen_flush_tlb_one_user,
 		.flush_tlb_multi = xen_flush_tlb_multi,
-		.tlb_remove_table = tlb_remove_table,
 
 		.pgd_alloc = xen_pgd_alloc,
 		.pgd_free = xen_pgd_free,
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 03/12] x86/mm: consolidate full flush threshold decision
  2025-01-12 15:53 [RFC PATCH v4 00/10] AMD broadcast TLB invalidation Rik van Riel
  2025-01-12 15:53 ` [PATCH v4 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
  2025-01-12 15:53 ` [PATCH v4 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call Rik van Riel
@ 2025-01-12 15:53 ` Rik van Riel
  2025-01-12 15:53 ` [PATCH v4 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-12 15:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, Rik van Riel,
	Dave Hansen

Reduce code duplication by consolidating the decision point
for whether to do individual invalidations or a full flush
inside get_flush_tlb_info.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
---
 arch/x86/mm/tlb.c | 42 +++++++++++++++++++-----------------------
 1 file changed, 19 insertions(+), 23 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6cf881a942bb..2b339f55f839 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1009,6 +1009,15 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 	info->initiating_cpu	= smp_processor_id();
 	info->trim_cpumask	= 0;
 
+	/*
+	 * If the number of flushes is so large that a full flush
+	 * would be faster, do a full flush.
+	 */
+	if ((end - start) >> stride_shift > tlb_single_page_flush_ceiling) {
+		info->start = 0;
+		info->end = TLB_FLUSH_ALL;
+	}
+
 	return info;
 }
 
@@ -1026,17 +1035,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				bool freed_tables)
 {
 	struct flush_tlb_info *info;
+	int cpu = get_cpu();
 	u64 new_tlb_gen;
-	int cpu;
-
-	cpu = get_cpu();
-
-	/* Should we flush just the requested range? */
-	if ((end == TLB_FLUSH_ALL) ||
-	    ((end - start) >> stride_shift) > tlb_single_page_flush_ceiling) {
-		start = 0;
-		end = TLB_FLUSH_ALL;
-	}
 
 	/* This is also a barrier that synchronizes with switch_mm(). */
 	new_tlb_gen = inc_mm_tlb_gen(mm);
@@ -1089,22 +1089,18 @@ static void do_kernel_range_flush(void *info)
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
-	/* Balance as user space task's flush, a bit conservative */
-	if (end == TLB_FLUSH_ALL ||
-	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
-	} else {
-		struct flush_tlb_info *info;
+	struct flush_tlb_info *info;
+	guard(preempt)();
 
-		preempt_disable();
-		info = get_flush_tlb_info(NULL, start, end, 0, false,
-					  TLB_GENERATION_INVALID);
+	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
+				  TLB_GENERATION_INVALID);
 
+	if (end == TLB_FLUSH_ALL)
+		on_each_cpu(do_flush_tlb_all, NULL, 1);
+	else
 		on_each_cpu(do_kernel_range_flush, info, 1);
 
-		put_flush_tlb_info();
-		preempt_enable();
-	}
+	put_flush_tlb_info();
 }
 
 /*
@@ -1276,7 +1272,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 
 	int cpu = get_cpu();
 
-	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
+	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, PAGE_SHIFT, false,
 				  TLB_GENERATION_INVALID);
 	/*
 	 * flush_tlb_multi() is not optimized for the common case in which only
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-01-12 15:53 [RFC PATCH v4 00/10] AMD broadcast TLB invalidation Rik van Riel
                   ` (2 preceding siblings ...)
  2025-01-12 15:53 ` [PATCH v4 03/12] x86/mm: consolidate full flush threshold decision Rik van Riel
@ 2025-01-12 15:53 ` Rik van Riel
  2025-01-13 15:50   ` Jann Horn
  2025-01-12 15:53 ` [PATCH v4 05/12] x86/mm: add INVLPGB support code Rik van Riel
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2025-01-12 15:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, Rik van Riel

The CPU advertises the maximum number of pages that can be shot down
with one INVLPGB instruction in the CPUID data.

Save that information for later use.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/Kconfig.cpu               | 5 +++++
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/include/asm/tlbflush.h    | 7 +++++++
 arch/x86/kernel/cpu/amd.c          | 8 ++++++++
 4 files changed, 21 insertions(+)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 2a7279d80460..bacdc502903f 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -395,6 +395,10 @@ config X86_VMX_FEATURE_NAMES
 	def_bool y
 	depends on IA32_FEAT_CTL
 
+config X86_BROADCAST_TLB_FLUSH
+	def_bool y
+	depends on CPU_SUP_AMD
+
 menuconfig PROCESSOR_SELECT
 	bool "Supported processor vendors" if EXPERT
 	help
@@ -431,6 +435,7 @@ config CPU_SUP_CYRIX_32
 config CPU_SUP_AMD
 	default y
 	bool "Support AMD processors" if PROCESSOR_SELECT
+	select X86_BROADCAST_TLB_FLUSH
 	help
 	  This enables detection, tunings and quirks for AMD processors
 
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 17b6590748c0..f9b832e971c5 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -338,6 +338,7 @@
 #define X86_FEATURE_CLZERO		(13*32+ 0) /* "clzero" CLZERO instruction */
 #define X86_FEATURE_IRPERF		(13*32+ 1) /* "irperf" Instructions Retired Count */
 #define X86_FEATURE_XSAVEERPTR		(13*32+ 2) /* "xsaveerptr" Always save/restore FP error pointers */
+#define X86_FEATURE_INVLPGB		(13*32+ 3) /* INVLPGB and TLBSYNC instruction supported. */
 #define X86_FEATURE_RDPRU		(13*32+ 4) /* "rdpru" Read processor register at user level */
 #define X86_FEATURE_WBNOINVD		(13*32+ 9) /* "wbnoinvd" WBNOINVD instruction */
 #define X86_FEATURE_AMD_IBPB		(13*32+12) /* Indirect Branch Prediction Barrier */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 02fc2aa06e9e..8fe3b2dda507 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -183,6 +183,13 @@ static inline void cr4_init_shadow(void)
 extern unsigned long mmu_cr4_features;
 extern u32 *trampoline_cr4_features;
 
+/* How many pages can we invalidate with one INVLPGB. */
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+extern u16 invlpgb_count_max;
+#else
+#define invlpgb_count_max 1
+#endif
+
 extern void initialize_tlbstate_and_flush(void);
 
 /*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 79d2e17f6582..bcf73775b4f8 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -29,6 +29,8 @@
 
 #include "cpu.h"
 
+u16 invlpgb_count_max __ro_after_init;
+
 static inline int rdmsrl_amd_safe(unsigned msr, unsigned long long *p)
 {
 	u32 gprs[8] = { 0 };
@@ -1135,6 +1137,12 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
 		tlb_lli_2m[ENTRIES] = eax & mask;
 
 	tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
+
+	/* Max number of pages INVLPGB can invalidate in one shot */
+	if (boot_cpu_has(X86_FEATURE_INVLPGB)) {
+		cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
+		invlpgb_count_max = (edx & 0xffff) + 1;
+	}
 }
 
 static const struct cpu_dev amd_cpu_dev = {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 05/12] x86/mm: add INVLPGB support code
  2025-01-12 15:53 [RFC PATCH v4 00/10] AMD broadcast TLB invalidation Rik van Riel
                   ` (3 preceding siblings ...)
  2025-01-12 15:53 ` [PATCH v4 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
@ 2025-01-12 15:53 ` Rik van Riel
  2025-01-13 14:21   ` Tom Lendacky
                     ` (2 more replies)
  2025-01-12 15:53 ` [PATCH v4 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
                   ` (6 subsequent siblings)
  11 siblings, 3 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-12 15:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, Rik van Riel

Add invlpgb.h with the helper functions and definitions needed to use
broadcast TLB invalidation on AMD EPYC 3 and newer CPUs.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/invlpgb.h  | 95 +++++++++++++++++++++++++++++++++
 arch/x86/include/asm/tlbflush.h |  1 +
 2 files changed, 96 insertions(+)
 create mode 100644 arch/x86/include/asm/invlpgb.h

diff --git a/arch/x86/include/asm/invlpgb.h b/arch/x86/include/asm/invlpgb.h
new file mode 100644
index 000000000000..d62e3733a1ab
--- /dev/null
+++ b/arch/x86/include/asm/invlpgb.h
@@ -0,0 +1,95 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_INVLPGB
+#define _ASM_X86_INVLPGB
+
+#include <vdso/bits.h>
+
+/*
+ * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
+ *
+ * The INVLPGB instruction is weakly ordered, and a batch of invalidations can
+ * be done in a parallel fashion.
+ *
+ * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from
+ * this CPU have completed.
+ */
+static inline void __invlpgb(unsigned long asid, unsigned long pcid, unsigned long addr,
+			    int extra_count, bool pmd_stride, unsigned long flags)
+{
+	u32 edx = (pcid << 16) | asid;
+	u32 ecx = (pmd_stride << 31);
+	u64 rax = addr | flags;
+
+	/* Protect against negative numbers. */
+	extra_count = max(extra_count, 0);
+	ecx |= extra_count;
+
+	asm volatile("invlpgb" : : "a" (rax), "c" (ecx), "d" (edx));
+}
+
+/* Wait for INVLPGB originated by this CPU to complete. */
+static inline void tlbsync(void)
+{
+	asm volatile("tlbsync");
+}
+
+/*
+ * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
+ * of the three. For example:
+ * - INVLPGB_VA | INVLPGB_INCLUDE_GLOBAL: invalidate all TLB entries at the address
+ * - INVLPGB_PCID:              	  invalidate all TLB entries matching the PCID
+ *
+ * The first can be used to invalidate (kernel) mappings at a particular
+ * address across all processes.
+ *
+ * The latter invalidates all TLB entries matching a PCID.
+ */
+#define INVLPGB_VA			BIT(0)
+#define INVLPGB_PCID			BIT(1)
+#define INVLPGB_ASID			BIT(2)
+#define INVLPGB_INCLUDE_GLOBAL		BIT(3)
+#define INVLPGB_FINAL_ONLY		BIT(4)
+#define INVLPGB_INCLUDE_NESTED		BIT(5)
+
+/* Flush all mappings for a given pcid and addr, not including globals. */
+static inline void invlpgb_flush_user(unsigned long pcid,
+				      unsigned long addr)
+{
+	__invlpgb(0, pcid, addr, 0, 0, INVLPGB_PCID | INVLPGB_VA);
+	tlbsync();
+}
+
+static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
+						unsigned long addr,
+						int nr, bool pmd_stride)
+{
+	__invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
+}
+
+/* Flush all mappings for a given PCID, not including globals. */
+static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
+{
+	__invlpgb(0, pcid, 0, 0, 0, INVLPGB_PCID);
+}
+
+/* Flush all mappings, including globals, for all PCIDs. */
+static inline void invlpgb_flush_all(void)
+{
+	__invlpgb(0, 0, 0, 0, 0, INVLPGB_INCLUDE_GLOBAL);
+	tlbsync();
+}
+
+/* Flush addr, including globals, for all PCIDs. */
+static inline void invlpgb_flush_addr_nosync(unsigned long addr, int nr)
+{
+	__invlpgb(0, 0, addr, nr - 1, 0, INVLPGB_INCLUDE_GLOBAL);
+}
+
+/* Flush all mappings for all PCIDs except globals. */
+static inline void invlpgb_flush_all_nonglobals(void)
+{
+	__invlpgb(0, 0, 0, 0, 0, 0);
+	tlbsync();
+}
+
+#endif /* _ASM_X86_INVLPGB */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8fe3b2dda507..dba5caa4a9f4 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -10,6 +10,7 @@
 #include <asm/cpufeature.h>
 #include <asm/special_insns.h>
 #include <asm/smp.h>
+#include <asm/invlpgb.h>
 #include <asm/invpcid.h>
 #include <asm/pti.h>
 #include <asm/processor-flags.h>
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 06/12] x86/mm: use INVLPGB for kernel TLB flushes
  2025-01-12 15:53 [RFC PATCH v4 00/10] AMD broadcast TLB invalidation Rik van Riel
                   ` (4 preceding siblings ...)
  2025-01-12 15:53 ` [PATCH v4 05/12] x86/mm: add INVLPGB support code Rik van Riel
@ 2025-01-12 15:53 ` Rik van Riel
  2025-01-12 15:53 ` [PATCH v4 07/12] x86/tlb: use INVLPGB in flush_tlb_all Rik van Riel
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-12 15:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, Rik van Riel

Use broadcast TLB invalidation for kernel addresses when available.

Remove the need to send IPIs for kernel TLB flushes.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/mm/tlb.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2b339f55f839..45c7b84f6f80 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1077,6 +1077,30 @@ void flush_tlb_all(void)
 	on_each_cpu(do_flush_tlb_all, NULL, 1);
 }
 
+static bool broadcast_kernel_range_flush(struct flush_tlb_info *info)
+{
+	unsigned long addr;
+	unsigned long nr;
+
+	if (!IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH))
+		return false;
+
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	if (info->end == TLB_FLUSH_ALL) {
+		invlpgb_flush_all();
+		return true;
+	}
+
+	for (addr = info->start; addr < info->end; addr += nr << PAGE_SHIFT) {
+		nr = min((info->end - addr) >> PAGE_SHIFT, invlpgb_count_max);
+		invlpgb_flush_addr_nosync(addr, nr);
+	}
+	tlbsync();
+	return true;
+}
+
 static void do_kernel_range_flush(void *info)
 {
 	struct flush_tlb_info *f = info;
@@ -1095,7 +1119,9 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
 				  TLB_GENERATION_INVALID);
 
-	if (end == TLB_FLUSH_ALL)
+	if (broadcast_kernel_range_flush(info))
+		; /* Fall through. */
+	else if (end == TLB_FLUSH_ALL)
 		on_each_cpu(do_flush_tlb_all, NULL, 1);
 	else
 		on_each_cpu(do_kernel_range_flush, info, 1);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 07/12] x86/tlb: use INVLPGB in flush_tlb_all
  2025-01-12 15:53 [RFC PATCH v4 00/10] AMD broadcast TLB invalidation Rik van Riel
                   ` (5 preceding siblings ...)
  2025-01-12 15:53 ` [PATCH v4 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
@ 2025-01-12 15:53 ` Rik van Riel
  2025-01-12 15:53 ` [PATCH v4 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-12 15:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, Rik van Riel

The flush_tlb_all() function is not used a whole lot, but we might
as well use broadcast TLB flushing there, too.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/mm/tlb.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 45c7b84f6f80..56634c735351 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1065,6 +1065,18 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 }
 
 
+static bool broadcast_flush_tlb_all(void) {
+	if (!IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH))
+		return false;
+
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	guard(preempt)();
+	invlpgb_flush_all();
+	return true;
+}
+
 static void do_flush_tlb_all(void *info)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -1073,6 +1085,8 @@ static void do_flush_tlb_all(void *info)
 
 void flush_tlb_all(void)
 {
+	if (broadcast_flush_tlb_all())
+		return;
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 	on_each_cpu(do_flush_tlb_all, NULL, 1);
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing
  2025-01-12 15:53 [RFC PATCH v4 00/10] AMD broadcast TLB invalidation Rik van Riel
                   ` (6 preceding siblings ...)
  2025-01-12 15:53 ` [PATCH v4 07/12] x86/tlb: use INVLPGB in flush_tlb_all Rik van Riel
@ 2025-01-12 15:53 ` Rik van Riel
  2025-01-12 15:53 ` [PATCH v4 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-12 15:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, Rik van Riel

In the page reclaim code, we only track the CPU(s) where the TLB needs
to be flushed, rather than all the individual mappings that may be getting
invalidated.

Use broadcast TLB flushing when that is available.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/mm/tlb.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 56634c735351..b47d6c3fe0af 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1319,7 +1319,9 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		invlpgb_flush_all_nonglobals();
+	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-12 15:53 [RFC PATCH v4 00/10] AMD broadcast TLB invalidation Rik van Riel
                   ` (7 preceding siblings ...)
  2025-01-12 15:53 ` [PATCH v4 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
@ 2025-01-12 15:53 ` Rik van Riel
  2025-01-13 13:09   ` Nadav Amit
  2025-01-12 15:53 ` [PATCH v4 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code Rik van Riel
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2025-01-12 15:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, Rik van Riel

Use broadcast TLB invalidation, using the INVPLGB instruction, on AMD EPYC 3
and newer CPUs.

In order to not exhaust PCID space, and keep TLB flushes local for single
threaded processes, we only hand out broadcast ASIDs to processes active on
3 or more CPUs, and gradually increase the threshold as broadcast ASID space
is depleted.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/mmu.h         |   6 +
 arch/x86/include/asm/mmu_context.h |  14 ++
 arch/x86/include/asm/tlbflush.h    |  64 +++++
 arch/x86/mm/tlb.c                  | 363 ++++++++++++++++++++++++++++-
 4 files changed, 435 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 3b496cdcb74b..d71cd599fec4 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -69,6 +69,12 @@ typedef struct {
 	u16 pkey_allocation_map;
 	s16 execute_only_pkey;
 #endif
+
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	u16 global_asid;
+	bool asid_transition;
+#endif
+
 } mm_context_t;
 
 #define INIT_MM_CONTEXT(mm)						\
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 795fdd53bd0a..d670699d32c2 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
 #define enter_lazy_tlb enter_lazy_tlb
 extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 
+extern void destroy_context_free_global_asid(struct mm_struct *mm);
+
 /*
  * Init a new mm.  Used on mm copies, like at fork()
  * and on mm's that are brand-new, like at execve().
@@ -161,6 +163,14 @@ static inline int init_new_context(struct task_struct *tsk,
 		mm->context.execute_only_pkey = -1;
 	}
 #endif
+
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+		mm->context.global_asid = 0;
+		mm->context.asid_transition = false;
+	}
+#endif
+
 	mm_reset_untag_mask(mm);
 	init_new_context_ldt(mm);
 	return 0;
@@ -170,6 +180,10 @@ static inline int init_new_context(struct task_struct *tsk,
 static inline void destroy_context(struct mm_struct *mm)
 {
 	destroy_context_ldt(mm);
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		destroy_context_free_global_asid(mm);
+#endif
 }
 
 extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dba5caa4a9f4..cd244cdd49dd 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -239,6 +239,70 @@ void flush_tlb_one_kernel(unsigned long addr);
 void flush_tlb_multi(const struct cpumask *cpumask,
 		      const struct flush_tlb_info *info);
 
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+static inline bool is_dyn_asid(u16 asid)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return true;
+
+	return asid < TLB_NR_DYN_ASIDS;
+}
+
+static inline bool is_global_asid(u16 asid)
+{
+	return !is_dyn_asid(asid);
+}
+
+static inline bool in_asid_transition(const struct flush_tlb_info *info)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return false;
+
+	return info->mm && info->mm->context.asid_transition;
+}
+
+static inline u16 mm_global_asid(struct mm_struct *mm)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return 0;
+
+	return mm->context.global_asid;
+}
+#else
+static inline bool is_dyn_asid(u16 asid)
+{
+	return true;
+}
+
+static inline bool is_global_asid(u16 asid)
+{
+	return false;
+}
+
+static inline bool in_asid_transition(const struct flush_tlb_info *info)
+{
+	return false;
+}
+
+static inline u16 mm_global_asid(struct mm_struct *mm)
+{
+	return 0;
+}
+
+static inline bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
+{
+	return false;
+}
+
+static inline void broadcast_tlb_flush(struct flush_tlb_info *info)
+{
+}
+
+static inline void consider_global_asid(struct mm_struct *mm)
+{
+}
+#endif
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #endif
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index b47d6c3fe0af..80375ef186d5 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -74,13 +74,15 @@
  * use different names for each of them:
  *
  * ASID  - [0, TLB_NR_DYN_ASIDS-1]
- *         the canonical identifier for an mm
+ *         the canonical identifier for an mm, dynamically allocated on each CPU
+ *         [TLB_NR_DYN_ASIDS, MAX_ASID_AVAILABLE-1]
+ *         the canonical, global identifier for an mm, identical across all CPUs
  *
- * kPCID - [1, TLB_NR_DYN_ASIDS]
+ * kPCID - [1, MAX_ASID_AVAILABLE]
  *         the value we write into the PCID part of CR3; corresponds to the
  *         ASID+1, because PCID 0 is special.
  *
- * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
+ * uPCID - [2048 + 1, 2048 + MAX_ASID_AVAILABLE]
  *         for KPTI each mm has two address spaces and thus needs two
  *         PCID values, but we can still do with a single ASID denomination
  *         for each mm. Corresponds to kPCID + 2048.
@@ -225,6 +227,19 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 		return;
 	}
 
+	/*
+	 * TLB consistency for global ASIDs is maintained with broadcast TLB
+	 * flushing. The TLB is never outdated, and does not need flushing.
+	 */
+	if (IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) && static_cpu_has(X86_FEATURE_INVLPGB)) {
+		u16 global_asid = mm_global_asid(next);
+		if (global_asid) {
+			*new_asid = global_asid;
+			*need_flush = false;
+			return;
+		}
+	}
+
 	if (this_cpu_read(cpu_tlbstate.invalidate_other))
 		clear_asid_other();
 
@@ -251,6 +266,292 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 	*need_flush = true;
 }
 
+#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
+/*
+ * Logic for broadcast TLB invalidation.
+ */
+static DEFINE_RAW_SPINLOCK(global_asid_lock);
+static u16 last_global_asid = MAX_ASID_AVAILABLE;
+static DECLARE_BITMAP(global_asid_used, MAX_ASID_AVAILABLE) = { 0 };
+static DECLARE_BITMAP(global_asid_freed, MAX_ASID_AVAILABLE) = { 0 };
+static int global_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
+
+static void reset_global_asid_space(void)
+{
+	lockdep_assert_held(&global_asid_lock);
+
+	/*
+	 * A global TLB flush guarantees that any stale entries from
+	 * previously freed global ASIDs get flushed from the TLB
+	 * everywhere, making these global ASIDs safe to reuse.
+	 */
+	invlpgb_flush_all_nonglobals();
+
+	/*
+	 * Clear all the previously freed global ASIDs from the
+	 * broadcast_asid_used bitmap, now that the global TLB flush
+	 * has made them actually available for re-use.
+	 */
+	bitmap_andnot(global_asid_used, global_asid_used,
+			global_asid_freed, MAX_ASID_AVAILABLE);
+	bitmap_clear(global_asid_freed, 0, MAX_ASID_AVAILABLE);
+
+	/*
+	 * ASIDs 0-TLB_NR_DYN_ASIDS are used for CPU-local ASID
+	 * assignments, for tasks doing IPI based TLB shootdowns.
+	 * Restart the search from the start of the global ASID space.
+	 */
+	last_global_asid = TLB_NR_DYN_ASIDS;
+}
+
+static u16 get_global_asid(void)
+{
+	lockdep_assert_held(&global_asid_lock);
+
+	do {
+		u16 start = last_global_asid;
+		u16 asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, start);
+
+		if (asid >= MAX_ASID_AVAILABLE) {
+			reset_global_asid_space();
+			continue;
+		}
+
+		/* Claim this global ASID. */
+		__set_bit(asid, global_asid_used);
+		last_global_asid = asid;
+		return asid;
+	} while (1);
+}
+
+/*
+ * Returns true if the mm is transitioning from a CPU-local ASID to a global 
+ * (INVLPGB) ASID, or the other way around.
+ */
+static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
+{
+	u16 global_asid = mm_global_asid(next);
+
+	if (global_asid && prev_asid != global_asid)
+		return true;
+
+	if (!global_asid && is_global_asid(prev_asid))
+		return true;
+
+	return false;
+}
+
+void destroy_context_free_global_asid(struct mm_struct *mm)
+{
+	if (!mm->context.global_asid)
+		return;
+
+	guard(raw_spinlock_irqsave)(&global_asid_lock);
+
+	/* The global ASID can be re-used only after flush at wrap-around. */
+	__set_bit(mm->context.global_asid, global_asid_freed);
+
+	mm->context.global_asid = 0;
+	global_asid_available++;
+}
+
+/*
+ * Check whether a process is currently active on more than "threshold" CPUs.
+ * This is a cheap estimation on whether or not it may make sense to assign
+ * a global ASID to this process, and use broadcast TLB invalidation.
+ */
+static bool mm_active_cpus_exceeds(struct mm_struct *mm, int threshold)
+{
+	int count = 0;
+	int cpu;
+
+	/* This quick check should eliminate most single threaded programs. */
+	if (cpumask_weight(mm_cpumask(mm)) <= threshold)
+		return false;
+
+	/* Slower check to make sure. */
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/* Skip the CPUs that aren't really running this process. */
+		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
+			continue;
+
+		if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu))
+			continue;
+
+		if (++count > threshold)
+			return true;
+	}
+	return false;
+}
+
+/*
+ * Assign a global ASID to the current process, protecting against
+ * races between multiple threads in the process.
+ */
+static void use_global_asid(struct mm_struct *mm)
+{
+	guard(raw_spinlock_irqsave)(&global_asid_lock);
+
+	/* This process is already using broadcast TLB invalidation. */
+	if (mm->context.global_asid)
+		return;
+
+	/* The last global ASID was consumed while waiting for the lock. */
+	if (!global_asid_available)
+		return;
+
+	/*
+	 * The transition from IPI TLB flushing, with a dynamic ASID,
+	 * and broadcast TLB flushing, using a global ASID, uses memory
+	 * ordering for synchronization.
+	 *
+	 * While the process has threads still using a dynamic ASID,
+	 * TLB invalidation IPIs continue to get sent.
+	 *
+	 * This code sets asid_transition first, before assigning the
+	 * global ASID.
+	 *
+	 * The TLB flush code will only verify the ASID transition
+	 * after it has seen the new global ASID for the process.
+	 */
+	WRITE_ONCE(mm->context.asid_transition, true);
+	WRITE_ONCE(mm->context.global_asid, get_global_asid());
+
+	global_asid_available--;
+}
+
+/*
+ * Figure out whether to assign a global ASID to a process.
+ * We vary the threshold by how empty or full global ASID space is.
+ * 1/4 full: >= 4 active threads
+ * 1/2 full: >= 8 active threads
+ * 3/4 full: >= 16 active threads
+ * 7/8 full: >= 32 active threads
+ * etc
+ *
+ * This way we should never exhaust the global ASID space, even on very
+ * large systems, and the processes with the largest number of active
+ * threads should be able to use broadcast TLB invalidation.
+ */
+#define HALFFULL_THRESHOLD 8
+static bool meets_global_asid_threshold(struct mm_struct *mm)
+{
+	int avail = global_asid_available;
+	int threshold = HALFFULL_THRESHOLD;
+
+	if (!avail)
+		return false;
+
+	if (avail > MAX_ASID_AVAILABLE * 3 / 4) {
+		threshold = HALFFULL_THRESHOLD / 4;
+	} else if (avail > MAX_ASID_AVAILABLE / 2) {
+		threshold = HALFFULL_THRESHOLD / 2;
+	} else if (avail < MAX_ASID_AVAILABLE / 3) {
+		do {
+			avail *= 2;
+			threshold *= 2;
+		} while ((avail + threshold) < MAX_ASID_AVAILABLE / 2);
+	}
+
+	return mm_active_cpus_exceeds(mm, threshold);
+}
+
+static void consider_global_asid(struct mm_struct *mm)
+{
+	if (!static_cpu_has(X86_FEATURE_INVLPGB))
+		return;
+
+	/* Check every once in a while. */
+	if ((current->pid & 0x1f) != (jiffies & 0x1f))
+		return;
+
+	if (meets_global_asid_threshold(mm))
+		use_global_asid(mm);
+}
+
+static void finish_asid_transition(struct flush_tlb_info *info)
+{
+	struct mm_struct *mm = info->mm;
+	int bc_asid = mm_global_asid(mm);
+	int cpu;
+
+	if (!READ_ONCE(mm->context.asid_transition))
+		return;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		/*
+		 * The remote CPU is context switching. Wait for that to
+		 * finish, to catch the unlikely case of it switching to
+		 * the target mm with an out of date ASID.
+		 */
+		while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) == LOADED_MM_SWITCHING)
+			cpu_relax();
+
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm)
+			continue;
+
+		/*
+		 * If at least one CPU is not using the global ASID yet,
+		 * send a TLB flush IPI. The IPI should cause stragglers
+		 * to transition soon.
+		 *
+		 * This can race with the CPU switching to another task;
+		 * that results in a (harmless) extra IPI.
+		 */
+		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) {
+			flush_tlb_multi(mm_cpumask(info->mm), info);
+			return;
+		}
+	}
+
+	/* All the CPUs running this process are using the global ASID. */
+	WRITE_ONCE(mm->context.asid_transition, false);
+}
+
+static void broadcast_tlb_flush(struct flush_tlb_info *info)
+{
+	bool pmd = info->stride_shift == PMD_SHIFT;
+	unsigned long maxnr = invlpgb_count_max;
+	unsigned long asid = info->mm->context.global_asid;
+	unsigned long addr = info->start;
+	unsigned long nr;
+
+	/* Flushing multiple pages at once is not supported with 1GB pages. */
+	if (info->stride_shift > PMD_SHIFT)
+		maxnr = 1;
+
+	/*
+	 * TLB flushes with INVLPGB are kicked off asynchronously.
+	 * The inc_mm_tlb_gen() guarantees page table updates are done
+	 * before these TLB flushes happen.
+	 */
+	if (info->end == TLB_FLUSH_ALL) {
+		invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_single_pcid_nosync(user_pcid(asid));
+	} else do {
+		/*
+		 * Calculate how many pages can be flushed at once; if the
+		 * remainder of the range is less than one page, flush one.
+		 */
+		nr = min(maxnr, (info->end - addr) >> info->stride_shift);
+		nr = max(nr, 1);
+
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd);
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd);
+		addr += nr << info->stride_shift;
+	} while (addr < info->end);
+
+	finish_asid_transition(info);
+
+	/* Wait for the INVLPGBs kicked off above to finish. */
+	tlbsync();
+}
+#endif /* CONFIG_X86_BROADCAST_TLB_FLUSH */
+
 /*
  * Given an ASID, flush the corresponding user ASID.  We can delay this
  * until the next time we switch to it.
@@ -556,8 +857,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 	 */
 	if (prev == next) {
 		/* Not actually switching mm's */
-		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
-			   next->context.ctx_id);
+		VM_WARN_ON(is_dyn_asid(prev_asid) &&
+				this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
+				next->context.ctx_id);
 
 		/*
 		 * If this races with another thread that enables lam, 'new_lam'
@@ -573,6 +875,23 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 
+		/*
+		 * Check if the current mm is transitioning to a new ASID.
+		 */
+		if (needs_global_asid_reload(next, prev_asid)) {
+			next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+
+			choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
+			goto reload_tlb;
+		}
+
+		/*
+		 * Broadcast TLB invalidation keeps this PCID up to date
+		 * all the time.
+		 */
+		if (is_global_asid(prev_asid))
+			return;
+
 		/*
 		 * If the CPU is not in lazy TLB mode, we are just switching
 		 * from one thread in a process to another thread in the same
@@ -606,6 +925,13 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		 */
 		cond_mitigation(tsk);
 
+		/*
+		 * Let nmi_uaccess_okay() and finish_asid_transition()
+		 * know that we're changing CR3.
+		 */
+		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
+		barrier();
+
 		/*
 		 * Leave this CPU in prev's mm_cpumask. Atomic writes to
 		 * mm_cpumask can be expensive under contention. The CPU
@@ -620,14 +946,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
 		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
-
-		/* Let nmi_uaccess_okay() know that we're changing CR3. */
-		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
-		barrier();
 	}
 
+reload_tlb:
 	new_lam = mm_lam_cr3_mask(next);
 	if (need_flush) {
+		VM_BUG_ON(is_global_asid(new_asid));
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
 		load_new_mm_cr3(next->pgd, new_asid, new_lam, true);
@@ -746,7 +1070,7 @@ static void flush_tlb_func(void *info)
 	const struct flush_tlb_info *f = info;
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+	u64 local_tlb_gen;
 	bool local = smp_processor_id() == f->initiating_cpu;
 	unsigned long nr_invalidate = 0;
 	u64 mm_tlb_gen;
@@ -769,6 +1093,16 @@ static void flush_tlb_func(void *info)
 	if (unlikely(loaded_mm == &init_mm))
 		return;
 
+	/* Reload the ASID if transitioning into or out of a global ASID */
+	if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) {
+		switch_mm_irqs_off(NULL, loaded_mm, NULL);
+		loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	}
+
+	/* Broadcast ASIDs are always kept up to date with INVLPGB. */
+	if (is_global_asid(loaded_mm_asid))
+		return;
+
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
@@ -786,6 +1120,8 @@ static void flush_tlb_func(void *info)
 		return;
 	}
 
+	local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+
 	if (unlikely(f->new_tlb_gen != TLB_GENERATION_INVALID &&
 		     f->new_tlb_gen <= local_tlb_gen)) {
 		/*
@@ -953,7 +1289,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
 	 * up on the new contents of what used to be page tables, while
 	 * doing a speculative memory access.
 	 */
-	if (info->freed_tables)
+	if (info->freed_tables || in_asid_transition(info))
 		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
 	else
 		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
@@ -1049,9 +1385,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+	if (mm_global_asid(mm)) {
+		broadcast_tlb_flush(info);
+	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
 		info->trim_cpumask = should_trim_cpumask(mm);
 		flush_tlb_multi(mm_cpumask(mm), info);
+		consider_global_asid(mm);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code
  2025-01-12 15:53 [RFC PATCH v4 00/10] AMD broadcast TLB invalidation Rik van Riel
                   ` (8 preceding siblings ...)
  2025-01-12 15:53 ` [PATCH v4 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
@ 2025-01-12 15:53 ` Rik van Riel
  2025-01-13 17:05   ` Jann Horn
  2025-01-12 15:53 ` [PATCH v4 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
  2025-01-12 15:53 ` [PATCH v4 12/12] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
  11 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2025-01-12 15:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, Rik van Riel

Instead of doing a system-wide TLB flush from arch_tlbbatch_flush,
queue up asynchronous, targeted flushes from arch_tlbbatch_add_pending.

This also allows us to avoid adding the CPUs of processes using broadcast
flushing to the batch->cpumask, and will hopefully further reduce TLB
flushing from the reclaim and compaction paths.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/tlbbatch.h |  1 +
 arch/x86/include/asm/tlbflush.h | 12 +++-------
 arch/x86/mm/tlb.c               | 41 ++++++++++++++++++++++++++++++---
 3 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/tlbbatch.h b/arch/x86/include/asm/tlbbatch.h
index 1ad56eb3e8a8..f9a17edf63ad 100644
--- a/arch/x86/include/asm/tlbbatch.h
+++ b/arch/x86/include/asm/tlbbatch.h
@@ -10,6 +10,7 @@ struct arch_tlbflush_unmap_batch {
 	 * the PFNs being flushed..
 	 */
 	struct cpumask cpumask;
+	bool used_invlpgb;
 };
 
 #endif /* _ARCH_X86_TLBBATCH_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cd244cdd49dd..fa4fcafa8b87 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -350,21 +350,15 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 	return atomic64_inc_return(&mm->context.tlb_gen);
 }
 
-static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
-					     struct mm_struct *mm,
-					     unsigned long uaddr)
-{
-	inc_mm_tlb_gen(mm);
-	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
-	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
-}
-
 static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
 {
 	flush_tlb_mm(mm);
 }
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+					     struct mm_struct *mm,
+					     unsigned long uaddr);
 
 static inline bool pte_flags_need_flush(unsigned long oldflags,
 					unsigned long newflags,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 80375ef186d5..532911fbb12a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1658,9 +1658,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	 * a local TLB flush is needed. Optimize this use-case by calling
 	 * flush_tlb_func_local() directly in this case.
 	 */
-	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
-		invlpgb_flush_all_nonglobals();
-	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
@@ -1669,12 +1667,49 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 		local_irq_enable();
 	}
 
+	/*
+	 * If we issued (asynchronous) INVLPGB flushes, wait for them here.
+	 * The cpumask above contains only CPUs that were running tasks
+	 * not using broadcast TLB flushing.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB) && batch->used_invlpgb) {
+		tlbsync();
+		migrate_enable();
+		batch->used_invlpgb = false;
+	}
+
 	cpumask_clear(&batch->cpumask);
 
 	put_flush_tlb_info();
 	put_cpu();
 }
 
+void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+					     struct mm_struct *mm,
+					     unsigned long uaddr)
+{
+	if (static_cpu_has(X86_FEATURE_INVLPGB) && mm_global_asid(mm)) {
+		u16 asid = mm_global_asid(mm);
+		/*
+		 * Queue up an asynchronous invalidation. The corresponding
+		 * TLBSYNC is done in arch_tlbbatch_flush(), and must be done
+		 * on the same CPU.
+		 */
+		if (!batch->used_invlpgb) {
+			batch->used_invlpgb = true;
+			migrate_disable();
+		}
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
+		/* Do any CPUs supporting INVLPGB need PTI? */
+		if (static_cpu_has(X86_FEATURE_PTI))
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false);
+	} else {
+		inc_mm_tlb_gen(mm);
+		cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
+	}
+	mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
+}
+
 /*
  * Blindly accessing user memory from NMI context can be dangerous
  * if we're in the middle of switching the current user task or
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 11/12] x86/mm: enable AMD translation cache extensions
  2025-01-12 15:53 [RFC PATCH v4 00/10] AMD broadcast TLB invalidation Rik van Riel
                   ` (9 preceding siblings ...)
  2025-01-12 15:53 ` [PATCH v4 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code Rik van Riel
@ 2025-01-12 15:53 ` Rik van Riel
  2025-01-13 11:32   ` Andrew Cooper
  2025-01-12 15:53 ` [PATCH v4 12/12] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
  11 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2025-01-12 15:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, Rik van Riel

With AMD TCE (translation cache extensions) only the intermediate mappings
that cover the address range zapped by INVLPG / INVLPGB get invalidated,
rather than all intermediate mappings getting zapped at every TLB invalidation.

This can help reduce the TLB miss rate, by keeping more intermediate
mappings in the cache.

From the AMD manual:

Translation Cache Extension (TCE) Bit. Bit 15, read/write. Setting this bit
to 1 changes how the INVLPG, INVLPGB, and INVPCID instructions operate on
TLB entries. When this bit is 0, these instructions remove the target PTE
from the TLB as well as all upper-level table entries that are cached
in the TLB, whether or not they are associated with the target PTE.
When this bit is set, these instructions will remove the target PTE and
only those upper-level entries that lead to the target PTE in
the page table hierarchy, leaving unrelated upper-level entries intact.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/msr-index.h       | 2 ++
 arch/x86/kernel/cpu/amd.c              | 3 +++
 tools/arch/x86/include/asm/msr-index.h | 2 ++
 3 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3ae84c3b8e6d..dc1c1057f26e 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -25,6 +25,7 @@
 #define _EFER_SVME		12 /* Enable virtualization */
 #define _EFER_LMSLE		13 /* Long Mode Segment Limit Enable */
 #define _EFER_FFXSR		14 /* Enable Fast FXSAVE/FXRSTOR */
+#define _EFER_TCE		15 /* Enable Translation Cache Extensions */
 #define _EFER_AUTOIBRS		21 /* Enable Automatic IBRS */
 
 #define EFER_SCE		(1<<_EFER_SCE)
@@ -34,6 +35,7 @@
 #define EFER_SVME		(1<<_EFER_SVME)
 #define EFER_LMSLE		(1<<_EFER_LMSLE)
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
+#define EFER_TCE		(1<<_EFER_TCE)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
 /*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index bcf73775b4f8..b7e84d43a22d 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -1071,6 +1071,9 @@ static void init_amd(struct cpuinfo_x86 *c)
 
 	/* AMD CPUs don't need fencing after x2APIC/TSC_DEADLINE MSR writes. */
 	clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
+
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		msr_set_bit(MSR_EFER, _EFER_TCE);
 }
 
 #ifdef CONFIG_X86_32
diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h
index 3ae84c3b8e6d..dc1c1057f26e 100644
--- a/tools/arch/x86/include/asm/msr-index.h
+++ b/tools/arch/x86/include/asm/msr-index.h
@@ -25,6 +25,7 @@
 #define _EFER_SVME		12 /* Enable virtualization */
 #define _EFER_LMSLE		13 /* Long Mode Segment Limit Enable */
 #define _EFER_FFXSR		14 /* Enable Fast FXSAVE/FXRSTOR */
+#define _EFER_TCE		15 /* Enable Translation Cache Extensions */
 #define _EFER_AUTOIBRS		21 /* Enable Automatic IBRS */
 
 #define EFER_SCE		(1<<_EFER_SCE)
@@ -34,6 +35,7 @@
 #define EFER_SVME		(1<<_EFER_SVME)
 #define EFER_LMSLE		(1<<_EFER_LMSLE)
 #define EFER_FFXSR		(1<<_EFER_FFXSR)
+#define EFER_TCE		(1<<_EFER_TCE)
 #define EFER_AUTOIBRS		(1<<_EFER_AUTOIBRS)
 
 /*
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 12/12] x86/mm: only invalidate final translations with INVLPGB
  2025-01-12 15:53 [RFC PATCH v4 00/10] AMD broadcast TLB invalidation Rik van Riel
                   ` (10 preceding siblings ...)
  2025-01-12 15:53 ` [PATCH v4 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
@ 2025-01-12 15:53 ` Rik van Riel
  2025-01-13 17:11   ` Jann Horn
  11 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2025-01-12 15:53 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh, Rik van Riel

Use the INVLPGB_FINAL_ONLY flag when invalidating mappings with INVPLGB.
This way only leaf mappings get removed from the TLB, leaving intermediate
translations cached.

On the (rare) occasions where we free page tables we do a full flush,
ensuring intermediate translations get flushed from the TLB.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/invlpgb.h | 10 ++++++++--
 arch/x86/mm/tlb.c              |  8 ++++----
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/invlpgb.h b/arch/x86/include/asm/invlpgb.h
index d62e3733a1ab..4fa48d063b76 100644
--- a/arch/x86/include/asm/invlpgb.h
+++ b/arch/x86/include/asm/invlpgb.h
@@ -61,9 +61,15 @@ static inline void invlpgb_flush_user(unsigned long pcid,
 
 static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
 						unsigned long addr,
-						int nr, bool pmd_stride)
+						int nr, bool pmd_stride,
+						bool freed_tables)
 {
-	__invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
+	unsigned long flags = INVLPGB_PCID | INVLPGB_VA;
+
+	if (!freed_tables)
+		flags |= INVLPGB_FINAL_ONLY;
+
+	__invlpgb(0, pcid, addr, nr - 1, pmd_stride, flags);
 }
 
 /* Flush all mappings for a given PCID, not including globals. */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 532911fbb12a..0254e9ebaf15 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -538,10 +538,10 @@ static void broadcast_tlb_flush(struct flush_tlb_info *info)
 		nr = min(maxnr, (info->end - addr) >> info->stride_shift);
 		nr = max(nr, 1);
 
-		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd);
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd, info->freed_tables);
 		/* Do any CPUs supporting INVLPGB need PTI? */
 		if (static_cpu_has(X86_FEATURE_PTI))
-			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd);
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd, info->freed_tables);
 		addr += nr << info->stride_shift;
 	} while (addr < info->end);
 
@@ -1699,10 +1699,10 @@ void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
 			batch->used_invlpgb = true;
 			migrate_disable();
 		}
-		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
+		invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false, false);
 		/* Do any CPUs supporting INVLPGB need PTI? */
 		if (static_cpu_has(X86_FEATURE_PTI))
-			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false);
+			invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false, false);
 	} else {
 		inc_mm_tlb_gen(mm);
 		cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 11/12] x86/mm: enable AMD translation cache extensions
  2025-01-12 15:53 ` [PATCH v4 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
@ 2025-01-13 11:32   ` Andrew Cooper
  2025-01-14  1:28     ` Rik van Riel
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Cooper @ 2025-01-13 11:32 UTC (permalink / raw)
  To: riel
  Cc: akpm, bp, dave.hansen, jannh, kernel-team, linux-kernel, linux-mm,
	nadav.amit, peterz, thomas.lendacky, x86, zhengqi.arch

> diff
> <https://lore.kernel.org/lkml/20250112155453.1104139-1-riel@surriel.com/T/#iZ2e.:..:20250112155453.1104139-12-riel::40surriel.com:1arch:x86:kernel:cpu:amd.c>
> --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index
> bcf73775b4f8..b7e84d43a22d 100644 --- a/arch/x86/kernel/cpu/amd.c +++
> b/arch/x86/kernel/cpu/amd.c @@ -1071,6 +1071,9 @@ static void
> init_amd(struct cpuinfo_x86 *c)  
>  	/* AMD CPUs don't need fencing after x2APIC/TSC_DEADLINE MSR writes. */
>  	clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
> + + if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) +
> msr_set_bit(MSR_EFER, _EFER_TCE);  }
>  
>  #ifdef CONFIG_X86_32

I don't think this is wise.  TCE is orthogonal to INVLPGB.

Either Linux is safe with TCE turned on, and it should be turned on
everywhere (it goes back to Fam10h CPUs IIRC), or Linux isn't safe with
TCE turned on, and this needs to depend on some other condition.

Or, is this a typo and did you mean to check the TCE CPUID bit, rather
than the INVLPGB CPUID bit?

~Andrew

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-12 15:53 ` [PATCH v4 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
@ 2025-01-13 13:09   ` Nadav Amit
  2025-01-14  3:13     ` Rik van Riel
  0 siblings, 1 reply; 36+ messages in thread
From: Nadav Amit @ 2025-01-13 13:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List,
	Borislav Petkov, peterz, Dave Hansen, zhengqi.arch,
	thomas.lendacky, kernel-team, open list:MEMORY MANAGEMENT,
	Andrew Morton, jannh



Not sure my review is thorough, but that’s all the time I have right now...

> On 12 Jan 2025, at 17:53, Rik van Riel <riel@surriel.com> wrote:
> 
> Use broadcast TLB invalidation, using the INVPLGB instruction, on AMD EPYC 3
> and newer CPUs.
> 
> In order to not exhaust PCID space, and keep TLB flushes local for single
> threaded processes, we only hand out broadcast ASIDs to processes active on
> 3 or more CPUs, and gradually increase the threshold as broadcast ASID space
> is depleted.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
> arch/x86/include/asm/mmu.h         |   6 +
> arch/x86/include/asm/mmu_context.h |  14 ++
> arch/x86/include/asm/tlbflush.h    |  64 +++++
> arch/x86/mm/tlb.c                  | 363 ++++++++++++++++++++++++++++-
> 4 files changed, 435 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
> index 3b496cdcb74b..d71cd599fec4 100644
> --- a/arch/x86/include/asm/mmu.h
> +++ b/arch/x86/include/asm/mmu.h
> @@ -69,6 +69,12 @@ typedef struct {
> 	u16 pkey_allocation_map;
> 	s16 execute_only_pkey;
> #endif
> +
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +	u16 global_asid;
> +	bool asid_transition;

As I later note, there are various ordering issues between the two. Would it be
just easier to combine them into one field? I know everybody hates bitfields so
I don’t suggest it, but there are other ways...

> +#endif
> +
> } mm_context_t;
> 
> #define INIT_MM_CONTEXT(mm)						\
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 795fdd53bd0a..d670699d32c2 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
> #define enter_lazy_tlb enter_lazy_tlb
> extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
> 
> +extern void destroy_context_free_global_asid(struct mm_struct *mm);
> +
> /*
>  * Init a new mm.  Used on mm copies, like at fork()
>  * and on mm's that are brand-new, like at execve().
> @@ -161,6 +163,14 @@ static inline int init_new_context(struct task_struct *tsk,
> 		mm->context.execute_only_pkey = -1;
> 	}
> #endif
> +
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> +		mm->context.global_asid = 0;
> +		mm->context.asid_transition = false;
> +	}
> +#endif
> +
> 	mm_reset_untag_mask(mm);
> 	init_new_context_ldt(mm);
> 	return 0;
> @@ -170,6 +180,10 @@ static inline int init_new_context(struct task_struct *tsk,
> static inline void destroy_context(struct mm_struct *mm)
> {
> 	destroy_context_ldt(mm);
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH

I’d prefer to use IS_ENABLED() and to have a stub for 
destroy_context_free_global_asid().

> +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		destroy_context_free_global_asid(mm);
> +#endif
> }
> 
> extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index dba5caa4a9f4..cd244cdd49dd 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -239,6 +239,70 @@ void flush_tlb_one_kernel(unsigned long addr);
> void flush_tlb_multi(const struct cpumask *cpumask,
> 		      const struct flush_tlb_info *info);
> 
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +static inline bool is_dyn_asid(u16 asid)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		return true;
> +
> +	return asid < TLB_NR_DYN_ASIDS;
> +}
> +
> +static inline bool is_global_asid(u16 asid)
> +{
> +	return !is_dyn_asid(asid);
> +}
> +
> +static inline bool in_asid_transition(const struct flush_tlb_info *info)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		return false;
> +
> +	return info->mm && info->mm->context.asid_transition;

READ_ONCE(context.asid_transition) ?

> +}
> +
> +static inline u16 mm_global_asid(struct mm_struct *mm)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		return 0;
> +
> +	return mm->context.global_asid;
> +}
> +#else
> +static inline bool is_dyn_asid(u16 asid)
> +{
> +	return true;
> +}
> +
> +static inline bool is_global_asid(u16 asid)
> +{
> +	return false;
> +}
> +
> +static inline bool in_asid_transition(const struct flush_tlb_info *info)
> +{
> +	return false;
> +}
> +
> +static inline u16 mm_global_asid(struct mm_struct *mm)
> +{
> +	return 0;
> +}
> +
> +static inline bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
> +{
> +	return false;
> +}
> +
> +static inline void broadcast_tlb_flush(struct flush_tlb_info *info)
> +{

Having a VM_WARN_ON() here might be nice.

> +}
> +
> +static inline void consider_global_asid(struct mm_struct *mm)
> +{
> +}
> +#endif
> +
> #ifdef CONFIG_PARAVIRT
> #include <asm/paravirt.h>
> #endif
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index b47d6c3fe0af..80375ef186d5 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -74,13 +74,15 @@
>  * use different names for each of them:
>  *
>  * ASID  - [0, TLB_NR_DYN_ASIDS-1]
> - *         the canonical identifier for an mm
> + *         the canonical identifier for an mm, dynamically allocated on each CPU
> + *         [TLB_NR_DYN_ASIDS, MAX_ASID_AVAILABLE-1]
> + *         the canonical, global identifier for an mm, identical across all CPUs
>  *
> - * kPCID - [1, TLB_NR_DYN_ASIDS]
> + * kPCID - [1, MAX_ASID_AVAILABLE]
>  *         the value we write into the PCID part of CR3; corresponds to the
>  *         ASID+1, because PCID 0 is special.
>  *
> - * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
> + * uPCID - [2048 + 1, 2048 + MAX_ASID_AVAILABLE]
>  *         for KPTI each mm has two address spaces and thus needs two
>  *         PCID values, but we can still do with a single ASID denomination
>  *         for each mm. Corresponds to kPCID + 2048.
> @@ -225,6 +227,19 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
> 		return;
> 	}
> 
> +	/*
> +	 * TLB consistency for global ASIDs is maintained with broadcast TLB
> +	 * flushing. The TLB is never outdated, and does not need flushing.
> +	 */
> +	if (IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) && static_cpu_has(X86_FEATURE_INVLPGB)) {
> +		u16 global_asid = mm_global_asid(next);
> +		if (global_asid) {
> +			*new_asid = global_asid;
> +			*need_flush = false;
> +			return;
> +		}
> +	}
> +
> 	if (this_cpu_read(cpu_tlbstate.invalidate_other))
> 		clear_asid_other();
> 
> @@ -251,6 +266,292 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
> 	*need_flush = true;
> }
> 
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +/*
> + * Logic for broadcast TLB invalidation.
> + */
> +static DEFINE_RAW_SPINLOCK(global_asid_lock);
> +static u16 last_global_asid = MAX_ASID_AVAILABLE;
> +static DECLARE_BITMAP(global_asid_used, MAX_ASID_AVAILABLE) = { 0 };
> +static DECLARE_BITMAP(global_asid_freed, MAX_ASID_AVAILABLE) = { 0 };
> +static int global_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
> +
> +static void reset_global_asid_space(void)
> +{
> +	lockdep_assert_held(&global_asid_lock);
> +
> +	/*
> +	 * A global TLB flush guarantees that any stale entries from
> +	 * previously freed global ASIDs get flushed from the TLB
> +	 * everywhere, making these global ASIDs safe to reuse.
> +	 */
> +	invlpgb_flush_all_nonglobals();
> +
> +	/*
> +	 * Clear all the previously freed global ASIDs from the
> +	 * broadcast_asid_used bitmap, now that the global TLB flush
> +	 * has made them actually available for re-use.
> +	 */
> +	bitmap_andnot(global_asid_used, global_asid_used,
> +			global_asid_freed, MAX_ASID_AVAILABLE);
> +	bitmap_clear(global_asid_freed, 0, MAX_ASID_AVAILABLE);
> +
> +	/*
> +	 * ASIDs 0-TLB_NR_DYN_ASIDS are used for CPU-local ASID
> +	 * assignments, for tasks doing IPI based TLB shootdowns.
> +	 * Restart the search from the start of the global ASID space.
> +	 */
> +	last_global_asid = TLB_NR_DYN_ASIDS;
> +}
> +
> +static u16 get_global_asid(void)
> +{
> +	lockdep_assert_held(&global_asid_lock);
> +
> +	do {
> +		u16 start = last_global_asid;
> +		u16 asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, start);
> +
> +		if (asid >= MAX_ASID_AVAILABLE) {
> +			reset_global_asid_space();
> +			continue;
> +		}
> +
> +		/* Claim this global ASID. */
> +		__set_bit(asid, global_asid_used);
> +		last_global_asid = asid;
> +		return asid;
> +	} while (1);

This does not make me feel easy at all. I do not understand
why it might happen. The caller should’ve already checked the global ASID
is available under the lock. If it is not obvious from the code, perhaps
refactoring is needed.

> +}
> +
> +/*
> + * Returns true if the mm is transitioning from a CPU-local ASID to a global 
> + * (INVLPGB) ASID, or the other way around.
> + */
> +static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
> +{
> +	u16 global_asid = mm_global_asid(next);
> +
> +	if (global_asid && prev_asid != global_asid)
> +		return true;
> +
> +	if (!global_asid && is_global_asid(prev_asid))
> +		return true;
> +
> +	return false;
> +}
> +
> +void destroy_context_free_global_asid(struct mm_struct *mm)
> +{
> +	if (!mm->context.global_asid)
> +		return;
> +
> +	guard(raw_spinlock_irqsave)(&global_asid_lock);
> +
> +	/* The global ASID can be re-used only after flush at wrap-around. */
> +	__set_bit(mm->context.global_asid, global_asid_freed);
> +
> +	mm->context.global_asid = 0;
> +	global_asid_available++;
> +}
> +
> +/*
> + * Check whether a process is currently active on more than "threshold" CPUs.
> + * This is a cheap estimation on whether or not it may make sense to assign
> + * a global ASID to this process, and use broadcast TLB invalidation.
> + */
> +static bool mm_active_cpus_exceeds(struct mm_struct *mm, int threshold)
> +{
> +	int count = 0;
> +	int cpu;
> +
> +	/* This quick check should eliminate most single threaded programs. */
> +	if (cpumask_weight(mm_cpumask(mm)) <= threshold)
> +		return false;
> +
> +	/* Slower check to make sure. */
> +	for_each_cpu(cpu, mm_cpumask(mm)) {
> +		/* Skip the CPUs that aren't really running this process. */
> +		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
> +			continue;

Do you really want to make loaded_mm accessed from other cores? Does this
really provide worthy benefit?

Why not just use cpumask_weight() and be done with it? Anyhow it’s a heuristic.

> +
> +		if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu))
> +			continue;
> +
> +		if (++count > threshold)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/*
> + * Assign a global ASID to the current process, protecting against
> + * races between multiple threads in the process.
> + */
> +static void use_global_asid(struct mm_struct *mm)
> +{
> +	guard(raw_spinlock_irqsave)(&global_asid_lock);
> +
> +	/* This process is already using broadcast TLB invalidation. */
> +	if (mm->context.global_asid)
> +		return;
> +
> +	/* The last global ASID was consumed while waiting for the lock. */
> +	if (!global_asid_available)

I think "global_asid_available > 0” would make more sense.

> +		return;
> +
> +	/*
> +	 * The transition from IPI TLB flushing, with a dynamic ASID,
> +	 * and broadcast TLB flushing, using a global ASID, uses memory
> +	 * ordering for synchronization.
> +	 *
> +	 * While the process has threads still using a dynamic ASID,
> +	 * TLB invalidation IPIs continue to get sent.
> +	 *
> +	 * This code sets asid_transition first, before assigning the
> +	 * global ASID.
> +	 *
> +	 * The TLB flush code will only verify the ASID transition
> +	 * after it has seen the new global ASID for the process.
> +	 */
> +	WRITE_ONCE(mm->context.asid_transition, true);

I would prefer smp_wmb() and document where the matching smp_rmb()
(or smp_mb) is.

> +	WRITE_ONCE(mm->context.global_asid, get_global_asid());
> +
> +	global_asid_available--;
> +}
> +
> +/*
> + * Figure out whether to assign a global ASID to a process.
> + * We vary the threshold by how empty or full global ASID space is.
> + * 1/4 full: >= 4 active threads
> + * 1/2 full: >= 8 active threads
> + * 3/4 full: >= 16 active threads
> + * 7/8 full: >= 32 active threads
> + * etc
> + *
> + * This way we should never exhaust the global ASID space, even on very
> + * large systems, and the processes with the largest number of active
> + * threads should be able to use broadcast TLB invalidation.
> + */
> +#define HALFFULL_THRESHOLD 8
> +static bool meets_global_asid_threshold(struct mm_struct *mm)
> +{
> +	int avail = global_asid_available;
> +	int threshold = HALFFULL_THRESHOLD;
> +
> +	if (!avail)
> +		return false;
> +
> +	if (avail > MAX_ASID_AVAILABLE * 3 / 4) {
> +		threshold = HALFFULL_THRESHOLD / 4;
> +	} else if (avail > MAX_ASID_AVAILABLE / 2) {
> +		threshold = HALFFULL_THRESHOLD / 2;
> +	} else if (avail < MAX_ASID_AVAILABLE / 3) {
> +		do {
> +			avail *= 2;
> +			threshold *= 2;
> +		} while ((avail + threshold) < MAX_ASID_AVAILABLE / 2);
> +	}
> +
> +	return mm_active_cpus_exceeds(mm, threshold);
> +}
> +
> +static void consider_global_asid(struct mm_struct *mm)
> +{
> +	if (!static_cpu_has(X86_FEATURE_INVLPGB))
> +		return;
> +
> +	/* Check every once in a while. */
> +	if ((current->pid & 0x1f) != (jiffies & 0x1f))
> +		return;
> +
> +	if (meets_global_asid_threshold(mm))
> +		use_global_asid(mm);
> +}
> +
> +static void finish_asid_transition(struct flush_tlb_info *info)
> +{
> +	struct mm_struct *mm = info->mm;
> +	int bc_asid = mm_global_asid(mm);
> +	int cpu;
> +
> +	if (!READ_ONCE(mm->context.asid_transition))
> +		return;
> +
> +	for_each_cpu(cpu, mm_cpumask(mm)) {
> +		/*
> +		 * The remote CPU is context switching. Wait for that to
> +		 * finish, to catch the unlikely case of it switching to
> +		 * the target mm with an out of date ASID.
> +		 */
> +		while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) == LOADED_MM_SWITCHING)
> +			cpu_relax();

Although this code should rarely run, it seems bad for a couple of reasons:

1. It is a new busy-wait in a very delicate place. Lockdep is blind to this
   change.

2. cpu_tlbstate is supposed to be private for each core - that’s why there
   is cpu_tlbstate_shared. But I really think loaded_mm should be kept
   private.

Can't we just do one TLB shootdown if 
	cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids


> +
> +		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm)
> +			continue;
> +
> +		/*
> +		 * If at least one CPU is not using the global ASID yet,
> +		 * send a TLB flush IPI. The IPI should cause stragglers
> +		 * to transition soon.
> +		 *
> +		 * This can race with the CPU switching to another task;
> +		 * that results in a (harmless) extra IPI.
> +		 */
> +		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) {
> +			flush_tlb_multi(mm_cpumask(info->mm), info);
> +			return;
> +		}
> +	}
> +
> +	/* All the CPUs running this process are using the global ASID. */

I guess it’s ordered with the flushes (the flushes must complete first).

> +	WRITE_ONCE(mm->context.asid_transition, false);
> +}
> +
> +static void broadcast_tlb_flush(struct flush_tlb_info *info)
> +{
> +	bool pmd = info->stride_shift == PMD_SHIFT;
> +	unsigned long maxnr = invlpgb_count_max;
> +	unsigned long asid = info->mm->context.global_asid;
> +	unsigned long addr = info->start;
> +	unsigned long nr;
> +
> +	/* Flushing multiple pages at once is not supported with 1GB pages. */
> +	if (info->stride_shift > PMD_SHIFT)
> +		maxnr = 1;
> +
> +	/*
> +	 * TLB flushes with INVLPGB are kicked off asynchronously.
> +	 * The inc_mm_tlb_gen() guarantees page table updates are done
> +	 * before these TLB flushes happen.
> +	 */
> +	if (info->end == TLB_FLUSH_ALL) {
> +		invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
> +		/* Do any CPUs supporting INVLPGB need PTI? */
> +		if (static_cpu_has(X86_FEATURE_PTI))
> +			invlpgb_flush_single_pcid_nosync(user_pcid(asid));
> +	} else do {
> +		/*
> +		 * Calculate how many pages can be flushed at once; if the
> +		 * remainder of the range is less than one page, flush one.
> +		 */
> +		nr = min(maxnr, (info->end - addr) >> info->stride_shift);
> +		nr = max(nr, 1);
> +
> +		invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd);
> +		/* Do any CPUs supporting INVLPGB need PTI? */
> +		if (static_cpu_has(X86_FEATURE_PTI))
> +			invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd);
> +		addr += nr << info->stride_shift;
> +	} while (addr < info->end);

I would have preferred for instead of while...

> +
> +	finish_asid_transition(info);
> +
> +	/* Wait for the INVLPGBs kicked off above to finish. */
> +	tlbsync();
> +}
> +#endif /* CONFIG_X86_BROADCAST_TLB_FLUSH */
> +
> /*
>  * Given an ASID, flush the corresponding user ASID.  We can delay this
>  * until the next time we switch to it.
> @@ -556,8 +857,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
> 	 */
> 	if (prev == next) {
> 		/* Not actually switching mm's */
> -		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
> -			   next->context.ctx_id);
> +		VM_WARN_ON(is_dyn_asid(prev_asid) &&
> +				this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
> +				next->context.ctx_id);
> 
> 		/*
> 		 * If this races with another thread that enables lam, 'new_lam'
> @@ -573,6 +875,23 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
> 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
> 			cpumask_set_cpu(cpu, mm_cpumask(next));
> 
> +		/*
> +		 * Check if the current mm is transitioning to a new ASID.
> +		 */
> +		if (needs_global_asid_reload(next, prev_asid)) {
> +			next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> +
> +			choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
> +			goto reload_tlb;
> +		}
> +
> +		/*
> +		 * Broadcast TLB invalidation keeps this PCID up to date
> +		 * all the time.
> +		 */
> +		if (is_global_asid(prev_asid))
> +			return;
> +
> 		/*
> 		 * If the CPU is not in lazy TLB mode, we are just switching
> 		 * from one thread in a process to another thread in the same
> @@ -606,6 +925,13 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
> 		 */
> 		cond_mitigation(tsk);
> 
> +		/*
> +		 * Let nmi_uaccess_okay() and finish_asid_transition()
> +		 * know that we're changing CR3.
> +		 */
> +		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
> +		barrier();
> +
> 		/*
> 		 * Leave this CPU in prev's mm_cpumask. Atomic writes to
> 		 * mm_cpumask can be expensive under contention. The CPU
> @@ -620,14 +946,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
> 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> 
> 		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
> -
> -		/* Let nmi_uaccess_okay() know that we're changing CR3. */
> -		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
> -		barrier();
> 	}
> 
> +reload_tlb:
> 	new_lam = mm_lam_cr3_mask(next);
> 	if (need_flush) {
> +		VM_BUG_ON(is_global_asid(new_asid));
> 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
> 		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
> 		load_new_mm_cr3(next->pgd, new_asid, new_lam, true);
> @@ -746,7 +1070,7 @@ static void flush_tlb_func(void *info)
> 	const struct flush_tlb_info *f = info;
> 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
> 	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> -	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
> +	u64 local_tlb_gen;
> 	bool local = smp_processor_id() == f->initiating_cpu;
> 	unsigned long nr_invalidate = 0;
> 	u64 mm_tlb_gen;
> @@ -769,6 +1093,16 @@ static void flush_tlb_func(void *info)
> 	if (unlikely(loaded_mm == &init_mm))
> 		return;
> 
> +	/* Reload the ASID if transitioning into or out of a global ASID */
> +	if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) {
> +		switch_mm_irqs_off(NULL, loaded_mm, NULL);
> +		loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> +	}
> +
> +	/* Broadcast ASIDs are always kept up to date with INVLPGB. */
> +	if (is_global_asid(loaded_mm_asid))
> +		return;
> +
> 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
> 		   loaded_mm->context.ctx_id);
> 
> @@ -786,6 +1120,8 @@ static void flush_tlb_func(void *info)
> 		return;
> 	}
> 
> +	local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
> +
> 	if (unlikely(f->new_tlb_gen != TLB_GENERATION_INVALID &&
> 		     f->new_tlb_gen <= local_tlb_gen)) {
> 		/*
> @@ -953,7 +1289,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
> 	 * up on the new contents of what used to be page tables, while
> 	 * doing a speculative memory access.
> 	 */
> -	if (info->freed_tables)
> +	if (info->freed_tables || in_asid_transition(info))
> 		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
> 	else
> 		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
> @@ -1049,9 +1385,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> 	 * a local TLB flush is needed. Optimize this use-case by calling
> 	 * flush_tlb_func_local() directly in this case.
> 	 */
> -	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {

I think an smp_rmb() here would communicate the fact in_asid_transition() and
mm_global_asid() must be ordered.

> +	if (mm_global_asid(mm)) {
> +		broadcast_tlb_flush(info);
> +	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
> 		info->trim_cpumask = should_trim_cpumask(mm);
> 		flush_tlb_multi(mm_cpumask(mm), info);
> +		consider_global_asid(mm);
> 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
> 		lockdep_assert_irqs_enabled();
> 		local_irq_disable();
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 05/12] x86/mm: add INVLPGB support code
  2025-01-12 15:53 ` [PATCH v4 05/12] x86/mm: add INVLPGB support code Rik van Riel
@ 2025-01-13 14:21   ` Tom Lendacky
  2025-01-13 21:10     ` Rik van Riel
  2025-01-13 17:24   ` Jann Horn
  2025-01-14 18:24   ` Michael Kelley
  2 siblings, 1 reply; 36+ messages in thread
From: Tom Lendacky @ 2025-01-13 14:21 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jannh

On 1/12/25 09:53, Rik van Riel wrote:
> Add invlpgb.h with the helper functions and definitions needed to use
> broadcast TLB invalidation on AMD EPYC 3 and newer CPUs.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/invlpgb.h  | 95 +++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/tlbflush.h |  1 +
>  2 files changed, 96 insertions(+)
>  create mode 100644 arch/x86/include/asm/invlpgb.h
> 
> diff --git a/arch/x86/include/asm/invlpgb.h b/arch/x86/include/asm/invlpgb.h
> new file mode 100644
> index 000000000000..d62e3733a1ab
> --- /dev/null
> +++ b/arch/x86/include/asm/invlpgb.h
> @@ -0,0 +1,95 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_INVLPGB
> +#define _ASM_X86_INVLPGB
> +
> +#include <vdso/bits.h>
> +
> +/*
> + * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
> + *
> + * The INVLPGB instruction is weakly ordered, and a batch of invalidations can
> + * be done in a parallel fashion.
> + *
> + * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from
> + * this CPU have completed.
> + */
> +static inline void __invlpgb(unsigned long asid, unsigned long pcid, unsigned long addr,
> +			    int extra_count, bool pmd_stride, unsigned long flags)
> +{
> +	u32 edx = (pcid << 16) | asid;
> +	u32 ecx = (pmd_stride << 31);
> +	u64 rax = addr | flags;
> +
> +	/* Protect against negative numbers. */
> +	extra_count = max(extra_count, 0);
> +	ecx |= extra_count;

A bad ECX value (ECX[15:0] > invlpgb_count_max) will result in a #GP, is
that ok?

Thanks,
Tom

> +
> +	asm volatile("invlpgb" : : "a" (rax), "c" (ecx), "d" (edx));
> +}
> +
> +/* Wait for INVLPGB originated by this CPU to complete. */
> +static inline void tlbsync(void)
> +{
> +	asm volatile("tlbsync");
> +}
> +
> +/*
> + * INVLPGB can be targeted by virtual address, PCID, ASID, or any combination
> + * of the three. For example:
> + * - INVLPGB_VA | INVLPGB_INCLUDE_GLOBAL: invalidate all TLB entries at the address
> + * - INVLPGB_PCID:              	  invalidate all TLB entries matching the PCID
> + *
> + * The first can be used to invalidate (kernel) mappings at a particular
> + * address across all processes.
> + *
> + * The latter invalidates all TLB entries matching a PCID.
> + */
> +#define INVLPGB_VA			BIT(0)
> +#define INVLPGB_PCID			BIT(1)
> +#define INVLPGB_ASID			BIT(2)
> +#define INVLPGB_INCLUDE_GLOBAL		BIT(3)
> +#define INVLPGB_FINAL_ONLY		BIT(4)
> +#define INVLPGB_INCLUDE_NESTED		BIT(5)
> +
> +/* Flush all mappings for a given pcid and addr, not including globals. */
> +static inline void invlpgb_flush_user(unsigned long pcid,
> +				      unsigned long addr)
> +{
> +	__invlpgb(0, pcid, addr, 0, 0, INVLPGB_PCID | INVLPGB_VA);
> +	tlbsync();
> +}
> +
> +static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
> +						unsigned long addr,
> +						int nr, bool pmd_stride)
> +{
> +	__invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
> +}
> +
> +/* Flush all mappings for a given PCID, not including globals. */
> +static inline void invlpgb_flush_single_pcid_nosync(unsigned long pcid)
> +{
> +	__invlpgb(0, pcid, 0, 0, 0, INVLPGB_PCID);
> +}
> +
> +/* Flush all mappings, including globals, for all PCIDs. */
> +static inline void invlpgb_flush_all(void)
> +{
> +	__invlpgb(0, 0, 0, 0, 0, INVLPGB_INCLUDE_GLOBAL);
> +	tlbsync();
> +}
> +
> +/* Flush addr, including globals, for all PCIDs. */
> +static inline void invlpgb_flush_addr_nosync(unsigned long addr, int nr)
> +{
> +	__invlpgb(0, 0, addr, nr - 1, 0, INVLPGB_INCLUDE_GLOBAL);
> +}
> +
> +/* Flush all mappings for all PCIDs except globals. */
> +static inline void invlpgb_flush_all_nonglobals(void)
> +{
> +	__invlpgb(0, 0, 0, 0, 0, 0);
> +	tlbsync();
> +}
> +
> +#endif /* _ASM_X86_INVLPGB */
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 8fe3b2dda507..dba5caa4a9f4 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -10,6 +10,7 @@
>  #include <asm/cpufeature.h>
>  #include <asm/special_insns.h>
>  #include <asm/smp.h>
> +#include <asm/invlpgb.h>
>  #include <asm/invpcid.h>
>  #include <asm/pti.h>
>  #include <asm/processor-flags.h>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-01-12 15:53 ` [PATCH v4 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
@ 2025-01-13 15:50   ` Jann Horn
  2025-01-13 21:08     ` Rik van Riel
  0 siblings, 1 reply; 36+ messages in thread
From: Jann Horn @ 2025-01-13 15:50 UTC (permalink / raw)
  To: Rik van Riel, thomas.lendacky
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, kernel-team, linux-mm, akpm

On Sun, Jan 12, 2025 at 4:55 PM Rik van Riel <riel@surriel.com> wrote:
> +       /* Max number of pages INVLPGB can invalidate in one shot */
> +       if (boot_cpu_has(X86_FEATURE_INVLPGB)) {
> +               cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
> +               invlpgb_count_max = (edx & 0xffff) + 1;

I assume the +1 is just a weird undocumented (or weirdly documented) encoding?
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf
says that field InvlpgbCountMax contains the "Maximum page count for
INVLPGB instruction" and doesn't mention having to add 1 from what I
can tell.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code
  2025-01-12 15:53 ` [PATCH v4 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code Rik van Riel
@ 2025-01-13 17:05   ` Jann Horn
  2025-01-13 17:48     ` Jann Horn
  2025-01-13 21:16     ` Rik van Riel
  0 siblings, 2 replies; 36+ messages in thread
From: Jann Horn @ 2025-01-13 17:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm

On Sun, Jan 12, 2025 at 4:55 PM Rik van Riel <riel@surriel.com> wrote:
> Instead of doing a system-wide TLB flush from arch_tlbbatch_flush,
> queue up asynchronous, targeted flushes from arch_tlbbatch_add_pending.
>
> This also allows us to avoid adding the CPUs of processes using broadcast
> flushing to the batch->cpumask, and will hopefully further reduce TLB
> flushing from the reclaim and compaction paths.
[...]
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 80375ef186d5..532911fbb12a 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1658,9 +1658,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>          * a local TLB flush is needed. Optimize this use-case by calling
>          * flush_tlb_func_local() directly in this case.
>          */
> -       if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> -               invlpgb_flush_all_nonglobals();
> -       } else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
> +       if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
>                 flush_tlb_multi(&batch->cpumask, info);
>         } else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
>                 lockdep_assert_irqs_enabled();
> @@ -1669,12 +1667,49 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>                 local_irq_enable();
>         }
>
> +       /*
> +        * If we issued (asynchronous) INVLPGB flushes, wait for them here.
> +        * The cpumask above contains only CPUs that were running tasks
> +        * not using broadcast TLB flushing.
> +        */
> +       if (cpu_feature_enabled(X86_FEATURE_INVLPGB) && batch->used_invlpgb) {
> +               tlbsync();
> +               migrate_enable();
> +               batch->used_invlpgb = false;
> +       }
> +
>         cpumask_clear(&batch->cpumask);
>
>         put_flush_tlb_info();
>         put_cpu();
>  }
>
> +void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> +                                            struct mm_struct *mm,
> +                                            unsigned long uaddr)
> +{
> +       if (static_cpu_has(X86_FEATURE_INVLPGB) && mm_global_asid(mm)) {
> +               u16 asid = mm_global_asid(mm);
> +               /*
> +                * Queue up an asynchronous invalidation. The corresponding
> +                * TLBSYNC is done in arch_tlbbatch_flush(), and must be done
> +                * on the same CPU.
> +                */
> +               if (!batch->used_invlpgb) {
> +                       batch->used_invlpgb = true;
> +                       migrate_disable();
> +               }
> +               invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
> +               /* Do any CPUs supporting INVLPGB need PTI? */
> +               if (static_cpu_has(X86_FEATURE_PTI))
> +                       invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false);
> +       } else {
> +               inc_mm_tlb_gen(mm);
> +               cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
> +       }
> +       mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
> +}

How does this work if the MM is currently transitioning to a global
ASID? Should the "mm_global_asid(mm)" check maybe be replaced with
something that checks if the MM has fully transitioned to a global
ASID, so that we keep using the classic path if there might be holdout
CPUs?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 12/12] x86/mm: only invalidate final translations with INVLPGB
  2025-01-12 15:53 ` [PATCH v4 12/12] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
@ 2025-01-13 17:11   ` Jann Horn
  0 siblings, 0 replies; 36+ messages in thread
From: Jann Horn @ 2025-01-13 17:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm

On Sun, Jan 12, 2025 at 4:55 PM Rik van Riel <riel@surriel.com> wrote:
> Use the INVLPGB_FINAL_ONLY flag when invalidating mappings with INVPLGB.
> This way only leaf mappings get removed from the TLB, leaving intermediate
> translations cached.
>
> On the (rare) occasions where we free page tables we do a full flush,
> ensuring intermediate translations get flushed from the TLB.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/invlpgb.h | 10 ++++++++--
>  arch/x86/mm/tlb.c              |  8 ++++----
>  2 files changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/invlpgb.h b/arch/x86/include/asm/invlpgb.h
> index d62e3733a1ab..4fa48d063b76 100644
> --- a/arch/x86/include/asm/invlpgb.h
> +++ b/arch/x86/include/asm/invlpgb.h
> @@ -61,9 +61,15 @@ static inline void invlpgb_flush_user(unsigned long pcid,
>
>  static inline void invlpgb_flush_user_nr_nosync(unsigned long pcid,
>                                                 unsigned long addr,
> -                                               int nr, bool pmd_stride)
> +                                               int nr, bool pmd_stride,
> +                                               bool freed_tables)
>  {
> -       __invlpgb(0, pcid, addr, nr - 1, pmd_stride, INVLPGB_PCID | INVLPGB_VA);
> +       unsigned long flags = INVLPGB_PCID | INVLPGB_VA;
> +
> +       if (!freed_tables)
> +               flags |= INVLPGB_FINAL_ONLY;
> +
> +       __invlpgb(0, pcid, addr, nr - 1, pmd_stride, flags);
>  }

Thanks, this looks much nicer to me.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 05/12] x86/mm: add INVLPGB support code
  2025-01-12 15:53 ` [PATCH v4 05/12] x86/mm: add INVLPGB support code Rik van Riel
  2025-01-13 14:21   ` Tom Lendacky
@ 2025-01-13 17:24   ` Jann Horn
  2025-01-14  1:33     ` Rik van Riel
  2025-01-14 18:24   ` Michael Kelley
  2 siblings, 1 reply; 36+ messages in thread
From: Jann Horn @ 2025-01-13 17:24 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm

On Sun, Jan 12, 2025 at 4:55 PM Rik van Riel <riel@surriel.com> wrote:
> +/* Wait for INVLPGB originated by this CPU to complete. */
> +static inline void tlbsync(void)
> +{
> +       asm volatile("tlbsync");
> +}

If possible, it might be a good idea to add a cant_migrate() assertion
in here, though I'm not sure if that works in terms of include
hierarchy.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code
  2025-01-13 17:05   ` Jann Horn
@ 2025-01-13 17:48     ` Jann Horn
  2025-01-13 21:16     ` Rik van Riel
  1 sibling, 0 replies; 36+ messages in thread
From: Jann Horn @ 2025-01-13 17:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm

On Mon, Jan 13, 2025 at 6:05 PM Jann Horn <jannh@google.com> wrote:
> On Sun, Jan 12, 2025 at 4:55 PM Rik van Riel <riel@surriel.com> wrote:
> > Instead of doing a system-wide TLB flush from arch_tlbbatch_flush,
> > queue up asynchronous, targeted flushes from arch_tlbbatch_add_pending.
> >
> > This also allows us to avoid adding the CPUs of processes using broadcast
> > flushing to the batch->cpumask, and will hopefully further reduce TLB
> > flushing from the reclaim and compaction paths.
> [...]
> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index 80375ef186d5..532911fbb12a 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -1658,9 +1658,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> >          * a local TLB flush is needed. Optimize this use-case by calling
> >          * flush_tlb_func_local() directly in this case.
> >          */
> > -       if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> > -               invlpgb_flush_all_nonglobals();
> > -       } else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
> > +       if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
> >                 flush_tlb_multi(&batch->cpumask, info);
> >         } else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
> >                 lockdep_assert_irqs_enabled();
> > @@ -1669,12 +1667,49 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> >                 local_irq_enable();
> >         }
> >
> > +       /*
> > +        * If we issued (asynchronous) INVLPGB flushes, wait for them here.
> > +        * The cpumask above contains only CPUs that were running tasks
> > +        * not using broadcast TLB flushing.
> > +        */
> > +       if (cpu_feature_enabled(X86_FEATURE_INVLPGB) && batch->used_invlpgb) {
> > +               tlbsync();
> > +               migrate_enable();
> > +               batch->used_invlpgb = false;
> > +       }
> > +
> >         cpumask_clear(&batch->cpumask);
> >
> >         put_flush_tlb_info();
> >         put_cpu();
> >  }
> >
> > +void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> > +                                            struct mm_struct *mm,
> > +                                            unsigned long uaddr)
> > +{
> > +       if (static_cpu_has(X86_FEATURE_INVLPGB) && mm_global_asid(mm)) {
> > +               u16 asid = mm_global_asid(mm);
> > +               /*
> > +                * Queue up an asynchronous invalidation. The corresponding
> > +                * TLBSYNC is done in arch_tlbbatch_flush(), and must be done
> > +                * on the same CPU.
> > +                */
> > +               if (!batch->used_invlpgb) {
> > +                       batch->used_invlpgb = true;
> > +                       migrate_disable();
> > +               }
> > +               invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
> > +               /* Do any CPUs supporting INVLPGB need PTI? */
> > +               if (static_cpu_has(X86_FEATURE_PTI))
> > +                       invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false);
> > +       } else {
> > +               inc_mm_tlb_gen(mm);
> > +               cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
> > +       }
> > +       mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
> > +}
>
> How does this work if the MM is currently transitioning to a global
> ASID? Should the "mm_global_asid(mm)" check maybe be replaced with
> something that checks if the MM has fully transitioned to a global
> ASID, so that we keep using the classic path if there might be holdout
> CPUs?

Ah, but if we did that, we'd also have to ensure that the MM switching
path keeps invalidating the TLB when the MM's TLB generation count
increments, even if the CPU has already switched to the global ASID.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-01-13 15:50   ` Jann Horn
@ 2025-01-13 21:08     ` Rik van Riel
  2025-01-13 22:53       ` Tom Lendacky
  0 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2025-01-13 21:08 UTC (permalink / raw)
  To: Jann Horn, thomas.lendacky
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, kernel-team, linux-mm, akpm

On Mon, 2025-01-13 at 16:50 +0100, Jann Horn wrote:
> On Sun, Jan 12, 2025 at 4:55 PM Rik van Riel <riel@surriel.com>
> wrote:
> > +       /* Max number of pages INVLPGB can invalidate in one shot
> > */
> > +       if (boot_cpu_has(X86_FEATURE_INVLPGB)) {
> > +               cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
> > +               invlpgb_count_max = (edx & 0xffff) + 1;
> 
> I assume the +1 is just a weird undocumented (or weirdly documented)
> encoding?
> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf
> says that field InvlpgbCountMax contains the "Maximum page count for
> INVLPGB instruction" and doesn't mention having to add 1 from what I
> can tell.
> 
The way I read the documentation, the number
passed in to invlpgb (and retrieved from cpuid)
corresponds to the number of extra pages
invalidated beyond the first page at the specified
address.

Things have not exploded on me invalidating
multiple pages at once in larger ranges, so I 
suspect my reading is right, but it would be
nice for one of the AMD people to confirm :)

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 05/12] x86/mm: add INVLPGB support code
  2025-01-13 14:21   ` Tom Lendacky
@ 2025-01-13 21:10     ` Rik van Riel
  2025-01-14 14:29       ` Tom Lendacky
  0 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2025-01-13 21:10 UTC (permalink / raw)
  To: Tom Lendacky, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jannh

On Mon, 2025-01-13 at 08:21 -0600, Tom Lendacky wrote:
> On 1/12/25 09:53, Rik van Riel wrote:
> > 
> > +static inline void __invlpgb(unsigned long asid, unsigned long
> > pcid, unsigned long addr,
> > +			    int extra_count, bool pmd_stride,
> > unsigned long flags)
> > +{
> > +	u32 edx = (pcid << 16) | asid;
> > +	u32 ecx = (pmd_stride << 31);
> > +	u64 rax = addr | flags;
> > +
> > +	/* Protect against negative numbers. */
> > +	extra_count = max(extra_count, 0);
> > +	ecx |= extra_count;
> 
> A bad ECX value (ECX[15:0] > invlpgb_count_max) will result in a #GP,
> is
> that ok?

The calling code ensures we do not call this code
with more than invlpgb_count_max pages at a time.

Given the choice between "a bug in the calling code
crashes the kernel" and "a bug in the calling code
results in a missed TLB flush", I'm guessing the
crash is probably better.

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code
  2025-01-13 17:05   ` Jann Horn
  2025-01-13 17:48     ` Jann Horn
@ 2025-01-13 21:16     ` Rik van Riel
  1 sibling, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-13 21:16 UTC (permalink / raw)
  To: Jann Horn
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm

On Mon, 2025-01-13 at 18:05 +0100, Jann Horn wrote:
> On Sun, Jan 12, 2025 at 4:55 PM Rik van Riel <riel@surriel.com>
> wrote:
> 
> > 
> > +void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch
> > *batch,
> > +                                            struct mm_struct *mm,
> > +                                            unsigned long uaddr)
> > +{
> > +       if (static_cpu_has(X86_FEATURE_INVLPGB) &&
> > mm_global_asid(mm)) {
> > +               u16 asid = mm_global_asid(mm);
> > +               /*
> > +                * Queue up an asynchronous invalidation. The
> > corresponding
> > +                * TLBSYNC is done in arch_tlbbatch_flush(), and
> > must be done
> > +                * on the same CPU.
> > +                */
> > +               if (!batch->used_invlpgb) {
> > +                       batch->used_invlpgb = true;
> > +                       migrate_disable();
> > +               }
> > +               invlpgb_flush_user_nr_nosync(kern_pcid(asid),
> > uaddr, 1, false);
> > +               /* Do any CPUs supporting INVLPGB need PTI? */
> > +               if (static_cpu_has(X86_FEATURE_PTI))
> > +                      
> > invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false);
> > +       } else {
> > +               inc_mm_tlb_gen(mm);
> > +               cpumask_or(&batch->cpumask, &batch->cpumask,
> > mm_cpumask(mm));
> > +       }
> > +       mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
> > +}
> 
> How does this work if the MM is currently transitioning to a global
> ASID? Should the "mm_global_asid(mm)" check maybe be replaced with
> something that checks if the MM has fully transitioned to a global
> ASID, so that we keep using the classic path if there might be
> holdout
> CPUs?
> 
You are right!

If the mm is still transitioning, we should send a
TLB flush IPI, in addition to doing the broadcast shootdown.

Worst case the CPU is already using a global ASID, and
the TLB flush IPI ends up being a noop.


-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 04/12] x86/mm: get INVLPGB count max from CPUID
  2025-01-13 21:08     ` Rik van Riel
@ 2025-01-13 22:53       ` Tom Lendacky
  0 siblings, 0 replies; 36+ messages in thread
From: Tom Lendacky @ 2025-01-13 22:53 UTC (permalink / raw)
  To: Rik van Riel, Jann Horn
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, kernel-team, linux-mm, akpm

On 1/13/25 15:08, Rik van Riel wrote:
> On Mon, 2025-01-13 at 16:50 +0100, Jann Horn wrote:
>> On Sun, Jan 12, 2025 at 4:55 PM Rik van Riel <riel@surriel.com>
>> wrote:
>>> +       /* Max number of pages INVLPGB can invalidate in one shot
>>> */
>>> +       if (boot_cpu_has(X86_FEATURE_INVLPGB)) {
>>> +               cpuid(0x80000008, &eax, &ebx, &ecx, &edx);
>>> +               invlpgb_count_max = (edx & 0xffff) + 1;
>>
>> I assume the +1 is just a weird undocumented (or weirdly documented)
>> encoding?
>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf
>> says that field InvlpgbCountMax contains the "Maximum page count for
>> INVLPGB instruction" and doesn't mention having to add 1 from what I
>> can tell.
>>
> The way I read the documentation, the number
> passed in to invlpgb (and retrieved from cpuid)
> corresponds to the number of extra pages
> invalidated beyond the first page at the specified
> address.
> 
> Things have not exploded on me invalidating
> multiple pages at once in larger ranges, so I 
> suspect my reading is right, but it would be
> nice for one of the AMD people to confirm :)

That is correct.

Thanks,
Tom
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 11/12] x86/mm: enable AMD translation cache extensions
  2025-01-13 11:32   ` Andrew Cooper
@ 2025-01-14  1:28     ` Rik van Riel
  0 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-14  1:28 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: akpm, bp, dave.hansen, jannh, kernel-team, linux-kernel, linux-mm,
	nadav.amit, peterz, thomas.lendacky, x86, zhengqi.arch

On Mon, 2025-01-13 at 11:32 +0000, Andrew Cooper wrote:
> >  +++
> > b/arch/x86/kernel/cpu/amd.c @@ -1071,6 +1071,9 @@ static void
> > init_amd(struct cpuinfo_x86 *c)  
> >  	/* AMD CPUs don't need fencing after x2APIC/TSC_DEADLINE
> > MSR writes. */
> >  	clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
> > + + if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) +
> > msr_set_bit(MSR_EFER, _EFER_TCE);  }
> >  
> >  #ifdef CONFIG_X86_32
> 
> I don't think this is wise.  TCE is orthogonal to INVLPGB.
> 
> Either Linux is safe with TCE turned on, and it should be turned on
> everywhere (it goes back to Fam10h CPUs IIRC), or Linux isn't safe
> with
> TCE turned on, and this needs to depend on some other condition.
> 
> Or, is this a typo and did you mean to check the TCE CPUID bit,
> rather
> than the INVLPGB CPUID bit?

You're right, this should just check against X86_FEATURE_TCE,
which I did not realize was a separate feature bit.

I've changed this for the next version of the series.

Thank you!

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 05/12] x86/mm: add INVLPGB support code
  2025-01-13 17:24   ` Jann Horn
@ 2025-01-14  1:33     ` Rik van Riel
  0 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-14  1:33 UTC (permalink / raw)
  To: Jann Horn
  Cc: x86, linux-kernel, bp, peterz, dave.hansen, zhengqi.arch,
	nadav.amit, thomas.lendacky, kernel-team, linux-mm, akpm

On Mon, 2025-01-13 at 18:24 +0100, Jann Horn wrote:
> On Sun, Jan 12, 2025 at 4:55 PM Rik van Riel <riel@surriel.com>
> wrote:
> > +/* Wait for INVLPGB originated by this CPU to complete. */
> > +static inline void tlbsync(void)
> > +{
> > +       asm volatile("tlbsync");
> > +}
> 
> If possible, it might be a good idea to add a cant_migrate()
> assertion
> in here, though I'm not sure if that works in terms of include
> hierarchy.
> 

Nice idea. It just builds without any compiler
issues, so I've included that for the next version.

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes
  2025-01-13 13:09   ` Nadav Amit
@ 2025-01-14  3:13     ` Rik van Riel
  0 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2025-01-14  3:13 UTC (permalink / raw)
  To: Nadav Amit
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List,
	Borislav Petkov, peterz, Dave Hansen, zhengqi.arch,
	thomas.lendacky, kernel-team, open list:MEMORY MANAGEMENT,
	Andrew Morton, jannh

On Mon, 2025-01-13 at 15:09 +0200, Nadav Amit wrote:
> 
> > On 12 Jan 2025, at 17:53, Rik van Riel <riel@surriel.com> wrote:
> > 
> > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> > +	u16 global_asid;
> > +	bool asid_transition;
> 
> As I later note, there are various ordering issues between the two.
> Would it be
> just easier to combine them into one field? I know everybody hates
> bitfields so
> I don’t suggest it, but there are other ways...

It's certainly an option, but I think we figured out
the ordering issues, so at this point documentation
and readability of the code might be more important
for future proofing?

> 
> > @@ -170,6 +180,10 @@ static inline int init_new_context(struct
> > task_struct *tsk,
> > static inline void destroy_context(struct mm_struct *mm)
> > {
> > 	destroy_context_ldt(mm);
> > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> 
> I’d prefer to use IS_ENABLED() and to have a stub for 
> destroy_context_free_global_asid().
> 
> > +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
> > +		destroy_context_free_global_asid(mm);
> > +#endif
> > }

I'll think about how to do this cleaner.

I would like to keep the cpu feature test in the
inline function, so we don't do an unnecessary
function call on systems without INVLPGB.

> > 
> > +static inline bool in_asid_transition(const struct flush_tlb_info
> > *info)
> > +{
> > +	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> > +		return false;
> > +
> > +	return info->mm && info->mm->context.asid_transition;
> 
> READ_ONCE(context.asid_transition) ?

Changed for the next version.

> 
> > +static inline void broadcast_tlb_flush(struct flush_tlb_info
> > *info)
> > +{
> 
> Having a VM_WARN_ON() here might be nice.

Added. Thank you.

> 
> > +static u16 get_global_asid(void)
> > +{
> > +	lockdep_assert_held(&global_asid_lock);
> > +
> > +	do {
> > +		u16 start = last_global_asid;
> > +		u16 asid = find_next_zero_bit(global_asid_used,
> > MAX_ASID_AVAILABLE, start);
> > +
> > +		if (asid >= MAX_ASID_AVAILABLE) {
> > +			reset_global_asid_space();
> > +			continue;
> > +		}
> > +
> > +		/* Claim this global ASID. */
> > +		__set_bit(asid, global_asid_used);
> > +		last_global_asid = asid;
> > +		return asid;
> > +	} while (1);
> 
> This does not make me feel easy at all. I do not understand
> why it might happen. The caller should’ve already checked the global
> ASID
> is available under the lock. If it is not obvious from the code,
> perhaps
> refactoring is needed.
> 
The caller checks whether we have a global ASID available,
anywhere in the namespace.

This code will claim a specific one.

I guess the global_asid_available-- line could be moved
into get_global_asid() to improve readability?

> > +static bool mm_active_cpus_exceeds(struct mm_struct *mm, int
> > threshold)
> > +{
> > +	int count = 0;
> > +	int cpu;
> > +
> > +	/* This quick check should eliminate most single threaded
> > programs. */
> > +	if (cpumask_weight(mm_cpumask(mm)) <= threshold)
> > +		return false;
> > +
> > +	/* Slower check to make sure. */
> > +	for_each_cpu(cpu, mm_cpumask(mm)) {
> > +		/* Skip the CPUs that aren't really running this
> > process. */
> > +		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
> > +			continue;
> 
> Do you really want to make loaded_mm accessed from other cores? Does
> this
> really provide worthy benefit?
> 
> Why not just use cpumask_weight() and be done with it? Anyhow it’s a
> heuristic.

We recently added some code to make mm_cpumask
maintenance a lot lazier, which can result in
more CPUs remaining in the bitmap while not
running the mm.

As for accessing loaded_mm from other CPUs, we
already have to do that in finish_asid_transition.

I don't see any good way around that, but I'm open
to suggestions :)

> 
> > +	/*
> > +	 * The transition from IPI TLB flushing, with a dynamic
> > ASID,
> > +	 * and broadcast TLB flushing, using a global ASID, uses
> > memory
> > +	 * ordering for synchronization.
> > +	 *
> > +	 * While the process has threads still using a dynamic
> > ASID,
> > +	 * TLB invalidation IPIs continue to get sent.
> > +	 *
> > +	 * This code sets asid_transition first, before assigning
> > the
> > +	 * global ASID.
> > +	 *
> > +	 * The TLB flush code will only verify the ASID transition
> > +	 * after it has seen the new global ASID for the process.
> > +	 */
> > +	WRITE_ONCE(mm->context.asid_transition, true);
> 
> I would prefer smp_wmb() and document where the matching smp_rmb()
> (or smp_mb) is.
> 
> > +	WRITE_ONCE(mm->context.global_asid, get_global_asid());
> > +
> > +	global_asid_available--;
> > +}
> > 

I would be happy with either style of ordering.

It's all a bit of a no-op anyway, because x86 does not
do stores out of order.

> > +static void finish_asid_transition(struct flush_tlb_info *info)
> > +{
> > +	struct mm_struct *mm = info->mm;
> > +	int bc_asid = mm_global_asid(mm);
> > +	int cpu;
> > +
> > +	if (!READ_ONCE(mm->context.asid_transition))
> > +		return;
> > +
> > +	for_each_cpu(cpu, mm_cpumask(mm)) {
> > +		/*
> > +		 * The remote CPU is context switching. Wait for
> > that to
> > +		 * finish, to catch the unlikely case of it
> > switching to
> > +		 * the target mm with an out of date ASID.
> > +		 */
> > +		while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm,
> > cpu)) == LOADED_MM_SWITCHING)
> > +			cpu_relax();
> 
> Although this code should rarely run, it seems bad for a couple of
> reasons:
> 
> 1. It is a new busy-wait in a very delicate place. Lockdep is blind
> to this
>    change.

This is true. However, if a CPU gets stuck in the middle
of switch_mm_irqs_off, won't we have a bigger problem?

> 
> 2. cpu_tlbstate is supposed to be private for each core - that’s why
> there
>    is cpu_tlbstate_shared. But I really think loaded_mm should be
> kept
>    private.
> 
> Can't we just do one TLB shootdown if 
> 	cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids

That conflicts with a future optimization of simply
not maintaining the mm_cpumask at all for processes
that use broadcast TLB invalidation.

After all, once we no longer use the mm_cpumask for
anything any more, why incur the overhead of maintaining
it?

I would like to understand more about the harm of
reading loaded_mm. How is reading loaded_mm worse
than reading other per-CPU variables that are written
from the same code paths?

> 
> > +	/* All the CPUs running this process are using the global
> > ASID. */
> 
> I guess it’s ordered with the flushes (the flushes must complete
> first).
> 
If there are any flushes.

If all the CPUs we scanned are already using the
global ASID, we do not need any additional ordering
in here, since any CPUs switching to this mm afterward
will be seeing the new global ASID.

> > +	WRITE_ONCE(mm->context.asid_transition, false);
> > +}
> > 


> > +	} else do {
> > +		/*
> > +		 * Calculate how many pages can be flushed at
> > once; if the
> > +		 * remainder of the range is less than one page,
> > flush one.
> > +		 */
> > +		nr = min(maxnr, (info->end - addr) >> info-
> > >stride_shift);
> > +		nr = max(nr, 1);
> > +
> > +		invlpgb_flush_user_nr_nosync(kern_pcid(asid),
> > addr, nr, pmd);
> > +		/* Do any CPUs supporting INVLPGB need PTI? */
> > +		if (static_cpu_has(X86_FEATURE_PTI))
> > +			invlpgb_flush_user_nr_nosync(user_pcid(asi
> > d), addr, nr, pmd);
> > +		addr += nr << info->stride_shift;
> > +	} while (addr < info->end);
> 
> I would have preferred for instead of while...

Changed that for the next version. Thank you.

> > @@ -1049,9 +1385,12 @@ void flush_tlb_mm_range(struct mm_struct
> > *mm, unsigned long start,
> > 	 * a local TLB flush is needed. Optimize this use-case by
> > calling
> > 	 * flush_tlb_func_local() directly in this case.
> > 	 */
> > -	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
> 
> I think an smp_rmb() here would communicate the fact
> in_asid_transition() and
> mm_global_asid() must be ordered.
> 
> > +	if (mm_global_asid(mm)) {
> > +		broadcast_tlb_flush(info);

We have the barrier already, in the form of
inc_mm_tlb_gen a few lines up.

Or are you talking about a barrier (or READ_ONCE?)
inside of mm_global_asid() to make sure the compiler
cannot reorder things around mm_global_asid()?

> > +	} else if (cpumask_any_but(mm_cpumask(mm), cpu) <
> > nr_cpu_ids) {
> > 		info->trim_cpumask = should_trim_cpumask(mm);
> > 		flush_tlb_multi(mm_cpumask(mm), info);
> > +		consider_global_asid(mm);
> > 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
> > 		lockdep_assert_irqs_enabled();
> > 		local_irq_disable();
> > -- 
> > 2.47.1
> > 
> 

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
  2025-01-12 15:53 ` [PATCH v4 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
@ 2025-01-14 12:32   ` Borislav Petkov
  0 siblings, 0 replies; 36+ messages in thread
From: Borislav Petkov @ 2025-01-14 12:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: x86, linux-kernel, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	thomas.lendacky, kernel-team, linux-mm, akpm, jannh

On Sun, Jan 12, 2025 at 10:53:45AM -0500, Rik van Riel wrote:
> Currently x86 uses CONFIG_MMU_GATHER_TABLE_FREE when using
> paravirt, and not when running on bare metal.
> 
> There is no real good reason to do things differently for
> each setup. Make them all the same.
> 
> Currently get_user_pages_fast synchronizes against page table
> freeing in two different ways:
> - on bare metal, by blocking IRQs, which block TLB flush IPIs
> - on paravirt, with MMU_GATHER_RCU_TABLE_FREE
> 
> This is done because some paravirt TLB flush implementations
> handle the TLB flush in the hypervisor, and will do the flush
> even when the target CPU has interrupts disabled.
> 
> After this change, the synchronization between get_user_pages_fast

get_user_pages_fast()  - make it look like a function.

Avoid having "This patch" or "This commit" or "After this change", etc in the
commit message. It is tautologically useless so use imperative tone directly:

"Always handle page table freeing with..."

> and page table freeing is always handled with MMU_GATHER_RCU_TABLE_FREE,
> which allows bare metal to also do TLB flushes while interrupts are
> disabled.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 05/12] x86/mm: add INVLPGB support code
  2025-01-13 21:10     ` Rik van Riel
@ 2025-01-14 14:29       ` Tom Lendacky
  2025-01-14 15:05         ` Dave Hansen
  0 siblings, 1 reply; 36+ messages in thread
From: Tom Lendacky @ 2025-01-14 14:29 UTC (permalink / raw)
  To: Rik van Riel, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jannh

On 1/13/25 15:10, Rik van Riel wrote:
> On Mon, 2025-01-13 at 08:21 -0600, Tom Lendacky wrote:
>> On 1/12/25 09:53, Rik van Riel wrote:
>>>
>>> +static inline void __invlpgb(unsigned long asid, unsigned long
>>> pcid, unsigned long addr,
>>> +			    int extra_count, bool pmd_stride,
>>> unsigned long flags)
>>> +{
>>> +	u32 edx = (pcid << 16) | asid;
>>> +	u32 ecx = (pmd_stride << 31);
>>> +	u64 rax = addr | flags;
>>> +
>>> +	/* Protect against negative numbers. */
>>> +	extra_count = max(extra_count, 0);
>>> +	ecx |= extra_count;
>>
>> A bad ECX value (ECX[15:0] > invlpgb_count_max) will result in a #GP,
>> is
>> that ok?
> 
> The calling code ensures we do not call this code
> with more than invlpgb_count_max pages at a time.
> 
> Given the choice between "a bug in the calling code
> crashes the kernel" and "a bug in the calling code
> results in a missed TLB flush", I'm guessing the
> crash is probably better.

So instead of the negative number protection, shouldn't this just use an
unsigned int for extra_count and panic() if the value is greater than
invlpgb_count_max? The caller has some sort of logic problem and it
could possibly result in missed TLB flushes. Or if a panic() is out of
the question, maybe a WARN() and a full TLB flush to be safe?

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 05/12] x86/mm: add INVLPGB support code
  2025-01-14 14:29       ` Tom Lendacky
@ 2025-01-14 15:05         ` Dave Hansen
  2025-01-14 15:23           ` Tom Lendacky
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Hansen @ 2025-01-14 15:05 UTC (permalink / raw)
  To: Tom Lendacky, Rik van Riel, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jannh

On 1/14/25 06:29, Tom Lendacky wrote:
>> Given the choice between "a bug in the calling code
>> crashes the kernel" and "a bug in the calling code
>> results in a missed TLB flush", I'm guessing the
>> crash is probably better.
> So instead of the negative number protection, shouldn't this just use an
> unsigned int for extra_count and panic() if the value is greater than
> invlpgb_count_max? The caller has some sort of logic problem and it
> could possibly result in missed TLB flushes. Or if a panic() is out of
> the question, maybe a WARN() and a full TLB flush to be safe?

The current implementation will panic in the #GP handler though. It
should be pretty easy to figure out that INVLPGB is involved with RIP or
the Code: snippet. From there, you'd need to figure out what caused the #GP.

I guess the one nasty thing is that a person debugging this might not
have a CPUID dump handy so wouldn't actually know the number of valid
addresses that INVLPGB can take.

But otherwise, I'm not sure an explicit panic adds _much_ value here
over an implicit one via the #GP handler. I don't know how everybody
else feels about it, but I'm happy just depending on the #GP for now.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 05/12] x86/mm: add INVLPGB support code
  2025-01-14 15:05         ` Dave Hansen
@ 2025-01-14 15:23           ` Tom Lendacky
  2025-01-14 15:47             ` Rik van Riel
  0 siblings, 1 reply; 36+ messages in thread
From: Tom Lendacky @ 2025-01-14 15:23 UTC (permalink / raw)
  To: Dave Hansen, Rik van Riel, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jannh

On 1/14/25 09:05, Dave Hansen wrote:
> On 1/14/25 06:29, Tom Lendacky wrote:
>>> Given the choice between "a bug in the calling code
>>> crashes the kernel" and "a bug in the calling code
>>> results in a missed TLB flush", I'm guessing the
>>> crash is probably better.
>> So instead of the negative number protection, shouldn't this just use an
>> unsigned int for extra_count and panic() if the value is greater than
>> invlpgb_count_max? The caller has some sort of logic problem and it
>> could possibly result in missed TLB flushes. Or if a panic() is out of
>> the question, maybe a WARN() and a full TLB flush to be safe?
> 
> The current implementation will panic in the #GP handler though. It
> should be pretty easy to figure out that INVLPGB is involved with RIP or
> the Code: snippet. From there, you'd need to figure out what caused the #GP.

Hmmm, maybe I'm missing something. IIUC, when a negative number is
supplied, the extra_count field will be set to 0 (via the max()
function) and allow the INVLPGB to continue. 0 is valid in ECX[15:0] and
so the instruction won't #GP.

Thanks,
Tom

> 
> I guess the one nasty thing is that a person debugging this might not
> have a CPUID dump handy so wouldn't actually know the number of valid
> addresses that INVLPGB can take.
> 
> But otherwise, I'm not sure an explicit panic adds _much_ value here
> over an implicit one via the #GP handler. I don't know how everybody
> else feels about it, but I'm happy just depending on the #GP for now.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 05/12] x86/mm: add INVLPGB support code
  2025-01-14 15:23           ` Tom Lendacky
@ 2025-01-14 15:47             ` Rik van Riel
  2025-01-14 16:30               ` Tom Lendacky
  0 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2025-01-14 15:47 UTC (permalink / raw)
  To: Tom Lendacky, Dave Hansen, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jannh

On Tue, 2025-01-14 at 09:23 -0600, Tom Lendacky wrote:
> On 1/14/25 09:05, Dave Hansen wrote:
> > On 1/14/25 06:29, Tom Lendacky wrote:
> > > > Given the choice between "a bug in the calling code
> > > > crashes the kernel" and "a bug in the calling code
> > > > results in a missed TLB flush", I'm guessing the
> > > > crash is probably better.
> > > So instead of the negative number protection, shouldn't this just
> > > use an
> > > unsigned int for extra_count and panic() if the value is greater
> > > than
> > > invlpgb_count_max? The caller has some sort of logic problem and
> > > it
> > > could possibly result in missed TLB flushes. Or if a panic() is
> > > out of
> > > the question, maybe a WARN() and a full TLB flush to be safe?
> > 
> > The current implementation will panic in the #GP handler though. It
> > should be pretty easy to figure out that INVLPGB is involved with
> > RIP or
> > the Code: snippet. From there, you'd need to figure out what caused
> > the #GP.
> 
> Hmmm, maybe I'm missing something. IIUC, when a negative number is
> supplied, the extra_count field will be set to 0 (via the max()
> function) and allow the INVLPGB to continue. 0 is valid in ECX[15:0]
> and
> so the instruction won't #GP.

I added that at the request of somebody else :)

Let me remove it again, now that we seem to have a
consensus that a panic is preferable to a wrong
TLB flush.

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 05/12] x86/mm: add INVLPGB support code
  2025-01-14 15:47             ` Rik van Riel
@ 2025-01-14 16:30               ` Tom Lendacky
  2025-01-14 16:41                 ` Dave Hansen
  0 siblings, 1 reply; 36+ messages in thread
From: Tom Lendacky @ 2025-01-14 16:30 UTC (permalink / raw)
  To: Rik van Riel, Dave Hansen, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jannh

On 1/14/25 09:47, Rik van Riel wrote:
> On Tue, 2025-01-14 at 09:23 -0600, Tom Lendacky wrote:
>> On 1/14/25 09:05, Dave Hansen wrote:
>>> On 1/14/25 06:29, Tom Lendacky wrote:
>>>>> Given the choice between "a bug in the calling code
>>>>> crashes the kernel" and "a bug in the calling code
>>>>> results in a missed TLB flush", I'm guessing the
>>>>> crash is probably better.
>>>> So instead of the negative number protection, shouldn't this just
>>>> use an
>>>> unsigned int for extra_count and panic() if the value is greater
>>>> than
>>>> invlpgb_count_max? The caller has some sort of logic problem and
>>>> it
>>>> could possibly result in missed TLB flushes. Or if a panic() is
>>>> out of
>>>> the question, maybe a WARN() and a full TLB flush to be safe?
>>>
>>> The current implementation will panic in the #GP handler though. It
>>> should be pretty easy to figure out that INVLPGB is involved with
>>> RIP or
>>> the Code: snippet. From there, you'd need to figure out what caused
>>> the #GP.
>>
>> Hmmm, maybe I'm missing something. IIUC, when a negative number is
>> supplied, the extra_count field will be set to 0 (via the max()
>> function) and allow the INVLPGB to continue. 0 is valid in ECX[15:0]
>> and
>> so the instruction won't #GP.
> 
> I added that at the request of somebody else :)
> 
> Let me remove it again, now that we seem to have a
> consensus that a panic is preferable to a wrong
> TLB flush.

I believe the instruction will #GP if any of the ECX[30:16] reserved
bits are non-zero (although the APM doesn't document that), in addition
to ECX[15:0] being greater than allowed. But what if 0x80000000 is
passed in. That would set ECX[31] with a zero count field, which is
valid for the instruction, but the input is obviously bogus.

I think the safest thing to do is make the extra_count parameter an
unsigned int and check if it is greater than invlpgb_count_max. Not sure
what to actually do at that point, though... panic()? WARN() with full
TLB flush?

Thanks,
Tom

> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 05/12] x86/mm: add INVLPGB support code
  2025-01-14 16:30               ` Tom Lendacky
@ 2025-01-14 16:41                 ` Dave Hansen
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Hansen @ 2025-01-14 16:41 UTC (permalink / raw)
  To: Tom Lendacky, Rik van Riel, x86
  Cc: linux-kernel, bp, peterz, dave.hansen, zhengqi.arch, nadav.amit,
	kernel-team, linux-mm, akpm, jannh

On 1/14/25 08:30, Tom Lendacky wrote:
>> Let me remove it again, now that we seem to have a
>> consensus that a panic is preferable to a wrong
>> TLB flush.
> I believe the instruction will #GP if any of the ECX[30:16] reserved
> bits are non-zero (although the APM doesn't document that), in addition
> to ECX[15:0] being greater than allowed. But what if 0x80000000 is
> passed in. That would set ECX[31] with a zero count field, which is
> valid for the instruction, but the input is obviously bogus.
> 
> I think the safest thing to do is make the extra_count parameter an
> unsigned int and check if it is greater than invlpgb_count_max. Not sure
> what to actually do at that point, though... panic()? WARN() with full
> TLB flush?

How about we give 'extra_count' a type that can't bleed over into the
stride bits:

static inline void __invlpgb(unsigned long asid, unsigned long pcid,
			     unsigned long addr, u16 extra_count,
			     bool pmd_stride, u8 flags)

plus a check on CPUID 8000_0008h EDX[15:0] to make sure it's not 0xffff.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH v4 05/12] x86/mm: add INVLPGB support code
  2025-01-12 15:53 ` [PATCH v4 05/12] x86/mm: add INVLPGB support code Rik van Riel
  2025-01-13 14:21   ` Tom Lendacky
  2025-01-13 17:24   ` Jann Horn
@ 2025-01-14 18:24   ` Michael Kelley
  2 siblings, 0 replies; 36+ messages in thread
From: Michael Kelley @ 2025-01-14 18:24 UTC (permalink / raw)
  To: riel@surriel.com, x86@kernel.org
  Cc: linux-kernel@vger.kernel.org, bp@alien8.de, peterz@infradead.org,
	dave.hansen@linux.intel.com, zhengqi.arch@bytedance.com,
	nadav.amit@gmail.com, thomas.lendacky@amd.com,
	kernel-team@meta.com, linux-mm@kvack.org,
	akpm@linux-foundation.org, jannh@google.com

From: riel@surriel.com <riel@surriel.com> Sent: Sunday, January 12, 2025 7:54 AM
> 
> Add invlpgb.h with the helper functions and definitions needed to use
> broadcast TLB invalidation on AMD EPYC 3 and newer CPUs.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/invlpgb.h  | 95 +++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/tlbflush.h |  1 +
>  2 files changed, 96 insertions(+)
>  create mode 100644 arch/x86/include/asm/invlpgb.h
> 
> diff --git a/arch/x86/include/asm/invlpgb.h b/arch/x86/include/asm/invlpgb.h
> new file mode 100644
> index 000000000000..d62e3733a1ab
> --- /dev/null
> +++ b/arch/x86/include/asm/invlpgb.h
> @@ -0,0 +1,95 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_INVLPGB
> +#define _ASM_X86_INVLPGB
> +
> +#include <vdso/bits.h>
> +
> +/*
> + * INVLPGB does broadcast TLB invalidation across all the CPUs in the system.
> + *
> + * The INVLPGB instruction is weakly ordered, and a batch of invalidations can
> + * be done in a parallel fashion.
> + *
> + * TLBSYNC is used to ensure that pending INVLPGB invalidations initiated from
> + * this CPU have completed.
> + */
> +static inline void __invlpgb(unsigned long asid, unsigned long pcid, unsigned long addr,
> +			    int extra_count, bool pmd_stride, unsigned long flags)
> +{
> +	u32 edx = (pcid << 16) | asid;
> +	u32 ecx = (pmd_stride << 31);
> +	u64 rax = addr | flags;
> +
> +	/* Protect against negative numbers. */
> +	extra_count = max(extra_count, 0);
> +	ecx |= extra_count;
> +
> +	asm volatile("invlpgb" : : "a" (rax), "c" (ecx), "d" (edx));

The above needs to be:

	asm volatile(".byte 0x0f, 0x01, 0xfe" : : "a" (rax), "c" (ecx), "d" (edx));

plus an explanatory comment.

As Boris Petkov previously noted[1], the "invlpgb" instruction name requires
binutils version 2.36. But the current Linux kernel minimum binutils version
is 2.25 (in scripts/min-tool-version.sh). For example, I'm using binutils 2.34,
and your asm statement doesn't compile.

> +}
> +
> +/* Wait for INVLPGB originated by this CPU to complete. */
> +static inline void tlbsync(void)
> +{
> +	asm volatile("tlbsync");

Same as above for "tlbsync".

Michael

[1] https://lore.kernel.org/lkml/20250102124247.GPZ3aJx8JTJa6PcaOW@fat_crate.local/

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-01-14 18:24 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-12 15:53 [RFC PATCH v4 00/10] AMD broadcast TLB invalidation Rik van Riel
2025-01-12 15:53 ` [PATCH v4 01/12] x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional Rik van Riel
2025-01-14 12:32   ` Borislav Petkov
2025-01-12 15:53 ` [PATCH v4 02/12] x86/mm: remove pv_ops.mmu.tlb_remove_table call Rik van Riel
2025-01-12 15:53 ` [PATCH v4 03/12] x86/mm: consolidate full flush threshold decision Rik van Riel
2025-01-12 15:53 ` [PATCH v4 04/12] x86/mm: get INVLPGB count max from CPUID Rik van Riel
2025-01-13 15:50   ` Jann Horn
2025-01-13 21:08     ` Rik van Riel
2025-01-13 22:53       ` Tom Lendacky
2025-01-12 15:53 ` [PATCH v4 05/12] x86/mm: add INVLPGB support code Rik van Riel
2025-01-13 14:21   ` Tom Lendacky
2025-01-13 21:10     ` Rik van Riel
2025-01-14 14:29       ` Tom Lendacky
2025-01-14 15:05         ` Dave Hansen
2025-01-14 15:23           ` Tom Lendacky
2025-01-14 15:47             ` Rik van Riel
2025-01-14 16:30               ` Tom Lendacky
2025-01-14 16:41                 ` Dave Hansen
2025-01-13 17:24   ` Jann Horn
2025-01-14  1:33     ` Rik van Riel
2025-01-14 18:24   ` Michael Kelley
2025-01-12 15:53 ` [PATCH v4 06/12] x86/mm: use INVLPGB for kernel TLB flushes Rik van Riel
2025-01-12 15:53 ` [PATCH v4 07/12] x86/tlb: use INVLPGB in flush_tlb_all Rik van Riel
2025-01-12 15:53 ` [PATCH v4 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing Rik van Riel
2025-01-12 15:53 ` [PATCH v4 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Rik van Riel
2025-01-13 13:09   ` Nadav Amit
2025-01-14  3:13     ` Rik van Riel
2025-01-12 15:53 ` [PATCH v4 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code Rik van Riel
2025-01-13 17:05   ` Jann Horn
2025-01-13 17:48     ` Jann Horn
2025-01-13 21:16     ` Rik van Riel
2025-01-12 15:53 ` [PATCH v4 11/12] x86/mm: enable AMD translation cache extensions Rik van Riel
2025-01-13 11:32   ` Andrew Cooper
2025-01-14  1:28     ` Rik van Riel
2025-01-12 15:53 ` [PATCH v4 12/12] x86/mm: only invalidate final translations with INVLPGB Rik van Riel
2025-01-13 17:11   ` Jann Horn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).