[RFC PATCH v4 0/8] Intel RAR TLB invalidation

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v4 0/8] Intel RAR TLB invalidation
@ 2025-06-19 20:03 Rik van Riel
  2025-06-19 20:03 ` [RFC PATCH v4 1/8] x86/mm: Introduce Remote Action Request MSRs Rik van Riel
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Rik van Riel @ 2025-06-19 20:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, dave.hansen, luto, peterz, bp, x86, nadav.amit,
	seanjc, tglx, mingo

This patch series adds support for IPI-less TLB invalidation
using Intel RAR technology.

Intel RAR differs from AMD INVLPGB in a few ways:
- RAR goes through (emulated?) APIC writes, not instructions
- RAR flushes go through a memory table with 64 entries
- RAR flushes can be targeted to a cpumask
- The RAR functionality must be set up at boot time before it can be used

The cpumask targeting has resulted in Intel RAR and AMD INVLPGB having
slightly different rules:
- Processes with dynamic ASIDs use IPI based shootdowns
- INVLPGB: processes with a global ASID 
   - always have the TLB up to date, on every CPU
   - never need to flush the TLB at context switch time
- RAR: processes with global ASIDs
   - have the TLB up to date on CPUs in the mm_cpumask
   - can skip a TLB flush at context switch time if the CPU is in the mm_cpumask
   - need to flush the TLB when scheduled on a cpu not in the mm_cpumask,
     in case it used to run there before and the TLB has stale entries

RAR functionality is present on Sapphire Rapids and newer CPUs.

Information about Intel RAR can be found in this whitepaper.

https://www.intel.com/content/dam/develop/external/us/en/documents/341431-remote-action-request-white-paper.pdf

This patch series is based off a 2019 patch series created by
Intel, with patches later in the series modified to fit into
the TLB flush code structure we have after AMD INVLPGB functionality
was integrated.

TODO:
- some sort of optimization to avoid sending RARs to CPUs in deeper
  idle states when they have init_mm loaded (flush when switching to init_mm?)

v4:
- remove chicken/egg problem that made it impossible to use RAR early
  in bootup, now RAR can be used to flush the local TLB (but it's broken?)
- always flush other CPUs with RAR, no more periodic flush_tlb_func
- separate, simplified cpumask trimming code
- attempt to use RAR to flush the local TLB, which should work
  according to the documentation
- add a DEBUG patch to flush the local TLB with RAR and again locally,
  may need some help from Intel to figure out why this makes a difference
- memory dumps of rar_payload[] suggest we are sending valid RARs
- receiving CPUs set the status from RAR_PENDING to RAR_SUCCESS
- unclear whether the TLB is actually flushed correctly :(
v3:
- move cpa_flush() change out of this patch series
- use MSR_IA32_CORE_CAPS definition, merge first two patches together
- move RAR initialization to early_init_intel()
- remove single-CPU "fast path" from smp_call_rar_many
- remove smp call table RAR entries, just do a direct call
- cleanups suggested (Ingo, Nadav, Dave, Thomas, Borislav, Sean)
- fix !CONFIG_SMP compile in Kconfig
- match RAR definitions to the names & numbers in the documentation
- the code seems to work now
v2:
- Cleanups suggested by Ingo and Nadav (thank you)
- Basic RAR code seems to actually work now.
- Kernel TLB flushes with RAR seem to work correctly.
- User TLB flushes with RAR are still broken, with two symptoms:
  - The !is_lazy WARN_ON in leave_mm() is tripped
  - Random segfaults.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v4 1/8] x86/mm: Introduce Remote Action Request MSRs
  2025-06-19 20:03 [RFC PATCH v4 0/8] Intel RAR TLB invalidation Rik van Riel
@ 2025-06-19 20:03 ` Rik van Riel
  2025-06-19 20:03 ` [RFC PATCH v4 2/8] x86/mm: enable BROADCAST_TLB_FLUSH on Intel, too Rik van Riel
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2025-06-19 20:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, dave.hansen, luto, peterz, bp, x86, nadav.amit,
	seanjc, tglx, mingo, Yu-cheng Yu, Rik van Riel

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Remote Action Request (RAR) is a model-specific feature to speed
up inter-processor operations by moving parts of those operations
from software to hardware.

The current RAR implementation handles TLB flushes and MSR writes.

This patch introduces RAR MSRs.  RAR is introduced in later patches.

There are five RAR MSRs:

  MSR_CORE_CAPABILITIES
  MSR_IA32_RAR_CTRL
  MSR_IA32_RAR_ACT_VEC
  MSR_IA32_RAR_PAYLOAD_BASE
  MSR_IA32_RAR_INFO

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/msr-index.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index b7dded3c8113..367a62c50aa2 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -110,6 +110,8 @@
 
 /* Abbreviated from Intel SDM name IA32_CORE_CAPABILITIES */
 #define MSR_IA32_CORE_CAPS			  0x000000cf
+#define MSR_IA32_CORE_CAPS_RAR_BIT		  1
+#define MSR_IA32_CORE_CAPS_RAR			  BIT(MSR_IA32_CORE_CAPS_RAR_BIT)
 #define MSR_IA32_CORE_CAPS_INTEGRITY_CAPS_BIT	  2
 #define MSR_IA32_CORE_CAPS_INTEGRITY_CAPS	  BIT(MSR_IA32_CORE_CAPS_INTEGRITY_CAPS_BIT)
 #define MSR_IA32_CORE_CAPS_SPLIT_LOCK_DETECT_BIT  5
@@ -122,6 +124,17 @@
 #define SNB_C3_AUTO_UNDEMOTE		(1UL << 27)
 #define SNB_C1_AUTO_UNDEMOTE		(1UL << 28)
 
+/*
+ * Remote Action Requests (RAR) MSRs
+ */
+#define MSR_IA32_RAR_CTRL		0x000000ed
+#define MSR_IA32_RAR_ACT_VEC		0x000000ee
+#define MSR_IA32_RAR_PAYLOAD_BASE	0x000000ef
+#define MSR_IA32_RAR_INFO		0x000000f0
+
+#define RAR_CTRL_ENABLE			BIT(31)
+#define RAR_CTRL_IGNORE_IF		BIT(30)
+
 #define MSR_MTRRcap			0x000000fe
 
 #define MSR_IA32_ARCH_CAPABILITIES	0x0000010a
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v4 2/8] x86/mm: enable BROADCAST_TLB_FLUSH on Intel, too
  2025-06-19 20:03 [RFC PATCH v4 0/8] Intel RAR TLB invalidation Rik van Riel
  2025-06-19 20:03 ` [RFC PATCH v4 1/8] x86/mm: Introduce Remote Action Request MSRs Rik van Riel
@ 2025-06-19 20:03 ` Rik van Riel
  2025-06-26 13:08   ` Kirill A. Shutemov
  2025-06-19 20:03 ` [RFC PATCH v4 3/8] x86/mm: Introduce X86_FEATURE_RAR Rik van Riel
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2025-06-19 20:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, dave.hansen, luto, peterz, bp, x86, nadav.amit,
	seanjc, tglx, mingo, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@fb.com>

Much of the code for Intel RAR and AMD INVLPGB is shared.

Place both under the same config option.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/Kconfig.cpu | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index f928cf6e3252..ab763f69f54d 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -360,7 +360,7 @@ menuconfig PROCESSOR_SELECT
 
 config BROADCAST_TLB_FLUSH
 	def_bool y
-	depends on CPU_SUP_AMD && 64BIT
+	depends on (CPU_SUP_AMD || CPU_SUP_INTEL) && 64BIT && SMP
 
 config CPU_SUP_INTEL
 	default y
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v4 3/8] x86/mm: Introduce X86_FEATURE_RAR
  2025-06-19 20:03 [RFC PATCH v4 0/8] Intel RAR TLB invalidation Rik van Riel
  2025-06-19 20:03 ` [RFC PATCH v4 1/8] x86/mm: Introduce Remote Action Request MSRs Rik van Riel
  2025-06-19 20:03 ` [RFC PATCH v4 2/8] x86/mm: enable BROADCAST_TLB_FLUSH on Intel, too Rik van Riel
@ 2025-06-19 20:03 ` Rik van Riel
  2025-06-19 20:03 ` [RFC PATCH v4 4/8] x86/apic: Introduce Remote Action Request Operations Rik van Riel
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2025-06-19 20:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, dave.hansen, luto, peterz, bp, x86, nadav.amit,
	seanjc, tglx, mingo, Yu-cheng Yu, Rik van Riel

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Introduce X86_FEATURE_RAR and enumeration of the feature.

[riel: moved initialization to intel.c and disabling to Kconfig.cpufeatures]

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/Kconfig.cpufeatures       | 4 ++++
 arch/x86/include/asm/cpufeatures.h | 2 +-
 arch/x86/kernel/cpu/intel.c        | 9 +++++++++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig.cpufeatures b/arch/x86/Kconfig.cpufeatures
index 250c10627ab3..7d459b5f47f7 100644
--- a/arch/x86/Kconfig.cpufeatures
+++ b/arch/x86/Kconfig.cpufeatures
@@ -195,3 +195,7 @@ config X86_DISABLED_FEATURE_SEV_SNP
 config X86_DISABLED_FEATURE_INVLPGB
 	def_bool y
 	depends on !BROADCAST_TLB_FLUSH
+
+config X86_DISABLED_FEATURE_RAR
+	def_bool y
+	depends on !BROADCAST_TLB_FLUSH
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index ee176236c2be..e6781541ffce 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -76,7 +76,7 @@
 #define X86_FEATURE_K8			( 3*32+ 4) /* Opteron, Athlon64 */
 #define X86_FEATURE_ZEN5		( 3*32+ 5) /* CPU based on Zen5 microarchitecture */
 #define X86_FEATURE_ZEN6		( 3*32+ 6) /* CPU based on Zen6 microarchitecture */
-/* Free                                 ( 3*32+ 7) */
+#define X86_FEATURE_RAR			( 3*32+ 7) /* Intel Remote Action Request */
 #define X86_FEATURE_CONSTANT_TSC	( 3*32+ 8) /* "constant_tsc" TSC ticks at a constant rate */
 #define X86_FEATURE_UP			( 3*32+ 9) /* "up" SMP kernel running on UP */
 #define X86_FEATURE_ART			( 3*32+10) /* "art" Always running timer (ART) */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 076eaa41b8c8..0cc4ae27127c 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -719,6 +719,15 @@ static void intel_detect_tlb(struct cpuinfo_x86 *c)
 	cpuid_leaf_0x2(&regs);
 	for_each_cpuid_0x2_desc(regs, ptr, desc)
 		intel_tlb_lookup(desc);
+
+	if (cpu_has(c, X86_FEATURE_CORE_CAPABILITIES)) {
+		u64 msr;
+
+		rdmsrl(MSR_IA32_CORE_CAPS, msr);
+
+		if (msr & MSR_IA32_CORE_CAPS_RAR)
+			setup_force_cpu_cap(X86_FEATURE_RAR);
+	}
 }
 
 static const struct cpu_dev intel_cpu_dev = {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v4 4/8] x86/apic: Introduce Remote Action Request Operations
  2025-06-19 20:03 [RFC PATCH v4 0/8] Intel RAR TLB invalidation Rik van Riel
                   ` (2 preceding siblings ...)
  2025-06-19 20:03 ` [RFC PATCH v4 3/8] x86/mm: Introduce X86_FEATURE_RAR Rik van Riel
@ 2025-06-19 20:03 ` Rik van Riel
  2025-06-26 13:20   ` Kirill A. Shutemov
  2025-06-19 20:03 ` [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request Rik van Riel
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2025-06-19 20:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, dave.hansen, luto, peterz, bp, x86, nadav.amit,
	seanjc, tglx, mingo, Yu-cheng Yu, Rik van Riel

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

RAR TLB flushing is started by sending a command to the APIC.
This patch adds Remote Action Request commands.

Because RAR_VECTOR is hardcoded at 0xe0, POSTED_MSI_NOTIFICATION_VECTOR
has to be lowered to 0xdf, reducing the number of available vectors by
13.

[riel: refactor after 6 years of changes, lower POSTED_MSI_NOTIFICATION_VECTOR]

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/apicdef.h     | 1 +
 arch/x86/include/asm/irq_vectors.h | 7 ++++++-
 arch/x86/include/asm/smp.h         | 1 +
 arch/x86/kernel/apic/ipi.c         | 5 +++++
 arch/x86/kernel/apic/local.h       | 3 +++
 5 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/apicdef.h b/arch/x86/include/asm/apicdef.h
index 094106b6a538..b152d45af91a 100644
--- a/arch/x86/include/asm/apicdef.h
+++ b/arch/x86/include/asm/apicdef.h
@@ -92,6 +92,7 @@
 #define		APIC_DM_LOWEST		0x00100
 #define		APIC_DM_SMI		0x00200
 #define		APIC_DM_REMRD		0x00300
+#define		APIC_DM_RAR		0x00300
 #define		APIC_DM_NMI		0x00400
 #define		APIC_DM_INIT		0x00500
 #define		APIC_DM_STARTUP		0x00600
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 47051871b436..52a0cf56562a 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -97,11 +97,16 @@
 
 #define LOCAL_TIMER_VECTOR		0xec
 
+/*
+ * RAR (remote action request) TLB flush
+ */
+#define RAR_VECTOR			0xe0
+
 /*
  * Posted interrupt notification vector for all device MSIs delivered to
  * the host kernel.
  */
-#define POSTED_MSI_NOTIFICATION_VECTOR	0xeb
+#define POSTED_MSI_NOTIFICATION_VECTOR	0xdf
 
 #define NR_VECTORS			 256
 
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 0c1c68039d6f..0e5ad0dc987a 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -120,6 +120,7 @@ void __noreturn mwait_play_dead(unsigned int eax_hint);
 void native_smp_send_reschedule(int cpu);
 void native_send_call_func_ipi(const struct cpumask *mask);
 void native_send_call_func_single_ipi(int cpu);
+void native_send_rar_ipi(const struct cpumask *mask);
 
 asmlinkage __visible void smp_reboot_interrupt(void);
 __visible void smp_reschedule_interrupt(struct pt_regs *regs);
diff --git a/arch/x86/kernel/apic/ipi.c b/arch/x86/kernel/apic/ipi.c
index 98a57cb4aa86..9983c42619ef 100644
--- a/arch/x86/kernel/apic/ipi.c
+++ b/arch/x86/kernel/apic/ipi.c
@@ -106,6 +106,11 @@ void apic_send_nmi_to_offline_cpu(unsigned int cpu)
 		return;
 	apic->send_IPI(cpu, NMI_VECTOR);
 }
+
+void native_send_rar_ipi(const struct cpumask *mask)
+{
+	__apic_send_IPI_mask(mask, RAR_VECTOR);
+}
 #endif /* CONFIG_SMP */
 
 static inline int __prepare_ICR2(unsigned int mask)
diff --git a/arch/x86/kernel/apic/local.h b/arch/x86/kernel/apic/local.h
index bdcf609eb283..833669174267 100644
--- a/arch/x86/kernel/apic/local.h
+++ b/arch/x86/kernel/apic/local.h
@@ -38,6 +38,9 @@ static inline unsigned int __prepare_ICR(unsigned int shortcut, int vector,
 	case NMI_VECTOR:
 		icr |= APIC_DM_NMI;
 		break;
+	case RAR_VECTOR:
+		icr |= APIC_DM_RAR;
+		break;
 	}
 	return icr;
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request
  2025-06-19 20:03 [RFC PATCH v4 0/8] Intel RAR TLB invalidation Rik van Riel
                   ` (3 preceding siblings ...)
  2025-06-19 20:03 ` [RFC PATCH v4 4/8] x86/apic: Introduce Remote Action Request Operations Rik van Riel
@ 2025-06-19 20:03 ` Rik van Riel
  2025-06-19 23:01   ` Nadav Amit
  2025-06-26 15:41   ` Kirill A. Shutemov
  2025-06-19 20:03 ` [RFC PATCH v4 6/8] x86/mm: use RAR for kernel TLB flushes Rik van Riel
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 22+ messages in thread
From: Rik van Riel @ 2025-06-19 20:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, dave.hansen, luto, peterz, bp, x86, nadav.amit,
	seanjc, tglx, mingo, Yu-cheng Yu, Rik van Riel

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Remote Action Request (RAR) is a TLB flushing broadcast facility.
To start a TLB flush, the initiator CPU creates a RAR payload and
sends a command to the APIC.  The receiving CPUs automatically flush
TLBs as specified in the payload without the kernel's involement.

[ riel: add pcid parameter to smp_call_rar_many so other mms can be flushed ]

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/rar.h  |  76 ++++++++++++
 arch/x86/kernel/cpu/intel.c |   8 +-
 arch/x86/mm/Makefile        |   1 +
 arch/x86/mm/rar.c           | 236 ++++++++++++++++++++++++++++++++++++
 4 files changed, 320 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/rar.h
 create mode 100644 arch/x86/mm/rar.c

diff --git a/arch/x86/include/asm/rar.h b/arch/x86/include/asm/rar.h
new file mode 100644
index 000000000000..c875b9e9c509
--- /dev/null
+++ b/arch/x86/include/asm/rar.h
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_RAR_H
+#define _ASM_X86_RAR_H
+
+/*
+ * RAR payload types
+ */
+#define RAR_TYPE_INVPG		0
+#define RAR_TYPE_INVPG_NO_CR3	1
+#define RAR_TYPE_INVPCID	2
+#define RAR_TYPE_INVEPT		3
+#define RAR_TYPE_INVVPID	4
+#define RAR_TYPE_WRMSR		5
+
+/*
+ * Subtypes for RAR_TYPE_INVLPG
+ */
+#define RAR_INVPG_ADDR			0 /* address specific */
+#define RAR_INVPG_ALL			2 /* all, include global */
+#define RAR_INVPG_ALL_NO_GLOBAL		3 /* all, exclude global */
+
+/*
+ * Subtypes for RAR_TYPE_INVPCID
+ */
+#define RAR_INVPCID_ADDR		0 /* address specific */
+#define RAR_INVPCID_PCID		1 /* all of PCID */
+#define RAR_INVPCID_ALL			2 /* all, include global */
+#define RAR_INVPCID_ALL_NO_GLOBAL	3 /* all, exclude global */
+
+/*
+ * Page size for RAR_TYPE_INVLPG
+ */
+#define RAR_INVLPG_PAGE_SIZE_4K		0
+#define RAR_INVLPG_PAGE_SIZE_2M		1
+#define RAR_INVLPG_PAGE_SIZE_1G		2
+
+/*
+ * Max number of pages per payload
+ */
+#define RAR_INVLPG_MAX_PAGES 63
+
+struct rar_payload {
+	u64 for_sw		: 8;
+	u64 type		: 8;
+	u64 must_be_zero_1	: 16;
+	u64 subtype		: 3;
+	u64 page_size		: 2;
+	u64 num_pages		: 6;
+	u64 must_be_zero_2	: 21;
+
+	u64 must_be_zero_3;
+
+	/*
+	 * Starting address
+	 */
+	union {
+		u64 initiator_cr3;
+		struct {
+			u64 pcid	: 12;
+			u64 ignored	: 52;
+		};
+	};
+	u64 linear_address;
+
+	/*
+	 * Padding
+	 */
+	u64 padding[4];
+};
+
+void rar_cpu_init(void);
+void rar_boot_cpu_init(void);
+void smp_call_rar_many(const struct cpumask *mask, u16 pcid,
+		       unsigned long start, unsigned long end);
+
+#endif /* _ASM_X86_RAR_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 0cc4ae27127c..ddc5e7d81077 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -22,6 +22,7 @@
 #include <asm/microcode.h>
 #include <asm/msr.h>
 #include <asm/numa.h>
+#include <asm/rar.h>
 #include <asm/resctrl.h>
 #include <asm/thermal.h>
 #include <asm/uaccess.h>
@@ -624,6 +625,9 @@ static void init_intel(struct cpuinfo_x86 *c)
 	split_lock_init();
 
 	intel_init_thermal(c);
+
+	if (cpu_feature_enabled(X86_FEATURE_RAR))
+		rar_cpu_init();
 }
 
 #ifdef CONFIG_X86_32
@@ -725,8 +729,10 @@ static void intel_detect_tlb(struct cpuinfo_x86 *c)
 
 		rdmsrl(MSR_IA32_CORE_CAPS, msr);
 
-		if (msr & MSR_IA32_CORE_CAPS_RAR)
+		if (msr & MSR_IA32_CORE_CAPS_RAR) {
 			setup_force_cpu_cap(X86_FEATURE_RAR);
+			rar_boot_cpu_init();
+		}
 	}
 }
 
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5b9908f13dcf..f36fc99e8b10 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_ACPI_NUMA)		+= srat.o
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
 obj-$(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION)	+= pti.o
+obj-$(CONFIG_BROADCAST_TLB_FLUSH)		+= rar.o
 
 obj-$(CONFIG_X86_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_amd.o
diff --git a/arch/x86/mm/rar.c b/arch/x86/mm/rar.c
new file mode 100644
index 000000000000..76959782fb03
--- /dev/null
+++ b/arch/x86/mm/rar.c
@@ -0,0 +1,236 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * RAR TLB shootdown
+ */
+#include <linux/sched.h>
+#include <linux/bug.h>
+#include <asm/current.h>
+#include <asm/io.h>
+#include <asm/sync_bitops.h>
+#include <asm/rar.h>
+#include <asm/tlbflush.h>
+
+static DEFINE_PER_CPU(struct cpumask, rar_cpu_mask);
+
+#define RAR_SUCCESS	0x00
+#define RAR_PENDING	0x01
+#define RAR_FAILURE	0x80
+
+#define RAR_MAX_PAYLOADS 64UL
+
+/* How many RAR payloads are supported by this CPU */
+static int rar_max_payloads __ro_after_init = RAR_MAX_PAYLOADS;
+
+/*
+ * RAR payloads telling CPUs what to do. This table is shared between
+ * all CPUs; it is possible to have multiple payload tables shared between
+ * different subsets of CPUs, but that adds a lot of complexity.
+ */
+static struct rar_payload rar_payload[RAR_MAX_PAYLOADS] __page_aligned_bss;
+
+/*
+ * Reduce contention for the RAR payloads by having a small number of
+ * CPUs share a RAR payload entry, instead of a free for all with all CPUs.
+ */
+struct rar_lock {
+	union {
+		raw_spinlock_t lock;
+		char __padding[SMP_CACHE_BYTES];
+	};
+};
+
+static struct rar_lock rar_locks[RAR_MAX_PAYLOADS] __cacheline_aligned;
+
+/*
+ * The action vector tells each CPU which payload table entries
+ * have work for that CPU.
+ */
+static DEFINE_PER_CPU_ALIGNED(u8[RAR_MAX_PAYLOADS], rar_action);
+
+/*
+ * TODO: group CPUs together based on locality in the system instead
+ * of CPU number, to further reduce the cost of contention.
+ */
+static int cpu_rar_payload_number(void)
+{
+	int cpu = raw_smp_processor_id();
+	return cpu % rar_max_payloads;
+}
+
+static int get_payload_slot(void)
+{
+	int payload_nr = cpu_rar_payload_number();
+	raw_spin_lock(&rar_locks[payload_nr].lock);
+	return payload_nr;
+}
+
+static void free_payload_slot(unsigned long payload_nr)
+{
+	raw_spin_unlock(&rar_locks[payload_nr].lock);
+}
+
+static void set_payload(struct rar_payload *p, u16 pcid, unsigned long start,
+			long pages)
+{
+	p->must_be_zero_1	= 0;
+	p->must_be_zero_2	= 0;
+	p->must_be_zero_3	= 0;
+	p->page_size		= RAR_INVLPG_PAGE_SIZE_4K;
+	p->type			= RAR_TYPE_INVPCID;
+	p->pcid			= pcid;
+	p->linear_address	= start;
+
+	if (pcid) {
+		/* RAR invalidation of the mapping of a specific process. */
+		if (pages < RAR_INVLPG_MAX_PAGES) {
+			p->num_pages = pages;
+			p->subtype = RAR_INVPCID_ADDR;
+		} else {
+			p->subtype = RAR_INVPCID_PCID;
+		}
+	} else {
+		/*
+		 * Unfortunately RAR_INVPCID_ADDR excludes global translations.
+		 * Always do a full flush for kernel invalidations.
+		 */
+		p->subtype = RAR_INVPCID_ALL;
+	}
+
+	/* Ensure all writes are visible before the action entry is set. */
+	smp_wmb();
+}
+
+static void set_action_entry(unsigned long payload_nr, int target_cpu)
+{
+	u8 *bitmap = per_cpu(rar_action, target_cpu);
+
+	/*
+	 * Given a remote CPU, "arm" its action vector to ensure it handles
+	 * the request at payload_nr when it receives a RAR signal.
+	 * The remote CPU will overwrite RAR_PENDING when it handles
+	 * the request.
+	 */
+	WRITE_ONCE(bitmap[payload_nr], RAR_PENDING);
+}
+
+static void wait_for_action_done(unsigned long payload_nr, int target_cpu)
+{
+	u8 status;
+	u8 *rar_actions = per_cpu(rar_action, target_cpu);
+
+	status = READ_ONCE(rar_actions[payload_nr]);
+
+	while (status == RAR_PENDING) {
+		cpu_relax();
+		status = READ_ONCE(rar_actions[payload_nr]);
+	}
+
+	WARN_ON_ONCE(rar_actions[payload_nr] != RAR_SUCCESS);
+}
+
+void rar_cpu_init(void)
+{
+	u8 *bitmap;
+	u64 r;
+
+	/* Check if this CPU was already initialized. */
+	rdmsrl(MSR_IA32_RAR_PAYLOAD_BASE, r);
+	if (r == (u64)virt_to_phys(rar_payload))
+		return;
+
+	bitmap = this_cpu_ptr(rar_action);
+	memset(bitmap, 0, RAR_MAX_PAYLOADS);
+	wrmsrl(MSR_IA32_RAR_ACT_VEC, (u64)virt_to_phys(bitmap));
+	wrmsrl(MSR_IA32_RAR_PAYLOAD_BASE, (u64)virt_to_phys(rar_payload));
+
+	/*
+	 * Allow RAR events to be processed while interrupts are disabled on
+	 * a target CPU. This prevents "pileups" where many CPUs are waiting
+	 * on one CPU that has IRQs blocked for too long, and should reduce
+	 * contention on the rar_payload table.
+	 */
+	wrmsrl(MSR_IA32_RAR_CTRL, RAR_CTRL_ENABLE | RAR_CTRL_IGNORE_IF);
+}
+
+void rar_boot_cpu_init(void)
+{
+	int max_payloads;
+	u64 r;
+
+	/* The MSR contains N defining the max [0-N] rar payload slots. */
+	rdmsrl(MSR_IA32_RAR_INFO, r);
+	max_payloads = (r >> 32) + 1;
+
+	/* If this CPU supports less than RAR_MAX_PAYLOADS, lower our limit. */
+	if (max_payloads < rar_max_payloads)
+		rar_max_payloads = max_payloads;
+	pr_info("RAR: support %d payloads\n", max_payloads);
+
+	for (r = 0; r < rar_max_payloads; r++)
+		rar_locks[r].lock = __RAW_SPIN_LOCK_UNLOCKED(rar_lock);
+
+	/* Initialize the boot CPU early to handle early boot flushes. */
+	rar_cpu_init();
+}
+
+/*
+ * Inspired by smp_call_function_many(), but RAR requires a global payload
+ * table rather than per-CPU payloads in the CSD table, because the action
+ * handler is microcode rather than software.
+ */
+void smp_call_rar_many(const struct cpumask *mask, u16 pcid,
+		       unsigned long start, unsigned long end)
+{
+	unsigned long pages = (end - start + PAGE_SIZE) / PAGE_SIZE;
+	int cpu, this_cpu = smp_processor_id();
+	cpumask_t *dest_mask;
+	unsigned long payload_nr;
+
+	/* Catch the "end - start + PAGE_SIZE" overflow above. */
+	if (end == TLB_FLUSH_ALL)
+		pages = RAR_INVLPG_MAX_PAGES + 1;
+
+	/*
+	 * Can deadlock when called with interrupts disabled.
+	 * Allow CPUs that are not yet online though, as no one else can
+	 * send smp call function interrupt to this CPU and as such deadlocks
+	 * can't happen.
+	 */
+	if (cpu_online(this_cpu) && !oops_in_progress && !early_boot_irqs_disabled) {
+		lockdep_assert_irqs_enabled();
+		lockdep_assert_preemption_disabled();
+	}
+
+	/*
+	 * A CPU needs to be initialized in order to process RARs.
+	 * Skip offline CPUs.
+	 *
+	 * TODO:
+	 * - Skip RAR to CPUs that are in a deeper C-state, with an empty TLB
+	 *
+	 * This code cannot use the should_flush_tlb() logic here because
+	 * RAR flushes do not update the tlb_gen, resulting in unnecessary
+	 * flushes at context switch time.
+	 */
+	dest_mask = this_cpu_ptr(&rar_cpu_mask);
+	cpumask_and(dest_mask, mask, cpu_online_mask);
+
+	/* Some callers race with other CPUs changing the passed mask */
+	if (unlikely(!cpumask_weight(dest_mask)))
+		return;
+
+	payload_nr = get_payload_slot();
+	set_payload(&rar_payload[payload_nr], pcid, start, pages);
+
+	for_each_cpu(cpu, dest_mask)
+		set_action_entry(payload_nr, cpu);
+
+	/* Send a message to all CPUs in the map */
+	native_send_rar_ipi(dest_mask);
+
+	for_each_cpu(cpu, dest_mask)
+		wait_for_action_done(payload_nr, cpu);
+
+	free_payload_slot(payload_nr);
+}
+EXPORT_SYMBOL(smp_call_rar_many);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v4 6/8] x86/mm: use RAR for kernel TLB flushes
  2025-06-19 20:03 [RFC PATCH v4 0/8] Intel RAR TLB invalidation Rik van Riel
                   ` (4 preceding siblings ...)
  2025-06-19 20:03 ` [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request Rik van Riel
@ 2025-06-19 20:03 ` Rik van Riel
  2025-06-27 13:27   ` Kirill A. Shutemov
  2025-06-19 20:03 ` [RFC PATCH v4 7/8] x86/mm: userspace & pageout flushing using Intel RAR Rik van Riel
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2025-06-19 20:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, dave.hansen, luto, peterz, bp, x86, nadav.amit,
	seanjc, tglx, mingo, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@fb.com>

Use Intel RAR for kernel TLB flushes, when enabled.

Pass in PCID 0 to smp_call_rar_many() to flush the specified addresses,
regardless of which PCID they might be cached under in any destination CPU.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/mm/tlb.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 39f80111e6f1..8931f7029d6c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -21,6 +21,7 @@
 #include <asm/apic.h>
 #include <asm/msr.h>
 #include <asm/perf_event.h>
+#include <asm/rar.h>
 #include <asm/tlb.h>
 
 #include "mm_internal.h"
@@ -1468,6 +1469,18 @@ static void do_flush_tlb_all(void *info)
 	__flush_tlb_all();
 }
 
+static void rar_full_flush(const cpumask_t *cpumask)
+{
+	guard(preempt)();
+	smp_call_rar_many(cpumask, 0, 0, TLB_FLUSH_ALL);
+	invpcid_flush_all();
+}
+
+static void rar_flush_all(void)
+{
+	rar_full_flush(cpu_online_mask);
+}
+
 void flush_tlb_all(void)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
@@ -1475,6 +1488,8 @@ void flush_tlb_all(void)
 	/* First try (faster) hardware-assisted TLB invalidation. */
 	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
 		invlpgb_flush_all();
+	else if (cpu_feature_enabled(X86_FEATURE_RAR))
+		rar_flush_all();
 	else
 		/* Fall back to the IPI-based invalidation. */
 		on_each_cpu(do_flush_tlb_all, NULL, 1);
@@ -1504,15 +1519,36 @@ static void do_kernel_range_flush(void *info)
 	struct flush_tlb_info *f = info;
 	unsigned long addr;
 
+	/*
+	 * With PTI kernel TLB entries in all PCIDs need to be flushed.
+	 * With RAR the PCID space becomes so large, we might as well flush it all.
+	 *
+	 * Either of the two by itself works with targeted flushes.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_RAR) &&
+	    cpu_feature_enabled(X86_FEATURE_PTI)) {
+		invpcid_flush_all();
+		return;
+	}
+
 	/* flush range by one by one 'invlpg' */
 	for (addr = f->start; addr < f->end; addr += PAGE_SIZE)
 		flush_tlb_one_kernel(addr);
 }
 
+static void rar_kernel_range_flush(struct flush_tlb_info *info)
+{
+	guard(preempt)();
+	smp_call_rar_many(cpu_online_mask, 0, info->start, info->end);
+	do_kernel_range_flush(info);
+}
+
 static void kernel_tlb_flush_all(struct flush_tlb_info *info)
 {
 	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
 		invlpgb_flush_all();
+	else if (cpu_feature_enabled(X86_FEATURE_RAR))
+		rar_flush_all();
 	else
 		on_each_cpu(do_flush_tlb_all, NULL, 1);
 }
@@ -1521,6 +1557,8 @@ static void kernel_tlb_flush_range(struct flush_tlb_info *info)
 {
 	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
 		invlpgb_kernel_range_flush(info);
+	else if (cpu_feature_enabled(X86_FEATURE_RAR))
+		rar_kernel_range_flush(info);
 	else
 		on_each_cpu(do_kernel_range_flush, info, 1);
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v4 7/8] x86/mm: userspace & pageout flushing using Intel RAR
  2025-06-19 20:03 [RFC PATCH v4 0/8] Intel RAR TLB invalidation Rik van Riel
                   ` (5 preceding siblings ...)
  2025-06-19 20:03 ` [RFC PATCH v4 6/8] x86/mm: use RAR for kernel TLB flushes Rik van Riel
@ 2025-06-19 20:03 ` Rik van Riel
  2025-06-19 20:04 ` [RFC PATCH v4 8/8] x86/tlb: flush the local TLB twice (DEBUG) Rik van Riel
  2025-06-26 18:08 ` [RFC PATCH v4 0/8] Intel RAR TLB invalidation Dave Jiang
  8 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2025-06-19 20:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, dave.hansen, luto, peterz, bp, x86, nadav.amit,
	seanjc, tglx, mingo, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@fb.com>

Use Intel RAR to flush userspace mappings.

Because RAR flushes are targeted using a cpu bitmap, the rules are
a little bit different than for true broadcast TLB invalidation.

For true broadcast TLB invalidation, like done with AMD INVLPGB,
a global ASID always has up to date TLB entries on every CPU.
The context switch code never has to flush the TLB when switching
to a global ASID on any CPU with INVLPGB.

For RAR, the TLB mappings for a global ASID are kept up to date
only on CPUs within the mm_cpumask, which lazily follows the
threads around the system. The context switch code does not
need to flush the TLB if the CPU is in the mm_cpumask, and
the PCID used stays the same.

However, a CPU that falls outside of the mm_cpumask can have
out of date TLB mappings for this task. When switching to
that task on a CPU not in the mm_cpumask, the TLB does need
to be flushed.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/tlbflush.h |   9 +-
 arch/x86/mm/tlb.c               | 217 ++++++++++++++++++++++++++------
 2 files changed, 182 insertions(+), 44 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index e9b81876ebe4..21bd9162df38 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -250,7 +250,8 @@ static inline u16 mm_global_asid(struct mm_struct *mm)
 {
 	u16 asid;
 
-	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) &&
+	    !cpu_feature_enabled(X86_FEATURE_RAR))
 		return 0;
 
 	asid = smp_load_acquire(&mm->context.global_asid);
@@ -263,7 +264,8 @@ static inline u16 mm_global_asid(struct mm_struct *mm)
 
 static inline void mm_init_global_asid(struct mm_struct *mm)
 {
-	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB) ||
+	    cpu_feature_enabled(X86_FEATURE_RAR)) {
 		mm->context.global_asid = 0;
 		mm->context.asid_transition = false;
 	}
@@ -287,7 +289,8 @@ static inline void mm_clear_asid_transition(struct mm_struct *mm)
 
 static inline bool mm_in_asid_transition(struct mm_struct *mm)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) &&
+	    !cpu_feature_enabled(X86_FEATURE_RAR))
 		return false;
 
 	return mm && READ_ONCE(mm->context.asid_transition);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 8931f7029d6c..590742838e43 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -222,7 +222,8 @@ struct new_asid {
 	unsigned int need_flush : 1;
 };
 
-static struct new_asid choose_new_asid(struct mm_struct *next, u64 next_tlb_gen)
+static struct new_asid choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
+				       bool new_cpu)
 {
 	struct new_asid ns;
 	u16 asid;
@@ -235,14 +236,22 @@ static struct new_asid choose_new_asid(struct mm_struct *next, u64 next_tlb_gen)
 
 	/*
 	 * TLB consistency for global ASIDs is maintained with hardware assisted
-	 * remote TLB flushing. Global ASIDs are always up to date.
+	 * remote TLB flushing. Global ASIDs are always up to date with INVLPGB,
+	 * and up to date for CPUs in the mm_cpumask with RAR..
 	 */
-	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB) ||
+	    cpu_feature_enabled(X86_FEATURE_RAR)) {
 		u16 global_asid = mm_global_asid(next);
 
 		if (global_asid) {
 			ns.asid = global_asid;
 			ns.need_flush = 0;
+			/*
+			 * If the CPU fell out of the cpumask, it can be
+			 * out of date with RAR, and should be flushed.
+			 */
+			if (cpu_feature_enabled(X86_FEATURE_RAR))
+				ns.need_flush = new_cpu;
 			return ns;
 		}
 	}
@@ -300,7 +309,14 @@ static void reset_global_asid_space(void)
 {
 	lockdep_assert_held(&global_asid_lock);
 
-	invlpgb_flush_all_nonglobals();
+	/*
+	 * The global flush ensures that a freshly allocated global ASID
+	 * has no entries in any TLB, and can be used immediately.
+	 * With Intel RAR, the TLB may still need to be flushed at context
+	 * switch time when dealing with a CPU that was not in the mm_cpumask
+	 * for the process, and may have missed flushes along the way.
+	 */
+	flush_tlb_all();
 
 	/*
 	 * The TLB flush above makes it safe to re-use the previously
@@ -377,7 +393,7 @@ static void use_global_asid(struct mm_struct *mm)
 {
 	u16 asid;
 
-	guard(raw_spinlock_irqsave)(&global_asid_lock);
+	guard(raw_spinlock)(&global_asid_lock);
 
 	/* This process is already using broadcast TLB invalidation. */
 	if (mm_global_asid(mm))
@@ -403,13 +419,14 @@ static void use_global_asid(struct mm_struct *mm)
 
 void mm_free_global_asid(struct mm_struct *mm)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) &&
+	    !cpu_feature_enabled(X86_FEATURE_RAR))
 		return;
 
 	if (!mm_global_asid(mm))
 		return;
 
-	guard(raw_spinlock_irqsave)(&global_asid_lock);
+	guard(raw_spinlock)(&global_asid_lock);
 
 	/* The global ASID can be re-used only after flush at wrap-around. */
 #ifdef CONFIG_BROADCAST_TLB_FLUSH
@@ -427,7 +444,8 @@ static bool mm_needs_global_asid(struct mm_struct *mm, u16 asid)
 {
 	u16 global_asid = mm_global_asid(mm);
 
-	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) &&
+	    !cpu_feature_enabled(X86_FEATURE_RAR))
 		return false;
 
 	/* Process is transitioning to a global ASID */
@@ -445,7 +463,8 @@ static bool mm_needs_global_asid(struct mm_struct *mm, u16 asid)
  */
 static void consider_global_asid(struct mm_struct *mm)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB) &&
+	    !cpu_feature_enabled(X86_FEATURE_RAR))
 		return;
 
 	/* Check every once in a while. */
@@ -490,6 +509,7 @@ static void finish_asid_transition(struct flush_tlb_info *info)
 		 * that results in a (harmless) extra IPI.
 		 */
 		if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) {
+			info->trim_cpumask = true;
 			flush_tlb_multi(mm_cpumask(info->mm), info);
 			return;
 		}
@@ -499,7 +519,7 @@ static void finish_asid_transition(struct flush_tlb_info *info)
 	mm_clear_asid_transition(mm);
 }
 
-static void broadcast_tlb_flush(struct flush_tlb_info *info)
+static void invlpgb_tlb_flush(struct flush_tlb_info *info)
 {
 	bool pmd = info->stride_shift == PMD_SHIFT;
 	unsigned long asid = mm_global_asid(info->mm);
@@ -530,8 +550,6 @@ static void broadcast_tlb_flush(struct flush_tlb_info *info)
 		addr += nr << info->stride_shift;
 	} while (addr < info->end);
 
-	finish_asid_transition(info);
-
 	/* Wait for the INVLPGBs kicked off above to finish. */
 	__tlbsync();
 }
@@ -862,7 +880,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		/* Check if the current mm is transitioning to a global ASID */
 		if (mm_needs_global_asid(next, prev_asid)) {
 			next_tlb_gen = atomic64_read(&next->context.tlb_gen);
-			ns = choose_new_asid(next, next_tlb_gen);
+			ns = choose_new_asid(next, next_tlb_gen, true);
 			goto reload_tlb;
 		}
 
@@ -900,6 +918,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		ns.asid = prev_asid;
 		ns.need_flush = true;
 	} else {
+		bool new_cpu = false;
 		/*
 		 * Apply process to process speculation vulnerability
 		 * mitigations if applicable.
@@ -914,20 +933,25 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
 		barrier();
 
-		/* Start receiving IPIs and then read tlb_gen (and LAM below) */
-		if (next != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))
+		/* Start receiving IPIs and RAR invalidations */
+		if (next != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next))) {
 			cpumask_set_cpu(cpu, mm_cpumask(next));
+			if (cpu_feature_enabled(X86_FEATURE_RAR))
+				new_cpu = true;
+		}
+
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
-		ns = choose_new_asid(next, next_tlb_gen);
+		ns = choose_new_asid(next, next_tlb_gen, new_cpu);
 	}
 
 reload_tlb:
 	new_lam = mm_lam_cr3_mask(next);
 	if (ns.need_flush) {
-		VM_WARN_ON_ONCE(is_global_asid(ns.asid));
-		this_cpu_write(cpu_tlbstate.ctxs[ns.asid].ctx_id, next->context.ctx_id);
-		this_cpu_write(cpu_tlbstate.ctxs[ns.asid].tlb_gen, next_tlb_gen);
+		if (is_dyn_asid(ns.asid)) {
+			this_cpu_write(cpu_tlbstate.ctxs[ns.asid].ctx_id, next->context.ctx_id);
+			this_cpu_write(cpu_tlbstate.ctxs[ns.asid].tlb_gen, next_tlb_gen);
+		}
 		load_new_mm_cr3(next->pgd, ns.asid, new_lam, true);
 
 		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
@@ -1115,7 +1139,7 @@ static void flush_tlb_func(void *info)
 	const struct flush_tlb_info *f = info;
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	u64 local_tlb_gen;
+	u64 local_tlb_gen = 0;
 	bool local = smp_processor_id() == f->initiating_cpu;
 	unsigned long nr_invalidate = 0;
 	u64 mm_tlb_gen;
@@ -1138,19 +1162,6 @@ static void flush_tlb_func(void *info)
 	if (unlikely(loaded_mm == &init_mm))
 		return;
 
-	/* Reload the ASID if transitioning into or out of a global ASID */
-	if (mm_needs_global_asid(loaded_mm, loaded_mm_asid)) {
-		switch_mm_irqs_off(NULL, loaded_mm, NULL);
-		loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	}
-
-	/* Broadcast ASIDs are always kept up to date with INVLPGB. */
-	if (is_global_asid(loaded_mm_asid))
-		return;
-
-	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
-		   loaded_mm->context.ctx_id);
-
 	if (this_cpu_read(cpu_tlbstate_shared.is_lazy)) {
 		/*
 		 * We're in lazy mode.  We need to at least flush our
@@ -1161,11 +1172,31 @@ static void flush_tlb_func(void *info)
 		 * This should be rare, with native_flush_tlb_multi() skipping
 		 * IPIs to lazy TLB mode CPUs.
 		 */
+		cpumask_clear_cpu(raw_smp_processor_id(), mm_cpumask(loaded_mm));
 		switch_mm_irqs_off(NULL, &init_mm, NULL);
 		return;
 	}
 
-	local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+	/* Reload the ASID if transitioning into or out of a global ASID */
+	if (mm_needs_global_asid(loaded_mm, loaded_mm_asid)) {
+		switch_mm_irqs_off(NULL, loaded_mm, NULL);
+		loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	}
+
+	/*
+	 * Broadcast ASIDs are always kept up to date with INVLPGB; with
+	 * Intel RAR IPI based flushes are used periodically to trim the
+	 * mm_cpumask, and flushes that get here should be processed.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB) &&
+	    is_global_asid(loaded_mm_asid))
+		return;
+
+	VM_WARN_ON(is_dyn_asid(loaded_mm_asid) && loaded_mm->context.ctx_id !=
+		   this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id));
+
+	if (is_dyn_asid(loaded_mm_asid))
+		local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
 
 	if (unlikely(f->new_tlb_gen != TLB_GENERATION_INVALID &&
 		     f->new_tlb_gen <= local_tlb_gen)) {
@@ -1264,7 +1295,8 @@ static void flush_tlb_func(void *info)
 	}
 
 	/* Both paths above update our state to mm_tlb_gen. */
-	this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
+	if (is_dyn_asid(loaded_mm_asid))
+		this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
 
 	/* Tracing is done in a unified manner to reduce the code size */
 done:
@@ -1305,15 +1337,15 @@ static bool should_flush_tlb(int cpu, void *data)
 	if (loaded_mm == info->mm)
 		return true;
 
-	/* In cpumask, but not the loaded mm? Periodically remove by flushing. */
-	if (info->trim_cpumask)
-		return true;
-
 	return false;
 }
 
 static bool should_trim_cpumask(struct mm_struct *mm)
 {
+	/* INVLPGB always goes to all CPUs. No need to trim the mask. */
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB) && mm_global_asid(mm))
+		return false;
+
 	if (time_after(jiffies, READ_ONCE(mm->context.next_trim_cpumask))) {
 		WRITE_ONCE(mm->context.next_trim_cpumask, jiffies + HZ);
 		return true;
@@ -1324,6 +1356,27 @@ static bool should_trim_cpumask(struct mm_struct *mm)
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
 EXPORT_PER_CPU_SYMBOL(cpu_tlbstate_shared);
 
+static bool should_flush_all(const struct flush_tlb_info *info)
+{
+	if (info->freed_tables)
+		return true;
+
+	if (info->trim_cpumask)
+		return true;
+
+	/*
+	 * INVLPGB and RAR do not use this code path normally.
+	 * This call cleans up the cpumask or ASID transition.
+	 */
+	if (mm_global_asid(info->mm))
+		return true;
+
+	if (mm_in_asid_transition(info->mm))
+		return true;
+
+	return false;
+}
+
 STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
 					 const struct flush_tlb_info *info)
 {
@@ -1349,7 +1402,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
 	 * up on the new contents of what used to be page tables, while
 	 * doing a speculative memory access.
 	 */
-	if (info->freed_tables || mm_in_asid_transition(info->mm))
+	if (should_flush_all(info))
 		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
 	else
 		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
@@ -1380,6 +1433,74 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, flush_tlb_info);
 static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx);
 #endif
 
+static void trim_cpumask_func(void *data)
+{
+	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+	const struct flush_tlb_info *f = data;
+
+	/*
+	 * Clearing this bit from an IRQ handler synchronizes against
+	 * the bit being set in switch_mm_irqs_off, with IRQs disabled.
+	 */
+	if (f->mm != loaded_mm)
+		cpumask_clear_cpu(raw_smp_processor_id(), mm_cpumask(f->mm));
+}
+
+static bool should_remove_cpu_from_mask(int cpu, void *data)
+{
+	struct mm_struct *loaded_mm = per_cpu(cpu_tlbstate.loaded_mm, cpu);
+	struct flush_tlb_info *info = data;
+
+	if (loaded_mm != info->mm)
+		return true;
+
+	return false;
+}
+
+/* Remove CPUs from the mm_cpumask that are running another mm. */
+static void trim_cpumask(struct flush_tlb_info *info)
+{
+	cpumask_t *cpumask = mm_cpumask(info->mm);
+	on_each_cpu_cond_mask(should_remove_cpu_from_mask, trim_cpumask_func,
+				(void *)info, 1, cpumask);
+}
+
+static void rar_tlb_flush(struct flush_tlb_info *info)
+{
+	unsigned long asid = mm_global_asid(info->mm);
+	cpumask_t *cpumask = mm_cpumask(info->mm);
+	u16 pcid = kern_pcid(asid);
+
+	if (info->trim_cpumask)
+		trim_cpumask(info);
+
+	/* Only the local CPU needs to be flushed? */
+	if (cpumask_equal(cpumask, cpumask_of(raw_smp_processor_id()))) {
+		lockdep_assert_irqs_enabled();
+		local_irq_disable();
+		flush_tlb_func(info);
+		local_irq_enable();
+		return;
+	}
+
+	/* Flush all the CPUs at once with RAR. */
+	if (cpumask_weight(cpumask)) {
+		smp_call_rar_many(mm_cpumask(info->mm), pcid, info->start, info->end);
+		if (cpu_feature_enabled(X86_FEATURE_PTI))
+			smp_call_rar_many(mm_cpumask(info->mm), user_pcid(asid), info->start, info->end);
+	}
+}
+
+static void broadcast_tlb_flush(struct flush_tlb_info *info)
+{
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		invlpgb_tlb_flush(info);
+	else /* Intel RAR */
+		rar_tlb_flush(info);
+
+	finish_asid_transition(info);
+}
+
 static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 			unsigned long start, unsigned long end,
 			unsigned int stride_shift, bool freed_tables,
@@ -1440,6 +1561,13 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
 				  new_tlb_gen);
 
+	/*
+	 * IPIs and RAR can be targeted to a cpumask. Periodically trim that
+	 * mm_cpumask by sending TLB flush IPIs, even when most TLB flushes
+	 * are done with RAR.
+	 */
+	info->trim_cpumask = should_trim_cpumask(mm);
+
 	/*
 	 * flush_tlb_multi() is not optimized for the common case in which only
 	 * a local TLB flush is needed. Optimize this use-case by calling
@@ -1448,7 +1576,6 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	if (mm_global_asid(mm)) {
 		broadcast_tlb_flush(info);
 	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
-		info->trim_cpumask = should_trim_cpumask(mm);
 		flush_tlb_multi(mm_cpumask(mm), info);
 		consider_global_asid(mm);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
@@ -1759,6 +1886,14 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 	if (cpu_feature_enabled(X86_FEATURE_INVLPGB) && batch->unmapped_pages) {
 		invlpgb_flush_all_nonglobals();
 		batch->unmapped_pages = false;
+	} else if (cpu_feature_enabled(X86_FEATURE_RAR) && cpumask_any(&batch->cpumask) < nr_cpu_ids) {
+		rar_full_flush(&batch->cpumask);
+		if (cpumask_test_cpu(cpu, &batch->cpumask)) {
+			lockdep_assert_irqs_enabled();
+			local_irq_disable();
+			invpcid_flush_all_nonglobals();
+			local_irq_enable();
+		}
 	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
 		flush_tlb_multi(&batch->cpumask, info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v4 8/8] x86/tlb: flush the local TLB twice (DEBUG)
  2025-06-19 20:03 [RFC PATCH v4 0/8] Intel RAR TLB invalidation Rik van Riel
                   ` (6 preceding siblings ...)
  2025-06-19 20:03 ` [RFC PATCH v4 7/8] x86/mm: userspace & pageout flushing using Intel RAR Rik van Riel
@ 2025-06-19 20:04 ` Rik van Riel
  2025-06-26 18:08 ` [RFC PATCH v4 0/8] Intel RAR TLB invalidation Dave Jiang
  8 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2025-06-19 20:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, dave.hansen, luto, peterz, bp, x86, nadav.amit,
	seanjc, tglx, mingo, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

The RAR code attempts to flush the local TLB in addition to remote TLBs,
if the local TLB is in the cpumask. This can be seen in that the status
for the local CPU is changed from RAR_PENDING to RAR_SUCCESS.

However, it appears that the local TLB is not actually getting flushed
when the microcode flips the status to RAR_SUCCESS!

The RAR white paper suggests it should work:

"At this point, the ILP may invalidate its own TLB by signaling RAR
 to itself in order to invoke the RAR handler locally as well."

I would really appreciate some guidance from Intel on how to move
forward here.

Is the RAR code doing something wrong?

Is the CPU not behaving quite as documented?

What is the best way forward?

Not-signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/mm/tlb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 590742838e43..f12eff2dbcc8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1469,22 +1469,22 @@ static void rar_tlb_flush(struct flush_tlb_info *info)
 {
 	unsigned long asid = mm_global_asid(info->mm);
 	cpumask_t *cpumask = mm_cpumask(info->mm);
+	int cpu = raw_smp_processor_id();
 	u16 pcid = kern_pcid(asid);
 
 	if (info->trim_cpumask)
 		trim_cpumask(info);
 
 	/* Only the local CPU needs to be flushed? */
-	if (cpumask_equal(cpumask, cpumask_of(raw_smp_processor_id()))) {
+	if (cpumask_test_cpu(cpu, cpumask)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
 		flush_tlb_func(info);
 		local_irq_enable();
-		return;
 	}
 
 	/* Flush all the CPUs at once with RAR. */
-	if (cpumask_weight(cpumask)) {
+	if (cpumask_any_but(cpumask, cpu)) {
 		smp_call_rar_many(mm_cpumask(info->mm), pcid, info->start, info->end);
 		if (cpu_feature_enabled(X86_FEATURE_PTI))
 			smp_call_rar_many(mm_cpumask(info->mm), user_pcid(asid), info->start, info->end);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request
  2025-06-19 20:03 ` [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request Rik van Riel
@ 2025-06-19 23:01   ` Nadav Amit
  2025-06-20  1:10     ` Rik van Riel
  2025-06-26 15:41   ` Kirill A. Shutemov
  1 sibling, 1 reply; 22+ messages in thread
From: Nadav Amit @ 2025-06-19 23:01 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: kernel-team, dave.hansen, luto, peterz, bp, x86, seanjc, tglx,
	mingo, Yu-cheng Yu



On 19/06/2025 23:03, Rik van Riel wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Remote Action Request (RAR) is a TLB flushing broadcast facility.
> To start a TLB flush, the initiator CPU creates a RAR payload and
> sends a command to the APIC.  The receiving CPUs automatically flush
> TLBs as specified in the payload without the kernel's involement.
> 
> [ riel: add pcid parameter to smp_call_rar_many so other mms can be flushed ]
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>   arch/x86/include/asm/rar.h  |  76 ++++++++++++
>   arch/x86/kernel/cpu/intel.c |   8 +-
>   arch/x86/mm/Makefile        |   1 +
>   arch/x86/mm/rar.c           | 236 ++++++++++++++++++++++++++++++++++++
>   4 files changed, 320 insertions(+), 1 deletion(-)
>   create mode 100644 arch/x86/include/asm/rar.h
>   create mode 100644 arch/x86/mm/rar.c
> 
> diff --git a/arch/x86/include/asm/rar.h b/arch/x86/include/asm/rar.h
> new file mode 100644
> index 000000000000..c875b9e9c509
> --- /dev/null
> +++ b/arch/x86/include/asm/rar.h
> @@ -0,0 +1,76 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_RAR_H
> +#define _ASM_X86_RAR_H
> +
> +/*
> + * RAR payload types
> + */
> +#define RAR_TYPE_INVPG		0
> +#define RAR_TYPE_INVPG_NO_CR3	1
> +#define RAR_TYPE_INVPCID	2
> +#define RAR_TYPE_INVEPT		3
> +#define RAR_TYPE_INVVPID	4
> +#define RAR_TYPE_WRMSR		5
> +
> +/*
> + * Subtypes for RAR_TYPE_INVLPG
> + */
> +#define RAR_INVPG_ADDR			0 /* address specific */
> +#define RAR_INVPG_ALL			2 /* all, include global */
> +#define RAR_INVPG_ALL_NO_GLOBAL		3 /* all, exclude global */
> +
> +/*
> + * Subtypes for RAR_TYPE_INVPCID
> + */
> +#define RAR_INVPCID_ADDR		0 /* address specific */
> +#define RAR_INVPCID_PCID		1 /* all of PCID */
> +#define RAR_INVPCID_ALL			2 /* all, include global */
> +#define RAR_INVPCID_ALL_NO_GLOBAL	3 /* all, exclude global */
> +
> +/*
> + * Page size for RAR_TYPE_INVLPG
> + */
> +#define RAR_INVLPG_PAGE_SIZE_4K		0
> +#define RAR_INVLPG_PAGE_SIZE_2M		1
> +#define RAR_INVLPG_PAGE_SIZE_1G		2
> +
> +/*
> + * Max number of pages per payload
> + */
> +#define RAR_INVLPG_MAX_PAGES 63
> +
> +struct rar_payload {
> +	u64 for_sw		: 8;
> +	u64 type		: 8;
> +	u64 must_be_zero_1	: 16;
> +	u64 subtype		: 3;
> +	u64 page_size		: 2;
> +	u64 num_pages		: 6;
> +	u64 must_be_zero_2	: 21;
> +
> +	u64 must_be_zero_3;
> +
> +	/*
> +	 * Starting address
> +	 */
> +	union {
> +		u64 initiator_cr3;
> +		struct {
> +			u64 pcid	: 12;
> +			u64 ignored	: 52;
> +		};
> +	};
> +	u64 linear_address;
> +
> +	/*
> +	 * Padding
> +	 */
> +	u64 padding[4];
> +};

I think __aligned(64) should allow you to get rid of the padding.

> +
> +void rar_cpu_init(void);
> +void rar_boot_cpu_init(void);
> +void smp_call_rar_many(const struct cpumask *mask, u16 pcid,
> +		       unsigned long start, unsigned long end);
> +
> +#endif /* _ASM_X86_RAR_H */
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index 0cc4ae27127c..ddc5e7d81077 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -22,6 +22,7 @@
>   #include <asm/microcode.h>
>   #include <asm/msr.h>
>   #include <asm/numa.h>
> +#include <asm/rar.h>
>   #include <asm/resctrl.h>
>   #include <asm/thermal.h>
>   #include <asm/uaccess.h>
> @@ -624,6 +625,9 @@ static void init_intel(struct cpuinfo_x86 *c)
>   	split_lock_init();
>   
>   	intel_init_thermal(c);
> +
> +	if (cpu_feature_enabled(X86_FEATURE_RAR))
> +		rar_cpu_init();
>   }
>   
>   #ifdef CONFIG_X86_32
> @@ -725,8 +729,10 @@ static void intel_detect_tlb(struct cpuinfo_x86 *c)
>   
>   		rdmsrl(MSR_IA32_CORE_CAPS, msr);
>   
> -		if (msr & MSR_IA32_CORE_CAPS_RAR)
> +		if (msr & MSR_IA32_CORE_CAPS_RAR) {
>   			setup_force_cpu_cap(X86_FEATURE_RAR);
> +			rar_boot_cpu_init();
> +		}
>   	}
>   }
>   
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index 5b9908f13dcf..f36fc99e8b10 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -52,6 +52,7 @@ obj-$(CONFIG_ACPI_NUMA)		+= srat.o
>   obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
>   obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
>   obj-$(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION)	+= pti.o
> +obj-$(CONFIG_BROADCAST_TLB_FLUSH)		+= rar.o
>   
>   obj-$(CONFIG_X86_MEM_ENCRYPT)	+= mem_encrypt.o
>   obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_amd.o
> diff --git a/arch/x86/mm/rar.c b/arch/x86/mm/rar.c
> new file mode 100644
> index 000000000000..76959782fb03
> --- /dev/null
> +++ b/arch/x86/mm/rar.c
> @@ -0,0 +1,236 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * RAR TLB shootdown
> + */
> +#include <linux/sched.h>
> +#include <linux/bug.h>
> +#include <asm/current.h>
> +#include <asm/io.h>
> +#include <asm/sync_bitops.h>
> +#include <asm/rar.h>
> +#include <asm/tlbflush.h>
> +
> +static DEFINE_PER_CPU(struct cpumask, rar_cpu_mask);
> +
> +#define RAR_SUCCESS	0x00
> +#define RAR_PENDING	0x01
> +#define RAR_FAILURE	0x80
> +
> +#define RAR_MAX_PAYLOADS 64UL
> +
> +/* How many RAR payloads are supported by this CPU */
> +static int rar_max_payloads __ro_after_init = RAR_MAX_PAYLOADS;
> +
> +/*
> + * RAR payloads telling CPUs what to do. This table is shared between
> + * all CPUs; it is possible to have multiple payload tables shared between
> + * different subsets of CPUs, but that adds a lot of complexity.
> + */
> +static struct rar_payload rar_payload[RAR_MAX_PAYLOADS] __page_aligned_bss;
> +
> +/*
> + * Reduce contention for the RAR payloads by having a small number of
> + * CPUs share a RAR payload entry, instead of a free for all with all CPUs.
> + */
> +struct rar_lock {
> +	union {
> +		raw_spinlock_t lock;
> +		char __padding[SMP_CACHE_BYTES];
> +	};
> +};

I think you can lose the __padding and instead have 
____cacheline_aligned (and then you won't need union).

> +
> +static struct rar_lock rar_locks[RAR_MAX_PAYLOADS] __cacheline_aligned;
> +
> +/*
> + * The action vector tells each CPU which payload table entries
> + * have work for that CPU.
> + */
> +static DEFINE_PER_CPU_ALIGNED(u8[RAR_MAX_PAYLOADS], rar_action);
> +
> +/*
> + * TODO: group CPUs together based on locality in the system instead
> + * of CPU number, to further reduce the cost of contention.
> + */
> +static int cpu_rar_payload_number(void)
> +{
> +	int cpu = raw_smp_processor_id();

Why raw_* ?

> +	return cpu % rar_max_payloads;
> +}
> +
> +static int get_payload_slot(void)
> +{
> +	int payload_nr = cpu_rar_payload_number();
> +	raw_spin_lock(&rar_locks[payload_nr].lock);
> +	return payload_nr;
> +}

I think it would be better to open-code it to improve readability. If 
you choose not to, I think you should use sparse indications (e.g., 
__acquires() ).

> +
> +static void free_payload_slot(unsigned long payload_nr)
> +{
> +	raw_spin_unlock(&rar_locks[payload_nr].lock);
> +}
> +
> +static void set_payload(struct rar_payload *p, u16 pcid, unsigned long start,
> +			long pages)
> +{
> +	p->must_be_zero_1	= 0;
> +	p->must_be_zero_2	= 0;
> +	p->must_be_zero_3	= 0;
> +	p->page_size		= RAR_INVLPG_PAGE_SIZE_4K;

I think you can propagate the stride to this point instead of using 
fixed 4KB stride.

> +	p->type			= RAR_TYPE_INVPCID;
> +	p->pcid			= pcid;
> +	p->linear_address	= start;
> +
> +	if (pcid) {
> +		/* RAR invalidation of the mapping of a specific process. */
> +		if (pages < RAR_INVLPG_MAX_PAGES) {
> +			p->num_pages = pages;
> +			p->subtype = RAR_INVPCID_ADDR;
> +		} else {
> +			p->subtype = RAR_INVPCID_PCID;

I wonder whether it would be safer to set something to p->num_pages.
(then we can do it unconditionally)

> +		}
> +	} else {
> +		/*
> +		 * Unfortunately RAR_INVPCID_ADDR excludes global translations.
> +		 * Always do a full flush for kernel invalidations.
> +		 */
> +		p->subtype = RAR_INVPCID_ALL;
> +	}
> +
> +	/* Ensure all writes are visible before the action entry is set. */
> +	smp_wmb();

Maybe you can drop the smp_wmb() here and instead change the 
WRITE_ONCE() in set_action_entry() to smp_store_release() ? It should 
have the same effect and I think would be cleaner and convey your intent 
better.

> +}
> +
> +static void set_action_entry(unsigned long payload_nr, int target_cpu)
> +{
> +	u8 *bitmap = per_cpu(rar_action, target_cpu);

bitmap? It doesn't look like one...

> +
> +	/*
> +	 * Given a remote CPU, "arm" its action vector to ensure it handles
> +	 * the request at payload_nr when it receives a RAR signal.
> +	 * The remote CPU will overwrite RAR_PENDING when it handles
> +	 * the request.
> +	 */
> +	WRITE_ONCE(bitmap[payload_nr], RAR_PENDING);
> +}
> +
> +static void wait_for_action_done(unsigned long payload_nr, int target_cpu)
> +{
> +	u8 status;
> +	u8 *rar_actions = per_cpu(rar_action, target_cpu);
> +
> +	status = READ_ONCE(rar_actions[payload_nr]);
> +
> +	while (status == RAR_PENDING) {
> +		cpu_relax();
> +		status = READ_ONCE(rar_actions[payload_nr]);
> +	}
> +
> +	WARN_ON_ONCE(rar_actions[payload_nr] != RAR_SUCCESS);

WARN_ON_ONCE(status != RAR_SUCCESS)

> +}
> +
> +void rar_cpu_init(void)
> +{
> +	u8 *bitmap;
> +	u64 r;
> +
> +	/* Check if this CPU was already initialized. */
> +	rdmsrl(MSR_IA32_RAR_PAYLOAD_BASE, r);
> +	if (r == (u64)virt_to_phys(rar_payload))
> +		return;

Seems a bit risky test. If anything, I would check that the MSR that is 
supposed to be set *last* (MSR_IA32_RAR_CTRL) have the expected value. 
But it would still be best to initialize the MSRs unconditionally or to 
avoid repeated initialization using a different scheme.

> +
> +	bitmap = this_cpu_ptr(rar_action);
> +	memset(bitmap, 0, RAR_MAX_PAYLOADS);
> +	wrmsrl(MSR_IA32_RAR_ACT_VEC, (u64)virt_to_phys(bitmap));
> +	wrmsrl(MSR_IA32_RAR_PAYLOAD_BASE, (u64)virt_to_phys(rar_payload));
> +
> +	/*
> +	 * Allow RAR events to be processed while interrupts are disabled on
> +	 * a target CPU. This prevents "pileups" where many CPUs are waiting
> +	 * on one CPU that has IRQs blocked for too long, and should reduce
> +	 * contention on the rar_payload table.
> +	 */
> +	wrmsrl(MSR_IA32_RAR_CTRL, RAR_CTRL_ENABLE | RAR_CTRL_IGNORE_IF);
> +}
> +
> +void rar_boot_cpu_init(void)
> +{
> +	int max_payloads;
> +	u64 r;
> +
> +	/* The MSR contains N defining the max [0-N] rar payload slots. */
> +	rdmsrl(MSR_IA32_RAR_INFO, r);
> +	max_payloads = (r >> 32) + 1;
> +
> +	/* If this CPU supports less than RAR_MAX_PAYLOADS, lower our limit. */
> +	if (max_payloads < rar_max_payloads)
> +		rar_max_payloads = max_payloads;
> +	pr_info("RAR: support %d payloads\n", max_payloads);
> +
> +	for (r = 0; r < rar_max_payloads; r++)
> +		rar_locks[r].lock = __RAW_SPIN_LOCK_UNLOCKED(rar_lock);

Not a fan of the reuse of r for different purposes.

> +
> +	/* Initialize the boot CPU early to handle early boot flushes. */
> +	rar_cpu_init();
> +}
> +
> +/*
> + * Inspired by smp_call_function_many(), but RAR requires a global payload
> + * table rather than per-CPU payloads in the CSD table, because the action
> + * handler is microcode rather than software.
> + */
> +void smp_call_rar_many(const struct cpumask *mask, u16 pcid,
> +		       unsigned long start, unsigned long end)
> +{
> +	unsigned long pages = (end - start + PAGE_SIZE) / PAGE_SIZE;

I think "end" is not inclusive. See for instance flush_tlb_page() where 
"end" is set to "a + PAGE_SIZE". So this would flush one extra page.

> +	int cpu, this_cpu = smp_processor_id();
> +	cpumask_t *dest_mask;
> +	unsigned long payload_nr;
> +
> +	/* Catch the "end - start + PAGE_SIZE" overflow above. */
> +	if (end == TLB_FLUSH_ALL)
> +		pages = RAR_INVLPG_MAX_PAGES + 1;
> +
> +	/*
> +	 * Can deadlock when called with interrupts disabled.
> +	 * Allow CPUs that are not yet online though, as no one else can
> +	 * send smp call function interrupt to this CPU and as such deadlocks
> +	 * can't happen.
> +	 */
> +	if (cpu_online(this_cpu) && !oops_in_progress && !early_boot_irqs_disabled) {
> +		lockdep_assert_irqs_enabled();
> +		lockdep_assert_preemption_disabled();
> +	}
> +
> +	/*
> +	 * A CPU needs to be initialized in order to process RARs.
> +	 * Skip offline CPUs.
> +	 *
> +	 * TODO:
> +	 * - Skip RAR to CPUs that are in a deeper C-state, with an empty TLB
> +	 *
> +	 * This code cannot use the should_flush_tlb() logic here because
> +	 * RAR flushes do not update the tlb_gen, resulting in unnecessary
> +	 * flushes at context switch time.
> +	 */
> +	dest_mask = this_cpu_ptr(&rar_cpu_mask);
> +	cpumask_and(dest_mask, mask, cpu_online_mask);
> +
> +	/* Some callers race with other CPUs changing the passed mask */
> +	if (unlikely(!cpumask_weight(dest_mask)))

cpumask_and() returns "false if *@dstp is empty, else returns true". So 
you can use his value instead of calling cpumask_weight().

> +		return;
> +
> +	payload_nr = get_payload_slot();
> +	set_payload(&rar_payload[payload_nr], pcid, start, pages);
> +
> +	for_each_cpu(cpu, dest_mask)
> +		set_action_entry(payload_nr, cpu);
> +
> +	/* Send a message to all CPUs in the map */
> +	native_send_rar_ipi(dest_mask);
> +
> +	for_each_cpu(cpu, dest_mask)
> +		wait_for_action_done(payload_nr, cpu);
> +
> +	free_payload_slot(payload_nr);
> +}
> +EXPORT_SYMBOL(smp_call_rar_many);


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request
  2025-06-19 23:01   ` Nadav Amit
@ 2025-06-20  1:10     ` Rik van Riel
  2025-06-20 15:27       ` Sean Christopherson
                         ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Rik van Riel @ 2025-06-20  1:10 UTC (permalink / raw)
  To: Nadav Amit, linux-kernel
  Cc: kernel-team, dave.hansen, luto, peterz, bp, x86, seanjc, tglx,
	mingo, Yu-cheng Yu

On Fri, 2025-06-20 at 02:01 +0300, Nadav Amit wrote:
> 
> > +/*
> > + * Reduce contention for the RAR payloads by having a small number
> > of
> > + * CPUs share a RAR payload entry, instead of a free for all with
> > all CPUs.
> > + */
> > +struct rar_lock {
> > +	union {
> > +		raw_spinlock_t lock;
> > +		char __padding[SMP_CACHE_BYTES];
> > +	};
> > +};
> 
> I think you can lose the __padding and instead have 
> ____cacheline_aligned (and then you won't need union).
> 
I tried that initially, but the compiler was unhappy
to have __cacheline_aligned in the definition of a
struct.

> > +
> > +static struct rar_lock rar_locks[RAR_MAX_PAYLOADS]
> > __cacheline_aligned;
> > +
> > +/*
> > + * The action vector tells each CPU which payload table entries
> > + * have work for that CPU.
> > + */
> > +static DEFINE_PER_CPU_ALIGNED(u8[RAR_MAX_PAYLOADS], rar_action);
> > +
> > +/*
> > + * TODO: group CPUs together based on locality in the system
> > instead
> > + * of CPU number, to further reduce the cost of contention.
> > + */
> > +static int cpu_rar_payload_number(void)
> > +{
> > +	int cpu = raw_smp_processor_id();
> 
> Why raw_* ?

I'll change that to regular smp_processor_id()
for the next version.

> 
> > +	return cpu % rar_max_payloads;
> > +}
> > +
> > +static int get_payload_slot(void)
> > +{
> > +	int payload_nr = cpu_rar_payload_number();
> > +	raw_spin_lock(&rar_locks[payload_nr].lock);
> > +	return payload_nr;
> > +}
> 
> I think it would be better to open-code it to improve readability. If
> you choose not to, I think you should use sparse indications (e.g., 
> __acquires() ).

Good point about the annotations. This can indeed
be open coded, since any future improvements here,
for example to have cpu_rar_payload_number() take
topology into account to reduce the cost of contention,
will be in that helper function.

> 
> > +
> > +static void free_payload_slot(unsigned long payload_nr)
> > +{
> > +	raw_spin_unlock(&rar_locks[payload_nr].lock);
> > +}
> > +
> > +static void set_payload(struct rar_payload *p, u16 pcid, unsigned
> > long start,
> > +			long pages)
> > +{
> > +	p->must_be_zero_1	= 0;
> > +	p->must_be_zero_2	= 0;
> > +	p->must_be_zero_3	= 0;
> > +	p->page_size		= RAR_INVLPG_PAGE_SIZE_4K;
> 
> I think you can propagate the stride to this point instead of using 
> fixed 4KB stride.

Agreed. That's another future optimization, for
once this code all works right.

Currently I am in a situation where the receiving
CPU clears the action vector from RAR_PENDING to
RAR_SUCCESS, but the TLB does not appear to always
be correctly flushed.

> 
> > +	p->type			= RAR_TYPE_INVPCID;
> > +	p->pcid			= pcid;
> > +	p->linear_address	= start;
> > +
> > +	if (pcid) {
> > +		/* RAR invalidation of the mapping of a specific
> > process. */
> > +		if (pages < RAR_INVLPG_MAX_PAGES) {
> > +			p->num_pages = pages;
> > +			p->subtype = RAR_INVPCID_ADDR;
> > +		} else {
> > +			p->subtype = RAR_INVPCID_PCID;
> 
> I wonder whether it would be safer to set something to p->num_pages.
> (then we can do it unconditionally)

We have a limited number of bits available for
p->num_pages. I'm not sure we want to try
writing a larger number than what fits in those
bits.

> 
> > +		}
> > +	} else {
> > +		/*
> > +		 * Unfortunately RAR_INVPCID_ADDR excludes global
> > translations.
> > +		 * Always do a full flush for kernel
> > invalidations.
> > +		 */
> > +		p->subtype = RAR_INVPCID_ALL;
> > +	}
> > +
> > +	/* Ensure all writes are visible before the action entry
> > is set. */
> > +	smp_wmb();
> 
> Maybe you can drop the smp_wmb() here and instead change the 
> WRITE_ONCE() in set_action_entry() to smp_store_release() ? It should
> have the same effect and I think would be cleaner and convey your
> intent 
> better.
> 
We need protection against two different things here.

1) Receiving CPUs must see all the writes done by
   the originating CPU before we send the RAR IPI.

2) Receiving CPUs must see all the writes done by
   set_payload() before the write done by
   set_action_entry(), in case another CPU sends
   the RAR IPI before we do.

   That other RAR IPI could even be sent between
   when we write the payload, and when we write
   the action entry. The receiving CPU could take
   long enough processing other RAR payloads that
   it can see our action entry after we write it.

Does removing the smp_wmb() still leave everything
safe in that scenario?


> > +}
> > +
> > +static void set_action_entry(unsigned long payload_nr, int
> > target_cpu)
> > +{
> > +	u8 *bitmap = per_cpu(rar_action, target_cpu);
> 
> bitmap? It doesn't look like one...

I'll rename this one to rar_actions like I did in
wait_for_action_done()

> 
> > +static void wait_for_action_done(unsigned long payload_nr, int
> > target_cpu)
> > +{
> > +	u8 status;
> > +	u8 *rar_actions = per_cpu(rar_action, target_cpu);
> > +
> > +	status = READ_ONCE(rar_actions[payload_nr]);
> > +
> > +	while (status == RAR_PENDING) {
> > +		cpu_relax();
> > +		status = READ_ONCE(rar_actions[payload_nr]);
> > +	}
> > +
> > +	WARN_ON_ONCE(rar_actions[payload_nr] != RAR_SUCCESS);
> 
> WARN_ON_ONCE(status != RAR_SUCCESS)

I'll add that cleanup for v5, too.

> 
> > +}
> > +
> > +void rar_cpu_init(void)
> > +{
> > +	u8 *bitmap;
> > +	u64 r;
> > +
> > +	/* Check if this CPU was already initialized. */
> > +	rdmsrl(MSR_IA32_RAR_PAYLOAD_BASE, r);
> > +	if (r == (u64)virt_to_phys(rar_payload))
> > +		return;
> 
> Seems a bit risky test. If anything, I would check that the MSR that
> is 
> supposed to be set *last* (MSR_IA32_RAR_CTRL) have the expected
> value. 
> But it would still be best to initialize the MSRs unconditionally or
> to 
> avoid repeated initialization using a different scheme.
> 
Whatever different scheme we use must be able to deal
with CPU hotplug and suspend/resume. There are legitimate
cases where rar_cpu_init() is called, and the in-MSR
state does not match the in-memory state.

You are right that we could always unconditionally
write the MSRs.

However, I am not entirely convinced that overwriting
the per-CPU rar_action array with all zeroes (RAR_SUCCESS)
is always safe to do without some sort of guard.

I suppose it might be, since if we are in rar_cpu_init()
current->mm should be init_mm, the CPU bit in cpu_online_mask
is not set, and we don't have to worry about flushing memory 
all that much?

> > 
> > +	/* If this CPU supports less than RAR_MAX_PAYLOADS, lower
> > our limit. */
> > +	if (max_payloads < rar_max_payloads)
> > +		rar_max_payloads = max_payloads;
> > +	pr_info("RAR: support %d payloads\n", max_payloads);
> > +
> > +	for (r = 0; r < rar_max_payloads; r++)
> > +		rar_locks[r].lock =
> > __RAW_SPIN_LOCK_UNLOCKED(rar_lock);
> 
> Not a fan of the reuse of r for different purposes.

Fair enough, I'll add another variable name.

> 
> > +/*
> > + * Inspired by smp_call_function_many(), but RAR requires a global
> > payload
> > + * table rather than per-CPU payloads in the CSD table, because
> > the action
> > + * handler is microcode rather than software.
> > + */
> > +void smp_call_rar_many(const struct cpumask *mask, u16 pcid,
> > +		       unsigned long start, unsigned long end)
> > +{
> > +	unsigned long pages = (end - start + PAGE_SIZE) /
> > PAGE_SIZE;
> 
> I think "end" is not inclusive. See for instance flush_tlb_page()
> where 
> "end" is set to "a + PAGE_SIZE". So this would flush one extra page.
> 
It gets better. Once we add in a "stride" argument, we
may end up with a range that covers only the first
4kB of one of the 2MB entries the calling code wanted
to invalidate. I fell into that trap already with the
INVLPGB code :)

I'll look into simplifying this for the next version,
probably with only 4k PAGE_SIZE at first. We can think
about stride later.

> > 
> > +	 * This code cannot use the should_flush_tlb() logic here
> > because
> > +	 * RAR flushes do not update the tlb_gen, resulting in
> > unnecessary
> > +	 * flushes at context switch time.
> > +	 */
> > +	dest_mask = this_cpu_ptr(&rar_cpu_mask);
> > +	cpumask_and(dest_mask, mask, cpu_online_mask);
> > +
> > +	/* Some callers race with other CPUs changing the passed
> > mask */
> > +	if (unlikely(!cpumask_weight(dest_mask)))
> 
> cpumask_and() returns "false if *@dstp is empty, else returns true".
> So 
> you can use his value instead of calling cpumask_weight().

Will do, thank you.



-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request
  2025-06-20  1:10     ` Rik van Riel
@ 2025-06-20 15:27       ` Sean Christopherson
  2025-06-20 21:24       ` Nadav Amit
  2025-06-23 10:50       ` David Laight
  2 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2025-06-20 15:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Nadav Amit, linux-kernel, kernel-team, dave.hansen, luto, peterz,
	bp, x86, tglx, mingo, Yu-cheng Yu

On Thu, Jun 19, 2025, Rik van Riel wrote:
> On Fri, 2025-06-20 at 02:01 +0300, Nadav Amit wrote:
> > 
> > > +/*
> > > + * Reduce contention for the RAR payloads by having a small number
> > > of
> > > + * CPUs share a RAR payload entry, instead of a free for all with
> > > all CPUs.
> > > + */
> > > +struct rar_lock {
> > > +	union {
> > > +		raw_spinlock_t lock;
> > > +		char __padding[SMP_CACHE_BYTES];
> > > +	};
> > > +};
> > 
> > I think you can lose the __padding and instead have 
> > ____cacheline_aligned (and then you won't need union).
> > 
> I tried that initially, but the compiler was unhappy
> to have __cacheline_aligned in the definition of a
> struct.

____cacheline_aligned_in_smp, a.k.a. ____cacheline_aligned, should work in a
struct or "on" a struct.  There are multiple instances of both throughout the
kernel.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request
  2025-06-20  1:10     ` Rik van Riel
  2025-06-20 15:27       ` Sean Christopherson
@ 2025-06-20 21:24       ` Nadav Amit
  2025-06-23 10:50       ` David Laight
  2 siblings, 0 replies; 22+ messages in thread
From: Nadav Amit @ 2025-06-20 21:24 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: kernel-team, dave.hansen, luto, peterz, bp, x86, seanjc, tglx,
	mingo, Yu-cheng Yu



On 20/06/2025 4:10, Rik van Riel wrote:
> On Fri, 2025-06-20 at 02:01 +0300, Nadav Amit wrote:
>>> +
>>> +static void set_payload(struct rar_payload *p, u16 pcid, unsigned
>>> long start,
>>> +			long pages)
>>> +{
>>> +	p->must_be_zero_1	= 0;
>>> +	p->must_be_zero_2	= 0;
>>> +	p->must_be_zero_3	= 0;
>>> +	p->page_size		= RAR_INVLPG_PAGE_SIZE_4K;
>>
>> I think you can propagate the stride to this point instead of using
>> fixed 4KB stride.
> 
> Agreed. That's another future optimization, for
> once this code all works right. 

It might not be an optimization if it leads to performance regression 
relatively to the current software-based flush that does take the stride 
into account. IOW: flushing 2MB huge-page would end up doing a full TLB 
flush with RAR in contrast to selectively flushing the specific entry 
with the software-based shootdown. Correct me if I'm wrong.

> 
> Currently I am in a situation where the receiving
> CPU clears the action vector from RAR_PENDING to
> RAR_SUCCESS, but the TLB does not appear to always
> be correctly flushed.

Interesting. I really do not know anything about it, but you may want to 
look on performance counters to see whether the flush is at least 
reported to take place.

> 
>>
>>> +	p->type			= RAR_TYPE_INVPCID;
>>> +	p->pcid			= pcid;
>>> +	p->linear_address	= start;
>>> +
>>> +	if (pcid) {
>>> +		/* RAR invalidation of the mapping of a specific
>>> process. */
>>> +		if (pages < RAR_INVLPG_MAX_PAGES) {
>>> +			p->num_pages = pages;
>>> +			p->subtype = RAR_INVPCID_ADDR;
>>> +		} else {
>>> +			p->subtype = RAR_INVPCID_PCID;
>>
>> I wonder whether it would be safer to set something to p->num_pages.
>> (then we can do it unconditionally)
> 
> We have a limited number of bits available for
> p->num_pages. I'm not sure we want to try
> writing a larger number than what fits in those
> bits.

I was just looking for a way to simplify the code and at the same time 
avoid "undefined" state. History proved the all kind of unintended 
consequences happen when some state is uninitialized (even if the spec 
does not require it).

> 
>>
>>> +		}
>>> +	} else {
>>> +		/*
>>> +		 * Unfortunately RAR_INVPCID_ADDR excludes global
>>> translations.
>>> +		 * Always do a full flush for kernel
>>> invalidations.
>>> +		 */
>>> +		p->subtype = RAR_INVPCID_ALL;
>>> +	}
>>> +
>>> +	/* Ensure all writes are visible before the action entry
>>> is set. */
>>> +	smp_wmb();
>>
>> Maybe you can drop the smp_wmb() here and instead change the
>> WRITE_ONCE() in set_action_entry() to smp_store_release() ? It should
>> have the same effect and I think would be cleaner and convey your
>> intent
>> better.
>>
> We need protection against two different things here.
> 
> 1) Receiving CPUs must see all the writes done by
>     the originating CPU before we send the RAR IPI.
> 
> 2) Receiving CPUs must see all the writes done by
>     set_payload() before the write done by
>     set_action_entry(), in case another CPU sends
>     the RAR IPI before we do.
> 
>     That other RAR IPI could even be sent between
>     when we write the payload, and when we write
>     the action entry. The receiving CPU could take
>     long enough processing other RAR payloads that
>     it can see our action entry after we write it.
> 
> Does removing the smp_wmb() still leave everything
> safe in that scenario?

Admittedly, I do not understand your concern. IIUC until 
set_action_entry() is called, and until RAR_PENDING is set, IPIs are 
irrelevant and the RAR entry would not be processed. Once it is set, you 
want all the prior writes to be visible. Semantically, that's exactly 
what smp_store_release() means.

Technically, smp_wmb() + WRITE_ONCE() are equivalent to 
smp_store_release() on x86. smp_wmb() is actually a compiler barrier as 
x86 follows the TSO memory model, and smp_store_release() is actually a 
compiler-barrier + WRITE_ONCE(). So I think it's all safe and at the 
same time clearer.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request
  2025-06-20  1:10     ` Rik van Riel
  2025-06-20 15:27       ` Sean Christopherson
  2025-06-20 21:24       ` Nadav Amit
@ 2025-06-23 10:50       ` David Laight
  2 siblings, 0 replies; 22+ messages in thread
From: David Laight @ 2025-06-23 10:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Nadav Amit, linux-kernel, kernel-team, dave.hansen, luto, peterz,
	bp, x86, seanjc, tglx, mingo, Yu-cheng Yu

On Thu, 19 Jun 2025 21:10:47 -0400
Rik van Riel <riel@surriel.com> wrote:

> On Fri, 2025-06-20 at 02:01 +0300, Nadav Amit wrote:
> >   
> > > +/*
> > > + * Reduce contention for the RAR payloads by having a small number
> > > of
> > > + * CPUs share a RAR payload entry, instead of a free for all with
> > > all CPUs.
> > > + */
> > > +struct rar_lock {
> > > +	union {
> > > +		raw_spinlock_t lock;
> > > +		char __padding[SMP_CACHE_BYTES];
> > > +	};
> > > +};  
> > 
> > I think you can lose the __padding and instead have 
> > ____cacheline_aligned (and then you won't need union).
> >   
> I tried that initially, but the compiler was unhappy
> to have __cacheline_aligned in the definition of a
> struct.

You should be able to put it onto the first structure member.
(Which would match the normal use.)

The padding doesn't have the same effect.
Even for rar_lock[] the first entry will share with whatever
comes before, and the padding on the last entry just isn't used.

	David
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v4 2/8] x86/mm: enable BROADCAST_TLB_FLUSH on Intel, too
  2025-06-19 20:03 ` [RFC PATCH v4 2/8] x86/mm: enable BROADCAST_TLB_FLUSH on Intel, too Rik van Riel
@ 2025-06-26 13:08   ` Kirill A. Shutemov
  0 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2025-06-26 13:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, bp, x86,
	nadav.amit, seanjc, tglx, mingo, Rik van Riel

On Thu, Jun 19, 2025 at 04:03:54PM -0400, Rik van Riel wrote:
> From: Rik van Riel <riel@fb.com>
> 
> Much of the code for Intel RAR and AMD INVLPGB is shared.
> 
> Place both under the same config option.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/Kconfig.cpu | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
> index f928cf6e3252..ab763f69f54d 100644
> --- a/arch/x86/Kconfig.cpu
> +++ b/arch/x86/Kconfig.cpu
> @@ -360,7 +360,7 @@ menuconfig PROCESSOR_SELECT
>  
>  config BROADCAST_TLB_FLUSH
>  	def_bool y
> -	depends on CPU_SUP_AMD && 64BIT
> +	depends on (CPU_SUP_AMD || CPU_SUP_INTEL) && 64BIT && SMP

Maybe split it into few "depends on"?

	depends on 64BIT
	depends on SMP
	depends on CPU_SUP_AMD || CPU_SUP_INTEL

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v4 4/8] x86/apic: Introduce Remote Action Request Operations
  2025-06-19 20:03 ` [RFC PATCH v4 4/8] x86/apic: Introduce Remote Action Request Operations Rik van Riel
@ 2025-06-26 13:20   ` Kirill A. Shutemov
  2025-06-26 16:09     ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2025-06-26 13:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, bp, x86,
	nadav.amit, seanjc, tglx, mingo, Yu-cheng Yu

On Thu, Jun 19, 2025 at 04:03:56PM -0400, Rik van Riel wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> RAR TLB flushing is started by sending a command to the APIC.
> This patch adds Remote Action Request commands.
> 
> Because RAR_VECTOR is hardcoded at 0xe0, POSTED_MSI_NOTIFICATION_VECTOR
> has to be lowered to 0xdf, reducing the number of available vectors by
> 13.
> 
> [riel: refactor after 6 years of changes, lower POSTED_MSI_NOTIFICATION_VECTOR]

But why? Because it is used as FIRST_SYSTEM_VECTOR?

> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/apicdef.h     | 1 +
>  arch/x86/include/asm/irq_vectors.h | 7 ++++++-
>  arch/x86/include/asm/smp.h         | 1 +
>  arch/x86/kernel/apic/ipi.c         | 5 +++++
>  arch/x86/kernel/apic/local.h       | 3 +++
>  5 files changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/apicdef.h b/arch/x86/include/asm/apicdef.h
> index 094106b6a538..b152d45af91a 100644
> --- a/arch/x86/include/asm/apicdef.h
> +++ b/arch/x86/include/asm/apicdef.h
> @@ -92,6 +92,7 @@
>  #define		APIC_DM_LOWEST		0x00100
>  #define		APIC_DM_SMI		0x00200
>  #define		APIC_DM_REMRD		0x00300
> +#define		APIC_DM_RAR		0x00300

Hm. Do we conflict with APIC_DM_REMRD here?

>  #define		APIC_DM_NMI		0x00400
>  #define		APIC_DM_INIT		0x00500
>  #define		APIC_DM_STARTUP		0x00600

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request
  2025-06-19 20:03 ` [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request Rik van Riel
  2025-06-19 23:01   ` Nadav Amit
@ 2025-06-26 15:41   ` Kirill A. Shutemov
  2025-06-26 15:54     ` Kirill A. Shutemov
  1 sibling, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2025-06-26 15:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, bp, x86,
	nadav.amit, seanjc, tglx, mingo, Yu-cheng Yu

On Thu, Jun 19, 2025 at 04:03:57PM -0400, Rik van Riel wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Remote Action Request (RAR) is a TLB flushing broadcast facility.
> To start a TLB flush, the initiator CPU creates a RAR payload and
> sends a command to the APIC.  The receiving CPUs automatically flush
> TLBs as specified in the payload without the kernel's involement.
> 
> [ riel: add pcid parameter to smp_call_rar_many so other mms can be flushed ]
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/rar.h  |  76 ++++++++++++
>  arch/x86/kernel/cpu/intel.c |   8 +-
>  arch/x86/mm/Makefile        |   1 +
>  arch/x86/mm/rar.c           | 236 ++++++++++++++++++++++++++++++++++++
>  4 files changed, 320 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/include/asm/rar.h
>  create mode 100644 arch/x86/mm/rar.c
> 
> diff --git a/arch/x86/include/asm/rar.h b/arch/x86/include/asm/rar.h
> new file mode 100644
> index 000000000000..c875b9e9c509
> --- /dev/null
> +++ b/arch/x86/include/asm/rar.h
> @@ -0,0 +1,76 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_RAR_H
> +#define _ASM_X86_RAR_H
> +
> +/*
> + * RAR payload types
> + */
> +#define RAR_TYPE_INVPG		0
> +#define RAR_TYPE_INVPG_NO_CR3	1
> +#define RAR_TYPE_INVPCID	2
> +#define RAR_TYPE_INVEPT		3
> +#define RAR_TYPE_INVVPID	4
> +#define RAR_TYPE_WRMSR		5
> +
> +/*
> + * Subtypes for RAR_TYPE_INVLPG
> + */
> +#define RAR_INVPG_ADDR			0 /* address specific */
> +#define RAR_INVPG_ALL			2 /* all, include global */
> +#define RAR_INVPG_ALL_NO_GLOBAL		3 /* all, exclude global */
> +
> +/*
> + * Subtypes for RAR_TYPE_INVPCID
> + */
> +#define RAR_INVPCID_ADDR		0 /* address specific */
> +#define RAR_INVPCID_PCID		1 /* all of PCID */
> +#define RAR_INVPCID_ALL			2 /* all, include global */
> +#define RAR_INVPCID_ALL_NO_GLOBAL	3 /* all, exclude global */
> +
> +/*
> + * Page size for RAR_TYPE_INVLPG
> + */
> +#define RAR_INVLPG_PAGE_SIZE_4K		0
> +#define RAR_INVLPG_PAGE_SIZE_2M		1
> +#define RAR_INVLPG_PAGE_SIZE_1G		2
> +
> +/*
> + * Max number of pages per payload
> + */
> +#define RAR_INVLPG_MAX_PAGES 63
> +
> +struct rar_payload {
> +	u64 for_sw		: 8;

Bitfield of 8 bit? Why not just u8?

> +	u64 type		: 8;
> +	u64 must_be_zero_1	: 16;
> +	u64 subtype		: 3;
> +	u64 page_size		: 2;
> +	u64 num_pages		: 6;
> +	u64 must_be_zero_2	: 21;
> +
> +	u64 must_be_zero_3;
> +
> +	/*
> +	 * Starting address
> +	 */
> +	union {
> +		u64 initiator_cr3;

Initiator? It is CR3 to flush, not CR3 of the initiator.

> +		struct {
> +			u64 pcid	: 12;
> +			u64 ignored	: 52;
> +		};
> +	};
> +	u64 linear_address;
> +
> +	/*
> +	 * Padding
> +	 */
> +	u64 padding[4];

But it is not padding. It is available for SW, according to spec.

> +};

As far as I can see, only RAR_TYPE_INVPCID is used. Maybe it worth
defining payload struct specifically for this type and get rid of union.

> +
> +void rar_cpu_init(void);
> +void rar_boot_cpu_init(void);
> +void smp_call_rar_many(const struct cpumask *mask, u16 pcid,
> +		       unsigned long start, unsigned long end);
> +
> +#endif /* _ASM_X86_RAR_H */
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index 0cc4ae27127c..ddc5e7d81077 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -22,6 +22,7 @@
>  #include <asm/microcode.h>
>  #include <asm/msr.h>
>  #include <asm/numa.h>
> +#include <asm/rar.h>
>  #include <asm/resctrl.h>
>  #include <asm/thermal.h>
>  #include <asm/uaccess.h>
> @@ -624,6 +625,9 @@ static void init_intel(struct cpuinfo_x86 *c)
>  	split_lock_init();
>  
>  	intel_init_thermal(c);
> +
> +	if (cpu_feature_enabled(X86_FEATURE_RAR))
> +		rar_cpu_init();

So, boot CPU gets initialized twice right? Once via rar_boot_cpu_init()
in intel_detect_tlb() and the second time here.

>  }
>  
>  #ifdef CONFIG_X86_32
> @@ -725,8 +729,10 @@ static void intel_detect_tlb(struct cpuinfo_x86 *c)
>  
>  		rdmsrl(MSR_IA32_CORE_CAPS, msr);
>  
> -		if (msr & MSR_IA32_CORE_CAPS_RAR)
> +		if (msr & MSR_IA32_CORE_CAPS_RAR) {
>  			setup_force_cpu_cap(X86_FEATURE_RAR);
> +			rar_boot_cpu_init();
> +		}
>  	}
>  }
>  
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index 5b9908f13dcf..f36fc99e8b10 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -52,6 +52,7 @@ obj-$(CONFIG_ACPI_NUMA)		+= srat.o
>  obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
>  obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
>  obj-$(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION)	+= pti.o
> +obj-$(CONFIG_BROADCAST_TLB_FLUSH)		+= rar.o
>  
>  obj-$(CONFIG_X86_MEM_ENCRYPT)	+= mem_encrypt.o
>  obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_amd.o
> diff --git a/arch/x86/mm/rar.c b/arch/x86/mm/rar.c
> new file mode 100644
> index 000000000000..76959782fb03
> --- /dev/null
> +++ b/arch/x86/mm/rar.c
> @@ -0,0 +1,236 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * RAR TLB shootdown
> + */
> +#include <linux/sched.h>
> +#include <linux/bug.h>
> +#include <asm/current.h>
> +#include <asm/io.h>
> +#include <asm/sync_bitops.h>
> +#include <asm/rar.h>
> +#include <asm/tlbflush.h>
> +
> +static DEFINE_PER_CPU(struct cpumask, rar_cpu_mask);
> +
> +#define RAR_SUCCESS	0x00
> +#define RAR_PENDING	0x01
> +#define RAR_FAILURE	0x80
> +
> +#define RAR_MAX_PAYLOADS 64UL
> +
> +/* How many RAR payloads are supported by this CPU */
> +static int rar_max_payloads __ro_after_init = RAR_MAX_PAYLOADS;
> +
> +/*
> + * RAR payloads telling CPUs what to do. This table is shared between
> + * all CPUs; it is possible to have multiple payload tables shared between
> + * different subsets of CPUs, but that adds a lot of complexity.
> + */
> +static struct rar_payload rar_payload[RAR_MAX_PAYLOADS] __page_aligned_bss;

On machines without RAR it would waste 4k. Not a big deal, I guess. But it
would be neat to reclaim it if unused.

> +/*
> + * Reduce contention for the RAR payloads by having a small number of
> + * CPUs share a RAR payload entry, instead of a free for all with all CPUs.
> + */
> +struct rar_lock {
> +	union {
> +		raw_spinlock_t lock;
> +		char __padding[SMP_CACHE_BYTES];
> +	};
> +};
> +
> +static struct rar_lock rar_locks[RAR_MAX_PAYLOADS] __cacheline_aligned;

One more 4k.

> +/*
> + * The action vector tells each CPU which payload table entries
> + * have work for that CPU.
> + */
> +static DEFINE_PER_CPU_ALIGNED(u8[RAR_MAX_PAYLOADS], rar_action);
> +
> +/*
> + * TODO: group CPUs together based on locality in the system instead
> + * of CPU number, to further reduce the cost of contention.
> + */
> +static int cpu_rar_payload_number(void)
> +{
> +	int cpu = raw_smp_processor_id();
> +	return cpu % rar_max_payloads;
> +}
> +
> +static int get_payload_slot(void)
> +{
> +	int payload_nr = cpu_rar_payload_number();
> +	raw_spin_lock(&rar_locks[payload_nr].lock);
> +	return payload_nr;
> +}
> +
> +static void free_payload_slot(unsigned long payload_nr)
> +{
> +	raw_spin_unlock(&rar_locks[payload_nr].lock);
> +}
> +
> +static void set_payload(struct rar_payload *p, u16 pcid, unsigned long start,
> +			long pages)
> +{
> +	p->must_be_zero_1	= 0;
> +	p->must_be_zero_2	= 0;
> +	p->must_be_zero_3	= 0;
> +	p->page_size		= RAR_INVLPG_PAGE_SIZE_4K;
> +	p->type			= RAR_TYPE_INVPCID;
> +	p->pcid			= pcid;
> +	p->linear_address	= start;
> +
> +	if (pcid) {
> +		/* RAR invalidation of the mapping of a specific process. */
> +		if (pages < RAR_INVLPG_MAX_PAGES) {
> +			p->num_pages = pages;
> +			p->subtype = RAR_INVPCID_ADDR;
> +		} else {
> +			p->subtype = RAR_INVPCID_PCID;
> +		}
> +	} else {
> +		/*
> +		 * Unfortunately RAR_INVPCID_ADDR excludes global translations.
> +		 * Always do a full flush for kernel invalidations.
> +		 */
> +		p->subtype = RAR_INVPCID_ALL;
> +	}
> +
> +	/* Ensure all writes are visible before the action entry is set. */
> +	smp_wmb();
> +}
> +
> +static void set_action_entry(unsigned long payload_nr, int target_cpu)
> +{
> +	u8 *bitmap = per_cpu(rar_action, target_cpu);
> +
> +	/*
> +	 * Given a remote CPU, "arm" its action vector to ensure it handles
> +	 * the request at payload_nr when it receives a RAR signal.
> +	 * The remote CPU will overwrite RAR_PENDING when it handles
> +	 * the request.
> +	 */
> +	WRITE_ONCE(bitmap[payload_nr], RAR_PENDING);
> +}
> +
> +static void wait_for_action_done(unsigned long payload_nr, int target_cpu)
> +{
> +	u8 status;
> +	u8 *rar_actions = per_cpu(rar_action, target_cpu);
> +
> +	status = READ_ONCE(rar_actions[payload_nr]);
> +
> +	while (status == RAR_PENDING) {
> +		cpu_relax();
> +		status = READ_ONCE(rar_actions[payload_nr]);
> +	}
> +
> +	WARN_ON_ONCE(rar_actions[payload_nr] != RAR_SUCCESS);
> +}
> +
> +void rar_cpu_init(void)
> +{
> +	u8 *bitmap;
> +	u64 r;
> +
> +	/* Check if this CPU was already initialized. */
> +	rdmsrl(MSR_IA32_RAR_PAYLOAD_BASE, r);
> +	if (r == (u64)virt_to_phys(rar_payload))
> +		return;
> +
> +	bitmap = this_cpu_ptr(rar_action);
> +	memset(bitmap, 0, RAR_MAX_PAYLOADS);
> +	wrmsrl(MSR_IA32_RAR_ACT_VEC, (u64)virt_to_phys(bitmap));
> +	wrmsrl(MSR_IA32_RAR_PAYLOAD_BASE, (u64)virt_to_phys(rar_payload));
> +
> +	/*
> +	 * Allow RAR events to be processed while interrupts are disabled on
> +	 * a target CPU. This prevents "pileups" where many CPUs are waiting
> +	 * on one CPU that has IRQs blocked for too long, and should reduce
> +	 * contention on the rar_payload table.
> +	 */
> +	wrmsrl(MSR_IA32_RAR_CTRL, RAR_CTRL_ENABLE | RAR_CTRL_IGNORE_IF);

Hmm. How is RAR_CTRL_IGNORE_IF safe? Wouldn't it break GUP_fast() which
relies on disabling interrupts to block TLB flush and page table freeing?

> +}
> +
> +void rar_boot_cpu_init(void)
> +{
> +	int max_payloads;
> +	u64 r;
> +
> +	/* The MSR contains N defining the max [0-N] rar payload slots. */
> +	rdmsrl(MSR_IA32_RAR_INFO, r);
> +	max_payloads = (r >> 32) + 1;
> +
> +	/* If this CPU supports less than RAR_MAX_PAYLOADS, lower our limit. */
> +	if (max_payloads < rar_max_payloads)
> +		rar_max_payloads = max_payloads;
> +	pr_info("RAR: support %d payloads\n", max_payloads);
> +
> +	for (r = 0; r < rar_max_payloads; r++)
> +		rar_locks[r].lock = __RAW_SPIN_LOCK_UNLOCKED(rar_lock);
> +
> +	/* Initialize the boot CPU early to handle early boot flushes. */
> +	rar_cpu_init();
> +}
> +
> +/*
> + * Inspired by smp_call_function_many(), but RAR requires a global payload
> + * table rather than per-CPU payloads in the CSD table, because the action
> + * handler is microcode rather than software.
> + */
> +void smp_call_rar_many(const struct cpumask *mask, u16 pcid,
> +		       unsigned long start, unsigned long end)
> +{
> +	unsigned long pages = (end - start + PAGE_SIZE) / PAGE_SIZE;
> +	int cpu, this_cpu = smp_processor_id();
> +	cpumask_t *dest_mask;
> +	unsigned long payload_nr;
> +
> +	/* Catch the "end - start + PAGE_SIZE" overflow above. */
> +	if (end == TLB_FLUSH_ALL)
> +		pages = RAR_INVLPG_MAX_PAGES + 1;
> +
> +	/*
> +	 * Can deadlock when called with interrupts disabled.
> +	 * Allow CPUs that are not yet online though, as no one else can
> +	 * send smp call function interrupt to this CPU and as such deadlocks
> +	 * can't happen.
> +	 */
> +	if (cpu_online(this_cpu) && !oops_in_progress && !early_boot_irqs_disabled) {
> +		lockdep_assert_irqs_enabled();
> +		lockdep_assert_preemption_disabled();
> +	}
> +
> +	/*
> +	 * A CPU needs to be initialized in order to process RARs.
> +	 * Skip offline CPUs.
> +	 *
> +	 * TODO:
> +	 * - Skip RAR to CPUs that are in a deeper C-state, with an empty TLB
> +	 *
> +	 * This code cannot use the should_flush_tlb() logic here because
> +	 * RAR flushes do not update the tlb_gen, resulting in unnecessary
> +	 * flushes at context switch time.
> +	 */
> +	dest_mask = this_cpu_ptr(&rar_cpu_mask);
> +	cpumask_and(dest_mask, mask, cpu_online_mask);
> +
> +	/* Some callers race with other CPUs changing the passed mask */
> +	if (unlikely(!cpumask_weight(dest_mask)))
> +		return;
> +
> +	payload_nr = get_payload_slot();
> +	set_payload(&rar_payload[payload_nr], pcid, start, pages);
> +
> +	for_each_cpu(cpu, dest_mask)
> +		set_action_entry(payload_nr, cpu);
> +
> +	/* Send a message to all CPUs in the map */
> +	native_send_rar_ipi(dest_mask);
> +
> +	for_each_cpu(cpu, dest_mask)
> +		wait_for_action_done(payload_nr, cpu);
> +
> +	free_payload_slot(payload_nr);
> +}
> +EXPORT_SYMBOL(smp_call_rar_many);
> -- 
> 2.49.0
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request
  2025-06-26 15:41   ` Kirill A. Shutemov
@ 2025-06-26 15:54     ` Kirill A. Shutemov
  0 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2025-06-26 15:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, bp, x86,
	nadav.amit, seanjc, tglx, mingo, Yu-cheng Yu

On Thu, Jun 26, 2025 at 06:41:09PM +0300, Kirill A. Shutemov wrote:
> > +	/*
> > +	 * Allow RAR events to be processed while interrupts are disabled on
> > +	 * a target CPU. This prevents "pileups" where many CPUs are waiting
> > +	 * on one CPU that has IRQs blocked for too long, and should reduce
> > +	 * contention on the rar_payload table.
> > +	 */
> > +	wrmsrl(MSR_IA32_RAR_CTRL, RAR_CTRL_ENABLE | RAR_CTRL_IGNORE_IF);
> 
> Hmm. How is RAR_CTRL_IGNORE_IF safe? Wouldn't it break GUP_fast() which
> relies on disabling interrupts to block TLB flush and page table freeing?

Ah. I missed that x86 switched to MMU_GATHER_RCU_TABLE_FREE.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v4 4/8] x86/apic: Introduce Remote Action Request Operations
  2025-06-26 13:20   ` Kirill A. Shutemov
@ 2025-06-26 16:09     ` Sean Christopherson
  0 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2025-06-26 16:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Rik van Riel, linux-kernel, kernel-team, dave.hansen, luto,
	peterz, bp, x86, nadav.amit, tglx, mingo, Yu-cheng Yu

On Thu, Jun 26, 2025, Kirill A. Shutemov wrote:
> On Thu, Jun 19, 2025 at 04:03:56PM -0400, Rik van Riel wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > RAR TLB flushing is started by sending a command to the APIC.
> > This patch adds Remote Action Request commands.
> > 
> > Because RAR_VECTOR is hardcoded at 0xe0, POSTED_MSI_NOTIFICATION_VECTOR
> > has to be lowered to 0xdf, reducing the number of available vectors by
> > 13.
> > 
> > [riel: refactor after 6 years of changes, lower POSTED_MSI_NOTIFICATION_VECTOR]
> 
> But why? Because it is used as FIRST_SYSTEM_VECTOR?

The Posted MSI Notifications vector should be the lowest of the system vectors
so that device IRQs are NOT prioritized over "real" system vectors.

> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > ---
> >  arch/x86/include/asm/apicdef.h     | 1 +
> >  arch/x86/include/asm/irq_vectors.h | 7 ++++++-
> >  arch/x86/include/asm/smp.h         | 1 +
> >  arch/x86/kernel/apic/ipi.c         | 5 +++++
> >  arch/x86/kernel/apic/local.h       | 3 +++
> >  5 files changed, 16 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/include/asm/apicdef.h b/arch/x86/include/asm/apicdef.h
> > index 094106b6a538..b152d45af91a 100644
> > --- a/arch/x86/include/asm/apicdef.h
> > +++ b/arch/x86/include/asm/apicdef.h
> > @@ -92,6 +92,7 @@
> >  #define		APIC_DM_LOWEST		0x00100
> >  #define		APIC_DM_SMI		0x00200
> >  #define		APIC_DM_REMRD		0x00300
> > +#define		APIC_DM_RAR		0x00300
> 
> Hm. Do we conflict with APIC_DM_REMRD here?

Yes and no.  Yes, it literally conflicts, but it's easy enough to define the behavior
of APIC_DM_{REMRD,RAR} based on feature support.  E.g. KVM is likely going to add
support for Remote Read, which would conflict with KVM's bastardization of
APIC_DM_REMRD for PV kicks.  But as Paolo pointed out[*], KVM's PV unhalt/kick
can simply be gated on KVM_FEATURE_PV_UNHALT.  Any code that cares should be able
to do the same thing for RAR.  E.g. KVM's code could end up being something like:

	case APIC_DM_REMRD:
		if (guest_pv_has(vcpu, KVM_FEATURE_PV_UNHALT)) {
			result = 1;
			vcpu->arch.pv.pv_unhalted = 1;
			kvm_make_request(KVM_REQ_EVENT, vcpu);
			kvm_vcpu_kick(vcpu);
		} else if (guest_has_rar(vcpu)) {
			<magic!>
		} else {
			<emulate legacy Remote Read>;
		}
		break;

For the kernel itself, there's nothing to do, because Linux doesn't use Remote Read.

[*] https://lore.kernel.org/all/CABgObfadZZ5sXYB0xR5OcLDw_eVUmXTOTFSWkVpkgiCJmNnFRQ@mail.gmail.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v4 0/8] Intel RAR TLB invalidation
  2025-06-19 20:03 [RFC PATCH v4 0/8] Intel RAR TLB invalidation Rik van Riel
                   ` (7 preceding siblings ...)
  2025-06-19 20:04 ` [RFC PATCH v4 8/8] x86/tlb: flush the local TLB twice (DEBUG) Rik van Riel
@ 2025-06-26 18:08 ` Dave Jiang
  8 siblings, 0 replies; 22+ messages in thread
From: Dave Jiang @ 2025-06-26 18:08 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: kernel-team, dave.hansen, luto, peterz, bp, x86, nadav.amit,
	seanjc, tglx, mingo



On 6/19/25 1:03 PM, Rik van Riel wrote:
> This patch series adds support for IPI-less TLB invalidation
> using Intel RAR technology.
> 
> Intel RAR differs from AMD INVLPGB in a few ways:
> - RAR goes through (emulated?) APIC writes, not instructions
> - RAR flushes go through a memory table with 64 entries
> - RAR flushes can be targeted to a cpumask
> - The RAR functionality must be set up at boot time before it can be used
> 
> The cpumask targeting has resulted in Intel RAR and AMD INVLPGB having
> slightly different rules:
> - Processes with dynamic ASIDs use IPI based shootdowns
> - INVLPGB: processes with a global ASID 
>    - always have the TLB up to date, on every CPU
>    - never need to flush the TLB at context switch time
> - RAR: processes with global ASIDs
>    - have the TLB up to date on CPUs in the mm_cpumask
>    - can skip a TLB flush at context switch time if the CPU is in the mm_cpumask
>    - need to flush the TLB when scheduled on a cpu not in the mm_cpumask,
>      in case it used to run there before and the TLB has stale entries
> 
> RAR functionality is present on Sapphire Rapids and newer CPUs.
> 
> Information about Intel RAR can be found in this whitepaper.
> 
> https://www.intel.com/content/dam/develop/external/us/en/documents/341431-remote-action-request-white-paper.pdf
> 
> This patch series is based off a 2019 patch series created by
> Intel, with patches later in the series modified to fit into
> the TLB flush code structure we have after AMD INVLPGB functionality
> was integrated.
> 
> TODO:
> - some sort of optimization to avoid sending RARs to CPUs in deeper
>   idle states when they have init_mm loaded (flush when switching to init_mm?)
> 
> v4:
> - remove chicken/egg problem that made it impossible to use RAR early
>   in bootup, now RAR can be used to flush the local TLB (but it's broken?)
> - always flush other CPUs with RAR, no more periodic flush_tlb_func
> - separate, simplified cpumask trimming code
> - attempt to use RAR to flush the local TLB, which should work
>   according to the documentation
> - add a DEBUG patch to flush the local TLB with RAR and again locally,
>   may need some help from Intel to figure out why this makes a difference
> - memory dumps of rar_payload[] suggest we are sending valid RARs
> - receiving CPUs set the status from RAR_PENDING to RAR_SUCCESS
> - unclear whether the TLB is actually flushed correctly :(

Hi Rik,
Dave Hansen has asked me to reproduce this locally. Trying to replicate your test setup. What are the steps you are using to do testing of this patch series? Thanks!

DJ

> v3:
> - move cpa_flush() change out of this patch series
> - use MSR_IA32_CORE_CAPS definition, merge first two patches together
> - move RAR initialization to early_init_intel()
> - remove single-CPU "fast path" from smp_call_rar_many
> - remove smp call table RAR entries, just do a direct call
> - cleanups suggested (Ingo, Nadav, Dave, Thomas, Borislav, Sean)
> - fix !CONFIG_SMP compile in Kconfig
> - match RAR definitions to the names & numbers in the documentation
> - the code seems to work now
> v2:
> - Cleanups suggested by Ingo and Nadav (thank you)
> - Basic RAR code seems to actually work now.
> - Kernel TLB flushes with RAR seem to work correctly.
> - User TLB flushes with RAR are still broken, with two symptoms:
>   - The !is_lazy WARN_ON in leave_mm() is tripped
>   - Random segfaults.
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v4 6/8] x86/mm: use RAR for kernel TLB flushes
  2025-06-19 20:03 ` [RFC PATCH v4 6/8] x86/mm: use RAR for kernel TLB flushes Rik van Riel
@ 2025-06-27 13:27   ` Kirill A. Shutemov
  2025-06-29  1:30     ` Rik van Riel
  0 siblings, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2025-06-27 13:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, bp, x86,
	nadav.amit, seanjc, tglx, mingo, Rik van Riel

On Thu, Jun 19, 2025 at 04:03:58PM -0400, Rik van Riel wrote:
> From: Rik van Riel <riel@fb.com>
> 
> Use Intel RAR for kernel TLB flushes, when enabled.
> 
> Pass in PCID 0 to smp_call_rar_many() to flush the specified addresses,
> regardless of which PCID they might be cached under in any destination CPU.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/mm/tlb.c | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 39f80111e6f1..8931f7029d6c 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -21,6 +21,7 @@
>  #include <asm/apic.h>
>  #include <asm/msr.h>
>  #include <asm/perf_event.h>
> +#include <asm/rar.h>
>  #include <asm/tlb.h>
>  
>  #include "mm_internal.h"
> @@ -1468,6 +1469,18 @@ static void do_flush_tlb_all(void *info)
>  	__flush_tlb_all();
>  }
>  
> +static void rar_full_flush(const cpumask_t *cpumask)
> +{
> +	guard(preempt)();
> +	smp_call_rar_many(cpumask, 0, 0, TLB_FLUSH_ALL);
> +	invpcid_flush_all();

I don't follow why do we need to call invpcid_flush_all() here in addition
to smp_call_rar_many(). Hm?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v4 6/8] x86/mm: use RAR for kernel TLB flushes
  2025-06-27 13:27   ` Kirill A. Shutemov
@ 2025-06-29  1:30     ` Rik van Riel
  0 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2025-06-29  1:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, kernel-team, dave.hansen, luto, peterz, bp, x86,
	nadav.amit, seanjc, tglx, mingo, Rik van Riel

On Fri, 2025-06-27 at 16:27 +0300, Kirill A. Shutemov wrote:
> On Thu, Jun 19, 2025 at 04:03:58PM -0400, Rik van Riel wrote:
> > From: Rik van Riel <riel@fb.com>
> > 
> > Use Intel RAR for kernel TLB flushes, when enabled.
> > 
> > Pass in PCID 0 to smp_call_rar_many() to flush the specified
> > addresses,
> > regardless of which PCID they might be cached under in any
> > destination CPU.
> > 
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > ---
> >  arch/x86/mm/tlb.c | 38 ++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 38 insertions(+)
> > 
> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index 39f80111e6f1..8931f7029d6c 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -21,6 +21,7 @@
> >  #include <asm/apic.h>
> >  #include <asm/msr.h>
> >  #include <asm/perf_event.h>
> > +#include <asm/rar.h>
> >  #include <asm/tlb.h>
> >  
> >  #include "mm_internal.h"
> > @@ -1468,6 +1469,18 @@ static void do_flush_tlb_all(void *info)
> >  	__flush_tlb_all();
> >  }
> >  
> > +static void rar_full_flush(const cpumask_t *cpumask)
> > +{
> > +	guard(preempt)();
> > +	smp_call_rar_many(cpumask, 0, 0, TLB_FLUSH_ALL);
> > +	invpcid_flush_all();
> 
> I don't follow why do we need to call invpcid_flush_all() here in
> addition
> to smp_call_rar_many(). Hm?
> 
We shouldn't have to.

Once we figure out why the RAR flush isn't
working right (despite the RAR transitioning
from RAR_PENDING to RAR_SUCCESS) we should be
able to get rid of this call.


-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-06-29  1:42 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-19 20:03 [RFC PATCH v4 0/8] Intel RAR TLB invalidation Rik van Riel
2025-06-19 20:03 ` [RFC PATCH v4 1/8] x86/mm: Introduce Remote Action Request MSRs Rik van Riel
2025-06-19 20:03 ` [RFC PATCH v4 2/8] x86/mm: enable BROADCAST_TLB_FLUSH on Intel, too Rik van Riel
2025-06-26 13:08   ` Kirill A. Shutemov
2025-06-19 20:03 ` [RFC PATCH v4 3/8] x86/mm: Introduce X86_FEATURE_RAR Rik van Riel
2025-06-19 20:03 ` [RFC PATCH v4 4/8] x86/apic: Introduce Remote Action Request Operations Rik van Riel
2025-06-26 13:20   ` Kirill A. Shutemov
2025-06-26 16:09     ` Sean Christopherson
2025-06-19 20:03 ` [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request Rik van Riel
2025-06-19 23:01   ` Nadav Amit
2025-06-20  1:10     ` Rik van Riel
2025-06-20 15:27       ` Sean Christopherson
2025-06-20 21:24       ` Nadav Amit
2025-06-23 10:50       ` David Laight
2025-06-26 15:41   ` Kirill A. Shutemov
2025-06-26 15:54     ` Kirill A. Shutemov
2025-06-19 20:03 ` [RFC PATCH v4 6/8] x86/mm: use RAR for kernel TLB flushes Rik van Riel
2025-06-27 13:27   ` Kirill A. Shutemov
2025-06-29  1:30     ` Rik van Riel
2025-06-19 20:03 ` [RFC PATCH v4 7/8] x86/mm: userspace & pageout flushing using Intel RAR Rik van Riel
2025-06-19 20:04 ` [RFC PATCH v4 8/8] x86/tlb: flush the local TLB twice (DEBUG) Rik van Riel
2025-06-26 18:08 ` [RFC PATCH v4 0/8] Intel RAR TLB invalidation Dave Jiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).